class16_profiling-handout.pdf

Upload: pavan-behara

Post on 04-Apr-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 class16_profiling-handout.pdf

    1/102

    Application Performance Tuning

    M. D. Jones, Ph.D.

    Center for Computational ResearchUniversity at Buffalo

    State University of New York

    High Performance Computing I, 2012

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 1 / 108

  • 7/29/2019 class16_profiling-handout.pdf

    2/102

    Introduction to Performance Engineering in HPC Performance Fundamentals

    Performance Foundations

    Three pillars of performance optimization:

    Algorithmic - choose the most effective algorithm that you can for theproblem of interest

    Serial Efficiency - optimize the code to run efficiently in a non-parallelenvironment

    Parallel Efficiency - effectively use multiple processors to achieve areduction in execution time, or equivalently, to solve a

    proportionately larger problem

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 3 / 108

  • 7/29/2019 class16_profiling-handout.pdf

    3/102

    Introduction to Performance Engineering in HPC Performance Fundamentals

    Algorithmic Efficiency

    Choose the best algorithm beforeyou start coding (recall that good

    planning is an essential part of writing good software):

    Running on large number of processors? Choose an algorithm

    that scales well with increasing processor countRunning a large system (mesh points, particle count, etc.)?

    Choose an algorithm that scales well with system size

    If you are going to run on a massively parallel machine, plan from

    the beginning on how you intend to decompose the problem (it

    may save you a lot of time later)

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 4 / 108

  • 7/29/2019 class16_profiling-handout.pdf

    4/102

    Introduction to Performance Engineering in HPC Performance Baseline

    Serial Efficiency

    Getting efficient code in parallel is made much more difficult if you

    have not optimized the sequential code, and in fact can lead to amisleading picture of parallel performance. Recall that our definition of

    parallel speedup,S(Np) =

    S

    p(Np),

    involves the time, S, for an optimal sequential implementation (not

    just p(1)!)

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 5 / 108

  • 7/29/2019 class16_profiling-handout.pdf

    5/102

    Introduction to Performance Engineering in HPC Performance Baseline

    Establish a Performance Baseline

    Steps to establishing a baseline for your own performance

    expectations:

    Choose a representative problem (or better still a suite of

    problems) that can be run under identical circumstances on a

    multitude of platforms/compilersRequires portable code!

    How fast is fast? You can utilize hardware performance counters

    to measure actual code performance

    Profile, profile, and then profile some more ... to find bottlenecksand spend your time more effectively in optimizing code

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 6 / 108

    I d i P f E i i i HPC Pi f ll i P ll l P f

  • 7/29/2019 class16_profiling-handout.pdf

    6/102

    Introduction to Performance Engineering in HPC Pitfalls in Parallel Performance

    Parallel Performance Trap

    Pitfalls when measuring the performance of parallel codes:

    For many, speedup or linear scalability is the ultimate goal.

    This goal is incomplete - a terribly inefficient code can scale well,

    but actually deliver poor efficiency.For example, consider a simple Monte Carlo based code that usesthe most rudimentary uniform sampling (i.e. no importancesampling) - this can be made to scale perfectly in parallel, but the

    algorithmic efficiency (measured perhaps by the delivered

    variance per cpu-hour) is quite low.

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 7 / 108

    Si l P fili T l Ti

  • 7/29/2019 class16_profiling-handout.pdf

    7/102

    Simple Profiling Tools Timers

    time Command

    Note that this is not the time built-in function in many shells (bashand tcsh included), but instead the one located in /usr/bin. Thiscommand is quite useful for getting an overall picture of code

    performance. The default output format:

    %Uuse r %Ssys tem %Eel aps ed %PCPU (%X t e x t+%Ddata %Mmax) k%Ii n p u t s+%Ooutputs (%Fmajor+%Rminor ) pa ge fa ul ts %Wswaps

    and using the -p option:

    re a l %eus er %U

    sys %S

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 9 / 108

    Simple Profiling Tools Timers

  • 7/29/2019 class16_profiling-handout.pdf

    8/102

    Simple Profiling Tools Timers

    time Example

    [ bono : ~ / d _ l a p l a c e ] $ / u s r / b i n / t i m e . / l a p l a c e _ s:

    Max v a l u e i n s o l : 0 .9 99 99 23 27 96 12 18Min v a lu e i n s o l : 8.742278000372475E008

    7 5 .8 2 u se r 0 .0 0 system 1 :1 7 .7 2 e l a p sed 97%CPU (0 a vg t e xt +0 a vg da ta 0 ma xre si d e n t ) k

    0 i n p u t s +0 o u t p u t s ( 0 m a j o r +913 m i no r ) p a g e f a u l t s 0 swaps[ b on o : ~/ d _ l a p l a ce ]$ / u sr / b i n / t i me p . / l a pl a c e_ s:

    Max v a l u e i n s o l : 0 .9 99 99 23 27 96 12 18Min v a lu e i n s o l : 8.742278000372475E008

    r e a l 7 5. 73u s e r 7 4 .6 8s ys 0 . 00

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 10 / 108

    Simple Profiling Tools Timers

  • 7/29/2019 class16_profiling-handout.pdf

    9/102

    Simple Profiling Tools Timers

    time MPI Example

    [ b on o : ~/ d _ l a p l a ce ] $ / u sr / b i n / t i me mp i ru n np 2 . / l a pl a c e_ m p i:

    Max v a l u e i n s o l : 0 .9 99 99 23 27 96 12 18Min v a lu e i n s o l : 8.742278000372475E008

    W r i t in g l o g f i l e . . . .F in is he d w r i t i n g l o g f i l e .2 8 .4 3 u se r 1 .5 4 system 0 :3 1 .9 5 e l a p sed 93%CPU (0 a vg t e xt +0 a vg da ta 0 ma xre si d e n t ) k0 i n p u ts +0 o u tp u ts (0 ma j o r+14 92 0 min o r ) p a g e fa u l ts 0 swap s[ bono : ~ / d _ l a p l a c e ] $ q sub q debug l n o d e s=2 :p p n =2 ,wa l l t i me =0 0 :3 0 :0 0 Iqsub : w a i t i n g f o r j o b 57 72 55 .bono . c c r . b u f f a l o . e du t o s t a r t

    #############################PBS Pr olo gu e ##############################PBS p r ol o gu e s c r i p t r u n o n h o st c 15n32 a t F r i Sep 2 8 1 5 :0 3 :0 5 EDT 2007PBSTMPDIR is / sc rat ch /577255 .bono . ccr . bu ff al o . edu/ u s r / b i n / t i m e mp ie xe c . / l a p l a c e _ m p i:

    Max v a l u e i n s o l : 0 .9 99 99 23 27 96 12 18Min v a lu e i n s o l : 8.742278000372475E008

    0 .0 1 u se r 0 .0 1 syste m 0 :3 0 .8 2 e l a p sed 0%CPU (0 a vg t e xt +0 a vg da ta 0 ma xre si d e n t ) k0 i n p u ts +0 o u tp u ts (0 ma j o r+1 416 mi n o r) p a g e fa u l ts 0 swaps

    OSCs mpiexec will report accurate timings in the (optional) email

    report, as it does not rely on rsh/ssh to launch tasks (but Intel MPIdoes, so in that case you will see the result of timing the mpiexecshell script, not the MPI code).

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 11 / 108

    Simple Profiling Tools Code Section Timing (Calipers)

  • 7/29/2019 class16_profiling-handout.pdf

    10/102

    Simple Profiling Tools Code Section Timing (Calipers)

    Code Section Timing (Calipers)

    Timing sections of code requires a bit more work on the part of the

    programmer, but there are reasonably portable means of doing so:

    Routine Type Resolutiontimes user/sys ms

    gettimeofday wall s

    clock_gettime wall nssystem_clock (f90) wall system-dependent

    cpu_time (f95) cpu compiler-dependentMPI_Wtime* wall system-dependent

    OMP_GET_WTIME* wall system-dependent

    Generally I prefer the MPI and OpenMP timing calls whenever I canuse them (*the MPI and OpenMP specifications call for their intrinsic

    timers to be high precision).

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 12 / 108

    Simple Profiling Tools Code Section Timing (Calipers)

  • 7/29/2019 class16_profiling-handout.pdf

    11/102

    Simple Profiling Tools Code Section Timing (Calipers)

    More information on code section timers (and code for doing so):

    LLNL Performance Tools:https://computing.llnl.gov/tutorials/performance_tools/#gettimeofday

    Stopwatch (nice F90 module, but you need to supply a low-levelfunction for accessing a timer):

    http://math.nist.gov/StopWatch

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 13 / 108

    Simple Profiling Tools gprof

    https://computing.llnl.gov/tutorials/performance_tools/#gettimeofdayhttp://math.nist.gov/StopWatchhttp://math.nist.gov/StopWatchhttps://computing.llnl.gov/tutorials/performance_tools/#gettimeofday
  • 7/29/2019 class16_profiling-handout.pdf

    12/102

    Simple Profiling Tools gprof

    GNU Tools: gprof

    Tool that we used briefly before:Generic GNU profiler

    Requires recompiling code with -pg option

    Running subsequent instrumented code produces gmon.out to

    be read by gprofUse the environment variable GMON_OUT_PREFIX to specify anew gmon.out prefix to which the process ID will be appended(especially usefule for parallel runs) - this is a largely

    undocumented feature ...

    Line-level profiling is possible, as we will see in the following

    example

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 14 / 108

    Simple Profiling Tools gprof

  • 7/29/2019 class16_profiling-handout.pdf

    13/102

    p g gp

    gprof Shortcomings

    Shortcomings of gprof (which apply also to any statistical profilingtool):

    Need to recompile to instrument the codeInstrumentation can affect the statistics in the profile

    Overhead can significantly increase the running time

    Compiler optimization can be affected by instrumentation

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 15 / 108

    Simple Profiling Tools gprof

  • 7/29/2019 class16_profiling-handout.pdf

    14/102

    p g gp

    Types of gprof Profiles

    gprof profiles come in three types:1 Flat Profile: shows how much time your program spent in each

    function, and how many times that function was called

    2 Call Graph: for each function, which functions called it, which

    other functions it called, and how many times. There is also anestimate of how much time was spent in the subroutines of eachfunction

    3 Basic-block: Requires compilation with the -a flag (supportedonly by GNU?) - enables gprof to construct an annotated source

    code listing showing how many times each line of code wasexecuted

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 16 / 108

    Simple Profiling Tools gprof

  • 7/29/2019 class16_profiling-handout.pdf

    15/102

    gprof example

    g77 I . O3f f a s tmath g pg o r p r p _ re a d . o i n i t i a l . o en_ gde . o \

    a dw fn s . o rpqmc . o e v o l . o dbxgde . o d b x t r i . o g en _e tg . o g e n_ r t g . o \ gdewfn . o l i b / * . o

    [ jonesm@bono ~/ d_bench ] $ qsub qdebug l n o d e s=1 :p p n =2 ,wa l l t i me =0 0 :3 0 :0 0 I[ jonesm@c16n30 ~/ d_bench] $ . / rp

    E n t e r r u n i d : ( < = 9 c h a rs )s h o r t. . .. . . s k i p c o pi o us amount o f s t a nd a rd o u t p u t . . .. . .

    [ jonesm@bono ~/d _ be nch ] $ g p ro f rp gmon . o u t > & o u t . g p ro f[ jonesm@bono ~/ d_bench ] $ les s out . gpr ofF l at p r o f i l e :

    Ea ch sampl e co u n ts a s 0 .0 1 se co nd s .% c u m u l a t i v e s e l f s e l f t o t a l

    t i me seconds seconds c a l ls s / c a l l s / c a l l name89.74 123.23 123.23 204008 0.00 0.00 tr i w f n s _

    6.96 132.79 9.56 1 9.56 137.22 MAIN__

    1.18 134.41 1.62 200004 0.00 0.00 en_gde__ 1.05 135.86 1.44 204002 0.00 0.00 e v o l _ 0.71 136.83 0.97 14790551 0.00 0.00 r a n f _ 0.27 137.20 0.37 204008 0.00 0.00 gdewfn_

    . . .

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 17 / 108

    Simple Profiling Tools gprof

  • 7/29/2019 class16_profiling-handout.pdf

    16/102

    [ jonesm@bono ~/ d_bench ] $ gpr of - l i n e r p gmon . o u t > & o u t . g p r o f . l i n e[ jonesm@bono ~/d _ be nch ] $ l e s s o u t . g p ro f . l i n eF l at p r o f i l e :

    Ea ch sampl e co u n ts a s 0 .0 1 se co nd s .% c u m u l a t i v e s e l f s e l f t o t a l

    t ime seconds seconds c a l l s ns / c a l l ns / c a l l name17.45 23.96 23.96 t r i w f n s _ ( adwfns . f :129 @ 403c94 )

    14.46 43.82 19.86 t r i w f n s _ ( adwfns . f :130 @ 403ce6 )12.87 61.50 17.68 t r i w f n s _ ( adwfns . f :171 @ 404755)12.31 78.41 16.91 t r i w f n s _ ( adwfns . f :172 @ 4047e2 )

    0.67 79.33 0.92 t r i w f n s _ ( adwfns . f :1 30 @ 403cd6 )0.59 80.14 0.82 MAIN__ (cc4WTuQH . f :3 08 @ 4070c6 )0.51 80.84 0.70 MAIN__ (cc4WTuQH . f :3 04 @ 4072da )

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 18 / 108

    Simple Profiling Tools gprof

  • 7/29/2019 class16_profiling-handout.pdf

    17/102

    More gprof Information

    More gprof documentation:

    gprof GNU Manual:http://www.gnu.org/software/binutils/manual/gprof-2.9.1/gprof.html

    gprof man page:man gprof

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 19 / 108

    Simple Profiling Tools pgprof

    http://www.gnu.org/software/binutils/manual/gprof-2.9.1/gprof.htmlhttp://www.gnu.org/software/binutils/manual/gprof-2.9.1/gprof.html
  • 7/29/2019 class16_profiling-handout.pdf

    18/102

    PGI Tools: pgprof

    PGI tools also have profiling capabilities (c.f. man pgf95)

    Graphical profiler, pgprof

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 20 / 108

  • 7/29/2019 class16_profiling-handout.pdf

    19/102

  • 7/29/2019 class16_profiling-handout.pdf

    20/102

    Simple Profiling Tools pgprof

  • 7/29/2019 class16_profiling-handout.pdf

    21/102

    N.B. you can also use the -text option to pgprof to make it behavemore like gprof

    See the PGI Tools Guide for more information (should be a PDFcopy in $PGI/doc)

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 23 / 108

    Parallel Profiling Tools Statistical MPI Profiling With mpiP

    iP St ti ti l MPI P fili

  • 7/29/2019 class16_profiling-handout.pdf

    22/102

    mpiP: Statistical MPI Profiling

    http://mpip.sourceforge.net

    Not a tracing tool, but a lightweight interface to accumulate

    statistics using the MPI profiling interface

    Quite useful in conjunction with a tracefile analysis (e.g. using

    jumpshot)

    Installed on CCR systems - see module avail for mpiP

    availability and location

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 25 / 108

    Parallel Profiling Tools Statistical MPI Profiling With mpiP

    iP C il ti

    http://mpip.sourceforge.net/http://mpip.sourceforge.net/
  • 7/29/2019 class16_profiling-handout.pdf

    23/102

    mpiP Compilation

    To use mpiP you need to:

    Add a -g flag to add symbols (this will allow mpiP to access thesource code symbols and line numbers)

    Link in the necessary mpiP profiling library and the binary utilitylibraries for actually decoding symbols (There is a trick that youcan use most of the time to avoid having to link with mpiP, though).

    Compilation examples (from U2) follow ...

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 26 / 108

    Parallel Profiling Tools Statistical MPI Profiling With mpiP

    mpiP Runtime Flags

  • 7/29/2019 class16_profiling-handout.pdf

    24/102

    mpiP Runtime Flags

    You can set various mpiP runtime flags (e.g. export MPIP=-t

    10.0 -k 2):

    Option Description Default-c Generate concise version of report, omitting callsite

    process-specific detail.-e Print report data using floating-point format

    -f dir Record output file in directory .-g Enable mpiP debug mode disabled-k n Sets callsite stack traceback depth to 1-n Do not truncate full pathname of filename in callsites-o Disable profiling at initialization.

    Application must enable profiling with MPI_Pcontrol()-s n Set hash table size to 256-t x Set print threshold for report, where is the

    MPI percentage of time for each callsite 0.0-v Generates both concise and verbose report output

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 27 / 108

    Parallel Profiling Tools Statistical MPI Profiling With mpiP

    mpiP Example Output

  • 7/29/2019 class16_profiling-handout.pdf

    25/102

    mpiP Example Output

    From an older version of mpiP, but still almost entirely the same - thisone links directly with mpiP; first compile:

    mpicc g o a t l as a i j2 _ ba s is . o a nal yz e . o a t l as . o b a r r i e r . o b y t e f l i p . o \ c ho rd gn 2 . o c s t r i n g s . o i o 2 . o map . o m u t i l s . o numrec . o paramods . o p r o j . o \ p r o j A t l a s . o sym2 . o u t i l . o lm L / Pr oj ec ts /CCR/ jonesm / mpiP2 .8 .2 /gn u / ch_gm/ l i b \ lmpiP l b f d l i b e r t y lm

    then run (in this case using 16 processors) and examine the output file:

    [ j o n e sm@j o p l i n d _ de ren zo ] $ l s * . mpiPa t l a s . g c c3mpipapimp i P.1 6 .2 0 5 7 8 .1 .mp i P[ j o ne s m @j o pl i n d _d er en zo ] $ l e s s a t l a s . g cc 3mpipapimp i P.1 6 .2 0 5 7 8 .1 .mp i P:

    :

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 28 / 108

    Parallel Profiling Tools Statistical MPI Profiling With mpiP

  • 7/29/2019 class16_profiling-handout.pdf

    26/102

    @ mpiP@ Command : . / at la s . gcc3mpipapimpiP s t u dy . i n i 0@ Ve rs i o n : 2 . 8 . 2@ MPIP B u i l d date : Jun 29 2005 , 1 4 :5 3 :4 1

    @ S t a r t t i me : 2005 06 29 1 5 :1 8: 52@ Stop t i m e : 2005 06 29 1 5 :2 8: 34@ Timer Used : g e t t i m e o f d a y@ MPIP env v a r : [ n u l l ]@ C o l l e c t o r Rank : 0@ C o l l e c t o r PID : 20578@ F i n a l Output D i r : .@ MPI T ask A ss ig nm en t : 0 bb18n17 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 1 bb18n17 . c c r . b u f f a l o . e du

    @ MPI T ask A ss ig nm en t : 2 bb18n16 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 3 bb18n16 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 4 bb18n15 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 5 bb18n15 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 6 bb18n14 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 7 bb18n14 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 8 bb18n13 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 9 bb18n13 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 10 b b18n12 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 11 b b18n12 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 12 b b18n11 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 13 b b18n11 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 14 b b18n10 . c c r . b u f f a l o . e du@ MPI T ask A ss ig nm en t : 15 b b18n10 . c c r . b u f f a l o . e du

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 29 / 108

    Parallel Profiling Tools Statistical MPI Profiling With mpiP

  • 7/29/2019 class16_profiling-handout.pdf

    27/102

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    @- MPI Time ( seconds) - - - - - - - - - - - - - - - - - - - - - - - - -

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    Task AppTime MPITime MPI%0 582 44.7 7.6 91 579 41.9 7.2 42 579 40.7 7.0 33 579 36.9 6.3 74 579 22.3 3.8 45 579 16.6 2.8 76 579 32 5 .5 3

    7 579 35.9 6.2 08 579 28.6 4.9 39 579 25.9 4.4 8

    10 579 39.2 6.7 611 579 33.8 5.8 412 579 35.3 6.1 013 579 41 7.0 714 579 29.9 5.1 615 579 41.4 7.1 6

    * 9.27 e+03 546 5.89

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 30 / 108

    Parallel Profiling Tools Statistical MPI Profiling With mpiP

  • 7/29/2019 class16_profiling-handout.pdf

    28/102

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    @- C a l l s i t e s : 13 - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    ID Lev F i l e / Address L i n e Parent_ Funct MPI_Call1 0 u t i l . c 833 gsync B a r r i e r2 0 a t l a s . c 1531 re a d Pro j Da ta A l l r e d u c e3 0 p r o j A t l a s . c 745 b a c k P r o j A t l a s A l l r e d u c e4 0 a t l a s . c 1545 re a d Pro j Da ta A l l r e d u c e5 0 a t l a s . c 1525 re a d Pro j Da ta A l l r e d u c e

    6 0 a t l a s . c 1541 re a d Pro j Da ta A l l r e d u c e7 0 a t l a s . c 1589 re a d Pro j Da ta A l l r e d u c e8 0 a t l a s . c 1519 re a d Pro j Da ta A l l r e d u c e9 0 u t i l . c 789 mygcast Bcast

    10 0 p ro jA t l as . c 1100 c o m p u t e L o g l i k e A t l a s A l l r e d u c e11 0 a t l a s . c 1514 re a d Pro j Da ta A l l r e d u c e12 0 a t l a s . c 1537 re a d Pro j Da ta A l l r e d u c e13 0 p r o j A t l a s . c 425 f wd Ba c k Pr o j At l a s 2 A l l r e d u c e

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 31 / 108

    Parallel Profiling Tools Statistical MPI Profiling With mpiP

  • 7/29/2019 class16_profiling-handout.pdf

    29/102

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    @- A g gr eg a te Time ( t o p t w e nt y , d es c en di ng , m i l l i s e c o n d s ) - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    C a l l S i t e Time App% MPI% COVA l l r e d u c e 13 3.09 e+05 3.33 56.50 0.46B a r r i e r 1 2.13 e+05 2.30 38.97 0.35Bcast 9 1.69 e+04 0.18 3.10 0.37A l l r e d u c e 3 7.78 e+03 0.08 1.42 0.11A l l r e d u c e 10 62.7 0.00 0.01 0.20

    A l l r e d u c e 11 2.42 0.00 0.00 0.09A l l r e d u c e 7 2.17 0.00 0.00 0.26A l l r e d u c e 12 1.15 0.00 0.00 0.20A l l r e d u c e 6 1.14 0.00 0.00 0.19A l l r e d u c e 5 1.13 0.00 0.00 0.15A l l r e d u c e 8 1.12 0.00 0.00 0.18A l l r e d u c e 2 1 .1 0.00 0.00 0.13A l l r e d u c e 4 1 .1 0.00 0.00 0.12

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 32 / 108

    Parallel Profiling Tools Statistical MPI Profiling With mpiP

  • 7/29/2019 class16_profiling-handout.pdf

    30/102

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    @-

    Ag g re g a te Se nt Message Si ze ( to p twe n ty , d e sce nd i ng , b yte s ) - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    C a l l S i t e Count T o t a l Avrg Sent%A l l r e d u c e 13 65536 2.28 e+09 3.4 8 e+04 83.69A l l r e d u c e 3 8192 2.85 e+08 3.4 8 e+04 10.46Bcast 9 490784 1.59 e+08 325 5 .85A l l r e d u c e 11 16 2.07 e+04 1 . 3 e+03 0 .00A l l r e d u c e 10 512 4 .1 e+03 8 0 .00A l l r e d u c e 7 16 256 16 0 .00

    A l l r e d u c e 2 16 64 4 0 .00A l l r e d u c e 6 16 64 4 0 .00A l l r e d u c e 5 16 64 4 0 .00A l l r e d u c e 4 16 64 4 0 .00A l l r e d u c e 8 16 64 4 0 .00A l l r e d u c e 12 16 64 4 0 .00. . .. . .

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 33 / 108

    Parallel Profiling Tools Statistical MPI Profiling With mpiP

    Using mpiP at Runtime

  • 7/29/2019 class16_profiling-handout.pdf

    31/102

    Using mpiP at Runtime

    Now lets examine an example using mpiP at runtime.

    This example is solves a simple Laplace equation with Dirichletboundary conditions using finite differences.

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 34 / 108

    Parallel Profiling Tools Statistical MPI Profiling With mpiP

  • 7/29/2019 class16_profiling-handout.pdf

    32/102

    #PBS -S /bin/bash

    #PBS -q debug

    #PBS -l walltime=00:20:00

    #PBS -l nodes=2:MEM24GB:ppn=8

    #PBS -M [email protected]

    #PBS -m e

    #PBS -N test

    #PBS -o subMPIP.out

    #PBS -j oe

    module load intel

    module load intel-mpi

    module load mpip

    module list

    cd $PBS_O_WORKDIR

    which mpiexec

    NNODES=`cat $PBS_NODEFILE | uniq | wc -l`

    NPROCS=`cat $PBS_NODEFILE | wc -l`

    export I_MPI_DEBUG=5

    # Use LD_PRELOAD trick to load mpiP wrappers at runtime

    export LD_PRELOAD=$MPIPDIR/lib/libmpiP.so

    mpdboot -n $NNODES -f $PBS_NODEFILE -v

    mpiexec -np $NPROCS -envall ./laplace_mpi

  • 7/29/2019 class16_profiling-handout.pdf

    33/102

    Parallel Profiling Tools Statistical MPI Profiling With mpiP

  • 7/29/2019 class16_profiling-handout.pdf

    34/102

    @ mpiP

    @ Command : ./laplace_mpi

    @ Version : 3.3.0

    @ MPIP Build date : Oct 14 2011, 16:16:34

    @ Start time : 2011 10 14 16:26:15

    @ Stop time : 2011 10 14 16:28:11@ Timer Used : PMPI_Wtime

    @ MPIP env var : [null]

    @ Collector Rank : 0

    @ Collector PID : 8618

    @ Final Output Dir : .

    @ Report generation : Single collector task

    @ MPI Task Assignment : 0 d15n33.ccr.buffalo.edu

    @ MPI Task Assignment : 1 d15n33.ccr.buffalo.edu

    @ MPI Task Assignment : 2 d15n33.ccr.buffalo.edu@ MPI Task Assignment : 3 d15n33.ccr.buffalo.edu

    @ MPI Task Assignment : 4 d15n33.ccr.buffalo.edu

    @ MPI Task Assignment : 5 d15n33.ccr.buffalo.edu

    @ MPI Task Assignment : 6 d15n33.ccr.buffalo.edu

    @ MPI Task Assignment : 7 d15n33.ccr.buffalo.edu

    @ MPI Task Assignment : 8 d15n23.ccr.buffalo.edu

    @ MPI Task Assignment : 9 d15n23.ccr.buffalo.edu

    @ MPI Task Assignment : 10 d15n23.ccr.buffalo.edu

    @ MPI Task Assignment : 11 d15n23.ccr.buffalo.edu@ MPI Task Assignment : 12 d15n23.ccr.buffalo.edu

    @ MPI Task Assignment : 13 d15n23.ccr.buffalo.edu

    @ MPI Task Assignment : 14 d15n23.ccr.buffalo.edu

    @ MPI Task Assignment : 15 d15n23.ccr.buffalo.edu

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 37 / 108

    Parallel Profiling Tools Statistical MPI Profiling With mpiP

  • 7/29/2019 class16_profiling-handout.pdf

    35/102

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    @ - - MPI Time (seconds) - - - - - - - - - - - - - - - - - - - - - - - - - -

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Task AppTime MPITime MPI%

    0 116 16.9 14.57

    1 115 16.4 14.18

    2 115 16.8 14.53

    3 115 17.7 15.34

    4 115 16.6 14.39

    5 115 14.3 12.37

    6 115 13.7 11.90

    7 115 11.1 9.658 115 12.5 10.87

    9 115 13.8 12.00

    10 115 14.5 12.60

    11 115 16.9 14.67

    12 115 17.5 15.15

    13 115 19.4 16.82

    14 115 16.4 14.22

    15 115 17.5 15.13

    * 1.85e+03 252 13.65

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 38 / 108

    Parallel Profiling Tools Statistical MPI Profiling With mpiP

  • 7/29/2019 class16_profiling-handout.pdf

    36/102

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    @ - - Callsites: 6 - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -ID Lev File/Address Line Parent_Funct MPI_Call

    1 0 laplace_mpi.f90 118 MAIN__ Allreduce

    2 0 laplace_mpi.f90 143 MAIN__ Recv

    3 0 laplace_mpi.f90 48 __paramod_MOD_xchange Sendrecv

    4 0 laplace_mpi.f90 80 MAIN__ Bcast

    5 0 laplace_mpi.f90 46 __paramod_MOD_xchange Sendrecv

    6 0 laplace_mpi.f90 138 MAIN__ Send

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    @ - - Aggregate Time (top twenty, descending, milliseconds) - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    Call Site Time App% MPI% COV

    Allreduce 1 2.34e+05 12.69 92.94 0.15

    Sendrecv 3 8.88e+03 0.48 3.52 0.19

    Sendrecv 5 8.47e+03 0.46 3.36 0.16

    Send 6 321 0.02 0.13 0.51

    Bcast 4 97.7 0.01 0.04 0.26

    Recv 2 14.4 0.00 0.01 0.00

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 39 / 108

    Parallel Profiling Tools Statistical MPI Profiling With mpiP

  • 7/29/2019 class16_profiling-handout.pdf

    37/102

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    @ - - Callsite Time statistics (all, milliseconds): 80 - - - - - - - - - - -

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Name Site Rank Count Max Mean Min App% MPI%

    Allreduce 1 0 23845 28.5 0.666 0.036 13.69 94.00

    Allreduce 1 1 23845 28.6 0.644 0.035 13.30 93.79

    Allreduce 1 2 23845 28.6 0.661 0.037 13.66 93.98

    Allreduce 1 3 23845 28.6 0.695 0.038 14.36 93.63

    Allreduce 1 4 23845 29.2 0.651 0.038 13.46 93.54

    Allreduce 1 5 23845 29.2 0.555 0.035 11.46 92.71

    Allreduce 1 6 23845 29 0.528 0.037 10.91 91.64

    Allreduce 1 7 23845 26.3 0.405 0.038 8.37 86.74Allreduce 1 8 23845 26.8 0.468 0.034 9.67 88.92

    Allreduce 1 9 23845 28.4 0.538 0.033 11.11 92.59

    Allreduce 1 10 23845 29.7 0.566 0.033 11.70 92.90

    Allreduce 1 11 23845 29.7 0.661 0.033 13.65 93.10

    Allreduce 1 12 23845 28.6 0.686 0.039 14.17 93.49

    Allreduce 1 13 23845 28.8 0.768 0.041 15.87 94.35

    Allreduce 1 14 23845 28.6 0.641 0.036 13.25 93.19

    Allreduce 1 15 23845 28.7 0.694 0.038 14.33 94.76

    Allreduce 1 * 381520 29.7 0.614 0.033 12.69 92.94

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 40 / 108

  • 7/29/2019 class16_profiling-handout.pdf

    38/102

  • 7/29/2019 class16_profiling-handout.pdf

    39/102

    Parallel Profiling Tools Intel Trace Analyzer/Collector

    ITAC Example

  • 7/29/2019 class16_profiling-handout.pdf

    40/102

    Note that you do not have to recompile your application to use ITAC(unless you are building it statically), you can just build it usual, usingIntel MPI:

    [ bono : ~ / d _ l a p l a c e / d _ i t a c ] $ m od ul e l o a d i n t e lmpi

    [ b on o : ~/ d _ l a p l a c e / d _ i ta c ]$ make l a p l a ce _ mp im p i i f o r t i p o O3V a x l i b g c l a p l a c e _ mp i . f 9 0m p i i f o r t i p o O3V a x l i b g o l a p la c e _m p i l a p la c e _m p i . oi p o : r em ar k # 11 00 1: p e r f or m i n g s i n g l ef i l e o p t i m i z a t i o n si p o : r em ar k # 11 00 5: g e n e r at i n g o b j e c t f i l e / t mp / i p o _ i f o r t j O z Y A n . o[ bono : ~ / d _ l a p l a c e / d _ i t a c ] $

    You can turn trace collection on at run-time ...

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 43 / 108

  • 7/29/2019 class16_profiling-handout.pdf

    41/102

    Parallel Profiling Tools Intel Trace Analyzer/Collector

  • 7/29/2019 class16_profiling-handout.pdf

    42/102

    Running the preceding batch job on U2 produces a bunch (many!)ofprofiling output files, the most important of which can be is the name of

    your binary with a .stf suffix, in this case laplace_mpi.stf, which wefeed to the Intel Trace Analyzer using the traceanalyzer command ...and we should see a profile that looks very much like what you can see

    using jumpshot.

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 45 / 108

    Parallel Profiling Tools Intel Trace Analyzer/Collector

  • 7/29/2019 class16_profiling-handout.pdf

    43/102

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 46 / 108

    Parallel Profiling Tools Intel Trace Analyzer/Collector

    More ITAC Documentation

  • 7/29/2019 class16_profiling-handout.pdf

    44/102

    Some helpful pointers to more ITAC documentation:[ k 07n14 : ~ ] $ w hi ch t r a c e a n a l y z e r

    / u t i l / i n t e l / i t a c / 8 . 0 . 3 . 0 0 7 / b i n / t r a c e a n a l y z e r[ k 07n14 : ~ ] $ l s l / u t i l / i n t e l / i t a c / 8 . 0 . 3 . 0 0 7 / doc / t o t a l 4211r - r - r - 1 r o o t r o o t 91050 Aug 25 0 6 : 02 FAQ . p d fr - r - r - 1 r o o t r o o t 15566 Aug 25 0 6: 02 G e t t in g _ S ta r t e d . h tm ldrwxrxrx 3 r o o t ro o t 57 Oct 4 08:02 h t ml

    r - r - r - 1 r o o t r o o t 61598 Aug 25 0 6: 02 I NSTALL . h t m lr - r - r - 1 ro o t ro o t 2 05 10 29 Aug 25 0 6 :0 2 ITA_ Refe re n ce _ Gu i de . p d fr - r - r - 1 ro o t ro o t 1 05 02 76 Aug 25 0 6 :0 2 ITC_ Refe re n ce _ Gu i de . p d fr - r - r - 1 r o o t r o o t 20681 Aug 2 5 0 6 :0 2 R el ea se _N ot es . t x t

    Generally a good idea to refer to the documentation for the same

    version that you are using (you can check with module showintel-mpi).

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 47 / 108

    Performance API (PAPI) Introduction to PAPI

    Introduction

  • 7/29/2019 class16_profiling-handout.pdf

    45/102

    Performance Application Programming Interface

    Implement a portable(!) and efficient API to access existing

    hardware performance countersEase the optimization of code by providing base infrastructure forcross-platform tools

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 49 / 108

  • 7/29/2019 class16_profiling-handout.pdf

    46/102

  • 7/29/2019 class16_profiling-handout.pdf

    47/102

  • 7/29/2019 class16_profiling-handout.pdf

    48/102

  • 7/29/2019 class16_profiling-handout.pdf

    49/102

  • 7/29/2019 class16_profiling-handout.pdf

    50/102

    Performance API (PAPI) Reminder: U2 Hardware

    Nehalem Xeon Block Diagram

  • 7/29/2019 class16_profiling-handout.pdf

    51/102

    Block diagram for Nehalem architecture:

    10/14/11 5:25 PM

    Page 1 of 2file:///Users/jonesm/Desktop/Intel_Nehalem_arch.svg

    quadruple associative Instruction Cache 32 KByte,

    128-entry TLB-4K, 7 TLB-2/4M per thread

    Prefetch Buffer (16 Bytes)

    Predecode &

    Instruction Length Decoder

    Instruction Queue

    18 x86 Instructions

    Alignment

    MacroOp Fusion

    Complex

    Decoder

    Simple

    DecoderSimple

    Decoder

    Simple

    Decoder

    Decoded Instruction Queue (28 OP entries)

    MicroOp Fusion

    Loop

    Stream

    Decoder

    2 x Register Allocation Table (RAT)

    Reorder Buffer (128-entry) fused

    2 x

    Retirement

    Register

    File

    Reservation Station (128-entry) fused

    Store

    Addr.

    Unit

    AGU

    Load

    Addr.

    Unit

    AGUStore

    Data

    Micro

    Instruction

    Sequencer

    256 KByte

    8-way,

    64 Byte

    Cacheline,

    private

    L2-Cache

    512-entry

    L2-TLB-4K

    Integer/MMX ALU,

    Branch

    SSE

    ADD

    Move

    Integer/MMX

    ALU

    SSE

    ADD

    Move

    FP

    ADD

    Integer/MMX ALU,

    2x AGU

    SSE

    MUL/DIV

    Move

    FP

    MUL

    Memory Order Buffer (MOB)

    octuple associative Data Cache 32 KByte,

    64-entry TLB-4K, 32-entry TLB-2/4M

    Branch

    Prediction

    global/bimodal,

    loop, indirect

    jmp

    128

    Port 4 Port 0Port 3 Port 2 Port 5 Port 1

    128 128

    128 128 128

    Result Bus256

    Quick Path

    Inter-

    connect

    DDR3

    Memory

    Controller

    Common

    L3-Cache

    8 MByte

    Uncore

    4 x 20 Bit

    6,4 GT/s

    3 x 64 Bit

    1,33 GT/s

    Intel Nehalem microarchitecture

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 55 / 108

    Performance API (PAPI) Reminder: U2 Hardware

    Superscalar - EM64T Irwindale

  • 7/29/2019 class16_profiling-handout.pdf

    52/102

    Characteristics of the Irwindale Xeons that form (part of) CCRs U2

    cluster:

    Clock Cycle 3.2 GHzTPP 6.4 GFlop/s

    Pipeline 31 stagesL2 Cache Size 2 MByte

    L2 Bandwidth 102.4 GByte/sCPU-Memory Bandwidth 6.4 GByte/s (shared)

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 56 / 108

  • 7/29/2019 class16_profiling-handout.pdf

    53/102

  • 7/29/2019 class16_profiling-handout.pdf

    54/102

  • 7/29/2019 class16_profiling-handout.pdf

    55/102

  • 7/29/2019 class16_profiling-handout.pdf

    56/102

    Performance API (PAPI) High vs. Low-level PAPI

    High-level PAPI

  • 7/29/2019 class16_profiling-handout.pdf

    57/102

    Intended for coarse-grained measurements

    Requires little (or no) setup code

    Allows only PAPI preset

    Allows only aggregate counting (no statistical profiling)

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 61 / 108

    Performance API (PAPI) High vs. Low-level PAPI

    Low-level PAPI

  • 7/29/2019 class16_profiling-handout.pdf

    58/102

    More efficient (and functional) than high-level

    About 60 functions

    Thread-safe

    Supports presets and native events

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 62 / 108

    Performance API (PAPI) High vs. Low-level PAPI

    Preset vs. Native Events

  • 7/29/2019 class16_profiling-handout.pdf

    59/102

    preset or pre-defined events, are those which have been

    considered useful by the PAPI community and developers:http://icl.cs.utk.edu/projects/papi/presets.html

    native events are those countable by the CPUs hardware.These events are highly platform specific, and you would

    need to consult the processor architecture manuals forthe relevant native event lists

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 63 / 108

    http://icl.cs.utk.edu/projects/papi/presets.htmlhttp://icl.cs.utk.edu/projects/papi/presets.html
  • 7/29/2019 class16_profiling-handout.pdf

    60/102

  • 7/29/2019 class16_profiling-handout.pdf

    61/102

    Performance API (PAPI) High vs. Low-level PAPI

    How to Access PAPI at CCR

  • 7/29/2019 class16_profiling-handout.pdf

    62/102

    Consider a simple example code to measure Flop/s using thehigh-level PAPI API:

    1 #include < s t d i o . h>2 #include < s t d l i b . h>3 #include " p a p i .h "4

    56 i n t main ( )7 {8 f l o a t re a l _ t i me , p ro c_ t i me , mf l o p s ;9 l o n g _ l o n g f lp o p s ;

    10 f l o a t i r e a l _ t i m e , i p r oc _ t im e , i m f l o p s ;11 l o n g _ l o n g i f l p o p s ;12 i n t r e t v a l ;

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 66 / 108

    Performance API (PAPI) High vs. Low-level PAPI

    13 i f ( ( r e t v a l = P A P I _ f l op s ( & i r e a l _ t i m e , & i p r o c _ t i m e , & i f l p o p s ,& i m f l o p s ) ) < PAPI_OK )14 {15 p r i n t f ( " Could n ot i n i t i a l i s e PAPI f lo ps \ n " ) ;

  • 7/29/2019 class16_profiling-handout.pdf

    63/102

    15 p r i n t f ( Could n ot i n i t i a l i s e PAPI_ f lo ps \ n ) ;16 p r i n t f ( " Your p la t fo r m may no t s u pp or t f l o a t i n g p o in t o pe ra ti on ev en t . \ n " ) ;17 p r i n t f ( " r e t v a l : %d \ n" , r e t v a l ) ;18 e x i t ( 1 ) ;19 }

    2021 y our _s lo w_c ode ( ) ;2223 i f ( ( r e t v a l = P A P I_ f l op s ( & r e a l _ t i m e , & pr o c _t i m e , & f l p o p s , & m fl o p s ) ) < PAPI_OK )24 {25 p r i n t f ( " r e t v a l : %d \ n" , r e t v a l ) ;26 e x i t ( 1 ) ;27 }28

    29 p r i n t f ( " R ea l_ t im e : % f P r oc _t i me : % f T o t a l f l p o p s : % l l d MFLOPS : %f \ n " ,30 r ea l _ t i m e , pro c_time , f lp o p s , mfl ops ) ;31 e x i t ( 0 ) ;32 }33 i n t your_slow_code ( )34 {35 i n t i ;36 double tmp =1 .1 ;37

    38 f o r ( i =1 ; i

  • 7/29/2019 class16_profiling-handout.pdf

    64/102

    On U2, you access the papi module and compile accordingly:[ k07 n1 4 : ~/ d _p a p i ] $ q su b q debug lnode s =1:MEM48GB: ppn=12 , wa ll ti me =01: 00:00 Iqs ub : w a i t i n g f o r j o b 1 14 44 85 .d15n41 . c c r . b u f f a l o . e du t o s t a r tq su b : j o b 1 1 44 4 85 .d1 5n 41 . ccr . b u ff a l o . e du re a d y

    Job 1 1 44 4 85 .d1 5n 41 . ccr . b u ff a l o . ed u h as re q u e ste d 12 co re s / p ro ce sso rs p e r n od e .PBSTMPDIR is / sc rat ch /1144485. d15n41. cc r . bu ff al o . edu[ k16n01b : ~ ] $ cd $PBS_O_WORKDIR

    [ k16 n0 1b :~ / d _ p a pi ] $ mod ul e l o a d p a p i' p a p i / v 4 . 1 . 4 ' l o a d c o mp l et e .[ k16 n0 1b :~ / d _ p a pi ] $ g cc I$PAPI / inc lu de o P A P I _f l o ps P A P I _f l o ps . c L$PAPI/ l ib l p a p i[ k16 n0 1b :~ / d _ p a pi ] $ . / PAPI_ f l o p sRe a l _ t ime : 0 .00 0 04 2 Pro c_ t i me : 0 .00 0 02 9 To t a l f l p o p s : 7 98 3 MFLOPS: 2 76 .57 2 44 9

    N.B., PAPI needs to support the underlying hardware, and this

    version does not support the 32-core Intel nodes (including thefront end).

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 68 / 108

    Performance API (PAPI) High vs. Low-level PAPI

  • 7/29/2019 class16_profiling-handout.pdf

    65/102

    ... and on Lennon (SGI Altix):

    [ j on esm@le nn on ~/ d _ p a pi ] $ mod ul e l o a d p a p i' p a p i / v 3 . 2 . 1 ' l o a d c o mp l et e .[ jonesm@lennon ~/ d_papi ] $ echo $PAPI

    / u t i l / p e r f t o o l s / p ap i3.2.1[ jonesm@lennon ~/ d_papi ] $ gcc I$PAPI / inc lu de o P A PI _ fl o ps P A PI _ fl o ps . c \

    L$PAPI/ l ib l p a p i[ j on esm@le nn on ~/ d _ p a pi ] $ l d d PAPI_ f l o p s

    l i b p a p i . s o => / u t i l / p e r f t o o l s / p a pi3 . 2. 1/ l i b / l i b p a p i . s o(0x2000000000040000)

    l i b c . s o . 6 . 1 => / l i b / t l s / l i b c . s o . 6 . 1 ( 0 x20000000000ac000 )l i b p fm . so .2 => / u sr / l i b / l i b p fm . so .2 (0 x2 00 00 00 00 03 18 00 0 )/ l i b / l dl i n u xi a 64 . so . 2 => / l i b / l dl i n u xia64 . so. 2 (0 x2000000000000000 )

    [ j on esm@le nn on ~/ d _ p a pi ] $ . / PAPI_ f l o p sReal_time : 0.000143 Proc_time : 0.000132 To ta l f l po ps : 48056 MFLOPS: 363.964111

    N.B., The Altix is dead, but this is a good example of the cross-platform

    portability of PAPI accessign the hardware performance counters.

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 69 / 108

    Performance API (PAPI) High vs. Low-level PAPI

    papi_avail Command

    Y i il h k il bili (diff CPU

  • 7/29/2019 class16_profiling-handout.pdf

    66/102

    You can use papi_avail to check event availability (different CPUs

    support various events):

    [k16n01b:~/d_papi]$ papi_avail

    Available events and hardware information.

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    PAPI Version : 4.1.4.0

    Vendor string and code : GenuineIntel (1)

    Model string and code : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz (44)

    CPU Revision : 2.000000

    CPUID Info : Family: 6 Model: 44 Stepping: 2

    CPU Megahertz : 2400.391113

    CPU Clock Megahertz : 2400

    Hdw Threads per core : 1

    Cores per Socket : 6

    NUMA Nodes : 2

    CPU's per Node : 6

    Total CPU's : 12

    Number Hardware Counters : 16

    Max Multiplex Counters : 512

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    The following correspond to fields in the PAPI_event_info_t structure.

    Name Code Avail Deriv Description (Note)

    PAPI_L1_DCM 0x80000000 Yes No Level 1 data cache misses

    PAPI_L1_ICM 0x80000001 Yes No Level 1 instruction cache misses

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 70 / 108

    Performance API (PAPI) High vs. Low-level PAPI

    PAPI_L2_DCM 0x80000002 Yes Yes Level 2 data cache misses

    PAPI_L2_ICM 0x80000003 Yes No Level 2 instruction cache misses

    PAPI_L3_DCM 0x80000004 No No Level 3 data cache misses

    PAPI L3 ICM 0x80000005 No No Level 3 instruction cache misses

  • 7/29/2019 class16_profiling-handout.pdf

    67/102

    PAPI_L3_ICM 0x80000005 No No Level 3 instruction cache misses

    PAPI_L1_TCM 0x80000006 Yes Yes Level 1 cache misses

    PAPI_L2_TCM 0x80000007 Yes No Level 2 cache misses

    PAPI_L3_TCM 0x80000008 Yes No Level 3 cache misses

    PAPI_CA_SNP 0x80000009 No No Requests for a snoopPAPI_CA_SHR 0x8000000a No No Requests for exclusive access to shared cache line

    PAPI_CA_CLN 0x8000000b No No Requests for exclusive access to clean cache line

    PAPI_CA_INV 0x8000000c No No Requests for cache line invalidation

    PAPI_CA_ITV 0x8000000d No No Requests for cache line intervention

    PAPI_L3_LDM 0x8000000e Yes No Level 3 load misses

    PAPI_L3_STM 0x8000000f No No Level 3 store misses

    PAPI_BRU_IDL 0x80000010 No No Cycles branch units are idle

    PAPI_FXU_IDL 0x80000011 No No Cycles integer units are idle

    PAPI_FPU_IDL 0x80000012 No No Cycles floating point units are idlePAPI_LSU_IDL 0x80000013 No No Cycles load/store units are idle

    PAPI_TLB_DM 0x80000014 Yes No Data translation lookaside buffer misses

    PAPI_TLB_IM 0x80000015 Yes No Instruction translation lookaside buffer misses

    PAPI_TLB_TL 0x80000016 Yes Yes Total translation lookaside buffer misses

    PAPI_L1_LDM 0x80000017 Yes No Level 1 load misses

    PAPI_L1_STM 0x80000018 Yes No Level 1 store misses

    PAPI_L2_LDM 0x80000019 Yes No Level 2 load misses

    PAPI_L2_STM 0x8000001a Yes No Level 2 store misses

    PAPI_BTAC_M 0x8000001b No No Branch target address cache misses

    PAPI_PRF_DM 0x8000001c No No Data prefetch cache misses

    PAPI_L3_DCH 0x8000001d No No Level 3 data cache hits

    PAPI_TLB_SD 0x8000001e No No Translation lookaside buffer shootdowns

    PAPI_CSR_FAL 0x8000001f No No Failed store conditional instructions

    PAPI_CSR_SUC 0x80000020 No No Successful store conditional instructions

    PAPI_CSR_TOT 0x80000021 No No Total store conditional instructions

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 71 / 108

  • 7/29/2019 class16_profiling-handout.pdf

    68/102

  • 7/29/2019 class16_profiling-handout.pdf

    69/102

    Performance API (PAPI) High vs. Low-level PAPI

  • 7/29/2019 class16_profiling-handout.pdf

    70/102

    PAPI_FML_INS 0x80000061 No No Floating point multiply instructions

    PAPI_FAD_INS 0x80000062 No No Floating point add instructionsPAPI_FDV_INS 0x80000063 No No Floating point divide instructions

    PAPI_FSQ_INS 0x80000064 No No Floating point square root instructions

    PAPI_FNV_INS 0x80000065 No No Floating point inverse instructions

    PAPI_FP_OPS 0x80000066 Yes Yes Floating point operations

    PAPI_SP_OPS 0x80000067 Yes Yes Floating point operations; optimized to count

    scaled single precision vector operations

    PAPI_DP_OPS 0x80000068 Yes Yes Floating point operations; optimized to count

    scaled double precision vector operations

    PAPI_VEC_SP 0x80000069 Yes No Single precision vector/SIMD instructionsPAPI_VEC_DP 0x8000006a Yes No Double precision vector/SIMD instructions

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    Of 107 possible events, 57 are available, of which 14 are derived.

    avail.c PASSED

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 74 / 108

  • 7/29/2019 class16_profiling-handout.pdf

    71/102

    Performance API (PAPI) High vs. Low-level PAPI

    PAPI_L1_TCM 0x80000006 Yes Level 1 cache misses

    PAPI_L2_TCM 0x80000007 No Level 2 cache misses

    PAPI_L3_TCM 0x80000008 No Level 3 cache misses

    PAPI_L3_LDM 0x8000000e No Level 3 load misses

    PAPI TLB DM 0 80000014 N D t t l ti l k id b ff i

  • 7/29/2019 class16_profiling-handout.pdf

    72/102

    PAPI_TLB_DM 0x80000014 No Data translation lookaside buffer misses

    PAPI_TLB_IM 0x80000015 No Instruction translation lookaside buffer misses

    PAPI_TLB_TL 0x80000016 Yes Total translation lookaside buffer misses

    PAPI_L1_LDM 0x80000017 No Level 1 load misses

    PAPI_L1_STM 0x80000018 No Level 1 store misses

    PAPI_L2_LDM 0x80000019 No Level 2 load misses

    PAPI_L2_STM 0x8000001a No Level 2 store misses

    PAPI_BR_UCN 0x8000002a No Unconditional branch instructions

    PAPI_BR_CN 0x8000002b No Conditional branch instructions

    PAPI_BR_TKN 0x8000002c No Conditional branch instructions taken

    PAPI_BR_NTK 0x8000002d Yes Conditional branch instructions not taken

    PAPI_BR_MSP 0x8000002e No Conditional branch instructions mispredicted

    PAPI_BR_PRC 0x8000002f Yes Conditional branch instructions correctly predicted

    PAPI_TOT_IIS 0x80000031 No Instructions issuedPAPI_TOT_INS 0x80000032 No Instructions completed

    PAPI_FP_INS 0x80000034 No Floating point instructions

    PAPI_LD_INS 0x80000035 No Load instructions

    PAPI_SR_INS 0x80000036 No Store instructions

    PAPI_BR_INS 0x80000037 No Branch instructions

    PAPI_RES_STL 0x80000039 No Cycles stalled on any resource

    PAPI_TOT_CYC 0x8000003b No Total cycles

    PAPI_LST_INS 0x8000003c Yes Load/store instructions completed

    PAPI_L2_DCH 0x8000003f Yes Level 2 data cache hitsPAPI_L2_DCA 0x80000041 No Level 2 data cache accesses

    PAPI_L3_DCA 0x80000042 Yes Level 3 data cache accesses

    PAPI_L2_DCR 0x80000044 No Level 2 data cache reads

    PAPI_L3_DCR 0x80000045 No Level 3 data cache reads

    PAPI_L2_DCW 0x80000047 No Level 2 data cache writes

    PAPI_L3_DCW 0x80000048 No Level 3 data cache writes

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 76 / 108

  • 7/29/2019 class16_profiling-handout.pdf

    73/102

  • 7/29/2019 class16_profiling-handout.pdf

    74/102

  • 7/29/2019 class16_profiling-handout.pdf

    75/102

    PAPI API Examples

    PAPI Naming Scheme

  • 7/29/2019 class16_profiling-handout.pdf

    76/102

    The C interfaces:PAPI C interface

    (return type) PAPI_function_name(arg1, arg2, ...)

    andFortran

    interfacesPAPI Fortran interfaces

    PAPIF_function_name(arg1, arg2, ..., check)

    note that the check parameter is the same type and value as the C

    return value.

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 81 / 108

  • 7/29/2019 class16_profiling-handout.pdf

    77/102

    PAPI API Examples High-level API Example in C

    High-level API Example in C

    Lets consider the following example code for using the high level API

  • 7/29/2019 class16_profiling-handout.pdf

    78/102

    Let s consider the following example code for using the high-level APIin C.

    #include " p a p i .h "#include < s t d i o . h># d e fi n e NUM_EVENTS 2# d e fi n e THRESHOLD 10 000# d e fi n e ERROR_RETURN ( r e t v a l ) { f p r i n t f ( s t d e r r , " E r r o r %d %s : l i n e %d : \ n " , \

    r e t v a l , _ _FILE__ , __LINE__ ) ; e x i t ( r e t v a l ) ; }void c o m p u ta t i o n_ m u l t ( ) { /* s t u p i d c ode s t o be m o ni t or e d */

    double tmp =1 .0 ;

    i n t i =1;f o r ( i = 1 ; i < THRESHOLD ; i + + ) {

    tmp = tmp * i ;}

    }void co mp uta t i on _ a dd ( ) { /* s t u p i d c od es t o be m o ni t or e d */

    i n t tmp = 0 ;i n t i =0;

    f o r ( i = 0 ; i < THRESHOLD ; i + + ) {tmp = tmp + i ;

    }}

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 83 / 108

  • 7/29/2019 class16_profiling-handout.pdf

    79/102

    PAPI API Examples High-level API Example in C

    /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *P AP I nu m co un te rs r e t u r n s t h e number o f h ar dw ar e c o u n t e r s t h e p l a t f o r m

  • 7/29/2019 class16_profiling-handout.pdf

    80/102

    * P AP I_ nu m_ co un te rs r e t u r n s t h e number o f h ar dw ar e c o u n t e r s t h e p l a t f o r m ** has o r a n e ga t iv e number i f t h e re i s an e r r o r ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */

    i f ( ( num_hwcntrs = PAPI_num_counters ( ) ) < PAPI_OK) {p r i n t f ( " T her e a re no c o un t er s a v a i l a b l e . \ n " ) ;e x i t ( 1 ) ;

    }

    p r i n t f ( " T h er e a r e %d c o u n t e r s i n t h i s s ys te m \ n " , n um _h wc nt rs ) ;

    /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** P AP I_ st ar t_ co un te rs i n i t i a l i z e s t he PAPI l i b r a r y ( i f n ec es sa ry ) and *

    * s t a r t s c o u nt i ng t he e ve nt s named i n t he e ve nt s a r r ay . T hi s f u n c t i o n ** i m p l i c i t l y s to p s and i n i t i a l i z e s any c ou n t er s r un n in g as a r e s u l t o f ** a p r ev i ou s c a l l t o P A PI _ st a rt _ co u nt e rs . ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */

    i f ( ( r e t v a l = PAP I_s tart _co unte rs ( Events , NUM_EVENTS) ) != PAPI_OK)ERROR_RETURN( r e t v a l ) ;

    p r i n t f ( " \ n Co un te r S t a r t e d : \ n " ) ;

    /* You r cod e g oe s h e re */co mp uta t i on _ a dd ( ) ;

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 85 / 108

    PAPI API Examples High-level API Example in C

    /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** P AP I_ re ad _c ou nt er s r ea ds t h e c o u nt e r v a l ue s i n t o v a l ue s a r r a y ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /

  • 7/29/2019 class16_profiling-handout.pdf

    81/102

    * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */i f ( ( r e t v a l =PAPI_read_counters ( values , NUM_EVENTS) ) != PAPI_OK)

    ERROR_RETURN( r e t v a l ) ;

    p r i n t f ( " Read s u c c e s s f u l l y \ n " ) ;

    p r i n t f ( " The t o t a l i n s t r u c t i o n s ex ec ut ed f o r a d d i t i on a re %l l d \ n " , v a lu es [ 0 ] ) ;p r i n t f ( " The t o t a l c y cl es used a re %l l d \ n " , v al ue s [ 1 ] ) ;

    p r i n t f ( " \ nNow we t r y t o u se PAPI_accum t o a c cu m ul a te v a l u e s \ n " ) ;

    /* Do some co mp u ta t i o n h e re */co mp uta t i on _ a dd ( ) ;

    /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** What P AP I_ ac cu m_ co un te rs d oe s i s i t a dd s t h e r u n n i n g c o u n t e r v a l u e s ** t o what i s i n t he v a lu es a r r ay . The h ar dw ar e c o un t er s a re r e s e t and ** l e f t r un n in g a f t e r th e c a l l . ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */

    i f ( ( r e t v a l =PAPI_accum_counters ( value s , NUM_EVENTS) ) != PAPI_OK)ERROR_RETURN( r e t v a l ) ;

    p r i n t f ( "We d i d an a d d i t i o n a l %d t i m e s a d d i t i o n ! \ n " , THRESHOLD ) ;

    p r i n t f ( " The t o t a l i n s t r u c t i o n s executed f o r a d di t i on are %l l d \ n " ,v al ue s [ 0 ] ) ;p r i n t f ( " The t o t a l c y cl es used a re %l l d \ n " , v al ue s [ 1 ] ) ;

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 86 / 108 PAPI API Examples High-level API Example in C

  • 7/29/2019 class16_profiling-handout.pdf

    82/102

    /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** S to p c o u n t in g e v en t s ( t h i s r ea ds t h e c o u nt e r s as w e l l as s t o ps them ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */

    p r i n t f ( " \ nNow we t r y t o do some m u l t i p l i c a t i o n s \ n " ) ;c o m p u ta t i o n _m u l t ( ) ;

    /* * * * * * * * * * * * * * * * * * * PAPI_stop_counters * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */i f ( ( r e t v a l =PAPI _stop_ count ers ( value s , NUM_EVENTS) ) != PAPI_OK)

    ERROR_RETURN( r e t v a l ) ;

    p r i n t f ( " The t o t a l i n s t r u c t i o n ex ecuted f o r m u l t i p l i c a t i o n are %l l d \ n " ,v al ue s [ 0 ] ) ;

    p r i n t f ( " The t o t a l c y cl es used a re %l l d \ n " , v al ue s [ 1 ] ) ;e x i t ( 0 ) ;

    }

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 87 / 108 PAPI API Examples High-level API Example in C

    Running on E5645 U2 nodes:

  • 7/29/2019 class16_profiling-handout.pdf

    83/102

    g

    [k11n30b:~/d_papi]$ module load papi

    'papi/v4.1.4' load complete.[k11n30b:~/d_papi]$ gcc -I$PAPI/include -o highlev highlev.c -L$PAPI/lib -lpapi

    [k11n30b:~/d_papi]$ ./highlev

    There are 16 counters in this system

    Counter Started:

    Read successfully

    The total instructions executed for addition are 54977

    The total cycles used are 73894

    Now we try to use PAPI_accum to accumulate values

    We did an additional 10000 times addition!

    The total instructions executed for addition are 112352

    The total cycles used are 147814

    Now we try to do some multiplications

    The total instruction executed for multiplication are 77459

    The total cycles used are 126981

    M D Jones Ph D (CCR/UB) Application Performance Tuning HPC-I Fall 2012 88 / 108 PAPI API Examples High-level API Example in C

    PAPI Initialization

  • 7/29/2019 class16_profiling-handout.pdf

    84/102

    The preceding example used PAPI_library_init to initialize PAPI,which is also used for the low-level API, but you can also use thePAPI_num_counters, PAPI_start_counters, or one of the ratecalls, PAPI_flips, PAPI_flops, or PAPI_ipc.

    Events are counted, as we saw in the example, usingPAPI_accum_counters, PAPI_read_counters, and

    PAPI_stop_counters.

    Lets look at an even simpler example just using one of the rate

    counters.

    M D Jones Ph D (CCR/UB) Application Performance Tuning HPC-I Fall 2012 89 / 108 PAPI API Examples High-level Example in F90

    High-level Example in F90

  • 7/29/2019 class16_profiling-handout.pdf

    85/102

    For something a little different we can look at our old friend, matrixmultiplication, this time in Fortran:

    ! A s i m pl e example f o r t he use o f PAPI , t he number o f f l o p s you s ho ul d! g et i s a bo ut INDEX^3 on m ac hines t h a t c o ns i de r add a nd m u l t i p l y one f l o p! such as S GI , and 2 * ( INDEX ^ 3 ) t h a t don ' t c o n s i de r i t 1 f l o p s uc h as INTEL! Kevin Londonprogram f l o p s

    i m p l i c i t noneinclude " f9 0 p a p i . h "

    i n te g e r , parameter : : i 8 =SELECTED_INT_KIND( 1 6 ) ! i n te g er *8i n te g e r , parameter : : i n d e x =1000r e a l : : ma tr i xa ( i n d e x , i n d e x ) , ma tr i xb ( i n d e x , i n d e x ) ,mre s( i n d e x , i n d e x )r e a l : : p r oc _ ti m e , m fl op s , r e a l _ t i m ei n t e g e r ( kind= i8 ) : : f l p i n si n t e g e r : : i , j , k , r e t v a l

    M D Jones Ph D (CCR/UB) Application Performance Tuning HPC-I Fall 2012 90 / 108

  • 7/29/2019 class16_profiling-handout.pdf

    86/102

    PAPI API Examples High-level Example in F90

    ! M t i M t i M l t i l

  • 7/29/2019 class16_profiling-handout.pdf

    87/102

    ! M at ri xM a tr i x M u l t i p l ydo i =1 , i n d e x

    do j =1 , i n d e xdo k = 1 , i n d e xm res ( i , j ) = m res ( i , j ) + m a t r i x a ( i , k ) * ma tr i xb (k , j )

    end doend do

    end do

    ! C o l l e c t t he d at a i n t o t he V a ri a bl e s p assed i nc a l l P A PI f _f l op s ( r e al _ t im e , p ro c_ ti me , f l p i n s , m fl op s , r e t v a l )i f ( r et v al .NE.PAPI_OK) then

    p r i n t * , ' F a i l u r e on P A PI f _f l op s : ' , r e t v a lend i fp r i n t * , ' R ea l_ ti me : ' , r e a l _ ti m ep r i n t * , ' P ro c_ ti me : ' , p ro c _t im ep r i n t * , ' T o t al f l p i n s : ' , f l p i n sp r i n t * , ' MFLOPS: ' , mf l o p s

    end program f l o p s

    M D Jones Ph D (CCR/UB) Application Performance Tuning HPC-I Fall 2012 92 / 108

  • 7/29/2019 class16_profiling-handout.pdf

    88/102

  • 7/29/2019 class16_profiling-handout.pdf

    89/102

    PAPI API Examples Low-level API

    PAPI Low-level Example

    A simple example using the low-level API:

  • 7/29/2019 class16_profiling-handout.pdf

    90/102

    p p g

    #include #include < s t d i o . h>

    # d e fi n e NUM_FLOPS 1000 0

    m ai n ( ) {i n t re tv al , EventSet=PAPI_NULL ;l o n g_ l o ng v a lu e s [ 1 ] ;

    /* I n i t i a l i z e th e PAPI l i b r a r y */

    r et v a l = PA P I_ li br ar y _i ni t (PAPI_VER_CURRENT) ;i f ( r e t v a l != PAPI_VER_CURRENT) {

    f p r i n t f ( s t d er r , " PAPI l i b r a r y i n i t e r r o r ! \ n " ) ;e x i t ( 1 ) ;

    }

    /* C re at e t h e E ve nt S et */i f ( PAPI_ cre a te _ e ve n tse t(&Eve n tSet ) != PAPI_OK) h a n d l e _ e rr o r ( 1 ) ;

    /* Add T o t a l I n s t r u c t i o n s Ex ec ut ed t o o ur E ve nt S et */

    i f ( PAPI_add_event ( EventSet , PAPI_TOT_INS) != PAPI_OK) han dle _er ror ( 1 ) ;

    M D Jones Ph D (CCR/UB) Application Performance Tuning HPC-I Fall 2012 95 / 108 PAPI API Examples Low-level API

    /* S t a r t c o un t in g e ve nt s i n t he E ve nt S et */i f ( P A P I _ s t a r t ( E v en t Se t ) ! = PAPI_OK ) h a n d l e _ e r r o r ( 1 ) ;

    /* D ef in ed i n t e s t s / d o lo op s . c i n t he PAPI s ou rc e d i s t r i b u t i o n */

  • 7/29/2019 class16_profiling-handout.pdf

    91/102

    _ pdo _ f l o ps ( NUM_FLOPS ) ;

    /* Read t h e c o u n t in g e v en t s i n t h e E ve nt S et */i f ( PAPI_ re ad ( Even tSet , va l u e s ) != PAPI_OK) h a n d l e _ e rro r ( 1 ) ;

    p r i n t f ( " A f t e r r e ad i ng t he c o un t er s : % l l d \ n " , v a l ue s [ 0 ] ) ;

    /* Res et t he c o un t in g e ve nt s i n t h e E ve nt S et */i f ( P A PI _ re s et ( E v en t Se t ) ! = PAPI_OK ) h a n d l e _ e r r o r ( 1 ) ;

    do _ f l o ps ( NUM_FLOPS ) ;

    /* Add t he c o un t er s i n t he E ve nt S et */i f (PAPI_accum( Even tSe t , va l u e s ) != PAPI_OK) h a n d l e _ e rr o r ( 1 ) ;p r i n t f ( " A f t e r a dd in g t he c o un t er s : %l l d \ n " , v a l ue s [ 0 ] ) ;

    do _ f l o ps ( NUM_FLOPS ) ;

    /* S top t he c o un t in g o f e ve nt s i n t he E ven t S et */i f ( PAPI_ sto p ( Even tSet , va l u e s ) != PAPI_OK) h a n d l e _ e rro r ( 1 ) ;

    p r i n t f ( " A f t e r s t op p in g t he c o un t er s : %l l d \ n " , v a l ue s [ 0 ] ) ;}

    M D Jones Ph D (CCR/UB) Application Performance Tuning HPC I Fall 2012 96 / 108 PAPI API Examples PAPI in Parallel

    PAPI in Parallel

  • 7/29/2019 class16_profiling-handout.pdf

    92/102

    threads PAPI_thread_init enables PAPIs thread support, and

    should be called immediately after

    PAPI_library_init.

    MPI codes are treated very simply - each process has its ownaddress space, and potentially its own hardware counters.

    M D Jones Ph D (CCR/UB) Application Performance Tuning HPC I Fall 2012 97 / 108

  • 7/29/2019 class16_profiling-handout.pdf

    93/102

    High-level Tools Popular Examples

    Tool Examples

  • 7/29/2019 class16_profiling-handout.pdf

    94/102

    A list of such high-level tool examples (not exhaustive):

    TAU, Tuning and Analysis Utility,http://www.cs.uoregon.edu/Research/tau/home.php

    Open|SpeedShop,funded by U.S. DOEhttp://www.openspeedshop.org

    IPM, Integrated Performance Managementhttp://ipm-hpc.sourceforge.net

    M D Jones Ph D (CCR/UB) Application Performance Tuning HPC I Fall 2012 100 / 108 High-level Tools Specific Example: IPM

    Example: IPM

    http://www.cs.uoregon.edu/Research/tau/home.phphttp://www.openspeedshop.org/http://ipm-hpc.sourceforge.net/http://ipm-hpc.sourceforge.net/http://www.openspeedshop.org/http://www.cs.uoregon.edu/Research/tau/home.php
  • 7/29/2019 class16_profiling-handout.pdf

    95/102

    IPM is relatively simple to install and use, so we can easily walkthrough our favorite example. Note that IPM does:

    MPI

    PAPI

    I/O profiling

    Memory

    Timings: wall, user, and system

    M D Jones Ph D (CCR/UB) Application Performance Tuning HPC I Fall 2012 101 / 108

  • 7/29/2019 class16_profiling-handout.pdf

    96/102

  • 7/29/2019 class16_profiling-handout.pdf

    97/102

  • 7/29/2019 class16_profiling-handout.pdf

    98/102

    High-level Tools Specific Example: IPM

    Script to Generate HTML from XML Results

  • 7/29/2019 class16_profiling-handout.pdf

    99/102

    1 #!/bin/bash23 if [ $# -ne 1 ]; then4 echo "Usage: $0 xml_filename"5 exit6 fi7 XMLFILE=$1

    89 export IPM_KEYFILE=/projects/jonesm/ipm/src/ipm/ipm_key

    10 export PATH=${PATH}:/projects/jonesm/ipm/src/ipm/bin11 /projects/jonesm/ipm/src/ipm/bin/ipm_parse -html $XMLFILE

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 105 / 108

  • 7/29/2019 class16_profiling-handout.pdf

    100/102

    High-level Tools Specific Example: IPM

    Visualize Results in Browser

    Transfer compressed tar file to your local machine, unpack, and

    browse the results:

  • 7/29/2019 class16_profiling-handout.pdf

    101/102

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 107 / 108

    High-level Tools Specific Example: IPM

    Summary

  • 7/29/2019 class16_profiling-handout.pdf

    102/102

    Summary of high-level tools

    IPM is pretty easy to use, provides some good functionality

    TAU and Open|SpeedShop have steeper learning curves, muchmore functionality

    M. D. Jones, Ph.D. (CCR/UB) Application Performance Tuning HPC-I Fall 2012 108 / 108