robert fowler john mellor-crummey nathan tallent gabriel marin department of computer science rice...
TRANSCRIPT
Robert Fowler
John Mellor-Crummey Nathan Tallent Gabriel Marin
Department of Computer ScienceRice University
Performance Tuning on Modern Systems: Tools and Experiences
http://hipersoft.cs.rice.edu/hpctoolkit/
http://hipersoft.cs.rice.edu/hpctoolkit/2
Thought for the Day
'The Hitchiker's Guide to the Galaxy, in a moment of reasoned lucidity which is almost unique among its current tally of five million, nine hundred and seventy-three thousand, five hundred and nine pages, says of the Sirius Cybernetics Corporation products that “it is very easy to be blinded to the essential uselessness of them by the sense of achievement you get from getting them to work at all. In other words - and this is the rock-solid principle on which the whole of the Corporation's galaxywide success is founded -- their fundamental design flaws are completely hidden by their superficial design flaws.”
(Douglas Adams, "So Long, and Thanks for all the Fish")
http://hipersoft.cs.rice.edu/hpctoolkit/3
Background
What we (Rice Parallel Compilers Group) do.— Code optimization.
– Aggressive (mostly source-to-source) compiler optimization.– Hand application of transformations.
• Try out transformations by hand first.• Work on real codes with algorithm, library, and application
developers.• (Deployment time for compiler research is measured in
years.).
Why we spend a lot of time (too much?) analyzing executions. Why?
— It is hard to understand deeply pipelined, out of order, superscalar processors with non-blocking caches and deep memory hierarchies.
— Aggressive ( -O4) and/or idiosyncratic vendor compilers make life harder.
What we did.— Built tools to meet our needs w.r.t. the run/analyze/tune cycle..
http://hipersoft.cs.rice.edu/hpctoolkit/4
Processor.P4,
3.0GHzI865PERL
Dual Operon, 1.6GHz
Tyan2882
Dual Operon, 1.6GHz
Tyan2882McKinley(x2),
900MHzP4
2GHzDell
DEC ES40667MHz EV67
SGI O2KR12K
300MHzAthlon 800Mz
Bus/Memory 800MHzPC3200
333,PC2700ECC
node IL on
333,PC2700ECC
node IL off?? RB800 ?? PC133
LMBENCH cmplr. gcc3.2.3 gcc3.3.2 gcc3.3.2 gcc3.2 gcc3.2 gcc3.3.2
Int Add/Mul/Div (ns) .17/4.7/19.4 .67/1.9/26 .67/1.9/26 1.12/4/7.9 .25/.25/11 3.36/20/121 1.25/5/51
Dbl Add/Mul/Div (ns) 1.67/2.34/14.6 2.54/2.54/11.1 2.54/2.54/11.1 4.45/4.45 2.5/3.75/ 6.7/6.7/70.3 5/5/22
Latency L1 (ns) 0.67 1.88 1.88 2.27 1.18 6.70 3.77
Latency L2 (ns) 6.15 12.40 12.40 7.00 9.23 45.20 31.60
Latency Main (ns) 91.00 136.50 107.50 212.00 176.50 (195 ?) 387.90 236.10
Bcopy (MB/s) 1210 841 853 667 727 164 182
LM rd (MB/s) 2430 1489 1807 584 1474 253 304
LM wr (MB/s) 1532 1046 1313 567 1051 213 289
LMBENCH NUMBERS
http://hipersoft.cs.rice.edu/hpctoolkit/5
Processor.P4,
3.0GHzI865PERL
Dual Operon, 1.6GHz
Tyan2882
McKinley(x2),900MHz
P42GHz
Dell
DEC ES40667MHz EV67
SGI O2KR12K
300MHz
Bus/Memory800MHzPC3200
333,PC2700ECC
node IL off?? RB800 ??
Native compiler6M(2M)
elementsicc8.0 -fast ecc7.1 -O3
cc6.4 -O3
(w/o DCPI)cc -Ofast -64
copy 2479 3318 1228 (1310) 327
scan 2479 3306 1228 (1260) 304
add 3029 3842 1282 (1293} 367
triad 3024 3844 1293 (1328) 353
gcc 6M elements
gcc3.2.3 -O3gcc3.3.2 -O3 -m64
gcc3.2 -O3
(-funroll-all-loops)
gcc3.3.2-O3
(-funroll-all-loops)
copy 2422 1635 793 (820) 491 (826)
scan 2459 1661 734 (756) 493 (728)
add 2995 2350 843 (853) 622 (877)
triad 2954 1967 844 (858) 611 (797)
Streams numbers for native compiler and gcc
http://hipersoft.cs.rice.edu/hpctoolkit/6
Memory Bandwidth: Fundamental Performance Limit
• Example: simple 2-d relaxation kernelS: r(i,j) = a * x(i,j) + b * (x(i-1,j)+x(i+1,j)+x(i,j-1)+x(i,j+1))
— Each execution of S performs 6 ops, uses 5 x values. (writes 1 result)(not counting index computations or scalars kept in registers)
— Assume x is double and always comes from main memory– S consumes 40 bytes of memory read bandwidth, 6.7 bytes/op.– 1GB/sec memory 150MFLOPS upper bound, 25M updates/sec
— But in each outer loop, each x value is used exactly 5 times, 6 if r a, so if a clever programmer/compiler gets perfect reuse in registers and cache:
– S consumes 5*(1/5)*8 = 8 bytes of memory read bandwidth, 1.3bytes/op.– 1GB/sec memory 750MFLOPS upper bound, 125M updates/sec
— First latency has to be hidden– Load scheduling– Prefetching– Software pipelining
— There are lots of techniques for getting this kind of reuse– Tiling -- good for cache reuse– Unroll and jam – both cache and register reuse
http://hipersoft.cs.rice.edu/hpctoolkit/7
Memory Bandwidth: Part 2
Example: realistic 2-d relaxation kernelS: r(i,j) = c(i,j)*x(i,j)
+ c(i-1,j)*x(i-1,j) + c(i+1,j)*x(i+1,j)
+ c(i,j-1)*x(i,j-1) + c(i,j+1)*x(i,j+1)— Each execution of S performs 9 ops, uses 5 x and 5 c values,
(not counting index computations or scalars kept in registers)— Assume x is double and always comes from main memory
– S consumes 80 bytes of memory read bandwidth, 8.9 bytes/op.– 1GB/sec memory 112MFLOPS upper bound, 12.5 M updates/sec
— In each outer loop, each x value is used exactly 5 times, 6 if r a.But each c value is used only once (twice for i≠j if c is symmetric).
— S consumes (5 + 1*1/5)*8 = 48 bytes of memory read bandwidth, 5.33bytes/op.– 1GB/sec memory 187MFLOPS upper bound, 20.8M updates/sec.
• Solutions— “Superficial stuff”
– hide latency and get all the reuse you can.— “Fundamental stuff”
– Aggressive algorithms to increase reuse (Time skewing, etc)– Other algorithms?
Who cares if the flop rate is low if the solution is better, faster?
http://hipersoft.cs.rice.edu/hpctoolkit/8
Performance Analysis and Tuning
• Increasingly necessary—Gap between typical and peak performance is growing.
• Increasingly hard—Complex architectures are harder to program effectively
– complex processors VLIW deeply pipelined, out of order, superscalar
– complex memory hierarchy non-blocking, multi-level caches TLB
—Modern scientific applications pose challenges for tools– multi-lingual programs– many source files– complex build process– external libraries in binary-only form
—Tools have superficial design problems and aren’t used.
http://hipersoft.cs.rice.edu/hpctoolkit/9
Historical origins— Johnson & Johnston (March 1970) –
– IBM 360/91 at SLAC.– Profiling on OS360 MVT using timer in control process.
— Knuth (circa 1973) characterization of Fortran programs–– The 90/10 rule (or was it 80/20?)– Write a clear, correct initial version– Use profiling tools to identify the hot spots– Focus on the hot spots.
Since then— Three decades of academic and industrial tools— Sporadic advocacy in CS and Scientific Computing courses
But— Compelling success stories are rare.— Tools are underutilized outside a few small communities.
A 32+ Year History of Disappointment
A 10% hotspot in a 1M line application could be 100K linesscattered throughout thep rogram.
http://hipersoft.cs.rice.edu/hpctoolkit/10
Problems with Existing Performance Tools
Most tools are hard to use on big, complex, modular applications.Multiple libraries, libraries, runtime linking, multiple languages
Tools (feel like they) are designed to evaluate architectural choices or OS design, rather than for application tuning.
User Interface Issues— GUI problems
Platform specific, both for instrumentation and analysis workstations
Non-portable, non-collaborative visualization
Single-metric displays don’t capture underlying problems
Need to aggregate data by block, loop, and user-defined scopes
— Failure to make compelling cases.Users need to dig to find problems and solutions.Hard to convince management, application developers with a fat stack of
printouts and hand-generated notes.
http://hipersoft.cs.rice.edu/hpctoolkit/11
More Problems
Language- or compiler-based tools can’t handle big applications.— They can’t handle multi-language, multi-module (library) programs.— Even vendor compilers are challenged (correctness and optimization)
by big programs.
Insufficient analytic power in tools makes analysis difficult/tedious— Any one performance metric produces myopic view.
– Some metrics are causes. Some are effects. Relationship?• Ex. – Cache misses problematic only if latency isn’t hidden.• Ex. – Instruction balance: loads/stores, FLOPS, integer.• Ex. – Fusion of compiler analysis, simulation, and counter profiles.
— Labor intensive tasks are difficult and/or tedious.– (Re-)inserting explicit instrumentation and/or recompiling.– Manual correlation of data from multiple sources with source code.– Aggregating per line data into larger scopes.– Computing synthetic measures, e.g. loads/flop.
— Tune/reanalyze cycle slowed or prevented by manual tasks.– Manual (re-)insertion of instrumentation points.– Manual (re-)computation of derived metrics.– Manual (re-)correlation of performance data with source constructs.
http://hipersoft.cs.rice.edu/hpctoolkit/12
HPCToolkit Goals
• Support large, multi-lingual applications—a mix of of Fortran, C, C++—external libraries—thousands of procedures—hundreds of thousands of lines—we must avoid
– manual instrumentation– significantly altering the build process– frequent recompilation
• Multi-platform
• Scalable data collection
• Analyze both serial and parallel codes
• Effective presentation of analysis results—intuitive enough for physicists and engineers to use—detailed enough to meet the needs of compiler writers
http://hipersoft.cs.rice.edu/hpctoolkit/13
HPCToolkit Workflow
applicationsource
applicationsource
profile execution
profile execution
performanceprofile
performanceprofile
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
hpcviewer
hpcviewer
Drive this with scripts. Call scripts in Makefiles.
On parallel systems, integrate scripts with batch system.
http://hipersoft.cs.rice.edu/hpctoolkit/14
HPCToolkit Workflow
profile execution
profile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
hpcviewer
hpcviewer
—launch unmodified, optimized application binaries—collect statistical profiles of events of interest
http://hipersoft.cs.rice.edu/hpctoolkit/15
HPCToolkit Workflow
profile execution
profile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
hpcviewer
hpcviewer
—decode instructions and combine with profile data
http://hipersoft.cs.rice.edu/hpctoolkit/16
HPCToolkit Workflow
profile execution
profile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
hpcviewer
hpcviewer
—extract loop nesting information from executables
http://hipersoft.cs.rice.edu/hpctoolkit/17
HPCToolkit Workflow
profile execution
profile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
hpcviewer
hpcviewer
—synthesize new metrics by combining metrics —relate metrics, structure, and program source
http://hipersoft.cs.rice.edu/hpctoolkit/18
HPCToolkit Workflow
profile execution
profile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
hpcviewer
hpcviewer
—support top-down analysis with interactive viewer—analyze results anytime, anywhere
http://hipersoft.cs.rice.edu/hpctoolkit/19
HPCToolkit Workflow
profile execution
profile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
hpcviewer
hpcviewer
http://hipersoft.cs.rice.edu/hpctoolkit/20
Data Collection
Support analysis of unmodified, optimized binaries
• Inserting code to start, stop and read counters has many drawbacks, so don’t do it!—Nested measurements skew results—Instrumentation points interfere with optimization.
• Use hardware performance monitoring to collect statistical profiles of events of interest
• Different platforms have different capabilities—event-based counters: MIPS, IA64, AMD64, IA32—ProfileMe instruction tracing: Alpha
• Different capabilities require different approaches
http://hipersoft.cs.rice.edu/hpctoolkit/21
Data Collection Tools
Goal: limit development to essentials only
• MIPS-IRIX: —ssrun + prof ptran
• Alpha-Tru64: —uprofile + prof ptran—DCPI/ProfileMe xprof
• IA64-Linux and IA32-Linux—papirun/papiprof
http://hipersoft.cs.rice.edu/hpctoolkit/22
papirun/papiprof
• PAPI: Performance API —interface to hardware performance monitors—supports many platforms
• papirun: open source equivalent of SGI’s ‘ssrun’—sample-based profiling of an execution
– preload monitoring library before launching application– inspect load map to set up sampling for all load modules– record PC samples for each module along with load map
—Linux IA64 and IA32
• papiprof: ‘prof’-like tool —based on Curtis Janssen’s vprof—uses GNU binutils to perform PC source mapping—output styles
– XML for use with hpcview– plain text
http://hipersoft.cs.rice.edu/hpctoolkit/23
DCPI and ProfileMe
• Alpha ProfileMe—EV67+ records info about an instruction as it
executes– mispredicted branches, memory access replay traps– more accurate attribution of events
• DCPI: (Digital) Continuous Profiling Infrastructure—sample processor counters and instructions
continuously during execution of all code – all programs– shared libraries– operating system
—support both on-line and off-line data analysis– to date, we use only off-line analysis
http://hipersoft.cs.rice.edu/hpctoolkit/24
HPCToolkit Workflow
profile execution
profile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
hpcviewer
hpcviewer
http://hipersoft.cs.rice.edu/hpctoolkit/25
Metric Synthesis with xprof (Alpha)
Interpret DCPI/ProfileMe samples into useful metrics
• Transform low-level data to higher-level metrics —DCPI/ProfileMe information associated with PC values—project ProfileMe data into useful equivalence classes—decode instruction type info in application binary at each PC
– FLOP
– memory operation
– integer operation
—fuse the two kinds of information– Retired instruction sample points + instruction type
retired FLOPs
retired integer operations
retired memory operations
• Map back to source code like papiprof
http://hipersoft.cs.rice.edu/hpctoolkit/26
HPCToolkit Workflow
profile execution
profile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
hpcviewer
hpcviewer
http://hipersoft.cs.rice.edu/hpctoolkit/27
Program Structure Recovery with bloop
• Parse instructions in an executable using GNU binutils
• Analyze branches to identify basic blocks
• Construct control flow graph using branch target analysis—be careful with machine conventions and delay slots!
• Use interval analysis to identify natural loop nests
• Map machine instructions to source lines with symbol table—dependent on accurate debugging information!
• Normalize output to recover source-level view
Platforms: Alpha+Tru64, MIPS+IRIX, Linux+IA64, Linux+IA32, Solaris+SPARC
http://hipersoft.cs.rice.edu/hpctoolkit/28
Sample Flowgraph from an Executable
Loop nesting structure—blue: outermost level—red: loop level 1—green loop level 2
Observationoptimization complicates
program structure!
http://hipersoft.cs.rice.edu/hpctoolkit/29
Normalizing Program Structure
Coalesce duplicate lines
(1) if duplicate lines appear in different loops– find least common ancestor in scope tree; merge
corresponding loops along the paths to each of the duplicates purpose: re-rolls loops that have been split
(2) if duplicate lines appear at multiple levels in a loop nest– discard all but the innermost instance
purpose: handles loop-invariant code motion
apply (1) and (2) repeatedly until a fixed point is reached
Constraint: each source line must appear at most once
http://hipersoft.cs.rice.edu/hpctoolkit/30
<LM n="/apps/smg98/test/smg98"> ... <F n="/apps/smg98/struct_linear_solvers/smg_relax.c"> <P n="hypre_SMGRelaxFreeARem"> <L b="146" e="146"> <S b="146" e="146"/> </L> </P> <P n="hypre_SMGRelax"> <L b="297" e="328"> <S b="297" e="297"/> <L b="301" e="328"> <S b="301" e="301"/> <L b="318" e="325"> <S b="318" e="325"/> </L> <S b="328" e="328"/> </L> <S b="302" e="302"/> </L> </P> ... </F> </PGM>
Recovered Program Structure
Load Module
File
Statement
Loop
Procedure
http://hipersoft.cs.rice.edu/hpctoolkit/31
HPCToolkit Workflow
profile execution
profile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
hpcviewer
hpcviewer
http://hipersoft.cs.rice.edu/hpctoolkit/32
Data Correlation
• Problem—any one performance measure provides a myopic view
– some measure potential causes (e.g. cache misses)– some measure effects (e.g. cycles)– cache misses not always a problem
—event counter attribution is inaccurate for out-of-order processors
• Approaches— multiple metrics for each program line—computed metrics, e.g. cycles - FLOPS
– eliminate mental arithmetic – serve as a key for sorting
—hierarchical structure– line level attribution errors give good loop-level information
http://hipersoft.cs.rice.edu/hpctoolkit/33
HPCToolkit Workflow
profile execution
profile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
hpcviewer
hpcviewer
http://hipersoft.cs.rice.edu/hpctoolkit/34
HPCViewer Screenshot
MetricsNavigation
Annotated Source View
http://hipersoft.cs.rice.edu/hpctoolkit/35
Flattening for Top Down Analysis
• Problem —strict hierarchical view of a program is too rigid—want to compare program components at the same
level as peers
• Solution —enable a scope’s descendants to be flattened to
compare their children as peers
flatten
Current scope
unflatten
http://hipersoft.cs.rice.edu/hpctoolkit/36
Some Uses for HPCToolkit
• Identifying unproductive work—where is the program spending its time not
performing FLOPS
• Memory hierarchy issues—bandwidth utilization: misses x line size/cycles—exposed latency: ideal vs. measured
• Cross architecture or compiler comparisons—what program features cause performance
differences?
• Gap between peak and observed performance—loop balance vs. machine balance?
• Evaluating load balance in a parallelized code—how do profiles for different processes compare
http://hipersoft.cs.rice.edu/hpctoolkit/37
Assessment of HPCToolkit Functionality
• Top down analysis focuses attention where it belongs —sorted views put the important things first
• Integrated browsing interface facilitates exploration—rich network of connections makes navigation simple
• Hierarchical, loop-level reporting facilitates analysis—more sensible view when statement-level data is imprecise
• Binary analysis handles multi-lingual applications and libraries—succeeds where language and compiler based tools can’t
• Sample-based profiling, aggregation and derived metrics—reduce manual effort in analysis and tuning cycle
• Multiple metrics provide a better picture of performance
• Multi-platform data collection
• Platform independent analysis tool
http://hipersoft.cs.rice.edu/hpctoolkit/38
What’s Next?
Research—collect and present dynamic content
– what path gets us to expensive computations?
– accurate call-graph profiling of unmodified executables
– analysis and presentation of dynamic content
—communication in parallel programs—statistical clustering for analyzing large-scale parallelism—performance diagnosis: why rather than what
Development—harden toolchain—new platforms: Opteron and PowerPC(Apple)—data collection with oprofile on Linux—distributed and collaborative extensions
http://hipersoft.cs.rice.edu/hpctoolkit/39
Contacts
HPCView tools:
http://www.cs.rice.edu/~dsystem/hpcview/
Group: [email protected]
Rob: [email protected] John: [email protected]
http://hipersoft.cs.rice.edu/hpctoolkit/40
RISC “High ILP” processors: What went wrong?
• The “RISC revolution” was predicated on the assumption that aggressive compiler technology was mature and would be able to generate optimized code that can efficiently utilize these machines.
• “ILP features” (superscalar, VLIW, non-blocking caches) increase the reliance on effective compilation.
But
• In practice, many very aggressive optimizations fail to be applied to real programs. Implementation bugs? Incorrect assumptions? Aspects of the apps?
• A big scientific code that gets 5% of the peak flop rate considered to be doing well
• Some big scientific codes get <0.05 FLOP/cycle.
• Even well compiled programs are bandwidth bound.–Why not just use the cheapest Athalon compatible with the fastest chipset/memory combination?