what if performance tools were really easy to use? rob fowler center for high performance software...

21
What if Performance Tools Were Really Easy to Use? Rob Fowler Center for High Performance Software Rice University

Upload: dale-clark

Post on 03-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

What if Performance Tools Were Really Easy to Use?

Rob Fowler

Center for High Performance SoftwareRice University

“Easy” to “Use”?

Easy to use in program development and tuning cycles.

• “Easy” means little or no pain, tedious manual labor.

• “Use” means to routinely employ because it makes your work life better.

– Improved productivity

– New knowledge

– Improved intuition

Historical origins

– Johnson & Johnston (March 1970) –» IBM 360/91 at SLAC.» Profiling on OS360 MVT using timer in control process.

– Knuth (circa 1973) characterization of Fortran programs–» The 90/10 rule (or was it 80/20?)» Write a clear, correct initial version» Use profiling tools to identify the hot spots» Focus on the hot spots.

Since then

– Three decades of academic and industrial tools

– Sporadic advocacy in CS and Scientific Computing courses

But

– Compelling success stories are rare.

– Tools are underutilized outside a few small communities.

A 31+ Year History of Disappointment

A 10% hotspot in a 1M line application could be 100K linesscattered throughout theprogram.

Background

What we (Rice Parallel Compilers Group) do.– Code optimization.

» Aggressive (mostly source-to-source) compiler optimization.» Hand application of transformations.

o Try out transformations by hand first.o Work on real codes with algorithm, library, and application

developers. (Deployment time for compiler research is measured in years.).

Why we spend a lot of time (too much?) analyzing executions. Why?– It is hard to understand deeply pipelined, out of order, superscalar

processors with non-blocking caches and deep memory hierarchies.

– Aggressive ( -O4) and/or idiosyncratic vendor compilers make life harder.

What we did.– Built tools to meet our needs w.r.t. the run/analyze/tune cycle.

Goals for this talk.– To spread the word.

– Encourage others to use (and contribute) to the tools.

– Explore areas for collaboration.

Problems with Existing Performance Tools

Most tools are hard to use on big, complex, modular applications.Multiple libraries, libraries, runtime linking, multiple languages

Tools (feel like they) are designed to evaluate architectural choices or OS design, rather than for application tuning.

User Interface Issues– GUI problems

Platform specific, both for instrumentation and analysis workstationsNon-portable, non-collaborative visualizationSingle-metric displays don’t capture underlying problemsNeed to aggregate data by block, loop, and user-defined scopes

– Failure to make compelling cases.Users need to dig to find problems and solutions.Hard to convince management, application developers with a fat stack of printouts

and hand-generated notes.

More Problems

Language- or compiler-based tools can’t handle big applications.– They can’t handle multi-language, multi-module (library) programs.

– Even vendor compilers are challenged (correctness and optimization) by big programs.

Insufficient analytic power in tools makes analysis difficult/tedious– Any one performance metric produces myopic view.

» Some metrics are causes. Some are effects. Relationship?» Ex. – Cache misses problematic only if latency isn’t hidden.» Ex. – Instruction balance: loads/stores, FLOPS, integer.» Ex. – Fusion of compiler analysis, simulation, and counter profiles.

– Labor intensive tasks are difficult and/or tedious.» (Re-)inserting explicit instrumentation and/or recompiling.» Manual correlation of data from multiple sources with source code.» Aggregating per line data into larger scopes.» Computing synthetic measures, e.g. loads/flop.

– Tune/reanalyze cycle slowed or prevented by manual tasks.» Manual (re-)insertion of instrumentation points.» Manual (re-)computation of derived metrics.» Manual (re-)correlation of performance data with source constructs.

Our Approach

Under control of scripts/makefiles:

1.Gather performance data from multiple sources.» static analyses of source and executable code» profilers (PC sampling, hardware performance counters, ProfileMe)» simulation (PIXIE, MHSIM)

2.Fuse data, compute derived metrics, and correlate with source code to form hyperlinked document/database.

3.View document anywhere using commodity hypertext browser.

Tools – MHSIM: a simulator for multi-level memory hierarchies

» Our first foray, but usually expensive overkill.» Counts memory events, but doesn’t measure actual cost.

– HPCView: correlates multiple “profiles” with source code

HPCView in Detail

A toolkit for browsing source code and profile-like data– Data from any source that generates prof-like outputs

» Timer-driven PC sampling profiles (prof, ssrun)» Hardware performance counter profilers (ssrun, uprofile, DCPI-ProfileMe)» Static analyzers (register-spill analyzers, pipeline stall analysis.)» Simulation (pixie, i.e. ssrun –ideal, memory simulators)

– Source code in multiple languages, modules, and libraries

– Any combination of data collection sources, including cross-platform

– Generation of synthetic metrics, e.g mem_refs/FLOP

– Hierarchical data aggregation specified by a “structure” file» Loop structure extracted from program executables» Explicitly specified structure

– Output: hyperlinked performance database

– Browse with Netscape or IE on any platform» View on almost any desktop or notebook platform» Promote collaborative, distributed analysis process

– New this week: use HPCViewer, a standalone browser in Java.» Get rid of the huge mass (e.g. 300 MB, 26K files) of static HTML.» Enables dynamic analysis, etc.

Aside: Against Intrusive Instrumentation

• In source code –

– User resistance/economics.» Manual instrumentation/recompilation is too tedious in big codes.

o Hot spots are not “small”, e.g. 5% of 500,000 lines, in big simulations.

– Probe effects apply to software as well as hardware.» Used with optimizing compilers, nothing good can happen.

o If code motion/reorganization across instrumentation points is allowed, then the measurements do not correspond to the programmer’s intent.

o If code motion is inhibited then the target is compiled poorly.» Multiple functional unit, deep-pipelined, non-blocking processors.

o Pipeline perturbation is a problem no matter when the instrumentation is inserted.

– The relevant scopes are difficult to recognize a priori.» Aggressive optimization, again.

o Lines: What’s one statement’s contribution to an unrolled, software pipelined loop? With code hoisting, CSE, tail merging, etc?

o Procedures: Big Procs. lack of detail, Small Procs. should be inlined.

• In object code –

– Manual insertion not practical. (Automated insertion in code generation or later?)

– Presenting many-many source object mapping + perf. data is an open problem.

– Again, probe effects on pipeline and cache behavior.

Source code

Hierarchical display of metrics

Files

18% cycles in SMV multiply

~19 cycles per FLOP

Navigation

7

Estimate stall cycles by comparing actual vs idealcycles a la MTOOL

41% stalls caused by smv_mult

17

Indices reordered!

Success

• We use HPCView on a daily basis.

– Integral part of our analysis of compiler and library work.

– Routine analysis of benchmark codes (NAS, Sweep3D, SMG98,…)

• We have used “hands on deployment” to identify opportunities for tuning of important apps at NCSA, Sandia, LANL.

– Time to first interesting analysis < 1 hour.

– User retains scripts to continue independently.

Thought for the Day

'The Hitchiker's Guide to the Galaxy, in a moment of reasoned lucidity which is almost unique among its current tally of five million, nine hundred and seventy-three thousand, five hundred and nine pages, says of the Sirius Cybernetics Corporation products that “it is very easy to be blinded to the essential uselessness of them by the sense of achievement you get from getting them to work at all. In other words - and this is the rock-solid principle on which the whole of the Corporation's galaxywide success is founded -- their fundamental design flaws are completely hidden by their superficial design flaws.”

(Douglas Adams, "So Long, and Thanks for all the Fish")

Thesis of this talk

• When performance tools are easy enough to use routinely, we (the computer science and scientific computation communities) will be confronted with fundamental problems on a daily basis.

• Some of these will be seen as attractive research problems.

• Some will be important, but hard research problems.

• Some will be recurring, annoying reminders of truths that we have been able to ignore thus far. (Note: D. Adams also described the concept of a SEP, somebody else’s problem.)

• Of the SEPs, many will be intractable.

Can we handle the truth?

Instrumentation Issues for HPCView• Assimilate Network Performance Data• Problems with existing profiling tools

– Must handle dynamic libraries, multi-threading, multi-process, MPI» ssrun is the only “industrial strength” profiler» If DCPI/ProfileMe survives, it is the preferred approach.» Uprofile is close

– Low overhead profiling needed for detail on large problems.– Poor attribution of costs.

• The future of counter-based profiling tools.– Compiler-directed instrumentation

» Extremely efficient if counters are visible to user code.

– Dynamic control of data collection.» Collect data for disjoint sampling intervals in long production runs.» Reflective control of data collection.

– Encourage kernel support for non-invasive profiling on more systems!– The future of PAPI/cprof on Linux and AIX systems.

» Build kernel level HPC_sprofil( ) ?» Add ProfileMe support to Alpha Linux?» Collaborators?

RISC “High ILP” processors: What went wrong?• The “RISC revolution” was predicated on the assumption that

aggressive compiler technology would be able to generate optimized code that can efficiently utilize these machines.

• “ILP features” (superscalar, VLIW, non-blocking caches) increase the reliance on effective compilation

But

• In practice, many very aggressive optimizations fail to be applied to real programs. Implementation bugs? Incorrect assumptions? Aspects of the apps?

• A big scientific code that gets 5% of the peak flop rate considered to be doing well

• Some big scientific codes get <0.05 FLOP/cycle.

• Even well compiled programs are bandwidth bound.

–Why not just use the cheapest Athalon compatible with the fastest chipset/memory combination?

Performance Counter Issues.

• An intuitive semantics for hardware counters? Consistency across architectures, generations, …

• Attribution issues (hardware)– Multiple function units, deep pipelines, out-of-order dispatch

and execution. What’s the PC when an event occurs?

– Non-blocking, out-of-order cache systems. L2 miss event on R10k occurs on cycle after second block of data is moved off the bus.

• Attribution issues (software)– Some compilers aren’t perfect in maintaining source-object

maps.

– After optimization, instructions from multiple statements are interleaved and will be executed concurrently.

– After optimization there’s a many-many map between source constructs and object instructions.

ProfileMe fixes these problems, but “Alpha’s dead”. Besides, Tru64 never supported it well.

Compilers (e.g. Gem) can maintain many-many maps, but it’s an open problemto present the resulting information effectively to certain classes of users.

Application Lifetime Issues.

• The “get it right before trying to make it fast” approach is doomed. Performance must be engineered into the program from the beginning.

• The 100K line hotspot problem.

• Realistically, 1000 lines scattered across the program contribute 90% of the cost.

• We need Concurrent Engineering of scientific apps.

Algorithmic Issues.

We are starting to see similar algorithmic problems in many of the codes that we’ve examined with HPCView.

• Insufficient computational work for each memory operation.

• Large, sparse matrix-vector products.

– Equivalent to “stencil codes” with explicit looping over space.

– Code appears to be similar to examples in literature.

• The Krylov bottleneck

– Large, stiff systems of PDE’s, e.g. with diffusion equations.

– Krylov space (e.g. conjugant gradient) algorithms are robust and converge in few iterations, but they entail several gobal collective communication operations per iteration.