performance analysis with parallel performance wizard prashanth prakash, research assistant dr....
TRANSCRIPT
Performance Analysis Performance Analysis with Parallel with Parallel
Performance WizardPerformance Wizard
Prashanth Prakash, Research AssistantDr. Vikas Aggarwal, Research Scientist.
Vrishali Hajare, Research AssistantProfessor Alan D. George, Principal Investigator
HCS Research LaboratoryUniversity of Florida
Outline
Introduction talk (~20 minutes) Hands on
PPW basics Performance data collection Performance analysis Automatic analysis
Feel free to ask question during the talk or hands-on
2
Parallel Performance Analysis The need for Performance Analysis
High-performance computing has performance as an explicit, fundamental goal I just got my parallel program working, and… My program does NOT yield the expected performance Why is this? How do I fix my program?
The challenge of Performance Analysis Understanding performance of sequential applications can
be challenging Complexity of parallel computing makes it more difficult to
understand program performance without tools for performance analysis
3
Performance Analysis Approaches Three general performance analysis approaches
Analytical modeling Mostly predictive methods Could also be used in conjunction with experimental performance
measurement Pros: easy to use, fast, can be performed without running the program Cons: usually not very accurate
Simulation Pros: allow performance estimation of program with various system
architectures Cons: slow, not generally applicable for regular UPC/SHMEM users
Experimental performance measurement Strategy used by most modern performance analysis tools Uses actual event measurement to perform analysis Pros: most accurate Cons: can be time-consuming (iterative tuning process)
4
Role of a Performance Analysis Tool
Original Application
Optimized Application
Runtime Performance Data
Gathering
Data Processing and Analysis
Data and Result Presentation
5
Performance Analysis Stages
Instrumentation Insertion of code to facilitate perf. measurement Measurement Collection of perf. data at runtime Analysis Examination & processing of perf. data to find &
potentially resolve bottlenecks Presentation Display of analyzed data to tool user Optimization Modifying application to remove perf. problems
6
Instrumentation Techniques Runtime/compiler instrumentation
Provides the most detailed information about user’s program Requires vendor cooperation (modifications to compiler/runtime)
Source instrumentation Directly modify user’s source code Can provide much information, but may interfere with compiler optimizations
Interposition (“wrapper libraries”) No recompilation needed, only relinking Only get information about library calls Can be difficult to get source-level information Relies on alternate function entry points or dynamic linker hacks
Binary instrumentation Most of the benefits of source instrumentation without need for recompilation Can be difficult to get source-level information Highly platform-specific, existing toolkits lack support for some platforms (eg,
Cray)
7
Measurement Techniques Profiling
Record statistical information about execution time and/or hardware counter values (PAPI)
Relate information to basic blocks (functions, upc_forall loops) in source code
Important concept: inclusive vs. exclusive time (self vs. total)
Tracing Record full log of when events happen at runtime and how long Gives very complete information about what happened at runtime Requires much more storage space than profiling!
Sampling Special low-overhead mode of profiling that attributes performance
information via indirect measurement (samples)
8
Parallel Performance Wizard (PPW) Performance analysis tool developed in HCS Lab here at UF
Designed for partitioned global-address-space (PGAS) programming models (UPC and SHMEM)
Also supports MPI; other support in the works Features
Uses experimental measurement approach Provides profiling and tracing support Has numerous visualizations and advanced automated
analysis Overarching design goals
Be user-friendly Enhance productivity Aim for portability
9
PPW Hands-on…
10
Hands-on Boot liveDVD in a VM or directly or hardware Initial Setup
Export PATH variable to include recent release of PPW and UPC export
PATH=/usr/local/packages/ppw-2.6.2/bin/:/usr/local/packages/bupc-2.12.1/bin/:$PATH
All applications we use today are in the directory cd /home/livetau/workshop-point/UPC_apps
You can download these slides from (following slides has necessary commands and will come in handy),
http://hcs.ufl.edu/~prakash/pgas/PPW_Tutorial.ppt http://hcs.ufl.edu/~prakash/pgas/PPW_Tutorial.pdf
11
Programming in UPC (bupc) Compiling an UPC program
upcc hello.c –o hello Execution
upcrun –n 4 hello
12
Using PPW in a Nutshell
Recompile application (Instrumentation) Use ppwupcc instead of upcc ppwshemecc (for SHMEM) and ppwmpicc (for MPI)
Run application (Measurement) ppwrun <ppwrun options> <Command to execute
parallel application>
View performance data (Analysis + Presentation) ppw file.par
Change code (Optimization), recompile, repeat
13
PPW(for UPC) in a Nutshell Recompile application (Instrumentation)
ppwupcc CAMEL_upc.c -o camel
Run application (Measurement) ppwrun -–output=file.par upcrun –n 4 camel
abcd1234
View performance data (Analysis + Presentation) ppw file.par
Change code (Optimization), recompile, repeat
Note: PPW should be compiled --with-upc and Berkeley UPC should
be compiled with --with-multiconf=+opt_inst
14
PPW Useful Options Tracking user functions entry and exit
pass --inst-functions to ppwupcc
Communication matrix pass --comm-stats to ppwrun
Just open the .par file using ppw to find all the data. ppw file.par
Source archive (.sar file) Required during execution Retain the .sar file in the same directory as executable
15
NPB 2.4 Compiling
cd NPB2.4/FT make CLASS=X NP=N
where X can be S,A,B,C. Preferably use S or A.
Execution same as before
NPB2.4 is developed and maintained by George Washington University (upc.gwu.edu)
16
Tracing Compilation is same as before using
ppwupcc Pass --trace option to request tracing
ppwrun --trace --output=a.par upcrun -n 4 ft.A.4 Convert to slog2 using ppw (or par2slog2)
File -> Export -> <choose slog2> Use jumpshot to view the trace
jumpshot ft.slog2
17
Export: Covert to Other Popular Formats par file can be exported to different popular
performance data formats, supported formats include TAU profile CUBE profile OTF trace file (Vampir) SLOG-2 (Jumpshot)
18
Case Study: Analyzing FT of NPB2.4 NPB2.4 FT benchmark (class=A, np=4)
executed on an IB cluster with 1 thread per node
You can download the par file and slog2 file at
http://hcs.ufl.edu/~prakash/pgas/ftA4.par http://hcs.ufl.edu/~prakash/pgas/ftA4.slog2
19
Case Study: FT Identify the bottleneck
Sort by total time, look for bottlenecks upc_getmem ft.c:1950
Cannot be confirmed by looking at profile, so take a look at the trace
Observe the trace output and the behavior of the code section ft.c:1943 till ft.c:1953 Serialization of upc_getmem, which is
unnecessary in this case
20
Case Study: FT How to fix?
Use bupc_getmem_async – Berkeley UPC extension for asynchronous getmem http://upc.lbl.gov/publications/upc_memcpy.pdf
Did it improve performance? Download the par file generated after changes to
ft.c http://hcs.ufl.edu/~prakash/pgas/ftA4_m.par
Observe the changes in profile data
21
Automatic Analysis
Why do we need automatic analysis? Increasing size of performance data set makes it
hard to identify and resolve bottlenecks What will automatic analyses do?
Automatically detect, diagnose and possibly resolve bottlenecks
22
Automatic Analysis
Application analyses Deals with a single run and includes,
Bottleneck detection Cause analysis High-level analysis
Experiment set analyses Compare performance of related runs
Scalability analysis Revision analysis
23
Application Analysis Bottleneck detection
Examine profile data and identify the bottleneck profiling entries
Baseline comparison and deviation evaluation method
Cause analysis Identify the reason for bottlenecks and requires trace
data to complete analysis
High-level analysis High-level analysis is mainly used to detect bottleneck
nodes that, when optimized, could improve the application performance for a single experiment
24
Application Analysis Analysis -> Run Application Analysis
25
Experiment Set Analyses Scalability analysis
Plots the scaling factor (relative speedup) values against the ideal scaling value
Scaling factor of 1.00 implies perfect scalability Analysis->Run Scalability Analysis
Revision analysis Compare and evaluate different versions of the
same application Profile Charts -> Total Times by Function
26
For More Information on PPW Visit the PPW website
http://ppw.hcs.ufl.edu
Website has Overview of tool Links to detailed online/printable user manual Downloadable source code for entire tool Workstation GUI installers
Windows installer Linux packages
Publications covering PPW and related research
27