performance analysis with parallel performance wizard prashanth prakash, research assistant dr....

Performance Analysis Performance Analysis with Parallel with Parallel

Performance WizardPerformance Wizard

Prashanth Prakash, Research AssistantDr. Vikas Aggarwal, Research Scientist.

Vrishali Hajare, Research AssistantProfessor Alan D. George, Principal Investigator

HCS Research LaboratoryUniversity of Florida

Outline

Introduction talk (~20 minutes) Hands on

PPW basics Performance data collection Performance analysis Automatic analysis

Feel free to ask question during the talk or hands-on

2

Parallel Performance Analysis The need for Performance Analysis

High-performance computing has performance as an explicit, fundamental goal I just got my parallel program working, and… My program does NOT yield the expected performance Why is this? How do I fix my program?

The challenge of Performance Analysis Understanding performance of sequential applications can

be challenging Complexity of parallel computing makes it more difficult to

understand program performance without tools for performance analysis

3

Performance Analysis Approaches Three general performance analysis approaches

Analytical modeling Mostly predictive methods Could also be used in conjunction with experimental performance

measurement Pros: easy to use, fast, can be performed without running the program Cons: usually not very accurate

Simulation Pros: allow performance estimation of program with various system

architectures Cons: slow, not generally applicable for regular UPC/SHMEM users

Experimental performance measurement Strategy used by most modern performance analysis tools Uses actual event measurement to perform analysis Pros: most accurate Cons: can be time-consuming (iterative tuning process)

4

Role of a Performance Analysis Tool

Original Application

Optimized Application

Runtime Performance Data

Gathering

Data Processing and Analysis

Data and Result Presentation

5

Performance Analysis Stages

Instrumentation Insertion of code to facilitate perf. measurement Measurement Collection of perf. data at runtime Analysis Examination & processing of perf. data to find &

potentially resolve bottlenecks Presentation Display of analyzed data to tool user Optimization Modifying application to remove perf. problems

6

Instrumentation Techniques Runtime/compiler instrumentation

Provides the most detailed information about user’s program Requires vendor cooperation (modifications to compiler/runtime)

Source instrumentation Directly modify user’s source code Can provide much information, but may interfere with compiler optimizations

Interposition (“wrapper libraries”) No recompilation needed, only relinking Only get information about library calls Can be difficult to get source-level information Relies on alternate function entry points or dynamic linker hacks

Binary instrumentation Most of the benefits of source instrumentation without need for recompilation Can be difficult to get source-level information Highly platform-specific, existing toolkits lack support for some platforms (eg,

Cray)

7

Measurement Techniques Profiling

Record statistical information about execution time and/or hardware counter values (PAPI)

Relate information to basic blocks (functions, upc_forall loops) in source code

Important concept: inclusive vs. exclusive time (self vs. total)

Tracing Record full log of when events happen at runtime and how long Gives very complete information about what happened at runtime Requires much more storage space than profiling!

Sampling Special low-overhead mode of profiling that attributes performance

information via indirect measurement (samples)

8

Parallel Performance Wizard (PPW) Performance analysis tool developed in HCS Lab here at UF

Designed for partitioned global-address-space (PGAS) programming models (UPC and SHMEM)

Also supports MPI; other support in the works Features

Uses experimental measurement approach Provides profiling and tracing support Has numerous visualizations and advanced automated

analysis Overarching design goals

Be user-friendly Enhance productivity Aim for portability

9

PPW Hands-on…

10

Hands-on Boot liveDVD in a VM or directly or hardware Initial Setup

Export PATH variable to include recent release of PPW and UPC export

PATH=/usr/local/packages/ppw-2.6.2/bin/:/usr/local/packages/bupc-2.12.1/bin/:$PATH

All applications we use today are in the directory cd /home/livetau/workshop-point/UPC_apps

You can download these slides from (following slides has necessary commands and will come in handy),

http://hcs.ufl.edu/~prakash/pgas/PPW_Tutorial.ppt http://hcs.ufl.edu/~prakash/pgas/PPW_Tutorial.pdf

11

Programming in UPC (bupc) Compiling an UPC program

upcc hello.c –o hello Execution

upcrun –n 4 hello

12

Using PPW in a Nutshell

Recompile application (Instrumentation) Use ppwupcc instead of upcc ppwshemecc (for SHMEM) and ppwmpicc (for MPI)

Run application (Measurement) ppwrun <ppwrun options> <Command to execute

parallel application>

View performance data (Analysis + Presentation) ppw file.par

Change code (Optimization), recompile, repeat

13

PPW(for UPC) in a Nutshell Recompile application (Instrumentation)

ppwupcc CAMEL_upc.c -o camel

Run application (Measurement) ppwrun -–output=file.par upcrun –n 4 camel

abcd1234

View performance data (Analysis + Presentation) ppw file.par

Change code (Optimization), recompile, repeat

Note: PPW should be compiled --with-upc and Berkeley UPC should

be compiled with --with-multiconf=+opt_inst

14

PPW Useful Options Tracking user functions entry and exit

pass --inst-functions to ppwupcc

Communication matrix pass --comm-stats to ppwrun

Just open the .par file using ppw to find all the data. ppw file.par

Source archive (.sar file) Required during execution Retain the .sar file in the same directory as executable

15

NPB 2.4 Compiling

cd NPB2.4/FT make CLASS=X NP=N

where X can be S,A,B,C. Preferably use S or A.

Execution same as before

NPB2.4 is developed and maintained by George Washington University (upc.gwu.edu)

16

Tracing Compilation is same as before using

ppwupcc Pass --trace option to request tracing

ppwrun --trace --output=a.par upcrun -n 4 ft.A.4 Convert to slog2 using ppw (or par2slog2)

File -> Export -> <choose slog2> Use jumpshot to view the trace

jumpshot ft.slog2

17

Export: Covert to Other Popular Formats par file can be exported to different popular

performance data formats, supported formats include TAU profile CUBE profile OTF trace file (Vampir) SLOG-2 (Jumpshot)

18

Case Study: Analyzing FT of NPB2.4 NPB2.4 FT benchmark (class=A, np=4)

executed on an IB cluster with 1 thread per node

You can download the par file and slog2 file at

http://hcs.ufl.edu/~prakash/pgas/ftA4.par http://hcs.ufl.edu/~prakash/pgas/ftA4.slog2

19

Case Study: FT Identify the bottleneck

Sort by total time, look for bottlenecks upc_getmem ft.c:1950

Cannot be confirmed by looking at profile, so take a look at the trace

Observe the trace output and the behavior of the code section ft.c:1943 till ft.c:1953 Serialization of upc_getmem, which is

unnecessary in this case

20

Case Study: FT How to fix?

Use bupc_getmem_async – Berkeley UPC extension for asynchronous getmem http://upc.lbl.gov/publications/upc_memcpy.pdf

Did it improve performance? Download the par file generated after changes to

ft.c http://hcs.ufl.edu/~prakash/pgas/ftA4_m.par

Observe the changes in profile data

21

Automatic Analysis

Why do we need automatic analysis? Increasing size of performance data set makes it

hard to identify and resolve bottlenecks What will automatic analyses do?

Automatically detect, diagnose and possibly resolve bottlenecks

22

Automatic Analysis

Application analyses Deals with a single run and includes,

Bottleneck detection Cause analysis High-level analysis

Experiment set analyses Compare performance of related runs

Scalability analysis Revision analysis

23

Application Analysis Bottleneck detection

Examine profile data and identify the bottleneck profiling entries

Baseline comparison and deviation evaluation method

Cause analysis Identify the reason for bottlenecks and requires trace

data to complete analysis

High-level analysis High-level analysis is mainly used to detect bottleneck

nodes that, when optimized, could improve the application performance for a single experiment

24

Application Analysis Analysis -> Run Application Analysis

25

Experiment Set Analyses Scalability analysis

Plots the scaling factor (relative speedup) values against the ideal scaling value

Scaling factor of 1.00 implies perfect scalability Analysis->Run Scalability Analysis

Revision analysis Compare and evaluate different versions of the

same application Profile Charts -> Total Times by Function

26

For More Information on PPW Visit the PPW website

http://ppw.hcs.ufl.edu

Website has Overview of tool Links to detailed online/printable user manual Downloadable source code for entire tool Workstation GUI installers

Windows installer Linux packages

Publications covering PPW and related research

27

performance analysis with parallel performance wizard prashanth prakash, research assistant dr....

Documents

program performance

parallel program

detailed information

andmy program

users source codecan

sourcelevel informationrelies

research scientist

research assistantdr