tau for accelerating ai applications...tau_ebs_source time allows using papi hardware counters for...

34
Sameer Shende University of Oregon Join the Conversation #OpenPOWERSummit TAU for Accelerating AI Applications

Upload: others

Post on 02-Feb-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

SameerShendeUniversityofOregon

JointheConversation#OpenPOWERSummit

TAU for Accelerating AI Applications

Page 2: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

TAU Performance System® http://tau.uoregon.edu

•  Tuning and Analysis Utilities (20+ year project)

•  Comprehensive performance profiling and tracing •  Integrated, scalable, flexible, portable •  Targets all parallel programming/execution paradigms

•  Integrated performance toolkit •  Instrumentation, measurement, analysis, visualization •  Widely-ported performance profiling / tracing system •  Performance data management and data mining •  Open source (BSD-style license)

•  Integrates with application frameworks

2

Page 3: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

TAU Performance System®

Page 4: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

Performance Engineering using TAU •  How much time is spent in each application routine and outer loops? Within loops, what

is the contribution of each statement? What is the time spent in OpenMP loops? Kernels on the GPU?

•  How many instructions are executed in these code regions? Floating point, Level 1 and 2 data cache misses, hits, branches taken?

•  What is the memory usage of the code? When and where is memory allocated/de-allocated? Are there any memory leaks? What is the memory footprint of the application? What is the memory high water mark?

•  How much energy does the application use in Joules? What is the peak power usage? •  What are the I/O characteristics of the code? What is the peak read and write

bandwidth of individual calls, total volume? •  How does the application scale? What is the efficiency, runtime breakdown of

performance across different core counts?

Page 5: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

Instrumentation •  Sourceinstrumentationusingapreprocessor

•  Addtimerstart/stopcallsinacopyofthesourcecode.•  UseProgramDatabaseToolkit(PDT)forparsingsourcecode.•  RequiresrecompilingthecodeusingTAUshellscripts(tau_cc.sh,tau_f90.sh)•  Selectiveinstrumentation(filterfile)canreduceruntimeoverheadandnarrowinstrumentationfocus.

• Compiler-basedinstrumentation•  Usesystemcompilertoaddaspecialflagtoinserthooksatroutineentry/exit.•  RequiresrecompilingusingTAUcompilerscripts(tau_cc.sh,tau_f90.sh…)

• RuntimepreloadingofTAU’sDynamicSharedObject(DSO)•  Noneedtorecompilecode!Usetau_exec./appwithoptions.

Page 6: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

Profiling and Tracing

•  Tracing shows you when the events take place on a timeline

Profiling Tracing

•  Profiling shows you how much (total) time was spent in each routine

•  Profilingandtracing

Profilingshowsyouhowmuch(total)timewasspentineachroutineTracingshowsyouwhentheeventstakeplaceonatimeline

Page 7: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

Inclusive vs. Exclusive values ■  Inclusive

■  Informationofallsub-elementsaggregatedintosinglevalue

■  Exclusive■  Informationcannotbesubdividedfurther

Inclusive Exclusive

int foo() { int a; a = 1 + 1; bar(); a = a + 1; return a; }

Page 8: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

Performance Data Measurement DirectviaProbes IndirectviaSampling

•  Exact measurement •  Fine-grain control •  Calls inserted into

code

•  No code modification •  Minimal effort •  Relies on debug

symbols (-g)

Call START(‘potential’) // code Call STOP(‘potential’)

Page 9: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

Sampling

•  Runningprogramisperiodicallyinterruptedtotakemeasurement

•  Timerinterrupt,OSsignal,orHWCoverflow•  Serviceroutineexaminesreturn-addressstack•  Addressesaremappedtoroutinesusingsymboltableinformation

•  Statisticalinferenceofprogrambehavior•  Notverydetailedinformationonhighlyvolatilemetrics•  Requireslong-runningapplications

•  Workswithunmodifiedexecutables

Time main foo(0) foo(1) foo(2) int main()

{ int i; for (i=0; i < 3; i++) foo(i); return 0; } void foo(int i) { if (i > 0) foo(i – 1); }

Measurement

t9 t7 t6 t5 t4 t1 t2 t3 t8

Page 10: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

Instrumentation

• Measurementcodeisinsertedsuchthateveryeventofinterestiscaptureddirectly

•  Canbedoneinvariousways• Advantage:

•  Muchmoredetailedinformation

• Disadvantage:•  Processingofsource-code/executablenecessary

•  Largerelativeoverheadsforsmallfunctions

Time Measurement int main()

{ int i; for (i=0; i < 3; i++) foo(i); return 0; } void foo(int i) { if (i > 0) foo(i – 1); }

Time

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14

main foo(0) foo(1) foo(2)

Start(“main”);

Stop (“main”);

Start(“foo”);

Stop (“foo”);

Page 11: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

Using TAU’s Runtime Preloading Tool: tau_exec

• Preload a wrapper that intercepts the runtime system call and substitutes with another • MPI, CUDA, OpenACC, pthread, OpenCL

• OpenMP

• POSIX I/O

• Memory allocation/deallocation routines

• Wrapper library for an external package

• No modification to the binary executable! • Enable other TAU options (communication matrix, OTF2, event-

based sampling) • For Python: tau_python replaces python. Similar to tau_exec.

Page 12: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

TAU Execution Command (tau_exec) • Uninstrumented execution

•  % ./a.out • Track GPU operations

•  % tau_exec –cupti ./a.out •  % tau_exec –cupti -um ./a.out (for Unified Memory) •  % tau_exec –opencl ./a.out •  % tau_exec –openacc ./a.out

• Track MPI performance •  % tau_exec ./a.out

• Track OpenMP, and MPI performance (MPI enabled by default) •  % export TAU_OMPT_SUPPORT_LEVEL=full;

% export TAU_OMPT_RESOLVE_ADDRESS_EAGERLY=1 •  % tau_exec –T ompt,tr6,mpi –ompt ./a.out

• Track I/O operations •  % tau_exec –io –T pthread ./a.out

• Use event based sampling (compile with –g) •  % tau_exec –ebs ./a.out •  Also –ebs_source=<PAPI_COUNTER> -ebs_period=<overflow_count>

12

Page 13: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

Configuration tags for tau_exec % ./configure –pdt=<dir> -mpi –papi=<dir>; make install Creates in $TAU: Makefile.tau-papi-mpi-pdt(Configuration parameters in stub makefile) shared-papi-mpi-pdt/libTAU.so % ./configure –pdt=<dir> -mpi; make install creates Makefile.tau-mpi-pdt shared-mpi-pdt/libTAU.so To explicitly choose preloading of shared-<options>/libTAU.so change: % mpirun -np 256 ./a.out to % mpirun -np 256 tau_exec –T <comma_separated_options> ./a.out % mpirun -np 256 tau_exec –T papi,mpi,pdt ./a.out Preloads $TAU/shared-papi-mpi-pdt/libTAU.so % mpirun -np 256 tau_exec –T papi ./a.out Preloads $TAU/shared-papi-mpi-pdt/libTAU.so by matching. % aprun –n 256 tau_exec –T papi,mpi,pdt –s ./a.out Does not execute the program. Just displays the library that it will preload if executed without the –s option. NOTE: -mpi configuration is selected by default. Use –T serial for Sequential programs.

Page 14: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

ParaProf Profile Browser

% paraprof

Page 15: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

ParaProf Profile Browser

Page 16: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

ParaProf 3D Window

Page 17: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

Tensorflow mnist example with tau_python

Page 18: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)
Page 19: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

ParaProf: Thread Statistics Window

Page 20: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

TAU: Tracking Data Transfers

Page 21: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

Thread Profile on Thread 341

Page 22: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

ParaProf: Thread Statistics Window Thread 0

Page 23: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

ParaProf 3D Window for Caffe: Googlenet

Page 24: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

ParaProf Manager Window

Page 25: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

ParaProf Node Window

Page 26: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

Thread Statistics Window: Caffe Googlenet

Page 27: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

TAU’s Static Analysis System: Program Database Toolkit (PDT)

Application/ Library

C / C++parser

Fortran parserF77/90/95

C / C++IL analyzer

FortranIL analyzer

ProgramDatabase

Files

IL IL

DUCTAPETAU �

instrumentorAutomatic sourceinstrumentation

.

.

.

Page 28: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

tau_instrumentor

Parsedprogram

Instrumentationspecificationfile

Instrumentedcopyofsource

TAU source analyzer

Applicationsource

PDT: automatic source instrumentation

Page 29: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

EnvironmentVariable Default Description

TAU_TRACE 0 Settingto1turnsontracing

TAU_CALLPATH 0 Settingto1turnsoncallpathprofiling

TAU_TRACK_MEMORY_FOOTPRINT 0 Settingto1turnsontrackingmemoryusagebysamplingperiodicallytheresidentsetsizeandhighwatermarkofmemoryusage

TAU_TRACK_POWER 0 Trackspowerusagebysamplingperiodically.

TAU_CALLPATH_DEPTH 2 Specifiesdepthofcallpath.Settingto0generatesnocallpathorroutineinformation,settingto1generatesflatprofileandcontexteventshavejustparentinformation(e.g.,HeapEntry:foo)

TAU_SAMPLING 1 Settingto1enablesevent-basedsampling.

TAU_TRACK_SIGNALS 0 Settingto1generatedebuggingcallstackinfowhenaprogramcrashes

TAU_COMM_MATRIX 0 Settingto1generatescommunicationmatrixdisplayusingcontextevents

TAU_THROTTLE 1 Settingto0turnsoffthrottling.Throttlesinstrumentationinlightweightroutinesthatarecalledfrequently

TAU_THROTTLE_NUMCALLS 100000 Specifiesthenumberofcallsbeforetestingforthrottling

TAU_THROTTLE_PERCALL 10 Specifiesvalueinmicroseconds.Throttlearoutineifitiscalledover100000timesandtakeslessthan10usecofinclusivetimepercall

TAU_CALLSITE 0 Settingto1enablescallsiteprofilingthatshowswhereaninstrumentedfunctionwascalled.Alsocompatiblewithtracing.

TAU_PROFILE_FORMAT Profile Settingto“merged”generatesasinglefile.“snapshot”generatesxmlformat

TAU_METRICS TIME Settingtoacommaseparatedlistgeneratesothermetrics.(e.g.,ENERGY,TIME,P_VIRTUAL_TIME,PAPI_FP_INS,PAPI_NATIVE_<event>:<subevent>)

Runtime Environment Variables

Page 30: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

EnvironmentVariable Default Description

TAU_TRACE 0 Settingto1turnsontracing

TAU_TRACE_FORMAT Default Settingto“otf2”turnsonTAU’snativeOTF2tracegeneration(configurewith–otf=download)

TAU_EBS_UNWIND 0 Settingto1turnsonunwindingthecallstackduringsampling(usewithtau_exec–ebsorTAU_SAMPLING=1)

TAU_EBS_RESOLUTION line Settingto“function”or“file”changesthesamplingresolutiontofunctionorfilelevelrespectively.

TAU_TRACK_LOAD 0 Settingto1trackssystemloadonthenode

TAU_SELECT_FILE Default Settingtoafilename,enablesselectiveinstrumentationbasedonexclude/includelistsspecifiedinthefile.

TAU_OMPT_SUPPORT_LEVEL basic Settingto“full”improvesresolutionofOMPTTR6regionsonthreads1..N-1.Also,“lowoverhead”optionisavailable.

TAU_OMPT_RESOLVE_ADDRESS_EAGERLY 0 Settingto1isnecessaryforeventbasedsamplingtoresolveaddresseswithOMPTTR6(-ompt=download-tr6)

Runtime Environment Variables

Page 31: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

EnvironmentVariable Default Description

TAU_TRACK_MEMORY_LEAKS 0 Tracksallocatesthatwerenotde-allocated(needs–optMemDbgortau_exec–memory)

TAU_EBS_SOURCE TIME AllowsusingPAPIhardwarecountersforperiodicinterruptsforEBS(e.g.,TAU_EBS_SOURCE=PAPI_TOT_INSwhenTAU_SAMPLING=1)

TAU_EBS_PERIOD 100000 Specifiestheoverflowcountforinterrupts

TAU_MEMDBG_ALLOC_MIN/MAX 0 Bytesizeminimumandmaximumsubjecttoboundschecking(usedwithTAU_MEMDBG_PROTECT_*)

TAU_MEMDBG_OVERHEAD 0 SpecifiesthenumberofbytesforTAU’smemoryoverheadformemorydebugging.

TAU_MEMDBG_PROTECT_BELOW/ABOVE 0 Settingto1enablestrackingruntimeboundscheckingbeloworabovethearraybounds(requires–optMemDbgwhilebuildingortau_exec–memory)

TAU_MEMDBG_ZERO_MALLOC 0 Settingto1enablestrackingzerobyteallocationsasinvalidmemoryallocations.

TAU_MEMDBG_PROTECT_FREE 0 Settingto1detectsinvalidaccessestodeallocatedmemorythatshouldnotbereferenceduntilitisreallocated(requires–optMemDbgortau_exec–memory)

TAU_MEMDBG_ATTEMPT_CONTINUE 0 Settingto1allowsTAUtorecordandcontinueexecutionwhenamemoryerroroccursatruntime.

TAU_MEMDBG_FILL_GAP Undefined Initialvalueforgapbytes

TAU_MEMDBG_ALINGMENT Sizeof(int) Bytealignmentformemoryallocations

TAU_EVENT_THRESHOLD 0.5 Defineathresholdvalue(e.g.,.25is25%)totriggermarkereventsformin/max

Runtime Environment Variables (contd.)

Page 32: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

DownloadTAUfromU.Oregon

http://www.hpclinux.com[OVAfile]http://tau.uoregon.eduformoreinformation

Freedownload,opensource,BSDlicense

Page 33: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

Performance Research Lab, UO, Eugene

www.uoregon.edu

Page 34: TAU for Accelerating AI Applications...TAU_EBS_SOURCE TIME Allows using PAPI hardware counters for periodic interrupts for EBS (e.g., TAU_EBS_SOURCE=PAPI_TOT_INS when TAU_SAMPLING=1)

Support Acknowledgments •  US Department of Energy (DOE)

•  ANL •  Office of Science contracts, ECP •  SciDAC, LBL contracts •  LLNL-LANL-SNL ASC/NNSA contract •  Battelle, PNNL and ORNL contract

•  Department of Defense (DoD) •  PETTT, HPCMP

•  National Science Foundation (NSF) •  SI2-SSI, Glassbox

•  NASA

•  CEA

•  IBM

•  Partners: • The Ohio State University • ParaTools, Inc. • University of Tennessee, Knoxville • T.U. Dresden, GWT • Jülich Supercomputing Center