performance visualizations using xml representations
DESCRIPTION
Performance Visualizations using XML Representations. Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander. Overview. Background: program optimization research XML representations Visualizations Conclusion. Program optimization research. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/1.jpg)
Performance Visualizations using
XML Representations
Presented by Kristof BeylsYijun Yu
Erik H. D’Hollander
![Page 2: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/2.jpg)
2
Overview
1. Background: program optimization research
2. XML representations
3. Visualizations
4. Conclusion
![Page 3: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/3.jpg)
3
Program optimization research
What slows down a program execution?Need to pinpoint the performance bottlenecks.(by analyzing the program)
How to improve the performance?By program transformations, based on pinpointed bottlenecks.
How to transform the program?1. Compiler
advantage: automatic optimizationdisadvantage: sometimes hard to understand what program does
2. Programmer:advantage: has good understanding of program functionalitydisadvantage: requires human effort / How to present performance bottlenecks best?
How to construct a research infrastructure that supports all the above in a common framework? ( XML)
![Page 4: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/4.jpg)
4
Two main performance factors
Parallelismperforming computation in parallelreduces execution time
Data localityfetching data from fast CPU caches reduces execution time
![Page 5: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/5.jpg)
5
Overview
1. Background: program optimization research
2. XML representations
3. Visualizations
4. Conclusion
![Page 6: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/6.jpg)
6
Why XML representations? Extensible and versatile Standard and Interoperable Language Independent
XMLnamespace (tool)
Representing
1. ast (yaxx) abstract syntax tree
2. par (oc) identified parallel or sequential loops
3. trace (isv, cv) execution trace of memory instructions
4. hotspot(isv,cv)
performance bottleneck locations
5. isdg (isv) iteration space dependence graph
6. rdv (distv) a reuse distance vector
yaxx – YACC extension to XMLoc – Omega calculatorisv – iteration space visualizercv – cache (trace) visualizerdistv – (cache reuse)
distance visualizer
![Page 7: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/7.jpg)
7
1. AST (Abstract Syntax Tree) (ast) XML is a good representation for AST by its hierarchical
nature. ast namespace captures syntactical information of a
program We can construct AST from source code through YAXX
and regenerate source code through XSLT.
<ast:DO_Loop> <var name=“I”/> <lb><const value=“1”/></lb> <ub><const value=“10”/></ub> <st><const value=“1”/></st> <body>…</body>
</ast:DO_Loop>
DO I=1,10,1
……
ENDDO
![Page 8: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/8.jpg)
8
Program optimization research
What slows down a program execution?Need to pinpoint the performance bottlenecks.(by analyzing the program)
How to improve the performance?By program transformations, based on pinpointed bottlenecks.
Who transforms the program?1. Compiler
advantage: automatic optimizationdisadvantage: sometimes hard to understand what program does
2. Programmer:advantage: has good understanding of program functionalitydisadvantage: requires human effort / How to present performance bottlenecks best?
How to construct a research infrastructure that supports all the above in a common framework? ( XML)
![Page 9: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/9.jpg)
9
2. Parallel loops (par)
Identified parallel loop are annotated with a <par:true/> element in the “par” namespace.
<ast:DO_Loop><par:true/>…
</ast:DO_Loop> In this way, semantics and syntax information
are in orthogonal name spaces. Syntax-based tools (e.g. unparser) can still ignore it, or translate it into directive comments: e.g. Fortran C$DOALL.
![Page 10: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/10.jpg)
10
XFPT: an extended optimizing compiler
![Page 11: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/11.jpg)
11
Program optimization research
What slows down a program execution?Need to pinpoint the performance bottlenecks.(by analyzing the program)
How to improve the performance?By program transformations, based on pinpointed bottlenecks.
Who transforms the program?1. Compiler
advantage: automatic optimizationdisadvantage: sometimes hard to understand what program does
2. Programmer:advantage: has good understanding of program functionalitydisadvantage: requires human effort / How to present performance bottlenecks best?
How to construct a research infrastructure that supports all the above in a common framework? ( XML)
![Page 12: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/12.jpg)
12
3. Traces (trace) Trace records a sequence of memory address accesses<trace:seq>
<access addr=“0x00ffe8” bytes=“8” /><access addr=“0x00fff0” bytes=“16” />……
</trace:seq> Trace alone can be used to identify runtime data
dependences and identify cache misses through cache simulator
Associate an address with the array reference number or loop iteration index on the program’s AST, the trace can be used for advanced loop dependence analysis and cache reuse distance analysis.
<trace:seq><access addr=“0x00ffe8” bytes=“8” hotspot:id=“1”>
<!-– The 1st reference --> <do_loop hotspot:id=“1” vector=“1 2”/>
<!– The 1st DO loop:(I,J)=(1,2) --> <array hotspot:id=“1” vector=“1”/>
<!-– Reference to array element X(1) --></access>
……</trace:seq>
![Page 13: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/13.jpg)
13
4. Hotspots (hotspot) Hot spots are identified bottlenecks of the program Two types are used:
Bottleneck loops: tells which loop is the performance bottlenecks Bottleneck references: tells which references are performance
bottlenecks<hotspot:list>
<do_loop id=“1”><index vector=“I J”/><start lineno=“3” colno=“1”/><end lineno=“7” colno=“12”/>
</do_loop> ……<array id=“2” name=“X”>
<dim><lb>1</lb><ub>10</ub></dim></array>……<reference id=“1” type=“R”>
<start lineno=“5” colno=“9”/><end lineno=“5” colno=“14”/>
</reference>……</hotspot:list>
1 DIM T(3), X(10)2 REAL S, X3 DO I = 1, 104 DO J = 1, 105 S = S + X(I)*J6 ENDDO7 ENDDO8 …
![Page 14: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/14.jpg)
14
Overview
1. Background: program optimization research
2. XML representations
3. Visualizations
4. Conclusion
![Page 15: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/15.jpg)
15
Program optimization research
What slows down a program execution?Need to pinpoint the performance bottlenecks.(by analyzing the program)
How to improve the performance?By program transformations, based on pinpointed bottlenecks.
Who transforms the program?1. Compiler
advantage: automatic optimizationdisadvantage: sometimes hard to understand what program does
2. Programmer:advantage: has good understanding of program functionalitydisadvantage: requires human effort / How to present performance bottlenecks best?
How to construct a research infrastructure that supports all the above in a common framework? ( XML)
![Page 16: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/16.jpg)
16
Performance Visualizations
XML plays an important role to glue the visualizers with an optimizing compiler:
1.Loop dependence visualization
2.Reuse distance visualization
3.Cache behavior visualization
![Page 17: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/17.jpg)
17
Visualization 1:ISDG: iteration space dependence graph
An iteration is an instance of the loop body statements. An iteration space is the set of integer vector values of the DO loop index variables for the traversed iterations.
Loop carried dependence is a dependence caused by two references R1 and R2 that access to the same memory address, while:1. One of R1, R2 is a write2. R1 belongs to loop iteration (i1,
j1) and R2 belongs to loop iteration (i2, j2) (i1,j1)
A ISDG is a graph with nodes representing the iteration space and edges representing loop carried dependences.
DO i=1,5 DO j=1,5 A(i,j) = A(i,j+1) ENDDOENDDO
i
j1
1
5
5
![Page 18: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/18.jpg)
18
The WTCM CFD application
WTCM has a Computational Fluid Dynamics simulator which involves solving partial differential equations (PDE) through a Gauss-Siedel solver
temperature3D geometry + 1D time
![Page 19: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/19.jpg)
19
The visualized dependences
![Page 20: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/20.jpg)
20
The loop transformation
A 3-D unimodular transformation is found after visualizing the 4D loop nest which has 177 array references at run-time for each iteration. Here we use a regularshape. The transformation makes it possible to speed-up the program around N2/6 times where N is the diameter of the geometry.
![Page 21: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/21.jpg)
21
Visualization 2:Reuse distances
Reuse distance is the amount of data accessed before a memory address is reused.
reuse distance > cache size cache miss
![Page 22: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/22.jpg)
22
![Page 23: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/23.jpg)
23
Execution time reduction on an Itanium processor (Spec2000 programs).
0%
20%
40%
60%
80%
100%
program
pe
rce
nta
ge
ex
ec
uti
on
tim
e
calculation
other bottlenecks
data cache misses
![Page 24: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/24.jpg)
24
Visualization 3:Cache miss traces (Tomcatv/Spec95)
White: hit
Blue: compulsory
Green: capacity
Red: conflict
![Page 25: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/25.jpg)
25
4.2 Visualizing hotspots of conflict cache misses
X(I,J+1) and X(I,J) has conflictif X has a dimension (512,512).It is resolved by changing thedimension to (524, 524).
Also known as, Array Padding
![Page 26: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/26.jpg)
26
4.2 Cache misses trace after array padding, most spatial locality is exploited, conflict misses resolved
On Intel 550MHz Pentium III (single CPU), the measured speedup with VTune >50%
![Page 27: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/27.jpg)
27
Overview
1. Background: program optimization research
2. XML representations
3. Visualizations
4. Conclusion
![Page 28: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/28.jpg)
28
Conclusion
An existing optimizing compiler FPT was extended with an extensible XML interface.
The performance factors, in particular loop parallelism and data locality, were exported from FPT.
These factors were visualized through Loop dependence visualizer ISV Execution trace visualizer CacheVis Reuse distance visualizer ReuseVis
The programmer can use the visualized feedback to improve the performance.
![Page 29: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/29.jpg)
29
The End.
Any questions?
![Page 30: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/30.jpg)
30
Program semantics (Software) vs. Architecture capabilities (Hardware)
Research Area Program Architecture
Parallel Computing
Parallelism at Task, Loop, Instruction levels through data dependence analysis
Multi-processors (MIMD), pipeline (SIMD), multi-threads, network of workstations (NOW, Grid computing)
Memory-hierarchy Temporal and spatial data locality, data layout, stack reuse distances
Cache at level 1, 2, 3, TLB, set associativity, data replacement policy
![Page 31: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/31.jpg)
31
2. Major Performance factors
Parallelism Loop dependences Loop-level parallelism Instruction-level parallelism Partition load balance
Data locality Temporal locality Spatial locality CCC (Compulsory, Capacity, Conflict) cache misses Reuse distances
![Page 32: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/32.jpg)
32
3.6 Cache parameters
To tune different architectural cache configurations, we represent the cache parameters: cache size, cache line size and set associativity, into a configuration file in XML. For example, a 2-level cache is specified as follows:
<cache:hierarchy><parameters level=“1”><size>1024</size><line>32</line><associativity>32</associativity></parameters><parameters level=“2”><size>65536</size><line>32</line><associativity>1</associativity></parameters>
</cache:hierarchy>
![Page 33: Performance Visualizations using XML Representations](https://reader036.vdocuments.mx/reader036/viewer/2022062409/56815065550346895dbe6330/html5/thumbnails/33.jpg)
33
4.2 Visualizing data locality histogram distributed over reuse distances