performance models for application optimization walid abu-sufah [email protected] visiting...
TRANSCRIPT
Performance Models for Application Optimization
Walid Abu-Sufah
[email protected] Scholar, University of Illinois
Associate Professor, University of Jordan
Outline
1. Objective
2. Overview1. Roofline model2. Capacity model
3. Relate roofline/capacity
4. Open Issues
5. Discussion: How could PMUs help
www.upcrc.illinois.edu2
1. Objective
www.upcrc.illinois.edu3
Explore how a model for a target architecture could be used for application tuning (may be in a compiler?).
Explore how a model for a target architecture could be used for application tuning (may be in a compiler?).
2.1 Roofline Model
• For applications where off-chip memory bandwidth is the constraining resource (limit) in system performance.
• Relates processor performance to off-chip memory traffic.
• Bound and Bottleneck Model– good enough to understand which optimizations to try to get next
level of performance
• So far, demonstrated for several HPC dwarfs and multicore systems.
www.upcrc.illinois.edu4
Bounds
= Peak Processing Bandwidth; MFLOP/sec
= Peak DRAM Bandwidth; Mbytes/sec
• “Operational Intensity”: – Average number of Floating Point Operations per Byte to DRAM,
FLOPs/Byte– Varies by multicore design (cache org.) and dwarf– Characterize dwarf for a particular multicore design
5
PB
mB
Performance Model Graph
6
Y axis is GFLOPs/sec
X-axis is FLOPs/Byte(i.e. Operational Intensity)
Can plot peak DRAM BW, since
(GFLOPs/sec) (FLOPs/Byte)
= GBytes/sec
mB“Roofline”
pB
mB
Roofline Visual Performance Model
7
• “Ridge Point”: minimum Operational Intensity to get Peak Performance • Compute Bound• Memory Bound
Ridge Point
Roofline model for AMD Opteron X2
Roofline model for Opteron X2 vs. Opteron X4
Roofline model with ceilings for Opteron X2
10
Roofline model with ceilings for Opteron X2.
Roofline model with ceilings for Opteron X2
What is next for Roofline
• Non-floating point kernels would be interesting– e.g., Sort (potential exchanges/sec vs GB/s),
Graph Traversal (nodes traversed/sec vs. GB/s)
• Opportunities for others to help investigate: many kernels, multicores, metrics, …
13
2.2 Capacity Model
• HW represented as nodes with “peak” BW– In this talk & for illustration purposes, we assume
only two nodes, a memory and a processing node with BWs:
• System is represented as graph of HW nodes
mB pB
Performance Depends on:
A. System Characteristics1. Peak BWs of nodes2. Memory hierarchy (cache) organization/ size3. Operational overlap
B. Application Characteristics1. Relative demands on BWs2. Overheads
www.upcrc.illinois.edu15
Definitions
• Ration of peak BWs,
• BW-used per node: ,
• Ratio of BWs-used
• Ratio of BW-used per node to system bandwidth-used:
www.upcrc.illinois.edu16
upB
umB
p
mmp B
B,
um
up
up
mp BB
B
,
1
pmup
um
mp B
B,, /1
Capacity of A Node
Average node BW utilized by an application
A function of
• Application characteristics
• Node BW
www.upcrc.illinois.edu17
,{ pupp
pup
up
BBifB
BBifBpC
,{ m
umm
mum
um
BBifB
BBifBmC
Saturated Node Capacity• Assume that at least one of the nodes is saturated, then
processor capacity, , is given by
www.upcrc.illinois.edu18
A similar expression applies for memory capacity, mC
mps CCC
pC
System capacity,
Similar argument holds for unsaturated node pairSimilar argument holds for unsaturated node pair
Saturated Node Capacity Expression – Example
• For αp,m = ½
www.upcrc.illinois.edu19
Processor, Memory, and System Capacity Curves ( )
www.upcrc.illinois.edu20
21
, mp
3. Relating Roofline/ Capacity
• A processing optimization ceiling, x , in Roofline corresponds to a used processing BW
• A memory optimization ceiling , y, in Roofline corresponds to a used memory BW,
• If an application is optimized using optimizations x and y then
www.upcrc.illinois.edu21
xpB
ymB
ym
xp
xp
mp BB
B
,
1
pmxp
ym
mpB
B,, /1
Roofline model with ceilings for Opteron X2
) or ILP ( 1 SIMDpB
5 ,4mB
pB
mB
5,41
1
,
1
mp
p
mp BB
B
4. Open Issues
• Modeling with different performance limiting factors – Cache resident client applications (i.e. memory BW is not the
limit)
• Introduce additional bounds: Network BW and IO BW
• Development of tools based on models for use in application optimization
www.upcrc.illinois.edu23
5. Discussion:How could PMUs help
www.upcrc.illinois.edu24
References: Roofline Model
• S. Williams, A. Waterman, D. Patterson, "Roofline: an insightful visual performance model for multicore architectures,” Communications of the ACM, Volume 52 , Issue 4 (April 2009), Pages 65-76.
• David Patterson,” The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem?“, April 8, 2009 lecture in the Parallel@Illinois Distinguished Lecture Series (http://www.parallel.illinois.edu/dls_archive.html )
www.upcrc.illinois.edu25
References: Capacity Model
• D. J. Kuck, "Computer System Capacity Fundamentals,” National Bureau of Standards, Technical Note 851, Oct. 1974.
• D. J. Kuck, B. Kumar, A system model for computer performance evaluation, March 1976 SIGMETRICS 76: Proceedings of the 1976 ACM SIGMETRICS Conference on computer performance modeling measurement and evaluation.
• D.J. Kuck, The Structure of Computers and Computations, Vol. I, John Wiley & Sons, Inc., 1978.
www.upcrc.illinois.edu26
• David J. Kuck “Capacity-based Codesign of Computer HW and SW“, January 26, 2009 lecture in the Parallel@Illinois Distinguished Lecture Series (http://www.parallel.illinois.edu/dls_archive.html )
www.upcrc.illinois.edu27