resource sharing for multicore processors · (section iii), and present a series of...
TRANSCRIPT
Resource Sharing in Mul1core Processors
David Black-‐Schaffer Assistant Professor, Department of Informa<on Technology
Uppsala University
Special thanks to the Uppsala Architecture Research Team: David Eklöv, Prof. Erik Hagersten, Nikos Nikoleris, Andreas Sandberg, Andreas Sembrant
Uppsala Programming forMulticore ArchitecturesResearch Center
Uppsala University / Department of Informa<on Technology 28/11/2011 | 2 David Black-‐Schaffer
Resource Sharing in Mul<cores
• Most people focus on the core count – Parallelism, synchroniza<on, deadlock, etc.
• I am going to focus on the shared memory system – Cache, bandwidth, prefetching, etc.
• The shared memory system significantly impacts performance, efficiency, and scalability.
Nehalem Sandy Bridge
Shared L3 Cache
Shared L3 Cache Memory Controller I/O
Memory Controller I/O
Uppsala University / Department of Informa<on Technology 28/11/2011 | 3 David Black-‐Schaffer
Goals For Today
1. Overview of shared memory system resources
2. Impact on applica1on performance
3. Importance of applica1on sensi1vity
4. Research tools and direc<ons
Uppsala University / Department of Informa<on Technology 28/11/2011 | 4 David Black-‐Schaffer
SHARED RESOURCES Not just cache and bandwidth…
Uppsala University / Department of Informa<on Technology 28/11/2011 | 5 David Black-‐Schaffer
Mul<core Memory Systems Intel Nehalem Memory Hierarchy (3GHz)
D. Molka, et. al., Memory Performance and Cache Coherency Effects on an Intel Nehalem Mul8processor System, PACT 2009.
10 20 30 40 50 60 70 80 90 100 Latency (cycles)
190
DRAM
L3 L2 L1
Core
Core
Core
Core
Latency to DRAM: 200 cycles
Latency to private L1: 4 cycles
Bandwidth to DRAM: 4-‐8 bytes/cycle total
1-‐2 bytes/cycle per core Bandwidth to Private L1: 15 bytes/cycle per core
Shared off-‐chip Bandwidth
Shared Last Level Cache
Uppsala University / Department of Informa<on Technology 28/11/2011 | 6 David Black-‐Schaffer
Other Shared Resources
• SMT (Hyper-‐Threading) – “Private” caches
• Instruc<on cache • L1/L2 Data caches
– Branch predictors, TLBs, instruc<on decode
• AMD’s Bulldozer – Floa<ng point units – Instruc<on fetch/decode – L2 cache – Branch predictor
10 20 30 40 50 60 70 80 90 100 Latency (cycles)
Shared
Core
Core
Core
Core
Core
Core
Core
Bulldozer Core
Uppsala University / Department of Informa<on Technology 28/11/2011 | 7 David Black-‐Schaffer
Energy As A Shared Resource
Bulldozer Core
• Probably the 2nd most important resource today, soon to be 1st – Power (controlled by DVFS — Dynamic Frequency/Voltage Scaling) – Thermal Capacity
AMD’s “Turbo CORE”– thermal capacity
Intel’s “Turbo Boost”
Uppsala University / Department of Informa<on Technology 28/11/2011 | 8 David Black-‐Schaffer
Shared Cache
Core
Core
Core
Core
Efficient Data Sharing
Shared Cache
Core
Core
Core
Core
Fast Synchroniza1on
Aside: Why Share Resources? • Parallel Performance
– Data sharing across threads – Simplify synchroniza1on and coherency
• Flexibility – Cache par<<oning for different workloads
• Amdahl’s law – Always need single-‐threaded performance
Shared Data
Lock
Shared Cache
Core
Core
Core
Core
Flexible Resource Alloca1on
1
10
100
1 2 4 8 16 32 64 128 256 512 1024
Maxim
um Spe
edup
Number of Cores
Limits of Amdahl’s Law
75% Parallel (4x)
90% Parallel (10x)
99% Parallel (91x)
Uppsala University / Department of Informa<on Technology 28/11/2011 | 9 David Black-‐Schaffer
PERFORMANCE IMPACTS OF SHARING
Cache Sharing, Cache Pollu<on, and Bandwidth Sharing Cache Sharing, Cache Pollu<on, and Bandwidth Sharing Cache Sharing, Cache Pollu<on, and Bandwidth Sharing Cache Sharing, Cache Pollu<on, and Bandwidth Sharing
Uppsala University / Department of Informa<on Technology 28/11/2011 | 10 David Black-‐Schaffer
Cache Pirate – 471.omnetpp
1234
1 2 3 4
Thro
ughp
ut
cores
471.omnetppexpected
- 5 -
Cache Pirate – 471.omnetpp
1234
1 2 3 4
Thro
ughp
ut
cores
471.omnetppexpected
measured
- 5 -
Cache Pirate – 471.omnetpp
1234
1 2 3 4
Thro
ughp
ut
cores
471.omnetpp
δ
expectedmeasured
0.0
0.5
1.0
1.5
2.0
2.5
1M 2M 3M 4M 5M 6M 7M
CP
I
cache size
471.omnetpp
∆
- 5 -
Cache Sharing
Experiment • Run 1-‐4 independent instances of the
same program on a 4-‐core Nehalem
• Performance affected by shared cache alloca1on
– ¼ of the shared cache 20% slower • Understanding performance as a func(on
of shared resource alloca(on is cri<cal to understanding mul<core scalability
Cache Pirate – 471.omnetpp
1234
1 2 3 4
Thro
ughp
ut
cores
471.omnetppexpected
measured
0.0
0.5
1.0
1.5
2.0
2.5
1M 2M 3M 4M 5M 6M 7M
CP
I
cache size
471.omnetpp
- 5 -
Cache Pirate – 471.omnetpp
1234
1 2 3 4
Thro
ughp
ut
cores
471.omnetpp
δ
expectedmeasured
0.0
0.5
1.0
1.5
2.0
2.5
1M 2M 3M 4M 5M 6M 7M
CP
I
cache size
471.omnetpp
∆
- 5 -
Shared L3 Cache
Performance as a func1on of shared cache size
Cache Pirate – 471.omnetpp
1234
1 2 3 4
Thro
ughp
ut
cores
471.omnetppexpected
measuredcache pirate
0.0
0.5
1.0
1.5
2.0
2.5
1M 2M 3M 4M 5M 6M 7M
CP
I
cache size
471.omnetpp
- 5 -
predicted
Uppsala University / Department of Informa<on Technology 28/11/2011 | 11 David Black-‐Schaffer
10 20 30 40 50 60 70 80 90 100 Latency (cycles)
190
Cache Sharing: What’s Going On? Cache Pirate – 471.omnetpp
1234
1 2 3 4
Thro
ughp
ut
cores
471.omnetpp
δ
expectedmeasured
0.0
0.5
1.0
1.5
2.0
2.5
1M 2M 3M 4M 5M 6M 7M
CP
I
cache size
471.omnetpp
∆
- 5 -
Shared Cache Sensi1vity
Cache Pirate – 471.omnetpp
1234
1 2 3 4
Thro
ughp
ut
cores
471.omnetpp
δ
expectedmeasured
0.0
0.5
1.0
1.5
2.0
2.5
1M 2M 3M 4M 5M 6M 7M
CP
I
cache size
471.omnetpp
∆
- 5 -
Reduced Scalability
DRAM
L3
Shared
L2 L1
Core
Core
Core
Core
100% of Shared Cache
50% of Shared Cache
50% of Shared Cache
33% of Shared Cache
33% of Shared Cache
33% of Shared Cache
25% of Shared Cache
25% of Shared Cache
25% of Shared Cache
25% of Shared Cache
Uppsala University / Department of Informa<on Technology 28/11/2011 | 12 David Black-‐Schaffer
10 20 30 40 50 60 70 80 90 100 Latency (cycles)
190
Cache Sharing: What’s Going On? Cache Pirate – 471.omnetpp
1234
1 2 3 4
Thro
ughp
ut
cores
471.omnetpp
δ
expectedmeasured
0.0
0.5
1.0
1.5
2.0
2.5
1M 2M 3M 4M 5M 6M 7M
CP
I
cache size
471.omnetpp
∆
- 5 -
Shared Cache Sensi1vity
Cache Pirate – 471.omnetpp
1234
1 2 3 4
Thro
ughp
ut
cores
471.omnetpp
δ
expectedmeasured
0.0
0.5
1.0
1.5
2.0
2.5
1M 2M 3M 4M 5M 6M 7M
CP
I
cache size
471.omnetpp
∆
- 5 -
Reduced Scalability
DRAM
L3
Shared
L2 L1
Core
Core
Core
Core
100% of Shared Cache
50% of Shared Cache
50% of Shared Cache
33% of Shared Cache
33% of Shared Cache
33% of Shared Cache
25% of Shared Cache
25% of Shared Cache
25% of Shared Cache
25% of Shared Cache
Less shared cache space
More DRAM accesses
Increased latency
Reduced performance.
How much slower? See this curve.
Need performance as a func1on of shared resources to understand mul1core scalability.
Uppsala University / Department of Informa<on Technology 28/11/2011 | 13 David Black-‐Schaffer
Cache Pollu1on • Applica<ons put data in the cache but do not reuse it
– No benefit, but no problem …unless you are sharing cache space
• Pollu1on wastes space other applica1ons could benefit from
0
0.5
1
1.5
2
bzip2 lbm libquantum gamess Overall
Throughp
ut
Alone Mixed Workload Pollu1on Eliminated
Effects of Cache Pollu1on in a 4-‐process Workload
1/3 of slowdown due to cache pollu<on.
Uppsala University / Department of Informa<on Technology 28/11/2011 | 14 David Black-‐Schaffer
MotivationDo all applications benefit from caching?
0.0%5.0%
10.0%15.0%20.0%25.0%
Private Shared
Mis
sR
atio
Cache Size
libquantumlbm
Not all applications benefit from cachingBetter performance if caching can be controlled
- 3 -
20%
10%
0%
Cache Pollu1on: What’s Going On?
• Does an applica<on benefit from using the shared cache?
Cache Size
Private Shared
Miss Ra
1o
More misses more cache space
Fewer misses less cache space
No benefit from using the shared cache space.
Significant benefit from the shared cache space.
Uppsala University / Department of Informa<on Technology 28/11/2011 | 15 David Black-‐Schaffer
Predicting Multicore Scaling – 470.lbm
0.00.20.40.60.81.01.21.41.6
1M 2M 3M 4M 5M 6M 7M
CP
I
cache size
470.lbm
1
2
3
4
1 2 3 4
Thro
ughp
ut
cores
470.lbmcache pirate
measured
0.00.51.01.52.02.53.03.5
1M 2M 3M 4M 5M 6M 7M
Ban
dwid
ht(G
B/s
)
cache size
470.lbm
02468
10121416
1 2 3 4
Ban
dwid
th(G
B/s
)
cores
470.lbmrequired
- 6 -
Bandwidth as a func1on of shared cache size
Predicting Multicore Scaling – 470.lbm
0.00.20.40.60.81.01.21.41.6
1M 2M 3M 4M 5M 6M 7M
CP
I
cache size
470.lbm
1
2
3
4
1 2 3 4
Thro
ughp
ut
cores
470.lbmcache pirate
- 6 -
predicted
Predicting Multicore Scaling – 470.lbm
0.00.20.40.60.81.01.21.41.6
1M 2M 3M 4M 5M 6M 7M
CP
I
cache size
470.lbm
1
2
3
4
1 2 3 4
Thro
ughp
ut
cores
470.lbmcache pirate
- 6 -
Performance as a func1on of shared cache size
Bandwidth Sharing
Que
ue
Memory Controller I/O
50% reduc<on in cache 57% increase in bandwidth
Predicting Multicore Scaling – 470.lbm
0.00.20.40.60.81.01.21.41.6
1M 2M 3M 4M 5M 6M 7M
CP
I
cache size
470.lbm
1
2
3
4
1 2 3 4
Thro
ughp
ut
cores
470.lbmcache pirate
measured
- 6 -
predicted
Predicting Multicore Scaling – 470.lbm
0.00.20.40.60.81.01.21.41.6
1M 2M 3M 4M 5M 6M 7M
CP
I
cache size
470.lbm
1
2
3
4
1 2 3 4Th
roug
hput
cores
470.lbmcache pirate
measured
0.00.51.01.52.02.53.03.5
1M 2M 3M 4M 5M 6M 7M
Ban
dwid
ht(G
B/s
)
cache size
470.lbm
02468
10121416
1 2 3 4
Ban
dwid
th(G
B/s
)
cores
470.lbm
10.4 GB/s
requiredmeasured
- 6 -
No performance loss with less cache space…
…un<l you run out of shared bandwidth.
4 cores @ 3GB/s = 12GB/s System max = 10.4GB/s
Uppsala University / Department of Informa<on Technology 28/11/2011 | 16 David Black-‐Schaffer
10 20 30 40 50 60 70 80 90 100 Latency (cycles)
190
DRAM
L3
Shared
L2
L1
Core
Core
Core
Core
Bandwidth Sharing: What is Going On? • Shared resources:
– Cache, Bandwidth – But also: queues, row buffers, prefetchers, etc.
• Results in complex applica1on sensi1vi1es: – Bandwidth, capacity, latency, access paoern
• Complex interac1ons with shared bandwidth
1
Fig. 2. The memory hierarchy. The memory hierarchy can be analyzed withEquation 1 using the different degrees of parallelism and latency at each levelin the hierarchy. The queues in the figure indicate that the available parallelismcomes from multiple places within the hierarchy. The details of the hierarchyare discussed in Section IV.
(Section III), and present a series of micro-benchmarks forcharacterizing the impacts of latency and parallelism at theglobal and local level in the memory hierarchy (Section IV).With this background, we then introduce the concept of ap-plication sensitivity to bandwidth and latency (Section V) andpresent the details of the Bandwidth Bandit implementation(Section VI). The Bandit is then validated against a series ofmicro-benchmarks, which allow us to describe how to interpretits output (Section VII). Finally, we present the results of usingthe Bandwidth Bandit to investigate the memory contentionsensitivity of a variety of SPEC2006 and PARSEC benchmarks(Section VIII).
II. BACKGROUND
A. Memory Hierarchy Organization
The memory hierarchy considered in this paper is shown inFigure 2. If a memory access can not be serviced by the core’sprivate caches (not shown in the figure), it is first sent to theshared L3 cache. It the requested data is not found in the L3cache, it is sent to the integrated Memory Controller (MC).The MC has three independent memory channels over whichit communicates with the DRAM modules. Each channelconsists of an address bus and a 64 bit wide data bus. Memoryrequest are typically 64 bytes (one cache-line), thereby requir-ing eight transfers over the data bus. Each DRAM moduleconsists of several independent memory banks, which canbe accessed in parallel, as long as there are no conflicts onthe address and data buses. The combination of independentchannels and memory banks provides for a large degree ofavailable parallelism in the off-chip memory hierarchy.
The DRAM memory banks are organized into rows (alsocalled pages) and columns. To address a word of data the MChas to specify the channel, bank, row and column of the data.To read or write an address, the whole row is first copied intothe bank’s row buffer. This single-entry buffer (also known asa page cache) caches the row until a different row in the samebank is accessed.
On a read or write access three events can occur: A page-hitwhen the accessed row is already in the row buffer and thedata can be read/written directly; a page-empty when the rowbuffer is empty and the accessed row has to be copied from
the bank before it can be read/written1; or a page-miss when arow other then the one accessed is cached in the row buffer. Inthe case of a page-miss, the cached row has to first be writtenback to the memory bank before the newly accessed row iscopied into the row buffer. These three events have differentlatencies, with a page-hit having the shortest latency, and apage-miss having the longest.
B. Memory Hierarchy PerformanceFrom a performance point of view the memory hierarchy
can be described by two metrics: its latency and band-width. These two metrics are intimately related. Using Little’slaw [?], the average bandwidth achieved by an application canbe expressed as follows:
bandwidth = transfer size× MLP
latency, (1)
where MLP is the application’s average Memory LevelParallelism, that is, the average number of concurrent memoryrequests it has in-flight, and latency is the average time tocomplete the application’s memory accesses. For example,consider an application with a MLP of one (e.g., a linked listtraversal) and an access latency of 50ns. Such an applicationwould complete a transfer every 50ns, on average. With atransfer size of 64 bytes, its memory bandwidth would be1.3GB/s. If the MLP was increased to two, but the accesslatency remained the same, the application would completetwo transfers every 50ns, and its bandwidth would double to2.6GB/s.
The above equation clearly illustrates that the bandwidthachieved by an application is determined by both its memoryaccess latency and its memory parallelism. However, theseparameters vary throughout the memory hierarchy. For ex-ample, at the bank level, the parallelism is limited by thenumber of banks. However, MCs typically queue request tobusy banks. From the higher-level perspective of the MC,the parallelism, or number of in-flight requests, will includethe request in these queues, and appear larger. The latencywill also appear different, since the time spend in the queueshave to be considered. The above equation will therefore havedifferent values for latency and MLP depending on where itis applied in the memory hierarchy.
In this work we will use the above equation to understandhow contention for shared memory resources impacts appli-cation performance. Such contention can occur throughoutthe memory hierarchy, and can both increase access latenciesand reduce the memory parallelism. The resulting effect onapplication performance is a function of the application’ssensitivity to reduced latency or memory parallelism.
III. EXPERIMENTAL SETUP
The experiments presented in this paper have been run ona quad core Intel Xeon e5520 (Nehalem) machine. Its cacheconfiguration is detailed in the following table:
1Page-empties occur when the MC preemptively closes a page that hasn’tbeen accessed recently to optimistically turn a page-miss into a page-empty.
Slowdown at 90% of satura1on bandwidth
0
Bandwidth Bandit: Understanding Memory Contention
Abstract—Applications that are co-scheduled on a multicorecompete for shared resources, such as cache capacity and memorybandwidth. The performance degradation resulting from thiscontention can be substantial, which makes it important toeffectively manage these shared resources. This, however, requiresan understanding of how applications are impacted by suchcontention.
While cache-sharing effects have been studied extensivelyof late, the effects of memory bandwidth sharing are not aswell explored. This is in large due to its complex nature, assensitivity to bandwidth contention depends on bottlenecks atseveral levels of the memory-system and the locality propertiesof the application’s access stream.
This paper explores the contention effects of increased latencyand decreased memory parallelism at different points in thememory hierarchy, both of which cause decreases in availablebandwidth. To understand the impact of such contention onapplications, it also presents a method whereby an application’soverall sensitivity to different degrees of bandwidth contentioncan be directly measured. This method is used to demonstrate thevarying contention sensitivity across a selection of benchmarks,and explains why some of them experience substantial slowdownslong before the overall memory bandwidth saturates.
I. INTRODUCTION
Cache capacity and memory bandwidth are critical shared
resources in chip multi-processors. Sharing these resources
between cores has many benefits, however, it also leads to
contention, which can have dramatic, negative impacts on the
co-running applications’ performance. Understanding the im-
pact of such contention is therefore essential to fully realize the
computational power of these systems. This work investigates
the effects of contention for off-chip memory (bandwidth)
by first exploring where and why contention occurs in the
memory hierarchy. We then use this knowledge to develop a
method that enables us to measure an application’s sensitivity
to memory contention.
While contention effects have been extensively studied for
cache capacity (e.g. [1]), less is known about contention
for memory bandwidth. Two recent studies [2], [3], show
that the performance impact from contention for off-chip
memory resources can be significant. Their main results, with
respect to memory bandwidth contention, are that applications
with higher (un-contended) bandwidth demands both generate
more memory contention and are generally more sensitive to
contention from others.
The importance of application sensitivity to memory band-
width contention is shown in Figure 1. This data shows
that applications’ sensitivities vary significantly and can be
quite substantial, with slowdowns varying from 3% to 23%for these benchmarks. It is clear from the data in Figure 1
that applications with similar bandwidth demands (e.g. lbm
and soplex) can exhibit widely varying slowdowns for the
same degree of memory contention. The goal of this research
is to further investigate the mechanisms behind bandwidth
contention and its impact on applications’ performance.
0
10
20
lbmstream
cluster
soplex
mcf
0
1
2
slo
wdow
n(%
)
bandw
idth
(GB
/s)
slowdown
bandwidth
Fig. 1. Baseline bandwidth (no contention) and slowdown with memory
contention (90% of saturation bandwidth). While all of these applications
exhibit similar baseline bandwidth consumption, their sensitivities to memory
contention vary widely. This variability demonstrates the importance of a more
detailed understanding the effects of memory contention. (For each benchmark
we generated sufficient bandwidth contention to reach 90% of the system
saturation bandwidth for the benchmark. This contention was generated using
the Bandwidth Bandit technique discussed in Section VI.)
To measure application sensitivity to memory contention we
propose the Bandwidth Bandit method. It works by co-running
the application whose performance we want to measure (the
Target) with a Bandit application that generates memory
contention. By carefully controlling the amount of contention
the Bandit generates we can measure the Target’s performance
as a function of the contention generated by the Bandit.
As we want to measure the Target’s sensitivity to memory
contention in isolation, it is important that the Bandit does
not consume any shared resources other than the ones for
which it generates contention. For example, if the Bandit uses
large amounts of shared cache capacity, this might impact
the Target’s performance and distort the measurement. In
addition, the Bandit must generate realistic contention traffic
that appropriately targets the different parts of the memory
hierarchy. To design such a Bandit we must first understand
the sources of memory contention in detail.
This research makes the following contributions:
• We identify the main bottlenecks that can arise due to
memory contention and measure their impact on both
latency and bandwidth. We classify these bottlenecks
into two categories: bottlenecks that limit bandwidth and
bottlenecks increase latency.
• We present the Bandwidth Bandit method. To our knowl-
edge, this is the first method for measuring the perfor-
mance impact of memory contention on real hardware.
• We describe how the Bandit data can be analyzed to
understand an application’s bandwidth and latency sensi-
tivity.
• We use the Bandwidth Bandit method to analyze the
contention sensitivity of a set of applications from the
SPEC2006 and PARSEC benchmark suites.
To present this material, we begin with an overview of
the memory hierarchy (Section II-A) and a discussion of
the relationship between memory level parallelism and la-
tency (Section II-B). We then describe our experimental setup
Slow
down
Band
width
Uppsala University / Department of Informa<on Technology 28/11/2011 | 17 David Black-‐Schaffer
Sensi<vity Cri<cal for Scaling
• Shared resource sensi1vity understand scaling • Sensi1vity varies from applica1on to applica1on • Impact varies from plagorm to plagorm
Performance (lower beoer)
Bandwidth (GB/s)
F/M
435.gromacs
BW
CP
I
401.bzip2 429.mcf
F/M
454.calculixB
WC
PI
482.sphinx3 470.lbmF
/M
462.libquantum
BW
CP
I
cache size
450.soplex
cache size
403.gcc
cache size
0.0%
0.1%
0.2%
0.3%
0.0
0.1
0.2
0.0
0.4
0.8
0%
0.3%
0.6%
0.9%
0.0
0.3
0.6
0.0
0.5
1.0
0%
4%
8%
12%
16%
0
1
2
3
0
2
4
0.00%
0.06%
0.12%
0.00
0.03
0.06
0.0
0.5
1.0
0%
2%
4%
0
1
2
0.0
0.5
1.0
0%
3%
6%
0
1
2
3
0.0
0.8
1.6
0%
10%
20%
0
2
4
0.0
0.4
0.8
1M 3M 5M 7M
0%
6%
12%
0
1
2
3
0.0
0.5
1.0
1.5
1M 3M 5M 7M
0%
2%
4%
0.0
0.6
1.2
1.8
0.0
0.6
1.2
1M 3M 5M 7M
Miss Ratio Fetch Ratio CPI Bandwidth
Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware
prefetching enabled. .
482.sphinx3 behaves quite differently from 435.gromacs.
As its cache size is decreased its CPI increases by 50%,
while its miss ratio and bandwidth increase by a factor
of 20×. The fetch ratio and miss ratio curves are slightly
different indicating that there is a small amount of prefetching.
However, the significant performance decrease despite the
increased bandwidth indicates that the benchmark is more
sensitive to increased memory latency.
470.lbm shows an 8× difference between its fetch and miss
ratios, indicating 8 prefetched memory accesses for each miss.
However, the relative increase in miss ratio is still roughly 2×.
This indicates that 470.lbm is also relatively insensitive to the
increased latency. The data in Figure 9 show the performance
of 470.lbm with hardware prefetching disabled. This reduces
bandwidth by a third and increases CPI at all cache sizes.
Furthermore, the CPI is now no longer constant with varying
!"#
F/M
435.gromacs
BW
CP
I
401.bzip2 429.mcf
F/M
454.calculix
BW
CP
I
482.sphinx3 470.lbm
F/M
462.libquantum
BW
CP
I
cache size
450.soplex
cache size
403.gcc
cache size
0.0%
0.1%
0.2%
0.3%
0.0
0.1
0.2
0.0
0.4
0.8
0%
0.3%
0.6%
0.9%
0.0
0.3
0.6
0.0
0.5
1.0
0%
4%
8%
12%
16%
0
1
2
3
0
2
4
0.00%
0.06%
0.12%
0.00
0.03
0.06
0.0
0.5
1.0
0%
2%
4%
0
1
2
0.0
0.5
1.0
0%
3%
6%
0
1
2
3
0.0
0.8
1.6
0%
10%
20%
0
2
4
0.0
0.4
0.8
1M 3M 5M 7M
0%
6%
12%
0
1
2
3
0.0
0.5
1.0
1.5
1M 3M 5M 7M
0%
2%
4%
0.0
0.6
1.2
1.8
0.0
0.6
1.2
1M 3M 5M 7M
Miss Ratio Fetch Ratio CPI Bandwidth
Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware
prefetching enabled. .
482.sphinx3 behaves quite differently from 435.gromacs.
As its cache size is decreased its CPI increases by 50%,
while its miss ratio and bandwidth increase by a factor
of 20×. The fetch ratio and miss ratio curves are slightly
different indicating that there is a small amount of prefetching.
However, the significant performance decrease despite the
increased bandwidth indicates that the benchmark is more
sensitive to increased memory latency.
470.lbm shows an 8× difference between its fetch and miss
ratios, indicating 8 prefetched memory accesses for each miss.
However, the relative increase in miss ratio is still roughly 2×.
This indicates that 470.lbm is also relatively insensitive to the
increased latency. The data in Figure 9 show the performance
of 470.lbm with hardware prefetching disabled. This reduces
bandwidth by a third and increases CPI at all cache sizes.
Furthermore, the CPI is now no longer constant with varying
!"#
Cache Size
F/M
435.gromacs
BW
CP
I
401.bzip2 429.mcf
F/M
454.calculix
BW
CP
I
482.sphinx3 470.lbm
F/M
462.libquantum
BW
CP
I
cache size
450.soplex
cache size
403.gcc
cache size
0.0%
0.1%
0.2%
0.3%
0.0
0.1
0.2
0.0
0.4
0.8
0%
0.3%
0.6%
0.9%
0.0
0.3
0.6
0.0
0.5
1.0
0%
4%
8%
12%
16%
0
1
2
3
0
2
4
0.00%
0.06%
0.12%
0.00
0.03
0.06
0.0
0.5
1.0
0%
2%
4%
0
1
2
0.0
0.5
1.0
0%
3%
6%
0
1
2
3
0.0
0.8
1.6
0%
10%
20%
0
2
4
0.0
0.4
0.8
1M 3M 5M 7M
0%
6%
12%
0
1
2
3
0.0
0.5
1.0
1.5
1M 3M 5M 7M
0%
2%
4%
0.0
0.6
1.2
1.8
0.0
0.6
1.2
1M 3M 5M 7M
Miss Ratio Fetch Ratio CPI Bandwidth
Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware
prefetching enabled. .
482.sphinx3 behaves quite differently from 435.gromacs.
As its cache size is decreased its CPI increases by 50%,
while its miss ratio and bandwidth increase by a factor
of 20×. The fetch ratio and miss ratio curves are slightly
different indicating that there is a small amount of prefetching.
However, the significant performance decrease despite the
increased bandwidth indicates that the benchmark is more
sensitive to increased memory latency.
470.lbm shows an 8× difference between its fetch and miss
ratios, indicating 8 prefetched memory accesses for each miss.
However, the relative increase in miss ratio is still roughly 2×.
This indicates that 470.lbm is also relatively insensitive to the
increased latency. The data in Figure 9 show the performance
of 470.lbm with hardware prefetching disabled. This reduces
bandwidth by a third and increases CPI at all cache sizes.
Furthermore, the CPI is now no longer constant with varying
!"#
F/M
435.gromacs
BW
CP
I
401.bzip2 429.mcf
F/M
454.calculix
BW
CP
I
482.sphinx3 470.lbm
F/M
462.libquantum
BW
CP
I
cache size
450.soplex
cache size
403.gcc
cache size
0.0%
0.1%
0.2%
0.3%
0.0
0.1
0.2
0.0
0.4
0.8
0%
0.3%
0.6%
0.9%
0.0
0.3
0.6
0.0
0.5
1.0
0%
4%
8%
12%
16%
0
1
2
3
0
2
4
0.00%
0.06%
0.12%
0.00
0.03
0.06
0.0
0.5
1.0
0%
2%
4%
0
1
2
0.0
0.5
1.0
0%
3%
6%
0
1
2
3
0.0
0.8
1.6
0%
10%
20%
0
2
4
0.0
0.4
0.8
1M 3M 5M 7M
0%
6%
12%
0
1
2
3
0.0
0.5
1.0
1.5
1M 3M 5M 7M
0%
2%
4%
0.0
0.6
1.2
1.8
0.0
0.6
1.2
1M 3M 5M 7M
Miss Ratio Fetch Ratio CPI Bandwidth
Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware
prefetching enabled. .
482.sphinx3 behaves quite differently from 435.gromacs.
As its cache size is decreased its CPI increases by 50%,
while its miss ratio and bandwidth increase by a factor
of 20×. The fetch ratio and miss ratio curves are slightly
different indicating that there is a small amount of prefetching.
However, the significant performance decrease despite the
increased bandwidth indicates that the benchmark is more
sensitive to increased memory latency.
470.lbm shows an 8× difference between its fetch and miss
ratios, indicating 8 prefetched memory accesses for each miss.
However, the relative increase in miss ratio is still roughly 2×.
This indicates that 470.lbm is also relatively insensitive to the
increased latency. The data in Figure 9 show the performance
of 470.lbm with hardware prefetching disabled. This reduces
bandwidth by a third and increases CPI at all cache sizes.
Furthermore, the CPI is now no longer constant with varying
!"#
Cache Size
F/M
435.gromacs
BW
CP
I
401.bzip2 429.mcf
F/M
454.calculix
BW
CP
I
482.sphinx3 470.lbm
F/M
462.libquantum
BW
CP
I
cache size
450.soplex
cache size
403.gcc
cache size
0.0%
0.1%
0.2%
0.3%
0.0
0.1
0.2
0.0
0.4
0.8
0%
0.3%
0.6%
0.9%
0.0
0.3
0.6
0.0
0.5
1.0
0%
4%
8%
12%
16%
0
1
2
3
0
2
4
0.00%
0.06%
0.12%
0.00
0.03
0.06
0.0
0.5
1.0
0%
2%
4%
0
1
2
0.0
0.5
1.0
0%
3%
6%
0
1
2
3
0.0
0.8
1.6
0%
10%
20%
0
2
4
0.0
0.4
0.8
1M 3M 5M 7M
0%
6%
12%
0
1
2
3
0.0
0.5
1.0
1.5
1M 3M 5M 7M
0%
2%
4%
0.0
0.6
1.2
1.8
0.0
0.6
1.2
1M 3M 5M 7M
Miss Ratio Fetch Ratio CPI Bandwidth
Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware
prefetching enabled. .
482.sphinx3 behaves quite differently from 435.gromacs.
As its cache size is decreased its CPI increases by 50%,
while its miss ratio and bandwidth increase by a factor
of 20×. The fetch ratio and miss ratio curves are slightly
different indicating that there is a small amount of prefetching.
However, the significant performance decrease despite the
increased bandwidth indicates that the benchmark is more
sensitive to increased memory latency.
470.lbm shows an 8× difference between its fetch and miss
ratios, indicating 8 prefetched memory accesses for each miss.
However, the relative increase in miss ratio is still roughly 2×.
This indicates that 470.lbm is also relatively insensitive to the
increased latency. The data in Figure 9 show the performance
of 470.lbm with hardware prefetching disabled. This reduces
bandwidth by a third and increases CPI at all cache sizes.
Furthermore, the CPI is now no longer constant with varying
!"#
F/M
435.gromacs
BW
CP
I
401.bzip2 429.mcf
F/M
454.calculix
BW
CP
I
482.sphinx3 470.lbm
F/M
462.libquantum
BW
CP
I
cache size
450.soplex
cache size
403.gcc
cache size
0.0%
0.1%
0.2%
0.3%
0.0
0.1
0.2
0.0
0.4
0.8
0%
0.3%
0.6%
0.9%
0.0
0.3
0.6
0.0
0.5
1.0
0%
4%
8%
12%
16%
0
1
2
3
0
2
4
0.00%
0.06%
0.12%
0.00
0.03
0.06
0.0
0.5
1.0
0%
2%
4%
0
1
2
0.0
0.5
1.0
0%
3%
6%
0
1
2
3
0.0
0.8
1.6
0%
10%
20%
0
2
4
0.0
0.4
0.8
1M 3M 5M 7M
0%
6%
12%
0
1
2
3
0.0
0.5
1.0
1.5
1M 3M 5M 7M
0%
2%
4%
0.0
0.6
1.2
1.8
0.0
0.6
1.2
1M 3M 5M 7M
Miss Ratio Fetch Ratio CPI Bandwidth
Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware
prefetching enabled. .
482.sphinx3 behaves quite differently from 435.gromacs.
As its cache size is decreased its CPI increases by 50%,
while its miss ratio and bandwidth increase by a factor
of 20×. The fetch ratio and miss ratio curves are slightly
different indicating that there is a small amount of prefetching.
However, the significant performance decrease despite the
increased bandwidth indicates that the benchmark is more
sensitive to increased memory latency.
470.lbm shows an 8× difference between its fetch and miss
ratios, indicating 8 prefetched memory accesses for each miss.
However, the relative increase in miss ratio is still roughly 2×.
This indicates that 470.lbm is also relatively insensitive to the
increased latency. The data in Figure 9 show the performance
of 470.lbm with hardware prefetching disabled. This reduces
bandwidth by a third and increases CPI at all cache sizes.
Furthermore, the CPI is now no longer constant with varying
!"#
Cache Size
Cache Pirate – 471.omnetpp
1234
1 2 3 4
Thro
ughp
ut
cores
471.omnetpp
δ
expectedmeasured
0.0
0.5
1.0
1.5
2.0
2.5
1M 2M 3M 4M 5M 6M 7M
CP
I
cache size
471.omnetpp
∆
- 5 -
Shared Cache Sensi1vity
Cache Pirate – 471.omnetpp
1234
1 2 3 4
Thro
ughp
ut
cores
471.omnetpp
δ
expectedmeasured
0.0
0.5
1.0
1.5
2.0
2.5
1M 2M 3M 4M 5M 6M 7M
CP
I
cache size
471.omnetpp
∆
- 5 -
Reduced Scalability
Insensi1ve Sensi1ve Insensi1ve
As long as you have enough bandwidth.
Uppsala University / Department of Informa<on Technology 28/11/2011 | 18 David Black-‐Schaffer
Finding Wasteful Accesses
- 9 -
Finding Wasteful Accesses
- 9 -
Finding Wasteful Accesses
- 9 -
Aside: Implementa<on (Tools) • Performance as a func1on of
cache size – “Pirate” thread steals cache and
measures Target’s performance
– Monitor the Pirate’s miss rate
– (Similar for Bandwidth) Overview
$1
$1
- 3 -
Overview
$2
$2
- 3 -
Overview
$3
$3
- 3 -
Overview
$4
$4
- 3 -D. Eklöv, N. Nikoleris, D. Black-‐Schaffer, E. Hagersten. Cache Pira8ng: Measuring the Curse of the Shared Cache. ICPP 2011.
A. Sandberg, D. Eklöv, E. Hagersten. Reducing Cache Pollu8on Through Detec8on and Elimina8on of Non-‐Temporal Memory Accesses. SC 2010.
• Reducing cache pollu1on – Sample memory access behavior
(reuse distances)
– Model misses for different cache sizes
– Iden<fy pollu<ng instruc<ons and change to non-‐caching accesses
5% Overhead Fully Automa<c
Shared Cache
Overview
Captures performance data for all cache sizes in one run.
Cache Size 8MB
Time
Target
Pirate
0MB
Average Target slowdown: 5%(Simulation slowdown: 100× – 1000×)
- 4 -
Uppsala University / Department of Informa<on Technology 28/11/2011 | 19 David Black-‐Schaffer
WRAP UP What is all of this good for?
Uppsala University / Department of Informa<on Technology 28/11/2011 | 20 David Black-‐Schaffer
Why Is This Important?
• Performance and scaling are impacted by resource sharing
• How can we use this informa1on? – Predict and understand performance – Workload performance – Useful for schedulers (sta<c and dynamic), op1miza1on, compila1on
• Need new modeling technology (on-‐going research) – Can model workload cache behavior for simple cores (StatCC)
• Not good enough for modern machines • No MLP, prefetchers, OoO, etc.
– Developing new techniques for modeling real hardware
Uppsala University / Department of Informa<on Technology 28/11/2011 | 21 David Black-‐Schaffer
The Future
• More threads (100s of cores, 1000s of threads)
• More complex hierarchies (NoCs, distributed caches)
• Heterogeneous (CPU+GPU+Accelerators)
• More focus on energy (per-‐core DVFS, dark silicon)
AMD Fusion
Shared
I/O
CPU CPU
CPU CPU
GPU Cores
Intel Integrated Graphics Turbo Boost
Uppsala University / Department of Informa<on Technology 28/11/2011 | 22 David Black-‐Schaffer
Resource Sharing in Mul<cores • Important
– Significant impact on performance (and efficiency) scaling
– Basic data lets us understand interac<ons • Complex
– Sensi1vity varies across applica<ons – Hard to understand the hardware
• Moving forward – Research on predic<ng performance and understanding scaling
– MS thesis projects to port our tools/analyze your code
– We’re open (and eager) for collabora1on!
Uppsala Programming forMulticore ArchitecturesResearch Center
Uppsala University / Department of Informa<on Technology 28/11/2011 | 23 David Black-‐Schaffer
QUESTIONS?
Uppsala University / Department of Informa<on Technology 28/11/2011 | 24 David Black-‐Schaffer
APPLICATION PHASES Average behavior is not good enough
Uppsala University / Department of Informa<on Technology 28/11/2011 | 25 David Black-‐Schaffer
Applica<on Phases
A. Sembrant, D. Eklöv, E. Hagersten. Efficient SoDware-‐based Online Phase Classifica8on. IISWC 2011.
Motivation Outline Related Work ScarPhase Phase Guided Profiling
Motivation
Baseline CPI: bzip2/chicken
0.40.60.81.01.21.4
0 20 40 60 80 100120
140160
CP
I
Time in Billions of Instructions
Average CPI
0.40.60.81.01.21.4
0 20 40 60 80 100120
140160
Pre
dict
edC
PI
Time in Billions of Instructions
• Fast
• Inaccurate
Periodic CPI profiling
0.40.60.81.01.21.4
0 20 40 60 80 100120
140160
Pre
dict
edC
PI
Time in Billions of Instructions
• Nearly as fast as Average
• Still inaccurate
Phase guided profiling
0.40.60.81.01.21.4
0 20 40 60 80 100120
140160
Pre
dict
edC
PI
Time in Billions of Instructions
• 1.7% additional overhead to Periodic
• Good accuracy
Need Fast Phase Detec
tion!
Efficient Software-based Online Phase Classification Andreas Sembrant, Uppsala University
Measured CPI over 1me
Motivation Outline Related Work ScarPhase Phase Guided Profiling
Motivation
Baseline CPI: bzip2/chicken
0.40.60.81.01.21.4
0 20 40 60 80 100120
140160
CP
I
Time in Billions of Instructions
Average CPI
0.40.60.81.01.21.4
0 20 40 60 80 100120
140160
Pre
dict
edC
PI
Time in Billions of Instructions
• Fast
• Inaccurate
Periodic CPI profiling
0.40.60.81.01.21.4
0 20 40 60 80 100120
140160
Pre
dict
edC
PI
Time in Billions of Instructions
• Nearly as fast as Average
• Still inaccurate
Phase guided profiling
0.40.60.81.01.21.4
0 20 40 60 80 100120
140160
Pre
dict
edC
PI
Time in Billions of Instructions
• 1.7% additional overhead to Periodic
• Good accuracy
Need Fast Phase Detec
tion!
Efficient Software-based Online Phase Classification Andreas Sembrant, Uppsala University
Average CPI over 1me
Predicting Multicore Scaling – 470.lbm
0.00.20.40.60.81.01.21.41.6
1M 2M 3M 4M 5M 6M 7M
CP
I
cache size
470.lbm
1
2
3
4
1 2 3 4
Thro
ughp
ut
cores
470.lbmcache pirate
- 6 -
Predicting Multicore Scaling – 470.lbm
0.00.20.40.60.81.01.21.41.6
1M 2M 3M 4M 5M 6M 7M
CP
I
cache size
470.lbm
1
2
3
4
1 2 3 4
Thro
ughp
ut
cores
470.lbmcache pirate
measured
0.00.51.01.52.02.53.03.5
1M 2M 3M 4M 5M 6M 7M
Ban
dwid
ht(G
B/s
)
cache size
470.lbm
02468
10121416
1 2 3 4
Ban
dwid
th(G
B/s
)
cores
470.lbmrequired
- 6 -
Average Bandwidth Average CPI
Everything so far has been average applica1on behavior.
Q: Is that good enough? Is applica1on behavior constant?
A: No. Applica1ons have dis1nct phases with varying behaviors.
Motivation Background Related Work ScarPhase Phase Guided Profiling
Background
0 100 200 300 400 500 600 700
Time in Intervals
Branch Miss Predictions
A B B C� B D A B B C�� B D EE
Cycles per Instructions
A B B C� B D A B B C�� B D EE
Basic Block Vectors
A B B C� B D A B B C�� B D EE
� Program behavior is correlated with what code is executed.
� Changes in executed code reflect changes in other metrics
Efficient Software-based Online Phase Classification Andreas Sembrant, Uppsala University
Uppsala University / Department of Informa<on Technology 28/11/2011 | 26 David Black-‐Schaffer
Cache Miss Ra<o Per Phase
A. Sembrant, D. Black-‐Schaffer, E. Hagersten. Phase Guided Profiling for Fast Cache Modeling. GCO 2012.
30% Overhead
Uppsala University / Department of Informa<on Technology 28/11/2011 | 27 David Black-‐Schaffer
SHARING IN MORE DETAIL A brief look at applica<on sensi<vity
Uppsala University / Department of Informa<on Technology 28/11/2011 | 28 David Black-‐Schaffer
More Detail: Cache Sharing
• Three SPEC benchmark applica<ons • Different behaviors due to different proper<es • Each will respond differently to cache sharing
F/M
435.gromacs
BW
CP
I
401.bzip2 429.mcf
F/M
454.calculix
BW
CP
I
482.sphinx3 470.lbm
F/M
462.libquantum
BW
CP
I
cache size
450.soplex
cache size
403.gcc
cache size
0.0%
0.1%
0.2%
0.3%
0.0
0.1
0.2
0.0
0.4
0.8
0%
0.3%
0.6%
0.9%
0.0
0.3
0.6
0.0
0.5
1.0
0%
4%
8%
12%
16%
0
1
2
3
0
2
4
0.00%
0.06%
0.12%
0.00
0.03
0.06
0.0
0.5
1.0
0%
2%
4%
0
1
2
0.0
0.5
1.0
0%
3%
6%
0
1
2
3
0.0
0.8
1.6
0%
10%
20%
0
2
4
0.0
0.4
0.8
1M 3M 5M 7M
0%
6%
12%
0
1
2
3
0.0
0.5
1.0
1.5
1M 3M 5M 7M
0%
2%
4%
0.0
0.6
1.2
1.8
0.0
0.6
1.2
1M 3M 5M 7M
Miss Ratio Fetch Ratio CPI Bandwidth
Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware
prefetching enabled. .
482.sphinx3 behaves quite differently from 435.gromacs.
As its cache size is decreased its CPI increases by 50%,
while its miss ratio and bandwidth increase by a factor
of 20×. The fetch ratio and miss ratio curves are slightly
different indicating that there is a small amount of prefetching.
However, the significant performance decrease despite the
increased bandwidth indicates that the benchmark is more
sensitive to increased memory latency.
470.lbm shows an 8× difference between its fetch and miss
ratios, indicating 8 prefetched memory accesses for each miss.
However, the relative increase in miss ratio is still roughly 2×.
This indicates that 470.lbm is also relatively insensitive to the
increased latency. The data in Figure 9 show the performance
of 470.lbm with hardware prefetching disabled. This reduces
bandwidth by a third and increases CPI at all cache sizes.
Furthermore, the CPI is now no longer constant with varying
!"#
• 0% slower • 50x bandwidth
increase
• Full working set fits in the cache
• Insensi1ve to latency
F/M
435.gromacs
BW
CP
I
401.bzip2 429.mcf
F/M
454.calculixB
WC
PI
482.sphinx3 470.lbmF
/M
462.libquantum
BW
CP
I
cache size
450.soplex
cache size
403.gcc
cache size
0.0%
0.1%
0.2%
0.3%
0.0
0.1
0.2
0.0
0.4
0.8
0%
0.3%
0.6%
0.9%
0.0
0.3
0.6
0.0
0.5
1.0
0%
4%
8%
12%
16%
0
1
2
3
0
2
4
0.00%
0.06%
0.12%
0.00
0.03
0.06
0.0
0.5
1.0
0%
2%
4%
0
1
2
0.0
0.5
1.0
0%
3%
6%
0
1
2
3
0.0
0.8
1.6
0%
10%
20%
0
2
4
0.0
0.4
0.8
1M 3M 5M 7M
0%
6%
12%
0
1
2
3
0.0
0.5
1.0
1.5
1M 3M 5M 7M
0%
2%
4%
0.0
0.6
1.2
1.8
0.0
0.6
1.2
1M 3M 5M 7M
Miss Ratio Fetch Ratio CPI Bandwidth
Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware
prefetching enabled. .
482.sphinx3 behaves quite differently from 435.gromacs.
As its cache size is decreased its CPI increases by 50%,
while its miss ratio and bandwidth increase by a factor
of 20×. The fetch ratio and miss ratio curves are slightly
different indicating that there is a small amount of prefetching.
However, the significant performance decrease despite the
increased bandwidth indicates that the benchmark is more
sensitive to increased memory latency.
470.lbm shows an 8× difference between its fetch and miss
ratios, indicating 8 prefetched memory accesses for each miss.
However, the relative increase in miss ratio is still roughly 2×.
This indicates that 470.lbm is also relatively insensitive to the
increased latency. The data in Figure 9 show the performance
of 470.lbm with hardware prefetching disabled. This reduces
bandwidth by a third and increases CPI at all cache sizes.
Furthermore, the CPI is now no longer constant with varying
!"#
F/M
435.gromacs
BW
CP
I
401.bzip2 429.mcf
F/M
454.calculix
BW
CP
I
482.sphinx3 470.lbm
F/M
462.libquantum
BW
CP
I
cache size
450.soplex
cache size
403.gcc
cache size
0.0%
0.1%
0.2%
0.3%
0.0
0.1
0.2
0.0
0.4
0.8
0%
0.3%
0.6%
0.9%
0.0
0.3
0.6
0.0
0.5
1.0
0%
4%
8%
12%
16%
0
1
2
3
0
2
4
0.00%
0.06%
0.12%
0.00
0.03
0.06
0.0
0.5
1.0
0%
2%
4%
0
1
2
0.0
0.5
1.0
0%
3%
6%
0
1
2
3
0.0
0.8
1.6
0%
10%
20%
0
2
4
0.0
0.4
0.8
1M 3M 5M 7M
0%
6%
12%
0
1
2
3
0.0
0.5
1.0
1.5
1M 3M 5M 7M
0%
2%
4%
0.0
0.6
1.2
1.8
0.0
0.6
1.2
1M 3M 5M 7M
Miss Ratio Fetch Ratio CPI Bandwidth
Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware
prefetching enabled. .
482.sphinx3 behaves quite differently from 435.gromacs.
As its cache size is decreased its CPI increases by 50%,
while its miss ratio and bandwidth increase by a factor
of 20×. The fetch ratio and miss ratio curves are slightly
different indicating that there is a small amount of prefetching.
However, the significant performance decrease despite the
increased bandwidth indicates that the benchmark is more
sensitive to increased memory latency.
470.lbm shows an 8× difference between its fetch and miss
ratios, indicating 8 prefetched memory accesses for each miss.
However, the relative increase in miss ratio is still roughly 2×.
This indicates that 470.lbm is also relatively insensitive to the
increased latency. The data in Figure 9 show the performance
of 470.lbm with hardware prefetching disabled. This reduces
bandwidth by a third and increases CPI at all cache sizes.
Furthermore, the CPI is now no longer constant with varying
!"#
F/M
435.gromacs
BW
CP
I
401.bzip2 429.mcf
F/M
454.calculix
BW
CP
I
482.sphinx3 470.lbm
F/M
462.libquantum
BW
CP
I
cache size
450.soplex
cache size
403.gcc
cache size
0.0%
0.1%
0.2%
0.3%
0.0
0.1
0.2
0.0
0.4
0.8
0%
0.3%
0.6%
0.9%
0.0
0.3
0.6
0.0
0.5
1.0
0%
4%
8%
12%
16%
0
1
2
3
0
2
4
0.00%
0.06%
0.12%
0.00
0.03
0.06
0.0
0.5
1.0
0%
2%
4%
0
1
2
0.0
0.5
1.0
0%
3%
6%
0
1
2
3
0.0
0.8
1.6
0%
10%
20%
0
2
4
0.0
0.4
0.8
1M 3M 5M 7M
0%
6%
12%
0
1
2
3
0.0
0.5
1.0
1.5
1M 3M 5M 7M
0%
2%
4%
0.0
0.6
1.2
1.8
0.0
0.6
1.2
1M 3M 5M 7M
Miss Ratio Fetch Ratio CPI Bandwidth
Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware
prefetching enabled. .
482.sphinx3 behaves quite differently from 435.gromacs.
As its cache size is decreased its CPI increases by 50%,
while its miss ratio and bandwidth increase by a factor
of 20×. The fetch ratio and miss ratio curves are slightly
different indicating that there is a small amount of prefetching.
However, the significant performance decrease despite the
increased bandwidth indicates that the benchmark is more
sensitive to increased memory latency.
470.lbm shows an 8× difference between its fetch and miss
ratios, indicating 8 prefetched memory accesses for each miss.
However, the relative increase in miss ratio is still roughly 2×.
This indicates that 470.lbm is also relatively insensitive to the
increased latency. The data in Figure 9 show the performance
of 470.lbm with hardware prefetching disabled. This reduces
bandwidth by a third and increases CPI at all cache sizes.
Furthermore, the CPI is now no longer constant with varying
!"#
Performance (lower beoer)
Bandwidth (GB/s)
F/M
435.gromacs
BW
CP
I
401.bzip2 429.mcf
F/M
454.calculix
BW
CP
I
482.sphinx3 470.lbm
F/M
462.libquantum
BW
CP
Icache size
450.soplex
cache size
403.gcc
cache size
0.0%
0.1%
0.2%
0.3%
0.0
0.1
0.2
0.0
0.4
0.8
0%
0.3%
0.6%
0.9%
0.0
0.3
0.6
0.0
0.5
1.0
0%
4%
8%
12%
16%
0
1
2
3
0
2
4
0.00%
0.06%
0.12%
0.00
0.03
0.06
0.0
0.5
1.0
0%
2%
4%
0
1
2
0.0
0.5
1.0
0%
3%
6%
0
1
2
3
0.0
0.8
1.6
0%
10%
20%
0
2
4
0.0
0.4
0.8
1M 3M 5M 7M
0%
6%
12%
0
1
2
3
0.0
0.5
1.0
1.5
1M 3M 5M 7M
0%
2%
4%
0.0
0.6
1.2
1.8
0.0
0.6
1.2
1M 3M 5M 7M
Miss Ratio Fetch Ratio CPI Bandwidth
Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware
prefetching enabled. .
482.sphinx3 behaves quite differently from 435.gromacs.
As its cache size is decreased its CPI increases by 50%,
while its miss ratio and bandwidth increase by a factor
of 20×. The fetch ratio and miss ratio curves are slightly
different indicating that there is a small amount of prefetching.
However, the significant performance decrease despite the
increased bandwidth indicates that the benchmark is more
sensitive to increased memory latency.
470.lbm shows an 8× difference between its fetch and miss
ratios, indicating 8 prefetched memory accesses for each miss.
However, the relative increase in miss ratio is still roughly 2×.
This indicates that 470.lbm is also relatively insensitive to the
increased latency. The data in Figure 9 show the performance
of 470.lbm with hardware prefetching disabled. This reduces
bandwidth by a third and increases CPI at all cache sizes.
Furthermore, the CPI is now no longer constant with varying
!"#F
/M
435.gromacs
BW
CP
I
401.bzip2 429.mcf
F/M
454.calculix
BW
CP
I
482.sphinx3 470.lbm
F/M
462.libquantum
BW
CP
I
cache size
450.soplex
cache size
403.gcc
cache size
0.0%
0.1%
0.2%
0.3%
0.0
0.1
0.2
0.0
0.4
0.8
0%
0.3%
0.6%
0.9%
0.0
0.3
0.6
0.0
0.5
1.0
0%
4%
8%
12%
16%
0
1
2
3
0
2
4
0.00%
0.06%
0.12%
0.00
0.03
0.06
0.0
0.5
1.0
0%
2%
4%
0
1
2
0.0
0.5
1.0
0%
3%
6%
0
1
2
3
0.0
0.8
1.6
0%
10%
20%
0
2
4
0.0
0.4
0.8
1M 3M 5M 7M
0%
6%
12%
0
1
2
3
0.0
0.5
1.0
1.5
1M 3M 5M 7M
0%
2%
4%
0.0
0.6
1.2
1.8
0.0
0.6
1.2
1M 3M 5M 7M
Miss Ratio Fetch Ratio CPI Bandwidth
Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware
prefetching enabled. .
482.sphinx3 behaves quite differently from 435.gromacs.
As its cache size is decreased its CPI increases by 50%,
while its miss ratio and bandwidth increase by a factor
of 20×. The fetch ratio and miss ratio curves are slightly
different indicating that there is a small amount of prefetching.
However, the significant performance decrease despite the
increased bandwidth indicates that the benchmark is more
sensitive to increased memory latency.
470.lbm shows an 8× difference between its fetch and miss
ratios, indicating 8 prefetched memory accesses for each miss.
However, the relative increase in miss ratio is still roughly 2×.
This indicates that 470.lbm is also relatively insensitive to the
increased latency. The data in Figure 9 show the performance
of 470.lbm with hardware prefetching disabled. This reduces
bandwidth by a third and increases CPI at all cache sizes.
Furthermore, the CPI is now no longer constant with varying
!"#
Cache Size Cache Size Cache Size
• 50% slower • 10x bandwidth
increase
• Most of the working set fits in the cache
• Sensi1vity to latency
• 0% slower • 2x bandwidth
increase
• Likle of the working set fits in the cache
• Insensi1ve to latency
Perf: BW:
Size:
Sensi1vity:
Uppsala University / Department of Informa<on Technology 28/11/2011 | 29 David Black-‐Schaffer
More Detail: The Impact of Prefetching
• Different degrees of prefetching depending on applica<on access paoern and hardware
• Prefetching reduces applica<on sensi<vity to latency
F/M
435.gromacs
BW
CP
I
401.bzip2 429.mcf
F/M
454.calculix
BW
CP
I
482.sphinx3 470.lbm
F/M
462.libquantum
BW
CP
I
cache size
450.soplex
cache size
403.gcc
cache size
0.0%
0.1%
0.2%
0.3%
0.0
0.1
0.2
0.0
0.4
0.8
0%
0.3%
0.6%
0.9%
0.0
0.3
0.6
0.0
0.5
1.0
0%
4%
8%
12%
16%
0
1
2
3
0
2
4
0.00%
0.06%
0.12%
0.00
0.03
0.06
0.0
0.5
1.0
0%
2%
4%
0
1
2
0.0
0.5
1.0
0%
3%
6%
0
1
2
3
0.0
0.8
1.6
0%
10%
20%
0
2
4
0.0
0.4
0.8
1M 3M 5M 7M
0%
6%
12%
0
1
2
3
0.0
0.5
1.0
1.5
1M 3M 5M 7M
0%
2%
4%
0.0
0.6
1.2
1.8
0.0
0.6
1.2
1M 3M 5M 7M
Miss Ratio Fetch Ratio CPI Bandwidth
Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware
prefetching enabled. .
482.sphinx3 behaves quite differently from 435.gromacs.
As its cache size is decreased its CPI increases by 50%,
while its miss ratio and bandwidth increase by a factor
of 20×. The fetch ratio and miss ratio curves are slightly
different indicating that there is a small amount of prefetching.
However, the significant performance decrease despite the
increased bandwidth indicates that the benchmark is more
sensitive to increased memory latency.
470.lbm shows an 8× difference between its fetch and miss
ratios, indicating 8 prefetched memory accesses for each miss.
However, the relative increase in miss ratio is still roughly 2×.
This indicates that 470.lbm is also relatively insensitive to the
increased latency. The data in Figure 9 show the performance
of 470.lbm with hardware prefetching disabled. This reduces
bandwidth by a third and increases CPI at all cache sizes.
Furthermore, the CPI is now no longer constant with varying
!"#
F/M
435.gromacs
BW
CP
I
401.bzip2 429.mcf
F/M
454.calculix
BW
CP
I
482.sphinx3 470.lbm
F/M
462.libquantum
BW
CP
I
cache size
450.soplex
cache size
403.gcc
cache size
0.0%
0.1%
0.2%
0.3%
0.0
0.1
0.2
0.0
0.4
0.8
0%
0.3%
0.6%
0.9%
0.0
0.3
0.6
0.0
0.5
1.0
0%
4%
8%
12%
16%
0
1
2
3
0
2
4
0.00%
0.06%
0.12%
0.00
0.03
0.06
0.0
0.5
1.0
0%
2%
4%
0
1
2
0.0
0.5
1.0
0%
3%
6%
0
1
2
3
0.0
0.8
1.6
0%
10%
20%
0
2
4
0.0
0.4
0.8
1M 3M 5M 7M
0%
6%
12%
0
1
2
3
0.0
0.5
1.0
1.5
1M 3M 5M 7M
0%
2%
4%
0.0
0.6
1.2
1.8
0.0
0.6
1.2
1M 3M 5M 7M
Miss Ratio Fetch Ratio CPI Bandwidth
Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware
prefetching enabled. .
482.sphinx3 behaves quite differently from 435.gromacs.
As its cache size is decreased its CPI increases by 50%,
while its miss ratio and bandwidth increase by a factor
of 20×. The fetch ratio and miss ratio curves are slightly
different indicating that there is a small amount of prefetching.
However, the significant performance decrease despite the
increased bandwidth indicates that the benchmark is more
sensitive to increased memory latency.
470.lbm shows an 8× difference between its fetch and miss
ratios, indicating 8 prefetched memory accesses for each miss.
However, the relative increase in miss ratio is still roughly 2×.
This indicates that 470.lbm is also relatively insensitive to the
increased latency. The data in Figure 9 show the performance
of 470.lbm with hardware prefetching disabled. This reduces
bandwidth by a third and increases CPI at all cache sizes.
Furthermore, the CPI is now no longer constant with varying
!"#
F/M
435.gromacs
BW
CP
I
401.bzip2 429.mcf
F/M
454.calculix
BW
CP
I
482.sphinx3 470.lbm
F/M
462.libquantum
BW
CP
I
cache size
450.soplex
cache size
403.gcc
cache size
0.0%
0.1%
0.2%
0.3%
0.0
0.1
0.2
0.0
0.4
0.8
0%
0.3%
0.6%
0.9%
0.0
0.3
0.6
0.0
0.5
1.0
0%
4%
8%
12%
16%
0
1
2
3
0
2
4
0.00%
0.06%
0.12%
0.00
0.03
0.06
0.0
0.5
1.0
0%
2%
4%
0
1
2
0.0
0.5
1.0
0%
3%
6%
0
1
2
3
0.0
0.8
1.6
0%
10%
20%
0
2
4
0.0
0.4
0.8
1M 3M 5M 7M
0%
6%
12%
0
1
2
3
0.0
0.5
1.0
1.5
1M 3M 5M 7M
0%
2%
4%
0.0
0.6
1.2
1.8
0.0
0.6
1.2
1M 3M 5M 7M
Miss Ratio Fetch Ratio CPI Bandwidth
Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware
prefetching enabled. .
482.sphinx3 behaves quite differently from 435.gromacs.
As its cache size is decreased its CPI increases by 50%,
while its miss ratio and bandwidth increase by a factor
of 20×. The fetch ratio and miss ratio curves are slightly
different indicating that there is a small amount of prefetching.
However, the significant performance decrease despite the
increased bandwidth indicates that the benchmark is more
sensitive to increased memory latency.
470.lbm shows an 8× difference between its fetch and miss
ratios, indicating 8 prefetched memory accesses for each miss.
However, the relative increase in miss ratio is still roughly 2×.
This indicates that 470.lbm is also relatively insensitive to the
increased latency. The data in Figure 9 show the performance
of 470.lbm with hardware prefetching disabled. This reduces
bandwidth by a third and increases CPI at all cache sizes.
Furthermore, the CPI is now no longer constant with varying
!"#
F/M
435.gromacs
BW
CP
I
401.bzip2 429.mcf
F/M
454.calculix
BW
CP
I
482.sphinx3 470.lbm
F/M
462.libquantum
BW
CP
I
cache size
450.soplex
cache size
403.gcc
cache size
0.0%
0.1%
0.2%
0.3%
0.0
0.1
0.2
0.0
0.4
0.8
0%
0.3%
0.6%
0.9%
0.0
0.3
0.6
0.0
0.5
1.0
0%
4%
8%
12%
16%
0
1
2
3
0
2
4
0.00%
0.06%
0.12%
0.00
0.03
0.06
0.0
0.5
1.0
0%
2%
4%
0
1
2
0.0
0.5
1.0
0%
3%
6%
0
1
2
3
0.0
0.8
1.6
0%
10%
20%
0
2
4
0.0
0.4
0.8
1M 3M 5M 7M
0%
6%
12%
0
1
2
3
0.0
0.5
1.0
1.5
1M 3M 5M 7M
0%
2%
4%
0.0
0.6
1.2
1.8
0.0
0.6
1.2
1M 3M 5M 7M
Miss Ratio Fetch Ratio CPI Bandwidth
Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware
prefetching enabled. .
482.sphinx3 behaves quite differently from 435.gromacs.
As its cache size is decreased its CPI increases by 50%,
while its miss ratio and bandwidth increase by a factor
of 20×. The fetch ratio and miss ratio curves are slightly
different indicating that there is a small amount of prefetching.
However, the significant performance decrease despite the
increased bandwidth indicates that the benchmark is more
sensitive to increased memory latency.
470.lbm shows an 8× difference between its fetch and miss
ratios, indicating 8 prefetched memory accesses for each miss.
However, the relative increase in miss ratio is still roughly 2×.
This indicates that 470.lbm is also relatively insensitive to the
increased latency. The data in Figure 9 show the performance
of 470.lbm with hardware prefetching disabled. This reduces
bandwidth by a third and increases CPI at all cache sizes.
Furthermore, the CPI is now no longer constant with varying
!"#
F/M
435.gromacs
BW
CP
I
401.bzip2 429.mcf
F/M
454.calculix
BW
CP
I
482.sphinx3 470.lbm
F/M
462.libquantum
BW
CP
I
cache size
450.soplex
cache size
403.gcc
cache size
0.0%
0.1%
0.2%
0.3%
0.0
0.1
0.2
0.0
0.4
0.8
0%
0.3%
0.6%
0.9%
0.0
0.3
0.6
0.0
0.5
1.0
0%
4%
8%
12%
16%
0
1
2
3
0
2
4
0.00%
0.06%
0.12%
0.00
0.03
0.06
0.0
0.5
1.0
0%
2%
4%
0
1
2
0.0
0.5
1.0
0%
3%
6%
0
1
2
3
0.0
0.8
1.6
0%
10%
20%
0
2
4
0.0
0.4
0.8
1M 3M 5M 7M
0%
6%
12%
0
1
2
3
0.0
0.5
1.0
1.5
1M 3M 5M 7M
0%
2%
4%
0.0
0.6
1.2
1.8
0.0
0.6
1.2
1M 3M 5M 7M
Miss Ratio Fetch Ratio CPI Bandwidth
Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware
prefetching enabled. .
482.sphinx3 behaves quite differently from 435.gromacs.
As its cache size is decreased its CPI increases by 50%,
while its miss ratio and bandwidth increase by a factor
of 20×. The fetch ratio and miss ratio curves are slightly
different indicating that there is a small amount of prefetching.
However, the significant performance decrease despite the
increased bandwidth indicates that the benchmark is more
sensitive to increased memory latency.
470.lbm shows an 8× difference between its fetch and miss
ratios, indicating 8 prefetched memory accesses for each miss.
However, the relative increase in miss ratio is still roughly 2×.
This indicates that 470.lbm is also relatively insensitive to the
increased latency. The data in Figure 9 show the performance
of 470.lbm with hardware prefetching disabled. This reduces
bandwidth by a third and increases CPI at all cache sizes.
Furthermore, the CPI is now no longer constant with varying
!"#
F/M
435.gromacs
BW
CP
I
401.bzip2 429.mcf
F/M
454.calculix
BW
CP
I
482.sphinx3 470.lbm
F/M
462.libquantum
BW
CP
I
cache size
450.soplex
cache size
403.gcc
cache size
0.0%
0.1%
0.2%
0.3%
0.0
0.1
0.2
0.0
0.4
0.8
0%
0.3%
0.6%
0.9%
0.0
0.3
0.6
0.0
0.5
1.0
0%
4%
8%
12%
16%
0
1
2
3
0
2
4
0.00%
0.06%
0.12%
0.00
0.03
0.06
0.0
0.5
1.0
0%
2%
4%
0
1
2
0.0
0.5
1.0
0%
3%
6%
0
1
2
3
0.0
0.8
1.6
0%
10%
20%
0
2
4
0.0
0.4
0.8
1M 3M 5M 7M
0%
6%
12%
0
1
2
3
0.0
0.5
1.0
1.5
1M 3M 5M 7M
0%
2%
4%
0.0
0.6
1.2
1.8
0.0
0.6
1.2
1M 3M 5M 7M
Miss Ratio Fetch Ratio CPI Bandwidth
Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware
prefetching enabled. .
482.sphinx3 behaves quite differently from 435.gromacs.
As its cache size is decreased its CPI increases by 50%,
while its miss ratio and bandwidth increase by a factor
of 20×. The fetch ratio and miss ratio curves are slightly
different indicating that there is a small amount of prefetching.
However, the significant performance decrease despite the
increased bandwidth indicates that the benchmark is more
sensitive to increased memory latency.
470.lbm shows an 8× difference between its fetch and miss
ratios, indicating 8 prefetched memory accesses for each miss.
However, the relative increase in miss ratio is still roughly 2×.
This indicates that 470.lbm is also relatively insensitive to the
increased latency. The data in Figure 9 show the performance
of 470.lbm with hardware prefetching disabled. This reduces
bandwidth by a third and increases CPI at all cache sizes.
Furthermore, the CPI is now no longer constant with varying
!"#
F/M
470.lbm
BW
CPI
cache size
0%
3%
6%
0
1
2
3
0.0
0.8
1.6
1M 3M 5M 7M
Miss RatioFetch Ratio
CPIBandwidth
Fig. 9. Performance data for 470.lbm with hardware prefetching disabled.(Fetch ratio and miss ratio are identical.)
cache size, clearly showing that prefetching was helping tocompensate for the reduced cache space. This demonstratesthat 470.lbm not only heavily leverages hardware prefetching,but also heavily benefits from it.
V. RELATED WORK
The standard approach to measuring performance as a func-tion of cache size is to use architectural simulators (e.g. [11],[16]) or performance models (e.g. [14], [18]). For performanceanalysis the main limitation of simulation is the long executiontime, which is typically orders of magnitude slower than nativeexecution. To address this, a substantial amount of work hasbeen done to improve simulation performance [8], [12], [17],in particular by trading off accuracy and detail for improvedspeed through the use of analytical models [14], [18]. How-ever, despite these advances, the overhead of simulation andmodeling is still a limitation for performance analysis of realapplication with large data sets.
Xu et al. [4] presented a model for predicting the per-formance degradation of co-running applications contendingfor shared cache. As part of their work they co-run a stressapplication with each application to measure its performanceas a function of miss ratio. For our purpose of measuringperformance as a function of cache size, their approach has twolimitations: 1) Instead of ensuring that their stress applicationsteals a fixed amount of cache, they let it freely contend forspace with the Target, and then estimate the average amountof cache stolen after the fact. Such an average is hard tocorrelate to one cache size and is sensitive to program phases.2) As their micro benchmark freely contends for cache withthe Target, they cannot limit how much off-chip bandwidth itconsumes, which can distort performance measurements5.
5We implemented and used their method to measure the CPI of the simplesequential micro benchmark in Figure 4. When we tried to steal 4MB of cache,their stress application consumed sufficient off-chip bandwidth to increase themeasured CPI by 37%.
Doucette and Fedorova [5] also co-run a set of microbenchmarks, called base vectors, with a Target applicationto determine the impact on the target. Their base vector formeasuring shared cache contention has a similar sequentialaccess pattern to that of the Cache Pirate, but has its workingset size fixed to that of the shared cache. This provides a singlecache “sensitivity” measure for the Target application and doesnot relate the performance of the Target to the amount of cacheavailable, nor to specific performance measurements such asCPI or off-chip bandwidth.
Cakarevic et al. [19] use a similar set of micro benchmarksas Doucette and Fedorova, but their work focuses on hardwarecharacterization. They co-run their micro benchmarks, stress-ing different shared resources, and investigate how the microbenchmarks impact each other. This allows them to identifyand characterize the critical shared hardware resources, but notthe behavior of other applications.
VI. CONCLUSION
We have presented a general technique for quickly and ac-curately measuring application performance characteristics asa function of the available shared cache space. By leveraging aco-executing Pirate application that steals shared cache spaceand hardware performance counters, this approach works onstandard hardware, accurately reflects the idiosyncrasies of thesystem, has a low overhead, and requires no modifications toeither the operating system or target applications
To evaluate our technique we identified and investigatedfour key issues: 1) Does the cache available to the Target
behave as expected? 2) How much cache can the Pirate steal?
3) How many Pirate threads can we run before we impact the
Target’s performance? and, 4) Can we minimize the overhead
by dynamically varying the Pirate’s size? For the first questionwe used a trace-driven cache simulator to show that the cachespace available to the Target application behaved as expectedwith average and maximum absolute fetch ratio errors of0.2% and 2.7%, respectively. We introduced a method todetermine when the Pirate can execute two threads to increaseits fetch rate without impacting the Target’s performance anddemonstrated that we were able to steal an average of 6.7MBof the 8MB cache, and up to an average of 6.8MB witha small decrease in Target execution rate. And, finally, bydynamically adjusting the amount of cache the Pirate stealsduring execution, we were able to reduce the total overheadto 5.5%, with an average CPI error of only 0.5%
While we motivated this general performance characteriza-tion approach with an example of how it can be used to explainthroughput scaling, we believe that the data captured by CachePirating is fundamentally important for the understandingand optimization of applications executing in shared resourceenvironments. Future work includes extending this approachto collect performance data against other shared resources andexploring the range of information available from the everincreasing variety of performance counters.
!"#
• Significant prefetching, and it benefits from it • Minimal Prefetching
• No Prefetching
Prefetching Disabled
Difference between fetches and misses is prefetch rate.
F/M
435.gromacs
BW
CP
I
401.bzip2 429.mcf
F/M
454.calculix
BW
CP
I
482.sphinx3 470.lbm
F/M
462.libquantum
BW
CP
I
cache size
450.soplex
cache size
403.gcc
cache size
0.0%
0.1%
0.2%
0.3%
0.0
0.1
0.2
0.0
0.4
0.8
0%
0.3%
0.6%
0.9%
0.0
0.3
0.6
0.0
0.5
1.0
0%
4%
8%
12%
16%
0
1
2
3
0
2
4
0.00%
0.06%
0.12%
0.00
0.03
0.06
0.0
0.5
1.0
0%
2%
4%
0
1
2
0.0
0.5
1.0
0%
3%
6%
0
1
2
3
0.0
0.8
1.6
0%
10%
20%
0
2
4
0.0
0.4
0.8
1M 3M 5M 7M
0%
6%
12%
0
1
2
3
0.0
0.5
1.0
1.5
1M 3M 5M 7M
0%
2%
4%
0.0
0.6
1.2
1.8
0.0
0.6
1.2
1M 3M 5M 7M
Miss Ratio Fetch Ratio CPI Bandwidth
Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware
prefetching enabled. .
482.sphinx3 behaves quite differently from 435.gromacs.
As its cache size is decreased its CPI increases by 50%,
while its miss ratio and bandwidth increase by a factor
of 20×. The fetch ratio and miss ratio curves are slightly
different indicating that there is a small amount of prefetching.
However, the significant performance decrease despite the
increased bandwidth indicates that the benchmark is more
sensitive to increased memory latency.
470.lbm shows an 8× difference between its fetch and miss
ratios, indicating 8 prefetched memory accesses for each miss.
However, the relative increase in miss ratio is still roughly 2×.
This indicates that 470.lbm is also relatively insensitive to the
increased latency. The data in Figure 9 show the performance
of 470.lbm with hardware prefetching disabled. This reduces
bandwidth by a third and increases CPI at all cache sizes.
Furthermore, the CPI is now no longer constant with varying
!"#
F/M
435.gromacs
BW
CP
I
401.bzip2 429.mcf
F/M
454.calculix
BW
CP
I
482.sphinx3 470.lbm
F/M
462.libquantum
BW
CP
I
cache size
450.soplex
cache size
403.gcc
cache size
0.0%
0.1%
0.2%
0.3%
0.0
0.1
0.2
0.0
0.4
0.8
0%
0.3%
0.6%
0.9%
0.0
0.3
0.6
0.0
0.5
1.0
0%
4%
8%
12%
16%
0
1
2
3
0
2
4
0.00%
0.06%
0.12%
0.00
0.03
0.06
0.0
0.5
1.0
0%
2%
4%
0
1
2
0.0
0.5
1.0
0%
3%
6%
0
1
2
3
0.0
0.8
1.6
0%
10%
20%
0
2
4
0.0
0.4
0.8
1M 3M 5M 7M
0%
6%
12%
0
1
2
3
0.0
0.5
1.0
1.5
1M 3M 5M 7M
0%
2%
4%
0.0
0.6
1.2
1.8
0.0
0.6
1.2
1M 3M 5M 7M
Miss Ratio Fetch Ratio CPI Bandwidth
Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware
prefetching enabled. .
482.sphinx3 behaves quite differently from 435.gromacs.
As its cache size is decreased its CPI increases by 50%,
while its miss ratio and bandwidth increase by a factor
of 20×. The fetch ratio and miss ratio curves are slightly
different indicating that there is a small amount of prefetching.
However, the significant performance decrease despite the
increased bandwidth indicates that the benchmark is more
sensitive to increased memory latency.
470.lbm shows an 8× difference between its fetch and miss
ratios, indicating 8 prefetched memory accesses for each miss.
However, the relative increase in miss ratio is still roughly 2×.
This indicates that 470.lbm is also relatively insensitive to the
increased latency. The data in Figure 9 show the performance
of 470.lbm with hardware prefetching disabled. This reduces
bandwidth by a third and increases CPI at all cache sizes.
Furthermore, the CPI is now no longer constant with varying
!"#
F/M
435.gromacs
BW
CP
I
401.bzip2 429.mcf
F/M
454.calculix
BW
CP
I
482.sphinx3 470.lbm
F/M
462.libquantum
BW
CP
I
cache size
450.soplex
cache size
403.gcc
cache size
0.0%
0.1%
0.2%
0.3%
0.0
0.1
0.2
0.0
0.4
0.8
0%
0.3%
0.6%
0.9%
0.0
0.3
0.6
0.0
0.5
1.0
0%
4%
8%
12%
16%
0
1
2
3
0
2
4
0.00%
0.06%
0.12%
0.00
0.03
0.06
0.0
0.5
1.0
0%
2%
4%
0
1
2
0.0
0.5
1.0
0%
3%
6%
0
1
2
3
0.0
0.8
1.6
0%
10%
20%
0
2
4
0.0
0.4
0.8
1M 3M 5M 7M
0%
6%
12%
0
1
2
3
0.0
0.5
1.0
1.5
1M 3M 5M 7M
0%
2%
4%
0.0
0.6
1.2
1.8
0.0
0.6
1.2
1M 3M 5M 7M
Miss Ratio Fetch Ratio CPI Bandwidth
Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware
prefetching enabled. .
482.sphinx3 behaves quite differently from 435.gromacs.
As its cache size is decreased its CPI increases by 50%,
while its miss ratio and bandwidth increase by a factor
of 20×. The fetch ratio and miss ratio curves are slightly
different indicating that there is a small amount of prefetching.
However, the significant performance decrease despite the
increased bandwidth indicates that the benchmark is more
sensitive to increased memory latency.
470.lbm shows an 8× difference between its fetch and miss
ratios, indicating 8 prefetched memory accesses for each miss.
However, the relative increase in miss ratio is still roughly 2×.
This indicates that 470.lbm is also relatively insensitive to the
increased latency. The data in Figure 9 show the performance
of 470.lbm with hardware prefetching disabled. This reduces
bandwidth by a third and increases CPI at all cache sizes.
Furthermore, the CPI is now no longer constant with varying
!"#
Uppsala University / Department of Informa<on Technology 28/11/2011 | 30 David Black-‐Schaffer
More Detail: Cache Pollu<on
• We can measure how greedy an applica<on is and how sensi1ve it is
• By changing the code to use non-‐caching instruc<ons we can make an applica1on less greedy without hur<ng performance Motivation
Do all applications benefit from caching?
0.0%5.0%
10.0%15.0%20.0%25.0%
Private Shared
Mis
sR
atio
Cache Size
libquantumlbm
Not all applications benefit from cachingBetter performance if caching can be controlled
- 3 -
Increased Performance of Mixed Workloads
401.bzip2
470.lbm462.libq.
416.gamess
Not all applications are affected by cache contentionEven cache gobblers get increased performance dueto shared bandwidth
- 11 -
Increased Performance of Mixed Workloads
401.bzip2
470.lbm462.libq.
416.gamess
Not all applications are affected by cache contentionEven cache gobblers get increased performance dueto shared bandwidth
- 11 -
Uppsala University / Department of Informa<on Technology 28/11/2011 | 31 David Black-‐Schaffer
More Detail: Bandwidth Sharing • Sensi<vity is a func<on of the applica<on
– Latency sensi1vity (memory level parallelism) – Bandwidth requirement (data rate)
• And the hardware – Ability to handle out-‐of-‐order requests (queue sizes) – Access paoern costs (streaming vs. random in DRAM banks)
• BW consump1on is not a good indicator of BW sensi1vity
0
Bandwidth Bandit: Understanding Memory Contention
Abstract—Applications that are co-scheduled on a multicorecompete for shared resources, such as cache capacity and memorybandwidth. The performance degradation resulting from thiscontention can be substantial, which makes it important toeffectively manage these shared resources. This, however, requiresan understanding of how applications are impacted by suchcontention.
While cache-sharing effects have been studied extensivelyof late, the effects of memory bandwidth sharing are not aswell explored. This is in large due to its complex nature, assensitivity to bandwidth contention depends on bottlenecks atseveral levels of the memory-system and the locality propertiesof the application’s access stream.
This paper explores the contention effects of increased latencyand decreased memory parallelism at different points in thememory hierarchy, both of which cause decreases in availablebandwidth. To understand the impact of such contention onapplications, it also presents a method whereby an application’soverall sensitivity to different degrees of bandwidth contentioncan be directly measured. This method is used to demonstrate thevarying contention sensitivity across a selection of benchmarks,and explains why some of them experience substantial slowdownslong before the overall memory bandwidth saturates.
I. INTRODUCTION
Cache capacity and memory bandwidth are critical shared
resources in chip multi-processors. Sharing these resources
between cores has many benefits, however, it also leads to
contention, which can have dramatic, negative impacts on the
co-running applications’ performance. Understanding the im-
pact of such contention is therefore essential to fully realize the
computational power of these systems. This work investigates
the effects of contention for off-chip memory (bandwidth)
by first exploring where and why contention occurs in the
memory hierarchy. We then use this knowledge to develop a
method that enables us to measure an application’s sensitivity
to memory contention.
While contention effects have been extensively studied for
cache capacity (e.g. [1]), less is known about contention
for memory bandwidth. Two recent studies [2], [3], show
that the performance impact from contention for off-chip
memory resources can be significant. Their main results, with
respect to memory bandwidth contention, are that applications
with higher (un-contended) bandwidth demands both generate
more memory contention and are generally more sensitive to
contention from others.
The importance of application sensitivity to memory band-
width contention is shown in Figure 1. This data shows
that applications’ sensitivities vary significantly and can be
quite substantial, with slowdowns varying from 3% to 23%for these benchmarks. It is clear from the data in Figure 1
that applications with similar bandwidth demands (e.g. lbm
and soplex) can exhibit widely varying slowdowns for the
same degree of memory contention. The goal of this research
is to further investigate the mechanisms behind bandwidth
contention and its impact on applications’ performance.
0
10
20
lbmstream
cluster
soplex
mcf
0
1
2
slo
wdow
n(%
)
bandw
idth
(GB
/s)
slowdown
bandwidth
Fig. 1. Baseline bandwidth (no contention) and slowdown with memory
contention (90% of saturation bandwidth). While all of these applications
exhibit similar baseline bandwidth consumption, their sensitivities to memory
contention vary widely. This variability demonstrates the importance of a more
detailed understanding the effects of memory contention. (For each benchmark
we generated sufficient bandwidth contention to reach 90% of the system
saturation bandwidth for the benchmark. This contention was generated using
the Bandwidth Bandit technique discussed in Section VI.)
To measure application sensitivity to memory contention we
propose the Bandwidth Bandit method. It works by co-running
the application whose performance we want to measure (the
Target) with a Bandit application that generates memory
contention. By carefully controlling the amount of contention
the Bandit generates we can measure the Target’s performance
as a function of the contention generated by the Bandit.
As we want to measure the Target’s sensitivity to memory
contention in isolation, it is important that the Bandit does
not consume any shared resources other than the ones for
which it generates contention. For example, if the Bandit uses
large amounts of shared cache capacity, this might impact
the Target’s performance and distort the measurement. In
addition, the Bandit must generate realistic contention traffic
that appropriately targets the different parts of the memory
hierarchy. To design such a Bandit we must first understand
the sources of memory contention in detail.
This research makes the following contributions:
• We identify the main bottlenecks that can arise due to
memory contention and measure their impact on both
latency and bandwidth. We classify these bottlenecks
into two categories: bottlenecks that limit bandwidth and
bottlenecks increase latency.
• We present the Bandwidth Bandit method. To our knowl-
edge, this is the first method for measuring the perfor-
mance impact of memory contention on real hardware.
• We describe how the Bandit data can be analyzed to
understand an application’s bandwidth and latency sensi-
tivity.
• We use the Bandwidth Bandit method to analyze the
contention sensitivity of a set of applications from the
SPEC2006 and PARSEC benchmark suites.
To present this material, we begin with an overview of
the memory hierarchy (Section II-A) and a discussion of
the relationship between memory level parallelism and la-
tency (Section II-B). We then describe our experimental setup
Slowdown at 90% of satura1on bandwidth
1
Fig. 2. The memory hierarchy. The memory hierarchy can be analyzed withEquation 1 using the different degrees of parallelism and latency at each levelin the hierarchy. The queues in the figure indicate that the available parallelismcomes from multiple places within the hierarchy. The details of the hierarchyare discussed in Section IV.
(Section III), and present a series of micro-benchmarks forcharacterizing the impacts of latency and parallelism at theglobal and local level in the memory hierarchy (Section IV).With this background, we then introduce the concept of ap-plication sensitivity to bandwidth and latency (Section V) andpresent the details of the Bandwidth Bandit implementation(Section VI). The Bandit is then validated against a series ofmicro-benchmarks, which allow us to describe how to interpretits output (Section VII). Finally, we present the results of usingthe Bandwidth Bandit to investigate the memory contentionsensitivity of a variety of SPEC2006 and PARSEC benchmarks(Section VIII).
II. BACKGROUND
A. Memory Hierarchy Organization
The memory hierarchy considered in this paper is shown inFigure 2. If a memory access can not be serviced by the core’sprivate caches (not shown in the figure), it is first sent to theshared L3 cache. It the requested data is not found in the L3cache, it is sent to the integrated Memory Controller (MC).The MC has three independent memory channels over whichit communicates with the DRAM modules. Each channelconsists of an address bus and a 64 bit wide data bus. Memoryrequest are typically 64 bytes (one cache-line), thereby requir-ing eight transfers over the data bus. Each DRAM moduleconsists of several independent memory banks, which canbe accessed in parallel, as long as there are no conflicts onthe address and data buses. The combination of independentchannels and memory banks provides for a large degree ofavailable parallelism in the off-chip memory hierarchy.
The DRAM memory banks are organized into rows (alsocalled pages) and columns. To address a word of data the MChas to specify the channel, bank, row and column of the data.To read or write an address, the whole row is first copied intothe bank’s row buffer. This single-entry buffer (also known asa page cache) caches the row until a different row in the samebank is accessed.
On a read or write access three events can occur: A page-hitwhen the accessed row is already in the row buffer and thedata can be read/written directly; a page-empty when the rowbuffer is empty and the accessed row has to be copied from
the bank before it can be read/written1; or a page-miss when arow other then the one accessed is cached in the row buffer. Inthe case of a page-miss, the cached row has to first be writtenback to the memory bank before the newly accessed row iscopied into the row buffer. These three events have differentlatencies, with a page-hit having the shortest latency, and apage-miss having the longest.
B. Memory Hierarchy PerformanceFrom a performance point of view the memory hierarchy
can be described by two metrics: its latency and band-width. These two metrics are intimately related. Using Little’slaw [?], the average bandwidth achieved by an application canbe expressed as follows:
bandwidth = transfer size× MLP
latency, (1)
where MLP is the application’s average Memory LevelParallelism, that is, the average number of concurrent memoryrequests it has in-flight, and latency is the average time tocomplete the application’s memory accesses. For example,consider an application with a MLP of one (e.g., a linked listtraversal) and an access latency of 50ns. Such an applicationwould complete a transfer every 50ns, on average. With atransfer size of 64 bytes, its memory bandwidth would be1.3GB/s. If the MLP was increased to two, but the accesslatency remained the same, the application would completetwo transfers every 50ns, and its bandwidth would double to2.6GB/s.
The above equation clearly illustrates that the bandwidthachieved by an application is determined by both its memoryaccess latency and its memory parallelism. However, theseparameters vary throughout the memory hierarchy. For ex-ample, at the bank level, the parallelism is limited by thenumber of banks. However, MCs typically queue request tobusy banks. From the higher-level perspective of the MC,the parallelism, or number of in-flight requests, will includethe request in these queues, and appear larger. The latencywill also appear different, since the time spend in the queueshave to be considered. The above equation will therefore havedifferent values for latency and MLP depending on where itis applied in the memory hierarchy.
In this work we will use the above equation to understandhow contention for shared memory resources impacts appli-cation performance. Such contention can occur throughoutthe memory hierarchy, and can both increase access latenciesand reduce the memory parallelism. The resulting effect onapplication performance is a function of the application’ssensitivity to reduced latency or memory parallelism.
III. EXPERIMENTAL SETUP
The experiments presented in this paper have been run ona quad core Intel Xeon e5520 (Nehalem) machine. Its cacheconfiguration is detailed in the following table:
1Page-empties occur when the MC preemptively closes a page that hasn’tbeen accessed recently to optimistically turn a page-miss into a page-empty.