resource sharing for multicore processors · (section iii), and present a series of...

Resource Sharing in Mul1core Processors

David Black-‐Schaffer Assistant Professor, Department of Informa<on Technology

Uppsala University

Special thanks to the Uppsala Architecture Research Team: David Eklöv, Prof. Erik Hagersten, Nikos Nikoleris, Andreas Sandberg, Andreas Sembrant

Uppsala Programming forMulticore ArchitecturesResearch Center

Uppsala University / Department of Informa<on Technology 28/11/2011 | 2 David Black-‐Schaffer

Resource Sharing in Mul<cores

•  Most people focus on the core count –  Parallelism, synchroniza<on, deadlock, etc.

•  I am going to focus on the shared memory system –  Cache, bandwidth, prefetching, etc.

•  The shared memory system significantly impacts performance, efficiency, and scalability.

Nehalem Sandy Bridge

Shared L3 Cache

Shared L3 Cache Memory Controller I/O

Memory Controller I/O


Goals For Today

1.  Overview of shared memory system resources

2.  Impact on applica1on performance

3.  Importance of applica1on sensi1vity

4.  Research tools and direc<ons


SHARED RESOURCES Not just cache and bandwidth…


Mul<core Memory Systems Intel Nehalem Memory Hierarchy (3GHz)

D. Molka, et. al., Memory Performance and Cache Coherency Effects on an Intel Nehalem Mul8processor System, PACT 2009.

10 20 30 40 50 60 70 80 90 100 Latency (cycles)

190

DRAM

L3 L2 L1

Core

Core

Core

Core

Latency to DRAM: 200 cycles

Latency to private L1: 4 cycles

Bandwidth to DRAM: 4-‐8 bytes/cycle total

1-‐2 bytes/cycle per core Bandwidth to Private L1: 15 bytes/cycle per core

Shared off-‐chip Bandwidth

Shared Last Level Cache


Other Shared Resources

•  SMT (Hyper-‐Threading) –  “Private” caches

•  Instruc<on cache •  L1/L2 Data caches

–  Branch predictors, TLBs, instruc<on decode

•  AMD’s Bulldozer –  Floa<ng point units –  Instruc<on fetch/decode –  L2 cache –  Branch predictor

10 20 30 40 50 60 70 80 90 100 Latency (cycles)

Shared

Core

Core

Core

Core

Core

Core

Core

Bulldozer Core


Energy As A Shared Resource

Bulldozer Core

•  Probably the 2nd most important resource today, soon to be 1st –  Power (controlled by DVFS — Dynamic Frequency/Voltage Scaling) –  Thermal Capacity

AMD’s “Turbo CORE”– thermal capacity

Intel’s “Turbo Boost”


Shared Cache

Core

Core

Core

Core

Efficient Data Sharing

Shared Cache

Core

Core

Core

Core

Fast Synchroniza1on

Aside: Why Share Resources? •  Parallel Performance

–  Data sharing across threads –  Simplify synchroniza1on and coherency

•  Flexibility –  Cache par<<oning for different workloads

•  Amdahl’s law –  Always need single-‐threaded performance

Shared Data

Lock

Shared Cache

Core

Core

Core

Core

Flexible Resource Alloca1on

1

10

100

1 2 4 8 16 32 64 128 256 512 1024

Maxim

um Spe

edup

Number of Cores

Limits of Amdahl’s Law

75% Parallel (4x)

90% Parallel (10x)

99% Parallel (91x)


PERFORMANCE IMPACTS OF SHARING

Cache Sharing, Cache Pollu<on, and Bandwidth Sharing Cache Sharing, Cache Pollu<on, and Bandwidth Sharing Cache Sharing, Cache Pollu<on, and Bandwidth Sharing Cache Sharing, Cache Pollu<on, and Bandwidth Sharing


Cache Pirate – 471.omnetpp

1234

1 2 3 4

Thro

ughp

ut

cores

471.omnetppexpected

- 5 -


1234

1 2 3 4

Thro

ughp

ut

cores

471.omnetppexpected

measured

- 5 -


1234

1 2 3 4

Thro

ughp

ut

cores

471.omnetpp

δ

expectedmeasured

0.0

0.5

1.0

1.5

2.0

2.5

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

471.omnetpp

∆

- 5 -

Cache Sharing

Experiment •  Run 1-‐4 independent instances of the

same program on a 4-‐core Nehalem

•  Performance affected by shared cache alloca1on

–  ¼ of the shared cache 20% slower •  Understanding performance as a func(on

of shared resource alloca(on is cri<cal to understanding mul<core scalability


1234

1 2 3 4

Thro

ughp

ut

cores

471.omnetppexpected

measured

0.0

0.5

1.0

1.5

2.0

2.5

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

471.omnetpp

- 5 -


1234

1 2 3 4

Thro

ughp

ut

cores

471.omnetpp

δ

expectedmeasured

0.0

0.5

1.0

1.5

2.0

2.5

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

471.omnetpp

∆

- 5 -

Shared L3 Cache

Performance as a func1on of shared cache size


1234

1 2 3 4

Thro

ughp

ut

cores

471.omnetppexpected

measuredcache pirate

0.0

0.5

1.0

1.5

2.0

2.5

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

471.omnetpp

- 5 -

predicted


10 20 30 40 50 60 70 80 90 100 Latency (cycles)

190

Cache Sharing: What’s Going On? Cache Pirate – 471.omnetpp

1234

1 2 3 4

Thro

ughp

ut

cores

471.omnetpp

δ

expectedmeasured

0.0

0.5

1.0

1.5

2.0

2.5

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

471.omnetpp

∆

- 5 -

Shared Cache Sensi1vity


1234

1 2 3 4

Thro

ughp

ut

cores

471.omnetpp

δ

expectedmeasured

0.0

0.5

1.0

1.5

2.0

2.5

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

471.omnetpp

∆

- 5 -

Reduced Scalability

DRAM

L3

Shared

L2 L1

Core

Core

Core

Core

100% of Shared Cache

50% of Shared Cache

50% of Shared Cache

33% of Shared Cache

33% of Shared Cache

33% of Shared Cache

25% of Shared Cache

25% of Shared Cache

25% of Shared Cache

25% of Shared Cache


10 20 30 40 50 60 70 80 90 100 Latency (cycles)

190

Cache Sharing: What’s Going On? Cache Pirate – 471.omnetpp

1234

1 2 3 4

Thro

ughp

ut

cores

471.omnetpp

δ

expectedmeasured

0.0

0.5

1.0

1.5

2.0

2.5

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

471.omnetpp

∆

- 5 -



1234

1 2 3 4

Thro

ughp

ut

cores

471.omnetpp

δ

expectedmeasured

0.0

0.5

1.0

1.5

2.0

2.5

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

471.omnetpp

∆

- 5 -

Reduced Scalability

DRAM

L3

Shared

L2 L1

Core

Core

Core

Core

100% of Shared Cache

50% of Shared Cache

50% of Shared Cache

33% of Shared Cache

33% of Shared Cache

33% of Shared Cache

25% of Shared Cache

25% of Shared Cache

25% of Shared Cache

25% of Shared Cache

Less shared cache space

More DRAM accesses

Increased latency

Reduced performance.

How much slower? See this curve.

Need performance as a func1on of shared resources to understand mul1core scalability.


Cache Pollu1on •  Applica<ons put data in the cache but do not reuse it

–  No benefit, but no problem …unless you are sharing cache space

•  Pollu1on wastes space other applica1ons could benefit from

0

0.5

1

1.5

2

bzip2 lbm libquantum gamess Overall

Throughp

ut

Alone Mixed Workload Pollu1on Eliminated

Effects of Cache Pollu1on in a 4-‐process Workload

1/3 of slowdown due to cache pollu<on.


MotivationDo all applications benefit from caching?

0.0%5.0%

10.0%15.0%20.0%25.0%

Private Shared

Mis

sR

atio

Cache Size

libquantumlbm

Not all applications benefit from cachingBetter performance if caching can be controlled

- 3 -

20%

10%

0%

Cache Pollu1on: What’s Going On?

•  Does an applica<on benefit from using the shared cache?

Cache Size

Private Shared

Miss Ra

1o

More misses more cache space

Fewer misses less cache space

No benefit from using the shared cache space.

Significant benefit from the shared cache space.


Predicting Multicore Scaling – 470.lbm

0.00.20.40.60.81.01.21.41.6

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

470.lbm

1

2

3

4

1 2 3 4

Thro

ughp

ut

cores

470.lbmcache pirate

measured

0.00.51.01.52.02.53.03.5

1M 2M 3M 4M 5M 6M 7M

Ban

dwid

ht(G

B/s

)

cache size

470.lbm

02468

10121416

1 2 3 4

Ban

dwid

th(G

B/s

)

cores

470.lbmrequired

- 6 -

Bandwidth as a func1on of shared cache size


0.00.20.40.60.81.01.21.41.6

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

470.lbm

1

2

3

4

1 2 3 4

Thro

ughp

ut

cores

470.lbmcache pirate

- 6 -

predicted


0.00.20.40.60.81.01.21.41.6

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

470.lbm

1

2

3

4

1 2 3 4

Thro

ughp

ut

cores

470.lbmcache pirate

- 6 -

Performance as a func1on of shared cache size

Bandwidth Sharing

Que

ue

Memory Controller I/O

50% reduc<on in cache 57% increase in bandwidth


0.00.20.40.60.81.01.21.41.6

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

470.lbm

1

2

3

4

1 2 3 4

Thro

ughp

ut

cores

470.lbmcache pirate

measured

- 6 -

predicted


0.00.20.40.60.81.01.21.41.6

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

470.lbm

1

2

3

4

1 2 3 4Th

roug

hput

cores

470.lbmcache pirate

measured

0.00.51.01.52.02.53.03.5

1M 2M 3M 4M 5M 6M 7M

Ban

dwid

ht(G

B/s

)

cache size

470.lbm

02468

10121416

1 2 3 4

Ban

dwid

th(G

B/s

)

cores

470.lbm

10.4 GB/s

requiredmeasured

- 6 -

No performance loss with less cache space…

…un<l you run out of shared bandwidth.

4 cores @ 3GB/s = 12GB/s System max = 10.4GB/s


10 20 30 40 50 60 70 80 90 100 Latency (cycles)

190

DRAM

L3

Shared

L2

L1

Core

Core

Core

Core

Bandwidth Sharing: What is Going On? •  Shared resources:

–  Cache, Bandwidth –  But also: queues, row buffers, prefetchers, etc.

•  Results in complex applica1on sensi1vi1es: –  Bandwidth, capacity, latency, access paoern

•  Complex interac1ons with shared bandwidth

1

Fig. 2. The memory hierarchy. The memory hierarchy can be analyzed withEquation 1 using the different degrees of parallelism and latency at each levelin the hierarchy. The queues in the figure indicate that the available parallelismcomes from multiple places within the hierarchy. The details of the hierarchyare discussed in Section IV.

(Section III), and present a series of micro-benchmarks forcharacterizing the impacts of latency and parallelism at theglobal and local level in the memory hierarchy (Section IV).With this background, we then introduce the concept of ap-plication sensitivity to bandwidth and latency (Section V) andpresent the details of the Bandwidth Bandit implementation(Section VI). The Bandit is then validated against a series ofmicro-benchmarks, which allow us to describe how to interpretits output (Section VII). Finally, we present the results of usingthe Bandwidth Bandit to investigate the memory contentionsensitivity of a variety of SPEC2006 and PARSEC benchmarks(Section VIII).

II. BACKGROUND

A. Memory Hierarchy Organization

The memory hierarchy considered in this paper is shown inFigure 2. If a memory access can not be serviced by the core’sprivate caches (not shown in the figure), it is first sent to theshared L3 cache. It the requested data is not found in the L3cache, it is sent to the integrated Memory Controller (MC).The MC has three independent memory channels over whichit communicates with the DRAM modules. Each channelconsists of an address bus and a 64 bit wide data bus. Memoryrequest are typically 64 bytes (one cache-line), thereby requir-ing eight transfers over the data bus. Each DRAM moduleconsists of several independent memory banks, which canbe accessed in parallel, as long as there are no conflicts onthe address and data buses. The combination of independentchannels and memory banks provides for a large degree ofavailable parallelism in the off-chip memory hierarchy.

The DRAM memory banks are organized into rows (alsocalled pages) and columns. To address a word of data the MChas to specify the channel, bank, row and column of the data.To read or write an address, the whole row is first copied intothe bank’s row buffer. This single-entry buffer (also known asa page cache) caches the row until a different row in the samebank is accessed.

On a read or write access three events can occur: A page-hitwhen the accessed row is already in the row buffer and thedata can be read/written directly; a page-empty when the rowbuffer is empty and the accessed row has to be copied from

the bank before it can be read/written1; or a page-miss when arow other then the one accessed is cached in the row buffer. Inthe case of a page-miss, the cached row has to first be writtenback to the memory bank before the newly accessed row iscopied into the row buffer. These three events have differentlatencies, with a page-hit having the shortest latency, and apage-miss having the longest.

B. Memory Hierarchy PerformanceFrom a performance point of view the memory hierarchy

can be described by two metrics: its latency and band-width. These two metrics are intimately related. Using Little’slaw [?], the average bandwidth achieved by an application canbe expressed as follows:

bandwidth = transfer size× MLP

latency, (1)

where MLP is the application’s average Memory LevelParallelism, that is, the average number of concurrent memoryrequests it has in-flight, and latency is the average time tocomplete the application’s memory accesses. For example,consider an application with a MLP of one (e.g., a linked listtraversal) and an access latency of 50ns. Such an applicationwould complete a transfer every 50ns, on average. With atransfer size of 64 bytes, its memory bandwidth would be1.3GB/s. If the MLP was increased to two, but the accesslatency remained the same, the application would completetwo transfers every 50ns, and its bandwidth would double to2.6GB/s.

The above equation clearly illustrates that the bandwidthachieved by an application is determined by both its memoryaccess latency and its memory parallelism. However, theseparameters vary throughout the memory hierarchy. For ex-ample, at the bank level, the parallelism is limited by thenumber of banks. However, MCs typically queue request tobusy banks. From the higher-level perspective of the MC,the parallelism, or number of in-flight requests, will includethe request in these queues, and appear larger. The latencywill also appear different, since the time spend in the queueshave to be considered. The above equation will therefore havedifferent values for latency and MLP depending on where itis applied in the memory hierarchy.

In this work we will use the above equation to understandhow contention for shared memory resources impacts appli-cation performance. Such contention can occur throughoutthe memory hierarchy, and can both increase access latenciesand reduce the memory parallelism. The resulting effect onapplication performance is a function of the application’ssensitivity to reduced latency or memory parallelism.

III. EXPERIMENTAL SETUP

The experiments presented in this paper have been run ona quad core Intel Xeon e5520 (Nehalem) machine. Its cacheconfiguration is detailed in the following table:

1Page-empties occur when the MC preemptively closes a page that hasn’tbeen accessed recently to optimistically turn a page-miss into a page-empty.

Slowdown at 90% of satura1on bandwidth

0

Bandwidth Bandit: Understanding Memory Contention

Abstract—Applications that are co-scheduled on a multicorecompete for shared resources, such as cache capacity and memorybandwidth. The performance degradation resulting from thiscontention can be substantial, which makes it important toeffectively manage these shared resources. This, however, requiresan understanding of how applications are impacted by suchcontention.

While cache-sharing effects have been studied extensivelyof late, the effects of memory bandwidth sharing are not aswell explored. This is in large due to its complex nature, assensitivity to bandwidth contention depends on bottlenecks atseveral levels of the memory-system and the locality propertiesof the application’s access stream.

This paper explores the contention effects of increased latencyand decreased memory parallelism at different points in thememory hierarchy, both of which cause decreases in availablebandwidth. To understand the impact of such contention onapplications, it also presents a method whereby an application’soverall sensitivity to different degrees of bandwidth contentioncan be directly measured. This method is used to demonstrate thevarying contention sensitivity across a selection of benchmarks,and explains why some of them experience substantial slowdownslong before the overall memory bandwidth saturates.

I. INTRODUCTION

Cache capacity and memory bandwidth are critical shared

resources in chip multi-processors. Sharing these resources

between cores has many benefits, however, it also leads to

contention, which can have dramatic, negative impacts on the

co-running applications’ performance. Understanding the im-

pact of such contention is therefore essential to fully realize the

computational power of these systems. This work investigates

the effects of contention for off-chip memory (bandwidth)

by first exploring where and why contention occurs in the

memory hierarchy. We then use this knowledge to develop a

method that enables us to measure an application’s sensitivity

to memory contention.

While contention effects have been extensively studied for

cache capacity (e.g. [1]), less is known about contention

for memory bandwidth. Two recent studies [2], [3], show

that the performance impact from contention for off-chip

memory resources can be significant. Their main results, with

respect to memory bandwidth contention, are that applications

with higher (un-contended) bandwidth demands both generate

more memory contention and are generally more sensitive to

contention from others.

The importance of application sensitivity to memory band-

width contention is shown in Figure 1. This data shows

that applications’ sensitivities vary significantly and can be

quite substantial, with slowdowns varying from 3% to 23%for these benchmarks. It is clear from the data in Figure 1

that applications with similar bandwidth demands (e.g. lbm

and soplex) can exhibit widely varying slowdowns for the

same degree of memory contention. The goal of this research

is to further investigate the mechanisms behind bandwidth

contention and its impact on applications’ performance.

0

10

20

lbmstream

cluster

soplex

mcf

0

1

2

slo

wdow

n(%

)

bandw

idth

(GB

/s)

slowdown

bandwidth

Fig. 1. Baseline bandwidth (no contention) and slowdown with memory

contention (90% of saturation bandwidth). While all of these applications

exhibit similar baseline bandwidth consumption, their sensitivities to memory

contention vary widely. This variability demonstrates the importance of a more

detailed understanding the effects of memory contention. (For each benchmark

we generated sufficient bandwidth contention to reach 90% of the system

saturation bandwidth for the benchmark. This contention was generated using

the Bandwidth Bandit technique discussed in Section VI.)

To measure application sensitivity to memory contention we

propose the Bandwidth Bandit method. It works by co-running

the application whose performance we want to measure (the

Target) with a Bandit application that generates memory

contention. By carefully controlling the amount of contention

the Bandit generates we can measure the Target’s performance

as a function of the contention generated by the Bandit.

As we want to measure the Target’s sensitivity to memory

contention in isolation, it is important that the Bandit does

not consume any shared resources other than the ones for

which it generates contention. For example, if the Bandit uses

large amounts of shared cache capacity, this might impact

the Target’s performance and distort the measurement. In

addition, the Bandit must generate realistic contention traffic

that appropriately targets the different parts of the memory

hierarchy. To design such a Bandit we must first understand

the sources of memory contention in detail.

This research makes the following contributions:

• We identify the main bottlenecks that can arise due to

memory contention and measure their impact on both

latency and bandwidth. We classify these bottlenecks

into two categories: bottlenecks that limit bandwidth and

bottlenecks increase latency.

• We present the Bandwidth Bandit method. To our knowl-

edge, this is the first method for measuring the perfor-

mance impact of memory contention on real hardware.

• We describe how the Bandit data can be analyzed to

understand an application’s bandwidth and latency sensi-

tivity.

• We use the Bandwidth Bandit method to analyze the

contention sensitivity of a set of applications from the

SPEC2006 and PARSEC benchmark suites.

To present this material, we begin with an overview of

the memory hierarchy (Section II-A) and a discussion of

the relationship between memory level parallelism and la-

tency (Section II-B). We then describe our experimental setup

Slow

down

Band

width


Sensi<vity Cri<cal for Scaling

•  Shared resource sensi1vity understand scaling •  Sensi1vity varies from applica1on to applica1on •  Impact varies from plagorm to plagorm

Performance (lower beoer)

Bandwidth (GB/s)

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculixB

WC

PI

482.sphinx3 470.lbmF

/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M

Miss Ratio Fetch Ratio CPI Bandwidth

Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware

prefetching enabled. .

482.sphinx3 behaves quite differently from 435.gromacs.

As its cache size is decreased its CPI increases by 50%,

while its miss ratio and bandwidth increase by a factor

of 20×. The fetch ratio and miss ratio curves are slightly

different indicating that there is a small amount of prefetching.

However, the significant performance decrease despite the

increased bandwidth indicates that the benchmark is more

sensitive to increased memory latency.

470.lbm shows an 8× difference between its fetch and miss

ratios, indicating 8 prefetched memory accesses for each miss.

However, the relative increase in miss ratio is still roughly 2×.

This indicates that 470.lbm is also relatively insensitive to the

increased latency. The data in Figure 9 show the performance

of 470.lbm with hardware prefetching disabled. This reduces

bandwidth by a third and increases CPI at all cache sizes.

Furthermore, the CPI is now no longer constant with varying

!"#

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M




















!"#

Cache Size

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M




















!"#

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M




















!"#

Cache Size

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M




















!"#

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M




















!"#

Cache Size


1234

1 2 3 4

Thro

ughp

ut

cores

471.omnetpp

δ

expectedmeasured

0.0

0.5

1.0

1.5

2.0

2.5

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

471.omnetpp

∆

- 5 -



1234

1 2 3 4

Thro

ughp

ut

cores

471.omnetpp

δ

expectedmeasured

0.0

0.5

1.0

1.5

2.0

2.5

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

471.omnetpp

∆

- 5 -

Reduced Scalability

Insensi1ve Sensi1ve Insensi1ve

As long as you have enough bandwidth.


Finding Wasteful Accesses

- 9 -


- 9 -


- 9 -

Aside: Implementa<on (Tools) •  Performance as a func1on of

cache size –  “Pirate” thread steals cache and

measures Target’s performance

–  Monitor the Pirate’s miss rate

–  (Similar for Bandwidth) Overview

$1

$1

- 3 -

Overview

$2

$2

- 3 -

Overview

$3

$3

- 3 -

Overview

$4

$4

- 3 -D. Eklöv, N. Nikoleris, D. Black-‐Schaffer, E. Hagersten. Cache Pira8ng: Measuring the Curse of the Shared Cache. ICPP 2011.

A. Sandberg, D. Eklöv, E. Hagersten. Reducing Cache Pollu8on Through Detec8on and Elimina8on of Non-‐Temporal Memory Accesses. SC 2010.

•  Reducing cache pollu1on –  Sample memory access behavior

(reuse distances)

–  Model misses for different cache sizes

–  Iden<fy pollu<ng instruc<ons and change to non-‐caching accesses

5% Overhead Fully Automa<c

Shared Cache

Overview

Captures performance data for all cache sizes in one run.

Cache Size 8MB

Time

Target

Pirate

0MB

Average Target slowdown: 5%(Simulation slowdown: 100× – 1000×)

- 4 -


WRAP UP What is all of this good for?


Why Is This Important?

•  Performance and scaling are impacted by resource sharing

•  How can we use this informa1on? –  Predict and understand performance –  Workload performance –  Useful for schedulers (sta<c and dynamic), op1miza1on, compila1on

•  Need new modeling technology (on-‐going research) –  Can model workload cache behavior for simple cores (StatCC)

•  Not good enough for modern machines •  No MLP, prefetchers, OoO, etc.

–  Developing new techniques for modeling real hardware


The Future

•  More threads (100s of cores, 1000s of threads)

•  More complex hierarchies (NoCs, distributed caches)

•  Heterogeneous (CPU+GPU+Accelerators)

•  More focus on energy (per-‐core DVFS, dark silicon)

AMD Fusion

Shared

I/O

CPU CPU

CPU CPU

GPU Cores

Intel Integrated Graphics Turbo Boost


Resource Sharing in Mul<cores •  Important

–  Significant impact on performance (and efficiency) scaling

–  Basic data lets us understand interac<ons •  Complex

–  Sensi1vity varies across applica<ons –  Hard to understand the hardware

•  Moving forward –  Research on predic<ng performance and understanding scaling

–  MS thesis projects to port our tools/analyze your code

–  We’re open (and eager) for collabora1on!

Uppsala Programming forMulticore ArchitecturesResearch Center


QUESTIONS?


APPLICATION PHASES Average behavior is not good enough


Applica<on Phases

A. Sembrant, D. Eklöv, E. Hagersten. Efficient SoDware-‐based Online Phase Classifica8on. IISWC 2011.

Motivation Outline Related Work ScarPhase Phase Guided Profiling

Motivation

Baseline CPI: bzip2/chicken

0.40.60.81.01.21.4

0 20 40 60 80 100120

140160

CP

I

Time in Billions of Instructions

Average CPI

0.40.60.81.01.21.4

0 20 40 60 80 100120

140160

Pre

dict

edC

PI


• Fast

• Inaccurate

Periodic CPI profiling

0.40.60.81.01.21.4

0 20 40 60 80 100120

140160

Pre

dict

edC

PI


• Nearly as fast as Average

• Still inaccurate

Phase guided profiling

0.40.60.81.01.21.4

0 20 40 60 80 100120

140160

Pre

dict

edC

PI


• 1.7% additional overhead to Periodic

• Good accuracy

Need Fast Phase Detec

tion!

Efficient Software-based Online Phase Classification Andreas Sembrant, Uppsala University

Measured CPI over 1me

Motivation Outline Related Work ScarPhase Phase Guided Profiling

Motivation

Baseline CPI: bzip2/chicken

0.40.60.81.01.21.4

0 20 40 60 80 100120

140160

CP

I


Average CPI

0.40.60.81.01.21.4

0 20 40 60 80 100120

140160

Pre

dict

edC

PI


• Fast

• Inaccurate

Periodic CPI profiling

0.40.60.81.01.21.4

0 20 40 60 80 100120

140160

Pre

dict

edC

PI


• Nearly as fast as Average

• Still inaccurate

Phase guided profiling

0.40.60.81.01.21.4

0 20 40 60 80 100120

140160

Pre

dict

edC

PI


• 1.7% additional overhead to Periodic

• Good accuracy

Need Fast Phase Detec

tion!


Average CPI over 1me


0.00.20.40.60.81.01.21.41.6

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

470.lbm

1

2

3

4

1 2 3 4

Thro

ughp

ut

cores

470.lbmcache pirate

- 6 -


0.00.20.40.60.81.01.21.41.6

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

470.lbm

1

2

3

4

1 2 3 4

Thro

ughp

ut

cores

470.lbmcache pirate

measured

0.00.51.01.52.02.53.03.5

1M 2M 3M 4M 5M 6M 7M

Ban

dwid

ht(G

B/s

)

cache size

470.lbm

02468

10121416

1 2 3 4

Ban

dwid

th(G

B/s

)

cores

470.lbmrequired

- 6 -

Average Bandwidth Average CPI

Everything so far has been average applica1on behavior.

Q: Is that good enough? Is applica1on behavior constant?

A: No. Applica1ons have dis1nct phases with varying behaviors.

Motivation Background Related Work ScarPhase Phase Guided Profiling

Background

0 100 200 300 400 500 600 700

Time in Intervals

Branch Miss Predictions

A B B C� B D A B B C�� B D EE

Cycles per Instructions


Basic Block Vectors


� Program behavior is correlated with what code is executed.

� Changes in executed code reflect changes in other metrics



Cache Miss Ra<o Per Phase

A. Sembrant, D. Black-‐Schaffer, E. Hagersten. Phase Guided Profiling for Fast Cache Modeling. GCO 2012.

30% Overhead


SHARING IN MORE DETAIL A brief look at applica<on sensi<vity


More Detail: Cache Sharing

•  Three SPEC benchmark applica<ons •  Different behaviors due to different proper<es •  Each will respond differently to cache sharing

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M




















!"#

•  0% slower •  50x bandwidth

increase

•  Full working set fits in the cache

•  Insensi1ve to latency

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculixB

WC

PI

482.sphinx3 470.lbmF

/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M




















!"#

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M




















!"#

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M




















!"#

Performance (lower beoer)

Bandwidth (GB/s)

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

Icache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M




















!"#F

/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M




















!"#

Cache Size Cache Size Cache Size


increase

•  Most of the working set fits in the cache

•  Sensi1vity to latency


increase

•  Likle of the working set fits in the cache

•  Insensi1ve to latency

Perf: BW:

Size:

Sensi1vity:


More Detail: The Impact of Prefetching

•  Different degrees of prefetching depending on applica<on access paoern and hardware

•  Prefetching reduces applica<on sensi<vity to latency

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M




















!"#

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M




















!"#

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M




















!"#

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M




















!"#

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M




















!"#

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M




















!"#

F/M

470.lbm

BW

CPI

cache size

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

1M 3M 5M 7M

Miss RatioFetch Ratio

CPIBandwidth

Fig. 9. Performance data for 470.lbm with hardware prefetching disabled.(Fetch ratio and miss ratio are identical.)

cache size, clearly showing that prefetching was helping tocompensate for the reduced cache space. This demonstratesthat 470.lbm not only heavily leverages hardware prefetching,but also heavily benefits from it.

V. RELATED WORK

The standard approach to measuring performance as a func-tion of cache size is to use architectural simulators (e.g. [11],[16]) or performance models (e.g. [14], [18]). For performanceanalysis the main limitation of simulation is the long executiontime, which is typically orders of magnitude slower than nativeexecution. To address this, a substantial amount of work hasbeen done to improve simulation performance [8], [12], [17],in particular by trading off accuracy and detail for improvedspeed through the use of analytical models [14], [18]. How-ever, despite these advances, the overhead of simulation andmodeling is still a limitation for performance analysis of realapplication with large data sets.

Xu et al. [4] presented a model for predicting the per-formance degradation of co-running applications contendingfor shared cache. As part of their work they co-run a stressapplication with each application to measure its performanceas a function of miss ratio. For our purpose of measuringperformance as a function of cache size, their approach has twolimitations: 1) Instead of ensuring that their stress applicationsteals a fixed amount of cache, they let it freely contend forspace with the Target, and then estimate the average amountof cache stolen after the fact. Such an average is hard tocorrelate to one cache size and is sensitive to program phases.2) As their micro benchmark freely contends for cache withthe Target, they cannot limit how much off-chip bandwidth itconsumes, which can distort performance measurements5.

5We implemented and used their method to measure the CPI of the simplesequential micro benchmark in Figure 4. When we tried to steal 4MB of cache,their stress application consumed sufficient off-chip bandwidth to increase themeasured CPI by 37%.

Doucette and Fedorova [5] also co-run a set of microbenchmarks, called base vectors, with a Target applicationto determine the impact on the target. Their base vector formeasuring shared cache contention has a similar sequentialaccess pattern to that of the Cache Pirate, but has its workingset size fixed to that of the shared cache. This provides a singlecache “sensitivity” measure for the Target application and doesnot relate the performance of the Target to the amount of cacheavailable, nor to specific performance measurements such asCPI or off-chip bandwidth.

Cakarevic et al. [19] use a similar set of micro benchmarksas Doucette and Fedorova, but their work focuses on hardwarecharacterization. They co-run their micro benchmarks, stress-ing different shared resources, and investigate how the microbenchmarks impact each other. This allows them to identifyand characterize the critical shared hardware resources, but notthe behavior of other applications.

VI. CONCLUSION

We have presented a general technique for quickly and ac-curately measuring application performance characteristics asa function of the available shared cache space. By leveraging aco-executing Pirate application that steals shared cache spaceand hardware performance counters, this approach works onstandard hardware, accurately reflects the idiosyncrasies of thesystem, has a low overhead, and requires no modifications toeither the operating system or target applications

To evaluate our technique we identified and investigatedfour key issues: 1) Does the cache available to the Target

behave as expected? 2) How much cache can the Pirate steal?

3) How many Pirate threads can we run before we impact the

Target’s performance? and, 4) Can we minimize the overhead

by dynamically varying the Pirate’s size? For the first questionwe used a trace-driven cache simulator to show that the cachespace available to the Target application behaved as expectedwith average and maximum absolute fetch ratio errors of0.2% and 2.7%, respectively. We introduced a method todetermine when the Pirate can execute two threads to increaseits fetch rate without impacting the Target’s performance anddemonstrated that we were able to steal an average of 6.7MBof the 8MB cache, and up to an average of 6.8MB witha small decrease in Target execution rate. And, finally, bydynamically adjusting the amount of cache the Pirate stealsduring execution, we were able to reduce the total overheadto 5.5%, with an average CPI error of only 0.5%

While we motivated this general performance characteriza-tion approach with an example of how it can be used to explainthroughput scaling, we believe that the data captured by CachePirating is fundamentally important for the understandingand optimization of applications executing in shared resourceenvironments. Future work includes extending this approachto collect performance data against other shared resources andexploring the range of information available from the everincreasing variety of performance counters.

!"#

•  Significant prefetching, and it benefits from it •  Minimal Prefetching

•  No Prefetching

Prefetching Disabled

Difference between fetches and misses is prefetch rate.

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M




















!"#

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M




















!"#

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M




















!"#


More Detail: Cache Pollu<on

•  We can measure how greedy an applica<on is and how sensi1ve it is

•  By changing the code to use non-‐caching instruc<ons we can make an applica1on less greedy without hur<ng performance Motivation

Do all applications benefit from caching?

0.0%5.0%

10.0%15.0%20.0%25.0%

Private Shared

Mis

sR

atio

Cache Size

libquantumlbm

Not all applications benefit from cachingBetter performance if caching can be controlled

- 3 -

Increased Performance of Mixed Workloads

401.bzip2

470.lbm462.libq.

416.gamess

Not all applications are affected by cache contentionEven cache gobblers get increased performance dueto shared bandwidth

- 11 -

Increased Performance of Mixed Workloads

401.bzip2

470.lbm462.libq.

416.gamess

Not all applications are affected by cache contentionEven cache gobblers get increased performance dueto shared bandwidth

- 11 -


More Detail: Bandwidth Sharing •  Sensi<vity is a func<on of the applica<on

–  Latency sensi1vity (memory level parallelism) –  Bandwidth requirement (data rate)

•  And the hardware –  Ability to handle out-‐of-‐order requests (queue sizes) –  Access paoern costs (streaming vs. random in DRAM banks)

•  BW consump1on is not a good indicator of BW sensi1vity

0

Bandwidth Bandit: Understanding Memory Contention

Abstract—Applications that are co-scheduled on a multicorecompete for shared resources, such as cache capacity and memorybandwidth. The performance degradation resulting from thiscontention can be substantial, which makes it important toeffectively manage these shared resources. This, however, requiresan understanding of how applications are impacted by suchcontention.

While cache-sharing effects have been studied extensivelyof late, the effects of memory bandwidth sharing are not aswell explored. This is in large due to its complex nature, assensitivity to bandwidth contention depends on bottlenecks atseveral levels of the memory-system and the locality propertiesof the application’s access stream.

This paper explores the contention effects of increased latencyand decreased memory parallelism at different points in thememory hierarchy, both of which cause decreases in availablebandwidth. To understand the impact of such contention onapplications, it also presents a method whereby an application’soverall sensitivity to different degrees of bandwidth contentioncan be directly measured. This method is used to demonstrate thevarying contention sensitivity across a selection of benchmarks,and explains why some of them experience substantial slowdownslong before the overall memory bandwidth saturates.

I. INTRODUCTION

Cache capacity and memory bandwidth are critical shared

resources in chip multi-processors. Sharing these resources

between cores has many benefits, however, it also leads to

contention, which can have dramatic, negative impacts on the

co-running applications’ performance. Understanding the im-

pact of such contention is therefore essential to fully realize the

computational power of these systems. This work investigates

the effects of contention for off-chip memory (bandwidth)

by first exploring where and why contention occurs in the

memory hierarchy. We then use this knowledge to develop a

method that enables us to measure an application’s sensitivity

to memory contention.

While contention effects have been extensively studied for

cache capacity (e.g. [1]), less is known about contention

for memory bandwidth. Two recent studies [2], [3], show

that the performance impact from contention for off-chip

memory resources can be significant. Their main results, with

respect to memory bandwidth contention, are that applications

with higher (un-contended) bandwidth demands both generate

more memory contention and are generally more sensitive to

contention from others.

The importance of application sensitivity to memory band-

width contention is shown in Figure 1. This data shows

that applications’ sensitivities vary significantly and can be

quite substantial, with slowdowns varying from 3% to 23%for these benchmarks. It is clear from the data in Figure 1

that applications with similar bandwidth demands (e.g. lbm

and soplex) can exhibit widely varying slowdowns for the

same degree of memory contention. The goal of this research

is to further investigate the mechanisms behind bandwidth

contention and its impact on applications’ performance.

0

10

20

lbmstream

cluster

soplex

mcf

0

1

2

slo

wdow

n(%

)

bandw

idth

(GB

/s)

slowdown

bandwidth

Fig. 1. Baseline bandwidth (no contention) and slowdown with memory

contention (90% of saturation bandwidth). While all of these applications

exhibit similar baseline bandwidth consumption, their sensitivities to memory

contention vary widely. This variability demonstrates the importance of a more

detailed understanding the effects of memory contention. (For each benchmark

we generated sufficient bandwidth contention to reach 90% of the system

saturation bandwidth for the benchmark. This contention was generated using

the Bandwidth Bandit technique discussed in Section VI.)

To measure application sensitivity to memory contention we

propose the Bandwidth Bandit method. It works by co-running

the application whose performance we want to measure (the

Target) with a Bandit application that generates memory

contention. By carefully controlling the amount of contention

the Bandit generates we can measure the Target’s performance

as a function of the contention generated by the Bandit.

As we want to measure the Target’s sensitivity to memory

contention in isolation, it is important that the Bandit does

not consume any shared resources other than the ones for

which it generates contention. For example, if the Bandit uses

large amounts of shared cache capacity, this might impact

the Target’s performance and distort the measurement. In

addition, the Bandit must generate realistic contention traffic

that appropriately targets the different parts of the memory

hierarchy. To design such a Bandit we must first understand

the sources of memory contention in detail.

This research makes the following contributions:

• We identify the main bottlenecks that can arise due to

memory contention and measure their impact on both

latency and bandwidth. We classify these bottlenecks

into two categories: bottlenecks that limit bandwidth and

bottlenecks increase latency.

• We present the Bandwidth Bandit method. To our knowl-

edge, this is the first method for measuring the perfor-

mance impact of memory contention on real hardware.

• We describe how the Bandit data can be analyzed to

understand an application’s bandwidth and latency sensi-

tivity.

• We use the Bandwidth Bandit method to analyze the

contention sensitivity of a set of applications from the

SPEC2006 and PARSEC benchmark suites.

To present this material, we begin with an overview of

the memory hierarchy (Section II-A) and a discussion of

the relationship between memory level parallelism and la-

tency (Section II-B). We then describe our experimental setup

Slowdown at 90% of satura1on bandwidth

1

Fig. 2. The memory hierarchy. The memory hierarchy can be analyzed withEquation 1 using the different degrees of parallelism and latency at each levelin the hierarchy. The queues in the figure indicate that the available parallelismcomes from multiple places within the hierarchy. The details of the hierarchyare discussed in Section IV.

(Section III), and present a series of micro-benchmarks forcharacterizing the impacts of latency and parallelism at theglobal and local level in the memory hierarchy (Section IV).With this background, we then introduce the concept of ap-plication sensitivity to bandwidth and latency (Section V) andpresent the details of the Bandwidth Bandit implementation(Section VI). The Bandit is then validated against a series ofmicro-benchmarks, which allow us to describe how to interpretits output (Section VII). Finally, we present the results of usingthe Bandwidth Bandit to investigate the memory contentionsensitivity of a variety of SPEC2006 and PARSEC benchmarks(Section VIII).

II. BACKGROUND

A. Memory Hierarchy Organization

The memory hierarchy considered in this paper is shown inFigure 2. If a memory access can not be serviced by the core’sprivate caches (not shown in the figure), it is first sent to theshared L3 cache. It the requested data is not found in the L3cache, it is sent to the integrated Memory Controller (MC).The MC has three independent memory channels over whichit communicates with the DRAM modules. Each channelconsists of an address bus and a 64 bit wide data bus. Memoryrequest are typically 64 bytes (one cache-line), thereby requir-ing eight transfers over the data bus. Each DRAM moduleconsists of several independent memory banks, which canbe accessed in parallel, as long as there are no conflicts onthe address and data buses. The combination of independentchannels and memory banks provides for a large degree ofavailable parallelism in the off-chip memory hierarchy.

The DRAM memory banks are organized into rows (alsocalled pages) and columns. To address a word of data the MChas to specify the channel, bank, row and column of the data.To read or write an address, the whole row is first copied intothe bank’s row buffer. This single-entry buffer (also known asa page cache) caches the row until a different row in the samebank is accessed.

On a read or write access three events can occur: A page-hitwhen the accessed row is already in the row buffer and thedata can be read/written directly; a page-empty when the rowbuffer is empty and the accessed row has to be copied from

the bank before it can be read/written1; or a page-miss when arow other then the one accessed is cached in the row buffer. Inthe case of a page-miss, the cached row has to first be writtenback to the memory bank before the newly accessed row iscopied into the row buffer. These three events have differentlatencies, with a page-hit having the shortest latency, and apage-miss having the longest.

B. Memory Hierarchy PerformanceFrom a performance point of view the memory hierarchy

can be described by two metrics: its latency and band-width. These two metrics are intimately related. Using Little’slaw [?], the average bandwidth achieved by an application canbe expressed as follows:

bandwidth = transfer size× MLP

latency, (1)

where MLP is the application’s average Memory LevelParallelism, that is, the average number of concurrent memoryrequests it has in-flight, and latency is the average time tocomplete the application’s memory accesses. For example,consider an application with a MLP of one (e.g., a linked listtraversal) and an access latency of 50ns. Such an applicationwould complete a transfer every 50ns, on average. With atransfer size of 64 bytes, its memory bandwidth would be1.3GB/s. If the MLP was increased to two, but the accesslatency remained the same, the application would completetwo transfers every 50ns, and its bandwidth would double to2.6GB/s.

The above equation clearly illustrates that the bandwidthachieved by an application is determined by both its memoryaccess latency and its memory parallelism. However, theseparameters vary throughout the memory hierarchy. For ex-ample, at the bank level, the parallelism is limited by thenumber of banks. However, MCs typically queue request tobusy banks. From the higher-level perspective of the MC,the parallelism, or number of in-flight requests, will includethe request in these queues, and appear larger. The latencywill also appear different, since the time spend in the queueshave to be considered. The above equation will therefore havedifferent values for latency and MLP depending on where itis applied in the memory hierarchy.

In this work we will use the above equation to understandhow contention for shared memory resources impacts appli-cation performance. Such contention can occur throughoutthe memory hierarchy, and can both increase access latenciesand reduce the memory parallelism. The resulting effect onapplication performance is a function of the application’ssensitivity to reduced latency or memory parallelism.

III. EXPERIMENTAL SETUP

The experiments presented in this paper have been run ona quad core Intel Xeon e5520 (Nehalem) machine. Its cacheconfiguration is detailed in the following table:

1Page-empties occur when the MC preemptively closes a page that hasn’tbeen accessed recently to optimistically turn a page-miss into a page-empty.

resource sharing for multicore processors · (section iii), and present a series of...

Documents