resource sharing for multicore processors · (section iii), and present a series of...

31
Resource Sharing in Mul1core Processors David BlackSchaffer Assistant Professor, Department of Informa<on Technology Uppsala University Special thanks to the Uppsala Architecture Research Team: David Eklöv, Prof. Erik Hagersten, Nikos Nikoleris, Andreas Sandberg, Andreas Sembrant Uppsala Programming for Multicore Architectures Research Center

Upload: others

Post on 30-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Resource  Sharing  in  Mul1core  Processors  

David  Black-­‐Schaffer  Assistant  Professor,  Department  of  Informa<on  Technology  

Uppsala  University  

Special  thanks  to  the  Uppsala  Architecture  Research  Team:  David  Eklöv,  Prof.  Erik  Hagersten,  Nikos  Nikoleris,  Andreas  Sandberg,  Andreas  Sembrant  

Uppsala Programming forMulticore ArchitecturesResearch Center

Page 2: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  2  David  Black-­‐Schaffer  

Resource  Sharing  in  Mul<cores  

•  Most  people  focus  on  the  core  count  –  Parallelism,  synchroniza<on,  deadlock,  etc.  

•  I  am  going  to  focus  on  the  shared  memory  system  –  Cache,  bandwidth,  prefetching,  etc.  

•  The  shared  memory  system  significantly  impacts  performance,  efficiency,  and  scalability.  

Nehalem  Sandy  Bridge  

Shared  L3  Cache  

Shared  L3  Cache  Memory  Controller  I/O  

Memory  Controller  I/O  

Page 3: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  3  David  Black-­‐Schaffer  

Goals  For  Today  

1.  Overview  of  shared  memory  system  resources  

2.  Impact  on  applica1on  performance  

3.  Importance  of  applica1on  sensi1vity  

4.  Research  tools  and  direc<ons  

Page 4: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  4  David  Black-­‐Schaffer  

SHARED  RESOURCES  Not  just  cache  and  bandwidth…  

Page 5: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  5  David  Black-­‐Schaffer  

Mul<core  Memory  Systems  Intel  Nehalem  Memory  Hierarchy  (3GHz)  

D.  Molka,  et.  al.,  Memory  Performance  and  Cache  Coherency  Effects  on  an  Intel  Nehalem  Mul8processor  System,  PACT  2009.  

10   20   30   40   50   60   70   80   90   100  Latency  (cycles)  

190  

DRAM    

L3  L2  L1  

Core  

Core  

Core  

Core  

Latency  to  DRAM:  200  cycles  

Latency  to  private  L1:  4  cycles  

Bandwidth  to  DRAM:  4-­‐8  bytes/cycle  total  

1-­‐2  bytes/cycle  per  core  Bandwidth  to  Private  L1:  15  bytes/cycle  per  core  

Shared  off-­‐chip  Bandwidth  

Shared  Last  Level  Cache  

Page 6: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  6  David  Black-­‐Schaffer  

Other  Shared  Resources  

•  SMT  (Hyper-­‐Threading)  –  “Private”  caches  

•  Instruc<on  cache  •  L1/L2  Data  caches  

–  Branch  predictors,  TLBs,  instruc<on  decode  

•  AMD’s  Bulldozer  –  Floa<ng  point  units  –  Instruc<on  fetch/decode  –  L2  cache  –  Branch  predictor  

10   20   30   40   50   60   70   80   90   100  Latency  (cycles)  

Shared  

Core  

Core  

Core  

Core  

Core  

Core  

Core  

Bulldozer  Core  

Page 7: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  7  David  Black-­‐Schaffer  

Energy  As  A  Shared  Resource  

Bulldozer  Core  

•  Probably  the  2nd  most  important  resource  today,  soon  to  be  1st  –  Power  (controlled  by  DVFS  —  Dynamic  Frequency/Voltage  Scaling)  –  Thermal  Capacity  

AMD’s  “Turbo  CORE”–  thermal  capacity  

Intel’s  “Turbo  Boost”  

Page 8: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  8  David  Black-­‐Schaffer  

Shared  Cache  

Core  

Core  

Core  

Core  

Efficient  Data  Sharing  

Shared  Cache  

Core  

Core  

Core  

Core  

Fast  Synchroniza1on  

Aside:  Why  Share  Resources?  •  Parallel  Performance  

–  Data  sharing  across  threads  –  Simplify  synchroniza1on  and  coherency  

•  Flexibility  –  Cache  par<<oning  for  different  workloads  

•  Amdahl’s  law  –  Always  need  single-­‐threaded  performance  

Shared  Data  

Lock  

Shared  Cache  

Core  

Core  

Core  

Core  

Flexible  Resource  Alloca1on  

1  

10  

100  

1   2   4   8   16   32   64   128   256   512  1024  

Maxim

um  Spe

edup

 

Number  of  Cores  

Limits  of  Amdahl’s  Law  

75%  Parallel  (4x)  

90%  Parallel  (10x)  

99%  Parallel  (91x)  

Page 9: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  9  David  Black-­‐Schaffer  

PERFORMANCE  IMPACTS  OF  SHARING  

Cache  Sharing,  Cache  Pollu<on,  and  Bandwidth  Sharing  Cache  Sharing,  Cache  Pollu<on,  and  Bandwidth  Sharing  Cache  Sharing,  Cache  Pollu<on,  and  Bandwidth  Sharing  Cache  Sharing,  Cache  Pollu<on,  and  Bandwidth  Sharing  

Page 10: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  10  David  Black-­‐Schaffer  

Cache Pirate – 471.omnetpp

1234

1 2 3 4

Thro

ughp

ut

cores

471.omnetppexpected

- 5 -

Cache Pirate – 471.omnetpp

1234

1 2 3 4

Thro

ughp

ut

cores

471.omnetppexpected

measured

- 5 -

Cache Pirate – 471.omnetpp

1234

1 2 3 4

Thro

ughp

ut

cores

471.omnetpp

δ

expectedmeasured

0.0

0.5

1.0

1.5

2.0

2.5

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

471.omnetpp

- 5 -

Cache  Sharing  

Experiment  •  Run  1-­‐4  independent  instances  of  the  

same  program  on  a  4-­‐core  Nehalem  

•  Performance  affected  by  shared  cache  alloca1on  

–  ¼  of  the  shared  cache    20%  slower  •  Understanding  performance  as  a  func(on  

of  shared  resource  alloca(on  is  cri<cal  to  understanding  mul<core  scalability  

Cache Pirate – 471.omnetpp

1234

1 2 3 4

Thro

ughp

ut

cores

471.omnetppexpected

measured

0.0

0.5

1.0

1.5

2.0

2.5

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

471.omnetpp

- 5 -

Cache Pirate – 471.omnetpp

1234

1 2 3 4

Thro

ughp

ut

cores

471.omnetpp

δ

expectedmeasured

0.0

0.5

1.0

1.5

2.0

2.5

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

471.omnetpp

- 5 -

Shared  L3  Cache  

Performance  as  a  func1on  of  shared  cache  size  

Cache Pirate – 471.omnetpp

1234

1 2 3 4

Thro

ughp

ut

cores

471.omnetppexpected

measuredcache pirate

0.0

0.5

1.0

1.5

2.0

2.5

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

471.omnetpp

- 5 -

predicted

Page 11: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  11  David  Black-­‐Schaffer  

10   20   30   40   50   60   70   80   90   100  Latency  (cycles)  

190  

Cache  Sharing:  What’s  Going  On?  Cache Pirate – 471.omnetpp

1234

1 2 3 4

Thro

ughp

ut

cores

471.omnetpp

δ

expectedmeasured

0.0

0.5

1.0

1.5

2.0

2.5

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

471.omnetpp

- 5 -

Shared  Cache  Sensi1vity  

Cache Pirate – 471.omnetpp

1234

1 2 3 4

Thro

ughp

ut

cores

471.omnetpp

δ

expectedmeasured

0.0

0.5

1.0

1.5

2.0

2.5

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

471.omnetpp

- 5 -

Reduced  Scalability  

DRAM    

L3  

Shared  

L2  L1  

Core  

Core  

Core  

Core  

100%  of  Shared  Cache  

50%  of  Shared  Cache  

50%  of  Shared  Cache  

33%  of  Shared  Cache  

33%  of  Shared  Cache  

33%  of  Shared  Cache  

25%  of  Shared  Cache  

25%  of  Shared  Cache  

25%  of  Shared  Cache  

25%  of  Shared  Cache  

Page 12: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  12  David  Black-­‐Schaffer  

10   20   30   40   50   60   70   80   90   100  Latency  (cycles)  

190  

Cache  Sharing:  What’s  Going  On?  Cache Pirate – 471.omnetpp

1234

1 2 3 4

Thro

ughp

ut

cores

471.omnetpp

δ

expectedmeasured

0.0

0.5

1.0

1.5

2.0

2.5

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

471.omnetpp

- 5 -

Shared  Cache  Sensi1vity  

Cache Pirate – 471.omnetpp

1234

1 2 3 4

Thro

ughp

ut

cores

471.omnetpp

δ

expectedmeasured

0.0

0.5

1.0

1.5

2.0

2.5

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

471.omnetpp

- 5 -

Reduced  Scalability  

DRAM    

L3  

Shared  

L2  L1  

Core  

Core  

Core  

Core  

100%  of  Shared  Cache  

50%  of  Shared  Cache  

50%  of  Shared  Cache  

33%  of  Shared  Cache  

33%  of  Shared  Cache  

33%  of  Shared  Cache  

25%  of  Shared  Cache  

25%  of  Shared  Cache  

25%  of  Shared  Cache  

25%  of  Shared  Cache  

Less  shared  cache  space    

More  DRAM  accesses    

Increased  latency    

Reduced  performance.  

How  much  slower?  See  this  curve.  

Need  performance  as  a  func1on  of  shared  resources  to  understand  mul1core  scalability.  

Page 13: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  13  David  Black-­‐Schaffer  

Cache  Pollu1on  •  Applica<ons  put  data  in  the  cache  but  do  not  reuse  it  

–  No  benefit,  but  no  problem    …unless  you  are  sharing  cache  space  

•  Pollu1on  wastes  space  other  applica1ons  could  benefit  from  

0  

0.5  

1  

1.5  

2  

bzip2   lbm   libquantum   gamess   Overall  

Throughp

ut  

Alone   Mixed  Workload   Pollu1on  Eliminated  

Effects  of  Cache  Pollu1on  in  a  4-­‐process  Workload  

1/3  of  slowdown  due  to  cache  pollu<on.  

Page 14: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  14  David  Black-­‐Schaffer  

MotivationDo all applications benefit from caching?

0.0%5.0%

10.0%15.0%20.0%25.0%

Private Shared

Mis

sR

atio

Cache Size

libquantumlbm

Not all applications benefit from cachingBetter performance if caching can be controlled

- 3 -

20%        

10%        

0%  

Cache  Pollu1on:  What’s  Going  On?  

•  Does  an  applica<on  benefit  from  using  the  shared  cache?  

Cache  Size  

Private   Shared  

Miss  Ra

1o  

More  misses    more  cache  space  

Fewer  misses    less  cache  space  

No  benefit  from  using  the  shared  cache  space.  

Significant  benefit  from  the  shared  cache  space.  

Page 15: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  15  David  Black-­‐Schaffer  

Predicting Multicore Scaling – 470.lbm

0.00.20.40.60.81.01.21.41.6

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

470.lbm

1

2

3

4

1 2 3 4

Thro

ughp

ut

cores

470.lbmcache pirate

measured

0.00.51.01.52.02.53.03.5

1M 2M 3M 4M 5M 6M 7M

Ban

dwid

ht(G

B/s

)

cache size

470.lbm

02468

10121416

1 2 3 4

Ban

dwid

th(G

B/s

)

cores

470.lbmrequired

- 6 -

Bandwidth  as  a  func1on  of  shared  cache  size  

Predicting Multicore Scaling – 470.lbm

0.00.20.40.60.81.01.21.41.6

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

470.lbm

1

2

3

4

1 2 3 4

Thro

ughp

ut

cores

470.lbmcache pirate

- 6 -

predicted

Predicting Multicore Scaling – 470.lbm

0.00.20.40.60.81.01.21.41.6

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

470.lbm

1

2

3

4

1 2 3 4

Thro

ughp

ut

cores

470.lbmcache pirate

- 6 -

Performance  as  a  func1on  of  shared  cache  size  

Bandwidth  Sharing  

Que

ue  

Memory  Controller  I/O  

50%  reduc<on  in  cache    57%  increase  in  bandwidth  

Predicting Multicore Scaling – 470.lbm

0.00.20.40.60.81.01.21.41.6

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

470.lbm

1

2

3

4

1 2 3 4

Thro

ughp

ut

cores

470.lbmcache pirate

measured

- 6 -

predicted

Predicting Multicore Scaling – 470.lbm

0.00.20.40.60.81.01.21.41.6

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

470.lbm

1

2

3

4

1 2 3 4Th

roug

hput

cores

470.lbmcache pirate

measured

0.00.51.01.52.02.53.03.5

1M 2M 3M 4M 5M 6M 7M

Ban

dwid

ht(G

B/s

)

cache size

470.lbm

02468

10121416

1 2 3 4

Ban

dwid

th(G

B/s

)

cores

470.lbm

10.4 GB/s

requiredmeasured

- 6 -

No  performance  loss  with  less  cache  space…    

…un<l  you  run  out  of  shared  bandwidth.  

4  cores  @  3GB/s  =  12GB/s  System  max  =  10.4GB/s  

Page 16: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  16  David  Black-­‐Schaffer  

10   20   30   40   50   60   70   80   90   100  Latency  (cycles)  

190  

DRAM    

L3    

Shared  

L2    

L1    

Core  

Core  

Core  

Core  

Bandwidth  Sharing:  What  is  Going  On?  •  Shared  resources:  

–  Cache,  Bandwidth  –  But  also:  queues,  row  buffers,  prefetchers,  etc.  

•  Results  in  complex  applica1on  sensi1vi1es:  –  Bandwidth,  capacity,  latency,  access  paoern  

•  Complex  interac1ons  with  shared  bandwidth  

1

Fig. 2. The memory hierarchy. The memory hierarchy can be analyzed withEquation 1 using the different degrees of parallelism and latency at each levelin the hierarchy. The queues in the figure indicate that the available parallelismcomes from multiple places within the hierarchy. The details of the hierarchyare discussed in Section IV.

(Section III), and present a series of micro-benchmarks forcharacterizing the impacts of latency and parallelism at theglobal and local level in the memory hierarchy (Section IV).With this background, we then introduce the concept of ap-plication sensitivity to bandwidth and latency (Section V) andpresent the details of the Bandwidth Bandit implementation(Section VI). The Bandit is then validated against a series ofmicro-benchmarks, which allow us to describe how to interpretits output (Section VII). Finally, we present the results of usingthe Bandwidth Bandit to investigate the memory contentionsensitivity of a variety of SPEC2006 and PARSEC benchmarks(Section VIII).

II. BACKGROUND

A. Memory Hierarchy Organization

The memory hierarchy considered in this paper is shown inFigure 2. If a memory access can not be serviced by the core’sprivate caches (not shown in the figure), it is first sent to theshared L3 cache. It the requested data is not found in the L3cache, it is sent to the integrated Memory Controller (MC).The MC has three independent memory channels over whichit communicates with the DRAM modules. Each channelconsists of an address bus and a 64 bit wide data bus. Memoryrequest are typically 64 bytes (one cache-line), thereby requir-ing eight transfers over the data bus. Each DRAM moduleconsists of several independent memory banks, which canbe accessed in parallel, as long as there are no conflicts onthe address and data buses. The combination of independentchannels and memory banks provides for a large degree ofavailable parallelism in the off-chip memory hierarchy.

The DRAM memory banks are organized into rows (alsocalled pages) and columns. To address a word of data the MChas to specify the channel, bank, row and column of the data.To read or write an address, the whole row is first copied intothe bank’s row buffer. This single-entry buffer (also known asa page cache) caches the row until a different row in the samebank is accessed.

On a read or write access three events can occur: A page-hitwhen the accessed row is already in the row buffer and thedata can be read/written directly; a page-empty when the rowbuffer is empty and the accessed row has to be copied from

the bank before it can be read/written1; or a page-miss when arow other then the one accessed is cached in the row buffer. Inthe case of a page-miss, the cached row has to first be writtenback to the memory bank before the newly accessed row iscopied into the row buffer. These three events have differentlatencies, with a page-hit having the shortest latency, and apage-miss having the longest.

B. Memory Hierarchy PerformanceFrom a performance point of view the memory hierarchy

can be described by two metrics: its latency and band-width. These two metrics are intimately related. Using Little’slaw [?], the average bandwidth achieved by an application canbe expressed as follows:

bandwidth = transfer size× MLP

latency, (1)

where MLP is the application’s average Memory LevelParallelism, that is, the average number of concurrent memoryrequests it has in-flight, and latency is the average time tocomplete the application’s memory accesses. For example,consider an application with a MLP of one (e.g., a linked listtraversal) and an access latency of 50ns. Such an applicationwould complete a transfer every 50ns, on average. With atransfer size of 64 bytes, its memory bandwidth would be1.3GB/s. If the MLP was increased to two, but the accesslatency remained the same, the application would completetwo transfers every 50ns, and its bandwidth would double to2.6GB/s.

The above equation clearly illustrates that the bandwidthachieved by an application is determined by both its memoryaccess latency and its memory parallelism. However, theseparameters vary throughout the memory hierarchy. For ex-ample, at the bank level, the parallelism is limited by thenumber of banks. However, MCs typically queue request tobusy banks. From the higher-level perspective of the MC,the parallelism, or number of in-flight requests, will includethe request in these queues, and appear larger. The latencywill also appear different, since the time spend in the queueshave to be considered. The above equation will therefore havedifferent values for latency and MLP depending on where itis applied in the memory hierarchy.

In this work we will use the above equation to understandhow contention for shared memory resources impacts appli-cation performance. Such contention can occur throughoutthe memory hierarchy, and can both increase access latenciesand reduce the memory parallelism. The resulting effect onapplication performance is a function of the application’ssensitivity to reduced latency or memory parallelism.

III. EXPERIMENTAL SETUP

The experiments presented in this paper have been run ona quad core Intel Xeon e5520 (Nehalem) machine. Its cacheconfiguration is detailed in the following table:

1Page-empties occur when the MC preemptively closes a page that hasn’tbeen accessed recently to optimistically turn a page-miss into a page-empty.

Slowdown  at  90%  of  satura1on  bandwidth  

0

Bandwidth Bandit: Understanding Memory Contention

Abstract—Applications that are co-scheduled on a multicorecompete for shared resources, such as cache capacity and memorybandwidth. The performance degradation resulting from thiscontention can be substantial, which makes it important toeffectively manage these shared resources. This, however, requiresan understanding of how applications are impacted by suchcontention.

While cache-sharing effects have been studied extensivelyof late, the effects of memory bandwidth sharing are not aswell explored. This is in large due to its complex nature, assensitivity to bandwidth contention depends on bottlenecks atseveral levels of the memory-system and the locality propertiesof the application’s access stream.

This paper explores the contention effects of increased latencyand decreased memory parallelism at different points in thememory hierarchy, both of which cause decreases in availablebandwidth. To understand the impact of such contention onapplications, it also presents a method whereby an application’soverall sensitivity to different degrees of bandwidth contentioncan be directly measured. This method is used to demonstrate thevarying contention sensitivity across a selection of benchmarks,and explains why some of them experience substantial slowdownslong before the overall memory bandwidth saturates.

I. INTRODUCTION

Cache capacity and memory bandwidth are critical shared

resources in chip multi-processors. Sharing these resources

between cores has many benefits, however, it also leads to

contention, which can have dramatic, negative impacts on the

co-running applications’ performance. Understanding the im-

pact of such contention is therefore essential to fully realize the

computational power of these systems. This work investigates

the effects of contention for off-chip memory (bandwidth)

by first exploring where and why contention occurs in the

memory hierarchy. We then use this knowledge to develop a

method that enables us to measure an application’s sensitivity

to memory contention.

While contention effects have been extensively studied for

cache capacity (e.g. [1]), less is known about contention

for memory bandwidth. Two recent studies [2], [3], show

that the performance impact from contention for off-chip

memory resources can be significant. Their main results, with

respect to memory bandwidth contention, are that applications

with higher (un-contended) bandwidth demands both generate

more memory contention and are generally more sensitive to

contention from others.

The importance of application sensitivity to memory band-

width contention is shown in Figure 1. This data shows

that applications’ sensitivities vary significantly and can be

quite substantial, with slowdowns varying from 3% to 23%for these benchmarks. It is clear from the data in Figure 1

that applications with similar bandwidth demands (e.g. lbm

and soplex) can exhibit widely varying slowdowns for the

same degree of memory contention. The goal of this research

is to further investigate the mechanisms behind bandwidth

contention and its impact on applications’ performance.

0

10

20

lbmstream

cluster

soplex

mcf

0

1

2

slo

wdow

n(%

)

bandw

idth

(GB

/s)

slowdown

bandwidth

Fig. 1. Baseline bandwidth (no contention) and slowdown with memory

contention (90% of saturation bandwidth). While all of these applications

exhibit similar baseline bandwidth consumption, their sensitivities to memory

contention vary widely. This variability demonstrates the importance of a more

detailed understanding the effects of memory contention. (For each benchmark

we generated sufficient bandwidth contention to reach 90% of the system

saturation bandwidth for the benchmark. This contention was generated using

the Bandwidth Bandit technique discussed in Section VI.)

To measure application sensitivity to memory contention we

propose the Bandwidth Bandit method. It works by co-running

the application whose performance we want to measure (the

Target) with a Bandit application that generates memory

contention. By carefully controlling the amount of contention

the Bandit generates we can measure the Target’s performance

as a function of the contention generated by the Bandit.

As we want to measure the Target’s sensitivity to memory

contention in isolation, it is important that the Bandit does

not consume any shared resources other than the ones for

which it generates contention. For example, if the Bandit uses

large amounts of shared cache capacity, this might impact

the Target’s performance and distort the measurement. In

addition, the Bandit must generate realistic contention traffic

that appropriately targets the different parts of the memory

hierarchy. To design such a Bandit we must first understand

the sources of memory contention in detail.

This research makes the following contributions:

• We identify the main bottlenecks that can arise due to

memory contention and measure their impact on both

latency and bandwidth. We classify these bottlenecks

into two categories: bottlenecks that limit bandwidth and

bottlenecks increase latency.

• We present the Bandwidth Bandit method. To our knowl-

edge, this is the first method for measuring the perfor-

mance impact of memory contention on real hardware.

• We describe how the Bandit data can be analyzed to

understand an application’s bandwidth and latency sensi-

tivity.

• We use the Bandwidth Bandit method to analyze the

contention sensitivity of a set of applications from the

SPEC2006 and PARSEC benchmark suites.

To present this material, we begin with an overview of

the memory hierarchy (Section II-A) and a discussion of

the relationship between memory level parallelism and la-

tency (Section II-B). We then describe our experimental setup

Slow

down  

Band

width  

Page 17: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  17  David  Black-­‐Schaffer  

Sensi<vity  Cri<cal  for  Scaling  

•  Shared  resource  sensi1vity    understand  scaling  •  Sensi1vity  varies  from  applica1on  to  applica1on  •  Impact  varies  from  plagorm  to  plagorm  

Performance  (lower  beoer)  

Bandwidth  (GB/s)  

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculixB

WC

PI

482.sphinx3 470.lbmF

/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M

Miss Ratio Fetch Ratio CPI Bandwidth

Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware

prefetching enabled. .

482.sphinx3 behaves quite differently from 435.gromacs.

As its cache size is decreased its CPI increases by 50%,

while its miss ratio and bandwidth increase by a factor

of 20×. The fetch ratio and miss ratio curves are slightly

different indicating that there is a small amount of prefetching.

However, the significant performance decrease despite the

increased bandwidth indicates that the benchmark is more

sensitive to increased memory latency.

470.lbm shows an 8× difference between its fetch and miss

ratios, indicating 8 prefetched memory accesses for each miss.

However, the relative increase in miss ratio is still roughly 2×.

This indicates that 470.lbm is also relatively insensitive to the

increased latency. The data in Figure 9 show the performance

of 470.lbm with hardware prefetching disabled. This reduces

bandwidth by a third and increases CPI at all cache sizes.

Furthermore, the CPI is now no longer constant with varying

!"#

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M

Miss Ratio Fetch Ratio CPI Bandwidth

Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware

prefetching enabled. .

482.sphinx3 behaves quite differently from 435.gromacs.

As its cache size is decreased its CPI increases by 50%,

while its miss ratio and bandwidth increase by a factor

of 20×. The fetch ratio and miss ratio curves are slightly

different indicating that there is a small amount of prefetching.

However, the significant performance decrease despite the

increased bandwidth indicates that the benchmark is more

sensitive to increased memory latency.

470.lbm shows an 8× difference between its fetch and miss

ratios, indicating 8 prefetched memory accesses for each miss.

However, the relative increase in miss ratio is still roughly 2×.

This indicates that 470.lbm is also relatively insensitive to the

increased latency. The data in Figure 9 show the performance

of 470.lbm with hardware prefetching disabled. This reduces

bandwidth by a third and increases CPI at all cache sizes.

Furthermore, the CPI is now no longer constant with varying

!"#

Cache  Size  

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M

Miss Ratio Fetch Ratio CPI Bandwidth

Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware

prefetching enabled. .

482.sphinx3 behaves quite differently from 435.gromacs.

As its cache size is decreased its CPI increases by 50%,

while its miss ratio and bandwidth increase by a factor

of 20×. The fetch ratio and miss ratio curves are slightly

different indicating that there is a small amount of prefetching.

However, the significant performance decrease despite the

increased bandwidth indicates that the benchmark is more

sensitive to increased memory latency.

470.lbm shows an 8× difference between its fetch and miss

ratios, indicating 8 prefetched memory accesses for each miss.

However, the relative increase in miss ratio is still roughly 2×.

This indicates that 470.lbm is also relatively insensitive to the

increased latency. The data in Figure 9 show the performance

of 470.lbm with hardware prefetching disabled. This reduces

bandwidth by a third and increases CPI at all cache sizes.

Furthermore, the CPI is now no longer constant with varying

!"#

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M

Miss Ratio Fetch Ratio CPI Bandwidth

Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware

prefetching enabled. .

482.sphinx3 behaves quite differently from 435.gromacs.

As its cache size is decreased its CPI increases by 50%,

while its miss ratio and bandwidth increase by a factor

of 20×. The fetch ratio and miss ratio curves are slightly

different indicating that there is a small amount of prefetching.

However, the significant performance decrease despite the

increased bandwidth indicates that the benchmark is more

sensitive to increased memory latency.

470.lbm shows an 8× difference between its fetch and miss

ratios, indicating 8 prefetched memory accesses for each miss.

However, the relative increase in miss ratio is still roughly 2×.

This indicates that 470.lbm is also relatively insensitive to the

increased latency. The data in Figure 9 show the performance

of 470.lbm with hardware prefetching disabled. This reduces

bandwidth by a third and increases CPI at all cache sizes.

Furthermore, the CPI is now no longer constant with varying

!"#

Cache  Size  

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M

Miss Ratio Fetch Ratio CPI Bandwidth

Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware

prefetching enabled. .

482.sphinx3 behaves quite differently from 435.gromacs.

As its cache size is decreased its CPI increases by 50%,

while its miss ratio and bandwidth increase by a factor

of 20×. The fetch ratio and miss ratio curves are slightly

different indicating that there is a small amount of prefetching.

However, the significant performance decrease despite the

increased bandwidth indicates that the benchmark is more

sensitive to increased memory latency.

470.lbm shows an 8× difference between its fetch and miss

ratios, indicating 8 prefetched memory accesses for each miss.

However, the relative increase in miss ratio is still roughly 2×.

This indicates that 470.lbm is also relatively insensitive to the

increased latency. The data in Figure 9 show the performance

of 470.lbm with hardware prefetching disabled. This reduces

bandwidth by a third and increases CPI at all cache sizes.

Furthermore, the CPI is now no longer constant with varying

!"#

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M

Miss Ratio Fetch Ratio CPI Bandwidth

Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware

prefetching enabled. .

482.sphinx3 behaves quite differently from 435.gromacs.

As its cache size is decreased its CPI increases by 50%,

while its miss ratio and bandwidth increase by a factor

of 20×. The fetch ratio and miss ratio curves are slightly

different indicating that there is a small amount of prefetching.

However, the significant performance decrease despite the

increased bandwidth indicates that the benchmark is more

sensitive to increased memory latency.

470.lbm shows an 8× difference between its fetch and miss

ratios, indicating 8 prefetched memory accesses for each miss.

However, the relative increase in miss ratio is still roughly 2×.

This indicates that 470.lbm is also relatively insensitive to the

increased latency. The data in Figure 9 show the performance

of 470.lbm with hardware prefetching disabled. This reduces

bandwidth by a third and increases CPI at all cache sizes.

Furthermore, the CPI is now no longer constant with varying

!"#

Cache  Size  

Cache Pirate – 471.omnetpp

1234

1 2 3 4

Thro

ughp

ut

cores

471.omnetpp

δ

expectedmeasured

0.0

0.5

1.0

1.5

2.0

2.5

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

471.omnetpp

- 5 -

Shared  Cache  Sensi1vity  

Cache Pirate – 471.omnetpp

1234

1 2 3 4

Thro

ughp

ut

cores

471.omnetpp

δ

expectedmeasured

0.0

0.5

1.0

1.5

2.0

2.5

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

471.omnetpp

- 5 -

Reduced  Scalability  

Insensi1ve   Sensi1ve   Insensi1ve  

As  long  as  you  have  enough  bandwidth.  

Page 18: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  18  David  Black-­‐Schaffer  

Finding Wasteful Accesses

- 9 -

Finding Wasteful Accesses

- 9 -

Finding Wasteful Accesses

- 9 -

Aside:  Implementa<on  (Tools)  •  Performance  as  a  func1on  of  

cache  size  –  “Pirate”  thread  steals  cache  and  

measures  Target’s  performance  

–  Monitor  the  Pirate’s  miss  rate  

–  (Similar  for  Bandwidth)  Overview

$1

$1

- 3 -

Overview

$2

$2

- 3 -

Overview

$3

$3

- 3 -

Overview

$4

$4

- 3 -D.  Eklöv,  N.  Nikoleris,  D.  Black-­‐Schaffer,  E.  Hagersten.  Cache  Pira8ng:  Measuring  the  Curse  of  the  Shared  Cache.  ICPP  2011.    

A.  Sandberg,  D.  Eklöv,  E.  Hagersten.  Reducing  Cache  Pollu8on  Through  Detec8on  and  Elimina8on  of  Non-­‐Temporal  Memory  Accesses.  SC  2010.    

•  Reducing  cache  pollu1on  –  Sample  memory  access  behavior  

(reuse  distances)  

–  Model  misses  for  different  cache  sizes  

–  Iden<fy  pollu<ng  instruc<ons  and  change  to  non-­‐caching  accesses  

5%  Overhead   Fully  Automa<c  

Shared  Cache  

Overview

Captures performance data for all cache sizes in one run.

Cache Size 8MB

Time

Target

Pirate

0MB

Average Target slowdown: 5%(Simulation slowdown: 100× – 1000×)

- 4 -

Page 19: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  19  David  Black-­‐Schaffer  

WRAP  UP  What  is  all  of  this  good  for?  

Page 20: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  20  David  Black-­‐Schaffer  

Why  Is  This  Important?  

•  Performance  and  scaling  are  impacted  by  resource  sharing  

•  How  can  we  use  this  informa1on?  –  Predict  and  understand  performance  –  Workload  performance  –  Useful  for  schedulers  (sta<c  and  dynamic),  op1miza1on,  compila1on  

•  Need  new  modeling  technology  (on-­‐going  research)  –  Can  model  workload  cache  behavior  for  simple  cores  (StatCC)  

•  Not  good  enough  for  modern  machines  •  No  MLP,  prefetchers,  OoO,  etc.  

–  Developing  new  techniques  for  modeling  real  hardware  

Page 21: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  21  David  Black-­‐Schaffer  

The  Future  

•  More  threads    (100s  of  cores,  1000s  of  threads)  

•  More  complex  hierarchies  (NoCs,  distributed  caches)  

•  Heterogeneous  (CPU+GPU+Accelerators)  

•  More  focus  on  energy  (per-­‐core  DVFS,  dark  silicon)  

AMD  Fusion  

Shared

 I/O  

CPU   CPU  

CPU   CPU  

GPU  Cores  

Intel  Integrated  Graphics  Turbo  Boost  

Page 22: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  22  David  Black-­‐Schaffer  

Resource  Sharing  in  Mul<cores  •  Important  

–  Significant  impact  on  performance  (and  efficiency)    scaling  

–  Basic  data  lets  us  understand  interac<ons  •  Complex  

–  Sensi1vity  varies  across  applica<ons  –  Hard  to  understand  the  hardware  

•  Moving  forward  –  Research  on  predic<ng  performance  and  understanding  scaling  

–  MS  thesis  projects  to  port  our  tools/analyze  your  code  

–  We’re  open  (and  eager)  for  collabora1on!  

Uppsala Programming forMulticore ArchitecturesResearch Center

Page 23: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  23  David  Black-­‐Schaffer  

QUESTIONS?  

Page 24: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  24  David  Black-­‐Schaffer  

APPLICATION  PHASES  Average  behavior  is  not  good  enough  

Page 25: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  25  David  Black-­‐Schaffer  

Applica<on  Phases  

A.  Sembrant,  D.  Eklöv,  E.  Hagersten.  Efficient  SoDware-­‐based  Online  Phase  Classifica8on.  IISWC  2011.    

Motivation Outline Related Work ScarPhase Phase Guided Profiling

Motivation

Baseline CPI: bzip2/chicken

0.40.60.81.01.21.4

0 20 40 60 80 100120

140160

CP

I

Time in Billions of Instructions

Average CPI

0.40.60.81.01.21.4

0 20 40 60 80 100120

140160

Pre

dict

edC

PI

Time in Billions of Instructions

• Fast

• Inaccurate

Periodic CPI profiling

0.40.60.81.01.21.4

0 20 40 60 80 100120

140160

Pre

dict

edC

PI

Time in Billions of Instructions

• Nearly as fast as Average

• Still inaccurate

Phase guided profiling

0.40.60.81.01.21.4

0 20 40 60 80 100120

140160

Pre

dict

edC

PI

Time in Billions of Instructions

• 1.7% additional overhead to Periodic

• Good accuracy

Need Fast Phase Detec

tion!

Efficient Software-based Online Phase Classification Andreas Sembrant, Uppsala University

Measured  CPI  over  1me  

Motivation Outline Related Work ScarPhase Phase Guided Profiling

Motivation

Baseline CPI: bzip2/chicken

0.40.60.81.01.21.4

0 20 40 60 80 100120

140160

CP

I

Time in Billions of Instructions

Average CPI

0.40.60.81.01.21.4

0 20 40 60 80 100120

140160

Pre

dict

edC

PI

Time in Billions of Instructions

• Fast

• Inaccurate

Periodic CPI profiling

0.40.60.81.01.21.4

0 20 40 60 80 100120

140160

Pre

dict

edC

PI

Time in Billions of Instructions

• Nearly as fast as Average

• Still inaccurate

Phase guided profiling

0.40.60.81.01.21.4

0 20 40 60 80 100120

140160

Pre

dict

edC

PI

Time in Billions of Instructions

• 1.7% additional overhead to Periodic

• Good accuracy

Need Fast Phase Detec

tion!

Efficient Software-based Online Phase Classification Andreas Sembrant, Uppsala University

Average  CPI  over  1me  

Predicting Multicore Scaling – 470.lbm

0.00.20.40.60.81.01.21.41.6

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

470.lbm

1

2

3

4

1 2 3 4

Thro

ughp

ut

cores

470.lbmcache pirate

- 6 -

Predicting Multicore Scaling – 470.lbm

0.00.20.40.60.81.01.21.41.6

1M 2M 3M 4M 5M 6M 7M

CP

I

cache size

470.lbm

1

2

3

4

1 2 3 4

Thro

ughp

ut

cores

470.lbmcache pirate

measured

0.00.51.01.52.02.53.03.5

1M 2M 3M 4M 5M 6M 7M

Ban

dwid

ht(G

B/s

)

cache size

470.lbm

02468

10121416

1 2 3 4

Ban

dwid

th(G

B/s

)

cores

470.lbmrequired

- 6 -

Average  Bandwidth   Average  CPI  

Everything  so  far  has  been  average  applica1on  behavior.  

Q:  Is  that  good  enough?  Is  applica1on  behavior  constant?  

A:  No.  Applica1ons  have  dis1nct  phases  with  varying  behaviors.  

Motivation Background Related Work ScarPhase Phase Guided Profiling

Background

0 100 200 300 400 500 600 700

Time in Intervals

Branch Miss Predictions

A B B C� B D A B B C�� B D EE

Cycles per Instructions

A B B C� B D A B B C�� B D EE

Basic Block Vectors

A B B C� B D A B B C�� B D EE

� Program behavior is correlated with what code is executed.

� Changes in executed code reflect changes in other metrics

Efficient Software-based Online Phase Classification Andreas Sembrant, Uppsala University

Page 26: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  26  David  Black-­‐Schaffer  

Cache  Miss  Ra<o  Per  Phase  

A.  Sembrant,  D.  Black-­‐Schaffer,  E.  Hagersten.    Phase  Guided  Profiling  for  Fast  Cache  Modeling.  GCO  2012.    

30%  Overhead  

Page 27: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  27  David  Black-­‐Schaffer  

SHARING  IN  MORE  DETAIL  A  brief  look  at  applica<on  sensi<vity    

Page 28: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  28  David  Black-­‐Schaffer  

More  Detail:  Cache  Sharing  

•  Three  SPEC  benchmark  applica<ons  •  Different  behaviors  due  to  different  proper<es  •  Each  will  respond  differently  to  cache  sharing  

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M

Miss Ratio Fetch Ratio CPI Bandwidth

Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware

prefetching enabled. .

482.sphinx3 behaves quite differently from 435.gromacs.

As its cache size is decreased its CPI increases by 50%,

while its miss ratio and bandwidth increase by a factor

of 20×. The fetch ratio and miss ratio curves are slightly

different indicating that there is a small amount of prefetching.

However, the significant performance decrease despite the

increased bandwidth indicates that the benchmark is more

sensitive to increased memory latency.

470.lbm shows an 8× difference between its fetch and miss

ratios, indicating 8 prefetched memory accesses for each miss.

However, the relative increase in miss ratio is still roughly 2×.

This indicates that 470.lbm is also relatively insensitive to the

increased latency. The data in Figure 9 show the performance

of 470.lbm with hardware prefetching disabled. This reduces

bandwidth by a third and increases CPI at all cache sizes.

Furthermore, the CPI is now no longer constant with varying

!"#

•  0%  slower  •  50x  bandwidth  

increase  

•  Full  working  set  fits  in  the  cache  

•  Insensi1ve  to  latency  

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculixB

WC

PI

482.sphinx3 470.lbmF

/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M

Miss Ratio Fetch Ratio CPI Bandwidth

Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware

prefetching enabled. .

482.sphinx3 behaves quite differently from 435.gromacs.

As its cache size is decreased its CPI increases by 50%,

while its miss ratio and bandwidth increase by a factor

of 20×. The fetch ratio and miss ratio curves are slightly

different indicating that there is a small amount of prefetching.

However, the significant performance decrease despite the

increased bandwidth indicates that the benchmark is more

sensitive to increased memory latency.

470.lbm shows an 8× difference between its fetch and miss

ratios, indicating 8 prefetched memory accesses for each miss.

However, the relative increase in miss ratio is still roughly 2×.

This indicates that 470.lbm is also relatively insensitive to the

increased latency. The data in Figure 9 show the performance

of 470.lbm with hardware prefetching disabled. This reduces

bandwidth by a third and increases CPI at all cache sizes.

Furthermore, the CPI is now no longer constant with varying

!"#

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M

Miss Ratio Fetch Ratio CPI Bandwidth

Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware

prefetching enabled. .

482.sphinx3 behaves quite differently from 435.gromacs.

As its cache size is decreased its CPI increases by 50%,

while its miss ratio and bandwidth increase by a factor

of 20×. The fetch ratio and miss ratio curves are slightly

different indicating that there is a small amount of prefetching.

However, the significant performance decrease despite the

increased bandwidth indicates that the benchmark is more

sensitive to increased memory latency.

470.lbm shows an 8× difference between its fetch and miss

ratios, indicating 8 prefetched memory accesses for each miss.

However, the relative increase in miss ratio is still roughly 2×.

This indicates that 470.lbm is also relatively insensitive to the

increased latency. The data in Figure 9 show the performance

of 470.lbm with hardware prefetching disabled. This reduces

bandwidth by a third and increases CPI at all cache sizes.

Furthermore, the CPI is now no longer constant with varying

!"#

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M

Miss Ratio Fetch Ratio CPI Bandwidth

Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware

prefetching enabled. .

482.sphinx3 behaves quite differently from 435.gromacs.

As its cache size is decreased its CPI increases by 50%,

while its miss ratio and bandwidth increase by a factor

of 20×. The fetch ratio and miss ratio curves are slightly

different indicating that there is a small amount of prefetching.

However, the significant performance decrease despite the

increased bandwidth indicates that the benchmark is more

sensitive to increased memory latency.

470.lbm shows an 8× difference between its fetch and miss

ratios, indicating 8 prefetched memory accesses for each miss.

However, the relative increase in miss ratio is still roughly 2×.

This indicates that 470.lbm is also relatively insensitive to the

increased latency. The data in Figure 9 show the performance

of 470.lbm with hardware prefetching disabled. This reduces

bandwidth by a third and increases CPI at all cache sizes.

Furthermore, the CPI is now no longer constant with varying

!"#

Performance  (lower  beoer)  

Bandwidth  (GB/s)  

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

Icache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M

Miss Ratio Fetch Ratio CPI Bandwidth

Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware

prefetching enabled. .

482.sphinx3 behaves quite differently from 435.gromacs.

As its cache size is decreased its CPI increases by 50%,

while its miss ratio and bandwidth increase by a factor

of 20×. The fetch ratio and miss ratio curves are slightly

different indicating that there is a small amount of prefetching.

However, the significant performance decrease despite the

increased bandwidth indicates that the benchmark is more

sensitive to increased memory latency.

470.lbm shows an 8× difference between its fetch and miss

ratios, indicating 8 prefetched memory accesses for each miss.

However, the relative increase in miss ratio is still roughly 2×.

This indicates that 470.lbm is also relatively insensitive to the

increased latency. The data in Figure 9 show the performance

of 470.lbm with hardware prefetching disabled. This reduces

bandwidth by a third and increases CPI at all cache sizes.

Furthermore, the CPI is now no longer constant with varying

!"#F

/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M

Miss Ratio Fetch Ratio CPI Bandwidth

Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware

prefetching enabled. .

482.sphinx3 behaves quite differently from 435.gromacs.

As its cache size is decreased its CPI increases by 50%,

while its miss ratio and bandwidth increase by a factor

of 20×. The fetch ratio and miss ratio curves are slightly

different indicating that there is a small amount of prefetching.

However, the significant performance decrease despite the

increased bandwidth indicates that the benchmark is more

sensitive to increased memory latency.

470.lbm shows an 8× difference between its fetch and miss

ratios, indicating 8 prefetched memory accesses for each miss.

However, the relative increase in miss ratio is still roughly 2×.

This indicates that 470.lbm is also relatively insensitive to the

increased latency. The data in Figure 9 show the performance

of 470.lbm with hardware prefetching disabled. This reduces

bandwidth by a third and increases CPI at all cache sizes.

Furthermore, the CPI is now no longer constant with varying

!"#

Cache  Size   Cache  Size   Cache  Size  

•  50%  slower  •  10x  bandwidth  

increase  

•  Most  of  the  working  set  fits  in  the  cache  

•  Sensi1vity  to  latency  

•  0%  slower  •  2x  bandwidth  

increase  

•  Likle  of  the  working  set  fits  in  the  cache  

•  Insensi1ve  to  latency  

Perf:  BW:  

Size:  

Sensi1vity:  

Page 29: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  29  David  Black-­‐Schaffer  

More  Detail:  The  Impact  of  Prefetching  

•  Different  degrees  of  prefetching  depending  on  applica<on  access  paoern  and  hardware  

•  Prefetching  reduces  applica<on  sensi<vity  to  latency  

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M

Miss Ratio Fetch Ratio CPI Bandwidth

Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware

prefetching enabled. .

482.sphinx3 behaves quite differently from 435.gromacs.

As its cache size is decreased its CPI increases by 50%,

while its miss ratio and bandwidth increase by a factor

of 20×. The fetch ratio and miss ratio curves are slightly

different indicating that there is a small amount of prefetching.

However, the significant performance decrease despite the

increased bandwidth indicates that the benchmark is more

sensitive to increased memory latency.

470.lbm shows an 8× difference between its fetch and miss

ratios, indicating 8 prefetched memory accesses for each miss.

However, the relative increase in miss ratio is still roughly 2×.

This indicates that 470.lbm is also relatively insensitive to the

increased latency. The data in Figure 9 show the performance

of 470.lbm with hardware prefetching disabled. This reduces

bandwidth by a third and increases CPI at all cache sizes.

Furthermore, the CPI is now no longer constant with varying

!"#

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M

Miss Ratio Fetch Ratio CPI Bandwidth

Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware

prefetching enabled. .

482.sphinx3 behaves quite differently from 435.gromacs.

As its cache size is decreased its CPI increases by 50%,

while its miss ratio and bandwidth increase by a factor

of 20×. The fetch ratio and miss ratio curves are slightly

different indicating that there is a small amount of prefetching.

However, the significant performance decrease despite the

increased bandwidth indicates that the benchmark is more

sensitive to increased memory latency.

470.lbm shows an 8× difference between its fetch and miss

ratios, indicating 8 prefetched memory accesses for each miss.

However, the relative increase in miss ratio is still roughly 2×.

This indicates that 470.lbm is also relatively insensitive to the

increased latency. The data in Figure 9 show the performance

of 470.lbm with hardware prefetching disabled. This reduces

bandwidth by a third and increases CPI at all cache sizes.

Furthermore, the CPI is now no longer constant with varying

!"#

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M

Miss Ratio Fetch Ratio CPI Bandwidth

Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware

prefetching enabled. .

482.sphinx3 behaves quite differently from 435.gromacs.

As its cache size is decreased its CPI increases by 50%,

while its miss ratio and bandwidth increase by a factor

of 20×. The fetch ratio and miss ratio curves are slightly

different indicating that there is a small amount of prefetching.

However, the significant performance decrease despite the

increased bandwidth indicates that the benchmark is more

sensitive to increased memory latency.

470.lbm shows an 8× difference between its fetch and miss

ratios, indicating 8 prefetched memory accesses for each miss.

However, the relative increase in miss ratio is still roughly 2×.

This indicates that 470.lbm is also relatively insensitive to the

increased latency. The data in Figure 9 show the performance

of 470.lbm with hardware prefetching disabled. This reduces

bandwidth by a third and increases CPI at all cache sizes.

Furthermore, the CPI is now no longer constant with varying

!"#

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M

Miss Ratio Fetch Ratio CPI Bandwidth

Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware

prefetching enabled. .

482.sphinx3 behaves quite differently from 435.gromacs.

As its cache size is decreased its CPI increases by 50%,

while its miss ratio and bandwidth increase by a factor

of 20×. The fetch ratio and miss ratio curves are slightly

different indicating that there is a small amount of prefetching.

However, the significant performance decrease despite the

increased bandwidth indicates that the benchmark is more

sensitive to increased memory latency.

470.lbm shows an 8× difference between its fetch and miss

ratios, indicating 8 prefetched memory accesses for each miss.

However, the relative increase in miss ratio is still roughly 2×.

This indicates that 470.lbm is also relatively insensitive to the

increased latency. The data in Figure 9 show the performance

of 470.lbm with hardware prefetching disabled. This reduces

bandwidth by a third and increases CPI at all cache sizes.

Furthermore, the CPI is now no longer constant with varying

!"#

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M

Miss Ratio Fetch Ratio CPI Bandwidth

Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware

prefetching enabled. .

482.sphinx3 behaves quite differently from 435.gromacs.

As its cache size is decreased its CPI increases by 50%,

while its miss ratio and bandwidth increase by a factor

of 20×. The fetch ratio and miss ratio curves are slightly

different indicating that there is a small amount of prefetching.

However, the significant performance decrease despite the

increased bandwidth indicates that the benchmark is more

sensitive to increased memory latency.

470.lbm shows an 8× difference between its fetch and miss

ratios, indicating 8 prefetched memory accesses for each miss.

However, the relative increase in miss ratio is still roughly 2×.

This indicates that 470.lbm is also relatively insensitive to the

increased latency. The data in Figure 9 show the performance

of 470.lbm with hardware prefetching disabled. This reduces

bandwidth by a third and increases CPI at all cache sizes.

Furthermore, the CPI is now no longer constant with varying

!"#

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M

Miss Ratio Fetch Ratio CPI Bandwidth

Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware

prefetching enabled. .

482.sphinx3 behaves quite differently from 435.gromacs.

As its cache size is decreased its CPI increases by 50%,

while its miss ratio and bandwidth increase by a factor

of 20×. The fetch ratio and miss ratio curves are slightly

different indicating that there is a small amount of prefetching.

However, the significant performance decrease despite the

increased bandwidth indicates that the benchmark is more

sensitive to increased memory latency.

470.lbm shows an 8× difference between its fetch and miss

ratios, indicating 8 prefetched memory accesses for each miss.

However, the relative increase in miss ratio is still roughly 2×.

This indicates that 470.lbm is also relatively insensitive to the

increased latency. The data in Figure 9 show the performance

of 470.lbm with hardware prefetching disabled. This reduces

bandwidth by a third and increases CPI at all cache sizes.

Furthermore, the CPI is now no longer constant with varying

!"#

F/M

470.lbm

BW

CPI

cache size

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

1M 3M 5M 7M

Miss RatioFetch Ratio

CPIBandwidth

Fig. 9. Performance data for 470.lbm with hardware prefetching disabled.(Fetch ratio and miss ratio are identical.)

cache size, clearly showing that prefetching was helping tocompensate for the reduced cache space. This demonstratesthat 470.lbm not only heavily leverages hardware prefetching,but also heavily benefits from it.

V. RELATED WORK

The standard approach to measuring performance as a func-tion of cache size is to use architectural simulators (e.g. [11],[16]) or performance models (e.g. [14], [18]). For performanceanalysis the main limitation of simulation is the long executiontime, which is typically orders of magnitude slower than nativeexecution. To address this, a substantial amount of work hasbeen done to improve simulation performance [8], [12], [17],in particular by trading off accuracy and detail for improvedspeed through the use of analytical models [14], [18]. How-ever, despite these advances, the overhead of simulation andmodeling is still a limitation for performance analysis of realapplication with large data sets.

Xu et al. [4] presented a model for predicting the per-formance degradation of co-running applications contendingfor shared cache. As part of their work they co-run a stressapplication with each application to measure its performanceas a function of miss ratio. For our purpose of measuringperformance as a function of cache size, their approach has twolimitations: 1) Instead of ensuring that their stress applicationsteals a fixed amount of cache, they let it freely contend forspace with the Target, and then estimate the average amountof cache stolen after the fact. Such an average is hard tocorrelate to one cache size and is sensitive to program phases.2) As their micro benchmark freely contends for cache withthe Target, they cannot limit how much off-chip bandwidth itconsumes, which can distort performance measurements5.

5We implemented and used their method to measure the CPI of the simplesequential micro benchmark in Figure 4. When we tried to steal 4MB of cache,their stress application consumed sufficient off-chip bandwidth to increase themeasured CPI by 37%.

Doucette and Fedorova [5] also co-run a set of microbenchmarks, called base vectors, with a Target applicationto determine the impact on the target. Their base vector formeasuring shared cache contention has a similar sequentialaccess pattern to that of the Cache Pirate, but has its workingset size fixed to that of the shared cache. This provides a singlecache “sensitivity” measure for the Target application and doesnot relate the performance of the Target to the amount of cacheavailable, nor to specific performance measurements such asCPI or off-chip bandwidth.

Cakarevic et al. [19] use a similar set of micro benchmarksas Doucette and Fedorova, but their work focuses on hardwarecharacterization. They co-run their micro benchmarks, stress-ing different shared resources, and investigate how the microbenchmarks impact each other. This allows them to identifyand characterize the critical shared hardware resources, but notthe behavior of other applications.

VI. CONCLUSION

We have presented a general technique for quickly and ac-curately measuring application performance characteristics asa function of the available shared cache space. By leveraging aco-executing Pirate application that steals shared cache spaceand hardware performance counters, this approach works onstandard hardware, accurately reflects the idiosyncrasies of thesystem, has a low overhead, and requires no modifications toeither the operating system or target applications

To evaluate our technique we identified and investigatedfour key issues: 1) Does the cache available to the Target

behave as expected? 2) How much cache can the Pirate steal?

3) How many Pirate threads can we run before we impact the

Target’s performance? and, 4) Can we minimize the overhead

by dynamically varying the Pirate’s size? For the first questionwe used a trace-driven cache simulator to show that the cachespace available to the Target application behaved as expectedwith average and maximum absolute fetch ratio errors of0.2% and 2.7%, respectively. We introduced a method todetermine when the Pirate can execute two threads to increaseits fetch rate without impacting the Target’s performance anddemonstrated that we were able to steal an average of 6.7MBof the 8MB cache, and up to an average of 6.8MB witha small decrease in Target execution rate. And, finally, bydynamically adjusting the amount of cache the Pirate stealsduring execution, we were able to reduce the total overheadto 5.5%, with an average CPI error of only 0.5%

While we motivated this general performance characteriza-tion approach with an example of how it can be used to explainthroughput scaling, we believe that the data captured by CachePirating is fundamentally important for the understandingand optimization of applications executing in shared resourceenvironments. Future work includes extending this approachto collect performance data against other shared resources andexploring the range of information available from the everincreasing variety of performance counters.

!"#

•  Significant  prefetching,  and  it  benefits  from  it  •  Minimal  Prefetching  

•  No  Prefetching  

Prefetching  Disabled  

Difference  between  fetches  and  misses  is  prefetch  rate.  

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M

Miss Ratio Fetch Ratio CPI Bandwidth

Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware

prefetching enabled. .

482.sphinx3 behaves quite differently from 435.gromacs.

As its cache size is decreased its CPI increases by 50%,

while its miss ratio and bandwidth increase by a factor

of 20×. The fetch ratio and miss ratio curves are slightly

different indicating that there is a small amount of prefetching.

However, the significant performance decrease despite the

increased bandwidth indicates that the benchmark is more

sensitive to increased memory latency.

470.lbm shows an 8× difference between its fetch and miss

ratios, indicating 8 prefetched memory accesses for each miss.

However, the relative increase in miss ratio is still roughly 2×.

This indicates that 470.lbm is also relatively insensitive to the

increased latency. The data in Figure 9 show the performance

of 470.lbm with hardware prefetching disabled. This reduces

bandwidth by a third and increases CPI at all cache sizes.

Furthermore, the CPI is now no longer constant with varying

!"#

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M

Miss Ratio Fetch Ratio CPI Bandwidth

Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware

prefetching enabled. .

482.sphinx3 behaves quite differently from 435.gromacs.

As its cache size is decreased its CPI increases by 50%,

while its miss ratio and bandwidth increase by a factor

of 20×. The fetch ratio and miss ratio curves are slightly

different indicating that there is a small amount of prefetching.

However, the significant performance decrease despite the

increased bandwidth indicates that the benchmark is more

sensitive to increased memory latency.

470.lbm shows an 8× difference between its fetch and miss

ratios, indicating 8 prefetched memory accesses for each miss.

However, the relative increase in miss ratio is still roughly 2×.

This indicates that 470.lbm is also relatively insensitive to the

increased latency. The data in Figure 9 show the performance

of 470.lbm with hardware prefetching disabled. This reduces

bandwidth by a third and increases CPI at all cache sizes.

Furthermore, the CPI is now no longer constant with varying

!"#

F/M

435.gromacs

BW

CP

I

401.bzip2 429.mcf

F/M

454.calculix

BW

CP

I

482.sphinx3 470.lbm

F/M

462.libquantum

BW

CP

I

cache size

450.soplex

cache size

403.gcc

cache size

0.0%

0.1%

0.2%

0.3%

0.0

0.1

0.2

0.0

0.4

0.8

0%

0.3%

0.6%

0.9%

0.0

0.3

0.6

0.0

0.5

1.0

0%

4%

8%

12%

16%

0

1

2

3

0

2

4

0.00%

0.06%

0.12%

0.00

0.03

0.06

0.0

0.5

1.0

0%

2%

4%

0

1

2

0.0

0.5

1.0

0%

3%

6%

0

1

2

3

0.0

0.8

1.6

0%

10%

20%

0

2

4

0.0

0.4

0.8

1M 3M 5M 7M

0%

6%

12%

0

1

2

3

0.0

0.5

1.0

1.5

1M 3M 5M 7M

0%

2%

4%

0.0

0.6

1.2

1.8

0.0

0.6

1.2

1M 3M 5M 7M

Miss Ratio Fetch Ratio CPI Bandwidth

Fig. 8. Performance (CPI), bandwidth requirements (BW) in GB/s, and fetch/miss ratios (F/M) for several benchmarks. This data was collected with hardware

prefetching enabled. .

482.sphinx3 behaves quite differently from 435.gromacs.

As its cache size is decreased its CPI increases by 50%,

while its miss ratio and bandwidth increase by a factor

of 20×. The fetch ratio and miss ratio curves are slightly

different indicating that there is a small amount of prefetching.

However, the significant performance decrease despite the

increased bandwidth indicates that the benchmark is more

sensitive to increased memory latency.

470.lbm shows an 8× difference between its fetch and miss

ratios, indicating 8 prefetched memory accesses for each miss.

However, the relative increase in miss ratio is still roughly 2×.

This indicates that 470.lbm is also relatively insensitive to the

increased latency. The data in Figure 9 show the performance

of 470.lbm with hardware prefetching disabled. This reduces

bandwidth by a third and increases CPI at all cache sizes.

Furthermore, the CPI is now no longer constant with varying

!"#

Page 30: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  30  David  Black-­‐Schaffer  

More  Detail:  Cache  Pollu<on  

•  We  can  measure  how  greedy  an  applica<on  is  and  how  sensi1ve  it  is  

•  By  changing  the  code  to  use  non-­‐caching  instruc<ons  we  can  make  an  applica1on  less  greedy  without  hur<ng  performance  Motivation

Do all applications benefit from caching?

0.0%5.0%

10.0%15.0%20.0%25.0%

Private Shared

Mis

sR

atio

Cache Size

libquantumlbm

Not all applications benefit from cachingBetter performance if caching can be controlled

- 3 -

Increased Performance of Mixed Workloads

401.bzip2

470.lbm462.libq.

416.gamess

Not all applications are affected by cache contentionEven cache gobblers get increased performance dueto shared bandwidth

- 11 -

Increased Performance of Mixed Workloads

401.bzip2

470.lbm462.libq.

416.gamess

Not all applications are affected by cache contentionEven cache gobblers get increased performance dueto shared bandwidth

- 11 -

Page 31: Resource Sharing for Multicore Processors · (Section III), and present a series of micro-benchmarks for characterizing the impacts of latency and parallelism at the global and local

Uppsala  University  /  Department  of  Informa<on  Technology   28/11/2011  |  31  David  Black-­‐Schaffer  

More  Detail:  Bandwidth  Sharing  •  Sensi<vity  is  a  func<on  of  the  applica<on  

–  Latency  sensi1vity  (memory  level  parallelism)  –  Bandwidth  requirement  (data  rate)  

•  And  the  hardware  –  Ability  to  handle  out-­‐of-­‐order  requests  (queue  sizes)  –  Access  paoern  costs  (streaming  vs.  random  in  DRAM  banks)  

•  BW  consump1on  is  not  a  good  indicator  of  BW  sensi1vity  

0

Bandwidth Bandit: Understanding Memory Contention

Abstract—Applications that are co-scheduled on a multicorecompete for shared resources, such as cache capacity and memorybandwidth. The performance degradation resulting from thiscontention can be substantial, which makes it important toeffectively manage these shared resources. This, however, requiresan understanding of how applications are impacted by suchcontention.

While cache-sharing effects have been studied extensivelyof late, the effects of memory bandwidth sharing are not aswell explored. This is in large due to its complex nature, assensitivity to bandwidth contention depends on bottlenecks atseveral levels of the memory-system and the locality propertiesof the application’s access stream.

This paper explores the contention effects of increased latencyand decreased memory parallelism at different points in thememory hierarchy, both of which cause decreases in availablebandwidth. To understand the impact of such contention onapplications, it also presents a method whereby an application’soverall sensitivity to different degrees of bandwidth contentioncan be directly measured. This method is used to demonstrate thevarying contention sensitivity across a selection of benchmarks,and explains why some of them experience substantial slowdownslong before the overall memory bandwidth saturates.

I. INTRODUCTION

Cache capacity and memory bandwidth are critical shared

resources in chip multi-processors. Sharing these resources

between cores has many benefits, however, it also leads to

contention, which can have dramatic, negative impacts on the

co-running applications’ performance. Understanding the im-

pact of such contention is therefore essential to fully realize the

computational power of these systems. This work investigates

the effects of contention for off-chip memory (bandwidth)

by first exploring where and why contention occurs in the

memory hierarchy. We then use this knowledge to develop a

method that enables us to measure an application’s sensitivity

to memory contention.

While contention effects have been extensively studied for

cache capacity (e.g. [1]), less is known about contention

for memory bandwidth. Two recent studies [2], [3], show

that the performance impact from contention for off-chip

memory resources can be significant. Their main results, with

respect to memory bandwidth contention, are that applications

with higher (un-contended) bandwidth demands both generate

more memory contention and are generally more sensitive to

contention from others.

The importance of application sensitivity to memory band-

width contention is shown in Figure 1. This data shows

that applications’ sensitivities vary significantly and can be

quite substantial, with slowdowns varying from 3% to 23%for these benchmarks. It is clear from the data in Figure 1

that applications with similar bandwidth demands (e.g. lbm

and soplex) can exhibit widely varying slowdowns for the

same degree of memory contention. The goal of this research

is to further investigate the mechanisms behind bandwidth

contention and its impact on applications’ performance.

0

10

20

lbmstream

cluster

soplex

mcf

0

1

2

slo

wdow

n(%

)

bandw

idth

(GB

/s)

slowdown

bandwidth

Fig. 1. Baseline bandwidth (no contention) and slowdown with memory

contention (90% of saturation bandwidth). While all of these applications

exhibit similar baseline bandwidth consumption, their sensitivities to memory

contention vary widely. This variability demonstrates the importance of a more

detailed understanding the effects of memory contention. (For each benchmark

we generated sufficient bandwidth contention to reach 90% of the system

saturation bandwidth for the benchmark. This contention was generated using

the Bandwidth Bandit technique discussed in Section VI.)

To measure application sensitivity to memory contention we

propose the Bandwidth Bandit method. It works by co-running

the application whose performance we want to measure (the

Target) with a Bandit application that generates memory

contention. By carefully controlling the amount of contention

the Bandit generates we can measure the Target’s performance

as a function of the contention generated by the Bandit.

As we want to measure the Target’s sensitivity to memory

contention in isolation, it is important that the Bandit does

not consume any shared resources other than the ones for

which it generates contention. For example, if the Bandit uses

large amounts of shared cache capacity, this might impact

the Target’s performance and distort the measurement. In

addition, the Bandit must generate realistic contention traffic

that appropriately targets the different parts of the memory

hierarchy. To design such a Bandit we must first understand

the sources of memory contention in detail.

This research makes the following contributions:

• We identify the main bottlenecks that can arise due to

memory contention and measure their impact on both

latency and bandwidth. We classify these bottlenecks

into two categories: bottlenecks that limit bandwidth and

bottlenecks increase latency.

• We present the Bandwidth Bandit method. To our knowl-

edge, this is the first method for measuring the perfor-

mance impact of memory contention on real hardware.

• We describe how the Bandit data can be analyzed to

understand an application’s bandwidth and latency sensi-

tivity.

• We use the Bandwidth Bandit method to analyze the

contention sensitivity of a set of applications from the

SPEC2006 and PARSEC benchmark suites.

To present this material, we begin with an overview of

the memory hierarchy (Section II-A) and a discussion of

the relationship between memory level parallelism and la-

tency (Section II-B). We then describe our experimental setup

Slowdown  at  90%  of  satura1on  bandwidth  

1

Fig. 2. The memory hierarchy. The memory hierarchy can be analyzed withEquation 1 using the different degrees of parallelism and latency at each levelin the hierarchy. The queues in the figure indicate that the available parallelismcomes from multiple places within the hierarchy. The details of the hierarchyare discussed in Section IV.

(Section III), and present a series of micro-benchmarks forcharacterizing the impacts of latency and parallelism at theglobal and local level in the memory hierarchy (Section IV).With this background, we then introduce the concept of ap-plication sensitivity to bandwidth and latency (Section V) andpresent the details of the Bandwidth Bandit implementation(Section VI). The Bandit is then validated against a series ofmicro-benchmarks, which allow us to describe how to interpretits output (Section VII). Finally, we present the results of usingthe Bandwidth Bandit to investigate the memory contentionsensitivity of a variety of SPEC2006 and PARSEC benchmarks(Section VIII).

II. BACKGROUND

A. Memory Hierarchy Organization

The memory hierarchy considered in this paper is shown inFigure 2. If a memory access can not be serviced by the core’sprivate caches (not shown in the figure), it is first sent to theshared L3 cache. It the requested data is not found in the L3cache, it is sent to the integrated Memory Controller (MC).The MC has three independent memory channels over whichit communicates with the DRAM modules. Each channelconsists of an address bus and a 64 bit wide data bus. Memoryrequest are typically 64 bytes (one cache-line), thereby requir-ing eight transfers over the data bus. Each DRAM moduleconsists of several independent memory banks, which canbe accessed in parallel, as long as there are no conflicts onthe address and data buses. The combination of independentchannels and memory banks provides for a large degree ofavailable parallelism in the off-chip memory hierarchy.

The DRAM memory banks are organized into rows (alsocalled pages) and columns. To address a word of data the MChas to specify the channel, bank, row and column of the data.To read or write an address, the whole row is first copied intothe bank’s row buffer. This single-entry buffer (also known asa page cache) caches the row until a different row in the samebank is accessed.

On a read or write access three events can occur: A page-hitwhen the accessed row is already in the row buffer and thedata can be read/written directly; a page-empty when the rowbuffer is empty and the accessed row has to be copied from

the bank before it can be read/written1; or a page-miss when arow other then the one accessed is cached in the row buffer. Inthe case of a page-miss, the cached row has to first be writtenback to the memory bank before the newly accessed row iscopied into the row buffer. These three events have differentlatencies, with a page-hit having the shortest latency, and apage-miss having the longest.

B. Memory Hierarchy PerformanceFrom a performance point of view the memory hierarchy

can be described by two metrics: its latency and band-width. These two metrics are intimately related. Using Little’slaw [?], the average bandwidth achieved by an application canbe expressed as follows:

bandwidth = transfer size× MLP

latency, (1)

where MLP is the application’s average Memory LevelParallelism, that is, the average number of concurrent memoryrequests it has in-flight, and latency is the average time tocomplete the application’s memory accesses. For example,consider an application with a MLP of one (e.g., a linked listtraversal) and an access latency of 50ns. Such an applicationwould complete a transfer every 50ns, on average. With atransfer size of 64 bytes, its memory bandwidth would be1.3GB/s. If the MLP was increased to two, but the accesslatency remained the same, the application would completetwo transfers every 50ns, and its bandwidth would double to2.6GB/s.

The above equation clearly illustrates that the bandwidthachieved by an application is determined by both its memoryaccess latency and its memory parallelism. However, theseparameters vary throughout the memory hierarchy. For ex-ample, at the bank level, the parallelism is limited by thenumber of banks. However, MCs typically queue request tobusy banks. From the higher-level perspective of the MC,the parallelism, or number of in-flight requests, will includethe request in these queues, and appear larger. The latencywill also appear different, since the time spend in the queueshave to be considered. The above equation will therefore havedifferent values for latency and MLP depending on where itis applied in the memory hierarchy.

In this work we will use the above equation to understandhow contention for shared memory resources impacts appli-cation performance. Such contention can occur throughoutthe memory hierarchy, and can both increase access latenciesand reduce the memory parallelism. The resulting effect onapplication performance is a function of the application’ssensitivity to reduced latency or memory parallelism.

III. EXPERIMENTAL SETUP

The experiments presented in this paper have been run ona quad core Intel Xeon e5520 (Nehalem) machine. Its cacheconfiguration is detailed in the following table:

1Page-empties occur when the MC preemptively closes a page that hasn’tbeen accessed recently to optimistically turn a page-miss into a page-empty.