analyzing the impact of data prefetching on chip multiprocessors naoto fukumoto, tomonobu mihara,...

27
Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan ACSAC 13 August 4, 2008

Upload: sydney-fleming

Post on 19-Jan-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

Analyzing the Impact of Data Prefetching on Chip MultiProcessors

Naoto Fukumoto, Tomonobu Mihara,Koji Inoue, Kazuaki Murakami

Kyushu University, Japan

ACSAC 13August 4, 2008

Page 2: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

2

Back Ground

• CMP (Chip MultiProcessor):– Several processor cores integrated in a chip– High performance by parallel processing– New feature: Cache-to-cache data transfer

• Limitation factor of CMP performance– Memory-wall problem is more critical

• High frequency of off-chip accesses• Not scaling bandwidth with the number of cores

CMP

Core

L1 $

L2 $

Core

L1 $

chip

Data prefetching is more important in CMPs

Page 3: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

Motivation & Goal

• Motivation– Conventional prefetch techniques have been

developed for uniprocessors– Not clear that these prefetch techniques achieve high

performance in even in CMPs– Is it necessary for the prefetch techniques to consider

CMP features ?– Need to know the effect of prefetch on CMPs

• Goal– Analysis of the prefetch effect on CMPs

3

Page 4: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

Outline

• Introduction• Prefetch Taxonomy for multiprocessors• Extension for CMPs• Quantitative Analysis• Conclusions

4

Page 5: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

Classification of Prefetches According to Impact on Memory Performance

• Focusing on each prefetch• Definition of the prefetch states– Initial state: the state just after a block is

prefetched into cache memory – Final State: the state when the block is evicted

from cache memory– The state transits based on Events during the life

time of the prefetched block in cache memory

5

Page 6: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

Definition of Events

Event1. The prefetched block is accessed by the local core

Event2. The local core accesses the block which has evicted from the cache by the prefetch

Event3. The prefetch causes a downgrade followed by a subsequent upgrade in a remote core

6

Local core Remote core

Core

L1 $

L2 $

Core

L1 $

Main MemoryA

A

A

prefetch ALoad A

hiding off-chip access latency

Page 7: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

Core

L1 $

L2 $

Core

L1 $

Main Memory

Definition of Events

Event1. The prefetched block is accessed by the local core

Event2. The local core accesses the block which has evicted from the cache by the prefetch

Event3. The prefetch causes a downgrade followed by a subsequent upgrade in a remote core

B

7

Local core Remote core

A

A

A

prefetch ALoad B

Cache miss!!blockBIs evicted

Page 8: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

Core

L1 $

L2 $

Core

L1 $

Main Memory

Definition of Events

Event1. The prefetched block is accessed by the local core

Event2. The local core accesses the block which has evicted from the cache by the prefetch

Event3. The prefetch causes a downgrade followed by a subsequent upgrade in a remote core

8

Local core Remote core

AA

prefetch A Store A

Invalidate Request

Page 9: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

The State Transition of Prefetch in Local Core

Useless

Useless/Conflict

Useful Event2

Event1.

Useful/Conflict

Event1

9

Local core Remote core

Core

L1 $

L2 $

Core

L1 $

Main MemoryA

A

A

prefetch ALoad A# of local L1 cache misses is decreased

# of memory accesses is increased

in local core(initial state)

# of local L1 cache misses and # of

accesses are increased in local

core

# of memory accesses is

Increased in local core

B

Load B

blockBIs evicted

cache miss!!

Page 10: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

The State Transition of Prefetch in Local and Remote Cores*

Useless

Useless/Conflict

Useful

Useful/Conflict

* Jerger, N., Hill, E., and Lipasti, M., ``Friendly Fire: Understanding the Effects of Multiprocessor Prefetching‘’ In Proceedings of the IEEE International  Symposium on Performance Analysis of Systems and Software (ISPASS), 2006.

Event2

Event1

Event1

10

Page 11: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

Useful/Conflict

Event1

Useful

The State Transition of Prefetch in Local and Remote Cores*

Useless

Useless/Conflict

Harmful

Harmful/Conflict

Event3

Event2

* Jerger, N., Hill, E., and Lipasti, M., ``Friendly Fire: Understanding the Effects of Multiprocessor Prefetching‘’ In Proceedings of the IEEE International  Symposium on Performance Analysis of Systems and Software (ISPASS), 2006.

Event3

Event2

Event1

11

Core

L1 $

L2 $

Core

L1 $

Main Memory

Local core Remote core

AA

prefetch A Store A

Invalidate Request

B

Load B

# of invalidation requests and # of

memory accesses are increased

# of invalidation requests, # of memory accesses

and #of cache misses are increased

cache miss!!

invalidated

Page 12: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

Considering Cache-to-Cache Data Transfer

• Event4. The prefetched block loaded from L2 or main memory is accessed by a remote core

Local core Remote core

Core

L1 $

L2 $

Core

L1 $

Main MemoryA

A

A

prefetch A Load A

A

12

hiding off chip access latency

Page 13: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

The State Transition in CMPs

  Useful

Useful/Conflict

Harmful

Harmful/Conflict

Event3.

Event2.

Event3.

Event2.

Event1.

13

Event1.

Local core Remote core

Core

L1 $

L2 $

Core

L1 $

Main MemoryA

B

Useless

   Useless/Conflict

Page 14: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

Event2.Harmful

Harmful/Conflict

The State Transition in CMPs

Useless

   Useless/Conflict

  Useful

Useful/Conflict

Useless/Conflict/Remote

Useless/Remote

Event4

Event2

Event1

Event1

Event4

14

Local core Remote core

Core

L1 $

L2 $

Core

L1 $

Main MemoryA

A

A

prefetch A

Load A

AB

Load B

cache miss

Load A

# of L2 access is decreased in remote core

# of L2 accesses is decreased in remote core,

# of cache misses is increased in local core

Page 15: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

Classification of Prefetches in CMPs

Useless

  Useless/Conflict

Harmful

Harmful/Conflict

Useful

Useful/Conflict

Useless/Remote

Useless/Conflict/Remote

one L2 access decreased in remote core

one memory access is increased in local core

one cache miss is decreased in local core

one memory access in local core, and invalidate request in remote core are increased

one cache miss is increasedin local core

15

Best case

Worst case

Better case

Worse case

Page 16: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

Outline

• Introduction• Prefetch Taxonomy– for Multiprocessors– for CMP

• Quantitative Analysis• Conclusions

16

Page 17: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

• Simulator– M5: CMP simulator• Prefetch mechanism attached on L1 cache• Stride prefetch and tagged prefetch• MOESI coherence protocol

• Benchmark programs– SPLASH-2:

Scientific computation programs

Simulation Environment

L2$

Main memory

Core

D I

Core

D I

Core

D I

Core

D I

64KB 2-way

4MB8way

17

Page 18: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

0%10%20%30%40%50%60%70%80%90%

100%Harmful/Conflict

Harmful

Useless/Conflict/Remote

Useless/Conflict

Useless/Remote

Useless

Useful/Conflict

Useful

Brea

kdow

n of

pre

fetc

hes

Can Conventional Prefetch Techniques Exploit Cache-to-Cache data transfer ?

•The percentage of Useless/Remote and Useless/Conflict/Remote prefetches is only 5% Conventional prefetch techniques do not exploit cache-to-cache data transfer effectively

1 2 FMM LU Radix Water

1. stride prefetch 2.tagged prefetch

18

Useless/Remote

Useless/Conflict/Remote

Page 19: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

0%10%20%30%40%50%60%70%80%90%

100%Harmful/Conflict

Harmful

Useless/Conflict/Remote

Useless/Conflict

Useless/Remote

Useless

Useful/Conflict

Useful

Brea

kdow

n of

pre

fetc

hes

Are the Prefetched-Block Invalidations Serious Problem for CMPs?

•Prefetches of Harmful and Harmful/Conflict are extremely few (average 0.2%) Invalidations of prefetched blocks are negligible

1 2 FMM LU Radix Water

1. stride prefetch 2.tagged prefetch

19

Page 20: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

20

Multiprocessor vs. Chip Multiprocessor

• Harmful and Harmful/Conflict prefetches– 0.01~0.70% in CMP (tagged prefetch) Small negative impact – 2~18% in MP* (sequential prefetch) Large negative impact

• Why does this difference occur ?

*Jerger, N., Hill, E., and Lipasti, M., Friendly Fire: Understanding the Effects of Multiprocessor Prefetching. In Proceedings of the IEEE International  Symposium on Performance Analysis of Systems and Software (ISPASS), 2006.

Page 21: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

The Reason of Difference of Invalidation Rate

• Difference of the life time of prefetched blocks in cache – Long life time (large cache size) High possibility of invalidation– Short life time (small cache size) Low possibility of invalidation

• If the cache size is large, the negative impact is large( like MPs)

• If the cache size is small, the negative impact is small (like CMPs)

CMP

Core

L1 $

L2 $

core

L1 $

Multiprocessor

load prefetched block and keep coherence

Core

L1$

L2$

Core

L1$

L2$

21

Page 22: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

The Invalidation Rate of Prefetched Blockswith Varying L1 Cache Size (tagged prefetch)

Larger cache large negative impact ( like MPs)Smaller cache small negative impact (like CMPs)

FMM LU Radix Water0%

10%

20%

30%

40%

50%

60%128KB 256KB 512KB 1MB

inva

lidat

ed ra

te

L1 cache size

22

Page 23: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

Summary

• Contributions– New method to analyze prefetch effects on CMPs– Quantitative analysis for two types of prefetches

• Observations– Conventional prefetch techniques DO NOT exploit

cache-to-cache data transfer effectively– Harmful prefetches are NOT harmful in CMPs

• Future work– Propose novel prefetch technique exploiting the

features of CMPs 23

Page 24: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

Any Questions ?~Please speak slowly~

Thank you

Page 25: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

Average Memory Access Time (AMAT)

25

base

strid

e

tagg

ed

base

strid

e

tagg

ed

base

strid

e

tagg

ed

base

strid

e

tagg

ed

FMM LU Radix Water

0

0.5

1

1.5

2

2.5

3

MCL2MBSSSBCCHCCL2HCCRL1HCCL1L1 $

Remote L1 $L2 $

Shared bus

Memory bus

Main memory

Page 26: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

Harmful and Harmful/Conflict Prefetches varying # of cores

26

2 co

res

4 co

res

8 co

res

2 co

res

4 co

res

8 co

res

2 co

res

4 co

res

8 co

res

2 co

res

4 co

res

8 co

res

2 co

res

4 co

res

8 co

res

2 co

res

4 co

res

8 co

res

Barnes FMM LU Radix Raytrace Water

0.00%

0.50%

1.00%

1.50%

2.00%

2.50%

3.00%

harmful/conflictharmful

Page 27: Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

MultiProcessor Traffic and Miss Taxonomy (MPTMT [Jerger’06])

• MultiProcessor Traffic and Miss Taxonomy (MPTMT)– This is an extended version of Uniproccessor

taxonomy (Srinivasan et al.)– Prefetches are classified according to effects on

memory performance– To count the classified prefetches, we can measure

the prefetch effects precisely

27