pacman: p approximately optimalclump/cdp2010/gu_pacman.pdf · the transformation • loop unrolling...

25
PACMAN: Program-level Approximately Optimal Cache Management Xiaoming Gu, Chen Ding Department of Computer Science University of Rochester

Upload: others

Post on 06-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PACMAN: P Approximately Optimalclump/cdp2010/Gu_PACMAN.pdf · The Transformation • Loop unrolling • find out the target references in IR using the candidates’ ref. IDs •

PACMAN: Program-level Approximately Optimal Cache Management

Xiaoming Gu, Chen DingDepartment of Computer Science

University of Rochester

Page 2: PACMAN: P Approximately Optimalclump/cdp2010/Gu_PACMAN.pdf · The Transformation • Loop unrolling • find out the target references in IR using the candidates’ ref. IDs •

• What is PACMAN?

2

• An ongoing compiler study to reduce cache misses using hardware bypassing supports

• A famous video game

Page 3: PACMAN: P Approximately Optimalclump/cdp2010/Gu_PACMAN.pdf · The Transformation • Loop unrolling • find out the target references in IR using the candidates’ ref. IDs •

The Framework

3

Analysisidentify

references for bypassing

tag references with

bypassingTransformation

feedback

Page 4: PACMAN: P Approximately Optimalclump/cdp2010/Gu_PACMAN.pdf · The Transformation • Loop unrolling • find out the target references in IR using the candidates’ ref. IDs •

Outline

• Motivation

• Hardware Bypassing Support

• An Example

• Details of Analysis and Transformation

• Summary

4

Page 5: PACMAN: P Approximately Optimalclump/cdp2010/Gu_PACMAN.pdf · The Transformation • Loop unrolling • find out the target references in IR using the candidates’ ref. IDs •

Outline

• Motivation

• Hardware Bypassing Support

• An Example

• Details of Analysis and Transformation

• Summary

5

Page 6: PACMAN: P Approximately Optimalclump/cdp2010/Gu_PACMAN.pdf · The Transformation • Loop unrolling • find out the target references in IR using the candidates’ ref. IDs •

Motivation

• Running a sequential program alone on a multi-core chip

• hard to parallelize the program

• running alone for no interference

• reduce cache misses

PACMAN: focus on reducing cache misses for sequential programs running alone

6

Page 7: PACMAN: P Approximately Optimalclump/cdp2010/Gu_PACMAN.pdf · The Transformation • Loop unrolling • find out the target references in IR using the candidates’ ref. IDs •

Outline

• Motivation

• Hardware Bypassing Support

• An Example

• Details of Analysis and Transformation

• Summary

7

Page 8: PACMAN: P Approximately Optimalclump/cdp2010/Gu_PACMAN.pdf · The Transformation • Loop unrolling • find out the target references in IR using the candidates’ ref. IDs •

Normal LRU Inst.

8

evicted

Page 9: PACMAN: P Approximately Optimalclump/cdp2010/Gu_PACMAN.pdf · The Transformation • Loop unrolling • find out the target references in IR using the candidates’ ref. IDs •

Bypass LRU Inst.

9

evicted

Page 10: PACMAN: P Approximately Optimalclump/cdp2010/Gu_PACMAN.pdf · The Transformation • Loop unrolling • find out the target references in IR using the candidates’ ref. IDs •

Outline

• Motivation

• Hardware Bypassing Support

• An Example

• Details of Analysis and Transformation

• Summary

10

Page 11: PACMAN: P Approximately Optimalclump/cdp2010/Gu_PACMAN.pdf · The Transformation • Loop unrolling • find out the target references in IR using the candidates’ ref. IDs •

SOR• Jacobi Successive Over-relaxation

• from NIST SciMark 2.0

• a classical stencil computation

• compiled by LLVM 2.7 using gold plugin with -O4

11

Page 12: PACMAN: P Approximately Optimalclump/cdp2010/Gu_PACMAN.pdf · The Transformation • Loop unrolling • find out the target references in IR using the candidates’ ref. IDs •

The Gap between LRU and OPT

12

NUM_ITERATIONS=10 M=N=512

• Two gaps

• The working sets (knees) of OPT are much smoother than LRU’s

gaps

Page 13: PACMAN: P Approximately Optimalclump/cdp2010/Gu_PACMAN.pdf · The Transformation • Loop unrolling • find out the target references in IR using the candidates’ ref. IDs •

The Transformation

Normal and bypass LRU instructions mixed ==> LRU Bypassing

13

Page 14: PACMAN: P Approximately Optimalclump/cdp2010/Gu_PACMAN.pdf · The Transformation • Loop unrolling • find out the target references in IR using the candidates’ ref. IDs •

The Improvement of PACMAN

NUM_ITERATIONS=10 M=N=512

•The gap at the second working set disappears!

14

Page 15: PACMAN: P Approximately Optimalclump/cdp2010/Gu_PACMAN.pdf · The Transformation • Loop unrolling • find out the target references in IR using the candidates’ ref. IDs •

Outline

• Motivation

• Hardware Bypassing Support

• An Example

• Details of Analysis and Transformation

• Summary

15

Page 16: PACMAN: P Approximately Optimalclump/cdp2010/Gu_PACMAN.pdf · The Transformation • Loop unrolling • find out the target references in IR using the candidates’ ref. IDs •

The Analysis

• Simulate OPT• for a given cache configuration

• each run-time access has three fields• data addr, static ref. ID, and bypassing flag (off by default)

• when an eviction happens, set bypassing flag on for the previously last access to the victim

16

A1, A2, ..., Ai, ......, Aj, ..., AN

X is evictedX is accessed

no access to Xin between

set bypassing flag on

Page 17: PACMAN: P Approximately Optimalclump/cdp2010/Gu_PACMAN.pdf · The Transformation • Loop unrolling • find out the target references in IR using the candidates’ ref. IDs •

The Analysis (cont’d)

• Simulate OPT (cont’d)

• calculate bypassing ratios for all memory references

17

bypassing ratio of a reference =#accesses generated by the reference and with bypassing flag on

#accesses generated by the reference

• the references with high bypassing ratios are the candidates for bypassing

Page 18: PACMAN: P Approximately Optimalclump/cdp2010/Gu_PACMAN.pdf · The Transformation • Loop unrolling • find out the target references in IR using the candidates’ ref. IDs •

The Transformation• Loop unrolling

• find out the target references in IR using the candidates’ ref. IDs

• figure out the last touch to a cache line in the innermost loop body• the cache line size

• the array element size

• the loop step stride

• the array indexing

• separate the last touch using loop unrolling

• tag the last touch with bypass LRU

18

Page 19: PACMAN: P Approximately Optimalclump/cdp2010/Gu_PACMAN.pdf · The Transformation • Loop unrolling • find out the target references in IR using the candidates’ ref. IDs •

SOR by PACMAN

• Gim1[j] is the candidate

• only do bypassing for the last touch in the innermost loop body

• spatial locality retained

• use loop unrolling to do separation in practice

19

NUM_ITERATIONS=10, M=N=512 fully-associative, 512KB, line size=64B

Page 20: PACMAN: P Approximately Optimalclump/cdp2010/Gu_PACMAN.pdf · The Transformation • Loop unrolling • find out the target references in IR using the candidates’ ref. IDs •

Why only the Second Working Set Improved?

• PACMAN simulation only at the second working set

• Reduce cache misses for the second working set

20

do OPT simulation at 512KB

Page 21: PACMAN: P Approximately Optimalclump/cdp2010/Gu_PACMAN.pdf · The Transformation • Loop unrolling • find out the target references in IR using the candidates’ ref. IDs •

Set-associative Cache

21

• Keep losing benefits on cache with lower associativities

• The improvement is still significant

NUM_ITERATIONS=10 M=N=512

Page 22: PACMAN: P Approximately Optimalclump/cdp2010/Gu_PACMAN.pdf · The Transformation • Loop unrolling • find out the target references in IR using the candidates’ ref. IDs •

With a Different Input

• The improvement is scalable with input sizes

22

NUM_ITERATIONS=10 M=N=512

NUM_ITERATIONS=20 M=N=1024

8X accesses

Page 23: PACMAN: P Approximately Optimalclump/cdp2010/Gu_PACMAN.pdf · The Transformation • Loop unrolling • find out the target references in IR using the candidates’ ref. IDs •

Future Work

23

• extend PACMAN for general applications

• more realistic hardware environment

• multi-level pseudo-LRU cache

• the differences between loads and stores

• the interaction with prefetching

Page 24: PACMAN: P Approximately Optimalclump/cdp2010/Gu_PACMAN.pdf · The Transformation • Loop unrolling • find out the target references in IR using the candidates’ ref. IDs •

Summary

24

Use simple hardware bypassing supports

Find out bypassing references by simulating OPT

Cache misses are reduced very close to the optimal case

Achieve significant improvement even on low associativity cache

The training results can be used for a real run with a larger input size

Page 25: PACMAN: P Approximately Optimalclump/cdp2010/Gu_PACMAN.pdf · The Transformation • Loop unrolling • find out the target references in IR using the candidates’ ref. IDs •

Q & ANOT