accurate and complexity-effective spatial pattern prediction

30
Computer Architecture Lab at University of Toronto AENAO: Power Aware Memory Coherence & Hierarchies for Servers http://eecg.toronto.edu/~aenao Accurate and Complexity-Effective Accurate and Complexity-Effective Spatial Pattern Spatial Pattern Prediction Prediction Chi Chen Se-Hyun Yang Babak Falsafi Andreas Moshovos

Upload: jeff

Post on 13-Jan-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Accurate and Complexity-Effective Spatial Pattern Prediction. Chi Chen Se-Hyun Yang Babak Falsafi Andreas Moshovos. Motivation – Variation in Spatial Locality. Caches Exploit Spatial Locality via Block Size Prefetch Nearby Data  Improve Performance - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Accurate and Complexity-Effective                    Spatial Pattern Prediction

Computer Architecture Labat

University of Toronto

AENAO: Power Aware Memory Coherence & Hierarchies for Servers

http://eecg.toronto.edu/~aenao

Accurate and Complexity-EffectiveAccurate and Complexity-Effective Spatial Pattern Prediction Spatial Pattern Prediction

Chi ChenSe-Hyun YangBabak Falsafi

Andreas Moshovos

Page 2: Accurate and Complexity-Effective                    Spatial Pattern Prediction

2CALCM

Motivation – Variation in Spatial Locality

Caches Exploit Spatial Locality via Block Size Prefetch Nearby Data Improve Performance

“One Size Fits All” Solution Large enough for prefetching Small enough to avoid memory link saturation

Opportunity Variation Within and Across Applications

If “Best Block Size” was known:1. Prefetch even further Higher Performance

2. “Turn-off” unused data in cache Lower Leakage Power

Page 3: Accurate and Complexity-Effective                    Spatial Pattern Prediction

3CALCM

This Work

Dynamic Spatial Pattern Prediction Leakage Power Reduction

Sub-blocks of a block as a Group Place “unused” block parts in low leakage state

Prefetching Consecutive Memory Blocks as a Group Selectively Prefetch Blocks Upon First Access in Group

Key Contribution: PC + Offset Within Group Quick Learning Compact Representation High Coverage

Page 4: Accurate and Complexity-Effective                    Spatial Pattern Prediction

4CALCM

How Well it Works

Spatial Pattern Predictor (SPP) 256-entry Tag-Less Direct-Mapped ~95% coverage

L1 Data Leakage Energy Reduction ~40% reduction w/ 70nm CMOS technology < 1% average performance degradation

Prefetching w/ 1024 byte Group Up to 2x speedup and 56% Average Conventional Cache: 14% Slowdown

Page 5: Accurate and Complexity-Effective                    Spatial Pattern Prediction

5CALCM

Outline

Conventional Cache: Optimization Opportunities

Variation in Spatial Locality

Prediction Framework

Prior Work

Results

Page 6: Accurate and Complexity-Effective                    Spatial Pattern Prediction

6CALCM

Optimization Opportunity #1

L1D with 64-Byte cache lines

age isAdult next

isAdult nextage

miss

miss

miss age isAdult next

Resident untouched data Wasteful Leakage

untouched touched

typedef struct person { char name[20]; … int age; int isAdult; struct person* next;} // total 64 bytes

// do something …

while ( people ) { if ( peopleage >= 21 ) peopleisAdult = TRUE; people = peoplenext;}

Conventional Cache

Page 7: Accurate and Complexity-Effective                    Spatial Pattern Prediction

7CALCM

Optimization Opportunity #2

L1D with 64-Byte cache lines

age isAdult

isAdultage

age isAdult

Detech Access Patterns at Group Level Selectively Prefetch Same Block Members

Improve Performance w/o Saturating Memory

Conventional Cache

age isAdultG

rou

p #

1G

rou

p #

2

typedef struct person { char name[20]; … int age; int isAdult;} people[LARGE]

// do something …

for i { if ( people[i].age >= 21 ) people[i].isAdult = TRUE;}

Page 8: Accurate and Complexity-Effective                    Spatial Pattern Prediction

8CALCM

Variation in Spatial Locality

1/8

facerec gcc mcf vortex

100%

80%

60%

40%

20%

0%

2/8

3/8

4/8

5/8

6/8

7/8

8/8

Fraction of data used before eviction Measured on 64KB 2-way L1D w/ 64B cache lines

40% 89% 26% 48%

Average Line Usage

All

Cac

he

Lin

es T

ou

che

d

Page 9: Accurate and Complexity-Effective                    Spatial Pattern Prediction

9CALCM

Prediction Framework

1 0 . . . 1

Minimum Fetch Unit (MFU):• replacement unit of cache• e.g., cache line or sub block

Spatial Group:• group of adjacent MFUs• indexed by logical tag

Spatial Pattern:• reference pattern of a spatial group

Tag0 Tag0 Tag1 Tag1 Tag1. . . . . .

Spatial Group Generation:• starts with a new logical tag

Time

Page 10: Accurate and Complexity-Effective                    Spatial Pattern Prediction

10CALCM

Spatial Pattern Predictor

0 0 0 0

0 0 0 0

1 0 0 0

1 1 1 1

0 1 1 0

1 1 0 0

1 0 0 0

1 1 1 1

001

000

011

010

Spatial PatternRegister

PHT EntryPointer

PredictionIndex

Spatial PatternHistory

Current Pattern Table (CPT) Pattern History Table (PHT)DataCache

Current Pattern Table records patterns Pattern History Table stores captured patterns

PC SPG Offset

Prediction Index: 32 bits

=?

Spatial Pattern Prediction

Page 11: Accurate and Complexity-Effective                    Spatial Pattern Prediction

11CALCM

Prior Work

Static profiling, V. Vleet, et al. ICCD 1999 Adjustable block size, Dubnicki & LeBlanc. ISCA 1992 Fetching adjacent cache lines, Temam & Jegou. ICS 1994 Dual cache, Gonzalez, Aliagas & Valero. ICS 1995 Spatial Locality Detection Table, Johnson, Merten & Hwu.

MICRO 1998 Spatial Footprint Predictor (SFP), Kumar & Wilkerson. ISCA

1998

Key Difference is Prediction Handle: PC + Group Offset

1. Compact Representation 2. Quick Learning3. High Coverage

Page 12: Accurate and Complexity-Effective                    Spatial Pattern Prediction

12CALCM

Results Overview

Predictor Performance Statistics

Leakage Power Reduction

Performance Improvement w/ Prefetching

Page 13: Accurate and Complexity-Effective                    Spatial Pattern Prediction

13CALCM

Methodology

SimpleScalar simulator 64KB 2-way L1D/L1I cache, 2-cycle latency 2MB 8-way L2 cache, 12-cycle latency

SPEC CPU2000 Alpha binaries + reference inputs

Predictor performance evaluation Simulated to completion

Performance impact evaluation Skipped 10B and simulated next 500M instructions

Energy reduction evaluation SPICE w/ 70nm CMOS technology & 1V supply voltage

Page 14: Accurate and Complexity-Effective                    Spatial Pattern Prediction

14CALCM

Practical Predictor: Performance

160%

100%

0%

20%

40%

60%

80%

gcc mcf

256-entry tag-less direct-mapped average prediction accuracy of 96%

A B CA B CvortexA B C

fecerecA B C

256 EntriesA: 16-wayB: DMC: FA

Training Over-PredictionOver-PredictionUnder-PredictionCorrect Prediction

% o

f p

erfe

ct

pre

dic

tio

ns bet

ter

Page 15: Accurate and Complexity-Effective                    Spatial Pattern Prediction

15CALCM

Predictor Applications

Leakage energy reduction Sub blocks as minimum fetch units Cache lines as spatial groups A cache miss starts a spatial group generation Assuming Gated-Ground by Agarwal, Li, & Roy

Spatial group prefetcher Cache lines as minimum fetch units Adjacent cache lines grouped into spatial groups A new logical tag starts a spatial group generation

Page 16: Accurate and Complexity-Effective                    Spatial Pattern Prediction

16CALCM

Leakage Energy Reduction

Execution Time Increase

Relative Leakage Power

80%

5%

0%

20%

40%

60%

100%

gcc mcf vortexfecerec AVG

Up to 73% leakage energy reduction ~40% average leakage energy reduction < 1% average performance degradation

60%

<1%~2%

bet

ter

bet

ter

Page 17: Accurate and Complexity-Effective                    Spatial Pattern Prediction

17CALCM

Performance Improvement

-50%

0%

50%

100%

150%

facerec gcc mcf vortex AVG

SPG 1024SPG 512CONV. 1024CONV. 512

Up to 2x speedup with 1024B spatial groups ~60% average speedup with 1024B spatial groups

Page 18: Accurate and Complexity-Effective                    Spatial Pattern Prediction

18CALCM

Summary

Spatial Pattern Predictor (SPP) Key Contribution: PC + Group Offset

Small and Effective, High Coverage 256-entry Tag-Less Direct-Mapped ~95% coverage

L1 Data Leakage Energy Reduction ~40% reduction w/ 70nm CMOS technology < 1% average performance degradation

Prefetching w/ 1024 byte Group Up to 2x speedup and 56% Average Conventional Cache: 14% Slowdown

Page 19: Accurate and Complexity-Effective                    Spatial Pattern Prediction

Computer Architecture Labat

University of Toronto

AENAO: Power Aware Memory Coherence & Hierarchies for Servers

http://eecg.toronto.edu/~aenao

Accurate and Complexity-EffectiveAccurate and Complexity-Effective Spatial Pattern Prediction Spatial Pattern Prediction

Chi ChenSe-Hyun YangBabak Falsafi

Andreas Moshovos

Page 20: Accurate and Complexity-Effective                    Spatial Pattern Prediction

20CALCM

Prediction Index

Infinite Tables PC + SPG offset yields high prediction accuracy PC + SPG offset has low prediction memory requirements

160%

100%

0%

20%

40%

60%

80%

facerec gcc mcf

TrainingOver-Prediction

Under-Prediction

Correct Prediction

A B C D A B C D A B C Dvortex

A B C D

A: PCB: PC+SPG IDC: PC+SPG OFFSETD: PC+ADDR

Page 21: Accurate and Complexity-Effective                    Spatial Pattern Prediction

21CALCM

Contributions

Spatial Pattern Predictor (SPP) 256-entry Tag-Less Direct-Mapped ~95% coverage

Leakage Energy Reduction ~40% reduction w/ 70nm CMOS technology < 1% average performance degradation

Processor Performance Improvement Up to 2x speedup

Page 22: Accurate and Complexity-Effective                    Spatial Pattern Prediction

22CALCM

Variations in Spatial Locality

0%

20%

40%

60%

80%

100%amm

p art bzip equake

facerec fma3d gap gcc lucas mcf mgrid

vortex

Percen

tage of A

ll Cach

e Line U

sages

<=13%14-25%26-38%39-50%51-63%64-75%76-88%89-100%

Fraction of data used before eviction Measured on 64KB 2-way L1D w/ 64B cache lines

Page 23: Accurate and Complexity-Effective                    Spatial Pattern Prediction

23CALCM

Prediction Index

PC + SPG offset yields high prediction accuracy PC + SPG offset requires low prediction memory

requirement

ABCDABCDABCDABCDABCDABCDABCDABCDABCDABCDABCDABCD

ammp art bzip equake

facerec fma3d gap gcc lucas mcf mgrid

vortex

0%20%40%60%80%100%120%140%160%

Percent

of Perfe

ct Predi

ctions

A: PC-onlyB: PC+SPG IDC: PC+SPG OFFSETD: PC+ADDR

Correct PredictionUnderpredictionOverpredictionTraining

Page 24: Accurate and Complexity-Effective                    Spatial Pattern Prediction

24CALCM

Predictor Memory Organization

256-entry tag-less direct-mapped yields average prediction accuracy of 96%

0%20%40%60%80%100%120%140%160%

ABCDEFABCDEFABCDEFABCDEFABCDEFABCDEFABCDEFABCDEFABCDEFABCDEFABCDEFABCDEF

ammp art bzip equake

facerec fma3d gap gcc lucas mcf mgrid

vortex

Percen

t of Perfect

Predict

ions

A: 128-en try 16-wayB: 128-en try DMC: 128-en try FAD: 256-en try 16-wayE: 256-entry DMF: 256-entry FA

Correct PredictionUnderpredictionOverpredictionTraining

Page 25: Accurate and Complexity-Effective                    Spatial Pattern Prediction

25CALCM

Spatial Group Size (1/2)

ABCDEABCDEABCDEABCDEABCDEABCDEABCDEABCDEABCDEABCDEABCDEABCDEartA am

mp bzip equake

facerec fma3d gap gcc lucas mcf mgrid

vortex

Percenta

ge of Pe

rfect Pre

dictions

0%20%40%60%80%100%120%140%160%

A: 16B Spatial Group 8B Fetch UnitB: 32B Spatial Group 8B Fetch UnitC: 64B Spatial Group 8B Fetch UnitD: 128B Spatial Group 8B Fetch UnitE: 256B Spatial Group 8B Fetch Unit

Correct PredictionUnderpredictionOverpredictionTraining

Page 26: Accurate and Complexity-Effective                    Spatial Pattern Prediction

26CALCM

Spatial Group Size (2/2)

0%20%40%60%80%100%120%140%160%

ABCDEFABCDEFABCDEFABCDEFABCDEFABCDEFABCDEFABCDEFABCDEFABCDEFABCDEFABCDEF

ammp art bzip equake

facerec fma3d gap gcc lucas mcf mgrid

vortex

Percen

tage of P

erfect P

redictio

nsCorrect PredictionUnderpredictionOverpredictionTraining

A: 32B Spatial Group 8B Fetch UnitB: 64B Spatial Group 8B Fetch UnitC: 128B Spatial Group 8B Fetch UnitD: 128B Spatial Group 64B Fetch UnitE: 256B Spatial Group 64B F etch UnitF: 512B Spatial Group 64B Fetch Unit

Page 27: Accurate and Complexity-Effective                    Spatial Pattern Prediction

27CALCM

Predictor Memory Organization

0%20%40%60%80%100%120%140%160%

ABCDEFGABCDEFGABCDEFGABCDEFGABCDEFGABCDEFGABCDEFGABCDEFGABCDEFGABCDEFGABCDEFGABCDEFG

ammp art bzip equake

facerec fma3d gap gcc lucas mcf mgrid

vortex

Percen

tage of P

erfect P

redictio

ns

A: 8-entryB: 16-entryC: 32-entryD: 64-entryE: 128-entryF: 256-entryG: INF

Correct PredictionUnderpredictionOverpredictionTraining

Page 28: Accurate and Complexity-Effective                    Spatial Pattern Prediction

28CALCM

Leakage Energy Reduction

Up to 73% leakage energy reduction ~40% average leakage energy reduction < 1% average performance degradation

0%

20%

40%

60%

80%

100%

ammp art bzip equ

akeface

rec fma3d gap gcc lucas mcf mgrid

vortex AVG

Execution Time Increase Fraction of Baseline Leakage Dissipation

5%

Page 29: Accurate and Complexity-Effective                    Spatial Pattern Prediction

29CALCM

ammp512B

1024BSPG 512BSPG 1024B

-41-6310-25

art3296121305

bzip-43-4968

equake-34-415999

facerec

-13-358103

fma3d

-9-900

gap

20313147

gcc

-2-211

lucas

-23-673451

mcf

-27-323867

mgrid

6123653

vortex

-27-4311

AVG

-13-143359

Performance Improvement

Up to 2x speedup with 1024B spatial groups ~60% average speedup with 1024B spatial groups

Page 30: Accurate and Complexity-Effective                    Spatial Pattern Prediction

30CALCM

Predictor Memory Organization

160%

100%

0%

20%

40%

60%

80%

gcc mcf

256-entry tag-less direct-mapped average prediction accuracy of 96%

A B C D E FA B C D E Fvortex

A B C D E Ffecerec

A B C D E F

A: 128-entry 16-wayB: 128-entry DMC: 128-entry FAD: 256-entry 16-wayE: 256-entry DMF: 256-entry FA

TrainingOver-Prediction

Under-Prediction

Correct Prediction