predictor-directed stream buffers

Predictor-Directed Stream Buffers

Timothy Sherwood

Suleyman Sair

Brad Calder

Sherwood, Sair, and Calder 2

Overview

• Introduction

• Past Stream Buffer work

• Predictor-Directed Stream Buffers

• Policy Improvements

• Results

• Contribution


Introduction

• Memory Wall

• Latency reduction through prefetching– without eating too much bandwidth

• Stream Buffers are one of the most used– simple to implement– very efficient

• Pointer based codes


Past Stream Buffer work

• Jouppi 1990 – consecutive cache line FIFO

• Palacharla and Kessler 1994– non-unit stride (based on memory chunk)– allocation filters

• Farkas et. al. 1997– PC-based stride– fully associative / non-overlapping



tag cache block comparator

• • •

PredictedStride

LastAddress


from/to next lower level of memoryN buffe

rs

store predict_stridein streaming buffer

on allocation

to data cache, register file, and MSHRs



• Past work targeted at streaming in arrays– either in sequential order– or stride order (multidimensional array)

• Could not handle Pointer Codes– repetitive non-striding references

• Need a more General Predictor


Predictor-Directed Stream Buffer

• The Goal: Simple and efficient hardware based prefetching of complex but predictable streams

• Approach: Take a general predictor and hook it up to the well established stream buffer front end.

• Separate the predictor from the prefetcher• Can use almost any predictor

– 2 Delta– Context– Markov


PSB Generalized Architecture

Load PCHistoryStride

ConfidenceLast Address

Prediction Info

tag cache block comparator• • •


AddressPredictor

load info (PC, address)fromwrite-backstage

from/to next lower level of memory

subset of prediction info

predicted address

predicted address

N buffers


updateprediction

information


PSB Stages

• Allocation

• Prediction

• Probe

• Prefetching

• Lookup


Stage Descriptions

• Allocation– Stream Buffer is allocated to a particular load– the buffer is initialized– subject to Allocation Filters

• Prediction– an empty buffer entry asks for an address– subject to limited predictor speed.


Stage Descriptions (Continued)

• Probe– if there are free ports remove useless prefetches

– not mandatory

• Prefetching– subject to scheduling for ports and priority, prefetches

are sent to memory

• Lookup– when a load performs an L1 access, the Stream Buffers

are checked in parallel


PSB Implementation

• Tried many different address predictors

• Best is Stride Filtered Markov– similar to Joseph and Grunwald’s Predictor– first order Markov– striding behavior is filtered out

• Difference is stored to reduce size


Difference Storing

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 3 5 7 9 11 13 15 17 19

Number of bits

Perc

en

t o

f L

1 M

isses

burgdeltagssisturb3dhealth


PSB with SFM

tag cache block comparator• • •


from/to next lower level of memory

predictedaddress

last address

if hit, returnpredicted address

8 buffers

store predictedstride in

streaming buffer on allocation

MarkovPredictor

load info (PC, address)from write-back stage

StridePredictor

MUXmarkov

hit?

PredictedStride

LastAddress

predicted markov address

predicted stride address



Methods

• SimpleScalar 3.0• Rewrote memory hierarchy• Model bandwidth between all levels• Added perfect store sets• Ran over set of Pointer Benchmarks• 2K entry predictor table• 8 buffers x 4 entry Stream Buffers• 32k 4-way associative cache


Speedup from PSB

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

health burg deltablue gs sis turb3d

Per

cen

t S

pee

du

p

PC-StridePSB w/ SFM


Allocation Filtering

• Farkas et.al. showed how two miss filtering– prevents too many streams requesting resources

• Does not work as well for pointer codes– irregular miss patterns

• We use Priority and Accuracy Counters– track behavior of Loads– allocate to Loads that are Behaving well


Allocation Filtering Speedup

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%


Per

cen

t S

pee

du

p

PC-Stride2 MissConfAlloc


Stream Buffer Priority

• Round Robin– give each active buffer equal resources– predictor and prefetching

• Priority Counters– uses small counters with each buffer– use the counters to rank buffer– more resources to better performing buffers


Priority Scheduling Speedup

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%


Per

cen

t S

pee

du

p

PC-Stride

2Miss-RR

2Miss-Priority

ConfAlloc-RR

ConfAlloc-Priority


Latency Reduction

0

2

4

6

8

10

12


Avg

. Acc

ess

Lat

ency

(cyc

les)

BasePC-Stride2Miss-RR2Miss-PriConf-RRConfAlloc-Priority


Contributions

• Predictor-Directed Stream Buffers allow decoupling of Stream Buffer front end from address generation

• Using accuracy based allocation filtering and priority scheduling can make a large difference in performance

• With some simple compression, even small Markov tables can be very effective


Accuracy

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%


Per

cen

t Acc

ura

cy

PC-Stride2Miss-RR2Miss-PriorityConfAlloc-RRConfAlloc-Priority


Bus Results

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%B

ase

PC

-Str

ide

2M

iss-

RR

2M

iss-

Pri

Co

nf-

RR

Co

nf-

Pri

Ba

seP

C-S

trid

e2

Mis

s-R

R2

Mis

s-P

riC

on

f-R

RC

on

f-P

riB

ase

PC

-Str

ide

2M

iss-

RR

2M

iss-

Pri

Co

nf-

RR

Co

nf-

Pri

Ba

seP

C-S

trid

e2

Mis

s-R

R2

Mis

s-P

riC

on

f-R

RC

on

f-P

riB

ase

PC

-Str

ide

2M

iss-

RR

2M

iss-

Pri

Co

nf-

RR

Co

nf-

Pri

Ba

seP

C-S

trid

e2

Mis

s-R

R2

Mis

s-P

riC

on

f-R

RC

on

f-P

ri


L1

to L

2 B

us

Uti

liza

tio

n

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

L2

to M

em B

us

Uti

liza

tio

n

L1 to L2 Bus UtilizationL2 to Mem Bus Utilization

predictor-directed stream buffers

Documents