predictor-directed stream buffers
DESCRIPTION
Predictor-Directed Stream Buffers. Timothy Sherwood Suleyman Sair Brad Calder. Overview. Introduction Past Stream Buffer work Predictor-Directed Stream Buffers Policy Improvements Results Contribution. Introduction. Memory Wall Latency reduction through prefetching - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Predictor-Directed Stream Buffers](https://reader035.vdocuments.mx/reader035/viewer/2022070410/568146b2550346895db3ce22/html5/thumbnails/1.jpg)
Predictor-Directed Stream Buffers
Timothy Sherwood
Suleyman Sair
Brad Calder
![Page 2: Predictor-Directed Stream Buffers](https://reader035.vdocuments.mx/reader035/viewer/2022070410/568146b2550346895db3ce22/html5/thumbnails/2.jpg)
Sherwood, Sair, and Calder 2
Overview
• Introduction
• Past Stream Buffer work
• Predictor-Directed Stream Buffers
• Policy Improvements
• Results
• Contribution
![Page 3: Predictor-Directed Stream Buffers](https://reader035.vdocuments.mx/reader035/viewer/2022070410/568146b2550346895db3ce22/html5/thumbnails/3.jpg)
Sherwood, Sair, and Calder 3
Introduction
• Memory Wall
• Latency reduction through prefetching– without eating too much bandwidth
• Stream Buffers are one of the most used– simple to implement– very efficient
• Pointer based codes
![Page 4: Predictor-Directed Stream Buffers](https://reader035.vdocuments.mx/reader035/viewer/2022070410/568146b2550346895db3ce22/html5/thumbnails/4.jpg)
Sherwood, Sair, and Calder 4
Past Stream Buffer work
• Jouppi 1990 – consecutive cache line FIFO
• Palacharla and Kessler 1994– non-unit stride (based on memory chunk)– allocation filters
• Farkas et. al. 1997– PC-based stride– fully associative / non-overlapping
![Page 5: Predictor-Directed Stream Buffers](https://reader035.vdocuments.mx/reader035/viewer/2022070410/568146b2550346895db3ce22/html5/thumbnails/5.jpg)
Sherwood, Sair, and Calder 5
Past Stream Buffer work
tag cache block comparator
• • •
PredictedStride
LastAddress
tag cache block comparator
from/to next lower level of memoryN buffe
rs
store predict_stridein streaming buffer
on allocation
to data cache, register file, and MSHRs
![Page 6: Predictor-Directed Stream Buffers](https://reader035.vdocuments.mx/reader035/viewer/2022070410/568146b2550346895db3ce22/html5/thumbnails/6.jpg)
Sherwood, Sair, and Calder 6
Past Stream Buffer work
• Past work targeted at streaming in arrays– either in sequential order– or stride order (multidimensional array)
• Could not handle Pointer Codes– repetitive non-striding references
• Need a more General Predictor
![Page 7: Predictor-Directed Stream Buffers](https://reader035.vdocuments.mx/reader035/viewer/2022070410/568146b2550346895db3ce22/html5/thumbnails/7.jpg)
Sherwood, Sair, and Calder 7
Predictor-Directed Stream Buffer
• The Goal: Simple and efficient hardware based prefetching of complex but predictable streams
• Approach: Take a general predictor and hook it up to the well established stream buffer front end.
• Separate the predictor from the prefetcher• Can use almost any predictor
– 2 Delta– Context– Markov
![Page 8: Predictor-Directed Stream Buffers](https://reader035.vdocuments.mx/reader035/viewer/2022070410/568146b2550346895db3ce22/html5/thumbnails/8.jpg)
Sherwood, Sair, and Calder 8
PSB Generalized Architecture
Load PCHistoryStride
ConfidenceLast Address
Prediction Info
tag cache block comparator• • •
tag cache block comparator
AddressPredictor
load info (PC, address)fromwrite-backstage
from/to next lower level of memory
subset of prediction info
predicted address
predicted address
N buffers
to data cache, register file, and MSHRs
updateprediction
information
![Page 9: Predictor-Directed Stream Buffers](https://reader035.vdocuments.mx/reader035/viewer/2022070410/568146b2550346895db3ce22/html5/thumbnails/9.jpg)
Sherwood, Sair, and Calder 9
PSB Stages
• Allocation
• Prediction
• Probe
• Prefetching
• Lookup
![Page 10: Predictor-Directed Stream Buffers](https://reader035.vdocuments.mx/reader035/viewer/2022070410/568146b2550346895db3ce22/html5/thumbnails/10.jpg)
Sherwood, Sair, and Calder 10
Stage Descriptions
• Allocation– Stream Buffer is allocated to a particular load– the buffer is initialized– subject to Allocation Filters
• Prediction– an empty buffer entry asks for an address– subject to limited predictor speed.
![Page 11: Predictor-Directed Stream Buffers](https://reader035.vdocuments.mx/reader035/viewer/2022070410/568146b2550346895db3ce22/html5/thumbnails/11.jpg)
Sherwood, Sair, and Calder 11
Stage Descriptions (Continued)
• Probe– if there are free ports remove useless prefetches
– not mandatory
• Prefetching– subject to scheduling for ports and priority, prefetches
are sent to memory
• Lookup– when a load performs an L1 access, the Stream Buffers
are checked in parallel
![Page 12: Predictor-Directed Stream Buffers](https://reader035.vdocuments.mx/reader035/viewer/2022070410/568146b2550346895db3ce22/html5/thumbnails/12.jpg)
Sherwood, Sair, and Calder 12
PSB Implementation
• Tried many different address predictors
• Best is Stride Filtered Markov– similar to Joseph and Grunwald’s Predictor– first order Markov– striding behavior is filtered out
• Difference is stored to reduce size
![Page 13: Predictor-Directed Stream Buffers](https://reader035.vdocuments.mx/reader035/viewer/2022070410/568146b2550346895db3ce22/html5/thumbnails/13.jpg)
Sherwood, Sair, and Calder 13
Difference Storing
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 3 5 7 9 11 13 15 17 19
Number of bits
Perc
en
t o
f L
1 M
isses
burgdeltagssisturb3dhealth
![Page 14: Predictor-Directed Stream Buffers](https://reader035.vdocuments.mx/reader035/viewer/2022070410/568146b2550346895db3ce22/html5/thumbnails/14.jpg)
Sherwood, Sair, and Calder 14
PSB with SFM
tag cache block comparator• • •
tag cache block comparator
from/to next lower level of memory
predictedaddress
last address
if hit, returnpredicted address
8 buffers
store predictedstride in
streaming buffer on allocation
MarkovPredictor
load info (PC, address)from write-back stage
StridePredictor
MUXmarkov
hit?
PredictedStride
LastAddress
predicted markov address
predicted stride address
to data cache, register file, and MSHRs
![Page 15: Predictor-Directed Stream Buffers](https://reader035.vdocuments.mx/reader035/viewer/2022070410/568146b2550346895db3ce22/html5/thumbnails/15.jpg)
Sherwood, Sair, and Calder 15
Methods
• SimpleScalar 3.0• Rewrote memory hierarchy• Model bandwidth between all levels• Added perfect store sets• Ran over set of Pointer Benchmarks• 2K entry predictor table• 8 buffers x 4 entry Stream Buffers• 32k 4-way associative cache
![Page 16: Predictor-Directed Stream Buffers](https://reader035.vdocuments.mx/reader035/viewer/2022070410/568146b2550346895db3ce22/html5/thumbnails/16.jpg)
Sherwood, Sair, and Calder 16
Speedup from PSB
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
health burg deltablue gs sis turb3d
Per
cen
t S
pee
du
p
PC-StridePSB w/ SFM
![Page 17: Predictor-Directed Stream Buffers](https://reader035.vdocuments.mx/reader035/viewer/2022070410/568146b2550346895db3ce22/html5/thumbnails/17.jpg)
Sherwood, Sair, and Calder 17
Allocation Filtering
• Farkas et.al. showed how two miss filtering– prevents too many streams requesting resources
• Does not work as well for pointer codes– irregular miss patterns
• We use Priority and Accuracy Counters– track behavior of Loads– allocate to Loads that are Behaving well
![Page 18: Predictor-Directed Stream Buffers](https://reader035.vdocuments.mx/reader035/viewer/2022070410/568146b2550346895db3ce22/html5/thumbnails/18.jpg)
Sherwood, Sair, and Calder 18
Allocation Filtering Speedup
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
health burg deltablue gs sis turb3d
Per
cen
t S
pee
du
p
PC-Stride2 MissConfAlloc
![Page 19: Predictor-Directed Stream Buffers](https://reader035.vdocuments.mx/reader035/viewer/2022070410/568146b2550346895db3ce22/html5/thumbnails/19.jpg)
Sherwood, Sair, and Calder 19
Stream Buffer Priority
• Round Robin– give each active buffer equal resources– predictor and prefetching
• Priority Counters– uses small counters with each buffer– use the counters to rank buffer– more resources to better performing buffers
![Page 20: Predictor-Directed Stream Buffers](https://reader035.vdocuments.mx/reader035/viewer/2022070410/568146b2550346895db3ce22/html5/thumbnails/20.jpg)
Sherwood, Sair, and Calder 20
Priority Scheduling Speedup
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
health burg deltablue gs sis turb3d
Per
cen
t S
pee
du
p
PC-Stride
2Miss-RR
2Miss-Priority
ConfAlloc-RR
ConfAlloc-Priority
![Page 21: Predictor-Directed Stream Buffers](https://reader035.vdocuments.mx/reader035/viewer/2022070410/568146b2550346895db3ce22/html5/thumbnails/21.jpg)
Sherwood, Sair, and Calder 21
Latency Reduction
0
2
4
6
8
10
12
health burg deltablue gs sis turb3d
Avg
. Acc
ess
Lat
ency
(cyc
les)
BasePC-Stride2Miss-RR2Miss-PriConf-RRConfAlloc-Priority
![Page 22: Predictor-Directed Stream Buffers](https://reader035.vdocuments.mx/reader035/viewer/2022070410/568146b2550346895db3ce22/html5/thumbnails/22.jpg)
Sherwood, Sair, and Calder 22
Contributions
• Predictor-Directed Stream Buffers allow decoupling of Stream Buffer front end from address generation
• Using accuracy based allocation filtering and priority scheduling can make a large difference in performance
• With some simple compression, even small Markov tables can be very effective
![Page 23: Predictor-Directed Stream Buffers](https://reader035.vdocuments.mx/reader035/viewer/2022070410/568146b2550346895db3ce22/html5/thumbnails/23.jpg)
Sherwood, Sair, and Calder 23
Accuracy
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
health burg deltablue gs sis turb3d
Per
cen
t Acc
ura
cy
PC-Stride2Miss-RR2Miss-PriorityConfAlloc-RRConfAlloc-Priority
![Page 24: Predictor-Directed Stream Buffers](https://reader035.vdocuments.mx/reader035/viewer/2022070410/568146b2550346895db3ce22/html5/thumbnails/24.jpg)
Sherwood, Sair, and Calder 24
Bus Results
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%B
ase
PC
-Str
ide
2M
iss-
RR
2M
iss-
Pri
Co
nf-
RR
Co
nf-
Pri
Ba
seP
C-S
trid
e2
Mis
s-R
R2
Mis
s-P
riC
on
f-R
RC
on
f-P
riB
ase
PC
-Str
ide
2M
iss-
RR
2M
iss-
Pri
Co
nf-
RR
Co
nf-
Pri
Ba
seP
C-S
trid
e2
Mis
s-R
R2
Mis
s-P
riC
on
f-R
RC
on
f-P
riB
ase
PC
-Str
ide
2M
iss-
RR
2M
iss-
Pri
Co
nf-
RR
Co
nf-
Pri
Ba
seP
C-S
trid
e2
Mis
s-R
R2
Mis
s-P
riC
on
f-R
RC
on
f-P
ri
health burg deltablue gs sis turb3d
L1
to L
2 B
us
Uti
liza
tio
n
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
L2
to M
em B
us
Uti
liza
tio
n
L1 to L2 Bus UtilizationL2 to Mem Bus Utilization