Accurate Analytical Modeling of Accurate Analytical Modeling of Superscalar ProcessorsSuperscalar Processors
J. E. Smith
Tejas Karkhanis
October 27, 2003 copyright J. E. Smith, 2003 2
Superscalar Processor EvaluationSuperscalar Processor Evaluation
Processors typically evaluated via simulation• Highly detailed simulator• Many cycles of simulation• Has a black box character -- provides little insight
Workload Implications• All workload characteristics are needed for detailed
simulation, BUT not all are critical for determining performance
• Workload space limited to specific benchmarks Alternative Approach – Use an analytical model
October 27, 2003 copyright J. E. Smith, 2003 3
Analytical ApproachAnalytical Approach
Analytical Model driven by relevant benchmark properties
Helps isolate important workload characteristics
• If performance estimate is accurate then workload characteristics must be the important ones
Workload characteristics can be varied over a “workload space”
• Apply characteristics directly by short-circuiting simulation
FunctionalSimulator
Benchmarks
Performance/Powerestimates
Extract RelevantProgram
Properties
Analytical Model
October 27, 2003 copyright J. E. Smith, 2003 4
Basis for ModelBasis for Model
Consider profile of dynamic instructions issued per cycle:
Background constant IPC • With never-ending series of transient events
determine performance with ideal caches & predictors then account for transient events
time
IPC
branch mispredictsi-cache miss
long d-cache miss
October 27, 2003 copyright J. E. Smith, 2003 5
IBID ModelIBID Model
Based on generic superscalar processor Useful for reasoning about transient events
Ife tch B I
s to p
P
s ta rt
e m p ty
&
s to p
s ta rt
m isp re d ic t S ize p ip e
Ica ch e m issS ize w in d o w
S ize R O B
D
L o n gD ca ch e m iss
s to p
D
s ta rt
IWch a ra c te r is tic
October 27, 2003 copyright J. E. Smith, 2003 6
Series/Parallel Performance PenaltiesSeries/Parallel Performance Penalties
Branch Misprediction and I-Cache Miss penalties “serialize”
• i.e. penalties add linearly Long D-Cache Misses may overlap with I-cache and
B-predict misses (and with each other)• Overlap with other long D-cache misses more important• Short D-cache misses handled differently (later)
BranchMispredicts
I-Cache Misses
Long D-CacheMisses
October 27, 2003 copyright J. E. Smith, 2003 7
Validating Series/Parallel ModelValidating Series/Parallel Model
0
1
2
3
4
IPC
Combined Independent Overlaps Compensated
Combined: simulated performance with realistic caches/predictor Independent: ideal performance minus individually determined
performance losses Overlap Compensated: account for overlaps w/ D-cache misses
4-way issue, 48 window, 128 ROB16K I-cache and D-Cache8K gshare branch predictor
I-cache Decode PipelineIssueBuffer
Exec.Unit
Exec.Unit
Exec.Unit
DataCache
Reorder Buffer
RegisterFile
StoreQ
f d d i
M SHRs
d
d
m ispredictrate
BranchPredict
m iss rate
#entries
# stages
# entries
# values
m iss rate
# entries
# entries# entries
# and type of unitsunit latencies
October 27, 2003 copyright J. E. Smith, 2003 8
IW CharacteristicIW Characteristic
Key Result (Michaud, Seznec, Jourdan):• Square Root relationship between Issue Rate
and Window sizeWI
October 27, 2003 copyright J. E. Smith, 2003 9
Similar ExperimentSimilar Experiment
Ideal caches, predictor Efficient I fetch keeps window full Graph issue rate I, as a fcn of window size W Straight lines on log log graph
0
1
2
3
4
5
6
3 4 5 6 7
lg(window size)
lg(I
PC
)
bzip
crafty
eon
gap
gcc
gzip
mcf
parser
perl
twolf
vortex
vpr
WI
October 27, 2003 copyright J. E. Smith, 2003 10
IW CharacteristicIW Characteristic
Allows determination of “background” IPC Allows evaluation of transients to determine
penalties
time
IPC
branch mispredictsi-cache miss
long d-cache miss
October 27, 2003 copyright J. E. Smith, 2003 11
Transient #1: Branch MispredictionsTransient #1: Branch Mispredictions
Typical behavior
steady state
mispredictedbranch enters
window
flushpipeline re-fill pipeline
instructionsre-enterwindow
issue rampsback up to
steady state
mispredictiondetected
misspeculatedinstructions
October 27, 2003 copyright J. E. Smith, 2003 12
Branch Misprediction PenaltyBranch Misprediction Penalty
1) lost opportunity• performance lost by issuing soon-to-be flushed instructions
2) pipeline re-fill penalty• obvious penalty; most people equate this with the penalty
3) window fill penalty• performance lost due to window startup
lostopportunity
pipelinere-fill window fill
October 27, 2003 copyright J. E. Smith, 2003 13
Use Sqrt ModelUse Sqrt Model
0
1
2
3
4
IPC
0 2 4 6 8 10 12 14 16 18 20 22 24 clock cycle
October 27, 2003 copyright J. E. Smith, 2003 14
Experimental DataExperimental Data
GCC
0
0.5
1
1.5
2
2.5
3
3.5
4
0 5 10 15 20 25 30
Cycle after the mispredict
Iss
ue
Ra
te (
ON
LY
Us
efu
l)
Load Lat = 1
Load Lat = 2
October 27, 2003 copyright J. E. Smith, 2003 15
Branch Mispredict PenaltyBranch Mispredict Penalty
02468
10121416
cy
cle
s
short long
short pipeline = 5 stages before issue long pipeline = 10 stages before issue
Insight from analytical model: Penalty from drain/fill is significant
Insight from analytical model: Penalty similar across all benchmarks for a given pipeline length
October 27, 2003 copyright J. E. Smith, 2003 16
Implication of Wider PipesImplication of Wider Pipes
Assume 1 mispredict every 96 instructions• E.g. SPEC benchmark crafty with 4K gshare• Graph full mispredict “cycle”
0 1 2 3 4 5 6 7
IP C
0 10 20 30 40 50 60 clock cycle
issue=8
issue=4
issue=2
Issue=8 gives very modest improvement vs issue=4 (window never full enough to issue 8)
Issue=4 barely reaches peak performance
October 27, 2003 copyright J. E. Smith, 2003 17
Importance of Branch PredictionImportance of Branch Prediction
0
200
400
600
800
1000
1200
1400
1600
1800
10 20 30 40 50
Percent time at 3.5, 7, 14 issues per cycle
Ins
tru
cti
on
s b
etw
ee
n
mis
pre
dic
tio
ns
Issue width 4 Issue width 8 issue width 16
Insight: Doubling issue width means predictor has to be four times better for similar performance profile (issue efficiency)
October 27, 2003 copyright J. E. Smith, 2003 18
Implication of Deeper PipelinesImplication of Deeper Pipelines
Assume 1 misprediction per 96 instructions Vary fetch/decode/rename section of pipe
Advantage of wide issue diminishes as pipe deepens
Pentium 4 decode pipe depth = 15 & issue width = 3
0
1
2
3
4
5
IP C
0 2 4 6 8 10 12 14 16 Fetch/Decode Pipe Length
issue=8
issue=4
issue=2
October 27, 2003 copyright J. E. Smith, 2003 19
Transient #2: I-Cache MissesTransient #2: I-Cache Misses
steady state
cache missoccurs
windowdrains
instructionsre-enterwindow
issue rampsback up to
steady state
instructionsbuffered in
decode pipe
miss delayinstructionsfill decode
pipe
October 27, 2003 copyright J. E. Smith, 2003 20
I-cache miss penaltyI-cache miss penalty
Penalty = Miss delay (L2 or memory latency)
minus window drain
plus window re-fill penalty
Instructions buffered in window offsets re-fill penalty
insight: penalty is independent of pipeline length.
Instructions buffered in pipe compensate for pipe re-fill
October 27, 2003 copyright J. E. Smith, 2003 21
I-cache miss penaltyI-cache miss penalty
Estimated i-cache penalty:• for n consecutive (clustered) misses:
Avg. Miss penalty
= (miss delay – drain + fill + (n-1)(miss delay-1))/n
miss delay – 1 + 1/n• For isolated miss miss delay• For long cluster miss delay – 1
October 27, 2003 copyright J. E. Smith, 2003 22
0
2
4
6
8
10
12p
en
alt
y
pipelen=4
pipelen=8
Independence from Pipe LengthIndependence from Pipe Length
16 K I-cache; ideal D-cache and predictor Two different pipeline lengths (4 and 8 cycles) I-cache miss delay 10 cycles Penalty independent of pipe length Similar across
benchmarks
October 27, 2003 copyright J. E. Smith, 2003 23
Reducing Miss Penalty – I-CachesReducing Miss Penalty – I-Caches
Add Ifetch buffer • Overlaps execution with
miss handling• Bypassed by miss
instructions To be effective, should
be enhanced with high fetch bandwidth
• greater than issue width
I-Cache
Decode PipelineDecoupling
Buffer
n2n n
steady state
cache missoccurs
instructionsbuffered in
decode pipe
instructionsfill decode
pipe
Increases this
Without increasing this
October 27, 2003 copyright J. E. Smith, 2003 24
Transient #3: D-Cache MissesTransient #3: D-Cache Misses
More complex than front-end miss events• Branch mispredict and icache misses block I-fetch• Data cache misses can be handled in parallel with I-fetch
and execution Divide into:
• Short misses – handle like long latency functional unit
• Long misses – get special treatment
October 27, 2003 copyright J. E. Smith, 2003 25
D-cache long miss penaltyD-cache long miss penalty
Three things can reduce performance1) Structural hazard
ROB fills up behind load (or inst dependent on load)and dispatch stalls
2) Data dependencesInstructions dependent on load pile up and stall
window3) Control dependences
Mispredicted branch dependent on load data Instructions beyond branch wasted
October 27, 2003 copyright J. E. Smith, 2003 26
ROB BlockageROB Blockage
Experiment:• Window size 32, Issue width 4, ROB size 64• Ideal branch prediction• Cache miss delay 1000 cycles• Simulate sampled, isolated cache misses and see
what happens
October 27, 2003 copyright J. E. Smith, 2003 27
ResultsResults
Benchmark Avg. # insts #insts in Fract. Samplesissued after in window where ROB fillsmiss after miss
Bzip2 44.1 13.1 1.0Crafty 44.6 9.6 0.9Eon 55.2 6.0 1.0Gap 56.8 10.7 1.0Gcc 51.7 8.2 0.9Mcf 55.8 5.5 0.9Parser 44.2 7.4 1.0Twolf 49.6 12.9 0.8Vortex 49.7 3.5 1.0Vpr 27.0 16.9 0.6
Full ROB stalls most of the time Relatively few dependent instructions pile up in window
October 27, 2003 copyright J. E. Smith, 2003 28
D-Cache Miss PenaltyD-Cache Miss Penalty
For typical ROBs, data and control dependences are not limiters – assume structural (ROB) stall:
if load at tail of window:
Penalty = Miss delay minus ROB fill, window drain plus ramp-up Miss delay minus ROB fill
if load at head of window:
Penalty = Miss delay minus window drain plus ramp-up Miss delay
If second long load miss is within ROB distance of first, then penalty is completely overlapped
October 27, 2003 copyright J. E. Smith, 2003 29
Transient #3: D-Cache MissesTransient #3: D-Cache Misses
steady state
d cache missoccurs
windowdrains
Miss datareturns
Commit resumes;Issue ramps to
steady state
ROBfills
miss delay
October 27, 2003 copyright J. E. Smith, 2003 30
Reducing Miss Penalty – D-Caches Reducing Miss Penalty – D-Caches
Enlarge ROB, Window, Rename Values• Overlap miss delay with execution
steady state
d cache missoccurs ROB fills
window drainsindependent insts.
miss datareturns
commitresumes;
issue rampsback up to
steady statemiss delay
ROB full
October 27, 2003 copyright J. E. Smith, 2003 31
Put it togetherPut it together
Issue width 4, window size 48 => peak CPI 8 cycle L1 I cache miss delay 200 cycle L2 cache miss delay (both I and D) 6.4 cycle branch mispredict delay (4 in pipeline) Performance (cycles)
= #insts*peak CPI + (total #br mispredicts- mispreds w/in ROBsize of long miss)*penalty + (total #Icache misses – misses w/in ROBsize of long miss)*penalty+ (total #long misses – long misses w/in ROBsize of long miss)*penalty
October 27, 2003 copyright J. E. Smith, 2003 32
Compare with Detailed SimulationCompare with Detailed Simulation
Very accurate Greatest inaccuracy from Dcache long misses
0
1
2
3
4
IPC
Detailed Simulation Analytical model
October 27, 2003 copyright J. E. Smith, 2003 33
Important Workload CharacteristicsImportant Workload Characteristics
0
0.2
0.4
0.6
0.8
1
1.2
CP
I
Ideal L1 Icache misses
L2 Icache misses L1 Dcache misses
L2 Dcache misses Branch mispredictions
October 27, 2003 copyright J. E. Smith, 2003 34
Conclusions: Key Workload CharacteristicsConclusions: Key Workload Characteristics
Instruction dependences important:• For establishing background (ideal) IPC• Not for performance penalties
All “major” events important• Branch mispredicts• I cache misses (both short and long)• D cache misses (long)
But ONLY “major” events are importantin a well-balanced design
Clustering of events only important for D cache misses
• Is miss within ROB distance of preceding miss?
October 27, 2003 copyright J. E. Smith, 2003 35
Conclusions: Performance EvaluationConclusions: Performance Evaluation
Accurate analytical models can (and should be) developed
Trace driven cache/predictor simulators have an important role
Hybrid analytical/simulation models also should be considered
• Combine real address streams with analytical processor models
• Statistical simulation
If you really need detailed simulation – you’re not doing research, you’re doing development!