a mechanistic model for superscalar processors a mechanistic model for superscalar processors j. e....
TRANSCRIPT
![Page 1: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/1.jpg)
A Mechanistic Model for A Mechanistic Model for Superscalar ProcessorsSuperscalar Processors
J. E. SmithUniversity of Wisconsin-Madison
Lieven Eeckhout, Stijn EyermanGhent University
Tejas KarkhanisAMD
![Page 2: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/2.jpg)
Superscalar Modeling © J. E. Smith, 2006 2
Interval AnalysisInterval Analysis
Superscalar execution can be divided into intervals separated by miss events
• Branch miss predictions• I cache misses• Long D cache misses• TLB misses, etc.
Provides more insight than simulation • You can see the forest and the trees• Supplements simulation, not a replacement
time
IPC
branchmispredicts
i-cachemiss long d-cache miss
interval 1 interval 2 interval 3interval 0
![Page 3: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/3.jpg)
Superscalar Modeling © J. E. Smith, 2006 3
OutlineOutline
Development of Interval Analysis • Modeling ILP• Modeling miss events
Balanced Superscalar Processors• Performance components• Optimal pipeline configurations
Performance Counter Architecture• Accurate CPI stacks
![Page 4: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/4.jpg)
Superscalar Modeling © J. E. Smith, 2006 4
Superscalar ProcessorsSuperscalar Processors
I-cache Decode PipelineIssueBuffer
Exec.Unit
Exec.Unit
Exec.Unit
Reorder Buffer (Window)
PhysicalRegisterFile(s)
F D D I
MSHRs
D
R
BranchPredict
Fetchbuffer
# entries
# entries
miss rate
W entries
# entries
# entries
# and type of unitsunit latencies
Pipeline depth
instructiondelivery
algorithm
miss-rate
mispredictrate
Store Q
Load Q# entries L1 Data
Cache#ports
L2Cache
miss rate
toI-cache
mainmemorylatency
![Page 5: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/5.jpg)
Superscalar Modeling © J. E. Smith, 2006 5
Superscalar ProcessorsSuperscalar Processors Ifetch
• Adequate fetch resources to sustain decode/dispatch width D• F > D plus fetch buffer to smooth flow
Decode• Assume decode pipe and dispatch bandwidth D
Window• Window, size W, holds in-flight instructions• Equivalent to ROB• Issue buffer holds subset of window (as an optimization)• Assume unified issue buffer, but model can support partitioned buffers
Issue• Width may be more or less than dispatch and commit widths
Retire• Retire width R typically equal to dispatch width
![Page 6: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/6.jpg)
Superscalar Modeling © J. E. Smith, 2006 6
Superscalar Processor PerformanceSuperscalar Processor Performance
Maximum IPC under ideal conditions• No cache misses or branch mispredictions
Miss-events disrupt smooth flow• In balanced design, performance is all about the transients
time
IPC
branchmispredicts
i-cachemiss
long d-cachemiss
![Page 7: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/7.jpg)
Superscalar Modeling © J. E. Smith, 2006 7
Modeling ILPModeling ILP
Relationship between maximum window size W and achieved issue width i
Program dependence structure Has a long history…
![Page 8: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/8.jpg)
Superscalar Modeling © J. E. Smith, 2006 8
Riseman and Foster (1972)Riseman and Foster (1972)
Basic relationship between window size and IPC
• Classic Study• Approx quadratic
relationship under ideal conditions
Wi
![Page 9: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/9.jpg)
Superscalar Modeling © J. E. Smith, 2006 9
Wall (1991)Wall (1991)
Limits of ILP• Another classic study• Approx. quadratic
relationship under “perfect” conditions
![Page 10: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/10.jpg)
Superscalar Modeling © J. E. Smith, 2006 10
Michaud, Seznec, JourdanMichaud, Seznec, Jourdan
More recent study Key Result (Michaud, Seznec, Jourdan):
• Approx. quadratic relationship
![Page 11: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/11.jpg)
Superscalar Modeling © J. E. Smith, 2006 11
Our ExperimentOur Experiment
Ideal caches, predictor Efficient I fetch keeps window full Graph issue rate i, as a fcn of window size W
•Approx. quadratic relationship
![Page 12: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/12.jpg)
Superscalar Modeling © J. E. Smith, 2006 12
Modeling IW CharacteristicModeling IW Characteristic
Clearly a function of program dependence structure Simple, single-level dependence models don’t work
very well• Need to consider dependence chains
Slide window over dynamic stream and compute average critical path k(W)
For unit latency, i = W/k(W)
Window
Dynamic InstructionStream
![Page 13: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/13.jpg)
Superscalar Modeling © J. E. Smith, 2006 13
Average Critical PathAverage Critical Path
For our benchmarks, 1.3 ≤ β ≤ 1.9• Quadratic when β=2
Unit latency avg. IPC
Avg. latency l, avg. IPC
/11)( WWk
/11 Wi
/111 Wli1/)/( liW
![Page 14: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/14.jpg)
Superscalar Modeling © J. E. Smith, 2006 14
Generic IntervalGeneric Interval
All intervals follow same basic profile
Time (in Cycles)
Instructionsper Cycle
ramp-up asinstructions
enter window
time dependenton type of miss
event
ramp-down aswindow drains
transient due tomiss-event
![Page 15: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/15.jpg)
Superscalar Modeling © J. E. Smith, 2006 15
I Cache Miss IntervalI Cache Miss Interval
total time = n/D + ciL1
n = no. instructions in interval
D = decode/dispatch width
cIL1 = miss delay cycles Predicts performance loss is
independent of pipe length
re-fillpipeline
miss delay
windowdrains
time= n/D
![Page 16: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/16.jpg)
Superscalar Modeling © J. E. Smith, 2006 16
Independence from Pipe LengthIndependence from Pipe Length
16 K I-cache; ideal D-cache and predictor Two different pipeline lengths (4 and 8 cycles) I-cache miss delay 8 cycles Penalty independent of pipe length Similar across benchmarks
0.0
8.0
bzip crafty eon gap gcc gzip mcf parser perl twolf vortex vpr
cycl
es
4 front-end stages 8 front-end stages
![Page 17: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/17.jpg)
Superscalar Modeling © J. E. Smith, 2006 17
Branch Miss Prediction IntervalBranch Miss Prediction Interval
Total time = n/D + cdr (D) + cfe n = no. instructions in intervalD = decode/dispatch widthcdr (D) = drain cycles; function of width
(and ILP)cfe = front-end pipeline length
time = n/D time= pipeline length
time= branch latency
window drain time
![Page 18: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/18.jpg)
Superscalar Modeling © J. E. Smith, 2006 18
Branch Resolution TimeBranch Resolution Time
Assumes mispredicted branch is one of the last instructions to issue
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
bzip
2
craf
ty
eon
gap
gcc
gzip
mcf
pars
er
perlb
mk
twol
f
vort
ex vpr
per
cent
age
>5
5
4
3
2
1
0
![Page 19: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/19.jpg)
Superscalar Modeling © J. E. Smith, 2006 19
Branch Miss Prediction PenaltyBranch Miss Prediction Penalty
Branch penalty is dependent on interval length
The penalty can be 2+ times pipeline length
Penalty is less for short intervals; more for long intervals
See ISPASS ’06 paper for more details
![Page 20: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/20.jpg)
Superscalar Modeling © J. E. Smith, 2006 20
Long D-cache Miss IntervalLong D-cache Miss Interval
Loadenters
window
ROB fills
Data returns frommemory
steady state
Instructionsenter window
issue rampsup to
steady state
time = n/D
Issue window emptyof issuable insns
Loadissues
miss latency
ROB fill time
Loadresolution
time
![Page 21: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/21.jpg)
Superscalar Modeling © J. E. Smith, 2006 21
Long D-cache Miss IntervalLong D-cache Miss Interval
For isolated miss total time = n/D - W/D + cLr (D) + cL2
n = no. instructions in intervalD = decode/dispatch widthW = window (ROB) sizecLr (D) = load resolution time; function of widthcL2 = L2 miss delay
Loadenterswindow
ROB fills
Data returns frommemory
steady state
Instructionsenter window
issue rampsup to
steady state
time = N/d
Issue window emptyof issuable insns
Loadissues
miss latency
ROB fill time
Loadresolution
time
![Page 22: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/22.jpg)
Superscalar Modeling © J. E. Smith, 2006 22
Miss Event OverlapsMiss Event Overlaps
Branch Misprediction and I-Cache Miss effects “serialize”
• i.e. penalties add linearly Long D-Cache Misses may overlap with I-cache and
B-predict misses (and with each other)• Overlap with other long D-cache misses more important• Overlaps with branch mispredictions and I-cache misses are
insignificant
BranchMispredicts
I-Cache Misses
Long D-CacheMisses
![Page 23: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/23.jpg)
Superscalar Modeling © J. E. Smith, 2006 23
Overlapping Long D-cache MissesOverlapping Long D-cache Misses
s/D reflects amount of overlap Total penalty is independent of s/D
1st loadenterswindow
ROB fills
Load 1data returns from
memory
time = n/D
Issue window emptyof issuable insns
1st loadissues
miss latency
s/D
2nd loadissues
s/D
![Page 24: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/24.jpg)
Superscalar Modeling © J. E. Smith, 2006 24
Experimental ResultsExperimental Results
For each long miss, collect stats on other misses within a “ROB distance”
• This is a trace statistic• Assume W/D = cLr
0.0
50.0
100.0
150.0
200.0
bzip crafty eon gap gcc gzip mcf parser perl twolf vortex vpr
cycl
es
Simulation Analytical Model
![Page 25: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/25.jpg)
Superscalar Modeling © J. E. Smith, 2006 25
Overall PerformanceOverall Performance
Sum over all intervals
I cache miss interval: n/D + cic
Branch mispredict: n/D + cdr + cfe
Long d-cache miss: n/D - W/D + cLr + cL2
(non-overlapping)
Collect the n/D terms:
Ntotal/D Account for “ceiling inefficiency”
((D-1)/2D)*(miL1 + mbr + mL2)
![Page 26: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/26.jpg)
Superscalar Modeling © J. E. Smith, 2006 26
Overall PerformanceOverall Performance
Total Cycles = Ntotal/D + ((D-1)/2D)*(miL1 + mbr + mL2)
+ mic * ciL1
+ mbr * (cdr + cfe)
+ mL2 * (- W/D + clr + cL2)
TLB misses similar to L2 misses
![Page 27: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/27.jpg)
Superscalar Modeling © J. E. Smith, 2006 27
AccuracyAccuracy
Decode Width, D=4Average error 4.2%; max 8.6%
D=2, error = 1.8%D=6, error = 5.6%D=8, error = 5.6%
![Page 28: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/28.jpg)
Superscalar Modeling © J. E. Smith, 2006 28
Decode EfficiencyDecode Efficiency
Compare with simulation• D = 4
mcf dominated by intervals of length 5 and 13
• Less efficient than model would predict
This is an inherent inefficiency due to intervals
• Strongly correlates w/ interval lengths
![Page 29: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/29.jpg)
Superscalar Modeling © J. E. Smith, 2006 29
Convert From Cycles to TimeConvert From Cycles to Time
Important if pipeline depth is to be modeled• latch overheads become important
Start with baseline 5 stage front-end• pb = #pipeline stages in baseline
Allow for arbitrary number of stages• p = #pipeline stages• Increase all latencies proportionate to relative depth
Multiply cycles by p/pb
Convert total cycles to total time• tp = total pipeline latency; to = latch overhead
• cycle time = tp / p + to
![Page 30: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/30.jpg)
Superscalar Modeling © J. E. Smith, 2006 30
Convert to Absolute TimeConvert to Absolute Time
Total Time = [Ntotal/D + ((D-1)/2D)*(miL1 + mbr + mL2)]* (tp / p + to)
+ mic * ciL1*(p/pb)* (tp / p + to)
+ mbr * (cdr(p,D) + cfe) *(p/pb)* (tp / p + to)
+ mL2 * (- W/Dp + clr(p,D)+ cL2) *(p/pb)* (tp / p + to)
TPI = Total Time/Ntotal
Now, consider some of the terms in isolation
![Page 31: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/31.jpg)
Superscalar Modeling © J. E. Smith, 2006 31
Base TPI + One Linear Miss EventBase TPI + One Linear Miss Event
Component TPI
0
0.2
0.4
0.6
0.8
1
1.2
5 10 15 20 25 30 35
Pipeline Stages
TP
I
width 2
width 4
width6
width8
miss event
Total Time = [Ntotal/D + ((D-1)/2D)*(miL1 + mbr + mL2)]* (tp / p + to)
+ mic * ciL1*(p/pb)* (tp / p + to)
+ mbr * (cdr(p,D) + cfe) *(p/pb)* (tp / p + to)
+ mL2 * (- W/Dp + clr(p,D)+ cL2) *(p/pb)* (tp / p + to)
TPI = Total Time/Ntotal
Total TPI
0.6
0.8
1
1.2
1.4
5 10 15 20 25 30 35
Pipeline Stages
T
PI
width 2
width 4
width6
width8
miss event
![Page 32: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/32.jpg)
Superscalar Modeling © J. E. Smith, 2006 32
Pipelining of Miss EventsPipelining of Miss Events
Fully Pipelined Unit
0
1
2
3
4
5
5 10 15 20 25 30 35
Pipeline Stages
T
PI
miss event
miss event x 2
miss event x 3
miss event x 4
Not all paths are fully pipelined• e.g. cache misses may not be fully pipelined• A pipeline factor (0 ≤ f ≤ 1) can be added to a term• Example: I cache miss
mic * ciL1*(p/pb)* (tp / p + fiL1 to)
Changing Pipeline Factor
0.6
0.7
0.8
0.91
1.1
1.2
5 10 15 20 25 30 35
Pipeline Stages
T
PI
pipelined
pipelined .5
pipelined .25
nonpipelined
![Page 33: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/33.jpg)
Superscalar Modeling © J. E. Smith, 2006 33
Fetch InefficiencyFetch Inefficiency
Inherent fetch inefficiency • Due to presence of misses• As opposed to structural inefficiency• More important for wider pipelines
[Ntotal/D + ((D-1)/2D)*(miL1 + mbr + mL2)]* (tp / p + to)
Effect of Inefficiency
0
0.1
0.2
0.3
0.4
0.5
0.6
1 2 3 4 5 6 7
Pipeline Stages
TP
I
width 2
width 4
width6
width8
w2+ovhd
w4+inherent
w6+inherent
w8+inherent
![Page 34: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/34.jpg)
Superscalar Modeling © J. E. Smith, 2006 34
Miss Events Dependent on ROB SizeMiss Events Dependent on ROB Size Miss events are dependent on ROB size
• And therefore dependent on depth/width for balanced designs Branch mispredicts go up due to late update of predictor L2 miss behavior may be better or worse depending on overlaps
• Deeper pipeline longer miss penalty• Longer ROB more MLP
![Page 35: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/35.jpg)
Superscalar Modeling © J. E. Smith, 2006 35
Balanced Superscalar Processor DesignBalanced Superscalar Processor Design
Definition: At iW balance point:
• Under ideal conditions, achieved issue width i = I, but decreasing W means achieved issue width diminishes..
• For practical issue widths, there is enough ILP that balance can be achieved (See earlier work)
• Balance does not imply overall width/depth optimality Provide adequate numbers of other resources
• Issue buffer, load/store buffers, rename regs., functional units, etc.• Reducing resources below adequate level causes reduced performance
![Page 36: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/36.jpg)
Superscalar Modeling © J. E. Smith, 2006 36
Balanced Superscalar Processor DesignBalanced Superscalar Processor Design
Choose Width/Depth Optimize other elements based on Width/Depth
IssueWidth
I-FetchResources
(aciheved width)Commit Width ROB Size
Beta (~ quadratic)Relationship
# RenameRegisters
Load/StoreBuffer Sizes
Numbers ofFunctional Units
Issue BufferSize
LinearRelationship Linear
RelationshipLinear
Relationship
LinearRelationships
PipelineDepth
Beta (~ quadratic)Relationship
Inverse RelationshipAt optimal point, widerissue implies shallower
pipeline
![Page 37: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/37.jpg)
Superscalar Modeling © J. E. Smith, 2006 37
Optimize Pipeline DepthOptimize Pipeline Depth
Start with baseline 5 stage front-end• pb = #pipeline stages in baseline
Evaluate 1x, 2x, 3x, 4x, 5x depths• Increase all latencies proportionate to depths• Multiply by p/pb
Convert total cycles to total time• cycle time = tp / p + to
• p = # stages; tp = total pipeline latency; to = latch overhead
Total Time = [Ntotal/D + ((D-1)/2D)*(miL1 + mbr + mL2)]* (tp / p + to)
+ mic * ciL1*(p/pb)* (tp / p + to)
+ mbr * (cdr(p,D) + cfe) *(p/pb)* (tp / p + to)
+ mL2 * (- W/Dp + clr(p,D)+ cL2) *(p/pb)* (tp / p + to)
TPI = Total Time/Ntotal
![Page 38: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/38.jpg)
Superscalar Modeling © J. E. Smith, 2006 38
Pipeline Depth ResultsPipeline Depth Results
Use tp/to = 55 as in Hartstein and Puzak
• Also illustrates accuracy of model• Consider four typical benchmarks:
![Page 39: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/39.jpg)
Superscalar Modeling © J. E. Smith, 2006 39
Pipeline Depth ResultsPipeline Depth Results
On average, 2X baseline pipeline depth is optimal Consistent w/ H&P
![Page 40: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/40.jpg)
Superscalar Modeling © J. E. Smith, 2006 40
Optimize Pipeline WidthOptimize Pipeline Width
In general wider means higher performance (to 8-wide) Optimal depth becomes shallower as width grows Diminishing returns w/ wider pipelines
• 4 vs. 2 13.3%; 6 vs. 4 7.1%; 8 vs. 6 2.9%
![Page 41: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/41.jpg)
Superscalar Modeling © J. E. Smith, 2006 41
Short Interval EffectsShort Interval Effects With short intervals, may never reach peak issue rate Example: assume 1 mispredict every 96 instructions
• E.g. SPEC benchmark crafty with 4K gshare• Max issue rate never reached for D = 6,8
Yet, there is a benefit from wider pipelines
0
1
2
3
4
5
6
7
0 10 20 30 40 50 60
Cycle
IPC
D=8
D=6
D=4
D=2
![Page 42: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/42.jpg)
Superscalar Modeling © J. E. Smith, 2006 42
Benefit Does Not Come From Benefit Does Not Come From IssueIssue Width Width Benefit comes from wider decode/dispatch width
• Get to next I-cache miss sooner• Resolve branch mispredicts sooner• Benefit comes from faster ramp-up• D = 8 faster than D = 6• D = 8, I =6 gives same performance as D = 8, I = 8
0
1
2
3
4
5
6
7
0 10 20 30 40 50 60
Cycle
IPC
D=8
D=6
D=4
D=2
![Page 43: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/43.jpg)
Superscalar Modeling © J. E. Smith, 2006 43
Potential High Perf ProcessorPotential High Perf Processor
Widen Fetch, Decode, Retire• Keep relatively narrow issue
Lengthen ROB• And related structures
I-cache Decode PipelineIssueBuffer
Exec.Unit
Exec.Unit
Exec.Unit
Reorder Buffer (Window)
PhysicalRegister
File(s)
F D D I
D
R
BranchPredict
Fetchbuffer
# entries
# entries
miss rate
W entries
# entries
# and type of unitsunit latencies
Pipeline depth
instructiondelivery
algorithm
miss-rate
mispredictrate
Store Q
Load Q# entries L1 Data
Cache#ports
L2Cache
miss rate
toI-cache
mainmemorylatency
![Page 44: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/44.jpg)
Superscalar Modeling © J. E. Smith, 2006 44
Issue Buffer SizingIssue Buffer Sizing
y = 0.3115x
0
50
100
150
200
250
0 200 400 600 800
Reorder Buffer Size
Issu
e B
uff
er S
ize
Similar to ROB sizing Use average path rather
than average critical path
(See Tejas Thesis)
Processor ROB Size Issue Buffer
Ratio
Intel Core 96 32 .3
Power4 100 36 .4
MIPS R10K 64 20 .3
Pentium Pro
40 20 .5
Alpha 21264
80 20 .25
Opteron 72 24 .3
AMD K5 16 4 .25
![Page 45: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/45.jpg)
Superscalar Modeling © J. E. Smith, 2006 45
Function Unit Demand VariationFunction Unit Demand Variation
0
0.2
0.4
0.6
0.8
1
2 12 22 32 42 52 62 72 82 92
DemandIALU
Instructions (millions)
MeanMean+1 stdevMean+2 stdevActual
Example: gcc
![Page 46: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/46.jpg)
Superscalar Modeling © J. E. Smith, 2006 46
Function Unit ResourcesFunction Unit Resources
Demand proportional to instruction mix Dependent on program and phases
• Collect phase-based data Must be an integer Number of functional units of type k:
• Lk = issue latency for unit k
• Gk = fraction using unit k
Use similar approach for other hardware resources
Fk = I (Dk) + (2 (Dk)) Lk
![Page 47: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/47.jpg)
Superscalar Modeling © J. E. Smith, 2006 47
Comparison With H&PComparison With H&P
H&P:
Total Time = Ntotal/α * (tp / p + to)
+ γ NH * ( to p + tp )
Empirical: fit to detailed simulation data to determine α and γ.requires re-simulation if caches/predictor/pipeline factor, etc. change
Interval Model:
Total Time = Ntotal/D (tp / p + to) + ((D-1)/2D)*(miL1 + mbr + mL2)* (tp / p + to)
+ miL1 * ciL1 * 1/ pb * (to piL1 + tp)
+ mbr * (cdr(p,D) /p + cfe) * 1/ pb * (to p + tp)
+ mL2 * (- W/Dp + clr(p,D)/p + cL2) * 1/ pb * (to pL2 + tp)
Mechanistic: Bottom-up -- no need to perform detailed simulationnot all hazard terms are linear in pnot all hazard terms are independent of D
![Page 48: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/48.jpg)
Superscalar Modeling © J. E. Smith, 2006 48
Application: Performance ArchitectureApplication: Performance Architecture
Construct performance counters based on interval model Total cycle counter + one counter per miss event type Front-end miss events
• Front-end Miss Event Table (FMT) Back-end miss events
• Begin counting when full ROB stalls • Increment appropriate counter depending on inst. at ROB head
D-TLB miss,
L2 D-cache miss,
L1 D-cache miss,
Long functional unit (divide)
![Page 49: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/49.jpg)
Superscalar Modeling © J. E. Smith, 2006 49
Performance Architecture: FMTPerformance Architecture: FMT
On entry per outstanding branch Tracks pre-window instructions
• between fetch and dispatch tail Tracks in-flight instructions
• between ROB tail and ROB head Table Increments
• For I1 or I2 miss or I-TLB increment counter pointed to by fetch
• Branch penalty counters between head and tail increment every cycle
Counter updates• When correctly predicted branch retires,
update I1, I2, I-TLB counters• When mispredicted branch retires, update
Branch mispredict counter (and continue counting until first instruction is dispatched))
![Page 50: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/50.jpg)
Superscalar Modeling © J. E. Smith, 2006 50
Simplified FMTSimplified FMT
Shared I1, I2, ITLB entry Instructions in ROB marked w/ I-
cache miss or I-TLB miss When a miss instruction retires,
• Shared entry is copied to counters, • ROB tag bits are cleared
When a mispredicted branch retires
• Add to branch mispredict counter,• Clear shared entries
![Page 51: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/51.jpg)
Superscalar Modeling © J. E. Smith, 2006 51
EvaluationEvaluation
Compare:• Simulation – add miss events one at a time and
measure difference• Simulation-rev – same as above, but reverse order of
miss events• naïve -- Count miss events, multiply by fixed penalty• naïve non-spec – Similar to above, but wrong-path
events not counted• Power5 – IBM Power5 method• FMT• sFMT
![Page 52: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/52.jpg)
Superscalar Modeling © J. E. Smith, 2006 52
EvaluationEvaluation
![Page 53: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/53.jpg)
Superscalar Modeling © J. E. Smith, 2006 53
ComparisonComparison
FMT and sFMT are most accurate• naïve is worst
FMT and sFMT similar• simplified version is adequate
Power5 underestimates frontend miss events
![Page 54: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/54.jpg)
Superscalar Modeling © J. E. Smith, 2006 54
Interval Model DevelopmentInterval Model Development
Michaud, Seznec, Jourdan – Issue transient Tejas Gap model – All transients Taha and Wills -- Interval (macro block) model Hartstein and Puzak – Optimal pipelines
![Page 55: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,](https://reader035.vdocuments.mx/reader035/viewer/2022081421/5697bfa51a28abf838c97899/html5/thumbnails/55.jpg)
Superscalar Modeling © J. E. Smith, 2006 55
ConclusionsConclusions
Intervals yield a divide-and-conquer approach Supports intuition (adds confidence to intuition) Its all about transients
• The only things that count are cache miss and branch mispredictions
Application to automated design, performance monitoring, very fast simulation, optimizing compiler analysis, etc.
Analysis of pipeline limits,• Re-enforces conventional wisdom• We are close to the practical limits for depth and width
Extends to energy modeling (Tejas PhD)