Decoupling Local Variable Decoupling Local Variable Accesses in a Wide-Issue Accesses in a Wide-Issue Superscalar ProcessorSuperscalar Processor
Sangyeun Cho, U of Minnesota/Samsung
Pen-Chung Yew, U of Minnesota
Gyungho Lee, U of Texas at San Antonio
‘‘99 ACM/IEEE International 99 ACM/IEEE International Symposium onSymposium on
Computer ArchitectureComputer Architecture
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 2
RoadmapRoadmap
Need for Higher Bandwidth Caches Multi-Ported Data Caches Data Decoupling
– Motivation– Approach– Implementation Issues– Quantitative Evaluation
Conclusions
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 3
Wide-Issue Superscalar Wide-Issue Superscalar ProcessorsProcessors
Fetc
h
R eservatio nStatio n s
D isp atchB uff er
I n structio n /D eco d e B uff er
R eo rder/C o m p letio nB uff er
Sto reB uff er
Dec
ode
Dis
patc
h
Com
plet
e
Ret
ire
L o ad / Sto reU n its
$$ Current Generation
– Alpha 21264– Intel’s Merced
Future Generation (IEEE Computer, Sept. ‘97)
– Superspeculative Processors
– Trace Processors
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 4
Multi-Ported Data CachesMulti-Ported Data Caches
Cache Built with Multi-Ported Cells
Replicated Cache– Alpha 21164
Interleaved Cache– MIPS R10K
Time-Division Multiplexing– Alpha 21264
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 5
Replicated CacheReplicated Cache
Pros.– Simple design– Symmetric read ports
Cons.– Doubled area– Exclusive writes for
data coherence
Fetch
$$ X $$ Y
Sto reL o ad L o ad
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 6
Time-Division Multiplexed Time-Division Multiplexed CacheCache
Pros.– True 2-port cache
Cons.– Hardware design
complexity– Not scalable
beyond 2 portsFetch
$ $
1 L o ad /Sto re
2 L o ad /Sto re
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 7
Interleaved CacheInterleaved Cache
Pros.– Scalable
Cons.– Asymmetric ports– Bank conflicts– Constraints in
number of banksFetch
$$ E ven $$ O dd
" O dd" L o ad /Sto re
Fetch
" E ven " L o ad /Sto re
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 8
Window Logic ComplexityWindow Logic Complexity
Pointed out as the major hardware complexity (Palacharla et al., ISCA ‘97)
More severe for Memory window– Difficult to partition– Thick network needed t
o connect RSs and LSUs
L SU
Net
wor
kD isp atch
R eserv atio nStatio n s
L SU
L SU
L SU
$$
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 9
Data DecouplingData Decoupling
A Divide-and-Conquer approach
– Instructions partitioned before entering RS
– Narrower networks– Less ports to each
cache
Net
wor
k "Y
"
D isp atch
R eservatio nStatio n s
L SU
L SU
$$ " Y "
L SU
L SU
$$ " X "
Net
wor
k "X
"
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 10
Data Decoupling: Data Decoupling: Operating IssuesOperating Issues
Memory Stream Partitioning– Hardware classification– Compiler classification
Load Balancing– Enough instructions
in different groups?– Are they well
interleaved?
D isp atch
R eservatio nStatio n s
?D isp atch
T o R eservatio nStatio n s
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 11
Case for Case for Decoupling Stack Decoupling Stack AccessesAccesses
Easily Identifiable– Hardware
Mechanism Simple 1-bit predictor with
enough context information works well (>99.9%).
– Compiler Mechanism Helps reduce required
prediction table space for good performance; but not essential.
Many of Them– 30% of loads, 48%
of stores Well-Interleaved
– Continuous supply of stack references with reasonable window size
Details are found in:– Cho, Yew, and Lee. “Access Region Locality for High-Bandwidth Pro
cessor Memory System Design”, CSTR #99-004, Univ. of Minnesota.
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 12
Data Decoupling: Data Decoupling: MechanismMechanism
Dynamically Predicting Access Regions for Partitioning Memory Instructions– Utilize Access Region Locality– Refer to context information, e.g., global branch his
tory, call site identifier
Dynamically Verifying Region Prediction– Let TLB (i.e., page table) contain verification inform
ation such that memory access is reissued on mispredictions.
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 13
Data Decoupling: Data Decoupling: Mechanism, Mechanism, ConCont’dt’d
0
0.2
0.4
0.6
0.8
1
99 124 126 129 130 132 134 147 101 102 103 107 Int.Avg FP.Avg
D/H/S
H/S
D/S
D/H
S
H
D
Access Region Locality
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 14
Data Decoupling: Data Decoupling: Mechanism, Mechanism, ConCont’dt’d
Dynamic Partitioning Accuracy
98%
99%
100%
Pred
ictio
n Rat
e
w/ Compiler Hints
w/o Compiler Hints
gom88ksimgcccompress li ijpeg perlvortex Int.AvgFP.Avgtomcatvswimsu2cormgrid
Unlimited8 KB4 KB
2 KB1 KB
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 15
Data Decoupling: Data Decoupling: OptimizationsOptimizations
Fast Forwarding– Uses offset (used with $s
p) to resolve dependence– Can shorten latency
Access Combining– Combines accesses to
adjacent locations– Can save bandwidth
st r3, 8($sp)......ld r4, 8($sp)
st r3, 4($sp)st r4, 8($sp)
st {r3,r4} {4,8($sp)}
Addr Matched!
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 16
Benchmark ProgramsBenchmark Programs
Benchmark Input Inst. Count
099.go train 541M124.m88ksim reference 250M
126.gcc stmt-protoize.i 220M129.compress train (100K) 293M
130.li ctak.lsp 434M132.ijpeg penguin.ppm 621M134.perl scrabble.pl 525M
147.vortex train (1 iteration) 284M101.tomcatv test (N = 254, 1 iteration) 549M102.swim test (3 iterations) 473M103.su2cor test 676M107.mgrid train (1 iteration) 684M
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 17
Program’s Memory Program’s Memory AccessesAccesses
0
5
10
15
20
25
30
35
Freq
uenc
y (%
)
Stack Accesses
Others
gom88ksimgcccompress li ijpeg perlvortex Int.AvgFP.Avgtomcatvswimsu2cormgrid
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 18
Program’s Frame Size Program’s Frame Size DistributionDistribution
Stack references tend to access small region.
Average size of dynamic frames was around 3 words.
Average size of static frames was around 7 words.
0
5
10
15
20
25
30
35
40
Freq
uenc
y (%
)
0 4 8 12 16
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 19
Base Machine ModelBase Machine Model
Issue Width 16Registers 32 GPRs/ 32 FPRs
ROB/ LSQ Size 128/ 64
Functional Units Integer: 16 ALUs, 4 MULT/ DIV UnitsFP: 16 ALUs, 4 MULT/ DIV Units
L1 D-Cache 32 KB, 2-Way Set-Associative, 2-Cycle AccessL2 D-Cache 512 KB, 4-Way Set-Associative, 12-Cycle Access
Memory 50-Cycle Acess, Fully InterleavedI-Cache Perfect (100% Hit) Cache, 1-Cycle AccessBranch
PredictionPerfect (100% Correct) Prediction
Instruction Lat. Same as MIPS R10000
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 20
Program’s Bandwidth Program’s Bandwidth RequirementsRequirements
Performance suffers greatly with less than 3 cache ports.
We study 3 cases:– Cache has 2 ports– Cache has 3 ports– Cache has 4 ports
62.5
70.5
88.091.4
96.199.4
97.398.8 98.499.2
40
60
80
100
Rel
ativ
e Pe
rfor
man
ce
Integer FP
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 21
Impact of LVC SizeImpact of LVC Size
2KB and 4KB LVCs achieve high hit rates (~99.9%).
Set associativity less important if LVC is 2KB or more.
Small, simple LVC works well.
0.5K 1K 2K 4K
8.42
3.98
1.12
2.30
0.73 0.440.19 0.090.02 0.00 0.00 0.000
1
2
3
4
5
6
7
8
9
Miss
Rat
e (%
)
126.gcc
Avg.
129.compress
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 22
Fast Data ForwardingFast Data Forwarding
Performance Improvement (%)
099 124 126 129 130 132 134 147 101 102 103 107
2.1 0.0 1.2 1.2 0.3 1.9 3.1 3.9 3.9 0.2 0.5 0.0
2KB and 4KB LVCs achieve high hit rates (~99.9%).
Set associativity less important if LVC is 2KB or more.
Small, simple LVC works well.
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 23
Access CombiningAccess Combining
Effective (over 8% improvement) when LVC bandwidth is scarce.
2-way combining is enough.(3+1) (3+2)
8.4
2.1
10.1
2.1
10.8
2.3
0
2
4
6
8
10
12
Impr
ovem
ent
over
"N
o Com
bini
ng"
(%) 2-way Combining
3-way Combining
4-way Combining
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 24
Performance of Various Config.’sPerformance of Various Config.’s
10.3
8.2
0.0
3.4
9.56.7
13.113.112.912.4
6.4
11.612.4 12.4 12.6
8.8 9.3 9.5
0
2
4
6
8
10
12
14
Impr
ovem
ent
over
(2+
0) (
%)
N=4
N=3
N=2
(N+0) (N+1) (N+2) (N+3) (N+4) (N+5)
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 25
14.0
20.2
10.9
5.7
18.8
0.02.4
12.815.0
19.720.0
18.1
6.7
18.818.516.6
14.5 14.7
0
5
10
15
20
25
Impr
ovem
ent
over
(2+
0) (
%)
N=4
N=3
N=2
Performance of Performance of 126.gcc126.gcc
(N+0) (N+1) (N+2) (N+3) (N+4) (N+5)
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 26
Performance of Performance of 130.li130.li
(N+0) (N+1) (N+2) (N+3) (N+4) (N+5)
28.7
24.622.1
0.0
14.3
26.3
31.030.4 31.331.3
23.6
30.830.030.029.7
25.3 26.1 26.3
0
5
10
15
20
25
30
35
Impr
ovem
ent
over
(2+
0) (
%)
N=4
N=3
N=2
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 27
Performance of Performance of 102.swim102.swim
(N+0) (N+1) (N+2) (N+3) (N+4) (N+5)
6.96.6
0.0
2.8
4.7
6.36.6 6.9 6.9 6.9
6.66.66.66.06.0
4.4 4.7 4.7
0
1
2
3
4
5
6
7
8
Impr
ovem
ent
over
(2+
0) (
%)
N=4
N=3
N=2
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 28
Other FindingsOther Findings
LVC hit latency has less impact than data cache due to– Many loads hitting in LVAQ– Out-of-order issuing
Addition of LVC reduced conflict misses in– 130.li (by 24%) and 147.vortex (by 7%)– May reduce bandwidth requirements on bus
to L2 cache
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 29
Overall PerformanceOverall Performance
gom88ksim gcccompress li ijpeg perl vortex Int.AvgFP.Avgtomcatvswim su2cormgrid
13.4
2.3
12.8
1.4
25.3
8.7
15.7
32.3
2.0 4.
4
-0.1 2.
4
13.6
1.9
15.3
5.4
18.5
3.6
30.0
13.9
21.0
4.6 6.
6
6.8
7.4
17.8
6.3
6.6
6.3
14.0
3.6
28.7
6.9
10.4
25.7
5.1 6.
6 8.2 9.
8 12.4
7.4
-6.8
-1.2
7.6
-2.4
18.4
1.7
0.6
17.1
2.0
6.3
5.8 8.
1
4.0 5.
5
38.8
-10
-5
0
5
10
15
20
25
30
35
40
Impr
ovem
ent
over
(2+
0) (
%)
(2+2), 1-cycle LVC, 2-cycle Cache
(3+3), 1-cycle LVC, 2-cycle Cache
(4+0), 2-cycle Cache
(4+0), 3-cycle Cache
ISCA ‘99May 1, 1999
Cho, Yew, and Lee 30
ConclusionsConclusions
Superscalar Processors will be around…– But its design complexity will call for architectural s
olutions.– Memory bandwidth becomes critical.
Data Decoupling is a way to– Decrease hardware complexity of memory issue log
ic and cache.– Provide additional bandwidth for decoupled stack a
ccesses.