ics’02 upc an interleaved cache clustered vliw processor e. gibert, j. sánchez * and a. gonzález...
TRANSCRIPT
![Page 1: ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551c0181550346a34f8b4c44/html5/thumbnails/1.jpg)
ICS’02
UPC
An Interleaved Cache Clustered VLIW Processor
An Interleaved Cache Clustered VLIW Processor
E. Gibert, J. Sánchez* and A. González*
Dept. d’Arquitectura de Computadors
Universitat Politècnica de Catalunya (UPC)* Also at Intel Barcelona Research Center
June 2002
![Page 2: ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551c0181550346a34f8b4c44/html5/thumbnails/2.jpg)
ICS’02
UPC
MotivationMotivation
Capacity-bound vs. Communication-bound Solution: clustered microarchitectures
• Partition some hardware resources• Simpler + faster
• Power consumption
• Communications not homogeneous
Goal: clustering the memory hierarchy in statically scheduled processors
Motivation
![Page 3: ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551c0181550346a34f8b4c44/html5/thumbnails/3.jpg)
ICS’02
UPC
Talk OutlineTalk Outline
State-of-the-art: multiVLIW Interleaved Cache Clustered VLIW Scheduling Algorithms Enhancement: Attraction Buffers Experimental Framework Results Conclusions
![Page 4: ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551c0181550346a34f8b4c44/html5/thumbnails/4.jpg)
ICS’02
UPC
State-of-the-art: MultiVLIWState-of-the-art: MultiVLIW
Sánchez and González [MICRO’00]
Reg. File
F.U.
L1 datacache
Clu
ste r
1 Reg. File
F.U.
L1 datacache
Clu
ste r
2 Reg. File
F.U.
L1 datacache
Clu
ste r
nCoherency network
...
Register-to-register buses
Next memory levelNext memory level
![Page 5: ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551c0181550346a34f8b4c44/html5/thumbnails/5.jpg)
ICS’02
UPC
Talk OutlineTalk Outline
State-of-the-art: multiVLIW Interleaved Cache Clustered VLIW Scheduling Algorithms Enhancement: Attraction Buffers Experimental Framework Results Conclusions
![Page 6: ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551c0181550346a34f8b4c44/html5/thumbnails/6.jpg)
ICS’02
UPC
Basic Interleaved Cache Clustered VLIW Processor
Basic Interleaved Cache Clustered VLIW Processor
Reg. FileReg. File
FUsFUs
TAG W0 W4
cache module
Reg. FileReg. File
FUsFUs
TAG W1 W5
cache module
Reg. FileReg. File
FUsFUs
TAG W2 W6
cache module
Reg. FileReg. File
FUsFUs
TAG W3 W7
cache module
TAG W0 W1 W2 W4 W5 W6 W7W3
Subblock 1memory buses
NEXT MEMORY LEVELcacheblock
Register-to-register buses
CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4
![Page 7: ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551c0181550346a34f8b4c44/html5/thumbnails/7.jpg)
ICS’02
UPC
Talk OutlineTalk Outline
State-of-the-art: multiVLIW Interleaved Cache Clustered VLIW Scheduling Algorithms Enhancement: Attraction Buffers Experimental Framework Results Conclusions
![Page 8: ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551c0181550346a34f8b4c44/html5/thumbnails/8.jpg)
ICS’02
UPC
Modulo SchedulingModulo Scheduling
Extract ILP from loops overlap execution of iterations
AA
BB
CC
AA
BB
CC
A’A’
B’B’
C’C’
A’’A’’
B’’B’’
C’’C’’
II
SC
Kernel
LOOP L
![Page 9: ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551c0181550346a34f8b4c44/html5/thumbnails/9.jpg)
ICS’02
UPC
Base Scheduling AlgorithmBase Scheduling Algorithm
Used for Unified Cache
II=II+1
Best profit inoutput edges
START
Sort nodes
Next nodeSelect possible
clusters HowMany?
Least loaded
Schedule it HowMany?
>0
>1
1
0
![Page 10: ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551c0181550346a34f8b4c44/html5/thumbnails/10.jpg)
ICS’02
UPC
Interleaved Cache Scheduling Algorithm
Interleaved Cache Scheduling Algorithm
Unroll loop to maximize instructions with a stride multiple of NxI access ONE cache module
Assign latencies to memory instructions Assign memory instructions to clusters:
– IPBC (Interleaved Pre-Build Chains) minimize stall time
– IBC (Interleaved Build Chains) minimize compute time
![Page 11: ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551c0181550346a34f8b4c44/html5/thumbnails/11.jpg)
ICS’02
UPC
Memory Dependent Instructions
Memory Dependent Instructions
store
load
add
load
add
store
store
load load
store
memorydependant
chain 1
memorydependant
chain 2
IPBC preferred info is usedvs.
IBC minimize register comms.Preferred=1
Preferred=1
Preferred=2
Preferred=2
![Page 12: ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551c0181550346a34f8b4c44/html5/thumbnails/12.jpg)
ICS’02
UPC
Talk OutlineTalk Outline
State-of-the-art: multiVLIW Interleaved Cache Clustered VLIW Scheduling Algorithms Enhancement: Attraction Buffers Experimental Framework Results Conclusions
![Page 13: ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551c0181550346a34f8b4c44/html5/thumbnails/13.jpg)
ICS’02
UPC
LocalData
LocalData ABufferABuffer
loca
l log
ic
data hit
data data hithit
ADDRESS
TAG W2 W6
=
TAG W
ADDRESS
datahit
ATTRACTION BUFFER
word select
CACHE MODULE
Enhacement: Attraction Buffers
Enhacement: Attraction Buffers
![Page 14: ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551c0181550346a34f8b4c44/html5/thumbnails/14.jpg)
ICS’02
UPC
for (i=0; i<MAX; i++) { ld r3, a[i] r4 = OP(r3) st r4, b[i]}
for (i=0; i<MAX; i+=4) { ld r31, a[i] (stride 16) ld r32, a[i+1] ld r33, a[i+2] ld r34, a[i+3] r41 = OP(r31) r42 = OP(r32) r43 = OP(r33) r44 = OP(r34) st r41, b[i] st r42, b[i+1] st r43, b[i+2] st r44, b[i+3]}
16 byte strides (NxI multiple)N = 4 clusters, I= 4 bytes
Unroll x4
An ExampleAn Example
a[3] a[7] a[0] a[4]
CLUSTER 4
ABufferLocal module
ld r31, a[0]
CL
US
TE
R 3
CL
US
TE
R 2
CL
US
TE
R 1
a[0] a[1] a[2] a[3] ...
![Page 15: ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551c0181550346a34f8b4c44/html5/thumbnails/15.jpg)
ICS’02
UPC
Enhacement: Attraction Buffers
Enhacement: Attraction Buffers
Why remote accesses? Why Attraction Buffers?– Double precision accesses low benefit– Indirect accesses: a[b[i]] low benefit– “Unclear” preferred cluster big benefit
for (i=0; i<MAX; i++)for (k=i; k<i+MAX; k+=4)
ld a[k], ld a[k+1], ld a[k+2], ld a[k+3]
– Memory dependent chains big benefit
– IBC: preferred cluster info is not used big benefit
![Page 16: ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551c0181550346a34f8b4c44/html5/thumbnails/16.jpg)
ICS’02
UPC
Talk OutlineTalk Outline
State-of-the-art: multiVLIW Interleaved Cache Clustered VLIW Scheduling Algorithms Enhancement: Attraction Buffers Experimental Framework Results Conclusions
![Page 17: ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551c0181550346a34f8b4c44/html5/thumbnails/17.jpg)
ICS’02
UPC
Experimental FrameworkExperimental Framework
IMPACT C compiler Modulo scheduling on hyperblock loops
– BASE for a Unified Cache– IPBC and IBC for an Interleaved Cache– IPBC and IBC for the MultiVLIW– The same unrolling factor has been used for
all architecture configurations! Mediabench benchmark suite
![Page 18: ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551c0181550346a34f8b4c44/html5/thumbnails/18.jpg)
ICS’02
UPC
Experimental FrameworkExperimental Framework
Number of clusters 4
Functional units 1 FP / cluster + 1 int / cluster
+ 1 mem / cluster
Cache configuration 8KB, 32-byte lines, 2-way set associative, 1 cycle latency
Reg-to-reg communication buses
4 buses that run at ½ the core frequency
Memory buses 4 buses that run at ½ (or ¼)
the core frequency
Next memory level 4 ports, 5 cycle latency, always hit
Interleaving factor
(Interleaved Cache)
4 bytes
Latencies 1-10 (Unified Cache + MultiVLIW)
1-(5/6)-10-15 (Interleaved Cache)
![Page 19: ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551c0181550346a34f8b4c44/html5/thumbnails/19.jpg)
ICS’02
UPC
Results (I)Results (I)
IPBC vs IBC similar cycle count results MultiVLIW vs Interleaved similar results BUT…
… lower complexity!
0
0,5
1
1,5
2
Nu
mb
er
of
cyckes
epic
dec
epic
enc
gsm
dec
gsm
enc
jpegdec
jpegenc
mpeg2dec
mpeg2enc
rasta
Comparison with Unified Cache and MultiVLIW
stall time
compute time MULTIVLIW
compute time INTERLEAVED1/2 1/4
![Page 20: ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551c0181550346a34f8b4c44/html5/thumbnails/20.jpg)
ICS’02
UPC
Results (II)Results (II)
Memory dependent chains– Interleaved cache workload unbalance + remote accesses
– MultiVLIW workload unbalance
– Working on techniques to overcome scheduling restrictions
0
0,5
1
1,5
Nu
mb
er
of
cycle
s
epic
dec
epic
enc
gsm
dec
gsm
enc
jpegdec
jpegenc
mpeg2dec
mpeg2enc
rasta
Interleaved and MultiVLIW with and without chains
stall time
compute time MULTIVLIW
compute time INTERLEAVEDno chainschains
![Page 21: ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551c0181550346a34f8b4c44/html5/thumbnails/21.jpg)
ICS’02
UPC
Results (III)Results (III)
0
20
40
60
80
100
120
140
rem
ote
hit
s
epic
dec
epic
enc
gsm
dec
gsm
enc
jpegdec
jpegenc
mpeg2dec
mpeg2enc
rasta
Remote hit classification with no Abuffers and with ABuffers
OTHER
CLSC
CLOC
ABuffersNo ABuffers
Local hits are increased by 15% Stall time reduced by 30%
![Page 22: ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551c0181550346a34f8b4c44/html5/thumbnails/22.jpg)
ICS’02
UPC
ConclusionsConclusions
Scheduling Algorithms– Good latency assignment process (stall time accounts
for 9% of execution time)– Coherence kept through memory dependent chains
(5% cycle count degradation) Attraction Buffers
– Effective to increase local hits (15% average) + reduce stall time (30% average)
– Reduce remote hits to previously accessed subblocks (70% average)
Cycle count results – similar to Unified Cache and MultiVLIW
![Page 23: ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica](https://reader035.vdocuments.mx/reader035/viewer/2022070308/551c0181550346a34f8b4c44/html5/thumbnails/23.jpg)
ICS’02
UPC
QuestionsQuestions