architectural enhancements for efficient operand transport in multimedia systems
DESCRIPTION
Architectural Enhancements for Efficient Operand Transport in Multimedia Systems. ECE7102 Class Presentation Date: 2006. 4. 13 Hongkyu Kim [email protected]. Overview. Introduction Characterization and modeling of operand usage and transport - PowerPoint PPT PresentationTRANSCRIPT
Architectural Enhancements for Efficient Operand Transport in Multimedia Systems
ECE7102 Class Presentation
Date: 2006. 4. 13
Hongkyu [email protected]
2/40
Overview
• Introduction
• Characterization and modeling of operand usage and transport
• Dynamic execution technique exploiting regular operand transport patterns in multimedia– Instruction cluster mapping on the inter-ALU network
for general-purpose domain
– Dynamic SIMDization for application-specific domain
• Summary
3/40
Interconnect Complexity
FU FU
Storage
Storage
Interconnect
FU
FU
FU
FU
FU
FU
FU
FU
Storage Storage Storage
Storage Storage Storage
Interconnect
• Exponential increase of chip capacity More devices
• Exponential decrease of feature size Interconnect limitation
J.D. Meindl, Interconnect Opportunities for Gigascale Integration, IEEE MICRO, vol. 23, no. 4, pp.28-35, May/June 2003.
4/40
Interconnect Bottleneck
ITRS 2002 Documents, http://public.itrs.net/Files/2002Update/Home.pdf.
1
10
100
0.1
Rel
ativ
e D
elay
250 180 130 90 65 45 42
Process Technology Node (nm)
α
1/α2
1/α2
• Disparity between wire delay and gate delay
5/40
Problem Statement
• High-performance interconnect– Interconnect organizations
– Interconnect technologies
• Why architectural responses are limited?– Compatibility with old ISAs
• Sequentially-specified operations• Restricted register file-based operand namespace
– ILP mechanisms• Operand bypass network, register renaming, and instruction
scheduling• Poorly scaling broadcast buses
6/40
Research Objective and Approach
• ObjectiveReduce latency of operand transport for multimedia– Development of dynamic execution techniques– Development of low-cost operand bypass networks
• Approach summary
Analysis of operandsExamine operand usage propertiesExplore the impact of architectural techniques on the operand transport
Technology model-based evaluation on target platforms
GENESYSSimpleScalar
Background work General approach Application-specific approach
Dynamic execution techniqueInstruction clusteringRecognition of regular operand transport patternsEfficient execution unit
Cluster mapping on inter-ALU network
Basic instruction clusteringRaw cluster mappingLocal operand mapping on dedicated inter-ALU path
Optimizing operand transport for multimedia systems
Regular pattern recognitionCluster reorganizationFunction remappingDynamic SIMDization
7/40
Overview
• Introduction
• Characterization and modeling of operand usage and transport
• Dynamic execution technique exploiting regular operand transport patterns in multimedia– Instruction cluster mapping on the inter-ALU network
for general-purpose domain
– Dynamic SIMDization for application-specific domain
• Summary
8/40
Motivation and Approach• Motivation
– Shift of microarchitectural design focusOperand computation Operand communication
– Recognizing and understanding of operand usage and transport properties Efficiently controlling operand traffic
• Approach summary– Operand usage characteristics
• How often operands are used Examine temporal property• Where operands are used Examine spatial property
– Operand transport properties
• What accounts for the majority of communication needs
Explore the impact of architectural techniques on the operand transport
9/40
Operand Usage Analysis• General terms
– Operands: values in registers, memory locations, or memory addresses
– Operand transport: buffering and delivery of operands to FUs
• Operands’ temporal characteristics– Which inst. consumes operands after they are produced
– Metrics: Degree of use, Age, Lifetime
• Operands’ spatial characteristics– From/to which FU operands are moved in the execution model
– Metrics: Degree of functionality, Transport pattern
10/40
Operand Transport Analysis• Operand transport model
Global Storage
Bypass Networktransrd_global
FU
Local Storage
FU
Local Storage
FU
Local Storage
FU
Local Storage
FU
Local Storage
FU
Local Storage
FU
Local Storage
FU
Local Storage
FU
Local Storage
FU
Local Storage
transrd_bypass
transwr_global
11/40
0% 20% 40% 60% 80% 100%
degree offunctionality
lifetime
age
degree of use
Preliminary Results• Operand usage properties (MediaBench average)
0 1 2 3 >3
1 2 3~5 >5
1 2 3~5 6~10 >10
0 1(same) 1(different)
H. Kim, D. Wills, and L. Wills, “Empirical analysis of operand usage and transport in multimedia applications,” Proc of the International Workshop on System-on-Chip for Real-Time Applications, pp. 168-171, July 2004.
>1
12/40
Preliminary Results (cntd.)• Operand transport pattern (MediaBench average)
integer integer43.0%
integer branch14.9%
integer ld/st13.6%
ld/st integer13.8%
ld/st ld/st6.6%
Others8.1%
13/40
Preliminary Results (cntd.)• Effective architectural techniques on operand
transport– Storage hierarchy: local buffering
– Dedicated transport network
– Lifetime detection: compile-time/run-time
– Smart instruction steering
14/40
Overview
• Introduction
• Characterization and modeling of operand usage and transport
• Dynamic execution technique exploiting regular operand transport patterns in multimedia– Instruction cluster mapping on the inter-ALU network
for general-purpose domain
– Dynamic SIMDization for application-specific domain
• Summary
15/40
Motivation and Approach• Motivation
Multimedia applications– Operand movement is highly regular
– Most operands are short lived, transient operandsDevelop dynamic execution technique exploiting regular
operand distribution patterns and local properties
• Approach summary– Instruction clustering: dynamic instruction grouping
– Recognition of regular operand transport pattern
– Efficient execution unit: reduce transport latency
16/40
Related Work• Solutions for multimedia processing
– Multimedia-specific ISA extensions• Exploit data-level parallelism at subword level• General-purpose domain: Intel’s MMX and SSE, AMD’s 3DNow!,
Sun’s VIS, IBM’s Altivec• Application-specific signal processing domain: Analog Device’s
TigerShark, Trimedia
– Vectorization and retargeting• Manual assembly coding• Hand-optimization: in-lined assembly code, library routines• Automatic vectorization: compiler/retargeting technology
17/40
• Solutions for reducing operand transport complexity– Communication-aware execution
• Network-connected tile architecture: RAW, GPA• Transport triggered architecture: MOVE
– Resource partitioning: Clustered architectures• Heterogeneous: decoupled architecture• Commercial: DEC Alpha21264• Academia: Multicluster, Palacharla’s, PEWs, ILDP, CTCP
– Dynamic optimizations• Fill unit: reform instructions in H/W, and cache them
• Small-scale dependence collapsing: combine dependences among multiple instructions macro instruction
Related Work (cntd.)
18/40
Related Research Landscape
Dynamic execution technique exploiting regular operand transport patterns in multimedia
Communication-aware execution:efficient operand transport
Resource partitioning:Clever instruction steering
Dynamic optimizations:instruction grouping, small-scale
dependence collapsing
Multimedia processing:independent computation
Regular pattern of dependent instructions
Steering burden off the critical path
Binary-compatibility,run-time optimization
Larger, more generalinstruction grouping
19/40
Research Methodology
Application code(C source)
gcc cross-compiler
PISA binary
Instruction trace
Instruction stream
Cluster formation logic
Cluster storage(cache)
Execution platform
Matched?
Normal execution unit
Cluster execution unit
N YInstruction queue Cluster queue
20/40
Dynamic Instruction Clustering
• Instruction Cluster– A connected subgraph of instructions joined by local operands– Dataflow graph Dependence edge classification
Instruction grouping
• Dependence edge types– External: produced/consumed by previous/next blocks– Non-clusterable: operands from/to memory– Local: produced and consumed within the same block
21/40
Instruction Clustering Example• Color conversion block in JPEG encoder
0: lbu r4, 0(r9) 19: addu r2, r2, r31: lbu r5, 1(r9) 20: lw r3, 5120(r6)2: lbu r6, 2(r9) 21: addu r7, r15, r83: sll r4, r4, 0x2 22: addu r2, r2, r34: addu r4, r4, r10 23: sra r2, r2, 0x105: sll r5, r5, 0x2 24: sb r2, 0(r7)6: addu r5, r5, r10 25: lw r2, 5120(r4)7: lw r2, 0(r4) 26: lw r3, 6144(r5)8: lw r3, 1024(r5) 27: addiu r9, r9, 39: sll r6, r6, r10 28: addu r2, r2, r310: addu r6, r6, r10 29: lw r3, 7168(r6)11: addu r2, r2, r3 30: addu r7, r12, r812: lw r3, 2048(r6) 31: addiu r8, r8, 113: addu r7, r25, r8 32: addu r2, r2, r314: addu r2, r2, r3 33: sra r2, r2, 0x1015: sra r2, r2, 0x10 34: sb r2, 0(r7)16: sb r2, 0(r7) 35: sltu r2, r8, r1617: lw r2, 3072(r4) 36: bne r2, r0, 0x41218818: lw r3, 4096(r5)
0 1 2
3 5 9
4 6 10
177 25 188 26 2012 29
11
14
15 13
16
19
22
2321
24
28
32
33 30
34
27
31
35
36
0 1 2
3 5 9
4 6 10
177 25 188 26 2012 29
11
14
15 13
16
19
22
2321
24
28
32
33 30
34
27
31
35
36
External
Local
Non-clusterable
0 1 2
3 5 9
4 6 10
177 25 188 26 2012 29
11
14
15 13
16
19
22
2321
24
28
32
33 30
34
27
31
35
36
External
Local
Non-clusterable
Instruction Cluster
22/40
Overview
• Introduction
• Characterization and modeling of operand usage and transport
• Dynamic execution technique exploiting regular operand transport patterns in multimedia– Instruction cluster mapping on the inter-ALU network
for general-purpose domain
– Dynamic SIMDization for application-specific domain
• Summary
23/40
• Raw cluster execution on inter-ALU network– Focus on intermediate, short-lived operands
• Local operands: inter-ALU dedicated bypass network• Others: traditional global bypass network
– Organization• Instruction cluster formation• Cluster queue and scheduling• Cluster execution: inter-ALU network
H. Kim, D. Wills, and L. Wills, “Reducing operand communication overhead using instruction clustering for multimedia applications,” Proc of 7th International Symposium on Multimedia, December 2005.
Implementation Example - I
24/40
Cluster Queue and Scheduling
I1
Conventional instuction queue
I0 I3I2
Head Tail
Cluster queue
C0:I0 C1:I0 C2:I0
C0:I1 C1:I1 C2:I1
C0:I2 C1:I2
C1:I3
Head Tail
width
dept
h
02 1Issue
pointer
• Organization of cluster queue– Single entry per cluster (2D)– Ready flag for local operands are always set– Issue pointer for each entry, in-order issue
25/40
Cluster Execution Unit• Cluster mapping on inter-ALU network
– Local operands: dedicated bypass network– Others: traditional global bypass network
I1I0
I2
I3
I4
I5
I6
Instruction cluster
I1I0
I2
I3
I4
I5
I6
0
Instruction cluster
1
2
3
4
Instruction Depth
row 0
row 1
row 2
row 3
col 0 col 1 col 2 col 3
I0 I1 I6
I2 I4
I3
I5
network ALU
26/40
Experimental Setup• Simulation Environment
– SimpleScalar sim-outorder simulator– MediaBench application programs
• Processor Configurations8-way 16-way
Queues24 instruction queue,8 cluster queue,16 load/store queue
48 instruction queue,16 cluster queue,32 load/store queue
FU resources
4 integer ALUs,1 (4x4) network ALU,2 integer MULs,2 floating ALUs1 floating MUL,2 memory ports
8 integer ALUs,2 (4x4) network ALUs,2 integer MULs,2 floating ALUs1 floating MUL,2 memory ports
Operand bypass(latency)
Local (0),pass-through (1),Global (1)
Local (0),pass-through (1),Global (max 3)
27/40
Experimental Result• Dynamic instruction coverage
0%
10%
20%
30%
40%
50%
60%
70%
80%cj
pe
g
djp
eg
ep
ic
ep
icu
n
g7
21
de
cod
e
g7
21
en
cod
e
mp
eg
2d
eco
de
mp
eg
2e
nco
de
raw
cau
dio
raw
da
ud
io
ave
rag
e
clu
ste
red
in
st.
/to
tal
co
mm
ite
d i
ns
t. 32 entries 64 entries128 entries 256 entries512 entries 1K entries
28/40
Experimental Result (cntd.)• Operand transport types
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4cjp
eg
djp
eg
epic
epic
un
g721decode
g721encode
mpeg2decode
mpeg2encode
raw
caudio
raw
daudio
ave
rage
cjp
eg
djp
eg
epic
epic
un
g721decode
g721encode
mpeg2decode
mpeg2encode
raw
caudio
raw
daudio
ave
rage
8-way 16-way
ave
rage d
ependence e
dge p
er in
st.
globalpass- throughlocal
29.5%
11.0%
59.5%
31.5%
10.6%
57.8%
29/40
Experimental Result (cntd.)• IPC speedup
0%
10%
20%
30%
40%
50%
60%
70%
80%cj
pe
g
djp
eg
ep
ic
ep
icu
n
g7
21
de
cod
e
g7
21
en
cod
e
mp
eg
2d
eco
de
mp
eg
2e
nco
de
raw
cau
dio
raw
da
ud
io
ave
rag
e
cjp
eg
djp
eg
ep
ic
ep
icu
n
g7
21
de
cod
e
g7
21
en
cod
e
mp
eg
2d
eco
de
mp
eg
2e
nco
de
raw
cau
dio
raw
da
ud
io
ave
rag
e
8-way 16-way
IPC
sp
ee
du
p
30/40
Summary• Summary of approach
– Dynamically group dependent instructions into clusters– Store regular operand transport patterns– Execute them on inter-ALU network where intermediate values
are propagated among ALUs w/o/ using global buses
• Summary of results (MediaBench average)– Dynamic instruction coverage
– Shortest transport rate
– IPC speedup
57.3%@ 256 entry cluster cache
30% 16-way8-way 32%
16-way8-way 16.2% 35.2%
31/40
• Introduction
• Characterization and modeling of operand usage and transport
• Dynamic execution technique exploiting regular operand transport patterns in multimedia– Instruction cluster mapping on the inter-ALU network
for general-purpose domain
– Dynamic SIMDization for application-specific domain
• Summary
Overview
32/40
• Data parallel execution using dynamic SIMDization– Observation (Image processing applications)
• Operand movement w/in a loop iteration is highly regular• Small # of inner loops covers most of execution time
– Focus on regular operand transport pattern between iterations of innermost loop
• Stride prediction: break loop-carried dependences data-parallel execution
• Operand lifetime detection operand traffic control
– Organization• Instruction cluster formation• SIMD instruction queue and scheduling• SIMD PE array
Implementation Example - II
33/40
Dynamic Instruction Clustering
• External dependence edge types– External-input: serving only as input– External-output: serving only as output– External-updated: serving as both input and output
• Parallel and non-parallel region detection– p-cluster: producing no external-updated output and not
having unpredicted external-updated input– np-cluster
34/40
Instruction Clustering Example• Image convolution code in TI’s IMGLIB
r2
IC0
0
1
r11
2
3
r8 r15
8 13
4
5
6
7
r10
9
10
11
12
r13
14
15
16
20
21
17 18 19
r9
r3r4r5r6 r7r9 r8
IC1
IC2 IC3
IC4
IC5
external-input = {r10, r11, r13, r15} external-output = {r2, r3, r4, r5, r6, r7} external-updated = {r8, r9}
p-clusters = {IC0, IC1, IC2, IC3}np-clusters = {IC4, IC5}
35/40
SIMD Execution Unit• Cluster scheduling on SIMD PE array
20
21
22
23
30
31
32
33
160
161
162
163
0 1 2 t 0 1 2 3 4 t
PE0
PE1
PE2
PE3
(a) p- cluster scheduling (b) np- cluster scheduling
8[0:3] 13[0:3]
200 210
201 211
202 212
203 213
4
36/40
SIMD Execution Unit (cntd.)• Operand transport model
Scalar resourcesP
ILPP
SIMD
conventional ILP processor
external- input external- output
local external- updpated
P
P P
PE
37/40
Summary of Approach• Dynamic parallelization
– Detect regular operand transport pattern on external-updated
– Compute stride predict external-update values
• Optimizing operand transport– Identify the lifetime of operands– Remove needless communication localize transport
• Execute the clusters on 1-D mesh SIMD PE array
38/40
Overview
• Introduction
• Characterization and modeling of operand usage and transport
• Dynamic execution technique exploiting regular operand transport patterns in multimedia– Instruction cluster mapping on the inter-ALU network
for general-purpose domain
– Dynamic SIMDization for application-specific domain
• Summary
39/40
Summary
• Characterization and modeling of operand– Examine the operand usage properties– Explore the impact of architectural techniques on the operand
transport
• Development of a dynamic execution technique– Instruction clustering– Recognition of regular operand transport pattern– Efficient execution unit
40/40
Thank you. Any questions?