architectural enhancements for efficient operand transport in multimedia systems

Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

ECE7102 Class Presentation

Date: 2006. 4. 13

Hongkyu [email protected]

2/40

Overview

• Introduction

• Characterization and modeling of operand usage and transport

• Dynamic execution technique exploiting regular operand transport patterns in multimedia– Instruction cluster mapping on the inter-ALU network

for general-purpose domain

– Dynamic SIMDization for application-specific domain

• Summary

3/40

Interconnect Complexity

FU FU

Storage

Storage

Interconnect

FU

FU

FU

FU

FU

FU

FU

FU

Storage Storage Storage

Storage Storage Storage

Interconnect

• Exponential increase of chip capacity More devices

• Exponential decrease of feature size Interconnect limitation

J.D. Meindl, Interconnect Opportunities for Gigascale Integration, IEEE MICRO, vol. 23, no. 4, pp.28-35, May/June 2003.

4/40

Interconnect Bottleneck

ITRS 2002 Documents, http://public.itrs.net/Files/2002Update/Home.pdf.

1

10

100

0.1

Rel

ativ

e D

elay

250 180 130 90 65 45 42

Process Technology Node (nm)

α

1/α2

1/α2

• Disparity between wire delay and gate delay

5/40

Problem Statement

• High-performance interconnect– Interconnect organizations

– Interconnect technologies

• Why architectural responses are limited?– Compatibility with old ISAs

• Sequentially-specified operations• Restricted register file-based operand namespace

– ILP mechanisms• Operand bypass network, register renaming, and instruction

scheduling• Poorly scaling broadcast buses

6/40

Research Objective and Approach

• ObjectiveReduce latency of operand transport for multimedia– Development of dynamic execution techniques– Development of low-cost operand bypass networks

• Approach summary

Analysis of operandsExamine operand usage propertiesExplore the impact of architectural techniques on the operand transport

Technology model-based evaluation on target platforms

GENESYSSimpleScalar

Background work General approach Application-specific approach

Dynamic execution techniqueInstruction clusteringRecognition of regular operand transport patternsEfficient execution unit

Cluster mapping on inter-ALU network

Basic instruction clusteringRaw cluster mappingLocal operand mapping on dedicated inter-ALU path

Optimizing operand transport for multimedia systems

Regular pattern recognitionCluster reorganizationFunction remappingDynamic SIMDization

7/40

Overview

• Introduction





• Summary

8/40

Motivation and Approach• Motivation

– Shift of microarchitectural design focusOperand computation Operand communication

– Recognizing and understanding of operand usage and transport properties Efficiently controlling operand traffic

• Approach summary– Operand usage characteristics

• How often operands are used Examine temporal property• Where operands are used Examine spatial property

– Operand transport properties

• What accounts for the majority of communication needs

Explore the impact of architectural techniques on the operand transport

9/40

Operand Usage Analysis• General terms

– Operands: values in registers, memory locations, or memory addresses

– Operand transport: buffering and delivery of operands to FUs

• Operands’ temporal characteristics– Which inst. consumes operands after they are produced

– Metrics: Degree of use, Age, Lifetime

• Operands’ spatial characteristics– From/to which FU operands are moved in the execution model

– Metrics: Degree of functionality, Transport pattern

10/40

Operand Transport Analysis• Operand transport model

Global Storage

Bypass Networktransrd_global

FU

Local Storage

FU

Local Storage

FU

Local Storage

FU

Local Storage

FU

Local Storage

FU

Local Storage

FU

Local Storage

FU

Local Storage

FU

Local Storage

FU

Local Storage

transrd_bypass

transwr_global

11/40

0% 20% 40% 60% 80% 100%

degree offunctionality

lifetime

age

degree of use

Preliminary Results• Operand usage properties (MediaBench average)

0 1 2 3 >3

1 2 3~5 >5

1 2 3~5 6~10 >10

0 1(same) 1(different)

H. Kim, D. Wills, and L. Wills, “Empirical analysis of operand usage and transport in multimedia applications,” Proc of the International Workshop on System-on-Chip for Real-Time Applications, pp. 168-171, July 2004.

>1

12/40

Preliminary Results (cntd.)• Operand transport pattern (MediaBench average)

integer integer43.0%

integer branch14.9%

integer ld/st13.6%

ld/st integer13.8%

ld/st ld/st6.6%

Others8.1%

13/40

Preliminary Results (cntd.)• Effective architectural techniques on operand

transport– Storage hierarchy: local buffering

– Dedicated transport network

– Lifetime detection: compile-time/run-time

– Smart instruction steering

14/40

Overview

• Introduction





• Summary

15/40

Motivation and Approach• Motivation

Multimedia applications– Operand movement is highly regular

– Most operands are short lived, transient operandsDevelop dynamic execution technique exploiting regular

operand distribution patterns and local properties

• Approach summary– Instruction clustering: dynamic instruction grouping

– Recognition of regular operand transport pattern

– Efficient execution unit: reduce transport latency

16/40

Related Work• Solutions for multimedia processing

– Multimedia-specific ISA extensions• Exploit data-level parallelism at subword level• General-purpose domain: Intel’s MMX and SSE, AMD’s 3DNow!,

Sun’s VIS, IBM’s Altivec• Application-specific signal processing domain: Analog Device’s

TigerShark, Trimedia

– Vectorization and retargeting• Manual assembly coding• Hand-optimization: in-lined assembly code, library routines• Automatic vectorization: compiler/retargeting technology

17/40

• Solutions for reducing operand transport complexity– Communication-aware execution

• Network-connected tile architecture: RAW, GPA• Transport triggered architecture: MOVE

– Resource partitioning: Clustered architectures• Heterogeneous: decoupled architecture• Commercial: DEC Alpha21264• Academia: Multicluster, Palacharla’s, PEWs, ILDP, CTCP

– Dynamic optimizations• Fill unit: reform instructions in H/W, and cache them

• Small-scale dependence collapsing: combine dependences among multiple instructions macro instruction

Related Work (cntd.)

18/40

Related Research Landscape

Dynamic execution technique exploiting regular operand transport patterns in multimedia

Communication-aware execution:efficient operand transport

Resource partitioning:Clever instruction steering

Dynamic optimizations:instruction grouping, small-scale

dependence collapsing

Multimedia processing:independent computation

Regular pattern of dependent instructions

Steering burden off the critical path

Binary-compatibility,run-time optimization

Larger, more generalinstruction grouping

19/40

Research Methodology

Application code(C source)

gcc cross-compiler

PISA binary

Instruction trace

Instruction stream

Cluster formation logic

Cluster storage(cache)

Execution platform

Matched?

Normal execution unit

Cluster execution unit

N YInstruction queue Cluster queue

20/40

Dynamic Instruction Clustering

• Instruction Cluster– A connected subgraph of instructions joined by local operands– Dataflow graph Dependence edge classification

Instruction grouping

• Dependence edge types– External: produced/consumed by previous/next blocks– Non-clusterable: operands from/to memory– Local: produced and consumed within the same block

21/40

Instruction Clustering Example• Color conversion block in JPEG encoder

0: lbu r4, 0(r9) 19: addu r2, r2, r31: lbu r5, 1(r9) 20: lw r3, 5120(r6)2: lbu r6, 2(r9) 21: addu r7, r15, r83: sll r4, r4, 0x2 22: addu r2, r2, r34: addu r4, r4, r10 23: sra r2, r2, 0x105: sll r5, r5, 0x2 24: sb r2, 0(r7)6: addu r5, r5, r10 25: lw r2, 5120(r4)7: lw r2, 0(r4) 26: lw r3, 6144(r5)8: lw r3, 1024(r5) 27: addiu r9, r9, 39: sll r6, r6, r10 28: addu r2, r2, r310: addu r6, r6, r10 29: lw r3, 7168(r6)11: addu r2, r2, r3 30: addu r7, r12, r812: lw r3, 2048(r6) 31: addiu r8, r8, 113: addu r7, r25, r8 32: addu r2, r2, r314: addu r2, r2, r3 33: sra r2, r2, 0x1015: sra r2, r2, 0x10 34: sb r2, 0(r7)16: sb r2, 0(r7) 35: sltu r2, r8, r1617: lw r2, 3072(r4) 36: bne r2, r0, 0x41218818: lw r3, 4096(r5)

0 1 2

3 5 9

4 6 10

177 25 188 26 2012 29

11

14

15 13

16

19

22

2321

24

28

32

33 30

34

27

31

35

36

0 1 2

3 5 9

4 6 10

177 25 188 26 2012 29

11

14

15 13

16

19

22

2321

24

28

32

33 30

34

27

31

35

36

External

Local

Non-clusterable

0 1 2

3 5 9

4 6 10

177 25 188 26 2012 29

11

14

15 13

16

19

22

2321

24

28

32

33 30

34

27

31

35

36

External

Local

Non-clusterable

Instruction Cluster

22/40

Overview

• Introduction





• Summary

23/40

• Raw cluster execution on inter-ALU network– Focus on intermediate, short-lived operands

• Local operands: inter-ALU dedicated bypass network• Others: traditional global bypass network

– Organization• Instruction cluster formation• Cluster queue and scheduling• Cluster execution: inter-ALU network

H. Kim, D. Wills, and L. Wills, “Reducing operand communication overhead using instruction clustering for multimedia applications,” Proc of 7th International Symposium on Multimedia, December 2005.

Implementation Example - I

24/40

Cluster Queue and Scheduling

I1

Conventional instuction queue

I0 I3I2

Head Tail

Cluster queue

C0:I0 C1:I0 C2:I0

C0:I1 C1:I1 C2:I1

C0:I2 C1:I2

C1:I3

Head Tail

width

dept

h

02 1Issue

pointer

• Organization of cluster queue– Single entry per cluster (2D)– Ready flag for local operands are always set– Issue pointer for each entry, in-order issue

25/40

Cluster Execution Unit• Cluster mapping on inter-ALU network

– Local operands: dedicated bypass network– Others: traditional global bypass network

I1I0

I2

I3

I4

I5

I6

Instruction cluster

I1I0

I2

I3

I4

I5

I6

0

Instruction cluster

1

2

3

4

Instruction Depth

row 0

row 1

row 2

row 3

col 0 col 1 col 2 col 3

I0 I1 I6

I2 I4

I3

I5

network ALU

26/40

Experimental Setup• Simulation Environment

– SimpleScalar sim-outorder simulator– MediaBench application programs

• Processor Configurations8-way 16-way

Queues24 instruction queue,8 cluster queue,16 load/store queue

48 instruction queue,16 cluster queue,32 load/store queue

FU resources

4 integer ALUs,1 (4x4) network ALU,2 integer MULs,2 floating ALUs1 floating MUL,2 memory ports

8 integer ALUs,2 (4x4) network ALUs,2 integer MULs,2 floating ALUs1 floating MUL,2 memory ports

Operand bypass(latency)

Local (0),pass-through (1),Global (1)

Local (0),pass-through (1),Global (max 3)

27/40

Experimental Result• Dynamic instruction coverage

0%

10%

20%

30%

40%

50%

60%

70%

80%cj

pe

g

djp

eg

ep

ic

ep

icu

n

g7

21

de

cod

e

g7

21

en

cod

e

mp

eg

2d

eco

de

mp

eg

2e

nco

de

raw

cau

dio

raw

da

ud

io

ave

rag

e

clu

ste

red

in

st.

/to

tal

co

mm

ite

d i

ns

t. 32 entries 64 entries128 entries 256 entries512 entries 1K entries

28/40

Experimental Result (cntd.)• Operand transport types

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4cjp

eg

djp

eg

epic

epic

un

g721decode

g721encode

mpeg2decode

mpeg2encode

raw

caudio

raw

daudio

ave

rage

cjp

eg

djp

eg

epic

epic

un

g721decode

g721encode

mpeg2decode

mpeg2encode

raw

caudio

raw

daudio

ave

rage

8-way 16-way

ave

rage d

ependence e

dge p

er in

st.

globalpass- throughlocal

29.5%

11.0%

59.5%

31.5%

10.6%

57.8%

29/40

Experimental Result (cntd.)• IPC speedup

0%

10%

20%

30%

40%

50%

60%

70%

80%cj

pe

g

djp

eg

ep

ic

ep

icu

n

g7

21

de

cod

e

g7

21

en

cod

e

mp

eg

2d

eco

de

mp

eg

2e

nco

de

raw

cau

dio

raw

da

ud

io

ave

rag

e

cjp

eg

djp

eg

ep

ic

ep

icu

n

g7

21

de

cod

e

g7

21

en

cod

e

mp

eg

2d

eco

de

mp

eg

2e

nco

de

raw

cau

dio

raw

da

ud

io

ave

rag

e

8-way 16-way

IPC

sp

ee

du

p

30/40

Summary• Summary of approach

– Dynamically group dependent instructions into clusters– Store regular operand transport patterns– Execute them on inter-ALU network where intermediate values

are propagated among ALUs w/o/ using global buses

• Summary of results (MediaBench average)– Dynamic instruction coverage

– Shortest transport rate

– IPC speedup

57.3%@ 256 entry cluster cache

30% 16-way8-way 32%

16-way8-way 16.2% 35.2%

31/40

• Introduction





• Summary

Overview

32/40

• Data parallel execution using dynamic SIMDization– Observation (Image processing applications)

• Operand movement w/in a loop iteration is highly regular• Small # of inner loops covers most of execution time

– Focus on regular operand transport pattern between iterations of innermost loop

• Stride prediction: break loop-carried dependences data-parallel execution

• Operand lifetime detection operand traffic control

– Organization• Instruction cluster formation• SIMD instruction queue and scheduling• SIMD PE array

Implementation Example - II

33/40

Dynamic Instruction Clustering

• External dependence edge types– External-input: serving only as input– External-output: serving only as output– External-updated: serving as both input and output

• Parallel and non-parallel region detection– p-cluster: producing no external-updated output and not

having unpredicted external-updated input– np-cluster

34/40

Instruction Clustering Example• Image convolution code in TI’s IMGLIB

r2

IC0

0

1

r11

2

3

r8 r15

8 13

4

5

6

7

r10

9

10

11

12

r13

14

15

16

20

21

17 18 19

r9

r3r4r5r6 r7r9 r8

IC1

IC2 IC3

IC4

IC5

external-input = {r10, r11, r13, r15} external-output = {r2, r3, r4, r5, r6, r7} external-updated = {r8, r9}

p-clusters = {IC0, IC1, IC2, IC3}np-clusters = {IC4, IC5}

35/40

SIMD Execution Unit• Cluster scheduling on SIMD PE array

20

21

22

23

30

31

32

33

160

161

162

163

0 1 2 t 0 1 2 3 4 t

PE0

PE1

PE2

PE3

(a) p- cluster scheduling (b) np- cluster scheduling

8[0:3] 13[0:3]

200 210

201 211

202 212

203 213

4

36/40

SIMD Execution Unit (cntd.)• Operand transport model

Scalar resourcesP

ILPP

SIMD

conventional ILP processor

external- input external- output

local external- updpated

P

P P

PE

37/40

Summary of Approach• Dynamic parallelization

– Detect regular operand transport pattern on external-updated

– Compute stride predict external-update values

• Optimizing operand transport– Identify the lifetime of operands– Remove needless communication localize transport

• Execute the clusters on 1-D mesh SIMD PE array

38/40

Overview

• Introduction





• Summary

39/40

Summary

• Characterization and modeling of operand– Examine the operand usage properties– Explore the impact of architectural techniques on the operand

transport

• Development of a dynamic execution technique– Instruction clustering– Recognition of regular operand transport pattern– Efficient execution unit

40/40

Thank you. Any questions?

architectural enhancements for efficient operand transport in multimedia systems

Documents

latency of operand transport

efficient operand transport

modeling of operand

delivery of operands

fu operands

execution modelmetrics

alu network

domaindynamic simdization