embedded computer architecture tu/e 5kk73 henk corporaal bart mesman exploiting ilp vliw...

71
Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

Upload: stephanie-randall

Post on 18-Jan-2016

231 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

Embedded Computer Architecture

TU/e 5KK73Henk Corporaal

Bart Mesman

Exploiting ILPVLIW architectures

Page 2: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 2

What are we talking about?

ILP = Instruction Level Parallelism =

ability to perform multiple operations (or instructions),from a single instruction stream,

in parallel

VLIW = Very Long Instruction Word architecture

Instruction format example of 5 issue VLIW:

operation 1 operation 2 operation 3 operation 4 operation 5

Page 3: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 3

Single Issue RISC vs VLIW

Compiler

op op opnop op opop op nopop nop opop op op

instrinstrinstrinstrinstr

opopopopopopopopopopopop

execute1 instr/cycle

instrinstrinstrinstrinstrinstrinstrinstrinstrinstrinstrinstr

RISC CPU

3-issue VLIW

execute1 instr/cycle3 ops/cycle

Page 4: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 4

Topics Overview• Enhance performance:

– What options do you have?

• Operation/Instruction Level Parallelism– Limits on ILP

• VLIW– Examples

• Clustering• Code generation• Hands-on

Page 5: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 5

Architecture methods

Pipelined Execution of InstructionsIF: Instruction Fetch

DC: Instruction Decode

RF: Register Fetch

EX: Execute instruction

WB: Write Result Register

IF DC RF EX WBIF DC RF EX WB

IF DC RF EX WBIF DC RF EX WB

INS

TR

UC

TIO

N

CYCLE

1 2 43 5 6 7 8

12

3

4

Purpose of pipelining:• Reduce #gate_levels in critical path

• Reduce CPI close to one (instead of a large number for the multicycle machine)• More efficient Hardware

Problems• Hazards: pipeline stalls

• Structural hazards: add more hardware• Control hazards, branch penalties: use branch prediction• Data hazards: by passing required

Simple 5-stage pipeline

Page 6: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 6

Architecture methods

Pipelined Execution of Instructions

Superpipelining:

• Split one or more of the critical pipeline stages

• Superpipelining degree S:

*Op I_set

S(architecture) = f(Op) * lt (Op)

where: f(op) is frequency of operation op lt(op) is latency of operation op

Page 7: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 7

Architecture methods

Powerful Instructions (1)

MD-technique• Multiple data operands per operation• SIMD: Single Instruction Multiple Data

Vector instruction:

for (i=0, i++, i<64) c[i] = a[i] + 5*b[i];

or

c = a + 5*b

Assembly:

set vl,64ldv v1,0(r2)mulvi v2,v1,5ldv v1,0(r1)addv v3,v1,v2stv v3,0(r3)

Page 8: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 8

Architecture methods

Powerful Instructions (1)

SIMD computing• Nodes used for independent

operations

• Mesh or hypercube connectivity

• Exploit data locality of e.g. image processing applications

• Dense encoding (few instruction bits needed)

SIMD Execution Method

tim

e

Instruction 1

Instruction 2

Instruction 3

Instruction n

node1 node2 node-K

Page 9: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 9

Architecture methods

Powerful Instructions (1)

• Sub-word parallelism– SIMD on restricted scale:

– Used for Multi-media instructions

• Examples– MMX, SSX, SUN-VIS, HP MAX-2,

AMD 3Dnow, Trimedia II

– Example: i=1..4|ai-bi|* * * *

Page 10: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 10

Architecture methods

Powerful Instructions (2)

MO-technique: multiple operations per instruction

Two options:

• CISC (Complex Instruction Set Computer)

• VLIW (Very Long Instruction Word)

sub r8, r5, 3 and r1, r5, 12 mul r6, r5, r2 ld r3, 0(r5)

FU 1 FU 2 FU 3 FU 4field

instruction bnez r5, 13

FU 5

VLIW instruction example

Page 11: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 11

Execunit 1

Execunit 2

Execunit 3

Shared, Multi-ported Register file

Issue slot 1

Execunit 4

Execunit 5

Execunit 6

Execunit 7

Execunit 8

Execunit 9

Issue slot 2 Issue slot 3

VLIW architecture: central Register File

Q: How many ports does the registerfile need for n-issue?

Page 12: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 12

TriMedia TM32A processor

D-cache I-Cache

IFM

UL

1IF

MU

L1

IFM

UL

2IF

MU

L2

(FL

OA

T)

(FL

OA

T)

(FL

OA

T)

(FL

OA

T)

DS

PM

UL

1D

SP

MU

L1 D

SP

MU

L2

DS

PM

UL

2

FT

OU

GH

1F

TO

UG

H1

SH

IFT

ER

1S

HIF

TE

R1

AL

U1

AL

U1

FC

OM

P2

FC

OM

P2

DS

PA

LU

2D

SP

AL

U2

AL

U2

AL

U2

AL

U4

AL

U4

AL

U0

AL

U0

AL

U3

AL

U3

FA

LU

0F

AL

U0

FA

LU

3F

AL

U3

DS

PA

LU

0D

SP

AL

U0

SH

IFT

ER

0S

HIF

TE

R0

TA

G

TA

G

TAG

TAG

SEQUENCER / DECODE

I/OINTERFACE

0.18 micronarea : 16.9mm2

200 MHz (typ)1.4 W

7 mW/MHz

(MIPS processor:0.9 mW/MHz)

Page 13: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 13

Architecture methods: Powerful Instructions (2)

VLIW Characteristics

• Only RISC like operation support Short cycle times

• Flexible: Can implement any FU mixture• Extensible• Tight inter FU connectivity required• Large instructions (up to 1024 bits)• Not binary compatible !!!• But good compilers exist

Page 14: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 14

Architecture methods

Multiple instruction issue (per cycle)

Who guarantees semantic correctness?– which can instructions be executed in parallel?

• User: he specifies multiple instruction streams

– Multi-processor: MIMD (Multiple Instruction Multiple Data)

• HW: Run-time detection of ready instructions

– Superscalar

• Compiler: Compile into dataflow representation

– Dataflow processors

Page 15: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 15

Multiple instruction issue

Three Approaches

a := b + 15;

c := 3.14 * d;

e := c / f;

Translation to DDG (Data Dependence Graph)

ld

+

st

&b

15

&a

ld *

/ st

ld

st

&f 3.14

&e

&d

&c

Example code

Page 16: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 16

Generated Code

Instr. Sequential Code

I1 ld r1,M(&b)I2 addi r1,r1,15I3 st r1,M(&a)I4 ld r1,M(&d)I5 muli r1,r1,3.14 I6 st r1,M(&c)I7 ld r2,M(&f)I8 div r1,r1,r2I9 st r1,M(&e)

3 approaches:• An MIMD may execute two streams: (1) I1-I3 (2) I4-I9

– No dependencies between streams; in practice communication and synchronization required between streams

• A superscalar issues multiple instructions from sequential stream– Obey dependencies (True and name dependencies)

– Reverse engineering of DDG needed at run-time

• Dataflow code is direct representation of DDG

Dataflow Code I1 ld(M(&b) -> I2I2 addi 15 -> I3I3 st M(&a)I4 ld M(&d) -> I5I5 muli 3.14 -> I6, I8I6 st M(&c)I7 ld M(&f) -> I8I8 div -> I9I9 st M(&e)

Page 17: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 17

Multiple Instruction Issue: Data flow processor

Token Matching

TokenStore

InstructionGenerate

InstructionStore

FU-1 FU-2 FU-K

Reservation Stations

Re

sult

To

ken

s

Page 18: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 18

Instruction Pipeline Overview

IF DC RF EX WB

IF DC/RF EX WB

CISC

RISC

IF1 DC1 RF1 EX1 ROBISSUE WB1

IF2 DC2 RF2 EX2 ROBISSUE WB2

IF3 DC3 RF3 EX3 ROBISSUE WB3

IFk DCk RFk EXk ROBISSUE WBk

Superscalar

IF1 IF2 IFs DC RF--- EX1 EX2 --- EX5 WBSuperpipelined

IF DC

RF1 EX1 WB1

RF2 EX2 WB2

RFk EXk WBk

VLIW

RF1 EX1 WB1

RF2 EX2 WB2

RFk EXk WBkD

AT

AF

LOW

(no pipelining)

Page 19: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 19

Four dimensional representation of the architecture design space <I, O, D, S>

Instructions/cycle ‘I’

Superpipelining Degree ‘S’

Operations/instruction ‘O’

Data/operation ‘D’

Superscalar MIMD Dataflow

Superpipelined

RISC

VLIW

10 100

1010

0.1

Vector

10

SIMD100

CISC

Page 20: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 20

Architecture design space

Architecture K I O D S MparCISC 1 0.2 1.2 1.1 1 0.26RISC 1 1 1 1 1.2 1.2VLIW 10 1 10 1 1.2 12Superscalar 3 3 1 1 1.2 3.6Superpipelined 1 1 1 1 3 3Vector 7 0.1 1 64 5 32SIMD 128 1 1 128 1.2 154MIMD 32 32 1 1 1.2 38Dataflow 10 10 1 1 1.2 12

Typical values of K (# of functional units or processor nodes), and

<I, O, D, S> for different architectures

Mpar = I*O*D*S

Op I_setS(architecture) = f(Op) * lt (Op)

Page 21: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 21

Overview

• Enhance performance: architecture methods

• Instruction Level Parallelism (ILP)– limits on ILP

• VLIW– Examples

• Clustering

• Code generation

• Hands-on

Page 22: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 22

General organization of an ILP architecture

Inst

ruct

ion

mem

ory

Inst

ruct

ion

fetc

h un

it

Inst

ruct

ion

deco

de u

nit

FU-1

FU-2

FU-3

FU-4

FU-5

Reg

iste

r fi

le

Dat

a m

emor

y

CPU

Byp

assi

ng n

etw

ork

Page 23: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 23

Motivation for ILP• Increasing VLSI densities; decreasing feature size

• Increasing performance requirements

• New application areas, like– multi-media (image, audio, video, 3-D, holographic)– intelligent search and filtering engines– neural, fuzzy, genetic computing

• More functionality

• Use of existing Code (Compatibility)

• Low Power: P = fCVdd2

Page 24: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 24

Low power through parallelism

• Sequential Processor– Switching capacitance C

– Frequency f

– Voltage V

– P = fCV2

• Parallel Processor (two times the number of units)– Switching capacitance 2C

– Frequency f/2

– Voltage V’ < V

– P = f/2 2C V’2 = fCV’2

Page 25: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 25

Measuring and exploiting available ILP

• How much ILP is there in applications?• How to measure parallelism within applications?

– Using existing compiler

– Using trace analysis• Track all the real data dependencies (RaWs) of instructions from issue

window

– register dependence

– memory dependence

• Check for correct branch prediction

– if prediction correct continue

– if wrong, flush schedule and start in next cycle

Page 26: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 26

Trace analysis

Program

For i := 0..2

A[i] := i;

S := X+3;

Compiled code

set r1,0

set r2,3

set r3,&A

Loop: st r1,0(r3)

add r1,r1,1

add r3,r3,4

brne r1,r2,Loop

add r1,r5,3

Trace

set r1,0

set r2,3

set r3,&A

st r1,0(r3)

add r1,r1,1

add r3,r3,4

brne r1,r2,Loop

st r1,0(r3)

add r1,r1,1

add r3,r3,4

brne r1,r2,Loop

st r1,0(r3)

add r1,r1,1

add r3,r3,4

brne r1,r2,Loop

add r1,r5,3How parallel can you execute this code?

Page 27: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 27

Trace analysis

Parallel Trace

set r1,0 set r2,3 set r3,&A

st r1,0(r3) add r1,r1,1 add r3,r3,4

st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop

st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop

brne r1,r2,Loop

add r1,r5,3

Max ILP = Speedup = Lparallel / Lserial = 16 / 6 = 2.7

Page 28: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 28

Ideal ProcessorAssumptions for ideal/perfect processor:

1. Register renaming – infinite number of virtual registers => all register WAW & WAR hazards avoided2. Branch and Jump prediction – Perfect => all program instructions available for execution3. Memory-address alias analysis – addresses are known. A store can be moved before a load provided addresses not equal

Also: – unlimited number of instructions issued/cycle (unlimited resources), and– unlimited instruction window– perfect caches– 1 cycle latency for all instructions (FP *,/)

Programs were compiled using MIPS compiler with maximum optimization level

Page 29: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 29

Upper Limit to ILP: Ideal Processor

Programs

Inst

ruct

ion

Iss

ues

per

cycl

e

0

20

40

60

80

100

120

140

160

gcc espresso li fpppp doducd tomcatv

54.862.6

17.9

75.2

118.7

150.1

Integer: 18 - 60 FP: 75 - 150

IPC

Page 30: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 30

35

41

16

6158

60

9

1210

48

15

67 6

46

13

45

6 6 7

45

14

45

2 2 2

29

4

19

46

0

10

20

30

40

50

60

gcc espresso li fpppp doducd tomcatv

Program

Inst

ruct

ion iss

ues

per

cyc

le

Perfect Selective predictor Standard 2-bit Static None

Window Size and Branch Impact• Change from infinite window to examine 2000

and issue at most 64 instructions per cycle FP: 15 - 45

Integer: 6 – 12

IPC

Perfect Tournament BHT(512) Profile No prediction

Page 31: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 31

11

15

12

29

54

10

15

12

49

16

10

1312

35

15

44

910

11

20

11

28

5 56 5 5

74 4

54

5 5

59

45

0

10

20

30

40

50

60

70

gcc espresso li fpppp doducd tomcatv

Program

Inst

ruct

ion iss

ues

per

cyc

le

Infinite 256 128 64 32 None

Limiting nr. of Renaming Registers• Changes: 2000 instr. window, 64 instr. issue, 8K 2-level

predictor (slightly better than tournament predictor)

Integer: 5 - 15 FP: 11 - 45

IP

C

Infinite 256 128 64 32

Page 32: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 32

Program

Instr

ucti

on

issu

es p

er

cy

cle

0

5

10

15

20

25

30

35

40

45

50

gcc espresso li fpppp doducd tomcatv

10

15

12

49

16

45

7 79

49

16

45 4 4

6 53

53 3 4 4

45

Perfect Global/stack Perfect Inspection None

Memory Address Alias Impact• Changes: 2000 instr. window, 64 instr. issue, 8K

2-level predictor, 256 renaming registers

FP: 4 - 45(Fortran,no heap)

Integer: 4 - 9

IPC

Perfect Global/stack perfect Inspection None

Page 33: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 33

Program

Instr

ucti

on

issu

es p

er

cy

cle

0

10

20

30

40

50

60

gcc expresso li fpppp doducd tomcatv

10

15

12

52

17

56

10

15

12

47

16

10

1311

35

15

34

910 11

22

12

8 8 9

14

9

14

6 6 68

79

4 4 4 5 46

3 2 3 3 3 3

45

22

Infinite 256 128 64 32 16 8 4

Reducing Window Size• Assumptions: Perfect disambiguation, 1K Selective predictor, 16

entry return stack, 64 renaming registers, issue as many as window

Integer: 6 - 12

FP: 8 - 45

IPC

Infinite 256 128 64 32 16 8 4

Page 34: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 34

How to Exceed ILP Limits of This Study?

• WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not in memory

• Unnecessary dependences – compiler did not unroll loops so iteration variable

dependence

• Overcoming the data flow limit: value prediction, predicting values and speculating on prediction– Address value prediction and speculation predicts

addresses and speculates by reordering loads and stores. Could provide better aliasing analysis

Page 35: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 35

Conclusions

• Amount of parallelism is limited– higher in Multi-Media and Signal Processing appl.– higher in kernels

• Trace analysis detects all types of parallelism– task, data and operation types

• Detected parallelism depends on– quality of compiler– hardware– source-code transformations

Page 36: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 36

Overview• Enhance performance: architecture methods• Instruction Level Parallelism• VLIW

– Examples• C6

• TM

• IA-64: Itanium, ....

• TTA

• Clustering• Code generation• Hands-on

Page 37: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 37

VLIW: general concept

A VLIW architecture with 7 FUs

Int Register File

Instruction Memory

Int FU

Data Memory

Int FU Int FU LD/ST LD/ST FP FU

Floating PointRegister File

FP FU

Instruction register

Functionunits

Page 38: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 38

VLIW characteristics• Multiple operations per instruction• One instruction per cycle issued (at most)• Compiler is in control• Only RISC like operation support

– Short cycle times– Easier to compile for

• Flexible: Can implement any FU mixture• Extensible / Scalable

However: • tight inter FU connectivity required• not binary compatible !!

– (new long instruction format)• low code density

Page 39: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 39

VelociTIC6x

datapath

Page 40: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 40

VLIW example: TMS320C62

TMS320C62 VelociTI Processor

• 8 operations (of 32-bit) per instruction (256 bit)• Two clusters

– 8 Fus: 4 Fus / cluster : (2 Multipliers, 6 ALUs)– 2 x 16 registers– One bus available to write in register file of other cluster

• Flexible addressing modes (like circular addressing)

• Flexible instruction packing

• All instruction conditional

• Originally: 5 ns, 200 MHz, 0.25 um, 5-layer CMOS

• 128 KB on-chip RAM

Page 41: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 41

Execunit

Execunit

Execunit

Execunit

Execunit

Register file (128 regs, 32 bit, 15 ports)

Instruction register (5 issue slots)

Data cache

(16 kB)

PCInstruction

cache (32kB)

5 constant5 ALU2 memory2 shift2 DSP-ALU2 DSP-mul3 branch2 FP ALU2 Int/FP ALU1 FP compare1 FP div/sqrt

VLIW example: Philips TriMedia TM1000

Page 42: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 42

Intel EPIC Architecture IA-64Explicit Parallel Instruction Computer (EPIC)• IA-64 architecture -> Itanium, first realization 2001

Register model:• 128 64-bit int x bits, stack, rotating• 128 82-bit floating point, rotating• 64 1-bit boolean• 8 64-bit branch target address• system control registers

See http://en.wikipedia.org/wiki/Itanium

Page 43: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 43

EPIC Architecture: IA-64

• Instructions grouped in 128-bit bundles– 3 * 41-bit instruction– 5 template bits, indicate type and stop location

• Each 41-bit instruction – starts with 4-bit opcode, and – ends with 6-bit guard (boolean) register-id

• Supports speculative loads

Page 44: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 44

Itanium organization

Page 45: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 45

Itanium 2: McKinley

Page 46: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 46

EPIC Architecture: IA-64

• EPIC allows for more binary compatibility then a plain VLIW:– Function unit assignment performed at run-time– Lock when FU results not available

• See other website (course 5MD00) for more info on IA-64:– www.ics.ele.tue.nl/~heco/courses/ACA– (look at related material)

Page 47: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 47

What are we talking about?

ILP = Instruction Level Parallelism =

ability to perform multiple operations (or instructions),from a single instruction stream,

in parallel

VLIW = Very Long Instruction Word architecture

operation 1 operation 2 operation 3 operation 4

Instruction format:

operation 5

Page 48: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 48

VLIW evaluationIn

stru

ctio

n m

emor

y

Inst

ruct

ion

fetc

h un

it

Inst

ruct

ion

deco

de u

nit

FU-1

FU-2

FU-3

FU-4

FU-5

Reg

iste

r fi

le

Dat

a m

emor

y

CPU

Byp

assi

ng n

etw

ork

Control problem O(N2) O(N)-O(N2) With N function units

Page 49: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 49

VLIW evaluation

Strong points of VLIW:– Scalable (add more FUs)– Flexible (an FU can be almost anything; e.g. multimedia support)

Weak points:• With N FUs:

– Bypassing complexity: O(N2)– Register file complexity: O(N)– Register file size: O(N2)

• Register file design restricts FU flexibility

Solution: .................................................. ?

Page 50: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 50

Solution

TTA: Transport Triggered ArchitectureTTA: Transport Triggered Architecture

>

st

*

+ -

>

st

*

+ -

Page 51: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 51

Transport Triggered Architecture

General organization of a TTAIn

stru

ctio

n m

emor

y

Inst

ruct

ion

fetc

h un

it

Inst

ruct

ion

deco

de u

nit

FU-1

FU-2

FU-3

FU-4

FU-5

Reg

iste

r fi

le

Dat

a m

emor

y

CPU

Byp

assi

ng n

etw

ork

Page 52: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 52

TTA structure; datapath details

Socket

integer RF

floatRF

booleanRF

instruct.unit

immediateunit

load/store unit

integer ALU

float ALU

integer ALU

load/store unit

Data Memory

Instruction Memory

Page 53: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 53

TTA hardware characteristics

• Modular: building blocks easy to reuse• Very flexible and scalable

– easy inclusion of Special Function Units (SFUs)

• Very low complexity– > 50% reduction on # register ports– reduced bypass complexity (no associative matching)– up to 80 % reduction in bypass connectivity– trivial decoding– reduced register pressure– easy register file partitioning (a single port is enough!)

Page 54: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 54

TTA software characteristics

• More difficult to schedule !• But: extra scheduling optimizations

add r3, r1, r2

r1 add.o1; r2 add.o2;

add.r r3

That does not look like an

improvement !?!

+o1 o2

r

Page 55: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 55

Program TTAs

How to do data operations ?1. Transport of operands to FU

• Operand move (s)• Trigger move

2. Transport of results from FU• Result move (s)

How to do Control flow ?1. Jumps: #jump-address pc

2. Branch: #displacement pcd

3. Call: pc r; #call-address pcd

Example Add r3,r1,r2 becomesr1 Oint // operand move to integer unitr2 Tadd // trigger move to integer unit…………. // addition operation in progressRint r3 // result move from integer unit

Trigger Operand

Internal stage

Result

FU Pipeline

Page 56: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 56

Scheduling example

add r1,r1,r2

sub r4,r1,95

VLIW

r1 -> add.o1, r2 -> add.o2

add.r -> sub.o1, 95 -> sub.o2

sub.r -> r4

TTA

integer RF

immediateunit

integer ALU

integer ALU

load/store unit

Page 57: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 57

TTA Instruction format

General MOVE field:g : guard specifieri : immediate specifiersrc : sourcedst : destination

g i src dst

How to use immediates?

Small, 6 bits

Long, 32 bits

g 1 imm dst

g 0 Ir-1 dst imm

move 1

General MOVE instructions: multiple fields

move 2 move 3 move 4

Page 58: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 58

Programming TTAs

How to do conditional executionEach move is guarded

Exampler1 cmp.o1 // operand move to compare unitr2 cmp.o2 // trigger move to compare unitcmp.r g // put result in boolean register gg:r3 r4 // guarded move takes place when r1=r2

Page 59: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 59

Register file port pressure for TTAs

12

34

5

12

34

51.00

1.50

2.00

2.50

3.00

3.50

ILP

de

gre

e

Read portsWrite ports

Read and write ports required

Page 60: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 60

Summary of TTA Advantages

• Better usage of transport capacity– Instead of 3 transports per dyadic operation, about 2 are

needed– # register ports reduced with at least 50%– Inter FU connectivity reduces with 50-70%

• No full connectivity required

• Both the transport capacity and # register ports become independent design parameters; this removes one of the major bottlenecks of VLIWs

• Flexible: Fus can incorporate arbitrary functionality• Scalable: #FUS, #reg.files, etc. can be changed• FU splitting results into extra exploitable concurrency• TTAs are easy to design and can have short cycle times

Page 61: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 61

TTA automatic DSE

Architectureparameters

OptimizerOptimizer

Parametric compilerParametric compiler Hardware generatorHardware generator

feedbackfeedback

Userintercation

Parallel object code chip

Pareto curve(solution space)

cost

exec

. tim

e

x

x

x

x

xx

x

xx

x

x

x

x

x

x

xx x

x

x

Move framework

Page 62: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 62

Overview• Enhance performance: architecture methods• Instruction Level Parallelism• VLIW• Examples

– C6

– TM

– TTA

• Clustering and Reconfigurable components• Code generation• Hands-on

Page 63: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 63

Clustered VLIW• Clustering = Splitting up the VLIW data path

- same can be done for the instruction path –

FU FU FU

loop buffer

register file

FU FU FU

loop buffer

register file

FU FU FU

loop buffer

register file

Level 1 Instruction Cache

Level 1 Data Cache

Level 2 (shared) C

ache

Page 64: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 64

Clustered VLIW

Why clustering?• Timing: faster clock• Lower Cost

– silicon area

– T2M

• Lower Energy

What’s the disadvantage?

Want to know more: see PhD thesis Andrei Terechko

Page 65: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 65

CLB

CLB

CLB

CLB

SwitchMatrix

ProgrammableInterconnect I/O Blocks (IOBs)

ConfigurableLogic Blocks (CLBs)

D Q

SlewRate

Control

PassivePull-Up,

Pull-Down

Delay

Vcc

OutputBuffer

InputBuffer

Q D

Pad

D QSD

RDEC

S/RControl

D QSD

RDEC

S/RControl

1

1

F'

G'

H'

DIN

F'

G'

H'

DIN

F'

G'

H'

H'

HFunc.Gen.

GFunc.Gen.

FFunc.Gen.

G4G3G2G1

F4F3F2F1

C4C1 C2 C3

K

Y

X

H1 DIN S/R EC

Fine-Grained reconfigurable: Xilinx XC4000 FPGA

Page 66: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 66

Coarse-Grained reconfigurable: Chameleon CS2000

Highlights:•32-bit datapath (ALU/Shift)•16x24 Multiplier•distributed local memory•fixed timing

Page 67: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 67

Recent Coarse Grain Reconfigurable Architectures

• SmartCell 2009– read http://www.hindawi.com/journals/es/2009/518659.html

• Montium (reconfigurable VLIW)• RAPID• NIOS II• RAW• PicoChip• PACT XPP64

• many more ….

Page 68: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 68

Hybrid FPGAs: Virtex II-Pro

ReC

on

fig

.lo

gic

Up to 16 serial transceivers

Pow

erP

Cs

Courtesy of Xilinx (Virtex II Pro)

PowerPC

Reconfigurable logicblocks

Memory blocks

GHz IO: Up to 16 serial transceivers

Page 69: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 69

Xilinx Zynq with 2 ARM processors

Page 70: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 70

Granularity Makes Differences

Fine-Grained Architecture

Coarse-Grained Architecture

Clock Speed Low High

Configuration Time

Long Short

# of Blocks Large Small

Flexibility High Low

Power High Low

Area Large Small

Page 71: Embedded Computer Architecture TU/e 5KK73 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures

04/21/23 Embedded Computer Architecture H. Corporaal and B. Mesman 71

HW or SW reconfigurable?

Data path granularityfine coarse

Rec

onfi

gura

tion

tim

e

1 cycleSubword parallelism

loopbuffercontext

reset

Spatial mapping

Temporal mapping

FPGA

VLIW

configuration bandwidth