platform-based design tu/e 5kk70 henk corporaal bart mesman exploiting ilp vliw architectures (part...

Platform-based Design

TU/e 5kk70Henk Corporaal

Bart Mesman

Exploiting ILPVLIW architectures (part b)

ILP compilation (part a)

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

2

What are we talking about?

ILP = Instruction Level Parallelism =

ability to perform multiple operations (or instructions),from a single instruction stream,

in parallel

VLIW = Very Long Instruction Word architecture

operation 1 operation 2 operation 3 operation 4

Instruction format:

operation 5


3

VLIW evaluation

Strong points of VLIW:– Scalable (add more FUs)– Flexible (an FU can be almost anything; e.g. multimedia support)

Weak points:• With N FUs:

– Bypassing complexity: O(N2)– Register file complexity: O(N)– Register file size: O(N2)

• Register file design restricts FU flexibility

Solution: .................................................. ?


4

VLIW evaluationIn

stru

ctio

n m

emor

y

Inst

ruct

ion

fetc

h un

it

Inst

ruct

ion

deco

de u

nit

FU-1

FU-2

FU-3

FU-4

FU-5

Reg

iste

r fi

le

Dat

a m

emor

y

CPU

Byp

assi

ng n

etw

ork

Control problem O(N2) O(N)-O(N2) With N function units


5

Solution

TTA: Transport Triggered ArchitectureTTA: Transport Triggered Architecture

>

st

*

+ -

>

st

*

+ -


6

Transport Triggered Architecture

General organization of a TTAIn

stru

ctio

n m

emor

y

Inst

ruct

ion

fetc

h un

it

Inst

ruct

ion

deco

de u

nit

FU-1

FU-2

FU-3

FU-4

FU-5

Reg

iste

r fi

le

Dat

a m

emor

y

CPU

Byp

assi

ng n

etw

ork


7

TTA structure; datapath details

Socket

integer RF

floatRF

booleanRF

instruct.unit

immediateunit

load/store unit

integer ALU

float ALU

integer ALU

load/store unit

Data Memory

Instruction Memory


8

TTA hardware characteristics

• Modular: building blocks easy to reuse• Very flexible and scalable

– easy inclusion of Special Function Units (SFUs)

• Very low complexity– > 50% reduction on # register ports– reduced bypass complexity (no associative matching)– up to 80 % reduction in bypass connectivity– trivial decoding– reduced register pressure– easy register file partitioning (a single port is enough!)


9

TTA software characteristics

• More difficult to schedule !• But: extra scheduling optimizations

add r3, r1, r2

r1 add.o1; r2 add.o2;

add.r r3

That does not look like an

improvement !?!

+o1 o2

r


10

Program TTAs

How to do data operations ?1. Transport of operands to FU

• Operand move (s)• Trigger move

2. Transport of results from FU• Result move (s)

How to do Control flow ?1. Jumps: #jump-address pc

2. Branch: #displacement pcd

3. Call: pc r; #call-address pcd

Example Add r3,r1,r2 becomesr1 Oint // operand move to integer unitr2 Tadd // trigger move to integer unit…………. // addition operation in progressRint r3 // result move from integer unit

Trigger Operand

Internal stage

Result

FU Pipeline


11

Scheduling example

add r1,r2,r2

sub r4,r1,95

VLIW

r1 -> add.o1, r2 -> add.o2

add.r -> sub.o1, 95 -> sub.o2

sub.r -> r4

TTA

integer RF

immediateunit

integer ALU

integer ALU

load/store unit


12

TTA Instruction format

General MOVE field:g : guard specifieri : immediate specifiersrc : sourcedst : desitnation

g i src dst

How to use immediates?

Small, 6 bits

Long, 32 bits

g 1 imm dst

g 0 Ir-1 dst imm

move 1

General MOVE instructions: multiple fields

move 2 move 3 move 4


13

Programming TTAs

How to do conditional executionEach move is guarded

Exampler1 cmp.o1 // operand move to compare unitr2 cmp.o2 // trigger move to compare unitcmp.r g // put result in boolean register gg:r3 r4 // guarded move takes place when r1=r2


14

Register file port pressure for TTAs

12

34

5

12

34

51.00

1.50

2.00

2.50

3.00

3.50

ILP

de

gre

e

Read portsWrite ports

Read and write ports required


15

Summary of TTA Advantages

• Better usage of transport capacity– Instead of 3 transports per dyadic operation, about 2 are

needed– # register ports reduced with at least 50%– Inter FU connectivity reduces with 50-70%

• No full connectivity required

• Both the transport capacity and # register ports become independent design parameters; this removes one of the major bottlenecks of VLIWs

• Flexible: Fus can incorporate arbitrary functionality• Scalable: #FUS, #reg.files, etc. can be changed• FU splitting results into extra exploitable concurrency• TTAs are easy to design and can have short cycle times


16

TTA automatic DSE

Architectureparameters

OptimizerOptimizer

Parametric compilerParametric compiler Hardware generatorHardware generator

feedbackfeedback

Userintercation

Parallel object code chip

Pareto curve(solution space)

cost

exec

. tim

e

x

x

x

x

xx

x

xx

x

x

x

x

x

x

xx x

x

x

Move framework


17

Overview• Enhance performance: architecture methods• Instruction Level Parallelism• VLIW• Examples

– C6

– TM

– TTA

• Clustering and Reconfigurable components• Code generation• Hands-on


18

Clustered VLIW• Clustering = Splitting up the VLIW data path

- same can be done for the instruction path –

FU FU FU

loop buffer

register file

FU FU FU

loop buffer

register file

FU FU FU

loop buffer

register file

Level 1 Instruction Cache

Level 1 Data Cache

Level 2 (shared) C

ache


19

Clustered VLIW

Why clustering?

• Timing: faster clock

• Lower Cost– silicon area– T2M

• Lower Energy


20

CLB

CLB

CLB

CLB

SwitchMatrix

ProgrammableInterconnect I/O Blocks (IOBs)

ConfigurableLogic Blocks (CLBs)

D Q

SlewRate

Control

PassivePull-Up,

Pull-Down

Delay

Vcc

OutputBuffer

InputBuffer

Q D

Pad

D QSD

RDEC

S/RControl

D QSD

RDEC

S/RControl

1

1

F'

G'

H'

DIN

F'

G'

H'

DIN

F'

G'

H'

H'

HFunc.Gen.

GFunc.Gen.

FFunc.Gen.

G4G3G2G1

F4F3F2F1

C4C1 C2 C3

K

Y

X

H1 DIN S/R EC

Fine-Grained reconfigurable: Xilinx XC4000 FPGA


21

Coarse-Grained reconfigurable: Chameleon CS2000

Highlights:•32-bit datapath (ALU/Shift)•16x24 Multiplier•distributed local memory•fixed timing


22

Hybrid FPGAs: Virtex II-Pro

ReConfig

.logic

Up to 16 serial transceivers

PowerP

Cs

Courtesy of Xilinx (Virtex II Pro)

PowerPC

Reconfigurable logicblocks

Memory blocks

GHz IO: Up to 16 serial transceivers


23

HW or SW reconfigurable?

Data path granularityfine coarse

Rec

onfi

gura

tion

tim

e

1 cycleSubword parallelism

loopbuffercontext

reset

Spatial mapping

Temporal mapping

FPGA

VLIW

configuration bandwidth


24

Granularity Makes Differences

Fine-Grained Architecture

Coarse-Grained Architecture

Clock Speed Low High

Configuration Time

Long Short

Unit Amount Large Small

Flexibility High Low

Power High Low

Area Large Small


25

Overview• Enhance performance: architecture methods• Instruction Level Parallelism• VLIW• Examples

– C6

– TM

– TTA

• Clustering and Reconfigurable components• Code generation• Hands-on


26

Compiler basics

• Overview– Compiler trajectory / structure / passes– Control Flow Graph (CFG)– Mapping and Scheduling– Basic block list scheduling– Extended scheduling scope– Loop scheduling– Loop transformations


27

Compiler basics: trajectory

Preprocessor

Compiler

Assembler

Loader/Linker

Source program

Object program

Error messages

Library code


28

Compiler basics: structure / passes

Lexical analyzer

Parsing

Code optimization

Register allocation

Source code

Sequential code

Intermediate code

Code generation

Scheduling and allocation

Object code

token generation

check syntax check semantic parse tree generation

data flow analysis local optimizations global optimizationscode selection peephole optimizations

making interference graph graph coloring spill code insertion caller / callee save and restore code

exploiting ILP


29

Compiler basics: structure Simple compilation example

Lexical analyzer

Syntax analyzer

Intermediate code generator

position := initial + rate * 60

id := id + id * 60

:=

+id

*id

60id

Code optimizer

Code generator

temp1 := intoreal(60)temp2 := id3 * temp1temp3 := id2 + temp2id1 := temp3

temp1 := id3 * 60.0id1 := id2 + temp1

movf id3, r2mulf #60, r2, r2movf id2, r1addf r2, r1movf r1, id1


30

Compiler basics: Control flow graph (CFG)

C input code:

CFG: 1 sub t1, a, b bgz t1, 2, 3

4 ………….. …………..

3 rem r, b, a goto 4

2 rem r, a, b goto 4

Program, is collection of Functions, each function is collection of Basic Blocks, each BB contains set of Instructions, each instruction consists of several Transports,..

if (a > b) { r = a % b; } else { r = b % a; }


31

• Machine independent optimizations

• Machine dependent optimizations

Compiler basics: Basic optimizations


32

• Machine independent optimizations– Common subexpression elimination

– Constant folding

– Copy propagation

– Dead-code elimination

– Induction variable elimination

– Strength reduction

– Algebraic identities• Commutative expressions

• Associativity: Tree height reduction– Note: not always allowed(due to limited precision)



33

• Machine dependent optimization example

What’s the optimal implementation of a*34 ?

– Use multiplier: mul Tb, Ta, 34• Pro: No thinking required

• Con: May take many cycles

– Alternative:SHL Tc, Ta, 1ADD Tb, Tc, TzeroSHL Tc, Tc, 4ADD Tb, Tb, Tc

• Pros: May take fewer cycles• Cons:• Uses more registers• Additional instructions ( I-cache load / code size)



34

• Register Organization Conventions needed for parameter passing and register usage across function calls

Compiler basics: Register allocation

r31

r21

r20

r11

r10

r1

r0

Callee saved registers

Caller saved registers

Argument and result transfer

Hard-wired 0

Temporaries


35

Register allocation using graph coloring

Given a set of registers, what is the most efficient mapping of registers to program variables in terms of execution time of the program?

• A variable is defined at a point in program when a value is assigned to it.

• A variable is used at a point in a program when its value is referenced in an expression.

• The live range of a variable is the execution range between definitions and uses of a variable.


36

Program:

a := c := b := := bd := := a := c := d

a b c dLive Ranges


Example:


37


a

b c

d

Inference Graph

a

b c

d

Coloring:a = redb = greenc = blued = green

Graph needs 3 colors => program needs 3 registers


38

Register allocation using graph coloringSpill/ Reload code Spill/ Reload code is needed when there are not enough colors (registers) to color the interference graph

Example: Only two registers available !!

Program:

a := c := store cb := := bd := := aload c := c := d

a b c dLive Ranges


39

Register allocation for a monolithic RF

Scheme of the optimistic register allocator

Renumber Build Spill costs Simplify Select

Spill code

The Select phase selects a color (= machine register) for a variable that minimizes the heuristic:

h1 = fdep(col, var) + caller_callee(col, var)

where: fdep(col, var) : a measure for the introduction of false dependencies caller_callee(col, var) : cost for mapping var on a caller or callee saved register


40

• CISC era– Code size important– Determine shortest sequence of code

• Many options may exist

– Pattern matchingExample M68029:

D1 := D1 + M[ M[10+A1] + 16*D2 + 20 ] ADD ([10,A1], D2*16, 20) D1

• RISC era– Performance important– Only few possible code sequences– New implementations of old architectures optimize RISC part of

instruction set only; for e.g. i486 / Pentium / M68020

Compiler basics: Code selection


41

Mapping / Scheduling: placing operations in space and time

d = a * b;

e = a + d;

f = 2 * b + d;

r = f – e;

x = z + y;

* *

+ +

-

+

a b 2

z yd

e f

r

x

Data Dependence Graph (DDG)


42

How to map these operations?

* *

+ +

-+

a b 2

z y

d

e f

rx

Architecture constraints:• One Function Unit• All operations single cycle latency

*

*

+

+

-

+

cycle 1

2

3

4

5

6


43

How to map these operations?

* *

+ +

-+

a b 2

z y

d

e f

rx

Architecture constraints:• One Add-sub and one Mul unit• All operations single cycle latency

*

* +

+

-

+cycle 1

2

3

4

5

6

Mul Add-sub


44

There are many mapping solutions

Pareto curve(solution space)

T e

xecu

tion

x

x

x

x

xx

x

xx

x

x

x

x

x

x

xxx

x

x

xx

x

x

x

x

x

xx

x

xx

Cost0


45

Basic Block Scheduling

• Make a dependence graph

• Determine minimal length

• Determine ASAP, ALAP, and slack of each operation

• Place each operation in first cycle with sufficient resources

Note:– Scheduling order sequential

– Priority determined by used heuristic; e.g. slack


46

Basic Block Scheduling

ADD

LD

A C

y

<1,3>

<2,4>MUL

A B

z

<1,4>

ADD

ADD

SUB

NEG LD

A

B C

X

<3,3>

<4,4>

<2,2>

<2,3>

<1,1>

ASAP cycle

ALAP cycle

slack


47

Cycle based list schedulingproc Schedule(DDG = (V,E))beginproc ready = { v | (u,v) E } ready’ = ready sched = current_cycle = 0 while sched V do for each v ready’ do if ResourceConfl(v,current_cycle, sched) then cycle(v) = current_cycle sched = sched {v} endif endfor current_cycle = current_cycle + 1 ready = { v | v sched (u,v) E, u sched } ready’ = { v | v ready (u,v) E, cycle(u) + delay(u,v) current_cycle} endwhileendproc

platform-based design tu/e 5kk70 henk corporaal bart mesman exploiting ilp vliw architectures (part...

Documents