embedded computer architecture tu/e 5kk73 henk corporaal vliw architectures: generating vliw code

Embedded Computer Architecture

TU/e 5kk73Henk Corporaal

VLIW architectures:Generating VLIW code

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 2

VLIW lectures overview• Enhance performance: architecture methods• Instruction Level Parallelism• VLIW• Examples

– C6– TM– TTA

• Clustering and Reconfigurable components• Code generation

– compiler basics– mapping and scheduling– TTA code generation– Design space exploration

• Hands-on


Compiler basics• Overview

– Compiler trajectory / structure / passes– Control Flow Graph (CFG)– Mapping and Scheduling– Basic block list scheduling– Extended scheduling scope– Loop scheduling

– Loop transformations• separate lecture


Compiler basics: trajectory

Preprocessor

Compiler

Assembler

Loader/Linker

Source program

Object program

Error messages

Library code


Compiler basics: structure / passes

Lexical analyzer

Parsing

Code optimization

Register allocation

Source code

Sequential code

Intermediate code

Code generation

Scheduling and allocation

Object code

token generation

check syntax check semantic parse tree generation

data flow analysis local optimizations global optimizationscode selection peephole optimizations

making interference graph graph coloring spill code insertion caller / callee save and restore code

exploiting ILP


Compiler basics: structure Simple example: from HLL to (Sequential) Assembly code

Lexical analyzer

Syntax analyzer

Intermediate code generator

position := initial + rate * 60

id := id + id * 60

:=

+id

*id

60id

Code optimizer

Code generator

temp1 := intoreal(60)temp2 := id3 * temp1temp3 := id2 + temp2id1 := temp3

temp1 := id3 * 60.0id1 := id2 + temp1

movf id3, r2mulf #60, r2, r2movf id2, r1addf r2, r1movf r1, id1


Compiler basics: Control flow graph (CFG)

C input code:CFG: shows the flow between basic blocks

1 sub t1, a, b bgz t1, 2, 3

4 ………….. …………..

3 rem r, b, a goto 4

2 rem r, a, b goto 4

Program, is collection of Functions, each function is collection of Basic Blocks, each BB contains set of Instructions, each instruction consists of several Transports,..

if (a > b) { r = a % b; } else { r = b % a; }


• Machine independent optimizations

• Machine dependent optimizations

Compiler basics: Basic optimizations



• Machine independent optimizations– Common subexpression elimination– Constant folding– Copy propagation– Dead-code elimination– Induction variable elimination– Strength reduction– Algebraic identities

• Commutative expressions• Associativity: Tree height reduction

– Note: not always allowed(due to limited precision)

• For details check any good compiler book !



• Machine dependent optimization example– What’s the optimal implementation of a*34 ?– Use multiplier: mul Tb, Ta, 34

• Pro: No thinking required• Con: May take many cycles

– Alternative:– SHL Tb, Ta, 1– SHL Tc, Ta, 5– ADD Tb, Tb, Tc

• Pros: May take fewer cycles• Cons:• Uses more registers• Additional instructions ( I-cache load / code size)


Compiler basics: Register allocation

• Register Organization – Conventions needed for parameter passing – and register usage across function calls

r31

r21

r20

r11

r10

r1

r0

Callee saved registers

Caller saved registers

Function Argument and Result transfer

Hard-wired 0

Other temporaries


Register allocation using graph coloring

Given a set of registers, what is the most efficient mapping of registers to program variables in terms of execution time of the program?Some definitions:

• A variable is defined at a point in program when a value is assigned to it.

• A variable is used at a point in a program when its value is referenced in an expression.

• The live range of a variable is the execution range between definitions and uses of a variable.


Program:

a := c := b := := bd := := a := c := d

a b c dLive Ranges


define

use



a

b c

d

Inference Graph

a

b c

d

Coloring:a = redb = greenc = blued = green

Graph needs 3 colors => program needs 3 registers

Question: map coloring requires (at most) 4 colors; what’s the maximum number of colors (= registers) needed for register interference graph coloring?


Register allocation using graph coloringSpill/ Reload code Spill/ Reload code is needed when there are not enough colors (registers) to color the interference graph

Example: Only two registers available !!

Program:

a := c := store cb := := bd := := aload c := c := d

a b c dLive Ranges


Register allocation for a monolithic RF

Scheme of the optimistic register allocator

Renumber Build Spill costs Simplify Select

Spill code

The Select phase selects a color (= machine register) for a variable that minimizes the heuristic h:

h = fdep(col, var) + caller_callee(col, var)

where: fdep(col, var) : a measure for the introduction of false dependencies caller_callee(col, var) : cost for mapping var on a caller or callee saved register


Some explanation of reg allocation phases• [Renumber:] The first phase finds all live ranges in a procedure• and numbers (renames) them uniquely.

• [Build:] This phase constructs the interference graph.

• [Spill Costs:] In preparation for coloring, a spill cost estimate• is computed for every live range. The cost is simply the sum of the• execution frequencies of the transports that define or use the variable• of the live range.

• [Simplify:] This phase removes nodes with degree < k in an• arbitrary order from the graph and pushes them on a stack. Whenever• it discovers that all remaining nodes have degree >= k, it chooses• a spill candidate. This node is also removed from the graph and• optimistically pushed on the stack, hoping a color will be available in• spite of its high degree.

• [Select:] Colors are selected for nodes. In turn, each node is• popped from the stack, reinserted in the interference graph and given a• color distinct from its neighbors. Whenever it discovers that it has no• color available for some node, it leaves the node uncolored and• continues with the next node.

• [Spill Code:] In the final phase spill code is• inserted for the live ranges of all uncolored nodes.

• Some symbolic registers must be mapped on a specific machine register• (like stack pointer). These registers get their color in the simplify• stage instead of being pushed on the stack.

• The other machine registers are divided in caller-saved and callee-saved• registers. The allocator computes the caller-saved and callee-saved cost.

• The caller-saved cost for the symbolic registers is computed when they have• live-ranges across a procedure call. The cost per symbolic register is twice• the execution frequency of its transport. • The callee-saved cost of a• symbolic register is twice the execution frequency of the procedure to which• the transport of the symbolic register belongs. With these two costs in mind• the allocator chooses a machine register.


• CISC era (before 1985)– Code size important– Determine shortest sequence of code

• Many options may exist

– Pattern matchingExample M68029:

D1 := D1 + M[ M[10+A1] + 16*D2 + 20 ] ADD ([10,A1], D2*16, 20) D1

• RISC era– Performance important– Only few possible code sequences– New implementations of old architectures optimize RISC part of

instruction set only; for e.g. i486 / Pentium / M68020

Compiler basics: Code selection


Overview• Enhance performance: architecture methods• Instruction Level Parallelism• VLIW• Examples

– C6– TM– TTA

• Clustering• Code generation

– Compiler basics– Mapping and Scheduling of Operations

• Design Space Exploration: TTA framework

•What is scheduling•Basic Block Scheduling•Extended Basic Block Scheduling•Loop Scheduling


Mapping / Scheduling =placing operations in space and time

d = a * b;

e = a + d;

f = 2 * b + d;

r = f – e;

x = z + y;

* *

+ +

-

+

a b 2

z yd

e f

r

x

Data Dependence Graph (DDG)


How to map these operations?

Architecture constraints:• One Function Unit• All operations single cycle latency

*

*

+

+

-

+

cycle 1

2

3

4

5

6

* *

+ +

-+

a b 2

z y

d

e f

rx


How to map these operations?

* *

+ +

-+

a b 2

z y

d

e f

rx

Architecture constraints:• One Add-sub and one Mul unit• All operations single cycle latency

*

* +

+

-

+cycle 1

2

3

4

5

6

Mul Add-sub


There are many mapping solutions

Pareto graph(solution space)

T e

xecu

tion

x

x

x

x

xx

x

xx

x

x

x

x

x

x

xxx

x

x

xx

x

x

x

x

x

xx

x

xx

Cost0

Point x is pareto there is no point y for which i yi<xi


Scheduling: OverviewTransforming a sequential program into a parallel program:

read sequential program read machine description file for each procedure do

perform function inlining

for each procedure dotransform an irreducible CFG into a reducible CFG perform control flow analysis perform loop unrolling perform data flow analysis perform memory reference disambiguation perform register allocation for each scheduling scope do

perform instruction scheduling write out the parallel program


Basic Block Scheduling• Basic Block =

piece of code which can only be entered from the top (first instruction) and left at the bottom (final instruction)

• Scheduling a basic block = Assign resources and a cycle to every operation

• List Scheduling =Heuristic scheduling approach, scheduling the operation one-by-one– Time_complexity = O(N), where N is #operations

• Optimal scheduling has Time_complexity = O(exp(N)

• Question: what is a good scheduling heuristic


Basic Block Scheduling• Make a Data Dependence Graph (DDG)

• Determine minimal length of the DDG (for the given architecture)– minimal number of cycles to schedule the graph (assuming sufficient resources)

• Determine:– ASAP (As Soon As Possible) cycle = earliest cycle instruction can be scheduled– ALAP (As Late As Possible) cycle = latest cycle instruction can be scheduled– Slack of each operation = ALAP – ASAP– Priority of operations = f (Slack, #decendants, #register impact, …. )

• Place each operation in first cycle with sufficient resources

• Notes:– Basic Block = a (maximal) piece of consecutive instructions which can only be entered at the

first instruction and left at the end

– Scheduling order sequential– Scheduling Priority determined by used heuristic; e.g. slack + other contributions


Basic Block Scheduling:determine ASAP and ALAP cycles

ADD

LD

A C

y

<1,3>

<2,4>MUL

A B

z

<1,4>

ADD

ADD

SUB

NEG LD

A

B C

X

<3,3>

<4,4>

<2,2>

<2,3>

<1,1>

ASAP cycle

ALAP cycle

slack

we assume all operations aresingle cycle !


Cycle based list schedulingproc Schedule(DDG = (V,E))beginproc ready = { v | (u,v) E } ready’ = ready sched = current_cycle = 0 while sched V do for each v ready’ (select in priority order) do if ResourceConfl(v,current_cycle, sched) then cycle(v) = current_cycle sched = sched {v} endif endfor current_cycle = current_cycle + 1 ready = { v | v sched (u,v) E, u sched } ready’ = { v | v ready (u,v) E, cycle(u) + delay(u,v) current_cycle} endwhileendproc


Extended Scheduling Scope: look at the CFG

A

C

F

B

D

E

G

A;If cond Then B Else C;D;If cond Then E Else F;G;

Code: CFG:ControlFlowGraph

Q: Why enlarge the scheduling scope?


Extended basic block scheduling: Code Motion A

a) add r3, r4, 4 b) beq . . .

D e) mul r1, r1, r3

C d) sub r3, r3, r2

B c) add r1, r1, r2

• Downward code motions?

— a B, a C, a D, c D, d D

• Upward code motions?

— c A, d A, e B, e C, e A

Q: Why moving code?


Possible Scheduling Scopes

Trace Superblock Decision tree Hyperblock/region


Create and Enlarge Scheduling Scope

B C

E F

D

G

A

Trace Superblock

B C

F E’

D’

G’

A

E

D

G

tail duplication

A

C

F

B

D

E

G


Create and Enlarge Scheduling Scope

B C

E F

D

G

A

Hyperblock/ region

B C

E’ F’

D’

G’’

A

E

D

G

Decision Tree

tail duplication

F

G’

A

C

F

B

D

E

G


Trace Sup.block

Hyp.block

Dec.Tree

Region

Multiple exc. paths No No Yes Yes YesSide-entries allowed Yes No No No NoJoin points allowed Yes No Yes No YesCode motion down joins Yes No No No NoMust be if-convertible No No Yes No NoTail dup. before sched. No Yes No Yes No

Comparing scheduling scopesA

C

F

B

D

E

G


Code movement (upwards) within regions: what to check?

I

I I

add

I

source block

destination block

I

Copy needed

Intermediateblock

Check foroff-liveness

Legend:

Code movement


Extended basic block scheduling:Code Motion

• A dominates B A is always executed before B– Consequently:

• A does not dominate B code motion from B to A requires

code duplication

• B post-dominates A B is always executed after A– Consequently:

• B does not post-dominate A code motion from B to A is speculative

A

CB

ED

F

Q1: does C dominate E?

Q2: does C dominate D?

Q3: does F post-dominate D?

Q4: does D post-dominate B?


Scheduling: Loops

B C

D

ALoop Optimizations:

B

C’’

D

A

C’

C

Loop peeling

B

C’’

D

A

C’

C

Loop unrolling


Scheduling: LoopsProblems with unrolling:

• Exploits only parallelism within sets of n iterations

• Iteration start-up latency

• Code expansion

Basic block scheduling

Basic block scheduling and unrolling

Software pipelining

reso

urc

e u

tiliz

atio

n

time


Software pipelining• Software pipelining a loop is:

– Scheduling the loop such that iterations start before preceding iterations have finished

Or:– Moving operations across the backedge

LD

ML

ST

LD

LD ML

LD ML ST

ML ST

ST

LD

LD ML

LD ML ST

ML ST

ST

Example: y = a.x

3 cycles/iteration Unroling (3 times)

5/3 cycles/iteration

Software pipelining

1 cycle/iteration


Software pipelining (cont’d)Basic loop scheduling techniques:

• Modulo scheduling (Rau, Lam)– list scheduling with modulo resource constraints

• Kernel recognition techniques– unroll the loop

– schedule the iterations

– identify a repeating pattern

– Examples:• Perfect pipelining (Aiken and Nicolau)

• URPR (Su, Ding and Xia)

• Petri net pipelining (Allan)

• Enhanced pipeline scheduling (Ebcioğlu)– fill first cycle of iteration

– copy this instruction over the backedge

This algorithm most used in commercial compilers


Software pipelining: Modulo scheduling

Example: Modulo scheduling a loop

for (i = 0; i < n; i++)

A[i+6] = 3* A[i] - 1;

(a) Example loop

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

(b) Code (without loop control)

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

Prologue

Kernel

Epilogue

(c) Software pipeline

• Prologue fills the SW pipeline with iterations

• Epilogue drains the SW pipeline


Software pipelining: determine II, the Initiation Interval

ld r1, (r2)

mul r3, r1, 3

(0,1) (1,0)

sub r4, r3, 1

st r4, (r5)

(0,1) (1,0)

(0,1) (1,0) (1,6)

(delay, iteration distance)

Cyclic data dependences

cycle(v) cycle(u) + delay(u,v) - II.distance(u,v)

For (i=0;.....)

A[i+6]= 3*A[i]-1

Initiation Interval

ld_1ld_2ld_3ld_4ld_5ld_6 st_1ld_7

-5


Modulo scheduling constraintsMII, minimum initiation interval, bounded by cyclic dependences and resources:

MII = max{ ResMinII, RecMinII }

Resources:

)(

)(maxResMinII

ravailable

rusedresourcesr

Cycles:

ce

edistanceIIedelayvcyclevcycle )(.)()()(

Therefore:

ce

cyclesc eIIedelayNII )(distance.)(0,|minRecMinII

Or:

ce

)(distance

)(maxRecMinII

e

edelayce

cyclesc


Let's go back to: The Role of the Compiler

9 steps required to translate an HLL program:(see online bookchapter)

1. Front-end compilation

2. Determine dependencies

3. Graph partitioning: make multiple threads (or tasks)

4. Bind partitions to compute nodes

5. Bind operands to locations

6. Bind operations to time slots: Scheduling

7. Bind operations to functional units

8. Bind transports to buses

9. Execute operations and perform transports


Division of responsibilities between hardware and compiler

Frontend

Binding of Operands

Determine Dependencies

Scheduling

Binding of Transports

Binding of Operations

Execute

Binding of Operands

Determine Dependencies

Scheduling

Binding of Transports

Binding of Operations

Responsibility of compiler Responsibility of Hardware

Application

Superscalar

Dataflow

Multi-threaded

Indep. Arch

VLIW

TTA

(1)

(2)

(3)

(4)

(5)

(6)

(7)


Overview• Enhance performance: architecture methods• Instruction Level Parallelism• VLIW• Examples

– C6

– TM

– TTA

• Clustering• Code generation• Design Space Exploration: TTA framework


Mapping applications to processorsMOVE framework

Architectureparameters

OptimizerOptimizer

Parametric compilerParametric compiler Hardware generatorHardware generator

feedbackfeedback

Userintercation

Parallel object code chip

Pareto curve(solution space)

cost

exec

. tim

e

x

x

x

x

xx

x

xx

x

x

x

x

x

x

xx x

x

x

Move framework

TTA based system


TTA (MOVE) organization

Socket

integer RF

floatRF

booleanRF

instruct.unit

immediateunit

load/store unit

integer ALU

float ALU

integer ALU

load/store unit

Data Memory

Instruction Memory


Code generation trajectory for TTAs

Application (C)

Compiler frontend

Sequential code

Compiler backend

Parallel code

Sequential simulation

Parallel simulation

Arc

hite

ctur

e de

scri

ptio

n

Profiling data

Input/Output

Input/Output

• Frontend: GCC or SUIF (adapted)

• Frontend: GCC or SUIF (adapted)


Exploration: TTA resource reduction

•


Exporation: TTA connectivity reduction

Number of connections removed

Exe

cuti

on t

ime

Reducing bus delay

FU stage constrains cycle time

Cri

tical

con

nect

ions

dis

appe

ar

0


Can we do better?

How ?

• Code Transformations• SFUs: Special Function

Units• Vector processing• Multiple Processors

Cost

Exe

cutio

n tim

e


Transforming the specification (1)

+

+

+

+

+

+

Based on associativity of + operationa + (b + c) = (a + b) + c


Transforming the specification (2)d = a * b;

e = a + d;

f = 2 * b + d;

r = f – e;

x = z + y;

r = 2*b – a;x = z + y;

<<

-

a

1 b

+

x

zy

r

a

* *

+ +

-+

b 2

z y

d

e f

r x


Changing the architectureadding SFUs: special function units

+

+

+

+

+

+

4-input adderwhy is this faster?


Changing the architectureadding SFUs: special function units

In the extreme case put everything into one unit!

Spatial mapping- no control flow

However: no flexibility / programmability !!

but could use FPGAs


SFUs: fine grain patterns• Why using fine grain SFUs:

– Code size reduction

– Register file #ports reduction

– Could be cheaper and/or faster

– Transport reduction

– Power reduction (avoid charging non-local wires)

– Supports whole application domain !• coarse grain would only help certain specific applications

Which patterns do need support?

• Detection of recurring operation patterns needed


SFUs: covering results

Adding only 20 'patterns' of 2 operations dramatically reduces # of operations (with about 40%) !!


Exploration: resulting architecture

9 buses4 RFs

4 Addercmp FUs 2 Multiplier FUs

2 Diffadd FUs

streamoutput

streaminput

Architecture for image processing• Several SFUs• Note the reduced connectivity


Conclusions• Billions of embedded processing systems / year

– how to design these systems quickly, cheap, correct, low power,.... ?

– what will their processing platform look like?

• VLIWs are very powerful and flexible– can be easily tuned to application domain

• TTAs even more flexible, scalable, and lower power


Conclusions• Compilation for ILP architectures is mature

– used in commercial compilers

• However– Great discrepancy between available and exploitable

parallelism

• Advanced code scheduling techniques needed to exploit ILP


Bottom line:


Handson-1 (2014)

• HOW FAR ARE YOU?• VLIW processor of Silicon Hive (Intel)• Map your algorithm• Optimize the mapping• Optimize the architecture• Perform DSE (Design Space Exploration) trading off

(=> Pareto curves)– Performance, – Energy and – Area (= Cost)

embedded computer architecture tu/e 5kk73 henk corporaal vliw architectures: generating vliw code

Documents

vliw code slide

embedded computer architecture

code optimization

mesman6 compiler basics

mesman4 compiler basics

mesman7 compiler basics

mesman9 compiler basics

mesman5 compiler basics