embedded computer architecture tu/e 5kk73 henk corporaal vliw architectures: generating vliw code

63
Embedded Computer Architecture TU/e 5kk73 Henk Corporaal VLIW architectures: Generating VLIW code

Upload: godfrey-wilson

Post on 27-Dec-2015

239 views

Category:

Documents


3 download

TRANSCRIPT

Embedded Computer Architecture

TU/e 5kk73Henk Corporaal

VLIW architectures:Generating VLIW code

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 2

VLIW lectures overview• Enhance performance: architecture methods• Instruction Level Parallelism• VLIW• Examples

– C6– TM– TTA

• Clustering and Reconfigurable components• Code generation

– compiler basics– mapping and scheduling– TTA code generation– Design space exploration

• Hands-on

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 3

Compiler basics• Overview

– Compiler trajectory / structure / passes– Control Flow Graph (CFG)– Mapping and Scheduling– Basic block list scheduling– Extended scheduling scope– Loop scheduling

– Loop transformations• separate lecture

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 4

Compiler basics: trajectory

Preprocessor

Compiler

Assembler

Loader/Linker

Source program

Object program

Error messages

Library code

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 5

Compiler basics: structure / passes

Lexical analyzer

Parsing

Code optimization

Register allocation

Source code

Sequential code

Intermediate code

Code generation

Scheduling and allocation

Object code

token generation

check syntax check semantic parse tree generation

data flow analysis local optimizations global optimizationscode selection peephole optimizations

making interference graph graph coloring spill code insertion caller / callee save and restore code

exploiting ILP

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 6

Compiler basics: structure Simple example: from HLL to (Sequential) Assembly code

Lexical analyzer

Syntax analyzer

Intermediate code generator

position := initial + rate * 60

id := id + id * 60

:=

+id

*id

60id

Code optimizer

Code generator

temp1 := intoreal(60)temp2 := id3 * temp1temp3 := id2 + temp2id1 := temp3

temp1 := id3 * 60.0id1 := id2 + temp1

movf id3, r2mulf #60, r2, r2movf id2, r1addf r2, r1movf r1, id1

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 7

Compiler basics: Control flow graph (CFG)

C input code:CFG: shows the flow between basic blocks

1 sub t1, a, b bgz t1, 2, 3

4 ………….. …………..

3 rem r, b, a goto 4

2 rem r, a, b goto 4

Program, is collection of Functions, each function is collection of Basic Blocks, each BB contains set of Instructions, each instruction consists of several Transports,..

if (a > b) { r = a % b; } else { r = b % a; }

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 8

• Machine independent optimizations

• Machine dependent optimizations

Compiler basics: Basic optimizations

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 9

Compiler basics: Basic optimizations

• Machine independent optimizations– Common subexpression elimination– Constant folding– Copy propagation– Dead-code elimination– Induction variable elimination– Strength reduction– Algebraic identities

• Commutative expressions• Associativity: Tree height reduction

– Note: not always allowed(due to limited precision)

• For details check any good compiler book !

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 10

Compiler basics: Basic optimizations

• Machine dependent optimization example– What’s the optimal implementation of a*34 ?– Use multiplier: mul Tb, Ta, 34

• Pro: No thinking required• Con: May take many cycles

– Alternative:– SHL Tb, Ta, 1– SHL Tc, Ta, 5– ADD Tb, Tb, Tc

• Pros: May take fewer cycles• Cons:• Uses more registers• Additional instructions ( I-cache load / code size)

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 11

Compiler basics: Register allocation

• Register Organization – Conventions needed for parameter passing – and register usage across function calls

r31

r21

r20

r11

r10

r1

r0

Callee saved registers

Caller saved registers

Function Argument and Result transfer

Hard-wired 0

Other temporaries

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 12

Register allocation using graph coloring

Given a set of registers, what is the most efficient mapping of registers to program variables in terms of execution time of the program?Some definitions:

• A variable is defined at a point in program when a value is assigned to it.

• A variable is used at a point in a program when its value is referenced in an expression.

• The live range of a variable is the execution range between definitions and uses of a variable.

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 13

Program:

a := c := b := := bd := := a := c := d

a b c dLive Ranges

Register allocation using graph coloring

define

use

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 14

Register allocation using graph coloring

a

b c

d

Inference Graph

a

b c

d

Coloring:a = redb = greenc = blued = green

Graph needs 3 colors => program needs 3 registers

Question: map coloring requires (at most) 4 colors; what’s the maximum number of colors (= registers) needed for register interference graph coloring?

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 15

Register allocation using graph coloringSpill/ Reload code Spill/ Reload code is needed when there are not enough colors (registers) to color the interference graph

Example: Only two registers available !!

Program:

a := c := store cb := := bd := := aload c := c := d

a b c dLive Ranges

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 16

Register allocation for a monolithic RF

Scheme of the optimistic register allocator

Renumber Build Spill costs Simplify Select

Spill code

The Select phase selects a color (= machine register) for a variable that minimizes the heuristic h:

h = fdep(col, var) + caller_callee(col, var)

where: fdep(col, var) : a measure for the introduction of false dependencies caller_callee(col, var) : cost for mapping var on a caller or callee saved register

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 17

Some explanation of reg allocation phases• [Renumber:] The first phase finds all live ranges in a procedure• and numbers (renames) them uniquely.

• [Build:] This phase constructs the interference graph.

• [Spill Costs:] In preparation for coloring, a spill cost estimate• is computed for every live range. The cost is simply the sum of the• execution frequencies of the transports that define or use the variable• of the live range.

• [Simplify:] This phase removes nodes with degree < k in an• arbitrary order from the graph and pushes them on a stack. Whenever• it discovers that all remaining nodes have degree >= k, it chooses• a spill candidate. This node is also removed from the graph and• optimistically pushed on the stack, hoping a color will be available in• spite of its high degree.

• [Select:] Colors are selected for nodes. In turn, each node is• popped from the stack, reinserted in the interference graph and given a• color distinct from its neighbors. Whenever it discovers that it has no• color available for some node, it leaves the node uncolored and• continues with the next node.

• [Spill Code:] In the final phase spill code is• inserted for the live ranges of all uncolored nodes.

• Some symbolic registers must be mapped on a specific machine register• (like stack pointer). These registers get their color in the simplify• stage instead of being pushed on the stack.

• The other machine registers are divided in caller-saved and callee-saved• registers. The allocator computes the caller-saved and callee-saved cost.

• The caller-saved cost for the symbolic registers is computed when they have• live-ranges across a procedure call. The cost per symbolic register is twice• the execution frequency of its transport. • The callee-saved cost of a• symbolic register is twice the execution frequency of the procedure to which• the transport of the symbolic register belongs. With these two costs in mind• the allocator chooses a machine register.

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 18

• CISC era (before 1985)– Code size important– Determine shortest sequence of code

• Many options may exist

– Pattern matchingExample M68029:

D1 := D1 + M[ M[10+A1] + 16*D2 + 20 ] ADD ([10,A1], D2*16, 20) D1

• RISC era– Performance important– Only few possible code sequences– New implementations of old architectures optimize RISC part of

instruction set only; for e.g. i486 / Pentium / M68020

Compiler basics: Code selection

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 19

Overview• Enhance performance: architecture methods• Instruction Level Parallelism• VLIW• Examples

– C6– TM– TTA

• Clustering• Code generation

– Compiler basics– Mapping and Scheduling of Operations

• Design Space Exploration: TTA framework

•What is scheduling•Basic Block Scheduling•Extended Basic Block Scheduling•Loop Scheduling

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 20

Mapping / Scheduling =placing operations in space and time

d = a * b;

e = a + d;

f = 2 * b + d;

r = f – e;

x = z + y;

* *

+ +

-

+

a b 2

z yd

e f

r

x

Data Dependence Graph (DDG)

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 21

How to map these operations?

Architecture constraints:• One Function Unit• All operations single cycle latency

*

*

+

+

-

+

cycle 1

2

3

4

5

6

* *

+ +

-+

a b 2

z y

d

e f

rx

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 22

How to map these operations?

* *

+ +

-+

a b 2

z y

d

e f

rx

Architecture constraints:• One Add-sub and one Mul unit• All operations single cycle latency

*

* +

+

-

+cycle 1

2

3

4

5

6

Mul Add-sub

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 23

There are many mapping solutions

Pareto graph(solution space)

T e

xecu

tion

x

x

x

x

xx

x

xx

x

x

x

x

x

x

xxx

x

x

xx

x

x

x

x

x

xx

x

xx

Cost0

Point x is pareto there is no point y for which i yi<xi

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 24

Scheduling: OverviewTransforming a sequential program into a parallel program:

read sequential program read machine description file for each procedure do

perform function inlining

for each procedure dotransform an irreducible CFG into a reducible CFG perform control flow analysis perform loop unrolling perform data flow analysis perform memory reference disambiguation perform register allocation for each scheduling scope do

perform instruction scheduling write out the parallel program

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 25

Basic Block Scheduling• Basic Block =

piece of code which can only be entered from the top (first instruction) and left at the bottom (final instruction)

• Scheduling a basic block = Assign resources and a cycle to every operation

• List Scheduling =Heuristic scheduling approach, scheduling the operation one-by-one– Time_complexity = O(N), where N is #operations

• Optimal scheduling has Time_complexity = O(exp(N)

• Question: what is a good scheduling heuristic

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 26

Basic Block Scheduling• Make a Data Dependence Graph (DDG)

• Determine minimal length of the DDG (for the given architecture)– minimal number of cycles to schedule the graph (assuming sufficient resources)

• Determine:– ASAP (As Soon As Possible) cycle = earliest cycle instruction can be scheduled– ALAP (As Late As Possible) cycle = latest cycle instruction can be scheduled– Slack of each operation = ALAP – ASAP– Priority of operations = f (Slack, #decendants, #register impact, …. )

• Place each operation in first cycle with sufficient resources

• Notes:– Basic Block = a (maximal) piece of consecutive instructions which can only be entered at the

first instruction and left at the end

– Scheduling order sequential– Scheduling Priority determined by used heuristic; e.g. slack + other contributions

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 27

Basic Block Scheduling:determine ASAP and ALAP cycles

ADD

LD

A C

y

<1,3>

<2,4>MUL

A B

z

<1,4>

ADD

ADD

SUB

NEG LD

A

B C

X

<3,3>

<4,4>

<2,2>

<2,3>

<1,1>

ASAP cycle

ALAP cycle

slack

we assume all operations aresingle cycle !

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 28

Cycle based list schedulingproc Schedule(DDG = (V,E))beginproc ready = { v | (u,v) E } ready’ = ready sched = current_cycle = 0 while sched V do for each v ready’ (select in priority order) do if ResourceConfl(v,current_cycle, sched) then cycle(v) = current_cycle sched = sched {v} endif endfor current_cycle = current_cycle + 1 ready = { v | v sched (u,v) E, u sched } ready’ = { v | v ready (u,v) E, cycle(u) + delay(u,v) current_cycle} endwhileendproc

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 29

Extended Scheduling Scope: look at the CFG

A

C

F

B

D

E

G

A;If cond Then B Else C;D;If cond Then E Else F;G;

Code: CFG:ControlFlowGraph

Q: Why enlarge the scheduling scope?

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 30

Extended basic block scheduling: Code Motion A

a) add r3, r4, 4 b) beq . . .

D e) mul r1, r1, r3

C d) sub r3, r3, r2

B c) add r1, r1, r2

• Downward code motions?

— a B, a C, a D, c D, d D

• Upward code motions?

— c A, d A, e B, e C, e A

Q: Why moving code?

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 31

Possible Scheduling Scopes

Trace Superblock Decision tree Hyperblock/region

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 32

Create and Enlarge Scheduling Scope

B C

E F

D

G

A

Trace Superblock

B C

F E’

D’

G’

A

E

D

G

tail duplication

A

C

F

B

D

E

G

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 33

Create and Enlarge Scheduling Scope

B C

E F

D

G

A

Hyperblock/ region

B C

E’ F’

D’

G’’

A

E

D

G

Decision Tree

tail duplication

F

G’

A

C

F

B

D

E

G

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 34

Trace Sup.block

Hyp.block

Dec.Tree

Region

Multiple exc. paths No No Yes Yes YesSide-entries allowed Yes No No No NoJoin points allowed Yes No Yes No YesCode motion down joins Yes No No No NoMust be if-convertible No No Yes No NoTail dup. before sched. No Yes No Yes No

Comparing scheduling scopesA

C

F

B

D

E

G

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 35

Code movement (upwards) within regions: what to check?

I

I I

add

I

source block

destination block

I

Copy needed

Intermediateblock

Check foroff-liveness

Legend:

Code movement

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 36

Extended basic block scheduling:Code Motion

• A dominates B A is always executed before B– Consequently:

• A does not dominate B code motion from B to A requires

code duplication

• B post-dominates A B is always executed after A– Consequently:

• B does not post-dominate A code motion from B to A is speculative

A

CB

ED

F

Q1: does C dominate E?

Q2: does C dominate D?

Q3: does F post-dominate D?

Q4: does D post-dominate B?

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 37

Scheduling: Loops

B C

D

ALoop Optimizations:

B

C’’

D

A

C’

C

Loop peeling

B

C’’

D

A

C’

C

Loop unrolling

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 38

Scheduling: LoopsProblems with unrolling:

• Exploits only parallelism within sets of n iterations

• Iteration start-up latency

• Code expansion

Basic block scheduling

Basic block scheduling and unrolling

Software pipelining

reso

urc

e u

tiliz

atio

n

time

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 39

Software pipelining• Software pipelining a loop is:

– Scheduling the loop such that iterations start before preceding iterations have finished

Or:– Moving operations across the backedge

LD

ML

ST

LD

LD ML

LD ML ST

ML ST

ST

LD

LD ML

LD ML ST

ML ST

ST

Example: y = a.x

3 cycles/iteration Unroling (3 times)

5/3 cycles/iteration

Software pipelining

1 cycle/iteration

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 40

Software pipelining (cont’d)Basic loop scheduling techniques:

• Modulo scheduling (Rau, Lam)– list scheduling with modulo resource constraints

• Kernel recognition techniques– unroll the loop

– schedule the iterations

– identify a repeating pattern

– Examples:• Perfect pipelining (Aiken and Nicolau)

• URPR (Su, Ding and Xia)

• Petri net pipelining (Allan)

• Enhanced pipeline scheduling (Ebcioğlu)– fill first cycle of iteration

– copy this instruction over the backedge

This algorithm most used in commercial compilers

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 41

Software pipelining: Modulo scheduling

Example: Modulo scheduling a loop

for (i = 0; i < n; i++)

A[i+6] = 3* A[i] - 1;

(a) Example loop

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

(b) Code (without loop control)

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

ld r1,(r2)

mul r3,r1,3

sub r4,r3,1

st r4,(r5)

Prologue

Kernel

Epilogue

(c) Software pipeline

• Prologue fills the SW pipeline with iterations

• Epilogue drains the SW pipeline

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 42

Software pipelining: determine II, the Initiation Interval

ld r1, (r2)

mul r3, r1, 3

(0,1) (1,0)

sub r4, r3, 1

st r4, (r5)

(0,1) (1,0)

(0,1) (1,0) (1,6)

(delay, iteration distance)

Cyclic data dependences

cycle(v) cycle(u) + delay(u,v) - II.distance(u,v)

For (i=0;.....)

A[i+6]= 3*A[i]-1

Initiation Interval

ld_1ld_2ld_3ld_4ld_5ld_6 st_1ld_7

-5

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 43

Modulo scheduling constraintsMII, minimum initiation interval, bounded by cyclic dependences and resources:

MII = max{ ResMinII, RecMinII }

Resources:

)(

)(maxResMinII

ravailable

rusedresourcesr

Cycles:

ce

edistanceIIedelayvcyclevcycle )(.)()()(

Therefore:

ce

cyclesc eIIedelayNII )(distance.)(0,|minRecMinII

Or:

ce

)(distance

)(maxRecMinII

e

edelayce

cyclesc

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 44

Let's go back to: The Role of the Compiler

9 steps required to translate an HLL program:(see online bookchapter)

1. Front-end compilation

2. Determine dependencies

3. Graph partitioning: make multiple threads (or tasks)

4. Bind partitions to compute nodes

5. Bind operands to locations

6. Bind operations to time slots: Scheduling

7. Bind operations to functional units

8. Bind transports to buses

9. Execute operations and perform transports

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 45

Division of responsibilities between hardware and compiler

Frontend

Binding of Operands

Determine Dependencies

Scheduling

Binding of Transports

Binding of Operations

Execute

Binding of Operands

Determine Dependencies

Scheduling

Binding of Transports

Binding of Operations

Responsibility of compiler Responsibility of Hardware

Application

Superscalar

Dataflow

Multi-threaded

Indep. Arch

VLIW

TTA

(1)

(2)

(3)

(4)

(5)

(6)

(7)

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 46

Overview• Enhance performance: architecture methods• Instruction Level Parallelism• VLIW• Examples

– C6

– TM

– TTA

• Clustering• Code generation• Design Space Exploration: TTA framework

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 47

Mapping applications to processorsMOVE framework

Architectureparameters

OptimizerOptimizer

Parametric compilerParametric compiler Hardware generatorHardware generator

feedbackfeedback

Userintercation

Parallel object code chip

Pareto curve(solution space)

cost

exec

. tim

e

x

x

x

x

xx

x

xx

x

x

x

x

x

x

xx x

x

x

Move framework

TTA based system

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 48

TTA (MOVE) organization

Socket

integer RF

floatRF

booleanRF

instruct.unit

immediateunit

load/store unit

integer ALU

float ALU

integer ALU

load/store unit

Data Memory

Instruction Memory

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 49

Code generation trajectory for TTAs

Application (C)

Compiler frontend

Sequential code

Compiler backend

Parallel code

Sequential simulation

Parallel simulation

Arc

hite

ctur

e de

scri

ptio

n

Profiling data

Input/Output

Input/Output

• Frontend: GCC or SUIF (adapted)

• Frontend: GCC or SUIF (adapted)

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 50

Exploration: TTA resource reduction

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 51

Exporation: TTA connectivity reduction

Number of connections removed

Exe

cuti

on t

ime

Reducing bus delay

FU stage constrains cycle time

Cri

tical

con

nect

ions

dis

appe

ar

0

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 52

Can we do better?

How ?

• Code Transformations• SFUs: Special Function

Units• Vector processing• Multiple Processors

Cost

Exe

cutio

n tim

e

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 53

Transforming the specification (1)

+

+

+

+

+

+

Based on associativity of + operationa + (b + c) = (a + b) + c

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 54

Transforming the specification (2)d = a * b;

e = a + d;

f = 2 * b + d;

r = f – e;

x = z + y;

r = 2*b – a;x = z + y;

<<

-

a

1 b

+

x

zy

r

a

* *

+ +

-+

b 2

z y

d

e f

r x

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 55

Changing the architectureadding SFUs: special function units

+

+

+

+

+

+

4-input adderwhy is this faster?

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 56

Changing the architectureadding SFUs: special function units

In the extreme case put everything into one unit!

Spatial mapping- no control flow

However: no flexibility / programmability !!

but could use FPGAs

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 57

SFUs: fine grain patterns• Why using fine grain SFUs:

– Code size reduction

– Register file #ports reduction

– Could be cheaper and/or faster

– Transport reduction

– Power reduction (avoid charging non-local wires)

– Supports whole application domain !• coarse grain would only help certain specific applications

Which patterns do need support?

• Detection of recurring operation patterns needed

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 58

SFUs: covering results

Adding only 20 'patterns' of 2 operations dramatically reduces # of operations (with about 40%) !!

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 59

Exploration: resulting architecture

9 buses4 RFs

4 Addercmp FUs 2 Multiplier FUs

2 Diffadd FUs

streamoutput

streaminput

Architecture for image processing• Several SFUs• Note the reduced connectivity

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 60

Conclusions• Billions of embedded processing systems / year

– how to design these systems quickly, cheap, correct, low power,.... ?

– what will their processing platform look like?

• VLIWs are very powerful and flexible– can be easily tuned to application domain

• TTAs even more flexible, scalable, and lower power

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 61

Conclusions• Compilation for ILP architectures is mature

– used in commercial compilers

• However– Great discrepancy between available and exploitable

parallelism

• Advanced code scheduling techniques needed to exploit ILP

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 62

Bottom line:

04/19/23 Embedded Computer Architecture H. Corporaal, and B. Mesman 63

Handson-1 (2014)

• HOW FAR ARE YOU?• VLIW processor of Silicon Hive (Intel)• Map your algorithm• Optimize the mapping• Optimize the architecture• Perform DSE (Design Space Exploration) trading off

(=> Pareto curves)– Performance, – Energy and – Area (= Cost)