Download - Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT

Optimizing CompilersCISC 673

Spring 2009Instruction Scheduling

John CavazosUniversity of Delaware


Instruction Scheduling

Reordering instructions to improve performance

Takes into account anticipated latencies Machine-specific

Performed late in optimization pass Instruction-Level Parallelism (ILP)

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT 3

Modern Architectures Features

Superscalar Multiple logic units

Multiple issue 2 or more instructions issued per cycle

Speculative execution Branch predictors Speculative loads

Deep pipelines


Types of Instruction Scheduling

Local Scheduling Basic Block Scheduling

Global Scheduling Trace Scheduling Superblock Scheduling Software Pipelining


Scheduling for different Computer Architectures

Out-of-order Issue Scheduling is useful

In-order issue Scheduling is very important

VLIW Scheduling is essential!


Challenges to ILP

Structural hazards: Insufficient resources to exploit parallelism

Data hazards Instruction depends on result of previous

instruction still in pipeline Control hazards

Branches & jumps modify PC affect which instructions should be in

pipeline


Recall from Architecture…

IF – Instruction Fetch ID – Instruction Decode EX – Execute MA – Memory access WB – Write back

IF

IF

IF

ID

ID

ID

EX

EX

EX

MA

MA

MA

WB

WB

WB


Structural Hazards

IF

IF

ID

ID

EX EX MA

MA WB

WBaddf R3,R1,R2

addf R3,R3,R4 stall EX EX

Assumes floating point ops take 2 execute cycles

Instruction latency: execute takes > 1 cycle


Data Hazards

IF

IF

ID

ID

EX

EX

MA

MA WB

WBlw R1,0(R2)

add R3,R1,R4 stall

Memory latency: data not ready


Control Hazards

IF

IF

ID

---

EX

---

MA

--- ---

WB

IF ID EX MA WB

IF ID EX MA WB

Taken Branch

Instr + 1

Branch Target

Branch Target + 1


Basic Block Scheduling

For each basic block: Construct directed acyclic graph

(DAG) using dependences between statements

Node = statement / instruction Edge (a,b) = statement a must execute

before b Schedule instructions using the

DAG


Data Dependences

If two operations access the same register and one access is a write, they are dependent

Types of data dependencesRAW=Read after Write WAW WAR

r1 = r2 + r3

r4 = r1 * 6

r1 = r2 + r3

r1 = r4 * 6

r1 = r2 + r3

r2 = r5 * 6

Cannot reorder two dependent instructions


Basic Block Scheduling Example

a) lw R2, (R1)b) lw R3, (R1) 4c) R4 R2 + R3d) R5 R2 - 1

a b

d c

2 2 2

a) lw R2, (R1)b) lw R3, (R1) 4 --- nop -----c) R4 R2 + R3d) R5 R2 - 1

a) lw R2, (R1)b) lw R3, (R1) 4d) R5 R2 - 1c) R4 R2 + R3

Original Schedule Dependence DAG

Schedule 1 (5 cycles) Schedule 2 (4 cycles)


Scheduling Algorithm Construct dependence dag on basic

block Put roots in candidate set Use scheduling heuristics (in order) to

select instruction

While candidate set not empty Evaluate all candidates and select best one Delete scheduled instruction from

candidate set Add newly-exposed candidates


Instruction Scheduling Heuristics

NP-complete = we need heuristics Bias scheduler to prefer instructions:

Earliest execution time Have many successors

More flexibility in scheduling Progress along critical path Free registers

Reduce register pressure Can be a combination of heuristics


Computing Priorities

Height(n) = exec(n) if n is a leaf max(height(m)) + exec(n) for m, where m is a successor of n

Critical path(s) = path through the dependence DAG with longest latency


17

Example – Determine Height and CP

Code

a lw r1, w

b add r1,r1,r1

c lw r2,x

d mult r1,r1,r2

e lw r2,y

f mult r1,r1,r2

g lw r2,z

h mult r1,r1,r2

i sw r1, a

Assume: memory instrs = 3 mult = 2 = (to have result in register) rest = 1 cycle

Critical path: _______

a

b

d

f

h

i

c

e

g

2

32

2 3

31

3


18

Example

start

Schedule

___ cycles

a

b

d

f

h

i

c

e

g

2

32

2 3

31

3

3

5

87

109

1210

13 Code

a lw r1, w

b add r1,r1,r1

c lw r2,x

d mult r1,r1,r2

e lw r2,y

f mult r1,r1,r2

g lw r2,z

h mult r1,r1,r2

i sw r1, a


Global Scheduling: Superblock Definition:

single trace of contiguous, frequently executed blocks

a single entry and multiple exits Formation algorithm:

pick a trace of frequently executed basic block eliminate side entrance (tail duplication)

Scheduling and optimization: speculate operations in the superblock apply optimization to scope defined by

superblock


Superblock Formation

A100

B90

E90

C10

D0

F100

A100

B90

E90

C10

D0

F90

F’10

Select a trace Tail duplicate


Optimizations within Superblock

By limiting the scope of optimization to superblock:

optimize for the frequent path may enable optimizations that are not feasible

otherwise (CSE, loop invariant code motion,...) For example: CSE

r1 = r2*3

r2 = r2 +1

r3 = r2*3

trace selection

r1 = r2*3

r2 = r2 +1

r3 = r2*3 r3 = r2*3

tail duplication

r1 = r2*3

r2 = r2 +1

r3 = r1 r3 = r2*3

CSE within superblock(no merge since single entry)


Scheduling Algorithm Complexity

Time complexity: O(n2) n = max number of instructions in basic

block

Building dependence dag: worst-case O(n2) Each instruction must be compared to

every other instruction

Scheduling then requires each instruction be inspected at each step = O(n2)

Average-case: small constant (e.g., 3)


Very Long Instruction Word (VLIW)

Compiler determines exactly what is issued every cycle (before the program is run)

Schedules also account for latencies All hardware changes result in a compiler

change

Usually embedded systems (hence simple HW)

Itanium is actually an EPIC-style machine (accounts for most parallelism, not latencies)


Sample VLIW code

c = a + b d = a - b e = a * b ld j = [x] nop

g = c + d h = c - d nop ld k = [y] nop

nop nop i = j * c ld f = [z] br g

Add/Sub Add/Sub Mul/Div Ld/St Branch

VLIW processor: 5 issue2 Add/Sub units (1 cycle)1 Mul/Div unit (2 cycle, unpipelined)1 LD/ST unit (2 cycle, pipelined)1 Branch unit (no delay slots)


Next Time

Phase-ordering

Download - Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling

Top Related