optimizing compilers cisc 673 spring 2009 instruction scheduling

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT

Optimizing CompilersCISC 673

Spring 2009Instruction Scheduling

John CavazosUniversity of Delaware

Instruction Scheduling

Reordering instructions to improve performance

Takes into account anticipated latencies Machine-specific

Performed late in optimization pass Instruction-Level Parallelism (ILP)

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT 3

Modern Architectures Features

Superscalar Multiple logic units

Multiple issue 2 or more instructions issued per cycle

Speculative execution Branch predictors Speculative loads

Deep pipelines

Types of Instruction Scheduling

Local Scheduling Basic Block Scheduling

Global Scheduling Trace Scheduling Superblock Scheduling Software Pipelining

Scheduling for different Computer Architectures

Out-of-order Issue Scheduling is useful

In-order issue Scheduling is very important

VLIW Scheduling is essential!

Challenges to ILP

Structural hazards: Insufficient resources to exploit parallelism

Data hazards Instruction depends on result of previous

instruction still in pipeline Control hazards

Branches & jumps modify PC affect which instructions should be in

pipeline

Recall from Architecture…

IF – Instruction Fetch ID – Instruction Decode EX – Execute MA – Memory access WB – Write back

Structural Hazards

EX EX MA

WBaddf R3,R1,R2

addf R3,R3,R4 stall EX EX

Assumes floating point ops take 2 execute cycles

Instruction latency: execute takes > 1 cycle

Data Hazards

WBlw R1,0(R2)

add R3,R1,R4 stall

Memory latency: data not ready

Control Hazards

--- ---

IF ID EX MA WB

Taken Branch

Instr + 1

Branch Target

Branch Target + 1

Basic Block Scheduling

For each basic block: Construct directed acyclic graph

(DAG) using dependences between statements

Node = statement / instruction Edge (a,b) = statement a must execute

before b Schedule instructions using the

Data Dependences

If two operations access the same register and one access is a write, they are dependent

Types of data dependencesRAW=Read after Write WAW WAR

r1 = r2 + r3

r4 = r1 * 6

r1 = r2 + r3

r1 = r4 * 6

r1 = r2 + r3

r2 = r5 * 6

Cannot reorder two dependent instructions

Basic Block Scheduling Example

a) lw R2, (R1)b) lw R3, (R1) 4c) R4 R2 + R3d) R5 R2 - 1

a) lw R2, (R1)b) lw R3, (R1) 4 --- nop -----c) R4 R2 + R3d) R5 R2 - 1

a) lw R2, (R1)b) lw R3, (R1) 4d) R5 R2 - 1c) R4 R2 + R3

Original Schedule Dependence DAG

Schedule 1 (5 cycles) Schedule 2 (4 cycles)

Scheduling Algorithm Construct dependence dag on basic

block Put roots in candidate set Use scheduling heuristics (in order) to

select instruction

While candidate set not empty Evaluate all candidates and select best one Delete scheduled instruction from

candidate set Add newly-exposed candidates

Instruction Scheduling Heuristics

NP-complete = we need heuristics Bias scheduler to prefer instructions:

Earliest execution time Have many successors

More flexibility in scheduling Progress along critical path Free registers

Reduce register pressure Can be a combination of heuristics

Computing Priorities

Height(n) = exec(n) if n is a leaf max(height(m)) + exec(n) for m, where m is a successor of n

Critical path(s) = path through the dependence DAG with longest latency

Example – Determine Height and CP

a lw r1, w

b add r1,r1,r1

c lw r2,x

d mult r1,r1,r2

e lw r2,y

f mult r1,r1,r2

g lw r2,z

h mult r1,r1,r2

i sw r1, a

Assume: memory instrs = 3 mult = 2 = (to have result in register) rest = 1 cycle

Critical path: _______

Example

Schedule

___ cycles

13 Code

a lw r1, w

b add r1,r1,r1

c lw r2,x

d mult r1,r1,r2

e lw r2,y

f mult r1,r1,r2

g lw r2,z

h mult r1,r1,r2

i sw r1, a

Global Scheduling: Superblock Definition:

single trace of contiguous, frequently executed blocks

a single entry and multiple exits Formation algorithm:

pick a trace of frequently executed basic block eliminate side entrance (tail duplication)

Scheduling and optimization: speculate operations in the superblock apply optimization to scope defined by

superblock

Superblock Formation

F’10

Select a trace Tail duplicate

Optimizations within Superblock

By limiting the scope of optimization to superblock:

optimize for the frequent path may enable optimizations that are not feasible

otherwise (CSE, loop invariant code motion,...) For example: CSE

r1 = r2*3

r2 = r2 +1

r3 = r2*3

trace selection

r1 = r2*3

r2 = r2 +1

r3 = r2*3 r3 = r2*3

tail duplication

r1 = r2*3

r2 = r2 +1

r3 = r1 r3 = r2*3

CSE within superblock(no merge since single entry)

Scheduling Algorithm Complexity

Time complexity: O(n2) n = max number of instructions in basic

Building dependence dag: worst-case O(n2) Each instruction must be compared to

every other instruction

Scheduling then requires each instruction be inspected at each step = O(n2)

Average-case: small constant (e.g., 3)

Very Long Instruction Word (VLIW)

Compiler determines exactly what is issued every cycle (before the program is run)

Schedules also account for latencies All hardware changes result in a compiler

change

Usually embedded systems (hence simple HW)

Itanium is actually an EPIC-style machine (accounts for most parallelism, not latencies)

Sample VLIW code

c = a + b d = a - b e = a * b ld j = [x] nop

g = c + d h = c - d nop ld k = [y] nop

nop nop i = j * c ld f = [z] br g

Add/Sub Add/Sub Mul/Div Ld/St Branch

VLIW processor: 5 issue2 Add/Sub units (1 cycle)1 Mul/Div unit (2 cycle, unpipelined)1 LD/ST unit (2 cycle, pipelined)1 Branch unit (no delay slots)

Next Time

Phase-ordering

optimizing compilers cisc 673 spring 2009 instruction scheduling

Documents

optimizing compilers cisc 673 spring 2009 potential...

compilers programmingembedded

compilers and computer architecture: the risc-v...

risc cisc

optimizing compilers cisc 673 spring 2011 dynamic...

cisc processors

optimizing compilers cisc 673 spring 2011 yet more data...

a couple billion lines of code later: static checking in...

optimizing compilers cisc 673 spring 2009 overview of...

risc - cisc

cisc dial.pdf

arm compilers

optimizing compilers cisc 673 spring 2011 inlining

cs232 instruction sets, risc vs. cisc, compilers, assemblers

optimizing compilers cisc 673 spring 2009 data flow analysis

compilers compilers . q1>a translator converts...

compilers for embedded systems: why are compilers an issue?

risc and cisc understanding the risc and cisc architectures

cisc, risc pipelining -...

partial compilers