UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Optimizing CompilersCISC 673
Spring 2009Instruction Scheduling
John CavazosUniversity of Delaware
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Instruction Scheduling
Reordering instructions to improve performance
Takes into account anticipated latencies Machine-specific
Performed late in optimization pass Instruction-Level Parallelism (ILP)
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT 3
Modern Architectures Features
Superscalar Multiple logic units
Multiple issue 2 or more instructions issued per cycle
Speculative execution Branch predictors Speculative loads
Deep pipelines
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT 4
Types of Instruction Scheduling
Local Scheduling Basic Block Scheduling
Global Scheduling Trace Scheduling Superblock Scheduling Software Pipelining
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT 5
Scheduling for different Computer Architectures
Out-of-order Issue Scheduling is useful
In-order issue Scheduling is very important
VLIW Scheduling is essential!
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT 6
Challenges to ILP
Structural hazards: Insufficient resources to exploit parallelism
Data hazards Instruction depends on result of previous
instruction still in pipeline Control hazards
Branches & jumps modify PC affect which instructions should be in
pipeline
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Recall from Architecture…
IF – Instruction Fetch ID – Instruction Decode EX – Execute MA – Memory access WB – Write back
IF
IF
IF
ID
ID
ID
EX
EX
EX
MA
MA
MA
WB
WB
WB
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Structural Hazards
IF
IF
ID
ID
EX EX MA
MA WB
WBaddf R3,R1,R2
addf R3,R3,R4 stall EX EX
Assumes floating point ops take 2 execute cycles
Instruction latency: execute takes > 1 cycle
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Data Hazards
IF
IF
ID
ID
EX
EX
MA
MA WB
WBlw R1,0(R2)
add R3,R1,R4 stall
Memory latency: data not ready
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Control Hazards
IF
IF
ID
---
EX
---
MA
--- ---
WB
IF ID EX MA WB
IF ID EX MA WB
Taken Branch
Instr + 1
Branch Target
Branch Target + 1
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT 11
Basic Block Scheduling
For each basic block: Construct directed acyclic graph
(DAG) using dependences between statements
Node = statement / instruction Edge (a,b) = statement a must execute
before b Schedule instructions using the
DAG
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Data Dependences
If two operations access the same register and one access is a write, they are dependent
Types of data dependencesRAW=Read after Write WAW WAR
r1 = r2 + r3
r4 = r1 * 6
r1 = r2 + r3
r1 = r4 * 6
r1 = r2 + r3
r2 = r5 * 6
Cannot reorder two dependent instructions
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Basic Block Scheduling Example
a) lw R2, (R1)b) lw R3, (R1) 4c) R4 R2 + R3d) R5 R2 - 1
a b
d c
2 2 2
a) lw R2, (R1)b) lw R3, (R1) 4 --- nop -----c) R4 R2 + R3d) R5 R2 - 1
a) lw R2, (R1)b) lw R3, (R1) 4d) R5 R2 - 1c) R4 R2 + R3
Original Schedule Dependence DAG
Schedule 1 (5 cycles) Schedule 2 (4 cycles)
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT 14
Scheduling Algorithm Construct dependence dag on basic
block Put roots in candidate set Use scheduling heuristics (in order) to
select instruction
While candidate set not empty Evaluate all candidates and select best one Delete scheduled instruction from
candidate set Add newly-exposed candidates
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT 15
Instruction Scheduling Heuristics
NP-complete = we need heuristics Bias scheduler to prefer instructions:
Earliest execution time Have many successors
More flexibility in scheduling Progress along critical path Free registers
Reduce register pressure Can be a combination of heuristics
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Computing Priorities
Height(n) = exec(n) if n is a leaf max(height(m)) + exec(n) for m, where m is a successor of n
Critical path(s) = path through the dependence DAG with longest latency
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
17
Example – Determine Height and CP
Code
a lw r1, w
b add r1,r1,r1
c lw r2,x
d mult r1,r1,r2
e lw r2,y
f mult r1,r1,r2
g lw r2,z
h mult r1,r1,r2
i sw r1, a
Assume: memory instrs = 3 mult = 2 = (to have result in register) rest = 1 cycle
Critical path: _______
a
b
d
f
h
i
c
e
g
2
32
2 3
31
3
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
18
Example
start
Schedule
___ cycles
a
b
d
f
h
i
c
e
g
2
32
2 3
31
3
3
5
87
109
1210
13 Code
a lw r1, w
b add r1,r1,r1
c lw r2,x
d mult r1,r1,r2
e lw r2,y
f mult r1,r1,r2
g lw r2,z
h mult r1,r1,r2
i sw r1, a
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Global Scheduling: Superblock Definition:
single trace of contiguous, frequently executed blocks
a single entry and multiple exits Formation algorithm:
pick a trace of frequently executed basic block eliminate side entrance (tail duplication)
Scheduling and optimization: speculate operations in the superblock apply optimization to scope defined by
superblock
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Superblock Formation
A100
B90
E90
C10
D0
F100
A100
B90
E90
C10
D0
F90
F’10
Select a trace Tail duplicate
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Optimizations within Superblock
By limiting the scope of optimization to superblock:
optimize for the frequent path may enable optimizations that are not feasible
otherwise (CSE, loop invariant code motion,...) For example: CSE
r1 = r2*3
r2 = r2 +1
r3 = r2*3
trace selection
r1 = r2*3
r2 = r2 +1
r3 = r2*3 r3 = r2*3
tail duplication
r1 = r2*3
r2 = r2 +1
r3 = r1 r3 = r2*3
CSE within superblock(no merge since single entry)
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT 22
Scheduling Algorithm Complexity
Time complexity: O(n2) n = max number of instructions in basic
block
Building dependence dag: worst-case O(n2) Each instruction must be compared to
every other instruction
Scheduling then requires each instruction be inspected at each step = O(n2)
Average-case: small constant (e.g., 3)
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Very Long Instruction Word (VLIW)
Compiler determines exactly what is issued every cycle (before the program is run)
Schedules also account for latencies All hardware changes result in a compiler
change
Usually embedded systems (hence simple HW)
Itanium is actually an EPIC-style machine (accounts for most parallelism, not latencies)
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT
Sample VLIW code
c = a + b d = a - b e = a * b ld j = [x] nop
g = c + d h = c - d nop ld k = [y] nop
nop nop i = j * c ld f = [z] br g
Add/Sub Add/Sub Mul/Div Ld/St Branch
VLIW processor: 5 issue2 Add/Sub units (1 cycle)1 Mul/Div unit (2 cycle, unpipelined)1 LD/ST unit (2 cycle, pipelined)1 Branch unit (no delay slots)
UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT 25
Next Time
Phase-ordering