optimizing compilers cisc 673 spring 2009 instruction scheduling

Post on 06-Jan-2016

32 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling. John Cavazos University of Delaware. Instruction Scheduling. Reordering instructions to improve performance Takes into account anticipated latencies Machine-specific Performed late in optimization pass - PowerPoint PPT Presentation

TRANSCRIPT

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT

Optimizing CompilersCISC 673

Spring 2009Instruction Scheduling

John CavazosUniversity of Delaware

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT

Instruction Scheduling

Reordering instructions to improve performance

Takes into account anticipated latencies Machine-specific

Performed late in optimization pass Instruction-Level Parallelism (ILP)

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT 3

Modern Architectures Features

Superscalar Multiple logic units

Multiple issue 2 or more instructions issued per cycle

Speculative execution Branch predictors Speculative loads

Deep pipelines

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT 4

Types of Instruction Scheduling

Local Scheduling Basic Block Scheduling

Global Scheduling Trace Scheduling Superblock Scheduling Software Pipelining

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT 5

Scheduling for different Computer Architectures

Out-of-order Issue Scheduling is useful

In-order issue Scheduling is very important

VLIW Scheduling is essential!

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT 6

Challenges to ILP

Structural hazards: Insufficient resources to exploit parallelism

Data hazards Instruction depends on result of previous

instruction still in pipeline Control hazards

Branches & jumps modify PC affect which instructions should be in

pipeline

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT

Recall from Architecture…

IF – Instruction Fetch ID – Instruction Decode EX – Execute MA – Memory access WB – Write back

IF

IF

IF

ID

ID

ID

EX

EX

EX

MA

MA

MA

WB

WB

WB

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT

Structural Hazards

IF

IF

ID

ID

EX EX MA

MA WB

WBaddf R3,R1,R2

addf R3,R3,R4 stall EX EX

Assumes floating point ops take 2 execute cycles

Instruction latency: execute takes > 1 cycle

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT

Data Hazards

IF

IF

ID

ID

EX

EX

MA

MA WB

WBlw R1,0(R2)

add R3,R1,R4 stall

Memory latency: data not ready

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT

Control Hazards

IF

IF

ID

---

EX

---

MA

--- ---

WB

IF ID EX MA WB

IF ID EX MA WB

Taken Branch

Instr + 1

Branch Target

Branch Target + 1

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT 11

Basic Block Scheduling

For each basic block: Construct directed acyclic graph

(DAG) using dependences between statements

Node = statement / instruction Edge (a,b) = statement a must execute

before b Schedule instructions using the

DAG

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT

Data Dependences

If two operations access the same register and one access is a write, they are dependent

Types of data dependencesRAW=Read after Write WAW WAR

r1 = r2 + r3

r4 = r1 * 6

r1 = r2 + r3

r1 = r4 * 6

r1 = r2 + r3

r2 = r5 * 6

Cannot reorder two dependent instructions

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT

Basic Block Scheduling Example

a) lw R2, (R1)b) lw R3, (R1) 4c) R4 R2 + R3d) R5 R2 - 1

a b

d c

2 2 2

a) lw R2, (R1)b) lw R3, (R1) 4 --- nop -----c) R4 R2 + R3d) R5 R2 - 1

a) lw R2, (R1)b) lw R3, (R1) 4d) R5 R2 - 1c) R4 R2 + R3

Original Schedule Dependence DAG

Schedule 1 (5 cycles) Schedule 2 (4 cycles)

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT 14

Scheduling Algorithm Construct dependence dag on basic

block Put roots in candidate set Use scheduling heuristics (in order) to

select instruction

While candidate set not empty Evaluate all candidates and select best one Delete scheduled instruction from

candidate set Add newly-exposed candidates

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT 15

Instruction Scheduling Heuristics

NP-complete = we need heuristics Bias scheduler to prefer instructions:

Earliest execution time Have many successors

More flexibility in scheduling Progress along critical path Free registers

Reduce register pressure Can be a combination of heuristics

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT

Computing Priorities

Height(n) = exec(n) if n is a leaf max(height(m)) + exec(n) for m, where m is a successor of n

Critical path(s) = path through the dependence DAG with longest latency

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT

17

Example – Determine Height and CP

Code

a lw r1, w

b add r1,r1,r1

c lw r2,x

d mult r1,r1,r2

e lw r2,y

f mult r1,r1,r2

g lw r2,z

h mult r1,r1,r2

i sw r1, a

Assume: memory instrs = 3 mult = 2 = (to have result in register) rest = 1 cycle

Critical path: _______

a

b

d

f

h

i

c

e

g

2

32

2 3

31

3

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT

18

Example

start

Schedule

___ cycles

a

b

d

f

h

i

c

e

g

2

32

2 3

31

3

3

5

87

109

1210

13 Code

a lw r1, w

b add r1,r1,r1

c lw r2,x

d mult r1,r1,r2

e lw r2,y

f mult r1,r1,r2

g lw r2,z

h mult r1,r1,r2

i sw r1, a

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT

Global Scheduling: Superblock Definition:

single trace of contiguous, frequently executed blocks

a single entry and multiple exits Formation algorithm:

pick a trace of frequently executed basic block eliminate side entrance (tail duplication)

Scheduling and optimization: speculate operations in the superblock apply optimization to scope defined by

superblock

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT

Superblock Formation

A100

B90

E90

C10

D0

F100

A100

B90

E90

C10

D0

F90

F’10

Select a trace Tail duplicate

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT

Optimizations within Superblock

By limiting the scope of optimization to superblock:

optimize for the frequent path may enable optimizations that are not feasible

otherwise (CSE, loop invariant code motion,...) For example: CSE

r1 = r2*3

r2 = r2 +1

r3 = r2*3

trace selection

r1 = r2*3

r2 = r2 +1

r3 = r2*3 r3 = r2*3

tail duplication

r1 = r2*3

r2 = r2 +1

r3 = r1 r3 = r2*3

CSE within superblock(no merge since single entry)

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT 22

Scheduling Algorithm Complexity

Time complexity: O(n2) n = max number of instructions in basic

block

Building dependence dag: worst-case O(n2) Each instruction must be compared to

every other instruction

Scheduling then requires each instruction be inspected at each step = O(n2)

Average-case: small constant (e.g., 3)

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT

Very Long Instruction Word (VLIW)

Compiler determines exactly what is issued every cycle (before the program is run)

Schedules also account for latencies All hardware changes result in a compiler

change

Usually embedded systems (hence simple HW)

Itanium is actually an EPIC-style machine (accounts for most parallelism, not latencies)

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT

Sample VLIW code

c = a + b d = a - b e = a * b ld j = [x] nop

g = c + d h = c - d nop ld k = [y] nop

nop nop i = j * c ld f = [z] br g

Add/Sub Add/Sub Mul/Div Ld/St Branch

VLIW processor: 5 issue2 Add/Sub units (1 cycle)1 Mul/Div unit (2 cycle, unpipelined)1 LD/ST unit (2 cycle, pipelined)1 Branch unit (no delay slots)

UUNIVERSITYNIVERSITY OFOF D DELAWARE ELAWARE • • C COMPUTER & OMPUTER & IINFORMATION NFORMATION SSCIENCES CIENCES DDEPARTMENTEPARTMENT 25

Next Time

Phase-ordering

top related