instruction rescheduling and loop-unroll department of computer science southern illinois university...

19
Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki E-mail: [email protected] Loop_Unroll/000 CS 312 Computer Architecture & Organization

Upload: briana-neal

Post on 17-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki E-mail:

Instruction Rescheduling and Loop-Unroll

Department of Computer ScienceSouthern Illinois University Edwardsville

Fall, 2015

Dr. Hiroshi FujinokiE-mail: [email protected]

Loop_Unroll/000

CS 312 Computer Architecture & Organization

Page 2: Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki E-mail:

Loop_Unroll/001

Example

A for-loop structure written in a high-level programming language

for (i = 0; i < 1000; i++){ a[i] = a[i] + 10.19;}

• There are an array of floating-point number, a[i], which has 1,000 elements

• Add a constant, 10.19, to every element in the FP array

Loop-unrolling

Loop-unrolling is a technique to increase ILP for loop-structure

CS 312 Computer Architecture & Organization

Page 3: Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki E-mail:

Loop_Unroll/002

Main MemoryAssumptions(High Address)

(Low Address)

FFFFFF

000000

a[0]

a[1]

a[999]

8 bytes

R1

R2F2 = 10.19

CS 312 Computer Architecture & Organization

Page 4: Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki E-mail:

L.D F0, 0(R1) // F0 = Mem[R1]

Data_Dependency/003

After the high-level programming language statements are compiled

for (i = 0; i < 1000; i++){ a[i] = a[i] + 10.19;}

ADD.D F4, F0, F2 // F4 = F0+F2S.D F4, 0(R1) // Mem[R1] = F4

DADDUI R1, R1, -8 // R1 = R1-8BNE R1, R2, LOOP // R1R2 LOOP

LOOP:

L.D F0, 0(R1) // F0 = Mem[R1]

ADD.D F4, F0, F2 // F4 = F0+F2

S.D F4, 0(R1) // Mem[R1] = F4

We focus on this loop structure BNE = “Branch if NOT EQUAL”

CS 312 Computer Architecture & Organization

Page 5: Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki E-mail:

LOAD F0, 0(R1) // F0 Mem[R1]

Data_Dependency/004

After the high-level programming language statements are compiled

for (i = 0; i < 1000; i++){ a[i] = a[i] + 10.19;}

ADD F4, F0, F2 // F4 F0+F2

STORE F4, 0(R1) // Mem[R1] F4

ADD R1, R1, -8 // R1 R1-8

BNE R1, R2, LOOP // R1R2 LOOP

LOOP:

CS 312 Computer Architecture & Organization

Page 6: Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki E-mail:

Loop_Unroll/005

Assumptions (Part 2)

• Branch slot (for a conditional branch) = 1 cycle

• RAW dependency for integer ALU instructions = 1 cycle

Numbers of stalled cycles for this CPU are defined as follow:

Instruction producing result Instruction using result Stalled cycles

FP ALU operation Another FP ALU operation

FP ALU operation Store FP data

Load FP data FP ALU operation

3

2

1

Load FP data Store FP data 0

(This table appears in page 304 of the textbook)

READWRITE

CS 312 Computer Architecture & Organization

Page 7: Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki E-mail:

Data_Dependency/006

Categorizing instruction types

LOAD F0, 0(R1) // F0 Mem[R1]

ADD F4, F0, F2 // F4 F0+F2

STORE F4, 0(R1) // Mem[R1] F4

ADD R1, R1, -8 // R1 R1-8

BNE R1, R2, LOOP // R1R2 LOOP

LOOP:

Floating-Point instructions

Integer instructionsConditional branch

CS 312 Computer Architecture & Organization

Page 8: Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki E-mail:

Data_Dependency/007

Identifying all pipeline hazards

LOAD F0, 0(R1) // F0 Mem[R1]

ADD F4, F0, F2 // F4 F0+F2

STORE F4, 0(R1) // Mem[R1] F4

ADD R1, R1, -8 // R1 R1-8

BNE R1, R2, LOOP // R1R2 LOOP

LOOP:RAW

RAW

WAR

RAW

ControlHazard

CS 312 Computer Architecture & Organization

Page 9: Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki E-mail:

Data_Dependency/008

Determining stalled and flashed cycles

L.D F0, 0(R1) // F0 Mem[R1]

ADD.D F4, F0, F2 // F4 F0+F2

S.D F4, 0(R1) // Mem[R1] F4

DADDUI R1, R1, -8 // R1 R1-8

BNE R1, R2, LOOP // R1R2 LOOP

LOOP: FP Load

FP ALU

RAW 1

FP Store2RAW

Int ALU0

RAW

Branch1

# of stalls

ControlHazard

(1 cycle flash)

How many cycles stalled or flashed due to RAW and Control hazard?

CS 312 Computer Architecture & Organization

Page 10: Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki E-mail:

Data_Dependency/009

Instruction issuing schedule w/ stalls and flash

ADD.D F4, F0, F2 // F4 F0+F2

S.D F4, 0(R1) // Mem[R1] F4

DADDUI R1, R1, -8 // R1 R1-8

BNE R1, R2, LOOP // R1R2 LOOP

L.D F0, 0(R1) // F0 Mem[R1]LOOP:

stall

stallstall

stall

Cycle Issued

flash

12

345

67

8910

CS 312 Computer Architecture & Organization

Page 11: Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki E-mail:

Data_Dependency/0010

Technique #4: Instruction Re-Scheduling

ADD.D F4, F0, F2 // F4 F0+F2

S.D F4, 0(R1) // Mem[R1] F4

DADDUI R1, R1, -8 // R1 R1-8

BNE R1, R2, LOOP // R1R2 LOOP

L.D F0, 0(R1) // F0 Mem[R1]LOOP:

stall

stallstall

stall

Cycle Issued

flash

12

345

67

8910

CS 312 Computer Architecture & Organization

Page 12: Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki E-mail:

Data_Dependency/011

Technique #4: Instruction Re-Scheduling

ADD.D F4, F0, F2 // F4 F0+F2

S.D F4, 0(R1) // Mem[R1] F4

DADDUI R1, R1, -8 // R1 R1-8

BNE R1, R2, LOOP // R1R2 LOOP

L.D F0, 0(R1) // F0 Mem[R1]LOOP:

stallstall

Cycle Issued

stall

flash

12

345

67

8910

stall

S.D F4, 8(R1) // Mem[R1] F4

Loop Completed Here

Make sureto add 8!

Delayed-branchapplied

CS 312 Computer Architecture & Organization

Page 13: Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki E-mail:

Data_Dependency/012

Technique #5: Loop-Unrolling

ADD F4, F0, F2

STORE F4, 0(R1)

ADD R1, R1, -8

BNE R1, R2, LOOP

LOAD F0, 0(R1)LOOP:

stall

stallstall

stall

flash

We repeat thisfor 1,000 times

CS 312 Computer Architecture & Organization

Page 14: Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki E-mail:

Data_Dependency/013

Technique #5: Loop-Unrolling

We repeat thisfor 1,000 times

ADD F4, F0, F2

STORE F4, 0(R1)

ADD R1, R1, -8

BNE R1, R2, LOOP

LOAD F0, 0(R1)LOOP1:

stall

stallstall

stall

flash

ADD F4, F0, F2

STORE F4, 0(R1)

LOAD F0, 0(R1)LOOP2:

stall

stallstall

ADD R1, R1, -8

BNE R1, R2, LOOPstall

flash

Merge Them Together

CS 312 Computer Architecture & Organization

Page 15: Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki E-mail:

Data_Dependency/014

Technique #5: Loop-Unrolling

ADD F4, F0, F2

STORE F4, 0(R1)

LOAD F0, 0(R1)LOOP1:

stall

stallstall

ADD R1, R1, -8

BNE R1, R2, LOOPstall

flash

ADD F4, F0, F2

STORE F4, 0(R1)

LOAD F0, 0(R1)LOOP2:

ADD R1, R1, -8

BNE R1, R2, LOOP

stall

stallstall

stall

flash

LOAD F6, 8(R1)

ADD F8, F6, F2

ADD R1, R1, -16

BNE R1, R2, LOOPstall

flash

STORE F8, 8(R1)

WAW Dependency(Pseudo Dependency)= Name Dependency

CS 312 Computer Architecture & Organization

Page 16: Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki E-mail:

Data_Dependency/015

Technique #5: Loop-Unrolling

ADD F4, F0, F2

STORE F4, 0(R1)

LOAD F0, 0(R1)LOOP1:

stall

LOAD F6, 8(R1)

ADD F8, F6, F2

ADD R1, R1, -16

BNE R1, R2, LOOPstall

flash

STOTE F8, 8(R1)

12

3

45

6

7

8910

11

Previous: 10 Cycles 1,000

Now: 11 Cycles 500

CS 312 Computer Architecture & Organization

Page 17: Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki E-mail:

Data_Dependency/016

Further Improvement

Is further improvement possible?

Especially eliminate especially control hazards

Combine instruction-scheduling (Technique 4) and Loop-unrolling

More loop-unrolling

Further eliminate stalls

But how many loop-unrolling should be performed?

CS 312 Computer Architecture & Organization

Page 18: Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki E-mail:

Data_Dependency/017

How many loop-unrolling should be performed?

• Too many unrolling

• Too few unrolling

• The best unrolling

Loop size becomes too big

Stalls still exist

Only enough to eliminate stalls

How can we know the best unrollingif number of loops is unknown before run-time?

Exercise 2.7 (p. 144 in ED4 (Exercise 4.4 in ED3)

CS 312 Computer Architecture & Organization

Page 19: Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki E-mail:

Code Optimization Examples by Visual Studio 2010

Data_Dependency/018

CS 312 Computer Architecture & Organization