instruction rescheduling and loop-unroll department of computer science southern illinois university...
TRANSCRIPT
Instruction Rescheduling and Loop-Unroll
Department of Computer ScienceSouthern Illinois University Edwardsville
Fall, 2015
Dr. Hiroshi FujinokiE-mail: [email protected]
Loop_Unroll/000
CS 312 Computer Architecture & Organization
Loop_Unroll/001
Example
A for-loop structure written in a high-level programming language
for (i = 0; i < 1000; i++){ a[i] = a[i] + 10.19;}
• There are an array of floating-point number, a[i], which has 1,000 elements
• Add a constant, 10.19, to every element in the FP array
Loop-unrolling
Loop-unrolling is a technique to increase ILP for loop-structure
CS 312 Computer Architecture & Organization
Loop_Unroll/002
Main MemoryAssumptions(High Address)
(Low Address)
FFFFFF
000000
a[0]
a[1]
a[999]
8 bytes
R1
R2F2 = 10.19
CS 312 Computer Architecture & Organization
L.D F0, 0(R1) // F0 = Mem[R1]
Data_Dependency/003
After the high-level programming language statements are compiled
for (i = 0; i < 1000; i++){ a[i] = a[i] + 10.19;}
ADD.D F4, F0, F2 // F4 = F0+F2S.D F4, 0(R1) // Mem[R1] = F4
DADDUI R1, R1, -8 // R1 = R1-8BNE R1, R2, LOOP // R1R2 LOOP
LOOP:
L.D F0, 0(R1) // F0 = Mem[R1]
ADD.D F4, F0, F2 // F4 = F0+F2
S.D F4, 0(R1) // Mem[R1] = F4
We focus on this loop structure BNE = “Branch if NOT EQUAL”
CS 312 Computer Architecture & Organization
LOAD F0, 0(R1) // F0 Mem[R1]
Data_Dependency/004
After the high-level programming language statements are compiled
for (i = 0; i < 1000; i++){ a[i] = a[i] + 10.19;}
ADD F4, F0, F2 // F4 F0+F2
STORE F4, 0(R1) // Mem[R1] F4
ADD R1, R1, -8 // R1 R1-8
BNE R1, R2, LOOP // R1R2 LOOP
LOOP:
CS 312 Computer Architecture & Organization
Loop_Unroll/005
Assumptions (Part 2)
• Branch slot (for a conditional branch) = 1 cycle
• RAW dependency for integer ALU instructions = 1 cycle
Numbers of stalled cycles for this CPU are defined as follow:
Instruction producing result Instruction using result Stalled cycles
FP ALU operation Another FP ALU operation
FP ALU operation Store FP data
Load FP data FP ALU operation
3
2
1
Load FP data Store FP data 0
(This table appears in page 304 of the textbook)
READWRITE
CS 312 Computer Architecture & Organization
Data_Dependency/006
Categorizing instruction types
LOAD F0, 0(R1) // F0 Mem[R1]
ADD F4, F0, F2 // F4 F0+F2
STORE F4, 0(R1) // Mem[R1] F4
ADD R1, R1, -8 // R1 R1-8
BNE R1, R2, LOOP // R1R2 LOOP
LOOP:
Floating-Point instructions
Integer instructionsConditional branch
CS 312 Computer Architecture & Organization
Data_Dependency/007
Identifying all pipeline hazards
LOAD F0, 0(R1) // F0 Mem[R1]
ADD F4, F0, F2 // F4 F0+F2
STORE F4, 0(R1) // Mem[R1] F4
ADD R1, R1, -8 // R1 R1-8
BNE R1, R2, LOOP // R1R2 LOOP
LOOP:RAW
RAW
WAR
RAW
ControlHazard
CS 312 Computer Architecture & Organization
Data_Dependency/008
Determining stalled and flashed cycles
L.D F0, 0(R1) // F0 Mem[R1]
ADD.D F4, F0, F2 // F4 F0+F2
S.D F4, 0(R1) // Mem[R1] F4
DADDUI R1, R1, -8 // R1 R1-8
BNE R1, R2, LOOP // R1R2 LOOP
LOOP: FP Load
FP ALU
RAW 1
FP Store2RAW
Int ALU0
RAW
Branch1
# of stalls
ControlHazard
(1 cycle flash)
How many cycles stalled or flashed due to RAW and Control hazard?
CS 312 Computer Architecture & Organization
Data_Dependency/009
Instruction issuing schedule w/ stalls and flash
ADD.D F4, F0, F2 // F4 F0+F2
S.D F4, 0(R1) // Mem[R1] F4
DADDUI R1, R1, -8 // R1 R1-8
BNE R1, R2, LOOP // R1R2 LOOP
L.D F0, 0(R1) // F0 Mem[R1]LOOP:
stall
stallstall
stall
Cycle Issued
flash
12
345
67
8910
CS 312 Computer Architecture & Organization
Data_Dependency/0010
Technique #4: Instruction Re-Scheduling
ADD.D F4, F0, F2 // F4 F0+F2
S.D F4, 0(R1) // Mem[R1] F4
DADDUI R1, R1, -8 // R1 R1-8
BNE R1, R2, LOOP // R1R2 LOOP
L.D F0, 0(R1) // F0 Mem[R1]LOOP:
stall
stallstall
stall
Cycle Issued
flash
12
345
67
8910
CS 312 Computer Architecture & Organization
Data_Dependency/011
Technique #4: Instruction Re-Scheduling
ADD.D F4, F0, F2 // F4 F0+F2
S.D F4, 0(R1) // Mem[R1] F4
DADDUI R1, R1, -8 // R1 R1-8
BNE R1, R2, LOOP // R1R2 LOOP
L.D F0, 0(R1) // F0 Mem[R1]LOOP:
stallstall
Cycle Issued
stall
flash
12
345
67
8910
stall
S.D F4, 8(R1) // Mem[R1] F4
Loop Completed Here
Make sureto add 8!
Delayed-branchapplied
CS 312 Computer Architecture & Organization
Data_Dependency/012
Technique #5: Loop-Unrolling
ADD F4, F0, F2
STORE F4, 0(R1)
ADD R1, R1, -8
BNE R1, R2, LOOP
LOAD F0, 0(R1)LOOP:
stall
stallstall
stall
flash
We repeat thisfor 1,000 times
CS 312 Computer Architecture & Organization
Data_Dependency/013
Technique #5: Loop-Unrolling
We repeat thisfor 1,000 times
ADD F4, F0, F2
STORE F4, 0(R1)
ADD R1, R1, -8
BNE R1, R2, LOOP
LOAD F0, 0(R1)LOOP1:
stall
stallstall
stall
flash
ADD F4, F0, F2
STORE F4, 0(R1)
LOAD F0, 0(R1)LOOP2:
stall
stallstall
ADD R1, R1, -8
BNE R1, R2, LOOPstall
flash
Merge Them Together
CS 312 Computer Architecture & Organization
Data_Dependency/014
Technique #5: Loop-Unrolling
ADD F4, F0, F2
STORE F4, 0(R1)
LOAD F0, 0(R1)LOOP1:
stall
stallstall
ADD R1, R1, -8
BNE R1, R2, LOOPstall
flash
ADD F4, F0, F2
STORE F4, 0(R1)
LOAD F0, 0(R1)LOOP2:
ADD R1, R1, -8
BNE R1, R2, LOOP
stall
stallstall
stall
flash
LOAD F6, 8(R1)
ADD F8, F6, F2
ADD R1, R1, -16
BNE R1, R2, LOOPstall
flash
STORE F8, 8(R1)
WAW Dependency(Pseudo Dependency)= Name Dependency
CS 312 Computer Architecture & Organization
Data_Dependency/015
Technique #5: Loop-Unrolling
ADD F4, F0, F2
STORE F4, 0(R1)
LOAD F0, 0(R1)LOOP1:
stall
LOAD F6, 8(R1)
ADD F8, F6, F2
ADD R1, R1, -16
BNE R1, R2, LOOPstall
flash
STOTE F8, 8(R1)
12
3
45
6
7
8910
11
Previous: 10 Cycles 1,000
Now: 11 Cycles 500
CS 312 Computer Architecture & Organization
Data_Dependency/016
Further Improvement
Is further improvement possible?
Especially eliminate especially control hazards
Combine instruction-scheduling (Technique 4) and Loop-unrolling
More loop-unrolling
Further eliminate stalls
But how many loop-unrolling should be performed?
CS 312 Computer Architecture & Organization
Data_Dependency/017
How many loop-unrolling should be performed?
• Too many unrolling
• Too few unrolling
• The best unrolling
Loop size becomes too big
Stalls still exist
Only enough to eliminate stalls
How can we know the best unrollingif number of loops is unknown before run-time?
Exercise 2.7 (p. 144 in ED4 (Exercise 4.4 in ED3)
CS 312 Computer Architecture & Organization
Code Optimization Examples by Visual Studio 2010
Data_Dependency/018
CS 312 Computer Architecture & Organization