global load instruction aggregation based on code motion
TRANSCRIPT
Global Load Instruction Aggregation
Based on Code Motion
The 2012 International Symposium on Parallel
Architectures, Algorithms and Programming.
December18, 2012
Outline
Background
Previous works
Motivations
Partial Redundancy Elimination(PRE)
Lazy code motion(LCM)
Global Load Instruction Aggregation(GLIA)
Experiment results
Conclusion
Processor
Main
memory
Speed:Speed:
Background
Cache
memory
ProcessorImportant
Main
memory
Background
1. Prefetch instructions
2. Transform loop structures.
for(j=0;j<10;j++)
for(i=0;i<10;i++)
... = a[i][j]
for(i=0;i<10;i++)
for(j=0;j<10;j++)
... = a[i][j]
before after
Previous works
for(j=0;j<10;j++)
for(i=0;i<10;i++)
... = a[i][j]i:0
i:1
j:0
j:1
j:0
j:1
・・・
・・・
Previous works
for(j=0;j<10;j++)
for(i=0;i<10;i++)
... = a[i][j]i:0
i:1
j:0
j:1
j:0
j:1
・・・
・・・
Previous works
for(j=0;j<10;j++)
for(i=0;i<10;i++)
... = a[i][j]i:0
i:1
j:0
j:1
j:0
j:1
・・・
・・・
Previous works
for(j=0;j<10;j++)
for(i=0;i<10;i++)
... = a[i][j]i:0
i:1
j:0
j:1
j:0
j:1
・・・
・・・
Previous works
for(j=0;j<10;j++)
for(i=0;i<10;i++)
... = a[i][j]i:0
i:1
j:0
j:1
j:0
j:1
・・・
・・・
Previous works
1. Prefetch instructions
2. Transform loop structures.
for(j=0;j<10;j++)
for(i=0;i<10;i++)
... = a[i][j]
for(i=0;i<10;i++)
for(j=0;j<10;j++)
... = a[i][j]
before after
Previous works
1. Local technique
ex. target: initial load instruction, loop only.
2. It is necessary to change the structure.
Problems
・・・
・・・
Main memory
Cache memory
main(){
x = a[i]
}
a[i]
a[i+1]
How we can apply cache optimization to any program globally?
main(){
x = a[i]
}
・・・
・・・
Main memory
a[i]
a[i+1]
How we can apply cache optimization to any program globally?
Cache memory
a[i]
a[i+1]
Cache memory
・・・
main(){
... = a[i]
... = b[i]
... = a[i+1]
}
a[i]
a[i+1]
a[i]
a[i+1]b[i]
b[i+1]
Main memory
How we can apply cache optimization to any program globally?
Cache memory
main(){
... = a[i]
... = b[i]
... = a[i+1]
}
b[i]
b[i+1]
・・・
a[i]
a[i+1]
b[i]
b[i+1]
Main memory
How we can apply cache optimization to any program globally?
Cache memory
main(){
... = a[i]
... = b[i]
... = a[i+1]
}
b[i]
b[i+1]
・・・
a[i]
a[i+1]
b[i]
b[i+1]
Main memory
Cache miss
How we can apply cache optimization to any program globally?
main(){
... = a[i]
... = b[i]
... = a[i+1]
}
b[i]
b[i+1]
We can remove this cache miss by
changing the order of accesses
・・・
a[i]
a[i+1]
b[i]
b[i+1]Cache miss
How we can apply cache optimization to any program globally?
x = a[i]
z = b[i]
Expel from
cache memory
w = a[i+j]
y = x+1
Code motion
x = a[i]
z = b[i]
w = a[i+j]
y = x+1
Code motion
x = a[i]
z = b[i]
Live range
of wy = x+1
w = a[i+j]
Code motion
x = a[i]
z = b[i]
w
y = x+1
x
w = a[i+j]
Code motion
x = a[i]
z = b[i]
Spilly = x+1
w = a[i+j]
Code motion
x = a[i]
t = Load(j)
z = b[i]
w = a[i+t]
y = x+1
Change the
access order
Code motion
x = a[i]
z = b[i]
w = a[i+j]
y = x+1
Code motion
x = a[i]
z = b[i]
w = a[i+j]
y = x+1
Delayed
Code motion
We use Partial Redundancy Elimination(PRE)
One of the code optimization
Eliminates redundant expressions
Implementation
x = a[i]
y = a[i]
t = a[i]x = tt = a[i]
y = t
PRE
LCM determines two insertion node
-- Earliest and Latest
Knoop,J.,etc.:Lazy Code Motion, Proc. Programming Language Design and Implementation, ACM, pp.224-234, 1992.
x = a[i]
y = a[i]
LCM
• Earliest(n) denotes that node n is
the closest to the start node of the
nodes which can be inserted
• Latest(n) denotes that node n is
the closest to nodes which contain
same load instruction.
x = a[i]
y = a[i]
LCM
x = a[i]
y = a[i]
t = a[i]
LCM
x = a[i]
y = a[i]
t = a[i]
LCM
x = a[i]
y = a[i]
t = a[i]
Delayed
LCM
x = a[i]y = a[i]
t = a[i]
Delayed
LCM
x = ty = t
t = a[i]
LCM
Purpose
1. Decrease the cache miss.
2. Suppress register spills.
Extension
1. Move not redundant load instructions.
2. Delayed considering the order of
memory access.
Global Load Instruction Aggregation(GLIA)
x = a[i]
w = a[i+1]
y = b[i]
GLIA
x = a[i]t = a[i+1]
w = a[i+1]
y = b[i]
GLIA
x = a[i]
w = a[i+1]
y = b[i]t = a[i+1]
GLIA
x = a[i]
w = a[i+1]
y = b[i]t = a[i+1]
GLIA
x = a[i]
w = t
y = b[i]t = a[i+1]
GLIA
= a[i]
= a[i+1]
= b[i]
= a[i+1]
Application to the entire program
= a[i]
= a[i+1]
= b[i]
= a[i+1]
Application to the entire program
= a[i]
= a[i+1]
= b[i]
= a[i+1]
Application to the entire program
= a[i]
= a[i+1]
= b[i]
= a[i+1]
Application to the entire program
= a[i]
= a[i+1]
= b[i]
= a[i+1]
Application to the entire program
= a[i]
= a[i+1]= b[i]
= a[i+1]
Application to the entire program
Implementation our technique in COINS compiler as LIR
converter.
Benchmark
SPEC2000
Measurement
1. Execution efficiency
2. The number of cache misses
Experiment
Environment
SPARC64-V 2GHz, Solaris 10
Optimization
BASE:applies Dead Code Elimination(DCE)
GLIADCE:applies GLIA and DCE.
Experiment(1/2) | Execution efficiency
Improvement of art has been about 10.5%
Experiment(1/2) | Execution efficiency
= a[i]
= a[j]
= b[i]
The decrease reason 1: speculative code motion
= a[i]= a[j]= b[i]
The decrease reason 1: speculative code motion
The number of spills
The decrease reason 2: register spill
System parameter of x86 machine
Intel corei5-2320 3.00GHz
Floating register :8
Integer register :8
L1D cache memory:32KB
L2 cache memory :256KB
L3 cache memory :6144KB
Experiment(2/2) | Cache misses
Improvement of twolf has been about 10.6%
Experiment(2/2) | Level 2 cache misses
Improvement of art has been about 93.7%
Experiment(2/2) | Level 3 cache misses
We proposed a new cache optimization.
1. GLIA can be applied to any programs
2. GLIA improves cache efficiency
3. GLIA considers register spill
Thank you for your attention.
Conclusion