global load instruction aggregation based on code motion

Global Load Instruction Aggregation

Based on Code Motion

The 2012 International Symposium on Parallel

Architectures, Algorithms and Programming.

December18, 2012

Outline

Background

Previous works

Motivations

Partial Redundancy Elimination(PRE)

Lazy code motion(LCM)

Global Load Instruction Aggregation(GLIA)

Experiment results

Conclusion

Processor

Main

memory

Speed：Speed：

Background

Cache

memory

ProcessorImportant

Main

memory

Background

1. Prefetch instructions

2. Transform loop structures.

for(j=0;j<10;j++)

for(i=0;i<10;i++)

... = a[i][j]

for(i=0;i<10;i++)

for(j=0;j<10;j++)

... = a[i][j]

before after

Previous works

for(j=0;j<10;j++)

for(i=0;i<10;i++)

... = a[i][j]i:0

i:1

j:0

j:1

j:0

j:1

・・・

・・・

Previous works

1. Prefetch instructions

2. Transform loop structures.

for(j=0;j<10;j++)

for(i=0;i<10;i++)

... = a[i][j]

for(i=0;i<10;i++)

for(j=0;j<10;j++)

... = a[i][j]

before after

Previous works

1. Local technique

ex. target: initial load instruction, loop only.

2. It is necessary to change the structure.

Problems

・・・

・・・

Main memory

Cache memory

main(){

x = a[i]

}

a[i]

a[i+1]

How we can apply cache optimization to any program globally?

main(){

x = a[i]

}

・・・

・・・

Main memory

a[i]

a[i+1]


Cache memory

a[i]

a[i+1]

Cache memory

・・・

main(){

... = a[i]

... = b[i]

... = a[i+1]

}

a[i]

a[i+1]

a[i]

a[i+1]b[i]

b[i+1]

Main memory


Cache memory

main(){

... = a[i]

... = b[i]

... = a[i+1]

}

b[i]

b[i+1]

・・・

a[i]

a[i+1]

b[i]

b[i+1]

Main memory


Cache memory

main(){

... = a[i]

... = b[i]

... = a[i+1]

}

b[i]

b[i+1]

・・・

a[i]

a[i+1]

b[i]

b[i+1]

Main memory

Cache miss


main(){

... = a[i]

... = b[i]

... = a[i+1]

}

b[i]

b[i+1]

We can remove this cache miss by

changing the order of accesses

・・・

a[i]

a[i+1]

b[i]

b[i+1]Cache miss


x = a[i]

z = b[i]

Expel from

cache memory

w = a[i+j]

y = x+1

Code motion

x = a[i]

z = b[i]

w = a[i+j]

y = x+1

Code motion

x = a[i]

z = b[i]

Live range

of wy = x+1

w = a[i+j]

Code motion

x = a[i]

z = b[i]

w

y = x+1

x

w = a[i+j]

Code motion

x = a[i]

z = b[i]

Spilly = x+1

w = a[i+j]

Code motion

x = a[i]

t = Load(j)

z = b[i]

w = a[i+t]

y = x+1

Change the

access order

Code motion

x = a[i]

z = b[i]

w = a[i+j]

y = x+1

Code motion

x = a[i]

z = b[i]

w = a[i+j]

y = x+1

Delayed

Code motion

We use Partial Redundancy Elimination（PRE）

One of the code optimization

Eliminates redundant expressions

Implementation

x = a[i]

y = a[i]

t = a[i]x = tt = a[i]

y = t

PRE

LCM determines two insertion node

-- Earliest and Latest

Knoop,J.,etc.:Lazy Code Motion, Proc. Programming Language Design and Implementation, ACM, pp.224-234, 1992.

x = a[i]

y = a[i]

LCM

• Earliest(n) denotes that node n is

the closest to the start node of the

nodes which can be inserted

• Latest(n) denotes that node n is

the closest to nodes which contain

same load instruction.

x = a[i]

y = a[i]

LCM

x = a[i]

y = a[i]

t = a[i]

LCM

x = a[i]

y = a[i]

t = a[i]

Delayed

LCM

x = a[i]y = a[i]

t = a[i]

Delayed

LCM

x = ty = t

t = a[i]

LCM

Purpose

1. Decrease the cache miss.

2. Suppress register spills.

Extension

1. Move not redundant load instructions.

2. Delayed considering the order of

memory access.

Global Load Instruction Aggregation(GLIA)

x = a[i]

w = a[i+1]

y = b[i]

GLIA

x = a[i]t = a[i+1]

w = a[i+1]

y = b[i]

GLIA

x = a[i]

w = a[i+1]

y = b[i]t = a[i+1]

GLIA

x = a[i]

w = t

y = b[i]t = a[i+1]

GLIA

= a[i]

= a[i+1]

= b[i]

= a[i+1]

Application to the entire program

= a[i]

= a[i+1]= b[i]

= a[i+1]

Application to the entire program

Implementation our technique in COINS compiler as LIR

converter.

Benchmark

SPEC2000

Measurement

1. Execution efficiency

2. The number of cache misses

Experiment

Environment

SPARC64-V 2GHz, Solaris 10

Optimization

BASE：applies Dead Code Elimination(DCE)

GLIADCE：applies GLIA and DCE.

Experiment(1/2) | Execution efficiency

Improvement of art has been about 10.5%

Experiment(1/2) | Execution efficiency

= a[i]

= a[j]

= b[i]

The decrease reason 1: speculative code motion

= a[i]= a[j]= b[i]

The decrease reason 1: speculative code motion

The number of spills

The decrease reason 2: register spill

System parameter of x86 machine

Intel corei5-2320 3.00GHz

Floating register ：8

Integer register ：8

L1D cache memory：32KB

L2 cache memory ：256KB

L3 cache memory ：6144KB

Experiment(2/2) | Cache misses

Improvement of twolf has been about 10.6%

Experiment(2/2) | Level 2 cache misses

Improvement of art has been about 93.7%

Experiment(2/2) | Level 3 cache misses

We proposed a new cache optimization.

1. GLIA can be applied to any programs

2. GLIA improves cache efficiency

3. GLIA considers register spill

Thank you for your attention.

Conclusion

global load instruction aggregation based on code motion

Documents