advanced compiler techniques

Advanced Compiler Techniques

LIU Xianhua

School of EECS, Peking University

Parallelism & Locality

“Advanced Compiler Techniques”

Outline

• Data dependences• Loop transformation• Software pipelining• Software prefetching & data layout

2


Data Dependence of Variables

• True dependence

• Anti-dependence

a = = a

= aa =

a = a =

= a = a

Output dependence

Input dependence

3

4

Domain of Data Dependence Analysis Only use loop bounds and array indexes that are

affine functions of loop variablesfor i = 1 to nfor j = 2i to 100

a[i + 2j + 3][4i + 2j][i * i] = …… = a[1][2i + 1][j]

Assume a data dependence between the read & write operation if there exists:

∃integers ir,jr,iw,jw 1 ≤ iw, ir ≤ n 2iw ≤ jw ≤ 1002ir ≤ jr ≤ 10iw + 2jw + 3 = 1 4iw + 2jw = 2ir + 1

Equate each dimension of array access; ignore non-affine ones No solution No data dependence Solution There may be a dependence


5

Iteration Space

An abstraction for loops

Iteration is represented as coordinates in iteration space.

for i= 0, 5 for j = 0, 3 a[i, j] = 3

i

j


6

Iteration Space An abstraction for

loops for i = 0, 5 for j = i, 3 a[i, j] = 0

i

j


7

Iteration Space An abstraction for

loops for i = 0, 5 for j = i, 7 a[i, j] = 0

i

j

0000

7050

10110101

ji


8

Affine Access


9

Affine Transform

i

j

u

v

bjiB

vu


10

Loop Transformation

for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_forend_for

for u = 1, 200 for v = 1, 100 A[v,u] = A[v,u]+ 3 end_forend_for

ji

vu

0110


11

Interchange Loops?

for i = 2, 1000 for j = 1, 1000 A[i, j] = A[i-1, j+1]+3 end_forend_for

• e.g. dependence vector dold = (1,-1)

i

jfor u = 1, 1000 for v = 2, 1000 A[v, u] = A[v-1, u+1]+3 end_forend_for


12

GCD Test

Is there any dependence?

Solve a linear Diophantine equation 2*iw = 2*ir + 1

for i = 1, 100 a[2*i] = … … = a[2*i+1] + 3


13

GCD The greatest common divisor (GCD) of

integers a1, a2, …, an, denoted gcd(a1, a2, …, an), is the largest integer that evenly divides all these integers.

Theorem: The linear Diophantine equation

has an integer solution x1, x2, …, xn iff gcd(a1, a2, …, an) divides c

cxaxaxa nn ...2211


14

Loop Transformation Loop Fusion Loop Distribution/Fission Loop Re-Indexing Loop Scaling Loop Reversal Loop Permutation/Interchange Loop Skew

Uni-modular Transformation


15

Loop Transformation


16

Cache Locality Suppose array A has column-major layout Loop nest has poor spatial cache locality.

for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_forend_for

A[1,1]

A[2,1]

… A[1,2]

A[2,2]

… A[1,3]

…


Performance Improvement 1 1.5 2 2.5 3

compress

cholesky(nasa7)

spicemxm (nasa7)btrix (nasa7)

tomcatvgmty (nasa7)

vpenta (nasa7)

mergedarrays

loopinterchange

loop fusion blocking

Summary of Compiler Optimizationsto Reduce Cache Misses (by hand)

“Advanced Compiler Techniques” 17

18

Resource Constraints

Functional unitsld st alu fmpy fadd br …

Tim

e 0

1

2

Each instruction type has a resource reservation table

Pipelined functional units: occupy only one slot Non-pipelined functional units: multiple time slots Instructions may use more than one resource Multiple units of same resource Limited instruction issue slots

may also be managed like a resource


19

List Scheduling

Scope: DAGs Schedules operations in topological

order Never backtracks

Variations: Priority function for node n

critical path: max clocks from n to any node

resource requirements source order


20

Global Scheduling

Assume each clock can execute 2 operations of any kind.

if (a==0) goto L

e = d + d

c = b

L:

LD R6 <- 0(R1)nopBEQZ R6, L

LD R8 <- 0(R4)nopADD R8 <- R8,R8ST 0(R5) <- R8

LD R7 <- 0(R2)nopST 0(R3) <- R7

L:

B1

B2

B3


21

Result of Code Scheduling

LD R6 <- 0(R1) ; LD R8 <- 0(R4)LD R7 <- 0(R2)ADD R8 <- R8,R8 ; BEQZ R6, L

ST 0(R5) <- R8 ST 0(R5) <- R8 ; ST 0(R3) <- R7 L:

B1

B3’B3


22

Code Motions

Goal: Shorten execution time probabilistically

Moving instructions up: Move instruction to a cut set (from

entry) Speculation: even when not

anticipated.

Moving instructions down: Move instruction to a cut set (from

exit) May execute extra instruction Can duplicate code

src

src


23

General-Purpose Applications

Lots of data dependences Key performance factor: memory latencies Move memory fetches up

Speculative memory fetches can be expensive Control-intensive: get execution profile

Static estimation Innermost loops are frequently executed

back edges are likely to be taken Edges that branch to exit and exception routines are

not likely to be taken Dynamic profiling

Instrument code and measure using representative data


24

A Basic Global Scheduling Algorithm

Schedule innermost loops first Only upward code motion No creation of copies Only one level of speculation

Extra: Prepass before scheduling: loop

unrolling“Advanced Compiler Techniques”

Software Pipelining



Software Pipelining

• Obtain parallelism by executing iterations of a loop in an overlapping way.

• Try to maximize parallelism across loop iterations

• We’ll focus on simplest case: the do-all loop, where iterations are independent.

• Goal: Initiate iterations as frequently as possible.

• Limitation: Use same schedule and delay for each iteration.

26

27

Example of DoAll Loops

Machine: Per clock: 1 read, 1 write, 1 (2-stage) arithmetic op, with

hardware loop op and auto-incrementing addressing mode. Source code:

For i = 1 to n D[i] = A[i] * B[i]+ c

Code for one iteration: 1. LD R5,0(R1++) 2. LD R6,0(R2++) 3. MUL R7,R5,R6 4. 5. ADD R8,R7,R4 6. 7. ST 0(R3++),R8

No parallelism in basic block


28

Unrolling 1.L: LD 2. LD 3. LD 4. MUL LD 5. MUL LD 6. ADD LD 7. ADD LD 8. ST MUL LD 9. MUL10. ST ADD11. ADD12. ST13. ST BL (L)

• Let u be the degree of unrolling:– Length of u iterations = 7+2(u-1)– Execution time per source iteration = (7+2(u-1)) / u = 2 + 5/u


Software Pipelined Code 1. LD 2. LD 3. MUL LD 4. LD 5. MUL LD 6. ADD LD 7. MUL LD 8. ST ADD LD 9. MUL LD10. ST ADD LD11. MUL12. ST ADD13. 14. ST ADD15.16. ST Unlike unrolling, software pipelining can give optimal

result. Locally compacted code may not be globally optimal DOALL: Can fill arbitrarily long pipelines with infinitely

many iterations “Advanced Compiler Techniques” 29

Example of DoAcross LoopLoop: Sum = Sum + A[i]; B[i] = A[i] * c;

Software Pipelined Code1. LD2. MUL3. ADD LD4. ST MUL5. ADD6. ST

Doacross loops Recurrences can be parallelized Harder to fully utilize hardware with large degrees of

parallelism

1. LD2. MUL3. ADD4. ST


Problem Formulation

Goals: maximize throughput small code size

Find: an identical relative schedule S(n)

for every iteration a constant initiation interval (T)

such that the initiation interval is minimized

Complexity: NP-complete in general

S0 LD1 MUL2 ADD LD3 ST MUL ADD ST

T=2


Scheduling Constraints: Resource

RT: resource reservation table for single iteration RTs: modulo resource reservation table

RTs[i] = t|(t mod T = i) RT[t]

LD Alu ST

LD Alu ST

LD Alu ST

LD Alu ST

Iteration 1

Iteration 2

Iteration 3

Iteration 4

T=2Ti

me

LD Alu STSteady State

T=2



Is It Possible to Have an IterationStart at Every Clock?

• Hint: No.• Why?• An iteration injects 2 MEM and 2 ALU

resource requirements.– If injected every clock, the machine

cannot possibly satisfy all requests.• Minimum delay = 2.

33


Assigning Registers

• We don’t need an infinite number of registers.

• We can reuse registers for iterations that do not overlap in time.

• But we can’t just use the same old registers for every iteration.

34


Assigning Registers --- (2)

• The inner loop may have to involve more than one copy of the smallest repeating pattern.– Enough so that registers may be reused

at each iteration of the expanded inner loop.

• Our example: 3 iterations coexist, so we need 3 sets of registers and 3 copies of the pattern.

35


Example: Assigning Registers

• Our original loop used registers:– r9 to hold a constant 4N.– r8 to count iterations and index the

arrays.– r1 to copy a[i] into b[i].

• The expanded loop needs:– r9 holds 12N.– r6, r7, r8 to count iterations and index.– r1, r2, r3 to copy certain array elements.

36

37

The Loop Body

L: ADD r8,r8,#12 nop LD r3,a(r6)BGE r8,r9,L’ ST b(r7),r2 nopLD r1,a(r8) ADD r7,r7,#12 nopnop BGE r7,r9,L’’ ST b(r6),r3nop LD r2,a(r7) ADD r6,r6,#12ST b(r8),r1 nop BLT r6,r9,L

Iteration i

Iteration i + 4

Iteration i + 1

Iteration i + 3

Iteration i + 2

To break the loop early Each register handles everythird element of the arrays.

L’ and L’’ are places for appropriate codas.“Advanced Compiler Techniques”


Cyclic Data Dependence Graph

• We assumed that data at an iteration depends only on data computed at the same iteration.– Not even true for our example.• r8 computed from its previous iteration.• But it doesn’t matter in this example.

• Fixup: edge labels have two components: (iteration change, delay).

38

39

Cyclic Data Dependence Graph Label edges with < , d >

= iteration difference, d = delay x T + S(n2) – S(n1) d

LD r1,a(r8)

ST b(r8),r1

ADD r8,r8,#4

BLT r8,r9,L

<0,2> <1,1>

<0,1>

<0,1>

(A)

(B)

(C)

(D)

(C) must wait atleast one clockafter the (B) fromthe same iteration.

(A) must wait atleast one clockafter the (C) fromthe previous iteration.



Matrix of Delays

• Let T be the delay between the start times of one iteration and the next.

• Replace edge label <i,j> by delay j-iT.

• Compute, for each pair of nodes n and m the total delay along the longest acyclic path from n to m.

• Gives upper and lower bounds relating the times to schedule n and m. 40

41

Example: Delay Matrix

AA

BCD

B C D2

1-T1

1

AA

BCD

B C D2

1-T1

1

3

2-T 23-T

4

Edges Acyclic Transitive Closure

S(B) ≥ S(A)+2S(A) ≥ S(B)+2-T

S(B)-2 ≥ S(A) ≥ S(B)+2-T

Note: Implies T ≥ 4 (because onlyone register used for loop-counting).If T=4, then A (LD) must be 2 clocksbefore B (ST). If T=5, A can be 2-3clocks before B.



Iterative Modulo Scheduling

• Compute the lower bounds (MII) on the delay between the start times of one iteration and the next (initiation interval, aka II)– due to resources– due to recurrences

• Try to find a schedule for II = MII• If no schedule can be found, try a

larger II.42

The Algorithm

1. Choose an initiation interval, ii> Compute lower bounds on ii > Shorter ii means faster overall execution

2. Generate a loop body that takes ii cycles> Try to schedule into ii cycles, using modulo scheduler> If it fails, bump ii by one and try again

3. Generate the needed prologue & epilogue code> For prologue, work backward from upward exposed uses

in the schedulued loop body> For epilogue, work forward from downward exposed

definitions in the scheduled loop bodyM. Lam, "Software pipelining: An effective scheduling technique for VLIW machines", In Proceedings of the ACM SIGPLAN 88 Conference on Programming Language Design and Implementation (PLDI 88), July 1988 pages 318-328. Also published as ACM SIGPLAN Notices 23(7).


44

A simple sum reduction loop

c = 0for i = 1 to n c = c + a[i]

rc 0 r@a @a r1 n x 4 rub r1 + r@a if r@a > rub goto ExitLoop: ra MEM(r@a) rc rc + ra

r@a r@a + 4 if r@a ≤ rub goto LoopExit: c rc

Source code LLIR code


45

OOO Execution

The loop’s steady state

behavior


46

OOO Execution

The loop’s prologue

The loop’s epilogue


Implementing the Concept

ra MEM(r@a)r@a r@a + 4rc rc + ra

if r@a ≤ rub goto Loop

r@a r@a + 4rc rc + ra

Prologue

Epilogue

Kernel

The actual schedule must respect both the data dependences and the operation latencies.

3

3

11

1 1

1

General schema for the loop

ra MEM(r@a) r@a r@a + 4 if r@a > rub goto Exit

1


The Example

1. rc 02. r@a @a3. r1 n x 44. rub r1 + r@a5. if r@a > rub goto Exit6. Loop: ra MEM(r@a)7. rc rc + ra8. r@a r@a + 49. if r@a ≤ rub goto

Loop10. Exit: c rc

The Code Its Dependence Graph

Focus on the loop body

1 2 3

4

56 8

7 9

Op 6 is not involved in a cycle


The Example

Label Fetch Unit Integer Unit Branch UnitLoop:

ii = 2

Focus on the loop body6 8

7 9

Template for the Modulo Schedule


The Example

Focus on the loop body• Schedule 6 on the fetch unit

3 1

Simulated

clock

Label Fetch Unit Integer Unit Branch UnitLoop: 6

6 8

7 9


The Example

Focus on the loop body• Schedule 6 on the fetch unit• Schedule 8 on the integer unit

3 1

Simulated

clock

Label Fetch Unit Integer Unit Branch UnitLoop: 6 8

6 8

7 9


The Example

Focus on the loop body• Schedule 6 on the fetch unit• Schedule 8 on the integer unit• Advance the scheduler’s clock3 1

Simulated

clock


6 8

7 9


The Example

Focus on the loop body• Schedule 6 on the fetch unit• Schedule 8 on the integer unit• Advance the scheduler’s clock• Schedule 9 on the branch unit

3 1

Simulated

clock


9

6 8

7 9


The Example

Focus on the loop body• Schedule 6 on the fetch unit• Schedule 8 on the integer unit• Advance the scheduler’s clock• Schedule 9 on the branch unit• Advance the clock (modulo ii)

3 1

Simulated

clock


9

6 8

7 9


The Example

Focus on the loop body• • Advance the scheduler’s clock• Schedule 9 on the branch unit• Advance the clock (modulo ii)• Advance the clock again

3 1

Simulated

clock


9

6 8

7 9


The Example

Focus on the loop body• • Schedule 9 on the branch unit• Advance the clock (modulo ii)• Advance the clock again• Schedule 7 on the integer unit

3 1

Simulated

clock


7 9

6 8

7 9


The Example

Focus on the loop body• • Advance the clock (modulo ii)• Advance the clock again• Schedule 7 on the integer unit• No unscheduled ops remain in loop body

3 1

The final schedule for the loop’s body

Simulated

clock


7 9

6 8

7 9


The Example: Final Schedule

Label Fetch Unit Integer Unit Branch Unit

nop r@a @a nop nop r1 n x 4 nop nop rub r1 + r@a nop

ra MEM(r@a) r@a r@a + 4 nop nop rc 0 If r@a > rub goto Exit

Loop: ra MEM(r@a) r@a r@a + 4 nop

nop rc rc + ra if r@a ≤ rub goto LoopExit: nop nop nop

nop rc rc + ra nop

What about the empty slots? Fill them (if needed) in some other way (e.g., fuse loop with another loop that is memory bound?)


59

Global Software Pipelining Conditional Execution

Control Dependence -> Data Dependence

Hardware Support FU, operation field, extra pipe stages

Conditional Composition Code Inflation Complicated scheduling

Branch Prediction Dynamic vs. Static


60

Summary Eliminate most dependences, improve ILP No much code inflation Register Pressure

Software Pipelining

Loop Unrolling


Data Prefetching



Why Data Prefetching

• Increasing Processor – Memory “distance”

• Caches do work !!! … IF …– Data set cache-able, accesses local (in

space/time)• Else ? …

62


Data Prefetching

• What is it ?– Request for a future data need is

initiated– Useful execution continues during

access– Data moves from slow/far memory to

fast/near cache– Data ready in cache when needed

(load/store)

63


Data Prefetching

• When can it be used ?– Future data needs are (somewhat)

predictable• How is it implemented ?– in hardware: history based prediction of

future access– in software: compiler inserted prefetch

instructions

64


Software Data Prefetching

• Compiler scheduled prefetches• Moves entire cache lines (not just datum)– Spatial locality assumed – often the case

• Typically a non-faulting access– Compiler free to speculate prefetch address

• Hardware not obligated to obey– A performance enhancement, no functional

impact– Loads/store may be preferentially treated

65


Software Data Prefetching Use

• Mostly in Scientific codes– Vectorizable loops accessing arrays

deterministically• Data access pattern is predictable• Prefetch scheduling easy (far in time, near in

code)– Large working data sets consumed

• Even large caches unable to capture access locality

• Sometimes in Integer codes– Loops with pointer de-references

66

67

Selective Data Prefetch

do j = 1, n do i = 1, m A(i,j) = B(1,i) + B(1,i+1) enddoenddo

E.g. A(i,j) has spatial locality, therefore only one prefetch is required for every cache line.

Aj

i

B 11 m


68

Formal Definitions Temporal locality occurs when a given

reference reuses exactly the same data location

Spatial locality occurs when a given reference accesses different data locations that fall within the same cache line

Group locality occurs when different references access the same cache line


69

Prefetch Predicates If an access has spatial locality, only the

first access to the same cache line will incur a miss.

For temporal locality, only the first access will incur a cache miss

If an access has group locality, only the leading reference incurs cache miss.

If an access has no locality, it will miss in every iteration.


70

Example Code with Prefetches

do j = 1, n do i = 1, m A(i,j) = B(1,i) + B(1,i+1) if (iand(i,7) == 0) prefetch (A(i+k,j)) if (j == 1) prefetch (B(1,i+t)) enddoenddo

Aj

i

B 11 m

Assumed CLS = 64 bytes and data size = 8 bytesk and t are prefetch distance values


71

Spreading of Prefetches If there is more than one reference that

has spatial locality within the same loop nest, spread these prefetches across the 8-iteration window

Reduces the stress on the memory subsystem by minimizing the number of outstanding prefetches


72

Example Code with Spreading

do j = 1, n do i = 1, m C(i,j) = D(i-1,j) + D(i+1,j) if (iand(i,7) == 0)

prefetch (C(i+k,j)) if (iand(i,7) == 1) prefetch (D(i+k+1,j))enddo

enddo

C, Dj

i

Assumed CLS = 64 bytes and data size = 8 bytesk is the prefetch distance value


73

Prefetch Strategy - Conditional

Example loopL:

Load A(I)Load B(I)...I = I + 1Br L, if I<n

Conditional Prefetching

L:Load A(I)Load B(I)Cmp pA=(I mod 8 == 0)if(pA) prefetch A(I+X)Cmp pB=(I mod 8 == 1)If(pB) prefetch B(I+X)...I = I + 1Br L, if I<n

Code for condition generationPrefetches occupy issue slots


74

Prefetch Strategy - Unroll

Example loopL:

Load A(I)Load B(I)...I = I + 1Br L, if I<n

UnrolledUnr_Loop:

prefetch A(I+X)load A(I)load B(I)...prefetch B(I+X)load A(I+1)load B(I+1)...prefetch C(I+X)load A(I+2)load B(I+2)...prefetch D(I+X)load A(I+3)load B(I+3)

...prefetch E(I+X)load A(I+4)load B(I+4)

...load A(I+5)load B(I+5)

...load A(I+6)load B(I+6)

...load A(I+7)load B(I+7)...I = I + 8Br Unr_Loop, if I<n

Code bloat (>8X)Remainder loop



Software Data Prefetching Cost

• Requires memory instruction resources– A prefetch instruction for each access

stream• Issues every iteration, but needed less

often– If branched around, inefficient execution

results– If conditionally executed, more instruction

overhead results– If loop is unrolled, code bloat results

75


Software Data Prefetching Cost

• Redundant prefetches get in the way– Resources consumed until prefetches

discarded!• Non redundant need careful

scheduling– Resources overwhelmed when many

issued & miss• Redundant prefetches increases

power/energy consumption76

77

Prefetch Execution Diagram



Measurements

78

79

Data Re-placement/Layout

#Words

B[]

i60

CacheMemory

Main Memory

(16 words) (16 words)

AB[new]

A[]

15Detailed Analysis:

max=15 words

12

for(i=0; i<12; i++) A[i] = B[i+3]+B[i];

Traditional Analysis:

max=27 words



Summary

• Today– Data dependences– Loop transformation– Software pipelining– Software prefetching– Data replacement

• Next– Compiler for Modern Architectures– Future Compilers

advanced compiler techniques

Documents

nfor j

dependence vector dold

integers ir

forfor u

ijfor u

loops iteration

data dependencesolution

cache misses