advanced compiler techniques

80
Advanced Compiler Techniques LIU Xianhua School of EECS, Peking University Parallelism & Locality

Upload: usoa

Post on 24-Feb-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Advanced Compiler Techniques. Parallelism & Locality. LIU Xianhua School of EECS, Peking University. Outline. Data dependences Loop transformation Software pipelining Software prefetching & d ata layout. Data Dependence of Variables. Output dependence Input dependence . - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Advanced Compiler Techniques

Advanced Compiler Techniques

LIU Xianhua

School of EECS, Peking University

Parallelism & Locality

Page 2: Advanced Compiler Techniques

“Advanced Compiler Techniques”

Outline

• Data dependences• Loop transformation• Software pipelining• Software prefetching & data layout

2

Page 3: Advanced Compiler Techniques

“Advanced Compiler Techniques”

Data Dependence of Variables

• True dependence

• Anti-dependence

a = = a

= aa =

a = a =

= a = a

Output dependence

Input dependence

3

Page 4: Advanced Compiler Techniques

4

Domain of Data Dependence Analysis Only use loop bounds and array indexes that are

affine functions of loop variablesfor i = 1 to nfor j = 2i to 100

a[i + 2j + 3][4i + 2j][i * i] = …… = a[1][2i + 1][j]

Assume a data dependence between the read & write operation if there exists:

∃integers ir,jr,iw,jw 1 ≤ iw, ir ≤ n 2iw ≤ jw ≤ 1002ir ≤ jr ≤ 10iw + 2jw + 3 = 1 4iw + 2jw = 2ir + 1

Equate each dimension of array access; ignore non-affine ones No solution No data dependence Solution There may be a dependence

“Advanced Compiler Techniques”

Page 5: Advanced Compiler Techniques

5

Iteration Space

An abstraction for loops

Iteration is represented as coordinates in iteration space.

for i= 0, 5 for j = 0, 3 a[i, j] = 3

i

j

“Advanced Compiler Techniques”

Page 6: Advanced Compiler Techniques

6

Iteration Space An abstraction for

loops for i = 0, 5 for j = i, 3 a[i, j] = 0

i

j

“Advanced Compiler Techniques”

Page 7: Advanced Compiler Techniques

7

Iteration Space An abstraction for

loops for i = 0, 5 for j = i, 7 a[i, j] = 0

i

j

0000

7050

10110101

ji

“Advanced Compiler Techniques”

Page 8: Advanced Compiler Techniques

8

Affine Access

“Advanced Compiler Techniques”

Page 9: Advanced Compiler Techniques

9

Affine Transform

i

j

u

v

bjiB

vu

“Advanced Compiler Techniques”

Page 10: Advanced Compiler Techniques

10

Loop Transformation

for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_forend_for

for u = 1, 200 for v = 1, 100 A[v,u] = A[v,u]+ 3 end_forend_for

ji

vu

0110

“Advanced Compiler Techniques”

Page 11: Advanced Compiler Techniques

11

Interchange Loops?

for i = 2, 1000 for j = 1, 1000 A[i, j] = A[i-1, j+1]+3 end_forend_for

• e.g. dependence vector dold = (1,-1)

i

jfor u = 1, 1000 for v = 2, 1000 A[v, u] = A[v-1, u+1]+3 end_forend_for

“Advanced Compiler Techniques”

Page 12: Advanced Compiler Techniques

12

GCD Test

Is there any dependence?

Solve a linear Diophantine equation 2*iw = 2*ir + 1

for i = 1, 100 a[2*i] = … … = a[2*i+1] + 3

“Advanced Compiler Techniques”

Page 13: Advanced Compiler Techniques

13

GCD The greatest common divisor (GCD) of

integers a1, a2, …, an, denoted gcd(a1, a2, …, an), is the largest integer that evenly divides all these integers.

Theorem: The linear Diophantine equation

has an integer solution x1, x2, …, xn iff gcd(a1, a2, …, an) divides c

cxaxaxa nn ...2211

“Advanced Compiler Techniques”

Page 14: Advanced Compiler Techniques

14

Loop Transformation Loop Fusion Loop Distribution/Fission Loop Re-Indexing Loop Scaling Loop Reversal Loop Permutation/Interchange Loop Skew

Uni-modular Transformation

“Advanced Compiler Techniques”

Page 15: Advanced Compiler Techniques

15

Loop Transformation

“Advanced Compiler Techniques”

Page 16: Advanced Compiler Techniques

16

Cache Locality Suppose array A has column-major layout Loop nest has poor spatial cache locality.

for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_forend_for

A[1,1]

A[2,1]

… A[1,2]

A[2,2]

… A[1,3]

“Advanced Compiler Techniques”

Page 17: Advanced Compiler Techniques

Performance Improvement 1 1.5 2 2.5 3

compress

cholesky(nasa7)

spicemxm (nasa7)btrix (nasa7)

tomcatvgmty (nasa7)

vpenta (nasa7)

mergedarrays

loopinterchange

loop fusion blocking

Summary of Compiler Optimizationsto Reduce Cache Misses (by hand)

“Advanced Compiler Techniques” 17

Page 18: Advanced Compiler Techniques

18

Resource Constraints

Functional unitsld st alu fmpy fadd br …

Tim

e 0

1

2

Each instruction type has a resource reservation table

Pipelined functional units: occupy only one slot Non-pipelined functional units: multiple time slots Instructions may use more than one resource Multiple units of same resource Limited instruction issue slots

may also be managed like a resource

“Advanced Compiler Techniques”

Page 19: Advanced Compiler Techniques

19

List Scheduling

Scope: DAGs Schedules operations in topological

order Never backtracks

Variations: Priority function for node n

critical path: max clocks from n to any node

resource requirements source order

“Advanced Compiler Techniques”

Page 20: Advanced Compiler Techniques

20

Global Scheduling

Assume each clock can execute 2 operations of any kind.

if (a==0) goto L

e = d + d

c = b

L:

LD R6 <- 0(R1)nopBEQZ R6, L

LD R8 <- 0(R4)nopADD R8 <- R8,R8ST 0(R5) <- R8

LD R7 <- 0(R2)nopST 0(R3) <- R7

L:

B1

B2

B3

“Advanced Compiler Techniques”

Page 21: Advanced Compiler Techniques

21

Result of Code Scheduling

LD R6 <- 0(R1) ; LD R8 <- 0(R4)LD R7 <- 0(R2)ADD R8 <- R8,R8 ; BEQZ R6, L

ST 0(R5) <- R8 ST 0(R5) <- R8 ; ST 0(R3) <- R7 L:

B1

B3’B3

“Advanced Compiler Techniques”

Page 22: Advanced Compiler Techniques

22

Code Motions

Goal: Shorten execution time probabilistically

Moving instructions up: Move instruction to a cut set (from

entry) Speculation: even when not

anticipated.

Moving instructions down: Move instruction to a cut set (from

exit) May execute extra instruction Can duplicate code

src

src

“Advanced Compiler Techniques”

Page 23: Advanced Compiler Techniques

23

General-Purpose Applications

Lots of data dependences Key performance factor: memory latencies Move memory fetches up

Speculative memory fetches can be expensive Control-intensive: get execution profile

Static estimation Innermost loops are frequently executed

back edges are likely to be taken Edges that branch to exit and exception routines are

not likely to be taken Dynamic profiling

Instrument code and measure using representative data

“Advanced Compiler Techniques”

Page 24: Advanced Compiler Techniques

24

A Basic Global Scheduling Algorithm

Schedule innermost loops first Only upward code motion No creation of copies Only one level of speculation

Extra: Prepass before scheduling: loop

unrolling“Advanced Compiler Techniques”

Page 25: Advanced Compiler Techniques

Software Pipelining

“Advanced Compiler Techniques”

Page 26: Advanced Compiler Techniques

“Advanced Compiler Techniques”

Software Pipelining

• Obtain parallelism by executing iterations of a loop in an overlapping way.

• Try to maximize parallelism across loop iterations

• We’ll focus on simplest case: the do-all loop, where iterations are independent.

• Goal: Initiate iterations as frequently as possible.

• Limitation: Use same schedule and delay for each iteration.

26

Page 27: Advanced Compiler Techniques

27

Example of DoAll Loops

Machine: Per clock: 1 read, 1 write, 1 (2-stage) arithmetic op, with

hardware loop op and auto-incrementing addressing mode. Source code:

For i = 1 to n D[i] = A[i] * B[i]+ c

Code for one iteration: 1. LD R5,0(R1++) 2. LD R6,0(R2++) 3. MUL R7,R5,R6 4. 5. ADD R8,R7,R4 6. 7. ST 0(R3++),R8

No parallelism in basic block

“Advanced Compiler Techniques”

Page 28: Advanced Compiler Techniques

28

Unrolling 1.L: LD 2. LD 3. LD 4. MUL LD 5. MUL LD 6. ADD LD 7. ADD LD 8. ST MUL LD 9. MUL10. ST ADD11. ADD12. ST13. ST BL (L)

• Let u be the degree of unrolling:– Length of u iterations = 7+2(u-1)– Execution time per source iteration = (7+2(u-1)) / u = 2 + 5/u

“Advanced Compiler Techniques”

Page 29: Advanced Compiler Techniques

Software Pipelined Code 1. LD 2. LD 3. MUL LD 4. LD 5. MUL LD 6. ADD LD 7. MUL LD 8. ST ADD LD 9. MUL LD10. ST ADD LD11. MUL12. ST ADD13. 14. ST ADD15.16. ST Unlike unrolling, software pipelining can give optimal

result. Locally compacted code may not be globally optimal DOALL: Can fill arbitrarily long pipelines with infinitely

many iterations “Advanced Compiler Techniques” 29

Page 30: Advanced Compiler Techniques

Example of DoAcross LoopLoop: Sum = Sum + A[i]; B[i] = A[i] * c;

Software Pipelined Code1. LD2. MUL3. ADD LD4. ST MUL5. ADD6. ST

Doacross loops Recurrences can be parallelized Harder to fully utilize hardware with large degrees of

parallelism

1. LD2. MUL3. ADD4. ST

“Advanced Compiler Techniques” 30

Page 31: Advanced Compiler Techniques

Problem Formulation

Goals: maximize throughput small code size

Find: an identical relative schedule S(n)

for every iteration a constant initiation interval (T)

such that the initiation interval is minimized

Complexity: NP-complete in general

S0 LD1 MUL2 ADD LD3 ST MUL ADD ST

T=2

“Advanced Compiler Techniques” 31

Page 32: Advanced Compiler Techniques

Scheduling Constraints: Resource

RT: resource reservation table for single iteration RTs: modulo resource reservation table

RTs[i] = t|(t mod T = i) RT[t]

LD Alu ST

LD Alu ST

LD Alu ST

LD Alu ST

Iteration 1

Iteration 2

Iteration 3

Iteration 4

T=2Ti

me

LD Alu STSteady State

T=2

“Advanced Compiler Techniques” 32

Page 33: Advanced Compiler Techniques

“Advanced Compiler Techniques”

Is It Possible to Have an IterationStart at Every Clock?

• Hint: No.• Why?• An iteration injects 2 MEM and 2 ALU

resource requirements.– If injected every clock, the machine

cannot possibly satisfy all requests.• Minimum delay = 2.

33

Page 34: Advanced Compiler Techniques

“Advanced Compiler Techniques”

Assigning Registers

• We don’t need an infinite number of registers.

• We can reuse registers for iterations that do not overlap in time.

• But we can’t just use the same old registers for every iteration.

34

Page 35: Advanced Compiler Techniques

“Advanced Compiler Techniques”

Assigning Registers --- (2)

• The inner loop may have to involve more than one copy of the smallest repeating pattern.– Enough so that registers may be reused

at each iteration of the expanded inner loop.

• Our example: 3 iterations coexist, so we need 3 sets of registers and 3 copies of the pattern.

35

Page 36: Advanced Compiler Techniques

“Advanced Compiler Techniques”

Example: Assigning Registers

• Our original loop used registers:– r9 to hold a constant 4N.– r8 to count iterations and index the

arrays.– r1 to copy a[i] into b[i].

• The expanded loop needs:– r9 holds 12N.– r6, r7, r8 to count iterations and index.– r1, r2, r3 to copy certain array elements.

36

Page 37: Advanced Compiler Techniques

37

The Loop Body

L: ADD r8,r8,#12 nop LD r3,a(r6)BGE r8,r9,L’ ST b(r7),r2 nopLD r1,a(r8) ADD r7,r7,#12 nopnop BGE r7,r9,L’’ ST b(r6),r3nop LD r2,a(r7) ADD r6,r6,#12ST b(r8),r1 nop BLT r6,r9,L

Iteration i

Iteration i + 4

Iteration i + 1

Iteration i + 3

Iteration i + 2

To break the loop early Each register handles everythird element of the arrays.

L’ and L’’ are places for appropriate codas.“Advanced Compiler Techniques”

Page 38: Advanced Compiler Techniques

“Advanced Compiler Techniques”

Cyclic Data Dependence Graph

• We assumed that data at an iteration depends only on data computed at the same iteration.– Not even true for our example.• r8 computed from its previous iteration.• But it doesn’t matter in this example.

• Fixup: edge labels have two components: (iteration change, delay).

38

Page 39: Advanced Compiler Techniques

39

Cyclic Data Dependence Graph Label edges with < , d >

= iteration difference, d = delay x T + S(n2) – S(n1) d

LD r1,a(r8)

ST b(r8),r1

ADD r8,r8,#4

BLT r8,r9,L

<0,2> <1,1>

<0,1>

<0,1>

(A)

(B)

(C)

(D)

(C) must wait atleast one clockafter the (B) fromthe same iteration.

(A) must wait atleast one clockafter the (C) fromthe previous iteration.

“Advanced Compiler Techniques”

Page 40: Advanced Compiler Techniques

“Advanced Compiler Techniques”

Matrix of Delays

• Let T be the delay between the start times of one iteration and the next.

• Replace edge label <i,j> by delay j-iT.

• Compute, for each pair of nodes n and m the total delay along the longest acyclic path from n to m.

• Gives upper and lower bounds relating the times to schedule n and m. 40

Page 41: Advanced Compiler Techniques

41

Example: Delay Matrix

AA

BCD

B C D2

1-T1

1

AA

BCD

B C D2

1-T1

1

3

2-T 23-T

4

Edges Acyclic Transitive Closure

S(B) ≥ S(A)+2S(A) ≥ S(B)+2-T

S(B)-2 ≥ S(A) ≥ S(B)+2-T

Note: Implies T ≥ 4 (because onlyone register used for loop-counting).If T=4, then A (LD) must be 2 clocksbefore B (ST). If T=5, A can be 2-3clocks before B.

“Advanced Compiler Techniques”

Page 42: Advanced Compiler Techniques

“Advanced Compiler Techniques”

Iterative Modulo Scheduling

• Compute the lower bounds (MII) on the delay between the start times of one iteration and the next (initiation interval, aka II)– due to resources– due to recurrences

• Try to find a schedule for II = MII• If no schedule can be found, try a

larger II.42

Page 43: Advanced Compiler Techniques

The Algorithm

1. Choose an initiation interval, ii> Compute lower bounds on ii > Shorter ii means faster overall execution

2. Generate a loop body that takes ii cycles> Try to schedule into ii cycles, using modulo scheduler> If it fails, bump ii by one and try again

3. Generate the needed prologue & epilogue code> For prologue, work backward from upward exposed uses

in the schedulued loop body> For epilogue, work forward from downward exposed

definitions in the scheduled loop bodyM. Lam, "Software pipelining: An effective scheduling technique for VLIW machines", In Proceedings of the ACM SIGPLAN 88 Conference on Programming Language Design and Implementation (PLDI 88), July 1988 pages 318-328. Also published as ACM SIGPLAN Notices 23(7).

“Advanced Compiler Techniques” 43

Page 44: Advanced Compiler Techniques

44

A simple sum reduction loop

c = 0for i = 1 to n c = c + a[i]

rc 0 r@a @a r1 n x 4 rub r1 + r@a if r@a > rub goto ExitLoop: ra MEM(r@a) rc rc + ra

r@a r@a + 4 if r@a ≤ rub goto LoopExit: c rc

Source code LLIR code

“Advanced Compiler Techniques”

Page 45: Advanced Compiler Techniques

45

OOO Execution

The loop’s steady state

behavior

“Advanced Compiler Techniques”

Page 46: Advanced Compiler Techniques

46

OOO Execution

The loop’s prologue

The loop’s epilogue

“Advanced Compiler Techniques”

Page 47: Advanced Compiler Techniques

Implementing the Concept

ra MEM(r@a)r@a r@a + 4rc rc + ra

if r@a ≤ rub goto Loop

r@a r@a + 4rc rc + ra

Prologue

Epilogue

Kernel

The actual schedule must respect both the data dependences and the operation latencies.

3

3

11

1 1

1

General schema for the loop

ra MEM(r@a) r@a r@a + 4 if r@a > rub goto Exit

1

“Advanced Compiler Techniques” 47

Page 48: Advanced Compiler Techniques

The Example

1. rc 02. r@a @a3. r1 n x 44. rub r1 + r@a5. if r@a > rub goto Exit6. Loop: ra MEM(r@a)7. rc rc + ra8. r@a r@a + 49. if r@a ≤ rub goto

Loop10. Exit: c rc

The Code Its Dependence Graph

Focus on the loop body

1 2 3

4

56 8

7 9

Op 6 is not involved in a cycle

“Advanced Compiler Techniques” 48

Page 49: Advanced Compiler Techniques

The Example

Label Fetch Unit Integer Unit Branch UnitLoop:

ii = 2

Focus on the loop body6 8

7 9

Template for the Modulo Schedule

“Advanced Compiler Techniques” 49

Page 50: Advanced Compiler Techniques

The Example

Focus on the loop body• Schedule 6 on the fetch unit

3 1

Simulated

clock

Label Fetch Unit Integer Unit Branch UnitLoop: 6

6 8

7 9

“Advanced Compiler Techniques” 50

Page 51: Advanced Compiler Techniques

The Example

Focus on the loop body• Schedule 6 on the fetch unit• Schedule 8 on the integer unit

3 1

Simulated

clock

Label Fetch Unit Integer Unit Branch UnitLoop: 6 8

6 8

7 9

“Advanced Compiler Techniques” 51

Page 52: Advanced Compiler Techniques

The Example

Focus on the loop body• Schedule 6 on the fetch unit• Schedule 8 on the integer unit• Advance the scheduler’s clock3 1

Simulated

clock

Label Fetch Unit Integer Unit Branch UnitLoop: 6 8

6 8

7 9

“Advanced Compiler Techniques” 52

Page 53: Advanced Compiler Techniques

The Example

Focus on the loop body• Schedule 6 on the fetch unit• Schedule 8 on the integer unit• Advance the scheduler’s clock• Schedule 9 on the branch unit

3 1

Simulated

clock

Label Fetch Unit Integer Unit Branch UnitLoop: 6 8

9

6 8

7 9

“Advanced Compiler Techniques” 53

Page 54: Advanced Compiler Techniques

The Example

Focus on the loop body• Schedule 6 on the fetch unit• Schedule 8 on the integer unit• Advance the scheduler’s clock• Schedule 9 on the branch unit• Advance the clock (modulo ii)

3 1

Simulated

clock

Label Fetch Unit Integer Unit Branch UnitLoop: 6 8

9

6 8

7 9

“Advanced Compiler Techniques” 54

Page 55: Advanced Compiler Techniques

The Example

Focus on the loop body• • Advance the scheduler’s clock• Schedule 9 on the branch unit• Advance the clock (modulo ii)• Advance the clock again

3 1

Simulated

clock

Label Fetch Unit Integer Unit Branch UnitLoop: 6 8

9

6 8

7 9

“Advanced Compiler Techniques” 55

Page 56: Advanced Compiler Techniques

The Example

Focus on the loop body• • Schedule 9 on the branch unit• Advance the clock (modulo ii)• Advance the clock again• Schedule 7 on the integer unit

3 1

Simulated

clock

Label Fetch Unit Integer Unit Branch UnitLoop: 6 8

7 9

6 8

7 9

“Advanced Compiler Techniques” 56

Page 57: Advanced Compiler Techniques

The Example

Focus on the loop body• • Advance the clock (modulo ii)• Advance the clock again• Schedule 7 on the integer unit• No unscheduled ops remain in loop body

3 1

The final schedule for the loop’s body

Simulated

clock

Label Fetch Unit Integer Unit Branch UnitLoop: 6 8

7 9

6 8

7 9

“Advanced Compiler Techniques” 57

Page 58: Advanced Compiler Techniques

The Example: Final Schedule

Label Fetch Unit Integer Unit Branch Unit

nop r@a @a nop nop r1 n x 4 nop nop rub r1 + r@a nop

ra MEM(r@a) r@a r@a + 4 nop nop rc 0 If r@a > rub goto Exit

Loop: ra MEM(r@a) r@a r@a + 4 nop

nop rc rc + ra if r@a ≤ rub goto LoopExit: nop nop nop

nop rc rc + ra nop

What about the empty slots? Fill them (if needed) in some other way (e.g., fuse loop with another loop that is memory bound?)

“Advanced Compiler Techniques” 58

Page 59: Advanced Compiler Techniques

59

Global Software Pipelining Conditional Execution

Control Dependence -> Data Dependence

Hardware Support FU, operation field, extra pipe stages

Conditional Composition Code Inflation Complicated scheduling

Branch Prediction Dynamic vs. Static

“Advanced Compiler Techniques”

Page 60: Advanced Compiler Techniques

60

Summary Eliminate most dependences, improve ILP No much code inflation Register Pressure

Software Pipelining

Loop Unrolling

“Advanced Compiler Techniques”

Page 61: Advanced Compiler Techniques

Data Prefetching

“Advanced Compiler Techniques”

Page 62: Advanced Compiler Techniques

“Advanced Compiler Techniques”

Why Data Prefetching

• Increasing Processor – Memory “distance”

• Caches do work !!! … IF …– Data set cache-able, accesses local (in

space/time)• Else ? …

62

Page 63: Advanced Compiler Techniques

“Advanced Compiler Techniques”

Data Prefetching

• What is it ?– Request for a future data need is

initiated– Useful execution continues during

access– Data moves from slow/far memory to

fast/near cache– Data ready in cache when needed

(load/store)

63

Page 64: Advanced Compiler Techniques

“Advanced Compiler Techniques”

Data Prefetching

• When can it be used ?– Future data needs are (somewhat)

predictable• How is it implemented ?– in hardware: history based prediction of

future access– in software: compiler inserted prefetch

instructions

64

Page 65: Advanced Compiler Techniques

“Advanced Compiler Techniques”

Software Data Prefetching

• Compiler scheduled prefetches• Moves entire cache lines (not just datum)– Spatial locality assumed – often the case

• Typically a non-faulting access– Compiler free to speculate prefetch address

• Hardware not obligated to obey– A performance enhancement, no functional

impact– Loads/store may be preferentially treated

65

Page 66: Advanced Compiler Techniques

“Advanced Compiler Techniques”

Software Data Prefetching Use

• Mostly in Scientific codes– Vectorizable loops accessing arrays

deterministically• Data access pattern is predictable• Prefetch scheduling easy (far in time, near in

code)– Large working data sets consumed

• Even large caches unable to capture access locality

• Sometimes in Integer codes– Loops with pointer de-references

66

Page 67: Advanced Compiler Techniques

67

Selective Data Prefetch

do j = 1, n do i = 1, m A(i,j) = B(1,i) + B(1,i+1) enddoenddo

E.g. A(i,j) has spatial locality, therefore only one prefetch is required for every cache line.

Aj

i

B 11 m

“Advanced Compiler Techniques”

Page 68: Advanced Compiler Techniques

68

Formal Definitions Temporal locality occurs when a given

reference reuses exactly the same data location

Spatial locality occurs when a given reference accesses different data locations that fall within the same cache line

Group locality occurs when different references access the same cache line

“Advanced Compiler Techniques”

Page 69: Advanced Compiler Techniques

69

Prefetch Predicates If an access has spatial locality, only the

first access to the same cache line will incur a miss.

For temporal locality, only the first access will incur a cache miss

If an access has group locality, only the leading reference incurs cache miss.

If an access has no locality, it will miss in every iteration.

“Advanced Compiler Techniques”

Page 70: Advanced Compiler Techniques

70

Example Code with Prefetches

do j = 1, n do i = 1, m A(i,j) = B(1,i) + B(1,i+1) if (iand(i,7) == 0) prefetch (A(i+k,j)) if (j == 1) prefetch (B(1,i+t)) enddoenddo

Aj

i

B 11 m

Assumed CLS = 64 bytes and data size = 8 bytesk and t are prefetch distance values

“Advanced Compiler Techniques”

Page 71: Advanced Compiler Techniques

71

Spreading of Prefetches If there is more than one reference that

has spatial locality within the same loop nest, spread these prefetches across the 8-iteration window

Reduces the stress on the memory subsystem by minimizing the number of outstanding prefetches

“Advanced Compiler Techniques”

Page 72: Advanced Compiler Techniques

72

Example Code with Spreading

do j = 1, n do i = 1, m C(i,j) = D(i-1,j) + D(i+1,j) if (iand(i,7) == 0)

prefetch (C(i+k,j)) if (iand(i,7) == 1) prefetch (D(i+k+1,j))enddo

enddo

C, Dj

i

Assumed CLS = 64 bytes and data size = 8 bytesk is the prefetch distance value

“Advanced Compiler Techniques”

Page 73: Advanced Compiler Techniques

73

Prefetch Strategy - Conditional

Example loopL:

Load A(I)Load B(I)...I = I + 1Br L, if I<n

Conditional Prefetching

L:Load A(I)Load B(I)Cmp pA=(I mod 8 == 0)if(pA) prefetch A(I+X)Cmp pB=(I mod 8 == 1)If(pB) prefetch B(I+X)...I = I + 1Br L, if I<n

Code for condition generationPrefetches occupy issue slots

“Advanced Compiler Techniques”

Page 74: Advanced Compiler Techniques

74

Prefetch Strategy - Unroll

Example loopL:

Load A(I)Load B(I)...I = I + 1Br L, if I<n

UnrolledUnr_Loop:

prefetch A(I+X)load A(I)load B(I)...prefetch B(I+X)load A(I+1)load B(I+1)...prefetch C(I+X)load A(I+2)load B(I+2)...prefetch D(I+X)load A(I+3)load B(I+3)

...prefetch E(I+X)load A(I+4)load B(I+4)

...load A(I+5)load B(I+5)

...load A(I+6)load B(I+6)

...load A(I+7)load B(I+7)...I = I + 8Br Unr_Loop, if I<n

Code bloat (>8X)Remainder loop

“Advanced Compiler Techniques”

Page 75: Advanced Compiler Techniques

“Advanced Compiler Techniques”

Software Data Prefetching Cost

• Requires memory instruction resources– A prefetch instruction for each access

stream• Issues every iteration, but needed less

often– If branched around, inefficient execution

results– If conditionally executed, more instruction

overhead results– If loop is unrolled, code bloat results

75

Page 76: Advanced Compiler Techniques

“Advanced Compiler Techniques”

Software Data Prefetching Cost

• Redundant prefetches get in the way– Resources consumed until prefetches

discarded!• Non redundant need careful

scheduling– Resources overwhelmed when many

issued & miss• Redundant prefetches increases

power/energy consumption76

Page 77: Advanced Compiler Techniques

77

Prefetch Execution Diagram

“Advanced Compiler Techniques”

Page 78: Advanced Compiler Techniques

“Advanced Compiler Techniques”

Measurements

78

Page 79: Advanced Compiler Techniques

79

Data Re-placement/Layout

#Words

B[]

i60

CacheMemory

Main Memory

(16 words) (16 words)

AB[new]

A[]

15Detailed Analysis:

max=15 words

12

for(i=0; i<12; i++) A[i] = B[i+3]+B[i];

Traditional Analysis:

max=27 words

“Advanced Compiler Techniques”

Page 80: Advanced Compiler Techniques

“Advanced Compiler Techniques”

Summary

• Today– Data dependences– Loop transformation– Software pipelining– Software prefetching– Data replacement

• Next– Compiler for Modern Architectures– Future Compilers