advanced compiler techniques

100
Advanced Compiler Techniques LIU Xianhua School of EECS, Peking University Compiler for Modern Architectures

Upload: kele

Post on 23-Feb-2016

110 views

Category:

Documents


0 download

DESCRIPTION

Advanced Compiler Techniques. Compiler for Modern Architectures. LIU Xianhua School of EECS, Peking University. Outline. Introduction & Basic Concepts Superscalar & VLIW Accelerator-Based Computing Compiler for Embedded Systems Compiler Optimization for Many-Core. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Advanced Compiler Techniques

Advanced Compiler Techniques

LIU Xianhua

School of EECS, Peking University

Compiler for Modern Architectures

Page 2: Advanced Compiler Techniques

“Advanced Compiler Techniques”

Outline

• Introduction & Basic Concepts• Superscalar & VLIW• Accelerator-Based Computing• Compiler for Embedded Systems• Compiler Optimization for Many-Core

2

Page 3: Advanced Compiler Techniques

Features of Machine Architectures

• Pipelining• Multiple execution units

– pipelined• Vector operations• Parallel processing

– Shared memory, distributed memory, message-passing• VLIW and Superscalar instruction issue• Registers• Cache hierarchy• Combinations of the above

– Parallel-vector machines

“Advanced Compiler Techniques” 3

Page 4: Advanced Compiler Techniques

Moore’s Law

“Advanced Compiler Techniques” 4

Page 5: Advanced Compiler Techniques

Instruction Pipelining

• Instruction pipelining– DLX Instruction Pipeline

– What is the performance challenge?

IF EX MEMID

c1 c2 c3 c4 c5 c6

WB

IF EX MEMID WB

IF EX MEMID WB

c7

“Advanced Compiler Techniques” 5

Page 6: Advanced Compiler Techniques

Replicated Execution Logic

• Pipelined Execution Units

• Multiple Execution Units

FetchOperands

(FO)

EquateExponents

(EE)

NormalizeMantissas

Add

(AM)Result(NR)

InputsResult

(EE) (NR)b3c3

(AM)b2 + c2

(FO)b4c4

a1

b5

c5

b1 + c1

Adder 1

b2 + c2

Adder 2

b4 + c4

Adder 4

b3 + c3

Adder 3

b5c5

Results

What is the performanc

e challenge?

“Advanced Compiler Techniques” 6

Page 7: Advanced Compiler Techniques

Vector Operations

• Apply same operation to different positions of one or more arrays– Goal: keep pipelines of execution units full

• Example:VLOAD V1,AVLOAD V2,BVADD V3,V1,V2VSTORE V3,C

How do we specify vector operations in Fortran 77

DO I = 1, 64C(I) = A(I) + B(I)D(I+1) = D(I) + C(I)ENDDO

“Advanced Compiler Techniques” 7

Page 8: Advanced Compiler Techniques

Superscalar & VLIW

• Multiple instruction issue on the same cycle– Wide word instruction (or superscalar)– Usually one instruction slot per functional unit

• What are the performance challenges?– Finding enough parallel instructions– Avoiding interlocks

• Scheduling instructions early enough

“Advanced Compiler Techniques” 8

Page 9: Advanced Compiler Techniques

SMP Parallelism

• Multiple processors with uniform shared memory– Task Parallelism

• Independent tasks– Data Parallelism

• the same task on different data

• What is the performance challenge?

p1

Memory

Bus

p2 p3 p3

“Advanced Compiler Techniques” 9

Page 10: Advanced Compiler Techniques

Bernstein’s Conditions

• When is it safe to run two tasks R1 and R2 in parallel?– If none of the following holds:

• R1 writes into a memory location that R2 reads• R2 writes into a memory location that R1 reads• Both R1 and R2 write to the same memory location

• How can we convert this to loop parallelism?• Think of loop iterations as tasks

“Advanced Compiler Techniques” 10

Page 11: Advanced Compiler Techniques

Memory Hierarchy

• Problem: memory is moving farther away in processor cycles– Latency and bandwidth difficulties

• Solution– Reuse data in cache and registers

• Challenge: How can we enhance reuse?– Coloring register allocation works well

• But only for scalars– DO I = 1, N– DO J = 1, N– C(I) = C(I) + A(J)

– Strip mining to reuse data from cache“Advanced Compiler Techniques” 11

Page 12: Advanced Compiler Techniques

Distributed Memory

• Memory packaged with processors– Message passing– Distributed shared memory

• SMP clusters– Shared memory on node, message passing off

node• What are the performance issues?

– Minimizing communication• Data placement

– Optimizing communication• Aggregation• Overlap of communication and computation

“Advanced Compiler Techniques” 12

Page 13: Advanced Compiler Techniques

Compiler Technologies

• Program Transformations– Most of these architectural issues can be dealt with by

restructuring transformations that can be reflected in source• Vectorization, parallelization, cache reuse enhancement

– Challenges:• Determining when transformations are legal• Selecting transformations based on profitability

• Low level code generation– Some issues must be dealt with at a low level

• Prefetch insertion• Instruction scheduling

• All require some understanding of the ways that instructions and statements depend on one another (share data)

“Advanced Compiler Techniques” 13

Page 14: Advanced Compiler Techniques

A Common Problem: Matrix Multiply

DO I = 1, N DO J = 1, N

C(J,I) = 0.0 DO K = 1, N

C(J,I) = C(J,I) + A(J,K) * B(K,I)

ENDDO ENDDO

ENDDO

“Advanced Compiler Techniques” 14

Page 15: Advanced Compiler Techniques

Problem for Pipelines

• Inner loop of matrix multiply is a reduction

• Solution: – work on several iterations of the J-loop simultaneously

A(1,1)*B(1,1)C(1,1)+

A(1,2)*B(2,1)

A(1,1)*B(1,1)C(1,1)+

A(1,2)*B(2,1) A(2,1)*B(1,1)C(2,1)+

A(3,1)*B(1,1)C(3,1)+

A(4,1)*B(1,1)C(4,1)+

“Advanced Compiler Techniques” 15

Page 16: Advanced Compiler Techniques

MatMult for a Pipelined Machine

DO I = 1, N, DO J = 1, N, 4

C(J,I) = 0.0 !Register 1C(J+1,I) = 0.0 !Register 2C(J+2,I) = 0.0 !Register 3 C(J+3,I) = 0.0 !Register 4DO K = 1, N

C(J,I) = C(J,I) + A(J,K) * B(K,I)

C(J+1,I) = C(J+1,I) + A(J+1,K) * B(K,I)

C(J+2,I) = C(J+2,I) + A(J+2,K) * B(K,I)

C(J+3,I) = C(J+3,I) + A(J+3,K) * B(K,I)

ENDDO ENDDO

ENDDO“Advanced Compiler Techniques” 16

Page 17: Advanced Compiler Techniques

Matrix Multiply on Vector Machines

DO I = 1, N DO J = 1, N

C(J,I) = 0.0 DO K = 1, N

C(J,I) = C(J,I) + A(J,K) * B(K,I)

ENDDO ENDDO

ENDDO

“Advanced Compiler Techniques” 17

Page 18: Advanced Compiler Techniques

Problems for Vectors

• Inner loop must be vector– And should be stride 1

• Vector registers have finite length (Cray: 64 elements)– Would like to reuse vector register in the compute loop

• Solution– Strip mine the loop over the stride-one dimension to 64– Move the iterate over strip loop to the innermost position

• Vectorize it there

“Advanced Compiler Techniques” 18

Page 19: Advanced Compiler Techniques

Vectorizing Matrix Multiply

DO I = 1, N DO J = 1, N, 64

DO JJ = 0,63C(JJ,I) = 0.0

DO K = 1, N

C(J,I) = C(J,I) +

A(J,K) * B(K,I)ENDDO

ENDDO ENDDO

ENDDO

“Advanced Compiler Techniques” 19

Page 20: Advanced Compiler Techniques

Vectorizing Matrix Multiply

DO I = 1, N DO J = 1, N, 64

DO JJ = 0,63C(JJ,I) = 0.0

ENDDODO K = 1, N

DO JJ = 0,63C(J,I) = C(J,I) + A(J,K)

* B(K,I)ENDDO

ENDDO ENDDO

ENDDO“Advanced Compiler Techniques” 20

Page 21: Advanced Compiler Techniques

MatMult for a Vector Machine

DO I = 1, N DO J = 1, N, 64

C(J:J+63,I) = 0.0 DO K = 1, N

C(J:J+63,I) = C(J:J+63,I) + A(J:J+63,K) *B(K,I)

ENDDO ENDDO

ENDDO

“Advanced Compiler Techniques” 21

Page 22: Advanced Compiler Techniques

Matrix Multiply on Parallel SMPs

DO I = 1, N DO J = 1, N

C(J,I) = 0.0 DO K = 1, N

C(J,I) = C(J,I) + A(J,K) * B(K,I)

ENDDO ENDDO

ENDDO

“Advanced Compiler Techniques” 22

Page 23: Advanced Compiler Techniques

Matrix Multiply on Parallel SMPs

DO I = 1, N ! Independent for all I DO J = 1, N

C(J,I) = 0.0 DO K = 1, N

C(J,I) = C(J,I) + A(J,K) * B(K,I)

ENDDO ENDDO

ENDDO

“Advanced Compiler Techniques” 23

Page 24: Advanced Compiler Techniques

Problems on a Parallel Machine

• Parallelism must be found at the outer loop level– But how do we know?

• Solution– Bernstein’s conditions

• Can we apply them to loop iterations?• Yes, with dependence

– Statement S2 depends on statement S1 if• S2 comes after S1• S2 must come after S1 in any correct reordering of statements

– Usually keyed to memory• Path from S1 to S2• S1 writes and S2 reads the same location• S1 reads and S2 writes the same location• S1 and S2 both write the same location

“Advanced Compiler Techniques” 24

Page 25: Advanced Compiler Techniques

MatMult on a Shared-Memory MP

PARALLEL DO I = 1, N DO J = 1, N

C(J,I) = 0.0 DO K = 1, N

C(J,I) = C(J,I) + A(J,K) * B(K,I)

ENDDO ENDDO

END PARALLEL DO

“Advanced Compiler Techniques” 25

Page 26: Advanced Compiler Techniques

MatMult on a Vector SMP

PARALLEL DO I = 1, N DO J = 1, N, 64

C(J:J+63,I) = 0.0 DO K = 1, N

C(J:J+63,I) = C(J:J+63,I) + A(J:J+63,K)*B(K,I)

ENDDO ENDDO

ENDDO

“Advanced Compiler Techniques” 26

Page 27: Advanced Compiler Techniques

Matrix Multiply for Cache Reuse

DO I = 1, N DO J = 1, N

C(J,I) = 0.0 DO K = 1, N

C(J,I) = C(J,I) + A(J,K) * B(K,I)

ENDDO ENDDO

ENDDO

“Advanced Compiler Techniques” 27

Page 28: Advanced Compiler Techniques

Problems on Cache

• There is reuse of C but no reuse of A and B• Solution

– Block the loops so you get reuse of both A and B• Multiply a block of A by a block of B and add to block of C

– When is it legal to interchange the iterate over block loops to the inside?

“Advanced Compiler Techniques” 28

Page 29: Advanced Compiler Techniques

MatMult on a Uniprocessor with Cache

DO I = 1, N, SDO J = 1, N, S

DO p = I, I+S-1DO q = J, J+S-1

C(q,p) = 0.0ENDDO

ENDDODO K = 1, N, T

DO p = I, I+S-1DO q = J, J+S-1

DO r = K, K+T-1 C(q,p) = C(q,p) + A(q,r) * B(r,p)

ENDDOENDDO

ENDDO ENDDO

ENDDO ENDDO

ST elements ST elements

S2 elements

“Advanced Compiler Techniques” 29

Page 30: Advanced Compiler Techniques

MatMult on a Distributed-Memory MP

PARALLEL DO I = 1, NPARALLEL DO J = 1, N

C(J,I) = 0.0ENDDO

ENDDOPARALLEL DO I = 1, N, S

PARALLEL DO J = 1, N, SDO K = 1, N, T

DO p = I, I+S-1DO q = J, J+S-1

DO r = K, K+T-1 C(q,p) = C(q,p) + A(q,r) * B(r,p)

ENDDOENDDO

ENDDOENDDO

ENDDO ENDDO

“Advanced Compiler Techniques” 30

Page 31: Advanced Compiler Techniques

Compiler Optimizationfor Superscalar & VLIW

“Advanced Compiler Techniques”

Page 32: Advanced Compiler Techniques

What is Superscalar?

• “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.

• Equally applicable to RISC & CISC, but more straightforward in RISC machines.

• The order of execution is usually assisted by the compiler.

A Superscalar machine executes multiple independent instructions in parallel. They are pipelined as well.

“Advanced Compiler Techniques” 32

Page 33: Advanced Compiler Techniques

Superscalar vs. Superpipeline

“Advanced Compiler Techniques” 33

Page 34: Advanced Compiler Techniques

Limitations of Superscalar

• Dependent upon:- Instruction level parallelism possible

- Compiler based optimization- Hardware support

• Limited by– Data dependency– Procedural dependency– Resource conflicts

“Advanced Compiler Techniques” 34

Page 35: Advanced Compiler Techniques

Effect of Dependencies onSuperscalar Operation

Notes:1) Superscalar operation is double impacted by a stall.2) CISC machines typically have different length instructions and need to be at least partially decoded before the next can be fetched – not good for superscalar operation “Advanced Compiler Techniques” 35

Page 36: Advanced Compiler Techniques

In-Order Issue –In-Order Completion (Example)

Assume:• I1 requires 2 cycles to execute• I3 & I4 conflict for the same functional unit• I5 depends upon value produced by I4• I5 & I6 conflict for a functional unit

“Advanced Compiler Techniques” 36

Page 37: Advanced Compiler Techniques

In-Order Issue –Out-of-Order Completion (Example)

How does this effect interrupts?

Again:• I1 requires 2 cycles to execute• I3 & I4 conflict for the same functional unit• I5 depends upon value produced by I4• I5 & I6 conflict for a functional unit

“Advanced Compiler Techniques” 37

Page 38: Advanced Compiler Techniques

Out-of-Order Issue –Out-of-Order Completion (Example)

Note: I5 depends upon I4, but I6 does not

Again:• I1 requires 2 cycles to execute• I3 & I4 conflict for the same functional unit• I5 depends upon value produced by I4• I5 & I6 conflict for a functional unit

“Advanced Compiler Techniques” 38

Page 39: Advanced Compiler Techniques

Speedups of Machine Organizations (Without Procedural Dependencies)

• Not worth duplication of functional units without register renaming• Need instruction window large enough (more than 8, probably not more than 32) “Advanced Compiler Techniques” 39

Page 40: Advanced Compiler Techniques

View of Superscalar Execution

“Advanced Compiler Techniques” 40

Page 41: Advanced Compiler Techniques

Example: Pentium 4A Superscalar CISC Machine

“Advanced Compiler Techniques” 41

Page 42: Advanced Compiler Techniques

Pentium 4 Alternative View

“Advanced Compiler Techniques” 42

Page 43: Advanced Compiler Techniques

Pentium 4 pipeline

20 stages !

“Advanced Compiler Techniques” 43

Page 44: Advanced Compiler Techniques

VLIW Approach

• Superscalar hardware is tough• E.g. for the two issue superscalar: every cycle we examine

the opcode of 2 instr, the 6 registers specifiers and we dynamically determine whether one or two instructions can issue and dispatch them to the appropriate Functional Units(FU)

• VLIW approach– Packages multiple independent instructions into one

very long instruction– Burden of finding independent instructions is on the

compiler– No dependence within issue package or at least indicate

when a dependence is present– Simpler hardware

“Advanced Compiler Techniques” 44

Page 45: Advanced Compiler Techniques

Superscalar vs. VLIW

“Advanced Compiler Techniques” 45

Page 46: Advanced Compiler Techniques

Superscalar vs. VLIW

“Advanced Compiler Techniques” 46

Page 47: Advanced Compiler Techniques

Role of the Compiler forSuperscalar and VLIW

“Advanced Compiler Techniques” 47

Page 48: Advanced Compiler Techniques

• Explicitly Parallel Instruction-set Computing – a new hardware architecture paradigm that is the successor to VLIW (Very Long Instruction Word).

• EPIC features:– Parallel instruction encoding– Flexible instruction grouping– Large register file– Predicated instruction set– Control and data speculation– Compiler control of memory hierarchy

What is EPIC?

“Advanced Compiler Techniques” 48

Page 49: Advanced Compiler Techniques

History

• A timeline of architectures:

• Hennessy and Patterson were early RISC proponents.• VLIW (very long instruction word) architectures,

1981.– Alan Charlesworth (Floating Point Systems)– Josh Fisher (Yale / Multiflow – Multiflow Trace machines)– Bob Rau (TRW / Cydrome – Cydra-5 machine)

Hard-wireddiscrete logic

Microprocessor(advent of VLSI)

CISC(microcode)

RISC(microcode bad!)

VLIWSuperpipelined,SuperscalarEPIC

“Advanced Compiler Techniques” 49

Page 50: Advanced Compiler Techniques

History (cont.)

• Superscalar processors– Tilak Agerwala and John Cocke of IBM coined term– First microprocessor (Intel i960CA) and workstation (IBM

RS/6000) introduced in 1989.– Intel Pentium in 1993.

• Foundations of EPIC - HP– HP FAST (Fine-grained Architecture and Software Technologies)

research project in 1989 included Bob Rau; this work later developed into the PlayDoh architecture.

– In 1990 Bill Worley started the PA-Wide Word project (PA-WW). Josh Fisher, also hired by HP, made contributions to these projects.

– HP teamed with Intel in 1994 – announced EPIC, IA-64 architecture in 1997.

– First implementation was Merced / Itanium, announced in 1999.

“Advanced Compiler Techniques” 50

Page 51: Advanced Compiler Techniques

IA-64 / Itanium• IA-64 is not a complete architecture – it is more a set of

high-level requirements for an architecture and instruction set, co-developed by HP and Intel.

• Itanium is Intel’s implementation of IA-64. HP announced an IA-64 successor to their PA-RISC line, but apparently abandoned it in favor of joint development of the Itanium 2.

• Architectural features:– No complex out-of-order logic in processor – leave it for

compiler– Provide large register file, multiple execution units (again,

compiler has visibility to explicitly manage)– Add hardware support for advanced ILP optimization:

predication, control and data speculation, register rotation, multi-way branches, parallel compares, visible memory hierarchy

– Load / store architecture – no direct-to-memory instructions

“Advanced Compiler Techniques” 51

Page 52: Advanced Compiler Techniques

Hardware Resources

• General (user-accessible) registers– 128 64-bit general registers; each has 1-bit NaT

flag attached (GR0 hardwired to 0)– 128 82-bit floating-point registers– 64 1-bit predicate registers (p0 hardwired to 0)– 8 64-bit branch registers

• Special application registers– Current Frame Marker (CFM), Loop Counter (LC),

Epilog Counter (EC)• Advanced Load Address Table (ALAT)• Register Stack Engine (RSE)

“Advanced Compiler Techniques” 52

Page 53: Advanced Compiler Techniques

Instruction Groups

• Sequence of instructions with no register dependencies (can all be executed in parallel)– WAR dependencies are allowed– Memory operations still require sequential

execution• Stops delimit groups, so a group of parallel

instructions can span bundles (allows new hardware implementation to add functional units – it would still be able to execute code compiled for a smaller number of units).– Itanium 2 can issue up to six instructions

simultaneously

“Advanced Compiler Techniques” 53

Page 54: Advanced Compiler Techniques

• The IA-64 includes 64 special-purpose registers that can be used to predicate most of the normal instructions.

• For example, this allows the following conversion:

• Most special compare / test instructions write the result and its complement into two different predicate registers

• Predication converts control dependencies into data dependencies, allowing more scheduling options

Predication

if (a > b) thenc = c + a

elsec = c + b

cmp.lt p1, p2 = a, b(p1) c = c + a(p2) c = c + b

“Advanced Compiler Techniques” 54

Page 55: Advanced Compiler Techniques

Software Pipelining

• Similar to hardware pipelining – overlap executions of different pieces of a loop. Requires multiple execution units.

• Problem with loop unrolling / pipelining – requires a lot of registers.

• Hardware support:– Rotating predicate registers to embed the prolog and

epilog of the software pipeline within the main loop – minimizes code expansion.

– Rotating registers can be used to avoid register renaming within unrolled loops, further minimizing code expansion

– Loop Count (LC) and Epilog Count (EC) registers, special loop branch instruction to rotate registers“Advanced Compiler Techniques” 55

Page 56: Advanced Compiler Techniques

The Register Stack

• Portion of process address space set aside as backing store for registers – spill to this area.

• Processor supports a register stack frame similar to normal memory stack frame.– First 32 registers are static. Remaining 96 form the

register stack.– Current Frame Marker (CFM) and Top of Stack (TOS)

registers used to manipulate stack frames– Called routine calls alloc instruction to indicate how

many registers it requires. This allocates the next frame and updates the registers.

– Code uses virtual register names (i.e. always uses register 32 as the first register in its frame) – processor maps to physical register

“Advanced Compiler Techniques” 56

Page 57: Advanced Compiler Techniques

Memory Hierarchy Control

• Compiler can insert prefetch hints (lfetch) in instruction stream.

• Load, store, and prefetch instructions can include post-increment – will also cause cache load of line containing post-incremented address

“Advanced Compiler Techniques” 57

Page 58: Advanced Compiler Techniques

Superscalar, VLIW and EPIC

Name Issue structureHazard detection Scheduling

Distinguishing characteristic Examples

Superscalar(static)

Dynamic Hardware Static In-order execution

Sun UltraSparc II/III

Superscalar (dynamic)

Dynamic Hardware Dynamic Some out-of-order execution

IBM Power2

Superscalar (speculative)

Dynamic Hardware Dynamic with speculation

Out-of-order execution with speculation

Pentium III/4, MIPS R10K, Alpha 21264, HP PA 8500, IBM RS64III

VLIW/LIW Static Software Static No hazards between issue packets

Trimedia, i850

EPIC Mostly static Mostly software

Mostly static Explicit dependences marked by compiler

Itanium, Itanium2

“Advanced Compiler Techniques” 58

Page 59: Advanced Compiler Techniques

59

Major Categories

From Mark Smotherman, “Understanding EPIC Architectures and Implementations”

“Advanced Compiler Techniques”

Page 60: Advanced Compiler Techniques

What’s a Compiler to Do?

• Many new architectural features are based on improving opportunities for ILP.– Obviously, there are high expectations for the compiler to

perform good instruction scheduling.– Scheduler should probably drive many other

optimizations.• Tradeoff between ILP and code expansion.• The choice of making the compiler do a lot of the

work is probably a good one – it has much more context available to make decisions. – However, it lacks the knowledge of the dynamic run-

time environment.

60“Advanced Compiler Techniques”

Page 61: Advanced Compiler Techniques

Superscalar, VLIW and ILP

• ILP has significant impact on performance, which is important for high-end systems– Superscalar– VLIW/EPIC

• For embedded systems, ILP yields performance gains and power savings (lower clock rate), provided that ILP implementation is low-cost and low-power– VLIW– DSP and custom

61“Advanced Compiler Techniques”

Page 62: Advanced Compiler Techniques

Energy-Efficient Branch Prediction withCompiler-Guided History Stack

If (a==2) //B1

a=0;If (b==2) //B2

b=0;

call fun1() … If (a!=b) //B3

If (a==2) //B1

a=0;If (b==2) //B2

b=0;

for(i=0;i<N;i++) …

If (a!=b) //B3Expected source code

Unexpected source code

Branch correlation can be very-long-distance

If (a==2) //B1

a=0;If (b==2) //B2

b=0;If (a!=b) //B3

B1B2 B3

loop history bits

… 1 1 1 1 0 … 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 0

GHRbranch history

?

62“Advanced Compiler Techniques”

Page 63: Advanced Compiler Techniques

Energy-Efficient Branch Prediction withCompiler-Guided History Stack

We observed: it is straightforward for the compiler to identify such very-long-distance branch correlation based on high-level dataflow information.

move_s search_root (...) {

if ((root_score > alpha) && !time_exi) {

post_fail_thinking(root_score, &moves[i]);

… } if (root_score > alpha) { ... if (root_score < beta) { Pv[ply-1][ply-1] = moves[i];

for (j = ply; j < pv_length[ply]; j++) pv[ply-1][j] = pv[ply][j];

pv_length[ply-1] = pv_length[ply]; } } if (root_score >= beta)

void push_slidE (int target) {

do { if (board[target] == npiece) { …

add_move(target, 0);

} } while (board[target] != frame);

static int bishop_mobility(int square){

for (diridx = 0; diridx < 4; diridx++) {

for (l = square+dir[diridx]; board[l] == npiece; l +=dir[diridx]) m++;

}

B1

B4

B3

B2

save history

restore history

save history

restore history

save history

restore history

save history

restore history

63“Advanced Compiler Techniques”

Page 64: Advanced Compiler Techniques

Energy-Efficient Branch Prediction withCompiler-Guided History Stack

Identifying program substruction

Analyzing branch correlation

Inserting guiding instructions

Binary codeSource code

Source codeSource code

Compiler work flow

BB1

BB2 BB3

BB4

Pred-branches

Succ-branches

header

BB1

BB2 BB3

BB4

guide_save_history

guide_restore history

Pred-branches

Succ-branches

header

preheader

tail

Loop transformation

64“Advanced Compiler Techniques”

Page 65: Advanced Compiler Techniques

Energy-Efficient Branch Prediction withCompiler-Guided History Stack

Compiler work flow

Procedure call transformation

BB1

call proc

Succ-branches

Pred-branches

BB1

guide_save_historycall proc

guide_retsore_history

Succ-branches

Pred-branches

Identifying program substruction

Analyzing branch correlation

Inserting guiding instructions

Binary codeSource code

Source codeSource code

65“Advanced Compiler Techniques”

Page 66: Advanced Compiler Techniques

CHS-based Branch Predictor

baddr

baddr + 4

1

0

+taken?

011101000100

predicted target

guide_save_history

guide_retore_history

...

countercounter

counter

PHT

...history

history

CHShistory

GHR

...

targettargettarget

target

BTB

gshare

gshare-CHS

66“Advanced Compiler Techniques”

Page 67: Advanced Compiler Techniques

Performance of gshare-CHS

gzip mcf parser twolf gcc06 gobmk hmmer sjeng astar AVG0

4

8

12

16

20

MPKI (small is better)

gsharegshare-CHS

bran

ch M

PKI

gzip mcf parser twolf gcc06 gobmk hmmer sjeng astar AVG0%

10%

20%

30%

IPC improvements (large is better)

gshare-CHS

IPC

impr

ovem

ents

MPKI: 28%Performance: 10%

67“Advanced Compiler Techniques”

Page 68: Advanced Compiler Techniques

Compiler Microarchitecture Cooperation

How to feed compiler-guided information to the predictor?

Compiler-Microarchitecture Cooperation

Algorithm

Problem

Program

ISA

Microarchitecture

Circuits

Electronic Devices

Algorithm

Problem

Program

ISA

Microarchitecture

Circuits

Electronic Devices

Compiler Microarchitecture

Collaboration

(a) Traditional layer (b) Improved layers

68“Advanced Compiler Techniques”

Page 69: Advanced Compiler Techniques

Compiler Optimization for Accelerator-Based Architectures

“Advanced Compiler Techniques”

Page 70: Advanced Compiler Techniques

HPC Design Using Accelerators• High level of performance from Accelerators• Variety of general-purpose hardware accelerators

– GPUs : nVidia, ATI,– Accelerators: Clearspeed, Cell BE, …– Plethora of Instruction Sets even for SIMD

• Programmable accelerators, e.g., FPGA-based• HPC Design using Accelerators

– Exploit instruction-level parallelism – Exploit data-level parallelism on SIMD units– Exploit thread-level parallelism on multiple units/multi-

cores• Challenges

– Portability across different generation and platforms– Ability to exploit different types of parallelism

70“Advanced Compiler Techniques”

Page 71: Advanced Compiler Techniques

Accelerators – Cell BE

71“Advanced Compiler Techniques”

Page 72: Advanced Compiler Techniques

Accelerators - 8800 GPU

72“Advanced Compiler Techniques”

Page 73: Advanced Compiler Techniques

Evolution of Intel Vector Instructions• MMX (1996, Pentium)

– CPU-based MPEG decoding– Integers only, 64-bit divided into 2 x 32 to 8 x 8– Phased out with SSE4

• SSE (1999, Pentium III)– CPU-based 3D graphics – 4-way float operations, single precision– 8 new 128 bit Register, 100+ instructions

• SSE2 (2001, Pentium 4)– High-performance computing– Adds 2-way float ops, double-precision; same registers as 4-way single-

precision– Integer SSE instructions make MMX obsolete

• SSE3 (2004, Pentium 4E Prescott)– Scientific computing– New 2-way and 4-way vector instructions for complex arithmetic

• SSSE3 (2006, Core Duo)– Minor advancement over SSE3

• SSE4 (2007, Core2 Duo Penryn)– Modern codecs, cryptography– New integer instructions– Better support for unaligned data, super shuffle engine

• AVX – advanced version of SSE – 128 bits to 256 bits and 3-operand instructions (up from 2)

73“Advanced Compiler Techniques”

Page 74: Advanced Compiler Techniques

The Challenge

74“Advanced Compiler Techniques”

Page 75: Advanced Compiler Techniques

Approach Framework

CUDA

Profile-basedCompiler

GPUs Multicore

High-Level Intermediate Representation

Compiler and Runtime System

75“Advanced Compiler Techniques”

Page 76: Advanced Compiler Techniques

Existing Approaches

Single Threaded

SIMD Execution

76“Advanced Compiler Techniques”

Page 77: Advanced Compiler Techniques

Existing Approaches (cont.)

Execution on Cell BE

Our Approach for GPUs

77“Advanced Compiler Techniques”

Page 78: Advanced Compiler Techniques

Stream Graph Execution

Stream Graph

Buffer requirement = 4 x

A

C

D

B

SIMD Execution

A1 A2

SM1 SM2 SM3 SM4

A3 A4

B1 B2 B3 B4

D3

C3

D4

C4

D1

C1

D2

C2

01234567

78“Advanced Compiler Techniques”

Page 79: Advanced Compiler Techniques

Stream Graph Execution

Stream Graph

Software Pipelined Execution

Buffer requirement = 2 x

A

C

D

B

SM1 SM2 SM3 SM4

A1 A2

A3 A4

B1 B2

B3 B4 D1

C1

D2

C2

D3

C3

D4

C4

01234567

79“Advanced Compiler Techniques”

Page 80: Advanced Compiler Techniques

Compiler Framework

80“Advanced Compiler Techniques”

Page 81: Advanced Compiler Techniques

81

Experimental Results

Speedup on GPU (8800) compared to CPU of stream programs

Filters are coarsened before scheduling!“Advanced Compiler Techniques”

Page 82: Advanced Compiler Techniques

Compiler Optimizationfor Embedded System

“Advanced Compiler Techniques”

Page 83: Advanced Compiler Techniques

83

Design Cost

Page 84: Advanced Compiler Techniques

84

Embedded-Specific Tradeoffs for Compilers

• Space, time, and energy tradeoffs• These are contrasting goals for

compiler• Ideally, a compiler should expose

some linear combination of the optimization dimensions to application developers:

K1{speed} + K2{code_size} + K3{energy_efficiency}

“Advanced Compiler Techniques”

Page 85: Advanced Compiler Techniques

85

Effect of Compiler Optimizations on Space

• Code size determines cost of ROM• Should minimize I-cache and D-cache

misses• Code layout techniques:

– DAG-based placement– Pettis-Hansen– Inlining– Cache line coloring– Temporal-order placement

“Advanced Compiler Techniques”

Page 86: Advanced Compiler Techniques

86

Power-aware Software Techniques

• Reducing switching activity• Power-aware instruction selection• Scheduling for minimal dissipation• Memory access optimizations• Data remapping

“Advanced Compiler Techniques”

Page 87: Advanced Compiler Techniques

87

Effect of CompilerOptimizations on Power

• Most of the traditional “scalar” optimizations benefit space, time, and (therefore) power

• Not so for:– Loop unrolling– Tail duplication– Inlining and cloning– Speculation and predication– Global code motion

• These and other optimizations will be discussed in subsequent lectures

“Advanced Compiler Techniques”

Page 88: Advanced Compiler Techniques

低功耗编译支持 (1) 编译制导的基于 DVS功耗优化

通过剖视信息或者人工干预,加入制导信息。编译器识别制导信息加入电压切换片段,从而有效实现细粒度和程序内 DVS根据相关文献和初步评测数据,该方法带来的平均功耗下降在 20%左右

100%

t

处理器利用率

功耗

动态电压频率调节下的功耗工作-休眠调节模式下功耗

功耗

100%

t

处理器利用率

“Advanced Compiler Techniques” 88

Page 89: Advanced Compiler Techniques

低功耗编译支持 (2) 功耗约束的代码生成研究

为每条指令添加功耗权重信息,可选择弱强度指令替换高强度指令以降低功耗,在编译器中根据选择选择不同的代码生成。 重新排布代码以降低逻辑翻转次数,该方法可以减少 20-30%的翻转切换,但可能带来 1-2%的性能损失,如果有策略的选择排布,可根据应用需求获取平衡点。同样的,通过调整并行程序代码排布,可以在不影响性能的情况下有效降低所需电压。

(a) 10010011

(b) 11101000

(c) 10111011

(d) 01110100

total switches: 24

(a) 10010011

(c) 10111011

(b) 11101000

(d) 01110100

total switches: 18

“Advanced Compiler Techniques” 89

Page 90: Advanced Compiler Techniques

Compiler Optimizationfor Many-core

“Advanced Compiler Techniques”

Page 91: Advanced Compiler Techniques

“Advanced Compiler Techniques”

THIS is the Problem!SP

EC C

PU IN

TEGE

R PE

RFO

RMAN

CE

TIME

?2004

91

Page 92: Advanced Compiler Techniques

“Advanced Compiler Techniques” 92

Page 93: Advanced Compiler Techniques

“Advanced Compiler Techniques” 93

Page 94: Advanced Compiler Techniques

“Advanced Compiler Techniques” 94

Page 95: Advanced Compiler Techniques

“Advanced Compiler Techniques” 95

Page 96: Advanced Compiler Techniques

“Advanced Compiler Techniques” 96

Page 97: Advanced Compiler Techniques

“Advanced Compiler Techniques” 97

Page 98: Advanced Compiler Techniques

“Advanced Compiler Techniques” 98

Page 99: Advanced Compiler Techniques

“Advanced Compiler Techniques” 99

Page 100: Advanced Compiler Techniques

“Advanced Compiler Techniques”

Summary

• Today– Introduction & Basic Concepts– Superscalar & VLIW– Accelerator-Based Computing– Compiler for Embedded Systems– Compiler Optimization for Many-Core

• Next– Future Compilers