advanced compiler techniques

Advanced Compiler Techniques

LIU Xianhua

School of EECS, Peking University

Compiler for Modern Architectures

“Advanced Compiler Techniques”

Outline

• Introduction & Basic Concepts• Superscalar & VLIW• Accelerator-Based Computing• Compiler for Embedded Systems• Compiler Optimization for Many-Core

2

Features of Machine Architectures

• Pipelining• Multiple execution units

– pipelined• Vector operations• Parallel processing

– Shared memory, distributed memory, message-passing• VLIW and Superscalar instruction issue• Registers• Cache hierarchy• Combinations of the above

– Parallel-vector machines

“Advanced Compiler Techniques” 3

Moore’s Law


Instruction Pipelining

• Instruction pipelining– DLX Instruction Pipeline

– What is the performance challenge?

IF EX MEMID

c1 c2 c3 c4 c5 c6

WB

IF EX MEMID WB

IF EX MEMID WB

c7


Replicated Execution Logic

• Pipelined Execution Units

• Multiple Execution Units

FetchOperands

(FO)

EquateExponents

(EE)

NormalizeMantissas

Add

(AM)Result(NR)

InputsResult

(EE) (NR)b3c3

(AM)b2 + c2

(FO)b4c4

a1

b5

c5

b1 + c1

Adder 1

b2 + c2

Adder 2

b4 + c4

Adder 4

b3 + c3

Adder 3

b5c5

Results

What is the performanc

e challenge?


Vector Operations

• Apply same operation to different positions of one or more arrays– Goal: keep pipelines of execution units full

• Example:VLOAD V1,AVLOAD V2,BVADD V3,V1,V2VSTORE V3,C

How do we specify vector operations in Fortran 77

DO I = 1, 64C(I) = A(I) + B(I)D(I+1) = D(I) + C(I)ENDDO


Superscalar & VLIW

• Multiple instruction issue on the same cycle– Wide word instruction (or superscalar)– Usually one instruction slot per functional unit

• What are the performance challenges?– Finding enough parallel instructions– Avoiding interlocks

• Scheduling instructions early enough


SMP Parallelism

• Multiple processors with uniform shared memory– Task Parallelism

• Independent tasks– Data Parallelism

• the same task on different data

• What is the performance challenge?

p1

Memory

Bus

p2 p3 p3


Bernstein’s Conditions

• When is it safe to run two tasks R1 and R2 in parallel?– If none of the following holds:

• R1 writes into a memory location that R2 reads• R2 writes into a memory location that R1 reads• Both R1 and R2 write to the same memory location

• How can we convert this to loop parallelism?• Think of loop iterations as tasks


Memory Hierarchy

• Problem: memory is moving farther away in processor cycles– Latency and bandwidth difficulties

• Solution– Reuse data in cache and registers

• Challenge: How can we enhance reuse?– Coloring register allocation works well

• But only for scalars– DO I = 1, N– DO J = 1, N– C(I) = C(I) + A(J)

– Strip mining to reuse data from cache“Advanced Compiler Techniques” 11

Distributed Memory

• Memory packaged with processors– Message passing– Distributed shared memory

• SMP clusters– Shared memory on node, message passing off

node• What are the performance issues?

– Minimizing communication• Data placement

– Optimizing communication• Aggregation• Overlap of communication and computation


Compiler Technologies

• Program Transformations– Most of these architectural issues can be dealt with by

restructuring transformations that can be reflected in source• Vectorization, parallelization, cache reuse enhancement

– Challenges:• Determining when transformations are legal• Selecting transformations based on profitability

• Low level code generation– Some issues must be dealt with at a low level

• Prefetch insertion• Instruction scheduling

• All require some understanding of the ways that instructions and statements depend on one another (share data)


A Common Problem: Matrix Multiply

DO I = 1, N DO J = 1, N

C(J,I) = 0.0 DO K = 1, N

C(J,I) = C(J,I) + A(J,K) * B(K,I)

ENDDO ENDDO

ENDDO


Problem for Pipelines

• Inner loop of matrix multiply is a reduction

• Solution: – work on several iterations of the J-loop simultaneously

A(1,1)*B(1,1)C(1,1)+

A(1,2)*B(2,1)

A(1,1)*B(1,1)C(1,1)+

A(1,2)*B(2,1) A(2,1)*B(1,1)C(2,1)+

A(3,1)*B(1,1)C(3,1)+

A(4,1)*B(1,1)C(4,1)+


MatMult for a Pipelined Machine

DO I = 1, N, DO J = 1, N, 4

C(J,I) = 0.0 !Register 1C(J+1,I) = 0.0 !Register 2C(J+2,I) = 0.0 !Register 3 C(J+3,I) = 0.0 !Register 4DO K = 1, N

C(J,I) = C(J,I) + A(J,K) * B(K,I)

C(J+1,I) = C(J+1,I) + A(J+1,K) * B(K,I)

C(J+2,I) = C(J+2,I) + A(J+2,K) * B(K,I)

C(J+3,I) = C(J+3,I) + A(J+3,K) * B(K,I)

ENDDO ENDDO

ENDDO“Advanced Compiler Techniques” 16

Matrix Multiply on Vector Machines

DO I = 1, N DO J = 1, N

C(J,I) = 0.0 DO K = 1, N

C(J,I) = C(J,I) + A(J,K) * B(K,I)

ENDDO ENDDO

ENDDO


Problems for Vectors

• Inner loop must be vector– And should be stride 1

• Vector registers have finite length (Cray: 64 elements)– Would like to reuse vector register in the compute loop

• Solution– Strip mine the loop over the stride-one dimension to 64– Move the iterate over strip loop to the innermost position

• Vectorize it there


Vectorizing Matrix Multiply

DO I = 1, N DO J = 1, N, 64

DO JJ = 0,63C(JJ,I) = 0.0

DO K = 1, N

C(J,I) = C(J,I) +

A(J,K) * B(K,I)ENDDO

ENDDO ENDDO

ENDDO


Vectorizing Matrix Multiply

DO I = 1, N DO J = 1, N, 64

DO JJ = 0,63C(JJ,I) = 0.0

ENDDODO K = 1, N

DO JJ = 0,63C(J,I) = C(J,I) + A(J,K)

* B(K,I)ENDDO

ENDDO ENDDO

ENDDO“Advanced Compiler Techniques” 20

MatMult for a Vector Machine

DO I = 1, N DO J = 1, N, 64

C(J:J+63,I) = 0.0 DO K = 1, N

C(J:J+63,I) = C(J:J+63,I) + A(J:J+63,K) *B(K,I)

ENDDO ENDDO

ENDDO


Matrix Multiply on Parallel SMPs

DO I = 1, N DO J = 1, N

C(J,I) = 0.0 DO K = 1, N

C(J,I) = C(J,I) + A(J,K) * B(K,I)

ENDDO ENDDO

ENDDO


Matrix Multiply on Parallel SMPs

DO I = 1, N ! Independent for all I DO J = 1, N

C(J,I) = 0.0 DO K = 1, N

C(J,I) = C(J,I) + A(J,K) * B(K,I)

ENDDO ENDDO

ENDDO


Problems on a Parallel Machine

• Parallelism must be found at the outer loop level– But how do we know?

• Solution– Bernstein’s conditions

• Can we apply them to loop iterations?• Yes, with dependence

– Statement S2 depends on statement S1 if• S2 comes after S1• S2 must come after S1 in any correct reordering of statements

– Usually keyed to memory• Path from S1 to S2• S1 writes and S2 reads the same location• S1 reads and S2 writes the same location• S1 and S2 both write the same location


MatMult on a Shared-Memory MP

PARALLEL DO I = 1, N DO J = 1, N

C(J,I) = 0.0 DO K = 1, N

C(J,I) = C(J,I) + A(J,K) * B(K,I)

ENDDO ENDDO

END PARALLEL DO


MatMult on a Vector SMP

PARALLEL DO I = 1, N DO J = 1, N, 64

C(J:J+63,I) = 0.0 DO K = 1, N

C(J:J+63,I) = C(J:J+63,I) + A(J:J+63,K)*B(K,I)

ENDDO ENDDO

ENDDO


Matrix Multiply for Cache Reuse

DO I = 1, N DO J = 1, N

C(J,I) = 0.0 DO K = 1, N

C(J,I) = C(J,I) + A(J,K) * B(K,I)

ENDDO ENDDO

ENDDO


Problems on Cache

• There is reuse of C but no reuse of A and B• Solution

– Block the loops so you get reuse of both A and B• Multiply a block of A by a block of B and add to block of C

– When is it legal to interchange the iterate over block loops to the inside?


MatMult on a Uniprocessor with Cache

DO I = 1, N, SDO J = 1, N, S

DO p = I, I+S-1DO q = J, J+S-1

C(q,p) = 0.0ENDDO

ENDDODO K = 1, N, T


DO r = K, K+T-1 C(q,p) = C(q,p) + A(q,r) * B(r,p)

ENDDOENDDO

ENDDO ENDDO

ENDDO ENDDO

ST elements ST elements

S2 elements


MatMult on a Distributed-Memory MP

PARALLEL DO I = 1, NPARALLEL DO J = 1, N

C(J,I) = 0.0ENDDO

ENDDOPARALLEL DO I = 1, N, S

PARALLEL DO J = 1, N, SDO K = 1, N, T


DO r = K, K+T-1 C(q,p) = C(q,p) + A(q,r) * B(r,p)

ENDDOENDDO

ENDDOENDDO

ENDDO ENDDO


Compiler Optimizationfor Superscalar & VLIW


What is Superscalar?

• “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.

• Equally applicable to RISC & CISC, but more straightforward in RISC machines.

• The order of execution is usually assisted by the compiler.

A Superscalar machine executes multiple independent instructions in parallel. They are pipelined as well.


Superscalar vs. Superpipeline


Limitations of Superscalar

• Dependent upon:- Instruction level parallelism possible

- Compiler based optimization- Hardware support

• Limited by– Data dependency– Procedural dependency– Resource conflicts


Effect of Dependencies onSuperscalar Operation

Notes:1) Superscalar operation is double impacted by a stall.2) CISC machines typically have different length instructions and need to be at least partially decoded before the next can be fetched – not good for superscalar operation “Advanced Compiler Techniques” 35

In-Order Issue –In-Order Completion (Example)

Assume:• I1 requires 2 cycles to execute• I3 & I4 conflict for the same functional unit• I5 depends upon value produced by I4• I5 & I6 conflict for a functional unit


In-Order Issue –Out-of-Order Completion (Example)

How does this effect interrupts?

Again:• I1 requires 2 cycles to execute• I3 & I4 conflict for the same functional unit• I5 depends upon value produced by I4• I5 & I6 conflict for a functional unit


Out-of-Order Issue –Out-of-Order Completion (Example)

Note: I5 depends upon I4, but I6 does not

Again:• I1 requires 2 cycles to execute• I3 & I4 conflict for the same functional unit• I5 depends upon value produced by I4• I5 & I6 conflict for a functional unit


Speedups of Machine Organizations (Without Procedural Dependencies)

• Not worth duplication of functional units without register renaming• Need instruction window large enough (more than 8, probably not more than 32) “Advanced Compiler Techniques” 39

View of Superscalar Execution


Example: Pentium 4A Superscalar CISC Machine


Pentium 4 Alternative View


Pentium 4 pipeline

20 stages !


VLIW Approach

• Superscalar hardware is tough• E.g. for the two issue superscalar: every cycle we examine

the opcode of 2 instr, the 6 registers specifiers and we dynamically determine whether one or two instructions can issue and dispatch them to the appropriate Functional Units(FU)

• VLIW approach– Packages multiple independent instructions into one

very long instruction– Burden of finding independent instructions is on the

compiler– No dependence within issue package or at least indicate

when a dependence is present– Simpler hardware


Superscalar vs. VLIW


Role of the Compiler forSuperscalar and VLIW


• Explicitly Parallel Instruction-set Computing – a new hardware architecture paradigm that is the successor to VLIW (Very Long Instruction Word).

• EPIC features:– Parallel instruction encoding– Flexible instruction grouping– Large register file– Predicated instruction set– Control and data speculation– Compiler control of memory hierarchy

What is EPIC?


History

• A timeline of architectures:

• Hennessy and Patterson were early RISC proponents.• VLIW (very long instruction word) architectures,

1981.– Alan Charlesworth (Floating Point Systems)– Josh Fisher (Yale / Multiflow – Multiflow Trace machines)– Bob Rau (TRW / Cydrome – Cydra-5 machine)

Hard-wireddiscrete logic

Microprocessor(advent of VLSI)

CISC(microcode)

RISC(microcode bad!)

VLIWSuperpipelined,SuperscalarEPIC


History (cont.)

• Superscalar processors– Tilak Agerwala and John Cocke of IBM coined term– First microprocessor (Intel i960CA) and workstation (IBM

RS/6000) introduced in 1989.– Intel Pentium in 1993.

• Foundations of EPIC - HP– HP FAST (Fine-grained Architecture and Software Technologies)

research project in 1989 included Bob Rau; this work later developed into the PlayDoh architecture.

– In 1990 Bill Worley started the PA-Wide Word project (PA-WW). Josh Fisher, also hired by HP, made contributions to these projects.

– HP teamed with Intel in 1994 – announced EPIC, IA-64 architecture in 1997.

– First implementation was Merced / Itanium, announced in 1999.


IA-64 / Itanium• IA-64 is not a complete architecture – it is more a set of

high-level requirements for an architecture and instruction set, co-developed by HP and Intel.

• Itanium is Intel’s implementation of IA-64. HP announced an IA-64 successor to their PA-RISC line, but apparently abandoned it in favor of joint development of the Itanium 2.

• Architectural features:– No complex out-of-order logic in processor – leave it for

compiler– Provide large register file, multiple execution units (again,

compiler has visibility to explicitly manage)– Add hardware support for advanced ILP optimization:

predication, control and data speculation, register rotation, multi-way branches, parallel compares, visible memory hierarchy

– Load / store architecture – no direct-to-memory instructions


Hardware Resources

• General (user-accessible) registers– 128 64-bit general registers; each has 1-bit NaT

flag attached (GR0 hardwired to 0)– 128 82-bit floating-point registers– 64 1-bit predicate registers (p0 hardwired to 0)– 8 64-bit branch registers

• Special application registers– Current Frame Marker (CFM), Loop Counter (LC),

Epilog Counter (EC)• Advanced Load Address Table (ALAT)• Register Stack Engine (RSE)


Instruction Groups

• Sequence of instructions with no register dependencies (can all be executed in parallel)– WAR dependencies are allowed– Memory operations still require sequential

execution• Stops delimit groups, so a group of parallel

instructions can span bundles (allows new hardware implementation to add functional units – it would still be able to execute code compiled for a smaller number of units).– Itanium 2 can issue up to six instructions

simultaneously


• The IA-64 includes 64 special-purpose registers that can be used to predicate most of the normal instructions.

• For example, this allows the following conversion:

• Most special compare / test instructions write the result and its complement into two different predicate registers

• Predication converts control dependencies into data dependencies, allowing more scheduling options

Predication

if (a > b) thenc = c + a

elsec = c + b

cmp.lt p1, p2 = a, b(p1) c = c + a(p2) c = c + b


Software Pipelining

• Similar to hardware pipelining – overlap executions of different pieces of a loop. Requires multiple execution units.

• Problem with loop unrolling / pipelining – requires a lot of registers.

• Hardware support:– Rotating predicate registers to embed the prolog and

epilog of the software pipeline within the main loop – minimizes code expansion.

– Rotating registers can be used to avoid register renaming within unrolled loops, further minimizing code expansion

– Loop Count (LC) and Epilog Count (EC) registers, special loop branch instruction to rotate registers“Advanced Compiler Techniques” 55

The Register Stack

• Portion of process address space set aside as backing store for registers – spill to this area.

• Processor supports a register stack frame similar to normal memory stack frame.– First 32 registers are static. Remaining 96 form the

register stack.– Current Frame Marker (CFM) and Top of Stack (TOS)

registers used to manipulate stack frames– Called routine calls alloc instruction to indicate how

many registers it requires. This allocates the next frame and updates the registers.

– Code uses virtual register names (i.e. always uses register 32 as the first register in its frame) – processor maps to physical register


Memory Hierarchy Control

• Compiler can insert prefetch hints (lfetch) in instruction stream.

• Load, store, and prefetch instructions can include post-increment – will also cause cache load of line containing post-incremented address


Superscalar, VLIW and EPIC

Name Issue structureHazard detection Scheduling

Distinguishing characteristic Examples

Superscalar(static)

Dynamic Hardware Static In-order execution

Sun UltraSparc II/III

Superscalar (dynamic)

Dynamic Hardware Dynamic Some out-of-order execution

IBM Power2

Superscalar (speculative)

Dynamic Hardware Dynamic with speculation

Out-of-order execution with speculation

Pentium III/4, MIPS R10K, Alpha 21264, HP PA 8500, IBM RS64III

VLIW/LIW Static Software Static No hazards between issue packets

Trimedia, i850

EPIC Mostly static Mostly software

Mostly static Explicit dependences marked by compiler

Itanium, Itanium2


59

Major Categories

From Mark Smotherman, “Understanding EPIC Architectures and Implementations”


http://www.cs.clemson.edu/~mark/464/acmse_epic.pdf

What’s a Compiler to Do?

• Many new architectural features are based on improving opportunities for ILP.– Obviously, there are high expectations for the compiler to

perform good instruction scheduling.– Scheduler should probably drive many other

optimizations.• Tradeoff between ILP and code expansion.• The choice of making the compiler do a lot of the

work is probably a good one – it has much more context available to make decisions. – However, it lacks the knowledge of the dynamic run-

time environment.

60“Advanced Compiler Techniques”

Superscalar, VLIW and ILP

• ILP has significant impact on performance, which is important for high-end systems– Superscalar– VLIW/EPIC

• For embedded systems, ILP yields performance gains and power savings (lower clock rate), provided that ILP implementation is low-cost and low-power– VLIW– DSP and custom


Energy-Efficient Branch Prediction withCompiler-Guided History Stack

If (a==2) //B1

a=0;If (b==2) //B2

b=0;

call fun1() … If (a!=b) //B3

If (a==2) //B1

a=0;If (b==2) //B2

b=0;

for(i=0;i<N;i++) …

If (a!=b) //B3Expected source code

Unexpected source code

Branch correlation can be very-long-distance

If (a==2) //B1

a=0;If (b==2) //B2

b=0;If (a!=b) //B3

B1B2 B3

loop history bits

… 1 1 1 1 0 … 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 0

GHRbranch history

?



We observed: it is straightforward for the compiler to identify such very-long-distance branch correlation based on high-level dataflow information.

move_s search_root (...) {

if ((root_score > alpha) && !time_exi) {

post_fail_thinking(root_score, &moves[i]);

… } if (root_score > alpha) { ... if (root_score < beta) { Pv[ply-1][ply-1] = moves[i];

for (j = ply; j < pv_length[ply]; j++) pv[ply-1][j] = pv[ply][j];

pv_length[ply-1] = pv_length[ply]; } } if (root_score >= beta)

void push_slidE (int target) {

do { if (board[target] == npiece) { …

add_move(target, 0);

} } while (board[target] != frame);

static int bishop_mobility(int square){

for (diridx = 0; diridx < 4; diridx++) {

for (l = square+dir[diridx]; board[l] == npiece; l +=dir[diridx]) m++;

}

B1

B4

B3

B2

save history

restore history

save history

restore history

save history

restore history

save history

restore history



Identifying program substruction

Analyzing branch correlation

Inserting guiding instructions

Binary codeSource code

Source codeSource code

Compiler work flow

BB1

BB2 BB3

BB4

Pred-branches

Succ-branches

header

BB1

BB2 BB3

BB4

guide_save_history

guide_restore history

Pred-branches

Succ-branches

header

preheader

tail

Loop transformation



Compiler work flow

Procedure call transformation

BB1

call proc

Succ-branches

Pred-branches

BB1

guide_save_historycall proc

guide_retsore_history

Succ-branches

Pred-branches

Identifying program substruction

Analyzing branch correlation

Inserting guiding instructions

Binary codeSource code

Source codeSource code


CHS-based Branch Predictor

baddr

baddr + 4

1

0

+taken?

011101000100

predicted target

guide_save_history

guide_retore_history

...

countercounter

counter

PHT

...history

history

CHShistory

GHR

...

targettargettarget

target

BTB

gshare

gshare-CHS


Performance of gshare-CHS

gzip mcf parser twolf gcc06 gobmk hmmer sjeng astar AVG0

4

8

12

16

20

MPKI (small is better)

gsharegshare-CHS

bran

ch M

PKI

gzip mcf parser twolf gcc06 gobmk hmmer sjeng astar AVG0%

10%

20%

30%

IPC improvements (large is better)

gshare-CHS

IPC

impr

ovem

ents

MPKI: 28%Performance: 10%


Compiler Microarchitecture Cooperation

How to feed compiler-guided information to the predictor?

Compiler-Microarchitecture Cooperation

Algorithm

Problem

Program

ISA

Microarchitecture

Circuits

Electronic Devices

Algorithm

Problem

Program

ISA

Microarchitecture

Circuits

Electronic Devices

Compiler Microarchitecture

Collaboration

(a) Traditional layer (b) Improved layers


Compiler Optimization for Accelerator-Based Architectures


HPC Design Using Accelerators• High level of performance from Accelerators• Variety of general-purpose hardware accelerators

– GPUs : nVidia, ATI,– Accelerators: Clearspeed, Cell BE, …– Plethora of Instruction Sets even for SIMD

• Programmable accelerators, e.g., FPGA-based• HPC Design using Accelerators

– Exploit instruction-level parallelism – Exploit data-level parallelism on SIMD units– Exploit thread-level parallelism on multiple units/multi-

cores• Challenges

– Portability across different generation and platforms– Ability to exploit different types of parallelism


Accelerators – Cell BE


Accelerators - 8800 GPU


Evolution of Intel Vector Instructions• MMX (1996, Pentium)

– CPU-based MPEG decoding– Integers only, 64-bit divided into 2 x 32 to 8 x 8– Phased out with SSE4

• SSE (1999, Pentium III)– CPU-based 3D graphics – 4-way float operations, single precision– 8 new 128 bit Register, 100+ instructions

• SSE2 (2001, Pentium 4)– High-performance computing– Adds 2-way float ops, double-precision; same registers as 4-way single-

precision– Integer SSE instructions make MMX obsolete

• SSE3 (2004, Pentium 4E Prescott)– Scientific computing– New 2-way and 4-way vector instructions for complex arithmetic

• SSSE3 (2006, Core Duo)– Minor advancement over SSE3

• SSE4 (2007, Core2 Duo Penryn)– Modern codecs, cryptography– New integer instructions– Better support for unaligned data, super shuffle engine

• AVX – advanced version of SSE – 128 bits to 256 bits and 3-operand instructions (up from 2)


The Challenge


Approach Framework

CUDA

Profile-basedCompiler

GPUs Multicore

High-Level Intermediate Representation

Compiler and Runtime System


Existing Approaches

Single Threaded

SIMD Execution


Existing Approaches (cont.)

Execution on Cell BE

Our Approach for GPUs


Stream Graph Execution

Stream Graph

Buffer requirement = 4 x

A

C

D

B

SIMD Execution

A1 A2

SM1 SM2 SM3 SM4

A3 A4

B1 B2 B3 B4

D3

C3

D4

C4

D1

C1

D2

C2

01234567


Stream Graph Execution

Stream Graph

Software Pipelined Execution

Buffer requirement = 2 x

A

C

D

B

SM1 SM2 SM3 SM4

A1 A2

A3 A4

B1 B2

B3 B4 D1

C1

D2

C2

D3

C3

D4

C4

01234567


Compiler Framework


81

Experimental Results

Speedup on GPU (8800) compared to CPU of stream programs

Filters are coarsened before scheduling!“Advanced Compiler Techniques”

Compiler Optimizationfor Embedded System


83

Design Cost

84

Embedded-Specific Tradeoffs for Compilers

• Space, time, and energy tradeoffs• These are contrasting goals for

compiler• Ideally, a compiler should expose

some linear combination of the optimization dimensions to application developers:

K1{speed} + K2{code_size} + K3{energy_efficiency}


85

Effect of Compiler Optimizations on Space

• Code size determines cost of ROM• Should minimize I-cache and D-cache

misses• Code layout techniques:

– DAG-based placement– Pettis-Hansen– Inlining– Cache line coloring– Temporal-order placement


86

Power-aware Software Techniques

• Reducing switching activity• Power-aware instruction selection• Scheduling for minimal dissipation• Memory access optimizations• Data remapping


87

Effect of CompilerOptimizations on Power

• Most of the traditional “scalar” optimizations benefit space, time, and (therefore) power

• Not so for:– Loop unrolling– Tail duplication– Inlining and cloning– Speculation and predication– Global code motion

• These and other optimizations will be discussed in subsequent lectures


低功耗编译支持 (1) 编译制导的基于 DVS功耗优化

通过剖视信息或者人工干预，加入制导信息。编译器识别制导信息加入电压切换片段，从而有效实现细粒度和程序内 DVS根据相关文献和初步评测数据，该方法带来的平均功耗下降在 20%左右

100%

t

处理器利用率

功耗

动态电压频率调节下的功耗工作-休眠调节模式下功耗

功耗

100%

t

处理器利用率


低功耗编译支持 (2) 功耗约束的代码生成研究

为每条指令添加功耗权重信息，可选择弱强度指令替换高强度指令以降低功耗，在编译器中根据选择选择不同的代码生成。重新排布代码以降低逻辑翻转次数，该方法可以减少 20-30%的翻转切换，但可能带来 1-2%的性能损失，如果有策略的选择排布，可根据应用需求获取平衡点。同样的，通过调整并行程序代码排布，可以在不影响性能的情况下有效降低所需电压。

(a) 10010011

(b) 11101000

(c) 10111011

(d) 01110100

total switches: 24

(a) 10010011

(c) 10111011

(b) 11101000

(d) 01110100

total switches: 18


Compiler Optimizationfor Many-core



THIS is the Problem!SP

EC C

PU IN

TEGE

R PE

RFO

RMAN

CE

TIME

?2004

91


Summary

• Today– Introduction & Basic Concepts– Superscalar & VLIW– Accelerator-Based Computing– Compiler for Embedded Systems– Compiler Optimization for Many-Core

• Next– Future Compilers

advanced compiler techniques

Documents

memory locationhow

performance issues

instruction slot

performance challenges

r1 readsboth r1

different data

tasks r1

cyclewide word instruction