advanced compiler techniques
DESCRIPTION
Advanced Compiler Techniques. Compiler for Modern Architectures. LIU Xianhua School of EECS, Peking University. Outline. Introduction & Basic Concepts Superscalar & VLIW Accelerator-Based Computing Compiler for Embedded Systems Compiler Optimization for Many-Core. - PowerPoint PPT PresentationTRANSCRIPT
Advanced Compiler Techniques
LIU Xianhua
School of EECS, Peking University
Compiler for Modern Architectures
“Advanced Compiler Techniques”
Outline
• Introduction & Basic Concepts• Superscalar & VLIW• Accelerator-Based Computing• Compiler for Embedded Systems• Compiler Optimization for Many-Core
2
Features of Machine Architectures
• Pipelining• Multiple execution units
– pipelined• Vector operations• Parallel processing
– Shared memory, distributed memory, message-passing• VLIW and Superscalar instruction issue• Registers• Cache hierarchy• Combinations of the above
– Parallel-vector machines
“Advanced Compiler Techniques” 3
Moore’s Law
“Advanced Compiler Techniques” 4
Instruction Pipelining
• Instruction pipelining– DLX Instruction Pipeline
– What is the performance challenge?
IF EX MEMID
c1 c2 c3 c4 c5 c6
WB
IF EX MEMID WB
IF EX MEMID WB
c7
“Advanced Compiler Techniques” 5
Replicated Execution Logic
• Pipelined Execution Units
• Multiple Execution Units
FetchOperands
(FO)
EquateExponents
(EE)
NormalizeMantissas
Add
(AM)Result(NR)
InputsResult
(EE) (NR)b3c3
(AM)b2 + c2
(FO)b4c4
a1
b5
c5
b1 + c1
Adder 1
b2 + c2
Adder 2
b4 + c4
Adder 4
b3 + c3
Adder 3
b5c5
Results
What is the performanc
e challenge?
“Advanced Compiler Techniques” 6
Vector Operations
• Apply same operation to different positions of one or more arrays– Goal: keep pipelines of execution units full
• Example:VLOAD V1,AVLOAD V2,BVADD V3,V1,V2VSTORE V3,C
How do we specify vector operations in Fortran 77
DO I = 1, 64C(I) = A(I) + B(I)D(I+1) = D(I) + C(I)ENDDO
“Advanced Compiler Techniques” 7
Superscalar & VLIW
• Multiple instruction issue on the same cycle– Wide word instruction (or superscalar)– Usually one instruction slot per functional unit
• What are the performance challenges?– Finding enough parallel instructions– Avoiding interlocks
• Scheduling instructions early enough
“Advanced Compiler Techniques” 8
SMP Parallelism
• Multiple processors with uniform shared memory– Task Parallelism
• Independent tasks– Data Parallelism
• the same task on different data
• What is the performance challenge?
p1
Memory
Bus
p2 p3 p3
“Advanced Compiler Techniques” 9
Bernstein’s Conditions
• When is it safe to run two tasks R1 and R2 in parallel?– If none of the following holds:
• R1 writes into a memory location that R2 reads• R2 writes into a memory location that R1 reads• Both R1 and R2 write to the same memory location
• How can we convert this to loop parallelism?• Think of loop iterations as tasks
“Advanced Compiler Techniques” 10
Memory Hierarchy
• Problem: memory is moving farther away in processor cycles– Latency and bandwidth difficulties
• Solution– Reuse data in cache and registers
• Challenge: How can we enhance reuse?– Coloring register allocation works well
• But only for scalars– DO I = 1, N– DO J = 1, N– C(I) = C(I) + A(J)
– Strip mining to reuse data from cache“Advanced Compiler Techniques” 11
Distributed Memory
• Memory packaged with processors– Message passing– Distributed shared memory
• SMP clusters– Shared memory on node, message passing off
node• What are the performance issues?
– Minimizing communication• Data placement
– Optimizing communication• Aggregation• Overlap of communication and computation
“Advanced Compiler Techniques” 12
Compiler Technologies
• Program Transformations– Most of these architectural issues can be dealt with by
restructuring transformations that can be reflected in source• Vectorization, parallelization, cache reuse enhancement
– Challenges:• Determining when transformations are legal• Selecting transformations based on profitability
• Low level code generation– Some issues must be dealt with at a low level
• Prefetch insertion• Instruction scheduling
• All require some understanding of the ways that instructions and statements depend on one another (share data)
“Advanced Compiler Techniques” 13
A Common Problem: Matrix Multiply
DO I = 1, N DO J = 1, N
C(J,I) = 0.0 DO K = 1, N
C(J,I) = C(J,I) + A(J,K) * B(K,I)
ENDDO ENDDO
ENDDO
“Advanced Compiler Techniques” 14
Problem for Pipelines
• Inner loop of matrix multiply is a reduction
• Solution: – work on several iterations of the J-loop simultaneously
A(1,1)*B(1,1)C(1,1)+
A(1,2)*B(2,1)
A(1,1)*B(1,1)C(1,1)+
A(1,2)*B(2,1) A(2,1)*B(1,1)C(2,1)+
A(3,1)*B(1,1)C(3,1)+
A(4,1)*B(1,1)C(4,1)+
“Advanced Compiler Techniques” 15
MatMult for a Pipelined Machine
DO I = 1, N, DO J = 1, N, 4
C(J,I) = 0.0 !Register 1C(J+1,I) = 0.0 !Register 2C(J+2,I) = 0.0 !Register 3 C(J+3,I) = 0.0 !Register 4DO K = 1, N
C(J,I) = C(J,I) + A(J,K) * B(K,I)
C(J+1,I) = C(J+1,I) + A(J+1,K) * B(K,I)
C(J+2,I) = C(J+2,I) + A(J+2,K) * B(K,I)
C(J+3,I) = C(J+3,I) + A(J+3,K) * B(K,I)
ENDDO ENDDO
ENDDO“Advanced Compiler Techniques” 16
Matrix Multiply on Vector Machines
DO I = 1, N DO J = 1, N
C(J,I) = 0.0 DO K = 1, N
C(J,I) = C(J,I) + A(J,K) * B(K,I)
ENDDO ENDDO
ENDDO
“Advanced Compiler Techniques” 17
Problems for Vectors
• Inner loop must be vector– And should be stride 1
• Vector registers have finite length (Cray: 64 elements)– Would like to reuse vector register in the compute loop
• Solution– Strip mine the loop over the stride-one dimension to 64– Move the iterate over strip loop to the innermost position
• Vectorize it there
“Advanced Compiler Techniques” 18
Vectorizing Matrix Multiply
DO I = 1, N DO J = 1, N, 64
DO JJ = 0,63C(JJ,I) = 0.0
DO K = 1, N
C(J,I) = C(J,I) +
A(J,K) * B(K,I)ENDDO
ENDDO ENDDO
ENDDO
“Advanced Compiler Techniques” 19
Vectorizing Matrix Multiply
DO I = 1, N DO J = 1, N, 64
DO JJ = 0,63C(JJ,I) = 0.0
ENDDODO K = 1, N
DO JJ = 0,63C(J,I) = C(J,I) + A(J,K)
* B(K,I)ENDDO
ENDDO ENDDO
ENDDO“Advanced Compiler Techniques” 20
MatMult for a Vector Machine
DO I = 1, N DO J = 1, N, 64
C(J:J+63,I) = 0.0 DO K = 1, N
C(J:J+63,I) = C(J:J+63,I) + A(J:J+63,K) *B(K,I)
ENDDO ENDDO
ENDDO
“Advanced Compiler Techniques” 21
Matrix Multiply on Parallel SMPs
DO I = 1, N DO J = 1, N
C(J,I) = 0.0 DO K = 1, N
C(J,I) = C(J,I) + A(J,K) * B(K,I)
ENDDO ENDDO
ENDDO
“Advanced Compiler Techniques” 22
Matrix Multiply on Parallel SMPs
DO I = 1, N ! Independent for all I DO J = 1, N
C(J,I) = 0.0 DO K = 1, N
C(J,I) = C(J,I) + A(J,K) * B(K,I)
ENDDO ENDDO
ENDDO
“Advanced Compiler Techniques” 23
Problems on a Parallel Machine
• Parallelism must be found at the outer loop level– But how do we know?
• Solution– Bernstein’s conditions
• Can we apply them to loop iterations?• Yes, with dependence
– Statement S2 depends on statement S1 if• S2 comes after S1• S2 must come after S1 in any correct reordering of statements
– Usually keyed to memory• Path from S1 to S2• S1 writes and S2 reads the same location• S1 reads and S2 writes the same location• S1 and S2 both write the same location
“Advanced Compiler Techniques” 24
MatMult on a Shared-Memory MP
PARALLEL DO I = 1, N DO J = 1, N
C(J,I) = 0.0 DO K = 1, N
C(J,I) = C(J,I) + A(J,K) * B(K,I)
ENDDO ENDDO
END PARALLEL DO
“Advanced Compiler Techniques” 25
MatMult on a Vector SMP
PARALLEL DO I = 1, N DO J = 1, N, 64
C(J:J+63,I) = 0.0 DO K = 1, N
C(J:J+63,I) = C(J:J+63,I) + A(J:J+63,K)*B(K,I)
ENDDO ENDDO
ENDDO
“Advanced Compiler Techniques” 26
Matrix Multiply for Cache Reuse
DO I = 1, N DO J = 1, N
C(J,I) = 0.0 DO K = 1, N
C(J,I) = C(J,I) + A(J,K) * B(K,I)
ENDDO ENDDO
ENDDO
“Advanced Compiler Techniques” 27
Problems on Cache
• There is reuse of C but no reuse of A and B• Solution
– Block the loops so you get reuse of both A and B• Multiply a block of A by a block of B and add to block of C
– When is it legal to interchange the iterate over block loops to the inside?
“Advanced Compiler Techniques” 28
MatMult on a Uniprocessor with Cache
DO I = 1, N, SDO J = 1, N, S
DO p = I, I+S-1DO q = J, J+S-1
C(q,p) = 0.0ENDDO
ENDDODO K = 1, N, T
DO p = I, I+S-1DO q = J, J+S-1
DO r = K, K+T-1 C(q,p) = C(q,p) + A(q,r) * B(r,p)
ENDDOENDDO
ENDDO ENDDO
ENDDO ENDDO
ST elements ST elements
S2 elements
“Advanced Compiler Techniques” 29
MatMult on a Distributed-Memory MP
PARALLEL DO I = 1, NPARALLEL DO J = 1, N
C(J,I) = 0.0ENDDO
ENDDOPARALLEL DO I = 1, N, S
PARALLEL DO J = 1, N, SDO K = 1, N, T
DO p = I, I+S-1DO q = J, J+S-1
DO r = K, K+T-1 C(q,p) = C(q,p) + A(q,r) * B(r,p)
ENDDOENDDO
ENDDOENDDO
ENDDO ENDDO
“Advanced Compiler Techniques” 30
Compiler Optimizationfor Superscalar & VLIW
“Advanced Compiler Techniques”
What is Superscalar?
• “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
• Equally applicable to RISC & CISC, but more straightforward in RISC machines.
• The order of execution is usually assisted by the compiler.
A Superscalar machine executes multiple independent instructions in parallel. They are pipelined as well.
“Advanced Compiler Techniques” 32
Superscalar vs. Superpipeline
“Advanced Compiler Techniques” 33
Limitations of Superscalar
• Dependent upon:- Instruction level parallelism possible
- Compiler based optimization- Hardware support
• Limited by– Data dependency– Procedural dependency– Resource conflicts
“Advanced Compiler Techniques” 34
Effect of Dependencies onSuperscalar Operation
Notes:1) Superscalar operation is double impacted by a stall.2) CISC machines typically have different length instructions and need to be at least partially decoded before the next can be fetched – not good for superscalar operation “Advanced Compiler Techniques” 35
In-Order Issue –In-Order Completion (Example)
Assume:• I1 requires 2 cycles to execute• I3 & I4 conflict for the same functional unit• I5 depends upon value produced by I4• I5 & I6 conflict for a functional unit
“Advanced Compiler Techniques” 36
In-Order Issue –Out-of-Order Completion (Example)
How does this effect interrupts?
Again:• I1 requires 2 cycles to execute• I3 & I4 conflict for the same functional unit• I5 depends upon value produced by I4• I5 & I6 conflict for a functional unit
“Advanced Compiler Techniques” 37
Out-of-Order Issue –Out-of-Order Completion (Example)
Note: I5 depends upon I4, but I6 does not
Again:• I1 requires 2 cycles to execute• I3 & I4 conflict for the same functional unit• I5 depends upon value produced by I4• I5 & I6 conflict for a functional unit
“Advanced Compiler Techniques” 38
Speedups of Machine Organizations (Without Procedural Dependencies)
• Not worth duplication of functional units without register renaming• Need instruction window large enough (more than 8, probably not more than 32) “Advanced Compiler Techniques” 39
View of Superscalar Execution
“Advanced Compiler Techniques” 40
Example: Pentium 4A Superscalar CISC Machine
“Advanced Compiler Techniques” 41
Pentium 4 Alternative View
“Advanced Compiler Techniques” 42
Pentium 4 pipeline
20 stages !
“Advanced Compiler Techniques” 43
VLIW Approach
• Superscalar hardware is tough• E.g. for the two issue superscalar: every cycle we examine
the opcode of 2 instr, the 6 registers specifiers and we dynamically determine whether one or two instructions can issue and dispatch them to the appropriate Functional Units(FU)
• VLIW approach– Packages multiple independent instructions into one
very long instruction– Burden of finding independent instructions is on the
compiler– No dependence within issue package or at least indicate
when a dependence is present– Simpler hardware
“Advanced Compiler Techniques” 44
Superscalar vs. VLIW
“Advanced Compiler Techniques” 45
Superscalar vs. VLIW
“Advanced Compiler Techniques” 46
Role of the Compiler forSuperscalar and VLIW
“Advanced Compiler Techniques” 47
• Explicitly Parallel Instruction-set Computing – a new hardware architecture paradigm that is the successor to VLIW (Very Long Instruction Word).
• EPIC features:– Parallel instruction encoding– Flexible instruction grouping– Large register file– Predicated instruction set– Control and data speculation– Compiler control of memory hierarchy
What is EPIC?
“Advanced Compiler Techniques” 48
History
• A timeline of architectures:
• Hennessy and Patterson were early RISC proponents.• VLIW (very long instruction word) architectures,
1981.– Alan Charlesworth (Floating Point Systems)– Josh Fisher (Yale / Multiflow – Multiflow Trace machines)– Bob Rau (TRW / Cydrome – Cydra-5 machine)
Hard-wireddiscrete logic
Microprocessor(advent of VLSI)
CISC(microcode)
RISC(microcode bad!)
VLIWSuperpipelined,SuperscalarEPIC
“Advanced Compiler Techniques” 49
History (cont.)
• Superscalar processors– Tilak Agerwala and John Cocke of IBM coined term– First microprocessor (Intel i960CA) and workstation (IBM
RS/6000) introduced in 1989.– Intel Pentium in 1993.
• Foundations of EPIC - HP– HP FAST (Fine-grained Architecture and Software Technologies)
research project in 1989 included Bob Rau; this work later developed into the PlayDoh architecture.
– In 1990 Bill Worley started the PA-Wide Word project (PA-WW). Josh Fisher, also hired by HP, made contributions to these projects.
– HP teamed with Intel in 1994 – announced EPIC, IA-64 architecture in 1997.
– First implementation was Merced / Itanium, announced in 1999.
“Advanced Compiler Techniques” 50
IA-64 / Itanium• IA-64 is not a complete architecture – it is more a set of
high-level requirements for an architecture and instruction set, co-developed by HP and Intel.
• Itanium is Intel’s implementation of IA-64. HP announced an IA-64 successor to their PA-RISC line, but apparently abandoned it in favor of joint development of the Itanium 2.
• Architectural features:– No complex out-of-order logic in processor – leave it for
compiler– Provide large register file, multiple execution units (again,
compiler has visibility to explicitly manage)– Add hardware support for advanced ILP optimization:
predication, control and data speculation, register rotation, multi-way branches, parallel compares, visible memory hierarchy
– Load / store architecture – no direct-to-memory instructions
“Advanced Compiler Techniques” 51
Hardware Resources
• General (user-accessible) registers– 128 64-bit general registers; each has 1-bit NaT
flag attached (GR0 hardwired to 0)– 128 82-bit floating-point registers– 64 1-bit predicate registers (p0 hardwired to 0)– 8 64-bit branch registers
• Special application registers– Current Frame Marker (CFM), Loop Counter (LC),
Epilog Counter (EC)• Advanced Load Address Table (ALAT)• Register Stack Engine (RSE)
“Advanced Compiler Techniques” 52
Instruction Groups
• Sequence of instructions with no register dependencies (can all be executed in parallel)– WAR dependencies are allowed– Memory operations still require sequential
execution• Stops delimit groups, so a group of parallel
instructions can span bundles (allows new hardware implementation to add functional units – it would still be able to execute code compiled for a smaller number of units).– Itanium 2 can issue up to six instructions
simultaneously
“Advanced Compiler Techniques” 53
• The IA-64 includes 64 special-purpose registers that can be used to predicate most of the normal instructions.
• For example, this allows the following conversion:
• Most special compare / test instructions write the result and its complement into two different predicate registers
• Predication converts control dependencies into data dependencies, allowing more scheduling options
Predication
if (a > b) thenc = c + a
elsec = c + b
cmp.lt p1, p2 = a, b(p1) c = c + a(p2) c = c + b
“Advanced Compiler Techniques” 54
Software Pipelining
• Similar to hardware pipelining – overlap executions of different pieces of a loop. Requires multiple execution units.
• Problem with loop unrolling / pipelining – requires a lot of registers.
• Hardware support:– Rotating predicate registers to embed the prolog and
epilog of the software pipeline within the main loop – minimizes code expansion.
– Rotating registers can be used to avoid register renaming within unrolled loops, further minimizing code expansion
– Loop Count (LC) and Epilog Count (EC) registers, special loop branch instruction to rotate registers“Advanced Compiler Techniques” 55
The Register Stack
• Portion of process address space set aside as backing store for registers – spill to this area.
• Processor supports a register stack frame similar to normal memory stack frame.– First 32 registers are static. Remaining 96 form the
register stack.– Current Frame Marker (CFM) and Top of Stack (TOS)
registers used to manipulate stack frames– Called routine calls alloc instruction to indicate how
many registers it requires. This allocates the next frame and updates the registers.
– Code uses virtual register names (i.e. always uses register 32 as the first register in its frame) – processor maps to physical register
“Advanced Compiler Techniques” 56
Memory Hierarchy Control
• Compiler can insert prefetch hints (lfetch) in instruction stream.
• Load, store, and prefetch instructions can include post-increment – will also cause cache load of line containing post-incremented address
“Advanced Compiler Techniques” 57
Superscalar, VLIW and EPIC
Name Issue structureHazard detection Scheduling
Distinguishing characteristic Examples
Superscalar(static)
Dynamic Hardware Static In-order execution
Sun UltraSparc II/III
Superscalar (dynamic)
Dynamic Hardware Dynamic Some out-of-order execution
IBM Power2
Superscalar (speculative)
Dynamic Hardware Dynamic with speculation
Out-of-order execution with speculation
Pentium III/4, MIPS R10K, Alpha 21264, HP PA 8500, IBM RS64III
VLIW/LIW Static Software Static No hazards between issue packets
Trimedia, i850
EPIC Mostly static Mostly software
Mostly static Explicit dependences marked by compiler
Itanium, Itanium2
“Advanced Compiler Techniques” 58
59
Major Categories
From Mark Smotherman, “Understanding EPIC Architectures and Implementations”
“Advanced Compiler Techniques”
What’s a Compiler to Do?
• Many new architectural features are based on improving opportunities for ILP.– Obviously, there are high expectations for the compiler to
perform good instruction scheduling.– Scheduler should probably drive many other
optimizations.• Tradeoff between ILP and code expansion.• The choice of making the compiler do a lot of the
work is probably a good one – it has much more context available to make decisions. – However, it lacks the knowledge of the dynamic run-
time environment.
60“Advanced Compiler Techniques”
Superscalar, VLIW and ILP
• ILP has significant impact on performance, which is important for high-end systems– Superscalar– VLIW/EPIC
• For embedded systems, ILP yields performance gains and power savings (lower clock rate), provided that ILP implementation is low-cost and low-power– VLIW– DSP and custom
61“Advanced Compiler Techniques”
Energy-Efficient Branch Prediction withCompiler-Guided History Stack
If (a==2) //B1
a=0;If (b==2) //B2
b=0;
call fun1() … If (a!=b) //B3
If (a==2) //B1
a=0;If (b==2) //B2
b=0;
for(i=0;i<N;i++) …
If (a!=b) //B3Expected source code
Unexpected source code
Branch correlation can be very-long-distance
If (a==2) //B1
a=0;If (b==2) //B2
b=0;If (a!=b) //B3
B1B2 B3
loop history bits
… 1 1 1 1 0 … 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 0
GHRbranch history
?
62“Advanced Compiler Techniques”
Energy-Efficient Branch Prediction withCompiler-Guided History Stack
We observed: it is straightforward for the compiler to identify such very-long-distance branch correlation based on high-level dataflow information.
move_s search_root (...) {
if ((root_score > alpha) && !time_exi) {
post_fail_thinking(root_score, &moves[i]);
… } if (root_score > alpha) { ... if (root_score < beta) { Pv[ply-1][ply-1] = moves[i];
for (j = ply; j < pv_length[ply]; j++) pv[ply-1][j] = pv[ply][j];
pv_length[ply-1] = pv_length[ply]; } } if (root_score >= beta)
void push_slidE (int target) {
do { if (board[target] == npiece) { …
add_move(target, 0);
} } while (board[target] != frame);
static int bishop_mobility(int square){
for (diridx = 0; diridx < 4; diridx++) {
for (l = square+dir[diridx]; board[l] == npiece; l +=dir[diridx]) m++;
}
B1
B4
B3
B2
save history
restore history
save history
restore history
save history
restore history
save history
restore history
63“Advanced Compiler Techniques”
Energy-Efficient Branch Prediction withCompiler-Guided History Stack
Identifying program substruction
Analyzing branch correlation
Inserting guiding instructions
Binary codeSource code
Source codeSource code
Compiler work flow
BB1
BB2 BB3
BB4
Pred-branches
Succ-branches
header
BB1
BB2 BB3
BB4
guide_save_history
guide_restore history
Pred-branches
Succ-branches
header
preheader
tail
Loop transformation
64“Advanced Compiler Techniques”
Energy-Efficient Branch Prediction withCompiler-Guided History Stack
Compiler work flow
Procedure call transformation
BB1
call proc
Succ-branches
Pred-branches
BB1
guide_save_historycall proc
guide_retsore_history
Succ-branches
Pred-branches
Identifying program substruction
Analyzing branch correlation
Inserting guiding instructions
Binary codeSource code
Source codeSource code
65“Advanced Compiler Techniques”
CHS-based Branch Predictor
baddr
baddr + 4
1
0
+taken?
011101000100
predicted target
guide_save_history
guide_retore_history
...
countercounter
counter
PHT
...history
history
CHShistory
GHR
...
targettargettarget
target
BTB
gshare
gshare-CHS
66“Advanced Compiler Techniques”
Performance of gshare-CHS
gzip mcf parser twolf gcc06 gobmk hmmer sjeng astar AVG0
4
8
12
16
20
MPKI (small is better)
gsharegshare-CHS
bran
ch M
PKI
gzip mcf parser twolf gcc06 gobmk hmmer sjeng astar AVG0%
10%
20%
30%
IPC improvements (large is better)
gshare-CHS
IPC
impr
ovem
ents
MPKI: 28%Performance: 10%
67“Advanced Compiler Techniques”
Compiler Microarchitecture Cooperation
How to feed compiler-guided information to the predictor?
Compiler-Microarchitecture Cooperation
Algorithm
Problem
Program
ISA
Microarchitecture
Circuits
Electronic Devices
Algorithm
Problem
Program
ISA
Microarchitecture
Circuits
Electronic Devices
Compiler Microarchitecture
Collaboration
(a) Traditional layer (b) Improved layers
68“Advanced Compiler Techniques”
Compiler Optimization for Accelerator-Based Architectures
“Advanced Compiler Techniques”
HPC Design Using Accelerators• High level of performance from Accelerators• Variety of general-purpose hardware accelerators
– GPUs : nVidia, ATI,– Accelerators: Clearspeed, Cell BE, …– Plethora of Instruction Sets even for SIMD
• Programmable accelerators, e.g., FPGA-based• HPC Design using Accelerators
– Exploit instruction-level parallelism – Exploit data-level parallelism on SIMD units– Exploit thread-level parallelism on multiple units/multi-
cores• Challenges
– Portability across different generation and platforms– Ability to exploit different types of parallelism
70“Advanced Compiler Techniques”
Accelerators – Cell BE
71“Advanced Compiler Techniques”
Accelerators - 8800 GPU
72“Advanced Compiler Techniques”
Evolution of Intel Vector Instructions• MMX (1996, Pentium)
– CPU-based MPEG decoding– Integers only, 64-bit divided into 2 x 32 to 8 x 8– Phased out with SSE4
• SSE (1999, Pentium III)– CPU-based 3D graphics – 4-way float operations, single precision– 8 new 128 bit Register, 100+ instructions
• SSE2 (2001, Pentium 4)– High-performance computing– Adds 2-way float ops, double-precision; same registers as 4-way single-
precision– Integer SSE instructions make MMX obsolete
• SSE3 (2004, Pentium 4E Prescott)– Scientific computing– New 2-way and 4-way vector instructions for complex arithmetic
• SSSE3 (2006, Core Duo)– Minor advancement over SSE3
• SSE4 (2007, Core2 Duo Penryn)– Modern codecs, cryptography– New integer instructions– Better support for unaligned data, super shuffle engine
• AVX – advanced version of SSE – 128 bits to 256 bits and 3-operand instructions (up from 2)
73“Advanced Compiler Techniques”
The Challenge
74“Advanced Compiler Techniques”
Approach Framework
CUDA
Profile-basedCompiler
GPUs Multicore
High-Level Intermediate Representation
Compiler and Runtime System
75“Advanced Compiler Techniques”
Existing Approaches
Single Threaded
SIMD Execution
76“Advanced Compiler Techniques”
Existing Approaches (cont.)
Execution on Cell BE
Our Approach for GPUs
77“Advanced Compiler Techniques”
Stream Graph Execution
Stream Graph
Buffer requirement = 4 x
A
C
D
B
SIMD Execution
A1 A2
SM1 SM2 SM3 SM4
A3 A4
B1 B2 B3 B4
D3
C3
D4
C4
D1
C1
D2
C2
01234567
78“Advanced Compiler Techniques”
Stream Graph Execution
Stream Graph
Software Pipelined Execution
Buffer requirement = 2 x
A
C
D
B
SM1 SM2 SM3 SM4
A1 A2
A3 A4
B1 B2
B3 B4 D1
C1
D2
C2
D3
C3
D4
C4
01234567
79“Advanced Compiler Techniques”
Compiler Framework
80“Advanced Compiler Techniques”
81
Experimental Results
Speedup on GPU (8800) compared to CPU of stream programs
Filters are coarsened before scheduling!“Advanced Compiler Techniques”
Compiler Optimizationfor Embedded System
“Advanced Compiler Techniques”
83
Design Cost
84
Embedded-Specific Tradeoffs for Compilers
• Space, time, and energy tradeoffs• These are contrasting goals for
compiler• Ideally, a compiler should expose
some linear combination of the optimization dimensions to application developers:
K1{speed} + K2{code_size} + K3{energy_efficiency}
“Advanced Compiler Techniques”
85
Effect of Compiler Optimizations on Space
• Code size determines cost of ROM• Should minimize I-cache and D-cache
misses• Code layout techniques:
– DAG-based placement– Pettis-Hansen– Inlining– Cache line coloring– Temporal-order placement
“Advanced Compiler Techniques”
86
Power-aware Software Techniques
• Reducing switching activity• Power-aware instruction selection• Scheduling for minimal dissipation• Memory access optimizations• Data remapping
“Advanced Compiler Techniques”
87
Effect of CompilerOptimizations on Power
• Most of the traditional “scalar” optimizations benefit space, time, and (therefore) power
• Not so for:– Loop unrolling– Tail duplication– Inlining and cloning– Speculation and predication– Global code motion
• These and other optimizations will be discussed in subsequent lectures
“Advanced Compiler Techniques”
低功耗编译支持 (1) 编译制导的基于 DVS功耗优化
通过剖视信息或者人工干预,加入制导信息。编译器识别制导信息加入电压切换片段,从而有效实现细粒度和程序内 DVS根据相关文献和初步评测数据,该方法带来的平均功耗下降在 20%左右
100%
t
处理器利用率
功耗
动态电压频率调节下的功耗工作-休眠调节模式下功耗
功耗
100%
t
处理器利用率
“Advanced Compiler Techniques” 88
低功耗编译支持 (2) 功耗约束的代码生成研究
为每条指令添加功耗权重信息,可选择弱强度指令替换高强度指令以降低功耗,在编译器中根据选择选择不同的代码生成。 重新排布代码以降低逻辑翻转次数,该方法可以减少 20-30%的翻转切换,但可能带来 1-2%的性能损失,如果有策略的选择排布,可根据应用需求获取平衡点。同样的,通过调整并行程序代码排布,可以在不影响性能的情况下有效降低所需电压。
(a) 10010011
(b) 11101000
(c) 10111011
(d) 01110100
total switches: 24
(a) 10010011
(c) 10111011
(b) 11101000
(d) 01110100
total switches: 18
“Advanced Compiler Techniques” 89
Compiler Optimizationfor Many-core
“Advanced Compiler Techniques”
“Advanced Compiler Techniques”
THIS is the Problem!SP
EC C
PU IN
TEGE
R PE
RFO
RMAN
CE
TIME
?2004
91
“Advanced Compiler Techniques” 92
“Advanced Compiler Techniques” 93
“Advanced Compiler Techniques” 94
“Advanced Compiler Techniques” 95
“Advanced Compiler Techniques” 96
“Advanced Compiler Techniques” 97
“Advanced Compiler Techniques” 98
“Advanced Compiler Techniques” 99
“Advanced Compiler Techniques”
Summary
• Today– Introduction & Basic Concepts– Superscalar & VLIW– Accelerator-Based Computing– Compiler for Embedded Systems– Compiler Optimization for Many-Core
• Next– Future Compilers