tm compiler techniques for single processor tuning an introduction

TM

Compiler Techniques for Compiler Techniques for Single Processor Tuning Single Processor Tuning

An Introduction

TM

Compiler manages processor resources:

• registers

• integer/floating-point execution units

• load/store/prefetch for data flow in/out of processor

• the implementation details of processor and system architecture are built into the compiler

User Program (C/C++/Fortran, etc.)– high level representation

– low level representation Machine instructions

The CompilerThe Compiler

Solving:–data dependencies–control flow dependencies–parallelization–compactification of code–optimal scheduling of the code

Compilationprocess

Intermediate representation

TM

MIPSpro Compiler ComponentsMIPSpro Compiler Components

• There are no source-to-source optimizers or parallelizers

• Source code is translated to WHIRL (Winning Hierarchical Intermediate Representation Language); – same IR for different levels of representation – whirl2f and whirl2c translates back into Fortran or C from IRs

• Inter-Procedural analyzer requires final translation at link time

F77/f90 cc/CCdriver

Macropre-

processor

Front-end

(source to WHIRLformat) Inter-

ProceduralAnalyzer

Loopnest

optimizer

Paralleloptimizer

LinkerInter-

ProceduralAnalyzer

source

Globaloptimizer

Codegenerator

Executable object

TM

Compiler OptimizationsCompiler Optimizations• Global Optimizer:

– dead code elimination– copy propagation– loop normalization

• stride one loops• single induction variable

– memory alias analysis– strength reduction

• Inter-Procedural Analyzer:– cross-file function inlining– dead function elimination– dead variable elimination– padding of variables in common blocks– inter-procedural constant propagation

• Automatic Parallelizer– loop level work distribution

• Loop Nest Optimizer:– loop unrolling (outer)– loop interchange– loop fusion/fission– loop blocking– memory prefetch– padding local variables

• Code Generator:– software pipelining– inner loop unrolling– if-conversion– read/write optimization– recurrence breaking– instruction scheduling

inside basic blocks

TM

SGI Architecture, ABI, LanguagesSGI Architecture, ABI, Languages• Instruction Set Architecture (ISA):

– -mips4 (R1x000, R8000, R5000 processors)– -mips3 (R4400)– -mips[1|2] (R3000, R4000 processors, invokes old ucode compiler)

• ABI (Application Binary Interface):– -n32 (32 bit pointers, 4 byte integers, 4 byte real)– -64 (64 bit pointers, 4 byte integers, 4 byte real)

• Languages:– C– C++– Fortran 77– Fortran 90

Variable C size[bit] F size[bit]-n32 -64 -n32 -64

char/character 8 8 8 8short 16 16 int/integer 32 32 32 32long 32 64long long 64 64logical 32 32float/real 32 32 32 32double 64 64 64 64pointer 32 64

TM

Options: ABI & ISAOptions: ABI & ISA Option Functionality

-n32 invoke the MIPSpro Compiler, use 32 bit addressing -64 invoke the MIPSpro Compiler, use 64 bit addressing -o32/-32 invoke the old ucode compiler, 32 bit addressing -mips[1234] ISA; -mips[12] implies ucode compiler

There are two more ways to define the ABI and ISA:• environment variable “SGI_ABI” can be set to -n32 or -64

• the ABI/ISA/Processor/optimization can be set in a file ~/compiler.defaults or /etc/compiler.defaults. In addition, the location of the file can be defined by “COMPILER_DEFAULTS_PATH” environment variable.

The file should contain a line like: -

DEFAULT:abi=n32:isa=mips4:proc=r10000:arith=3:opt=O3

There is a way to find which compiler flags were used:

dwarfdump -i file.o | grep DW_AT_producer

TM

Optimization LevelsOptimization Levels Compilation speed degrades with higher optimization• -O0 turn off all optimizations

• -O1 only local optimizations

• -O2 or -O extensive but conservative optimizations

• -O3 aggressive optimizations, LNO, software pipelining

• -ipa inter-procedural analysis (only at -O2 and -O3)

• -apo automatic parallelization option (same as -pfa)

• -g[0|3] debugging switch: -g0 forces -O0-g3 to debug with -O3

TM

Options: PerformanceOptions: Performance Option Functionality -r10000 Generate optimal instruction schedule for the R10000 proc -r8000 Generate optimal instruction schedule for the R8000 proc

-O[0|1|2|3] Set optimization Level to 0, 1, 2, 3

-Ofast=[ipXX] Select best optimization for the given architecture

-mp Enable multi-processing directives -mpio Support I/O from a parallel region -apo Invoke automatic parallelization option

XX machine (output of the hinv -c processor command)27 Origin2000 (all cpu frequencies and cache sizes)35 Origin3000 (all cpu frequencies and cache sizes)

optimizations may differ on the version of the compiler. Currently:

-O3 -IPA -TARG:platform=ip27 -n32 -OPT:Olimit=0:roundoff=3:div_split=ON:alias=typed

(thus -Ofast switch invokes the Interprocedural Analyzer)

TM

Options: PortingOptions: Porting Option Functionality -d8/d16 Double precision variables as 8 or 16 bytes -r8 Convert REAL to REAL*8 and COMPLEX to COMPLEX*16 (1)

-i8 Convert INTEGER to INTEGER*8 and LOGICAL to 8 byte sizes (1)

-static Local variables will be initialized in fixed locations on the heap (-static_threadprivatestatic_threadprivate makes static variables private to each thread) -col[72|120] Source line is 72 or 120 columns

-Dname Define name for the pre-processor -Idir Define include directory dir -alignN Assume alignment on the N=8,16,32,64,128 bit boundary -G0 Put all static data into indirect address area -xgot make big tables for static data and program addresses -multigot Automatic choice of table sizes for static variables and addresses

-version Show compiler version -show Put the compiler in verbose mode: all switches are displayed (1) Note: explicit sizes are preserved, i.e. REAL*4 remains 32 bit

TM

Options: DebuggingOptions: Debugging Option Functionality -g Disable optimization and keep all symbol tables -DEBUG: the DEBUG group option (man DEBUG_GROUP):• check_div=n n=1 (default) check integer divide by zero

n=2 check integer overflow n=3 check integer divide by zero and overflow• subscript_check (default ON) to check for subscripts out of range

C/C++: produces trap #8 f77: aborts run and dumps core f90: aborts run if setenv F90_BOUNDS_CHECK_ABORT• verbose_runtime(default OFF) to give source line number of failures• trap_uninitialized (default OFF) initialize all variables to 0xFFFA5A5

when used as pointer - access violation when used as fp values - NaN causes fp trap Example: f77 -n32 -mips4 -g file.f \ -DEBUG:subscript_check:verbose_runtime=ON \ -DEBUG:check_div=3 -DEBUG:trap_uninitialized=ON

TM

Compilation ExamplesCompilation Examples 1. Produce executable a.out with default compilation options: f77 source.f cc source.f

be aware of the defaults setting (e.g. /etc/compiler.defaults ) same flags for Fortran and C

2. Options for debugging: f77/cc -o prog -n32 -g -static source.f

2. Explicit setting of ABI/ISA/Processor, highest opt:

f77/cc -o prog -n32 -mips4 -r10000 -O3 source.f

3. Detailed control of the optimization process with the group options : f77/cc -o prog -64 -mips4 -O3 -Ofast=ip27 -OPT:round=3:IEEE_arith=3 -IPA:dfe=on ...

TM

Compiler performs many sophisticated optimizations on the source code under certain assumption about the program. Typically:• program data is large (does not fit into the cache)• program does not violate language standard• program is insensitive to roundoff errors• all data in the program is alias-ed, unless it can be proved otherwise

if one or more of these assumptions does not hold, compiler should be tuned to the program with the compiler options. Most important:• OPT for general optimizations assumptions• LNO for the Loop Nest optimizer options• IPA for the Inter-Procedural Analyzer options

Additional options that help to tune the compiler properly:• TENV, TARG for the target machine and environment description

-TENV:align_aggregates=x (bytes)

• LIST, DEBUG for the listing and debugging options

Fine Tuning Compiler ActionsFine Tuning Compiler Actions

TM

Group OptionsGroup Options Compiler options can be set with the key=value expressions on the command line. These options are combined in logical groups. Multiple key=val expressions are colon separated; same group headings can be specified several times, the effects are cumulative:

E.g.: -OPT:roundoff=2:alias=restrict -OPT:IEEE_arithmetic=3 etc.

Group Heading Reference page Usage comments -OPT:key=val cc(1) f77(1) opt(5) Optimizations -TENV:key=val cc(1) f77(1) Control target environment -TARG:key=val cc(1) f77(1) Control target architecture -FLIST/CLIST cc(1) f77(1) Listing control -LIST:key=val cc(1) f77(1) Options to control listing -DEBUG:key=val debug_group(5) Debugging options -IPA:key=val ipa(5) Inter-Procedural Analyzer control -INLINE:key=val ipa(5) Procedure inliner control -LNO:key=val lno(5) Loop Nest Optimizer control -MP:key=val cc(1) f77(1) Parallelization control -LANG: cc(1) f77(1) language compatibility features -CG: cc(1) f77(1) code generation -WOPT: cc(1) f77(1) global optimizer

TM

Compiler man PagesCompiler man Pages• Primary man pages:

man f77(1) f90(1) cc(1) CC(1) ld(1)• some of the compiler option groups are rather large and deserve their

own man pages

man opt(5)

man lno(5)

man ipa(5)

man DEBUG_GROUP(5)

man mp(3F)

man pe_environ(5)

man sigfpe(3C)

TM

The Run-Time Library StructureThe Run-Time Library Structure

/usr

lib/

lib32/

lib64/

*.a, *.so

ucode-compiler

nonshared/*.a

*.a, *.so

Cmplrs/mongoose-compiler

nonshared/*.a

mips3

mips4

*.a, *.sononshared/*.a


*.a, *.so

Cmplrs/mongoose-compiler

nonshared/*.a

mips3

mips4



n32

n64

o32

Environment variables:LD_LIBRARY_PATHLD_LIBRARY32_PATHLD_LIBRARY64_PATH

TM

The Scientific LibrariesThe Scientific Libraries Standard scientific libraries containing:• Basic Linear Algebra operations and algorithms:

– BLAS1, BLAS2, BLAS3 (see man intro_blas1,_blas2,_blas3)– LAPACK (see man intro_lapack)

• Fast Fourier Transformations (FFT):– 1D, 2D, 3D, multiple 1D transformations (see man intro_fft)

• Convolutions (Signal Processing, e.g. man SIIR2D)• Sparse Solvers (see man solvers; man PSLDLT)

To use: – -lscs serial versions ( -lscs_i8, -lscs_i8_mp for long integers)– -lscs_mp -mp for parallel versions– man intro_scsl for detailed description– -lcomplib.sgimath or -lcomplib.sgimath_mp for older versions– man complib.sgimath for detailed description

TM

Computational DomainComputational Domain Range of numbers (from /usr/include/limits.h): FLT_DIG 6 /* decimal digits of precision of a float */ FLT_MAX 3.40282347E+38F FLT_MIN 1.17549435E-38F

DBL_DIG 15 /* decimal digits of precision of a double */ DBL_MAX 1.7976931348623157E+308 DBL_MIN 2.2250738585072014E-308

LONGLONG_MIN -9223372036854775807LL-1LL LONGLONG_MAX 9223372036854775807LL ULONGLONG_MAX 18446744073709551615LLU

The extended precision (REAL*16) is available and supported by the compiler. But this mode of calculation is slow (by factor ~40)

TM

Underflow and Denormal NumbersUnderflow and Denormal Numbers When de-normalized numbers emerge in a computation (i.e. numbers x<DBL_MIN) they are flushed to zero by default:

will print zero. To force IEEE-754 gradual underflow it is necessary to manipulate status register on the R1x000 cpu. Calling no_flush at the beginning of the program will print 0.22250738585072014D-308

Flush-to-zero property can lead to x-y=0, while xy . Keeping de-normalized numbers in computations will avoid that condition, but will cause fp exception, that must be processed in software.

It is a performance issue - not to manipulate the de-normalized numbers in calculations.

Program denormreal*8 a,b

a = 2.2250738585072014D-308b = a/10.0D0write(6,10) bend

#include <sys/fpu.h>void no_flush_(){

union fpc_csr f;f.fc_word = get_fpc_csr();f.fc_struct.flush = 0;set_fpc_csr(f.fc_word);

}

TM

Overflow ExampleOverflow Example Program example that generates overflows and underflows:

Output with all exceptions ignored by default:

setenv TRAP_FPE “UNDERFL=TRACE; OVERFL=TRACE“ will trap at Overflow and Underflow and produce traceback (Link -lfpe).

Parameter (N=20)INCLUDE “/usr/include/limits.h”Real*8 A(N),B(N)complex*16 C(N)

do I=1,N A(I) = (FLT_MAX/10)*I ! single precision range B(I) = (FLT_MIN*10)/I ! will fit into doubleenddo

C = CMPLX(A,B) ! Standard requires passing from base precision: real*4 !

write (0,’(I3,2(2G22.15/))’) (I,A(I),B(I),C(I),I=1,N)

10 0.340282347000000E+39 0.117549435000000E-37 A,B0.340282346638529E+39 0.117549435082229E-37 Cr,Ci

11 0.374310581700000E+39 0.106863122727273E-37 Infinity 0.000000000000000

12 etc…

Compile with: f77 -n32 -mips4 -O3Note: Compilation with -r8 avoids the error.

Flush to zero!Overflow!

TM

Floating Point ExceptionsFloating Point Exceptions A fp status register flag is set when fpu is has an illegal condition:• division by zero• overflow• underflow• invalid• inexact

By default, all exceptions are ignored! (e.g. for 1/0 NaN value is set and execution continues)

The status register can be programmed to raise a Floating Point Exception. If an FPE occurs, the system can take a specified action:• abort• ignore the exception• repair the illegal condition

You can manipulate the status register to select action:• with calls to the FPE library, link with -lfpe• with environment variable TRAP_FPE

see man handle_sigfpes

TM

Compiler-Generated ExceptionsCompiler-Generated Exceptions•Compiler can do more optimizations if it is allowed to generate code that could cause exceptions (-TENV:X=0..4) X = 0 no speculative code motion X = 1 IEEE-754 underflow and inexact FPE disabled (default -O0 and -O2) X = 2 all IEEE-754 exceptions disabled except 1/0 (default -O3) X = 3 all IEEE-754 exceptions are disabled X = 4 memory access exceptions are disabled

IF-conversion with conditional moves for Software Pipelining (with -O3):

Removing IF(…) will cause divide by zero!In this case this exception must be ignored

Do i=1,Nif(a(i) .lt. eps) then

x = x + 1/epselse

x = x + 1/a(i)endifenddo

#put eps in $f1 and (1/eps) in $f0

ldc1 $f5,-8($2) load a(i)c.lt.d $fcc0,$f5,$f1 if(a(i) < eps)recip.d $f2,$f5 1/a(i)movt.d $f2,$f0,$fcc0 y = 1/a or 1/eps add.d $f1,$f1,$f2 x = x+ y

Note: transf applied already at -O3 & X=1!

TM

IEEE_754 Compliance IEEE_754 Compliance The MIPS4 instruction set contains IEEE-754 non-compliant instructions:• recip.s/d (reciprocal 1/x) instruction is accurate to 1 ulp

• rsqrt.s/d (reciprocal-sqrt: 1/sqrt(x)) instruction to 2 ulp

-OPT:IEEE_arithmetic=X specify degree of non-compliance and what to do with inf and NaN operands X = 1 strict IEEE-754 compliance; does not use the recip and rsqrt instructions (-O1,2)

X = 2 optimize 0*x=0 and x/x=1 while x can be NaN (default at -O3)

X = 3 any mathematically valid transformation is allowed, including recip & rsqrt instr.

Do i=1,n x = x + a(i)/yenddo

y_tmp = 1/ydo i=1,n x = x + a(I)*y_tmpenddo

21 cycles/iteration; 4% peak1 cycles/iteration; 100% peak

-O3 -OPT:IEEE_arithmetic=3

Note: X=3is required!

TM

Rounding AccuracyRounding Accuracy Rounding mode can be specified with -OPT:roundoff=X switch: X = 0 no optimizations that affect fp behaviour (default at -O1 -O2)

X = 1 allows simple transformations with limited round-off and overflow differences

X = 2 allows reordering of reduction loops (default at -O3)

X = 3 any mathematically valid transformation is allowed

do i=1,n x = x + a(i)enddo

do i=1,n,8 x0 = x0 + a(i) x1 = x1 + a(i+1) …enddox = x0 + x1 + …

-O3 -OPT:roundoff=2(default at -O3)

With -O3 -OPT:roundoff=12 cycles/iter; 25% peak 1 cycles/iter; 50% peak

Recommendation: Your program should work correctly when compiled with -O3 -OPT:IEEE_arithmetic=3:roundoff=3

TM

SummarySummary

Compiler is the primary tool of program optimization

• Compilation is the process of lowering the code representation from high level to low, I.e. processor level

• The MipsPro compiler targets the MIPS R1x000 processor and has built in the features of the processor and Origin architecture

• A large number of options exist to steer the compilation process– ABI, ISA and optimization options selections

– setting of assumptions about the program behaviour

• There are optimized and parallelized libraries of subroutines for scientific computation

• When programming for a digital computer, it is important to remember the limitations due to limited validity range of the floating point calculations

tm compiler techniques for single processor tuning an introduction

Documents