optimizing software through fully ...digital.auraria.edu/content/aa/00/00/01/41/00001/aa...ahmed...

OPTIMIZING SOFTWARE THROUGH FULLY UNDERSTANDING THE UNDERLYING

ARCHITECTURE & STRENGTHENING THE COMPILER

by

AHMED ESKANDER MEZHER

B.Sc, University of Technology, 2009

A thesis submitted to the

Faculty of the Graduate School of the

University of Colorado Denver in partial fulfillment

of the requirements for the degree of

Master of Science

Computer Science

2013

ii

This thesis for the Master of Science degree

by

Ahmed E. Mezher

has been approved for the

Computer Science Program

By

Gita Alaghband, Chair

Tom Altman

Bogdan Chlebus

ii

September 12, 2013

Ahmed, Mezher, E. (Master of Science in Computer Science)

Optimizing Software Through Fully Understanding the Underlying Architecture &

Strengthening the Compiler

Thesis directed by Professor Gita Alaghband

ABSTRACT

Software developers are trying to write more efficient and faster programs. Increasing

speed and efficiency not only depends on software programming knowledge, but also on

understanding of the underlying computer architecture and compiler. Many programmers do not

consider the machine architecture when they write code. Knowing the architecture specifications

of the computer before writing code can influence program efficiency and speed. Also,

optimizing compilers have a significant impact on the speed of program execution. The GCC

compiler is a well-known compiler that has advanced techniques to optimize code. However,

GCC compiler has several weaknesses in certain areas such as optimizing arithmetic operations,

optimizing function calls, optimizing loops and providing full automatic parallelization. Those

weaknesses can make programs run slower. This thesis helps developers to write more efficient

software by presenting the weaknesses and strengths of the machine architecture and building

tools to improve the weaknesses of the GCC compiler. In this thesis we show the negative

impact of not considering the architecture specifications when programs are written to motivate

software developers to pay attention to the architecture specifications of the machine. Several

tools are designed to improve the weaknesses in the GCC compiler. Our tools work together with

the GCC compiler to help developers design more efficient code. Also, our tools can work on

different compiler architecture, are not limited to specific compiler architecture. We tested our

iii

tools to optimize real C applications such as Strassen.cpp, Bubble Sort.cpp and Selection

Sort.cpp and we successfully optimized them. Therefore, by using our tools together with the

GCC compiler, programs will be more efficient and faster. Through understanding the

weaknesses and strengths of the underlying computer architecture, and using our tools together

with the GCC compiler, developers can design more efficient software.

The form and content of this abstract are approved. I recommend its publication.

Approved: Gita Alaghband

iv

CONTENTS

Chapter

1. Introduction ............................................................................................................................................. 8

2. Finding the limits of the underlying computer architecture through de-optimization .............................. 4

2.1 Optimization information and recommendations for the AMD Opteron processor ........................... 4

2.2 Find the limits of underlying computer architecture through de optimization techniques ................. 5

2.3 Methods ............................................................................................................................................ 6

2.4 Analysis and results ........................................................................................................................ 10

2.5 Detailed results ................................................................................................................................ 11

2.5.1 Branch density de-optimization ................................................................................................ 11

2.5.2 Unpredictable instruction de-optimization ............................................................................... 15

2.5.3 Branch pattern de-optimization ................................................................................................ 18

2.5.4 Float comparison de-optimization ............................................................................................ 19

2.5.5 Costly instruction de-optimization ........................................................................................... 22

2.5.6 Load-store dependency de-optimization ................................................................................... 25

2.5.7 High latency instruction de-Optimization................................................................................. 28

2.5.8 If condition de-optimization ..................................................................................................... 31

2.5.9 Loop re-rolling de-optimization ............................................................................................... 34

2.5.10 Dependency chain de-optimization ........................................................................................ 37

3. Building tools to optimize weaknesses in the GCC compiler ................................................................ 40

3.1 Compiler ......................................................................................................................................... 40

3.2 GCC compiler ................................................................................................................................. 41

3.3 GCC optimizations .......................................................................................................................... 42

3.4 Our optimizations to the GCC compiler .......................................................................................... 47

3.5 Methods, analysis and results .......................................................................................................... 49

3.5.1 Division vs. multiplication: ...................................................................................................... 49

3.5.2 Loop and recursive function ..................................................................................................... 52

3.5.3 Loop re-rolling ......................................................................................................................... 54

3.5.4 Loop unrolling .......................................................................................................................... 57

3.5.5 Power vs. multiplication ........................................................................................................... 59

v

3.5.6 SQRT function vs. division ...................................................................................................... 62

3.5.7 The cost of the function call inside and outside loops .............................................................. 65

3.6 Automatic parallelization in compilers ............................................................................................ 67

3.7 Automatic parallelization in GCC ................................................................................................... 70

3.8 Our improvement to automatic parallelization of the GCC compiler .............................................. 71

3.9 Optimizing real C programs by our tool .......................................................................................... 73

3.9.1 Strassen optimization ............................................................................................................... 73

3.9.2 Bubble sort optimization .......................................................................................................... 75

3.9.3 Selection sort optimization ....................................................................................................... 77

4. Conclusion ............................................................................................................................................ 80

References............................................................................................................................................... 83

vi

LIST OF FIGURES

Figure

2. 1 Branch Density (mod_ten_counter) ....................................................................................... 14

2. 2 Unpredictable instructions (factorial_over_array) de-optimization....................................... 17

2. 3 Compare_two_floats .............................................................................................................. 21

2. 4 Costly instruction de-optimization ......................................................................................... 24

2. 5 Costly instruction de-optimization ......................................................................................... 27

2. 6 Fib de-optimization ................................................................................................................ 30

2. 7 If Condition de-optimization.................................................................................................. 33

2. 8 Loop re-rolling de-optimization ............................................................................................. 36

2. 9 Dependency chain de-optimization........................................................................................ 39

3. 1 Division vs. Multiplication- milliseconds .............................................................................. 51

3. 2 Loop and recursive function- Time in milliseconds .............................................................. 54

3. 3 Loop re-rolling- Milliseconds ................................................................................................ 56

3. 4 Loop unrolling- seconds ........................................................................................................ 59

3. 5Power vs. multiplication- seconds .......................................................................................... 62

3. 6 SQRT function vs. division- seconds..................................................................................... 64

3. 7 Function calls in and out the loop- Millisecond..................................................................... 67

3. 8 Strassen optimization ............................................................................................................. 75

3. 9 Bubble sort optimization ........................................................................................................ 77

3. 10 Selection sort optimization .................................................................................................. 79

vii

LIST OF TABLES

Table

2. 1 Branch density de-optimization for the AMD Opteron and Intel Nehalem........................... 14

2. 2 Unpredictable instructions de-optimization for the AMD Opteron and Intel Nehalem ........ 17

2. 3 Compare two floats de-optimization ...................................................................................... 22

2. 4 Costly instruction de-optimization for the AMD Opteron and Intel Nehalem ...................... 25

2. 5 Costly Instruction de-optimization for the Opteron and Intel Nehalem ................................ 28

2. 6 Fib de-optimization for the Opteron and Intel Nehalem ........................................................ 30

2. 7 If condition de-optimization for the AMD Opteron and Intel Nehalem ................................ 33

2. 8 Loop re-rolling de-optimization for the AMD Opteron and Intel Nehalem .......................... 36

2. 9 Dependency chain de-optimization for the AMD Opteron and Intel Nehalem ..................... 39

3. 1 Time in milliseconds for the Division vs. multiplication optimization technique. ................ 51

3. 2 Time in milliseconds for the loop and recursive function optimization technique................ 53

3. 3 Time in milliseconds for the loop re-rolling optimization technique .................................... 56

3. 4 Time in seconds for the loop unrolling optimization technique ............................................ 59

3. 5 Time in seconds for the power vs. multiplication optimization technique ............................ 61

3. 6 Time in seconds for the SQRT function vs. division optimization technique ....................... 64

3. 7 Time in milliseconds for the cost of the function call in and out the loop optimization

technique ....................................................................................................................................... 66

3. 8 Running time in seconds for the Strassen optimization ......................................................... 74

3. 9 Running time in seconds for the Bubble sort optimization ................................................... 76

3. 10 Time in seconds for the Selection sort optimization ............................................................ 78

viii

1. Introduction

This thesis is in the area of compiler and software. Writing efficient software not only

depends on writing skills of programmers but also depends on the knowledge of underlying

architecture and compiler. Sometimes a particular optimized code running fast in one machine

may run very slow in a different machine. For example, we have an optimized code, and this

code has loop unrolled several times to run very fast in machine A, a programmer run the same

code in a different machine B, and the two machines A and B are completely different. If the

programmer does not know the architecture specifications of machine B, then the program may

run very slow because the size of instruction cache of machine B is less than the size of the first

machine A, and we know that when the number of instructions in a program exceeds the size of

the instruction cache, the program becomes slower. Therefore, knowing the architecture

specifications will make programmers write optimized codes. Additionally, the effort required to

optimize generally depends upon knowledge of the architecture specifications such as: How long

is its pipeline? How does it detect hazards? Does it dynamically schedule? How does its caching

work? By understanding these aspects of the machine architecture, Programmers can write very

efficient software. Does ignoring the architecture specifications when writing software affect its

efficiency? Chapter two answered this question. In chapter two, we researched the weaknesses

and strengths in the underlying computer architecture, and showing the consequences of not

2

considering the architecture specifications when programs are written. The goal of chapter two is

to design a series of de-optimizations for the Opteron and to show how gracefully, or not, its

performance degrades. In some circumstances, serious degradations in performance were found.

In others, expected de-optimizations were difficult if not impossible to implement. By examining

the results of chapter two, one can gain a thorough understanding of what development aspects

need attention and what aspects can be safely ignored. The de-optimizations that we have

implemented are branch density, unpredictable instructions, branch patterns, float comparisons,

costly instructions, load-store dependency, high latency instruction, if condition, loop rerolling

and dependency chain. Branch patterns, unpredictable instructions and branch density are

deemed unsuccessful because of the powerful of hardware. Other de-optimizations have showed

great impact on the Opteron and the Intel Nehalem as well.

Moreover, some programmers depend on the compiler only to optimize their codes.

Dependence on compiler only will not fully optimize codes because compilers may have some

weaknesses. We chose to research the GCC compiler because it is a well-known compiler used

by many developers and programmers. We will develop tools to write more efficient code and

compare the results with the GCC compiler. In chapter three, we present several weaknesses in

the GCC compiler, and then build two tools to help programmers using GCC compiler to write

more efficient codes by using our tools. The first tool can detect the optimizations in a given

program, and present a message to the programmer that with a list of optimizations. Therefore, a

programmer can implement the list of optimizations manually in his program to become more

efficient. The optimizations that we have implemented to the GCC compiler are division

operation, loop and recursive function, loop rerolling, loop unrolling, power operation, square

root function and function calls. All the optimizations are not compiler architecture dependent

3

except for the loop unrolling because loop unrolling depend on the size of instruction cache, and

the size of instruction cache is different from architecture to another. Another type of our study is

the automatic parallelization in the GCC compiler which does not support dependencies

detection. The GCC compiler can parallelize loops that do not have dependencies. In this way,

the programmer needs to check its program manually for dependencies. Sometimes checking

dependencies visually is not easy because it needs to follow the iterations of the loop. If a loop

has very long sequence of instructions, it becomes very hard and much time consuming to decide

if a loop has dependencies or not. Therefore, we designed a tool that can detect most of the loop

dependencies in programs. Our tool can give a message to the programmer indicating that either

the particular program has dependencies or not. Our tool can help programmers to identify

dependencies, and they can save a lot of time by using it. The main goal from chapter three is to

improve the weaknesses we have detected in the GCC compiler, and building tools with the

series of optimizations that can help programmers to write more efficient code. The main goal

form the thesis is to help developers to write the best efficient software.

4

2. Finding the limits of the underlying computer architecture through de-optimization

techniques

2.1 Optimization information and recommendations for the AMD Opteron processor

Many optimizations have been done on the 32 bit and 64 bit software to work with AMD

processor and AMD architecture. These optimizations are instruction-Decoding optimizations,

cache and memory optimizations, branch optimizations, scheduling optimizations, integer

optimizations, optimizing with the SIMD instructions and x87 floating point optimizations.

Instruction-Decoding optimizations help the processor to maximize the number of decoded

instructions. We have two types of instructions which are direct path instructions and vector path

instructions. In general, direct path instructions minimize the number of operations that cost

AMD processor. Three direct path instructions can be done in one cycle, and one and a half

double direct path instructions per one cycle. Therefore, using direct path instructions rather than

vector path instructions can optimize codes. Also, there are other optimizations have done to

load-execute instructions, load-execute integer instructions and many other on the type of

instruction-decoding optimizations. Memory optimization is another optimization can be applied

to the AMD processor. Memory has several optimizations; one of them is using store instructions

such as MOVNTPS and MOVNTPQ. These instructions make processor do writing without

reading the data from memory. So, these instructions can save time by not reading the memory

or cache. Therefore, it makes programs faster. This optimization can be applied on 32-bit

software and 64-bit software. Another optimization can be applied on AMD processor is branch

optimization. Branch optimization is one of the optimizations that can improve branch prediction

5

and minimize branch penalties. One of the possible branch optimizations is avoiding conditional

branches that depend on random data, as these branches are difficult to predict. Moreover,

scheduling optimization is another optimization can be applied on the AMD processor. One of

the possible scheduling optimizations is pushing data directly in to stack instead of loading it in

to a register first, and this technique will reduce register pressure and eliminates data

dependencies. Also, this optimization can apply to the 32-bit software and 64-bit software [12].

We showed some of the recommendations for the AMD processor. These recommendations to

the hardware can make programs faster, and then we did an implementation to the AMD

hardware in order to find what the limitation of the hardware or what other weaknesses in the

hardware can be improved by the AMD designers.

2.2 Find the limits of underlying computer architecture through de optimization techniques

Computer industry is developing very fast. CPU speeds are increasing, and most of

computers come with more than one CPU. Software developers are focusing on writing good

software without considering the architecture specifications. Developers rely on compilers to do

the dirty work of concerning themselves with the details of the architecture specifications. Fully

understanding the performance and what can affect the hardware that gives us a clear clue about

developing hardware or software. In this part of the thesis, the limits of underlying computer

architecture were researched by intentionally de-optimizing software. Developers typically do

not consider architecture specification; it is useful to how optimized hardware will behave under

the worst circumstances. With this viewpoint, one can imagine that a company might choose one

piece of hardware over another not because it performs the best in very special circumstances,

but rather because its performance degrades the slowest in adverse circumstances. The goal of

6

this part of thesis was to design a series of de-optimizations for the Opteron and to show how

gracefully, or not its performance degrades. After we did the implementation, there are serious

degradations in performance were found. Also, in some circumstances, it is harder to implement

de-optimization techniques because the powerful of hardware. The main benefit of this work is

that one can look to the result and decide what development aspects need attention and what

aspects can be safely ignored.

2.3 Methods

In our implementation, we chose the AMD Opteron and the Intel Nehalem. The Opteron

was chosen as the CPU to benchmark. The Intel Nehalem is used to check de-optimizations if

universal or not, and that is mean if one of the de-optimizations affects both CPUs Opteron and

Intel Nehalem , then may be it is fairly universal. In our implementation, we chose nine de-

optimizations to be implemented. The de-optimizations are:

● Branch Density - The code for this de-optimization contains densely packed branch

instructions in order to overload branch target buffers and/or branch prediction

algorithms.

● Unpredictable Instruction - The code for this de-optimization has intentionally

misaligned return instructions in order to prevent branch prediction.

● Float Comparison - The code for this de-optimization uses normal float comparison (a

conditional) when comparing integers (after casting) would suffice.

7

● Costly Instruction - The code for this de-optimization uses division when

multiplication would suffice.

● Load-Store Dependency - The code for this de-optimization sets up a load-store

dependency, the intention of which is to cause the CPU to wait until stores complete

before loads can occur.

● Dependency Chain - The code for this de-optimization creates a very long

dependency that quickly exhausts the resources of the dynamic scheduler.

● High Latency Instruction - The code for this de-optimization uses a very latency

instruction (LOOP) when lower latency instructions (DEC/JZ) would suffice.

● If Condition Organization - The code for this de-optimization intentionally orders

sub-conditions of an IF statement so that sub-conditions which take longer to

evaluate are first.

● Loop Re-Rolling - The code for this de-optimization breaks a loop into two that could

otherwise be combined so that the dynamic scheduler has less scheduling leeway.

● Branch Patterns - The code for this de-optimization tries to find patterns that are

difficult for branch prediction hardware to predict.

8

All of these de-optimizations are described in more detail below.

Most of these optimizations and de-optimizations were written in C and using GCC compiler and

we are definitely sure that using GCC compiler will not affect or reduce the de-optimization

techniques. In some cases, GCC can effect on the de-optimization, so we cannot use C to write

the code. When C could not be used completely, the NASM assembly language was used to

build C-importable assembly modules. These modules were then used by a C-wrapper in order to

implement the de-optimization. Typically, the C-wrapper builds an array that can be passed to an

assembly module, which in turn processes the array. Now, we have an important question which

is how to evaluate the de-optimization techniques. Well, we have several programs or tools can

give information about codes such as CodeAnalyst and VTune. At a glance, CodeAnalyst

appeared ideal since it proffers up a great deal of information about program execution and it

was written for the chosen hardware platform. However, it turned out to be less than ideal. It is

cumbersome to work with; it is hard to evaluate the data for sections of code together. In short, it

is great for profiling code, but it is poor for generating lots of test results. So, CodeAnalyst was

not an easy answer to the problem of evaluating de-optimizations. So, rather than using

prepackaged software in order to evaluate de-optimizations, we decided to wrap important

sections with code that counts the number of cycles. In the post-Pentium world, there is a ready

resource for counting clock cycles. It is the CPU Timestamp Counter (CTC). The CTC counts

the number clock cycles executed since the CPU was booted. It can be a little tricky to work

with, especially in multi-core, multi-CPU environments. However, if the core/CPU used for

execution can be tightly controlled, then it can be trusted to provide a reasonable measure of the

number of cycles needed to run a section of code.

#if defined(__i386__)

static __inline__ unsigned long long rdtsc(void)

9

{

unsigned long long int x;

__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x)); return x;

}

#elif defined(__x86_64__)

static __inline__ unsigned long long rdtsc(void)

{

unsigned hi, lo;

__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));

return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );

}

#end if

The above code was inserted in to the all C optimizations and de-optimizations. With this

code, a call to rtdsc() yields the current clock cycles (from the CTC). When we are running the

code either optimization or de-optimization, the number of clock cycles is different slightly from

an execution to another. So, we decided to run the code several times and then we find the

average for the all executions. For simplicity, we created a program called ‘The Version Tester’

(VT). This program is simply an executable that takes a configuration file as an argument.

Within the configuration file, the number of test iterations is specified along with a series of

programs to test. For each program, there is a description and a run command line. For each

specified program, the version tester runs the executable for the specified number of iterations.

After each iteration, it captures the number of cycles (the number of cycles is written onto stdout

by the executable being tested). After completing the number of test cycles, the VT computes the

average number of cycles per iteration.

10

2.4 Analysis and results

Many of the de-optimizations designed during the implementation worked as expected.

For example, the Branch Density de-optimization consistently showed a 12% slowdown on the

Opteron for all array sizes. This comported very well with expected cost of missing the branch

target buffer (BTB) on the Opteron so consistently. Other de-optimizations had a greater effect

than anticipated. The Loop Re-Rolling and Dependency Chain de-optimizations showed very

significant slowdowns for all array sizes. The expectation was that winnowing down the options

available to the Opteron dynamic scheduler would have an impact, but not such a stark impact.

The upshot of these surprises is: software engineers must consider the dynamic scheduler in

order write software that can run efficiently on the Opteron. Finally, some de-optimizations had

little or an inconclusive effect. For example, the Branching Pattern de-optimization turned out to

be very difficult to realize. A slowdown was achieved using random data. But no significant

slowdown was achieved using patterned data. Since using random data to cause mispredictions

yields no useful information about the Opteron, this de-optimization was deemed unsuccessful.

Overall, modern CPUs are a wild cacophony of interacting threads and processes. This makes

profiling optimizations (and de-optimizations) very difficult. It turns out that Hydra was a great

place to test de-optimizations; its multiple nodes with multiple CPUs made it easy to run

processes with relative confidence that the program being tested would not end up competing for

resources. As a result, the data for the Opteron is very smooth and consistent, while the noisy

data for the Intel Nehalem reflects its meager resources.

11

2.5 Detailed results

2.5.1 Branch density de-optimization

For this de-optimization, we designed an algorithm called ‘mod_ten_counter’. This

algorithm takes an array size as input. It then generates an array of the specified size in which

each element is populated with an integer x where 0 <= x <= 9. (Note that the data for this array

is not random; if it was, then branch mispredictions could artificially inflate the effect of the

Branch Density de-optimization.) After this array is populated, the number of times that each

integer x appears in the array is counted.

The executables associated with this algorithm are called ‘mod_ten_counter_op.exe’ and

‘mod_ten_counter_deop_1.exe’ (on Windows). The code is written in C. The section of code

being optimized/de-optimized is written in NASM. The cycles being counted includes only the

time that mod_ten_counter spends running its assembly code.

The optimized version counts the integer instances within the array using a structure that

is much like a case-switch statement in C. However, the branch statements within this structure

are spaced out using NOP instructions. These NOP instructions are used to ensure that the

optimized version maintains proper alignment and spacing for branch instructions so that branch

target buffer (BTB) misses are not incurred. The important optimized section is below:

cmp ecx, 0

je mark_0 ; We have a 0

nop

dec Ecx


nop

dec Ecx


nop

dec Ecx


nop

dec Ecx

12


nop

dec Ecx


nop

dec Ecx


nop

dec Ecx


nop

dec Ecx


nop

dec Ecx


Notice that there is a NOP between each DEC/JE pair. This was done in order to create space

between branches and in order to better align branching instructions. The de-optimized version

counts the integer instances with much the same structure as the optimized version. However, it

does not maintain proper alignment and spacing such that it should incur many BTB misses. The

important de-optimized section is below:

cmp ecx, 0


dec Ecx


dec Ecx


dec Ecx


dec Ecx


dec Ecx


dec Ecx


Dec Ecx


dec Ecx


dec Ecx


13

Notice that, unlike the optimized version, there are no additional NOP instructions between

DEC/JE instructions. This means that the de-optimized version has a very tightly packed bunch

of branches, unlike the optimized version. Also, note that, on average, the optimized version

executes 5 more instructions per iteration yet it is significantly outperformed by the optimized

version.

Data

As can be seen by the slowdown percentages below, packing branches as densely as the

de-optimization can have a significant impact on run-time. On the Opteron, it caused ~10%

slowdown for all array sizes; this slowdown is due to the branch target buffer misses that come

with packing branches too densely. On the Nehalem, this de-optimization also had a big impact,

though the reasons are less well understood. One point of note for the Nehalem is the length of

its pipeline, which is 17 versus 12 on the Opteron. Thus, additional slowdown could be due to a

higher cost of BTB misses on the Nehalem.

14

Figure 2. 1 Branch Density (mod_ten_counter)

Table 2. 1: Branch Density de-optimization for the AMD Opteron and Intel Nehalem

Array Size AMD Opteron:

Difference

Slowdown(%) Intel Nehalem:

Difference

Slowdown(%)

10 -47 -7.31 39 9.92

100 185 7.80 331 27.36

1000 2234 11.86 6355 89.42

10000 23230 12.69 67653 110.38

100000 203760 10.98 537362 84.27

1000000 1620652 8.40 5306766 89.63

10000000 17263048 8.34 52082971 78.85

0

50000000

100000000

150000000

200000000

250000000

Clo

ck C

ycle

s

Array Size

De-optimization: Branch Density (mod_ten_counter)

Optmzd,OpteronDe-Optmzd,Opteron

15

2.5.2 Unpredictable instruction de-Optimization

For this de-optimization, we designed an algorithm called ‘factorial_over_array’. This


each element is populated with an integer x where 0 <= x <= 12. After this array is populated, the

factorial of each element of the array is calculated and written back into the array, overwriting

the original array element.

The executables associated with this algorithm are called ‘factorial_over_array_op.exe’

and ‘factorial_over_array_deop_1.exe’ (on Windows). The code is written in C. The section of

code being optimized/de-optimized is written in NASM. The cycles being counted include only

the time that factorial_over_array spends running its assembly code.

The optimized and de-optimized versions are almost identical. Of course, both perform

factorial, recursively, on an integer argument. However, the optimized version properly spaces

the JNE and RET instructions so that the RET is not mispredicted. Below is the important

optimized section:

nop

eax, [esp+4]

mov ; Get the integer whose factorial is

; being calculated

cmp eax, 1 ; Have we hit one yet?

Jne Calculate ; If we haven't then do another call

nop ; This is here for alignment purposes

Ret ; return with a 1

Notice that there is a NOP instruction between the JNE instruction and the RET instruction. This

was done in order to create space between the two and ensure that the RET instruction is

properly aligned.

The de-optimized version on the other hand does not properly space and so it results in a

large number of mispredictions on the RET instruction. Below is the important de-optimized

16

section:

mov eax, [esp+4] ; Get the integer whose factorial is ; being calculated

cmp eax, 1 ; Have we hit one yet?

Jne Calculate ; If we haven't then do another call

ret ; return with a 1

Notice that, unlike the optimized version, there is no NOP instruction between the JNE and the

RET instruction. This means that the two branching instructions are adjacent and the RET

instruction is misaligned.

Data

As can be seen by the slowdown percentages below, the cost of mis-aligning heavily used

RET calls is high. On the Opteron, it caused ~12% slowdown for all array sizes; this slowdown

is due to the branch mispredictions engendered by mis-aligning RET. On the Nehalem, this de-

optimization had an inconclusive impact. Therefore, it is reasonable to assume that the way that

the Nehalem indexes branching instructions must be quite different from the Opteron.

17

Figure 2. 2 Unpredictable instructions (factorial_over_array) de-optimization

Table 2. 2 Unpredictable instructions de-optimization for the AMD Opteron and Intel

Nehalem


Difference


Difference

Slowdown(%)

10 80 7.07 21 2.58

100 1510 0.88 2436 32.15

1000 10429 11.61 -990 -1.24

10000 115975 12.76 36158 5.33

100000 1139748 12.59 238852 3.54

1000000 11529624 12.66 -3505949 -5.42

10000000 175191774 18.90 3081557 0.51

0

200000000

400000000

600000000

800000000

1E+09

1.2E+09C

lock

Cyc

les

Array Size

De-optimization: Unpredictable instructions (factorial_over_array)

Optmzd, Opteron

De-Optmzd, Opteron

Optmzd, Nehalem

De-Optmzd, Nehalem

18

2.5.3 Branch pattern de-optimization

For this de-optimization, we designed an algorithm called ‘even_number_sieve’. This

algorithm takes an array size and a pattern as inputs. It then generates an array of the specified

size in which each element is populated with either 1 or 2. The order in which 1’s and 2’s appear

depends upon the pattern provided; this pattern essentially represents the branching pattern being

tested. For example, the pattern ‘12’ would cause the array to be populated with alternating 1’s

and 2’s; the pattern ‘r’ would cause the array to be populated with a random sequence of 1’s and

2’s. After the array has been constructed, the algorithm iterates over the array and replaces all of

its odd entries, the 1’s, with 0’s.

Only one executable was needed for this de-optimization. Rather, in this case, an

optimized version is a branching pattern that can be easily managed by the CPUs branch

prediction mechanism. On the other hand, a de-optimization is a branching pattern that cannot

be easily managed by the branch prediction mechanism.

The executable associated with this algorithm is called ‘even_number_sieve.exe’ (on

Windows). The code is written in C. The section of code being optimized/de-optimized is also

written in C; there is no assembly for this de-optimization.

The cycles being counted include only the time that is taken to mark the odd elements of the

array.

The code that evaluates the array elements simply tests whether each element x (mod

2) is 1 or not. If it is, then it is replaced with a 0. Below is the important code:

unsigned long long pstart = rdtsc();

for ( i = 0; i < size_of_array; i++ ) {

if ( number_array[i] % 2 == 1 ) {

19

number_array[i] = 0;

}

}

printf( "Cycles=%d\n", ( rdtsc() - pstart ) );

Notice that each element, mod 2, is compared to 1. If it is equal, then the element is

replaced with 0.

We were never able to generate interesting results from this de-optimization. It is

reasonable to conclude that the branch prediction mechanism on the Opteron is very good. Of

course, many cycles were lost when random data was used. However, no branch prediction

mechanism can find and use a pattern in random data; it would not then be random.

2.5.4 Float comparison de-optimization

For this de-optimization, we designed an algorithm called ‘compare_two_floats’. This

algorithm takes a number of iterations as input. It then generates two simple floats and tests

them, over and over again, for the requested number of iterations.

The executables associated with this algorithm are called ‘compare_two_floats_op.exe’

and ‘compare_two_floats_deop.exe’ (on Windows). The code is written in C. The section of

code being optimized/de-optimized is also written in C; there is no assembly for this de-

optimization. The cycles being counted include only the time that compare_two_floats spends

comparing the floating point values.

The optimized version performs the comparison of floating point values by casting the

floats into integers and then performing integer comparison. The important optimized section is

below:

#define FLOAT2INTCAST(f) (*((int *)(&f))) float

t=f1-f2;

pstart = rdtsc();

20

for (j = 0; j < numberof_iteration ; j++) {

if ( FLOAT2INTCAST(t) <= 0 ) {

Count_numbers(i);

count++;

} else { count++;

}

}

result=rdtsc()-pstart;

printf( "Cycles=%d\n", result );

Notice that the two floats being compared, f1 and f2, are subtracted (outside of the section being

timed) and then casted and compared to zero. Thus, no float comparison occurs.

The de-optimized version performs the comparison of floating point values in

the normal fashion, i.e. by straightforwardly comparing the floats. The important de-optimized

section is below:

float t=f1-f2; pstart =

rdtsc();

for (j = 0; j < numberof_iteration ; j++) { if ( f1<=f2 ) {

Count_numbers(i);

count++;

} else { count++;

}

}

result=rdtsc()-pstart;

printf( "Cycles=%d\n", result );

Notice that f1 and f2 are being compared in the straightforward fashion. Thus, they are

being compared as floats

Data

As can be seen by the slowdown percentages below, the cost of casting and the

comparing integers is higher until the array size crosses a certain threshold; at this threshold, the

21

cost of comparing floats begins to dominate. On the Opteron, it caused ~6% slowdown for array

sizes greater than 10000; this slowdown is due to the fact that the floating point data path on the

Opteron is more costly than the integer data path. On the Nehalem, this de-optimization had a

similar impact. Therefore, it is reasonable to assume that the Nehalem has a similar discrepancy

between its integer and floating point data paths.

Figure 2. 3 Compare_Two_floats

020000000400000006000000080000000

100000000120000000140000000160000000180000000

Clo

ck C

ycle

s

Array Size

De-Optimization (Compate_Two_floats)

Optmzd, Opteron

De-Optmzd, Opteron

Optmzd, Nehalem

De-Optmzd, Nehalem

22

Table 2. 3: Compare two floats de-optimization


Difference


Difference

Slowdown(%)

10 -10991 -97.19 43 26.70

100 -10960 -86.18 471 49.68

1000 -9892 -37.97 5285 59.90

10000 -982 -0.60 52211 58.82

100000 88962 5.86 413932 41.79

1000000 987559 6.56 4298502 44.46

10000000 9949917 6.62 41906333 46.14

2.5.5 Costly instruction de-optimization

For this de-optimization, we designed an algorithm called ‘div_vs_mult’. This algorithm

takes an array size as input. It then generates an array of the specified size in which each

element is an integer x where 21 <= x <= 2

12; thus each element is a random power of two less

than or equal to 212

. After this array is populated, each element in the array is divided by two.

The executables associated with this algorithm are called ‘div_vs_mult_op.exe’ and

‘div_vs_mult_deop_1.exe’ (on Windows). The code is written in C. The section of code being

optimized/de-optimized is also written in C; there is no assembly for this de-optimization. The

cycles being counted include only the time that ‘div_vs_mult’ spends dividing each element of

the array by 2.

Both the optimized and de-optimized versions do basically the same thing. The only

difference is how each version divides each element of the array by 2. In the optimized version,

23

this is done by multiplying each element by 0.5. This means that the optimized version is able to

use the multiple multiply-data-paths available on the Opteron. The important optimized section

is below:

unsigned long long start = rdtsc();


test_array[i] = test_array[i] * 0.5;

}

printf( "Cycles=%d\n", ( rdtsc() - start ) );

Notice that the division of each element occurs by multiplying it by 0.5. Thus, no actual

division occurs.

The de-optimized version does the same thing as the optimized version. It just uses

division, instead of multiplication, to do it. This can be costly on an Opteron since it has more

limited resources for division, i.e. a single data path. The important de-optimized section is

below:



test_array[i] = test_array[i] / 2.0;

}


Notice that the division of each element occurs by dividing it by 2.0. (Note the use of 2.0 rather

than 2 ensures that the de-optimization is processed as a floating point operation, just like the

optimization). Thus, division occurs with each iteration.

Data

As can be seen by the slowdown percentages below, the cost of using division when

multiplication would suffice is very high. On the Opteron, it caused ~25% slowdown for all

array sizes; this slowdown is due to the fact that the Opteron can only handle division on one of

24

its scalar pipelines. On the Nehalem, this de-optimization had little impact. Therefore, it is

reasonable to assume that Nehalem may have more resources for division available.

Figure 2. 4 Costly instruction de-optimization

0

100000000

200000000

300000000

400000000

500000000

600000000

Clo

ck C

ycle

s

Array Size

De-Optimization (Costly_Instruction)

Optmzd, Opteron

De-Optmzd, Opteron

Optmzd, Nehalem

De-Optmzd, Nehalem

25

Table2. 4: Costly instruction de-optimization for the AMD Opteron and Intel Nehalem


Difference


Difference

Slowdown(%)

10 99 1.81 -42 -14.34

100 949 34842 74 38.4

1000 9474 34849 142. 7894

10000 9447. 37844 14244 78.4

100000 92.413 37834 114.14 4874

1000000 9419374 32894 44794. 3844

10000000 94444224 3289. .7.7474 2841

2.5.6 Load-store dependency de-optimization

For this de-optimization, we designed an algorithm called ‘dependency’. This algorithm

takes an array size as input. It then generates an array of the specified size in which each element

is populated with an integer x where 0 <= x <= 9. After this array is populated, each element in

the array has the previous element added to it; this effect ripples from the front of the array to the

back of the array. The last element in the array is the sum of all elements in the array. Thus, this

is basically a prefix sum.

The executables associated with this algorithm are called ‘dependency_op.exe’ and

‘dependency_deop.exe’ (on Windows). The code is written in C. The section of code being


cycles being counted include only the time that dependency spends summing array values.

The optimized version performs the additions by keeping the previous three array

26

elements in temporary variables; it creates few load-store dependencies since the previous array

element does not need to be reloaded, i.e. the sum for the previous element doesn’t need to be

written before it can be reloaded. The important optimized section is below:

int temp_prev = test_array[0], temp1, temp2; unsigned

long long start = rdtsc();

for ( i = 3; i < size_of_array; i += 3 ) {

temp2 = test_array[i - 2] + temp_prev;

temp1 = test_array[i - 1] + temp2;

test_array[i - 2] = temp2;

test_array[i - 1] = temp1;

test_array[i] = temp_prev = test_array[i] + temp1;

}


Notice that the value being added to the current element of the array comes from the temporary

variable. Thus, its value doesn’t need to re-loaded, i.e. no load-store dependency is (likely to be)

created.

The de-optimized version, on the other hand, performs this computation in a more

natural way. However, since it must wait for the previous element to have its value written

before the next one can be calculated, many load-store dependencies are created. The

important de-optimized section is below:



test_array[i] = test_array[i] + test_array[i - 1];

}


Notice that, unlike the optimized version, there are no temporary variables that can be used to

prevent load-store dependencies.

27

Data

As can be seen by the slowdown percentages below, the cost associated with these kinds

of dependencies is very high. On the Opteron, it caused ~60% slowdown for all array sizes; this

is due to the fact that the Opteron does not schedule stores in the way that it schedules other

instruction types leading to very costly dependency stalls. On the Nehalem, this de-optimization

had a lesser impact. Therefore, it is reasonable to assume that its dynamic scheduler is better at

managing stores.

Figure 2. 5 Costly instruction de-optimization

0

50000000

100000000

150000000

200000000

250000000

Clo

ck C

ycle

s

Array Size

De-Optimization (Costly_Instruction)

Optmzd, Opteron

De-Optmzd, Opteron

Optmzd, Nehalem

De-Optmzd, Nehalem

28

Table 2. 5: Costly Instruction de-optimization for the Opteron and Intel Nehalem


Difference


Difference

Slowdown(%)

10 .4 2487 13 .814

100 3293 4489. 32. 3189

1000 9.7. 4.841 141. 1782.

10000 491.. 428.9 14714 148.4

100000 4.44.. 4281 144.92 1.84

1000000 .1.19.1 44813 13.7... 13813

10000000 .149.449 7.8.2 14344314 1.8.4

2.5.7 High latency instruction de-optimization

For this de-optimization, we designed an algorithm called ‘fibonacci’. This algorithm

takes an array size as input. It then generates an empty array of the specified size. After this

array is created, it calculates the fibonacci number associated with each index of the array.

The executables associated with this algorithm are called ‘fib_op.exe’ and ‘fib_deop.exe’

(on Windows). The code is written in C. The section of code being optimized/de-optimized is

written in NASM. The cycles being counted include only the time that fib spends running its

assembly code.

The optimized and de-optimized versions are almost identical. Of course, both calculate

the fibonacci number associated with each index. However, the optimized version uses the

combination of DEC and JNZ instructions in order to control its branching. Below is the

important optimized section:

29

calculate:

mov edx, eax

add ebx, edx

mov eax, ebx

mov dword [edi], ebx

add edi, 4

dec ecx

jnz calculate

Notice that each iteration ends with DEC and JNZ instructions. Thus, the loop will end

when the ECX register is zero.

The de-optimized version on the other hand uses a LOOP instruction instead of the

DEC/JNZ combination. The important thing to note about the LOOP instruction is that it has a

high latency (approximately 8 cycles) compared to DEC/JNZ. Below is the important de-

optimized section:

calculate:

mov edx, eax

add ebx, edx

mov eax, ebx

mov dword [edi], ebx

add edi, 4

loop calculate

Notice that, unlike the optimized version, the loop is controlled by the LOOP instruction. Just

like the optimized version, the loop will end when ECX register is zero.

Data

As can be seen by the slowdown percentages below, the cost of the LOOP instruction is

high. On the Opteron, it caused ~17% slowdown for all array sizes; this is due solely to the very

high latency of the LOOP instruction. On the Nehalem, this de-optimization had a greater

impact. It is hard to speculate on what might cause the Nehalem to have such a poor

30

implementation of LOOP.

Figure 2. 6 Fib de-optimization

Table 2. 6: Fib de-optimization for the Opteron and Intel Nehalem

0100000002000000030000000400000005000000060000000700000008000000090000000

100000000C

lock

Cyc

les

Array Size

De-Optimization (fib)

Optmzd, Opteron

De-Optmzd, Opteron

Optmzd, Nehalem

De-Optmzd, Nehalem


Difference


Difference

Slowdown(%)

10 17 9.04 9 8.57

100 155 29.4 244 81.87

1000 1386 34.74 2239 104.52

10000 14588 22.03 19519 62.16

100000 123271 17.91 256839 87.42

1000000 1206678 16.94 2430301 81.51

10000000 11716747 14.07 19896396 65.26

31

2.5.8 If condition de-optimization

For this de-optimization, we designed an algorithm called ‘if’. This algorithm takes an

array size as input. It then generates an array of the specified size in which each element is

populated with floats x where 0.5 <= x <= 11.0. After this array is populated, each element in

the array is evaluated with an if-then statement in order to determine whether to increment or

decrement a dummy variable.

The executables associated with this algorithm are called ‘if_op.exe’ and ‘if_deop.exe’

(on Windows). The code is written in C. The section of code being optimized/de-optimized is

also written in C; there is no assembly for this de-optimization. The cycles being counted

include only the time that ‘if’ spends evaluating the elements of the array with the if-then

statement.

Both the optimized and de-optimized versions do basically the same thing. The only

difference is the order which they each evaluate the clauses of the if-then statement. One of these

clauses is very easy and fast to evaluate; it is a comparison to zero. The other of the clauses is

hard and time consuming to evaluate; it is a floating point comparison. The important optimized

section is below:

int dummy = 0, mod = 0;



mod = ( i % 2 );

if ( mod == 0 && test_array[i] > 1.5 ) { dummy++;

} else

{

dummy - -;

}

}


Notice the ordering of the clauses within the if-then statement. This is an optimal

ordering since the floating point comparison clause need only be evaluated half of the time in

32

order to know the value of the entire ‘&&’ statement.

Again, the de-optimized version does the same thing as the optimized version. It just

evaluates the ‘&&’ statement in a different order. The important de-optimized section is below:

int dummy = 0, mod = 0;



mod = ( i % 2 );

if ( test_array[i] > 1.5 && mod == 0 ) { dummy++;

} else { dummy--;

}

}


Again, notice the ordering of the clauses within the if-then statement. This is a sub-optimal

ordering since the floating point comparison clause needs to be evaluated each and every time in

order to know the value of the entire ‘&&’ statement.

Data

As can be seen by the slowdown percentages below, the cost of improperly ordering

clauses within a conditional is very high. On the Opteron, it caused ~37% slowdown for all array

sizes; this is due to being forced to evaluate the most expensive of the conditional clauses for

each iteration; therefore, a simple switch in ordering can make this program 37% faster. On the

Nehalem, this de-optimization had a similar impact.

33

Figure 2. 7 If Condition de-optimization

Table 2. 7: If Condition de-optimization for the AMD Opteron and Intel Nehalem


Difference


Difference

Slowdown(%)

10 -207 -26.90 74 328.7

100 22. 1784. 99. 74899

1000 442. 2.834 4..4 228.4

10000 42947 2.8.9 44244 248.4

100000 47142. 2.894 73.19. 7.844

1000000 42.79.4 24824 474441. 418.4

10000000 44.32.44 278.9 499944.4 44873

0

50000000

100000000

150000000

200000000

250000000

300000000C

lock

Cyc

les

Array Size

De-Optimization (If_Condition)

Optmzd, Opteron

De-Optmzd, Opteron

Optmzd, Nehalem

De-Optmzd, Nehalem

34

2.5.9 Loop re-rolling de-optimization

For this de-optimization, we designed an algorithm called ‘loop_re_rolling’. This


each element is populated with integers x where 0 <= x <= 9. After this array is populated, two

more arrays are built; one will hold the squares (x2) of each element in integer array and the

other will hold the cubes (x3) of each element in the array. After these arrays are built, squares

and cubes arrays are populated by iterating over the elements of the integer array.

The executables associated with this algorithm are called ‘loop_re_rolling_op.exe’ and

‘loop_re_rolling_deop.exe’ (on Windows). The code is written in C. The section of code being


cycles being counted include only the time that the executables spend filling out the squares

and cubes array from the original.

The optimized version fills out the squares and cubes array into a single loop, since the

integer array is used to populate both. This is ideal since it gives the dynamic instruction

scheduler more flexibility when scheduling instructions to run. The important optimized section

is below:

unsigned long long pstart = rdtsc();


quadratic_array[i]=load_store_array[i]*load_store_array[i];

cubic_array[i]=load_store_array[i]*load_store_array[i]*load_store_array[i];

}


Notice that the calculation that populates the element of each array, squares and cubes, is

within the same loop.

35

The de-optimized version fills out the squares and cubes array using separate loops. This

is sub-optimal since it takes away some of the flexibility that the dynamic instruction scheduler

might otherwise have when scheduling instructions to run. The important de-optimized section is

below:

Unsigned long long pstart = rdtsc();



}

For(i=0;i<size_of_array;i++) {


}


Notice that the calculation that populates the element of each array, squares and cubes, is

within the separate loops.

Data

As can be seen by the slowdown percentages below, the cost of not combining loops that

can be combined is very high. On the Opteron, it caused ~50% slowdown for all array sizes; this

is solely due to the fact that splitting the computations into separate loops isolates them such that

the dynamic scheduler cannot schedule them together; some of the flexibility that it might have

had otherwise has been lost. On the Nehalem, this de-optimization had an impact, but a lesser

one. It is hard to imagine why this is less costly on the Nehalem.

36

Figure 2. 8 Loop re-rolling de-optimization

Table 2. 8: Loop re-rolling de-optimization for the AMD Opteron and Intel Nehalem


Difference


Difference

Slowdown(%)

10 189 43.95 83 28.32

100 1653 53.68 357 15.35

1000 21311 54.96 4696 21.28

10000 234336 52.42 39311 17.10

100000 2352328 50.93 134786 6.05

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

10 100 1000 10000 100000

Clo

ck C

ycle

s

Array Size

De-Optimization (Loop_re_Rolling)

Optmzd, Opteron

De-Optmzd, Opteron

Optmzd, Nehalem

De-Optmzd, Nehalem

37

2.5.10 Dependency chain de-optimization

For this de-optimization, we designed an algorithm called ‘dependency_chain’. This


each element is populated with integers x where 0 <= x <= 20. After this array is populated, all

of the elements of the array are summed into a single integer variable.

The executables associated with this algorithm are called ‘dependency_chain_op.exe’ and

‘dependency_chain_deop_1.exe’ (on Windows). The code is written in C. The section of code

being optimized/de-optimized is also written in C; there is no assembly for this de-optimization.

The cycles being counted include only the time that the executables spend adding the elements of

the array.

The optimized version adds the elements of the array by striding through the array in four

element chunks and adding elements to four different temporary variables. After the array has

been processed, the four temporary variables are added. The advantage of the optimized version

is that it creates four large dependency chains instead of one massive one. Thus, the dynamic

scheduler has many more options when scheduling instructions. The important optimized section

is below:

int sum = 0, sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0; unsigned

long long start = rdtsc();

for ( i = 0; i < size_of_array; i += 4 ) {

sum1 += test_array[i];

sum2 += test_array[i + 1];



}

sum = sum1 + sum2 + sum3 + sum4;


Notice that the summing occurs across four temporary variables. These temporary

38

variables are then added after the array has been processed.

The de-optimized version sums each element of the array into one variable. This creates a

massive dependency chain that quickly exhausts the scheduling resources of the dynamic

scheduler. The important de-optimized section is below:

int sum = 0;



sum += test_array[i];

}


Notice that the summing occurs using one variable only.

Data

As can be seen by the slowdown percentages below, the cost of creating a massive

dependency chain is high. On the Opteron, it caused ~150% slowdown for all array sizes; this is

solely due to the exhaustion of the scheduling resources of the dynamic scheduler; exhausting

these resources effectively makes the program run sequentially (no ILP). On the Nehalem, this

de-optimization had a large impact but not quite as large as the Opteron. One can only imagine

that it has more scheduling resources at its disposal.

39

Figure 2. 9 Dependency chain de-optimization

Table 2. 9: Dependency chain de-optimization for the AMD Opteron and Intel Nehalem


Difference


Difference

Slowdown(%)

10 42 198.2 2 387

100 .41 13.821 374 74844

1000 ..74 14.8.3 3492 498.2

10000 .47.2 14.844 2..42 .4844

100000 .44143 17782 343.94 4.8.2

1000000 .422.44 1438.4 3924192 ..872

10000000 9.334421 12389. 37424329 41813

0

20000000

40000000

6000000080000000

100000000

120000000

140000000

160000000

180000000C

lock

Cyc

les

Array Size

dependency_ chain

Optmzd, Opteron

De-Optmzd, Opteron

Optmzd, Nehalem

De-Optmzd, Nehalem

40

3. Building tools to optimize weaknesses in the GCC compiler

3.1 Compiler

There are many definitions to the compiler but the most simple one is a program

converts other programs from high level language to low level languages. Compilers are

different from one language to another. However, compilers can be similar to each other

if they are belongs to extended programming languages. For example, C, C++, C# and

java are extended languages from C, so compilers for those languages are similar in terms

of structure and work. Programs are written in high level languages instead of low level

languages for several reasons. One of the reasons is that programs in high level languages

are much shorter than programs written in low level languages. So, high level languages

help users or programmers to write less and efficient codes. Also, there is another

advantage which is programs can be compiled to different low level languages, and then

can be run on different machines. However, programs in high level languages are slower

than programs written in machine languages because the time that spends in converting

from high to low level languages. So, the most efficient compiler can convert from high

level language to low level language with the shortest time [6] [7]. Every compiler is

working in phases; number of phases and sequence of phases are different from compiler

to another. The first phase in a compiler is lexical analysis. Lexical analysis is the

operation that reads programs as a text, and then divides the whole programs to tokens

corresponding to the rules of languages. The second phase in a compiler is syntax

analysis. Syntax analysis takes the output from the previous level and represents the

output in a tree corresponding to the structure of a program. The third phase of compiler

levels is Type Checking. Type Checking is responsible for checking a program to see if it

41

follows the rules and regulations of grammar and structure. The fourth compiler phase is

Intermediate code generation. Intermediate code generation is responsible for translating

programs to the machine language. The fifth compiler phase is register allocation.

Variables and names that used in previous operation (Intermediate code generation level)

translated to numbers in registers of a machine language. The sixth compiler phase called

machine code generation. In the machine code generation, the code in the intermediate

language is converted to assembly language. The last compiler phase is assembly and

linking. In the assembly and linking level, the assembly code translated to the binary

form. Lexical analysis, syntax analysis and type checking are called frontend of a

compiler while register allocation, machine code generation and assembly and linking are

called backend of compiler. Most of compiler optimizations are done on the two parts

which are frontend and backend of compilers. The optimization procedure is different

from one compiler to another, but there are basics of optimizations that work in all of

them [7].

3.2 GCC compiler

GCC is one of the most popular compilers that used in different languages. GCC

is used widely because it is free source and has very active tools to optimize programs

efficiently. Currently, GCC supports six languages such as C, C++, Objective-C,

FORTRAN, Java and Ada. Also, there is another language will be added in future which

is COBOL. GCC can be considered as a part of the GNU project. GNU project was

launched in 1984 to complete UNIX operating system as free software. There are many

versions of the GCC from the first version till today [8].

42

The basic language of the GCC compiler is C, and also the GCC compiler was

written in C. So, in the beginning, GCC compiler was written as a C compiler and after

that several languages were added to it. The first language that was added to the GCC

compiler is C++ which is the extension of C language. Furthermore, Objective C is

another language that is extended from C language; it was added to the GCC compiler.

Also, there is an important language was added to the GCC compiler which is

FORTRAN. FORTRAN is considered stronger than C because FORTRAN is a rich

language with math libraries. It can do complex operations and it deals with complex

numbers. Also, there is another important language that is added to the GCC which is

JAVA. JAVA is different than other languages because the form of its code different,

also the executable file that is generated by Java’s compiler is different than other

languages, and the executable will be executed by a Java virtual machine. The last

language that is added to the GCC compiler is Ada which is developed by Ada core

technologies and added to the GCC compiler at 2001. Ada language is designed to handle

large programs [8] [9].

3.3 GCC optimizations

Compilers basically are aimed to make programs faster, efficient and reducing the

size of codes. The main important goal from compiler optimizations is the speedup of

programs. Codes are generated by optimizing compilers called optimized codes.

Optimized codes can be produced by several changes on the source code, in order to

make that code is efficient and run faster than the original code. There are several ways in

the GCC compiler to decide a particular code can be improved or not such as control flaw

analysis and data flaw analysis. Control flaw analysis is used to see how the control

43

statements are used, and to see how the path is taken from the beginning of control

statement to the end of the statement. The other one is data flaw analysis, which is used

to identify how data is working in programs [10] p101-p117. There are several

optimization techniques that used in the GCC compiler to optimize programs. Copy

propagation transformation is one of the optimizing techniques that used to reduce the

redundant information. In this method, constant values are used instead of copy

operation. For more clarification about previous optimizing technique, let’s consider the

following example:

i=10;

x[j]=I;

y[min]=b;

b=i;

The previous code will optimized to the following code by GCC compiler:

i=10;

x[j]=10;

y[min]=b

b=10;

In the previous example, in the optimized code, values are used directly to set the

variables x[j] and b, while in the original code, copy operation used to copy values.

Copying operation costs much time than just setting the constant values in the optimized

code [10] P104.

Also, there is another GCC compiler optimization called Dead code elimination.

Dead code elimination is one of the GCC compiler optimization that removes extra or

unused codes in programs, or removes codes that do not effect on the sequence of

44

operations and the result in programs. Many codes are written with extra calculations or

extra variables called dead codes. Also, when we have a code that is never reached by

compiler, in this case, this code called unreachable code. For example,

If(i==10){

:}

Else{

If(i==10){

:}

}

In the above example, the second code will be never reached by a compiler because either

the first statement will reach by a compiler or none of them will not reached by a

compiler. So, in all cases, the second statement will never reached by a compiler, and

these phenomena called unreachable code, and the optimization that removes these

phenomena called unreachable code elimination.

GCC compiler and especially the latest versions have different and very smart

techniques to optimize codes. GCC compiler has optimizations called architecture –

independent optimizations. What is the most interesting thing about these optimizations is

that they do not depend on the features of a given architecture or features of a given

processor. So, optimizing codes is in different architecture and processors. GCC compiler

has different optimization switches such as –O, -O1, -O2, -O3, -Os, -OO and nothing.

Nothing or –OO turn off all GCC optimizations. –O1 turns on first level of GCC

optimizations which is level one. –O2 and –O3 turn on level two and level three of GCC

optimizations, they added to the first level of GCC optimizations. –O enables all GCC

optimizations, and it attempts to reduce the size of codes and execution time without

45

increasing the compilation time. –Os attempts to reduce the size of codes only without

considering the compilation time [10].

In the GCC compiler, you can turn off one of the optimizations by doing the

following:

We use “no” option between –f and the name of optimization that we want to disable it.

Suppose we want to disable the optimization that guess branch possibilities, the disabling

command will be like this:

GCC file name.c –o file name –O1 –fno-guess-branch-probability. In addition, the

original name of the guess branch possibilities optimization is –fguess –branch-

probability [10] P108.

In the GCC compiler, as we mentioned before, there are three levels of GCC

optimizations. First level called level one GCC optimizations. Level one has many

optimizations that reduce size of codes and improve time consumption. –free-dce is one

of level one GCC optimizations that eliminates unreachable codes or dead codes in

programs. Also, -fif-conversion is another optimization in level one which can change

conditional jumps in to none branching code. Also, there is another optimization in level

one which predicts branch possibilities in programs called –fguess-branch-probability. In

level two optimizations, there are many optimizations added to the first level includes

loop transformation, mathematical optimizations and jump handling. One of the

mathematical optimization is called –fpeephole. These optimizations replace long

instructions with the shorter once. For example, suppose we have the following code:

A=2;

46

For( i=1; i<10; i++)

A+=2;

GCC compiler, by applying second level of optimization can replace the above

code with A=20. So, this optimization saves much time that spend in executing “For

Loop” with much iteration. Also, there is another optimization in level two which can

improve recursive functions called –foptimize-sibling-calls. This optimization attempts to

eliminate the calls of functions and replace it by the code of the function itself. In general,

calls of a function consume much time, so replacing it with the code itself will make

programs faster. Also, there are other optimizations in level two can reduce the size of

codes. The command –Os in GCC attempts to reduce the size of codes and that will

improve the memory. –Os can be applied on all level two optimizations with the

exception of optimizations that enlarge program’s codes. Level three optimizations turn

on all level one, level two and additional optimizations such as –funswitch-loops, -fgcse-

afer-reload and –finline-functions [10] P110, P111, P112.

Furthermore, there are extra optimizations that called specific optimizations. One

of the specific optimizations is –fbound-check, which can check codes to validate array

indices in order to use for array access. –ffunction-cse is one of the specific optimizations

that stores function addresses in registers. Also, there is another specific optimization

called –finline-functions which expands simple functions in the calling function. –ffloat-

store is another specific optimization which disables strong floating point values in

registers. The specific optimizations are not enabled by O1, O2 or O3, basically they are

enabled by either using the command –f followed by the name of the optimization, or

they can be enabled by –O because –O turns on all GCC optimizations [10].

47

3.4 Our optimizations to the GCC compiler

In our implantation, we found several weaknesses in the GCC compiler, so we

improved these weaknesses and then we built a tool that has all the improved weaknesses

or the optimizations. The weaknesses that we found in the GCC compiler include

Division operation, loop and recursive function, loop rerolling, loop unrolling, power

operation, square root mathematical operation and functions calls. Division operation

consume much time than multiplication, so our goal in this optimization technique is to

replace the division operation with the multiplication operation to make programs

efficient and faster. Functions are used in many programs and function calls are similar to

iterations that used in loops, but function calls are much time consuming than iterations

in loops. Therefore, programs have function calls can be replaced by loops in order to

make programs faster. Loop rerolling is one of the optimization techniques that used in

many compilers to optimize loops in programs. The basic idea in loop rerolling is to

combine several loops in one loop, and this obviously will reduce the number of

instructions because in general, instructions in one loop are less than separate individual

loops. So, the main goal in this optimization technique is to combine several loops in one

loop when it is possible to save time and make programs faster. Also, we implemented

another optimization technique which can optimize loops called loop unrolling. Loop

unrolling is one of the optimization techniques that used in many compilers to enhance

the execution time of codes that have loops. The main idea in loop unrolling is to reduce

the number of instructions. Many programs are spending less time in executing actual

codes while most of the time is used to execute the instructions. The article” An

aggressive approach to loop unrolling” stated that many programs spend 90% of the time

48

in executing 10% of a code [11]. So, reducing the time consumption in the little portion

of a code will make programs faster. Also, the article” An aggressive approach to loop

unrolling” stated that loop unrolling can improve instruction-level parallelism and

improve memory hierarchy locality [11]. Loop unrolling can be done in many loops but

not in all loops because some codes are difficult to unroll them such as two dimensional

arrays and pointers. The basic idea in loop unrolling is to open the code several times and

reduce the number of iterations, and then will make programs faster. However, there is a

disadvantage in loop unrolling technique which is increasing the size of codes, and then

will affect the memory. The main drawback of increasing code’s size is affecting the

instruction cache. When the instruction cache is overflowed, the optimized code will be

affected and then will make programs slower. This case can be happened when the size

of instruction cache is small, and the size of original code is large. Also, there is another

thing regarding to loop unrolling which is choosing the unrolling factor. There is an

important question which is how many times we should unroll a given loop? We

introduced this question in order to help users or programmers to choose the perfect

unrolling factor. Choosing a good unrolling factor depends on several factors such as the

size of instruction cache, size of code and the number of instructions in program.

In our implementation, we help the programmer or user to find the best unrolling

factor if loop unrolling optimization were found. Our tool requires the size of instruction

cache as input, length of an individual instruction because length of an instruction is

different from architecture to another, and also requires the number of iterations to a loop

as input. The output of our tool is the best maximum unrolling factor. Programmer or

user should not pass the number of unrolling factor because this will overflowed the

49

instruction cache, and then will degrade the performance or the speed of programs. In

addition to those optimization techniques, there is another one which is optimizing power

operation. Power function in C or C++ (POW()) used in many mathematical programs.

Power operation can be replaced by multiplication and definitely both of them give the

same result. We want to replace power by multiplication because power operation is

much expensive than multiplication, so replacing power with the multiplication will make

programs faster. Also, there is another optimization technique that included in our tool

which is optimizing square root mathematical operation (SQRT()). SQRT() is one of the

functions in C and C++ that used in many programs to find the square root of a number.

SQRT() function is much expensive than division, and we can see that easily from the

results of our implementation. So, in our optimization technique, we are replacing the

SQRT() function with the division operation to make programs faster. The last

optimization technique that we implemented in our tool is optimizing function calls.

Suppose, when we have a function in For Loop that used as a constant, so taking out this

function from the loop, will save much time because the calls of a function will be equal

to the number of iterations in that loop. Therefore, in our optimization technique, we are

taking out a function from a loop when this is possible to make programs faster.

3.5 Methods, analysis and results

3.5.1 Division vs. multiplication

We designed a benchmark which tests the division in order to see if the compiler

can optimize it or not. We used an array populated with power of integers from (1-12).

We divided each element in the array by 5, and run the program for different size arrays,

and then we collected the results. Also, we implemented our optimized benchmark. In our

50

optimized code, we multiplied each element in the array by 0.2, and we know the two

codes give the same result; the only difference between those two codes is using

multiplication instead of division. In the implementation, we compiled each code in two

different types; one with disabling all GCC optimizations and with enabling all GCC

optimizations. After doing the implementation, we can see that the division has a real

impact on the GCC compiler. The average slowdown percentage is 118%, and from this

point of view, we can tell the GCC compiler has very limited optimization ability in this

field. The goal from this benchmark is to tell the programmer not using the division even

if you are relying on the compiler because the compiler cannot optimize this kind of code.

If you have an option to change the division with multiplication then changes it because

the division degrades the performance comparing with multiplication. We depend on time

in milliseconds to see which one is faster than the other. The size of the array that used in

this optimization technique is 1000000 to 1000000 for time measured in milliseconds.

We implemented our optimization using large array size for measuring time in

milliseconds because the CPU is really fast. For example, if you use small size of array,

the time is zero most of the time. Furthermore, we measured time in milliseconds to see

the effects of our implementations in real time.

Here are the results and flow charts for our implementations (depending on time in

milliseconds):

A: Division code without compiler optimization.

B: Multiplication code without compiler optimization.

C: Division code with compiler optimization.

51

D: Multiplication code with compiler optimization.

Table 3. 1: Time in milliseconds for the Division vs. Multiplication optimization

technique.

Figure 3. 1 Division vs. Multiplication- milliseconds

0

50

100

150

200

250

1000000 2000000 3000000 4000000 5000000 8000000 10000000

Ru

n t

ime

Array size

Division vs multiplication/ milliseconds

A

B

C

D

52

3.5.2 Loop and recursive function

We designed another benchmark which tests the GCC compiler to see if it can

optimize the function call. We know that function calls spend much time comparative

with the other programming techniques. In our implementation, we implemented two

benchmarks, both benchmarks calculate the factorial of a number, and we repeat this

operation many times to see the differences in terms of the performance between the two

benchmarks. The original benchmark calculates the factorial of the number using

function call, so a function will be called several times to calculate the factorial of a

number. The other benchmark which is our optimized benchmark calculates the factorial

of a number using loop instead of calling a function several times. We know that using a

loop instead of function call will optimize our program; the GCC compiler has a

weakness in optimizing this kind of optimization. We depend on time in milliseconds to

see the differences in performance between the two benchmarks. We run our programs

with different number of iterations. We run the benchmark with the iterations from

1000000 to 10000000, and we collected the time in milliseconds. We depend on time in

milliseconds in order to see clearly which benchmark is faster or perform better than the

other. The results showed that our optimized code (the code with loop) is much faster

than the code with function call that optimized with the GCC optimization. Therefore,

GCC compiler has a weakness in doing this kind of optimization. Let’s see the codes,

results and graphs.

The original code by using recursive function

int tail(int factorial)

{

53

if (factorial>1)

{

result=result*factorial;

tail(factorial-1);

}

return result;

}

Our optimized code by using loop instead of calling the function every time.

int tail(int factorial)

{

temp=factorial;

for(i=1;i<temp;i++)

{

result=result*factorial;

factorial=factorial-1;

}

return result;}

A: The original recursive function code without GCC optimization.

B: Our optimized “For Loop” code without GCC optimization.

C: The original recursive function code with GCC optimization.

D: Our optimized “For Loop” code with GCC optimization.

Table 3. 2: Time in milliseconds for the Loop and recursive function optimization

technique.

54

Figure 3. 2 Loop and recursive function- Time in milliseconds

3.5.3 Loop re-rolling

We designed another benchmark that tests the GCC compiler to see if it can

optimize loops using loop re-rolling optimization technique. This optimization designed

to work with “For loop”. There are two benchmarks; one of them is the original

benchmark, and the other one is our optimized benchmark. The two benchmarks are

doing the same job; they calculate square numbers and cubic numbers. We used three

arrays which are load_store_array, quadratic_ array and cubic_array. Load_store_array

initializes randomly with integers form 0 to 9. After this array is populated, two more

arrays are built. The first benchmark which is the original benchmark calculates square

numbers and cubic numbers in two separate loops. While the other benchmark calculates

square numbers and cubic numbers using one loop. We developed these two benchmarks

to test GCC compiler because we know that loop re-rolling optimization is very

0

10000

20000

30000

40000

50000

60000

1000000 2000000 3000000 4000000 6000000 8000000 10000000

Ru

n t

ime

Array size

Loop vs recusive functions

A

B

C

D

55

important technique that used by many compilers for optimizing loops. In our

implementation, we depend on time in milliseconds to see which benchmark is faster or

has better performance than the other. Measuring the time is for the part of the code that

we want to optimize, so it is not for the whole code or program, and this will give us a

clear clue about performance. The GCC compiler cannot do this kind of optimization

because when we tested both programs (original code and our optimized code), we can

see obviously our optimized code is much faster than the original code. We run both

programs (original benchmark and our optimized benchmark) with different number of

iterations from 10 to 1000. In this optimization technique, our optimized code is

approximately 40% faster than the original code. From this percentage, we can see that

the GCC compiler has a significant weakness for optimizing loop re-rolling. Let’s see the

codes, results and graphs.

The original code:

for(i=0;i<size_of_array;i++)

{


}

for (i=0;i<size_of_array;i++)

{


}

Our optimized code:

for (i=0;i<size_of_array;i++)

{



}

A: The original code without GCC optimization.

B: The original code with GCC optimization.

C: Our optimized code without GCC optimization.

D: Our optimized code with GCC optimization.

56

Table 3. 3: Time in milliseconds for the Loop Re-Rolling optimization technique

Figure 3. 3 Loop Re-Rolling- Milliseconds

0

1000

2000

3000

4000

5000

6000

10 20 30 40 50 60 70 80 90 100

Ru

n t

ime

Array size

Loop Rerolling

A

B

C

D

57

3.5.4 Loop unrolling

We designed another benchmark that tests GCC compiler to see if it can optimize

loops using loop unrolling technique. We did this optimization because loops are used

widely in many programs, so optimizing loops can significantly improve many such

programs. There are two benchmarks in our implementation; one of them is the original

benchmark while the other one is our optimized benchmark. Both benchmarks are doing

the same job with different technique or strategy. In the original benchmark, we

multiplied each number in the array by itself, and we stored the result in different array,

and this is done using “For Loop”. Number of iterations in the “For Loop” is equal to the

size of array. In our optimized benchmark, we opened the source code, and that means we

repeated the source code inside the loop body several times to reduce number of

iterations, and we know that reducing number of iterations will speed up our benchmark.

So, in our optimized benchmark we repeated the loop body five times, number of

iterations is reduced to 80% in our implementation. For example, if the array size is 100

then number of iterations will be 20 instead of 100. In our implementation, we depend on

time in seconds to see which benchmark is faster or better optimized than the other.

Measuring the time is for the part of the code that we want to optimize it, so it is not for

the whole code or program, and this will give us a clear clue about performance. The

GCC compiler cannot do this kind of optimization because when we tested both

programs (original code and our optimized code), we can see obviously our optimized

code is much faster than the original code. We run both programs (original benchmark

and our optimized benchmark) with different number of iterations from 10000000 to

900000000. From our results for both benchmarks, we can see that this optimization

58

technique is faster comparative with our previous optimization techniques because our

optimized benchmark without GCC optimization is faster than the original benchmark

with enabling GCC optimizations. Let’s see the codes, results and graphs.

The original code:


test_array[i] = test_array[i]*test_array[i];

}

Our optimized code:

for ( i = 0; i < size_of_array; i+=6 ) {

test_array[i] = test_array[i]*test_array[i];

test_array[i+1] = test_array[i+1]*test_array[i+1];




test_array[i+5] = test_array[i+5]*test_array[i+5];}

A: The original code without GCC optimizations.

B: The original code with GCC optimizations

C: Our optimized code (Unrolling code) without GCC optimization.

D: Our optimized code (Unrolling code) with GCC optimizations.

59

Table 3. 4: Time in seconds for the Loop Unrolling optimization technique

Figure 3. 4 Loop Unrolling- seconds

3.5.5 Power vs. multiplication

We designed another benchmark that tests the GCC compiler to see if it can

optimize the power operation or not. POW function is slower than multiplication because

there is extra time that used to call this function and executing subroutine has variables

0

100

200

300

400

500

600

Ru

nti

me

Array size

Loop Unrolling

A

B

C

D

60

beside the multiplication. We designed this benchmark because we know that the

multiplication is much faster than the power operation, and also the power operation is

used widely in many math programs, so optimizing power operation will help to improve

many programs. There are two benchmarks; one of them is the original benchmark, and

the other one is our optimized benchmark. The two benchmarks are doing the same job;

they calculate the power of numbers. We used one array which is called test_array[], and

the result will be put in the same array. The test_array[] is initialized randomly with

different integer numbers. The first benchmark which is the original benchmark

calculates the power of all numbers in the test_array[] using POW power function, while

the other benchmark calculates the power of all numbers in the test_array[] using

multiplication operation instead of using POW function. In our implementation, we

depend on time in seconds to see which benchmark is faster or better optimized than the

other. Measuring the time is for the part of the code that we want to optimize it, so it is

not for the whole code or program, and this will give us a clear clue about performance.

The GCC compiler cannot do this kind of optimization because when we tested both

programs (original code and our optimized code), we can see obviously our optimized





optimized benchmark which is the multiplication benchmark without GCC optimization

is faster than the original benchmark with enabling GCC optimizations. Let’s see the

codes, results and graphs.

61

The original Power code:


test_array[i] = pow(test_array[i],2);

}

Our optimized code:


test_array[i] = test_array[i]*test_array[i]; }

A: The original power code without GCC optimizations.

B: Our optimized code (multiplication code) without GCC optimization.

C: The original power code with GCC optimizations.

D: Our optimized code (multiplication code) with GCC optimizations.

Table 3. 5: Time in seconds for the Power vs. Multiplication optimization technique

62

Figure 3. 5Power vs. Multiplication- seconds

3.5.6 SQRT function vs. division

We designed another benchmark that tests GCC compiler to see if it can optimize

Sqrt () function which it uses to calculate the square root for numbers in C language. We

did this implementation about Sqrt() function and division because Sqrt () function

spends much time than the division. Sqrt() function used in many mathematical

benchmarks, so optimizing it will speed up those programs. In this implementation, we

wrote two codes that are doing the same job. Both codes calculate square root of a

number with different mechanism. One code used Sqrt() function to calculate the square

root, while the other code used the division operation to calculate the square root. There

are two benchmarks, the original benchmark used Sqrt() function while our optimized

benchmark used division. In the original benchmark, we multiplied each number in the

array by Sqrt(Number1) while in our optimized code we multiplied each number in the

array by Number1/Number2. We used different numbers and the results almost the same

020406080

100120140160180200

Ru

n t

ime

Array size

Power vs multiplication

A

B

C

D

63

because the main idea in this implementation to show that Sqrt() function spends much

time because of an extra time for calling a Sqrt() function while in our optimized

benchmark we did division only. In both codes, calling a Sqrt() function and the division

operation are repeated to the number of iterations. In our implementation, we depend on

time in seconds to see which benchmark is faster or much optimized than the other.

Measuring the time is for the part of the code that we want to optimize it, so it is not for

the whole code or program, and this will give us a clear clue about performance. The

GCC compiler cannot do this kind of optimization because when we tested both

programs (original code and our optimized code); we can see obviously our optimized





optimized benchmark without GCC optimization is faster than the original benchmark

with enabling GCC optimizations. Let’s see the results and graphs.

A: Sqrt() function code without GCC optimization.

B: Our optimized division code without GCC optimization.

C: Sqrt() function code with GCC optimization.

D: Our optimized division code with GCC optimization.

64

Table 3. 6: Time in seconds for the SQRT function vs. Division optimization

technique

Figure 3. 6 SQRT function vs. Division- seconds

0

2

4

6

8

10

12

14

16

18

Ru

n t

ime

Array size

Sqrt function vs Division

A

B

C

D

65

3.5.7 The cost of the function call inside and outside loops

We designed a benchmark that tests a compiler to see if it can optimize a function

call inside loops. This optimization designed to work with “for” loops. The original code

has a function called strlen(), this function was put inside “For Loop”, we know this

function return string length. We passed this code to the GCC compiler to see if it can

optimize it. We developed this benchmark because we know that calling a function

several times will cost significant time. We optimized this benchmark manually and we

passed it to the compiler to see which is faster or who optimize better GCC compiler

optimization or our optimization. In our optimization we took out the function call

strlen() from inside the loop and we put it outside the loop, so instead of calling a

function every time in the “for loop” , it will be called one time only and will be as a

constant inside the “For Loop”. This optimization saves us much time and makes our

program faster. GCC compiler cannot do this kind of optimization because when we

tested both programs (original code and our optimized code), we can see obviously our

optimized code is much faster than the original code. We depend on time in milliseconds

to see which is faster than the other. Measuring the time is for the part of the code that we

want to optimize it, so it is not for the whole code or program, and this will give us a

clear clue about performance. In this optimization technique, our optimized code is

approximately four times faster than the original code. From this percentage, we can see

that the GCC compiler has a significant weakness. Now let’s see the codes, results and

graphs.

This is the code that we pass it to the GCC compiler in order to optimize it.

unsigned sum(const unsigned char *s) {

// x[] is an array initialized randomly with integer numbers less than 100.

66

unsigned result = 0;

for (i=0; i < strlen(s); i++) {

result += x[i];

}

return result;

}

This is our optimized code which we optimize it manually.

unsigned sum(const unsigned char *s) {

unsigned result = 0;

length = strlen(s);

for ( i=0; i < length; i++) {

result += x[i];

}

return result;

}

A: The original function calls in a loop code without GCC optimizations.

B: Function calls out of loop code without GCC optimizations (our optimized code).

C: The original function calls in a loop code with GCC optimizations.

D: Function calls out of loop code with GCC optimizations (our optimized code).

These results show the difference between the code that optimized by the GCC compiler

and our optimized code in terms of time in Milliseconds.

Table 3. 7: Time in milliseconds for the cost of the function call in and out the loop

optimization technique

67

Figure 3. 7 Function calls in and out the loop- Millisecond

3.6 Automatic parallelization in compilers

When programs have begun to execute in parallel, programmers need a compiler

to take care of parallelization. The basic idea of parallelization is to distribute the job

among processors in order to complete big work in a significant small amount of time.

Today, computers have more than one central processing unit, so to take advantage from

those CPUs, they can collaborate together to finish their job. There are two types of

parallelism which are shared and distributed memory. The main difference between

shared and distributed memory is how the communication occurs between them. In

shared memory environment, threads are communicating by each other through reads and

writes while in distributed memory environment, processes are communicating by each

other through the messages [1] [2].

0

50

100

150

200

250

300

1000000 2000000 3000000 4000000 5000000 6000000 10000000

Ru

n t

ime

Array size

Function calls in loop and out loop

A

B

C

D

68

Also, there is an architecture can mix the previous two types which called Mixed-

Type multiprocessor architecture. The last architecture mixed the shared memory

environment with the distributed memory environment.

P M

P M

: --- S :

: :

P M

(a): shared memory multiprocessor [4]

M

p

M-P---S---P-M

(b): Distributed memory multiprocessor [4]

In a shared memory environment, processors are sharing the same memory while in a

distributed memory environment; each processor has its own memory location. In

addition, in mixed-type architecture, threads may share the same memory and may

communicate to other physical locations through messages, and this is the work of

distributed memory architecture. Parallel processing can achieve big jobs in a very small

69

time through work distribution between processors. Each processor has a part of work to

finish, and in the end, all processors give their results to the master processor. Master

processor or process is in charge of creating processes, collect results from the workers

(processors) and at the end kill them. In order to get good results or expected results from

parallel processing, programmer or users should balance the work between processors

when he or she distribute the work between them. There are many programs can be

parallelized perfectly without difficulties or requiring further work, but also there are

other programs need extra work for parallelizing them such as synchronization [4]. We

need synchronization when two processes have access to the same data. So, in this

situation programmer needs to solve the conflicts between the processes before applying

or implementing the parallelism. Synchronization takes much time because it limits from

the parallelism and the time spent in its techniques. For instance, when many processes

need to access the same value, we need synchronization techniques to make the value

accessed only once at a time. There are several operations that used in synchronization

such as produce and consume. Produce sets the bit to be full if it is empty while consume

waits the value to be full, and then read the value and sets the bit to be empty [4]. By

using this method, conflicts of accessing the same value will not happen because each

time one process can access that value. Also, there is another problem that makes some

programs harder or impossible to parallelize them which is dependencies. Dependencies

consider one of the obstacles that face parallel processing. The basic idea of dependencies

is that some processes need or depend on the value of a previous process; in this case we

will have wrong results because all processes are working in the same time. There are

several kinds of dependencies such as read after write, write after read, write after write

70

and read after read. Dependencies limit the ability of parallelizing programs, and even if

we parallelized them, they become slower. In some programs, if little dependencies may

found, we can have one or more than one sequential sections, and we parallelize the rest

of the code. In some times, the whole program including its results dependent on each

other. For instance, each result in every level of a program uses in the next level, so in

this case is very difficult or impossible to parallelize it. Deciding which section in a

program can be parallelized or not is by depending on the programmer. A programmer

looks to the program, and see which section in the code has dependencies and difficult to

parallelize, and what other do not have dependencies. The method of detecting

dependencies also can be done by depending on compilers. Some compilers have built to

work with the parallel environment, so they have the ability to detect any kind of

dependencies in programs. One of the compilers that have very powerful techniques to

detect dependencies is Intel compiler. Intel compiler has the ability to analyze loops and

determine which loops have dependencies and are hard to parallelize them [5].

3.7 Automatic parallelization in GCC

GCC compiler is one of the compilers that support automatic parallelization.

GCC can parallelize loops, but it has some limitations in automatic parallelization. One

of the limitations in automatic parallelization is that GCC cannot detect dependencies in

loops, so it can parallelize loops without dependencies. Also, there is another limitation in

GCC compiler which is GCC parallelize innermost loops only, it cannot parallelize outer

loops. Now, GCC compiler parallelizes every loop without dependency, so there is no

strategy to determine which loop can be parallelized or not depending on performance

issues [3].

71

3.8 Our improvement to automatic parallelization of the GCC compiler

Our improvement to the GCC compiler is to implement tool to optimize the

weaknesses in the automatic parallelization of GCC compiler. Our tool can detect

dependencies in most of the loops. The input of our tool is C or C++ program, and the

output is a message which is either you can safely parallelize your loop or there are

dependencies and it is difficult to parallelize it. In our tool, we used the idea of GCD test.

GCD test used to detect most of the dependencies with in arrays in loops. The method of

the GCD test works as follows:

1- We consider everything in the brackets only of arrays that used in loops to

calculate the GCD.

2- We give the values in brackets to variables. For instance, A[2i+3]= A[2i-

1]*B+C, so in this case a=2, b=3, c=2 and d=-1, thus a=2 and b=3 are

representing A[2i+3], while c=2 and d=-1 are representing A[2i-1].

3- We find the GCD between a and c, in the previous example, the GCD of (2, 2)

is 2.

4- We find the result of (b-d), in the above example, (3-(-1))=4.

Since, the GCD (a, c) = 2 which divides (b-d) which is 4, so there are dependencies in our

example, and we can make sure from the result by following the iterations. Suppose, we

have the following example with the number of iterations is 6.

A [2i+3] = A[2i-1]

i=0 A[3]=A[-1]

i=1 A[5]=A[1]

i=2 A[7]=A[3]

72

i=3 A[9]=A[5]

i=4 A[11]=A[7]

i=5 A[13]=A[9]

So from the previous example, we can see obviously that iteration 2 depend on iteration

0, iteration 3 depend on the result of iteration 1 and so on.

For more clarification of the method of GCD test, suppose we have the following

example:

For (………………………..)

A [2i]= A[2i-1]*B+C

End for

From the previous example, we can set the values of GCD test like this:

a=2 b=0

c=2 d=-1

GCD (a, c) = 2

(d-b)= -1

Since the GCD (a, c)= 2 does not divide ( d-b)=-1 [4] , so there is no dependencies in our

example. Therefore, by using the method of the GCD test, we can find dependencies in

programs without following the iterations of loops manually.

Therefore, in our tool, we implemented the idea of the GCD test. Our tool takes

C or C++ program as input and print a message tells that either loop has dependencies or

73

there are no possible dependencies in program. This tool basically designed to help the

user or programmer to discover different dependencies in programs that compiled by the

GCC compiler.

3.9 Optimizing real C programs by our tool

After we built our tool that has seven optimization techniques (Division, recursive

function, loop re-rolling, loop unrolling, power operation, SQRT function and recursive

function in a loop), we tested our tool to improve real C programs such as Strassen,

Bubble sort and Selection sort.

3.9.1 Strassen optimization

Strassen is a C program can do matrix multiply faster than the normal way of

matrix multiplication. The C program was passed as input to our tool. We optimized the

C program in several ways. The first optimizing way is by using GCC compiler only. The

second optimizing way is by using our tool only, and the third optimizing way is by using

both our tool and GCC compiler. Our tool detected several optimization techniques in

Strassen.cpp which are not optimized by the GCC compiler. Our tool found that several

divisions in the program can be replaced by multiplications to make the program run

faster. Also, our tool found that two loops in Strassen.cpp can be unrolled to make the

program run faster. So, our tool gave hints or messages showing that these optimizations

can improve the program. After we optimized Strassen.cpp by our tool, we run three

programs which are the original Strassen.cpp, Strassen.cpp optimized by GCC compiler

only and Strassen.cpp optimized by both our tool and GCC compiler. We saw an

interesting result which is the optimized program by our tool only is faster than the

74

original program, and also faster than the program optimized by GCC compiler. In

addition, when we run the program that optimized by our tool and GCC compiler, we got

much speedup.

Results

A: The code without both GCC optimization and our tool optimization.

B: The code with GCC optimization but without our tool optimization.

C: The code without GCC optimization but with our tool optimization.

D: The code with GCC optimization and with our tool optimization.

The runtime for our programs:

Table 3. 8: Running time in seconds for the Strassen optimization

Array size Strassen A Strassen B Strassen C Strassen D

256 0.62 0.47 0.29 0.15

512 4.56 3.32 2.09 1.07

1024 41.38 33.20 27.65 23.17

2048 371.49 321.51 217.96 204.08

4096 3341.04 2945.98 2349.21 2222.88

75

Figure 3. 8 Strassen Optimization

3.9.2 Bubble sort optimization

Bubble sort is a very well-known sorting algorithm used to sort numbers.

BubbleSort.cpp is the original C++ program that used in our implementation, and

OptimizedBubbleSort.cpp is a C++ program that optimized by our tool. The Bubble.cpp

program was passed as input to our tool. We optimized the C++ program in several ways.

The first optimizing way is by using GCC compiler only. The second optimizing way is

by using our tool only, and the third optimizing way is by using both our tool and GCC

compiler. Our tool detected several optimization techniques in the BubbleSort.cpp which

are not optimized by the GCC compiler. Our tool found that several loops can be unrolled

several times to make BubbleSort.cpp faster. Our tool gave hints indicating that these

optimizations can optimize the program. After we optimized the BubbleSort.cpp program

manually, then we collected the running time results, we found that the optimized

0

500

1000

1500

2000

2500

3000

3500

4000

256 512 1024 2048 4096

Ru

nti

me

Array size

Strassen implementation

Strassen A

Strassen B

Strassen C

Strassen D

76

program by our tool only (OptimizedBubbleSort.cpp) is faster than the original program

BubbleSort.cpp, and also the program that optimized by our tool and the GCC compiler is

much faster than the original program that optimized by the GCC compiler only.

Results and graphs

A: The code without both GCC optimization and without our tool optimization.




The runtime for our programs:

Table 3. 9: Running time in seconds for the Bubble Sort optimization

Array size BubbleSort A BubbleSort B BubbleSort C BubbleSort D

10000 0.64 0.28 0.52 0.17

20000 2.63 1.89 2.13 0.83

40000 10.45 6.68 8.62 3.59

80000 42.58 28.38 34.52 13.42

100000 65.55 43.80 54.04 20.90

77

Figure 3. 9 Bubble Sort optimization

3.9.3 Selection sort optimization

Selection sort is one of the sorting algorithms that used to order numbers.

SelectionSort.cpp is the original C++ program that used in our implementation, and

OptimizedSelectionSort.cpp is a C++ program that optimized by our tool. The

SelectionSort.cpp program was passed as input to our tool. We optimized the C++

program in several ways. The first optimizing way is by using GCC compiler only. The

second optimizing way is by using our tool only, and the third optimizing way is by using

both our tool and GCC compiler. Our tool detected several optimization techniques in the

SelectionSort.cpp which are not optimized by the GCC compiler. Our tool found that

several loops can be unrolled several times to make SelectionSort.cpp faster. After we

optimized the SelectionSort.cpp program manually, then we collected the results, we

0

10

20

30

40

50

60

70

10000 20000 40000 80000 100000

Ru

nTi

me

Array Size

Bubble Sort Optimization

BubbleSort A

BubbleSort B

BubbleSort C

BubbleSort D

78

found that the optimized program by our tool only (OptimizedSelectionSort.cpp) is faster

than the original program SelectionSort.cpp, and also the program that optimized by our

tool and the GCC compiler is much faster than the original program that optimized by the

GCC compiler only. Our tool in the programs Strassen.cpp, BubbleSort.cpp and

SelectionSort.cpp worked very well, and it added much speedup to the programs that

optimized by the GCC compiler only.

Therefore, our tool is working very well together with the GCC compiler to make

programs faster. Developers, who are using GCC compiler to optimize their codes, can

use our tool together with the GCC compiler to get fast and efficient results. We did an

implementation to optimize real C codes in order to see the effectiveness of our tool

when developers want to optimize real software.

Results and graphs

A: The code without both GCC optimization and without our tool optimization.




Table 3. 10: Time in seconds for the Selection Sort optimization

Array size SelectionSort A

SelectionSort B

SelectionSort C

SelectionSort D

100000 33.60 11.64 14.50 6.31

200000 135.65 46.87 59.34 25.62

400000 544.67 188.83 237.86 104.12

800000 2197.39 751.43 959.02 428.39

1000000 3539.38 1361.53 1600 862.97

79

Figure 3. 10 Selection sort optimization

0

500

1000

1500

2000

2500

3000

3500

4000

100000 200000 400000 800000 1000000

Ru

nTi

me

Array size

SelectionSort

SelectionSort A

SelectionSort B

SelectionSort C

SelectionSort D

80

4. Conclusion

In our thesis, we have presented several ways to help developers writing more

efficient codes. Writing good software depends on the architecture specifications and

compiler. We researched the architecture specifications and compiler, and we introduced

very good tools to help developers writing most optimized codes. For the architecture

specifications, developers should fully understand the architecture of a specific hardware.

Not considering the architecture specifications can make programs run slower. In the

chapter II of our thesis, we designed intentionally de-optimized benchmarks to see what

are the strengths and weaknesses in the underlying computer architecture, and showing

what are the affects if developers not considering architecture specifications when

programs are written. The de-optimizations show, convincingly, that ignoring

architecture specifications when writing software can be very costly indeed. Most of the

de-optimizations had effects that were greater than 25% of running time. Some had

effects up to 150%. These are not trivial slowdowns. They show that prioritizing

considerations of aesthetics, portability, etc. in front of hardware can result in significant

slowdowns. It is shocking when one considers that many of the de-optimizations that

were implemented look just like code that one sees every day in business and academic

environments. Moreover, much of its was straightforwardly compiled, i.e. the compiler

did not resolve the issues during compilation. The results of the chapter II should, at the

very least, make software developers reconsider many of their accumulated habits.

Sometimes the best “looking” code may perform the worst.

Compilers have very powerful tools to optimize codes. Therefore, writing good

software not only depends on the skills of programmers and architecture specifications,

81

but also depends on the powerful of compilers. One of the compiler processes is

optimizing codes. Compilers can make programs run faster, work efficiently and also can

reduce size of codes. Therefore, In order for developers to write optimized software

should choose very powerful compiler to optimize their codes. GCC compiler is one of

the well-known compilers that used by many developers to optimize codes. In chapter III

of our thesis, we researched the weaknesses in the GCC compiler. Several weaknesses

have found in the GCC compiler, and then we built several optimization techniques such

as Division vs. Multiplication, Loop and Recursive function, Loop Re-rolling, Loop

Unrolling, Power vs. Multiplication, SQRT function vs. Division and the cost of the

function call inside and outside loops. To make using these optimizations easier we built

a tool can help developers to optimize their codes manually. Our tool can give message to

users indicating that if their programs need to optimize or not. We built a tool because in

some circumstances, GCC compiler cannot fully optimize codes. Therefore, optimizing

codes by users will make programs run faster.

Also, GCC compiler has another weakness which is the automatic parallelization

in the GCC compiler can parallelize loops that do not have dependencies. Therefore, we

built a tool can do dependencies detection. Our second tool can detect most of the

dependencies in programs, and then can give a message to users indicating that either a

particular program has dependencies or not. The main goal from chapter III was to build

tools to improve the weaknesses in the GCC compiler and help programmers to write

most efficient codes

To make sure our tools can optimize real C or C++ application, we tested our tool

to optimize real codes such Strassen.c, Bubble Sort.cpp and SelectionSort.cpp.

82

Strassen.cpp is a c program can do matrix multiplication faster. The optimization to the

Strassen.cpp was successfully implemented, and we found that our optimizing code

without GCC optimization is faster than the original Strassen.cpp with GCC

optimizations. Moreover, we optimized BubbleSort.cpp. The optimization of the Bubble

Sort was implemented successfully, and Bubble Sort that optimized by our tool is much

faster than Bubble Sort optimized by GCC compiler only. Also, we successfully

optimized Selection Sort algorithm, and the optimized program by our tool is much faster

than the optimized program by the GCC compiler. Therefore, our tool is very powerful

tool can work with the GCC compiler to make programs efficient and faster. The main

goal from our thesis is to help developers to write more efficient codes through showing

that fully understanding the architecture specifications will make writing efficient codes,

and improve the weaknesses in the GCC compiler will make programs run faster.

83

REFERENCES

[1] Midkiff, S. P. (2012). Automatic Parallelization AnOverview of Fundamental

CompilerTechniques (2012th ed., Vol. 7, pp. 3-5). N.p.: Morgan & Claypool. Retrieved

June 6, 2013, from http://0-

www.morganclaypool.com.skyline.ucdenver.edu/doi/abs/10.2200/S00340ED1V01Y2012

01CAC019

[2] Adve, S.V.; Gharachorloo, K., "Shared memory consistency models: a

tutorial," Computer , vol.29, no.12, pp.66,76, Dec 1996

doi: 10.1109/2.546611, URL: http://0-

ieeexplore.ieee.org.skyline.ucdenver.edu/stamp/stamp.jsp?tp=&arnumber=546611&isnu

mber=11956

[3] Stallman, R. 2013. Using the GNU Compiler Collection. Boston, USA: GNU Press.

[4] Jordan, H. and Alaghband, G. 2003. Fundamentals of parallel processing. Upper

Saddle River, NJ: Prentice Hall/Pearson Education.

[5] Software.intel.com. 1999. Automatic Parallelization with Intel® Compilers | Intel®

Developer Zone. [online] Available at: http://software.intel.com/en-us/articles/automatic-

parallelization-with-Intel-compilers [Accessed: 13 Jun 2013].

[6] Wilhelm, R. and Seidl, H. 2010. Compiler design. Heidelberg: Springer.

[7] Mogensen, T. 2009. Basics of compiler design. [Kbh.]: Torben Ægidius Mogensen.

[8] Griffith, A. 2002. GCC, the complete reference. New York: McGraw-Hill/Osborne.

[9] Gcc.gnu.org. 2013. GCC, the GNU Compiler Collection- GNU Project - Free

Software Foundation (FSF). [online] Available at: http://gcc.gnu.org/ [Accessed: 13 Jun

2013].

[10] Von Hagen, W. 2006. The definitive guide to GCC. Berkeley, CA: Apress.

[11] Jack W. Davidson and Sanjay Jinturkar. 2001. An Aggressive Approach to Loop

Unrolling. Technical Report. University of Virginia, Charlottesville, VA, USA.

[12] AMD64 Technology. Software Optimization Guide for AMD64 Processors. 2005.

http://0-www.morganclaypool.com.skyline.ucdenver.edu/doi/abs/10.2200/S00340ED1V01Y201201CAC019



http://0-ieeexplore.ieee.org.skyline.ucdenver.edu/stamp/stamp.jsp?tp=&arnumber=546611&isnumber=11956



optimizing software through fully ...digital.auraria.edu/content/aa/00/00/01/41/00001/aa...ahmed...

Documents