optimizing software through fully ...digital.auraria.edu/content/aa/00/00/01/41/00001/aa...ahmed...
TRANSCRIPT
OPTIMIZING SOFTWARE THROUGH FULLY UNDERSTANDING THE UNDERLYING
ARCHITECTURE & STRENGTHENING THE COMPILER
by
AHMED ESKANDER MEZHER
B.Sc, University of Technology, 2009
A thesis submitted to the
Faculty of the Graduate School of the
University of Colorado Denver in partial fulfillment
of the requirements for the degree of
Master of Science
Computer Science
2013
ii
This thesis for the Master of Science degree
by
Ahmed E. Mezher
has been approved for the
Computer Science Program
By
Gita Alaghband, Chair
Tom Altman
Bogdan Chlebus
ii
September 12, 2013
Ahmed, Mezher, E. (Master of Science in Computer Science)
Optimizing Software Through Fully Understanding the Underlying Architecture &
Strengthening the Compiler
Thesis directed by Professor Gita Alaghband
ABSTRACT
Software developers are trying to write more efficient and faster programs. Increasing
speed and efficiency not only depends on software programming knowledge, but also on
understanding of the underlying computer architecture and compiler. Many programmers do not
consider the machine architecture when they write code. Knowing the architecture specifications
of the computer before writing code can influence program efficiency and speed. Also,
optimizing compilers have a significant impact on the speed of program execution. The GCC
compiler is a well-known compiler that has advanced techniques to optimize code. However,
GCC compiler has several weaknesses in certain areas such as optimizing arithmetic operations,
optimizing function calls, optimizing loops and providing full automatic parallelization. Those
weaknesses can make programs run slower. This thesis helps developers to write more efficient
software by presenting the weaknesses and strengths of the machine architecture and building
tools to improve the weaknesses of the GCC compiler. In this thesis we show the negative
impact of not considering the architecture specifications when programs are written to motivate
software developers to pay attention to the architecture specifications of the machine. Several
tools are designed to improve the weaknesses in the GCC compiler. Our tools work together with
the GCC compiler to help developers design more efficient code. Also, our tools can work on
different compiler architecture, are not limited to specific compiler architecture. We tested our
iii
tools to optimize real C applications such as Strassen.cpp, Bubble Sort.cpp and Selection
Sort.cpp and we successfully optimized them. Therefore, by using our tools together with the
GCC compiler, programs will be more efficient and faster. Through understanding the
weaknesses and strengths of the underlying computer architecture, and using our tools together
with the GCC compiler, developers can design more efficient software.
The form and content of this abstract are approved. I recommend its publication.
Approved: Gita Alaghband
iv
CONTENTS
Chapter
1. Introduction ............................................................................................................................................. 8
2. Finding the limits of the underlying computer architecture through de-optimization .............................. 4
2.1 Optimization information and recommendations for the AMD Opteron processor ........................... 4
2.2 Find the limits of underlying computer architecture through de optimization techniques ................. 5
2.3 Methods ............................................................................................................................................ 6
2.4 Analysis and results ........................................................................................................................ 10
2.5 Detailed results ................................................................................................................................ 11
2.5.1 Branch density de-optimization ................................................................................................ 11
2.5.2 Unpredictable instruction de-optimization ............................................................................... 15
2.5.3 Branch pattern de-optimization ................................................................................................ 18
2.5.4 Float comparison de-optimization ............................................................................................ 19
2.5.5 Costly instruction de-optimization ........................................................................................... 22
2.5.6 Load-store dependency de-optimization ................................................................................... 25
2.5.7 High latency instruction de-Optimization................................................................................. 28
2.5.8 If condition de-optimization ..................................................................................................... 31
2.5.9 Loop re-rolling de-optimization ............................................................................................... 34
2.5.10 Dependency chain de-optimization ........................................................................................ 37
3. Building tools to optimize weaknesses in the GCC compiler ................................................................ 40
3.1 Compiler ......................................................................................................................................... 40
3.2 GCC compiler ................................................................................................................................. 41
3.3 GCC optimizations .......................................................................................................................... 42
3.4 Our optimizations to the GCC compiler .......................................................................................... 47
3.5 Methods, analysis and results .......................................................................................................... 49
3.5.1 Division vs. multiplication: ...................................................................................................... 49
3.5.2 Loop and recursive function ..................................................................................................... 52
3.5.3 Loop re-rolling ......................................................................................................................... 54
3.5.4 Loop unrolling .......................................................................................................................... 57
3.5.5 Power vs. multiplication ........................................................................................................... 59
v
3.5.6 SQRT function vs. division ...................................................................................................... 62
3.5.7 The cost of the function call inside and outside loops .............................................................. 65
3.6 Automatic parallelization in compilers ............................................................................................ 67
3.7 Automatic parallelization in GCC ................................................................................................... 70
3.8 Our improvement to automatic parallelization of the GCC compiler .............................................. 71
3.9 Optimizing real C programs by our tool .......................................................................................... 73
3.9.1 Strassen optimization ............................................................................................................... 73
3.9.2 Bubble sort optimization .......................................................................................................... 75
3.9.3 Selection sort optimization ....................................................................................................... 77
4. Conclusion ............................................................................................................................................ 80
References............................................................................................................................................... 83
vi
LIST OF FIGURES
Figure
2. 1 Branch Density (mod_ten_counter) ....................................................................................... 14
2. 2 Unpredictable instructions (factorial_over_array) de-optimization....................................... 17
2. 3 Compare_two_floats .............................................................................................................. 21
2. 4 Costly instruction de-optimization ......................................................................................... 24
2. 5 Costly instruction de-optimization ......................................................................................... 27
2. 6 Fib de-optimization ................................................................................................................ 30
2. 7 If Condition de-optimization.................................................................................................. 33
2. 8 Loop re-rolling de-optimization ............................................................................................. 36
2. 9 Dependency chain de-optimization........................................................................................ 39
3. 1 Division vs. Multiplication- milliseconds .............................................................................. 51
3. 2 Loop and recursive function- Time in milliseconds .............................................................. 54
3. 3 Loop re-rolling- Milliseconds ................................................................................................ 56
3. 4 Loop unrolling- seconds ........................................................................................................ 59
3. 5Power vs. multiplication- seconds .......................................................................................... 62
3. 6 SQRT function vs. division- seconds..................................................................................... 64
3. 7 Function calls in and out the loop- Millisecond..................................................................... 67
3. 8 Strassen optimization ............................................................................................................. 75
3. 9 Bubble sort optimization ........................................................................................................ 77
3. 10 Selection sort optimization .................................................................................................. 79
vii
LIST OF TABLES
Table
2. 1 Branch density de-optimization for the AMD Opteron and Intel Nehalem........................... 14
2. 2 Unpredictable instructions de-optimization for the AMD Opteron and Intel Nehalem ........ 17
2. 3 Compare two floats de-optimization ...................................................................................... 22
2. 4 Costly instruction de-optimization for the AMD Opteron and Intel Nehalem ...................... 25
2. 5 Costly Instruction de-optimization for the Opteron and Intel Nehalem ................................ 28
2. 6 Fib de-optimization for the Opteron and Intel Nehalem ........................................................ 30
2. 7 If condition de-optimization for the AMD Opteron and Intel Nehalem ................................ 33
2. 8 Loop re-rolling de-optimization for the AMD Opteron and Intel Nehalem .......................... 36
2. 9 Dependency chain de-optimization for the AMD Opteron and Intel Nehalem ..................... 39
3. 1 Time in milliseconds for the Division vs. multiplication optimization technique. ................ 51
3. 2 Time in milliseconds for the loop and recursive function optimization technique................ 53
3. 3 Time in milliseconds for the loop re-rolling optimization technique .................................... 56
3. 4 Time in seconds for the loop unrolling optimization technique ............................................ 59
3. 5 Time in seconds for the power vs. multiplication optimization technique ............................ 61
3. 6 Time in seconds for the SQRT function vs. division optimization technique ....................... 64
3. 7 Time in milliseconds for the cost of the function call in and out the loop optimization
technique ....................................................................................................................................... 66
3. 8 Running time in seconds for the Strassen optimization ......................................................... 74
3. 9 Running time in seconds for the Bubble sort optimization ................................................... 76
3. 10 Time in seconds for the Selection sort optimization ............................................................ 78
viii
1. Introduction
This thesis is in the area of compiler and software. Writing efficient software not only
depends on writing skills of programmers but also depends on the knowledge of underlying
architecture and compiler. Sometimes a particular optimized code running fast in one machine
may run very slow in a different machine. For example, we have an optimized code, and this
code has loop unrolled several times to run very fast in machine A, a programmer run the same
code in a different machine B, and the two machines A and B are completely different. If the
programmer does not know the architecture specifications of machine B, then the program may
run very slow because the size of instruction cache of machine B is less than the size of the first
machine A, and we know that when the number of instructions in a program exceeds the size of
the instruction cache, the program becomes slower. Therefore, knowing the architecture
specifications will make programmers write optimized codes. Additionally, the effort required to
optimize generally depends upon knowledge of the architecture specifications such as: How long
is its pipeline? How does it detect hazards? Does it dynamically schedule? How does its caching
work? By understanding these aspects of the machine architecture, Programmers can write very
efficient software. Does ignoring the architecture specifications when writing software affect its
efficiency? Chapter two answered this question. In chapter two, we researched the weaknesses
and strengths in the underlying computer architecture, and showing the consequences of not
2
considering the architecture specifications when programs are written. The goal of chapter two is
to design a series of de-optimizations for the Opteron and to show how gracefully, or not, its
performance degrades. In some circumstances, serious degradations in performance were found.
In others, expected de-optimizations were difficult if not impossible to implement. By examining
the results of chapter two, one can gain a thorough understanding of what development aspects
need attention and what aspects can be safely ignored. The de-optimizations that we have
implemented are branch density, unpredictable instructions, branch patterns, float comparisons,
costly instructions, load-store dependency, high latency instruction, if condition, loop rerolling
and dependency chain. Branch patterns, unpredictable instructions and branch density are
deemed unsuccessful because of the powerful of hardware. Other de-optimizations have showed
great impact on the Opteron and the Intel Nehalem as well.
Moreover, some programmers depend on the compiler only to optimize their codes.
Dependence on compiler only will not fully optimize codes because compilers may have some
weaknesses. We chose to research the GCC compiler because it is a well-known compiler used
by many developers and programmers. We will develop tools to write more efficient code and
compare the results with the GCC compiler. In chapter three, we present several weaknesses in
the GCC compiler, and then build two tools to help programmers using GCC compiler to write
more efficient codes by using our tools. The first tool can detect the optimizations in a given
program, and present a message to the programmer that with a list of optimizations. Therefore, a
programmer can implement the list of optimizations manually in his program to become more
efficient. The optimizations that we have implemented to the GCC compiler are division
operation, loop and recursive function, loop rerolling, loop unrolling, power operation, square
root function and function calls. All the optimizations are not compiler architecture dependent
3
except for the loop unrolling because loop unrolling depend on the size of instruction cache, and
the size of instruction cache is different from architecture to another. Another type of our study is
the automatic parallelization in the GCC compiler which does not support dependencies
detection. The GCC compiler can parallelize loops that do not have dependencies. In this way,
the programmer needs to check its program manually for dependencies. Sometimes checking
dependencies visually is not easy because it needs to follow the iterations of the loop. If a loop
has very long sequence of instructions, it becomes very hard and much time consuming to decide
if a loop has dependencies or not. Therefore, we designed a tool that can detect most of the loop
dependencies in programs. Our tool can give a message to the programmer indicating that either
the particular program has dependencies or not. Our tool can help programmers to identify
dependencies, and they can save a lot of time by using it. The main goal from chapter three is to
improve the weaknesses we have detected in the GCC compiler, and building tools with the
series of optimizations that can help programmers to write more efficient code. The main goal
form the thesis is to help developers to write the best efficient software.
4
2. Finding the limits of the underlying computer architecture through de-optimization
techniques
2.1 Optimization information and recommendations for the AMD Opteron processor
Many optimizations have been done on the 32 bit and 64 bit software to work with AMD
processor and AMD architecture. These optimizations are instruction-Decoding optimizations,
cache and memory optimizations, branch optimizations, scheduling optimizations, integer
optimizations, optimizing with the SIMD instructions and x87 floating point optimizations.
Instruction-Decoding optimizations help the processor to maximize the number of decoded
instructions. We have two types of instructions which are direct path instructions and vector path
instructions. In general, direct path instructions minimize the number of operations that cost
AMD processor. Three direct path instructions can be done in one cycle, and one and a half
double direct path instructions per one cycle. Therefore, using direct path instructions rather than
vector path instructions can optimize codes. Also, there are other optimizations have done to
load-execute instructions, load-execute integer instructions and many other on the type of
instruction-decoding optimizations. Memory optimization is another optimization can be applied
to the AMD processor. Memory has several optimizations; one of them is using store instructions
such as MOVNTPS and MOVNTPQ. These instructions make processor do writing without
reading the data from memory. So, these instructions can save time by not reading the memory
or cache. Therefore, it makes programs faster. This optimization can be applied on 32-bit
software and 64-bit software. Another optimization can be applied on AMD processor is branch
optimization. Branch optimization is one of the optimizations that can improve branch prediction
5
and minimize branch penalties. One of the possible branch optimizations is avoiding conditional
branches that depend on random data, as these branches are difficult to predict. Moreover,
scheduling optimization is another optimization can be applied on the AMD processor. One of
the possible scheduling optimizations is pushing data directly in to stack instead of loading it in
to a register first, and this technique will reduce register pressure and eliminates data
dependencies. Also, this optimization can apply to the 32-bit software and 64-bit software [12].
We showed some of the recommendations for the AMD processor. These recommendations to
the hardware can make programs faster, and then we did an implementation to the AMD
hardware in order to find what the limitation of the hardware or what other weaknesses in the
hardware can be improved by the AMD designers.
2.2 Find the limits of underlying computer architecture through de optimization techniques
Computer industry is developing very fast. CPU speeds are increasing, and most of
computers come with more than one CPU. Software developers are focusing on writing good
software without considering the architecture specifications. Developers rely on compilers to do
the dirty work of concerning themselves with the details of the architecture specifications. Fully
understanding the performance and what can affect the hardware that gives us a clear clue about
developing hardware or software. In this part of the thesis, the limits of underlying computer
architecture were researched by intentionally de-optimizing software. Developers typically do
not consider architecture specification; it is useful to how optimized hardware will behave under
the worst circumstances. With this viewpoint, one can imagine that a company might choose one
piece of hardware over another not because it performs the best in very special circumstances,
but rather because its performance degrades the slowest in adverse circumstances. The goal of
6
this part of thesis was to design a series of de-optimizations for the Opteron and to show how
gracefully, or not its performance degrades. After we did the implementation, there are serious
degradations in performance were found. Also, in some circumstances, it is harder to implement
de-optimization techniques because the powerful of hardware. The main benefit of this work is
that one can look to the result and decide what development aspects need attention and what
aspects can be safely ignored.
2.3 Methods
In our implementation, we chose the AMD Opteron and the Intel Nehalem. The Opteron
was chosen as the CPU to benchmark. The Intel Nehalem is used to check de-optimizations if
universal or not, and that is mean if one of the de-optimizations affects both CPUs Opteron and
Intel Nehalem , then may be it is fairly universal. In our implementation, we chose nine de-
optimizations to be implemented. The de-optimizations are:
● Branch Density - The code for this de-optimization contains densely packed branch
instructions in order to overload branch target buffers and/or branch prediction
algorithms.
● Unpredictable Instruction - The code for this de-optimization has intentionally
misaligned return instructions in order to prevent branch prediction.
● Float Comparison - The code for this de-optimization uses normal float comparison (a
conditional) when comparing integers (after casting) would suffice.
7
● Costly Instruction - The code for this de-optimization uses division when
multiplication would suffice.
● Load-Store Dependency - The code for this de-optimization sets up a load-store
dependency, the intention of which is to cause the CPU to wait until stores complete
before loads can occur.
● Dependency Chain - The code for this de-optimization creates a very long
dependency that quickly exhausts the resources of the dynamic scheduler.
● High Latency Instruction - The code for this de-optimization uses a very latency
instruction (LOOP) when lower latency instructions (DEC/JZ) would suffice.
● If Condition Organization - The code for this de-optimization intentionally orders
sub-conditions of an IF statement so that sub-conditions which take longer to
evaluate are first.
● Loop Re-Rolling - The code for this de-optimization breaks a loop into two that could
otherwise be combined so that the dynamic scheduler has less scheduling leeway.
● Branch Patterns - The code for this de-optimization tries to find patterns that are
difficult for branch prediction hardware to predict.
8
All of these de-optimizations are described in more detail below.
Most of these optimizations and de-optimizations were written in C and using GCC compiler and
we are definitely sure that using GCC compiler will not affect or reduce the de-optimization
techniques. In some cases, GCC can effect on the de-optimization, so we cannot use C to write
the code. When C could not be used completely, the NASM assembly language was used to
build C-importable assembly modules. These modules were then used by a C-wrapper in order to
implement the de-optimization. Typically, the C-wrapper builds an array that can be passed to an
assembly module, which in turn processes the array. Now, we have an important question which
is how to evaluate the de-optimization techniques. Well, we have several programs or tools can
give information about codes such as CodeAnalyst and VTune. At a glance, CodeAnalyst
appeared ideal since it proffers up a great deal of information about program execution and it
was written for the chosen hardware platform. However, it turned out to be less than ideal. It is
cumbersome to work with; it is hard to evaluate the data for sections of code together. In short, it
is great for profiling code, but it is poor for generating lots of test results. So, CodeAnalyst was
not an easy answer to the problem of evaluating de-optimizations. So, rather than using
prepackaged software in order to evaluate de-optimizations, we decided to wrap important
sections with code that counts the number of cycles. In the post-Pentium world, there is a ready
resource for counting clock cycles. It is the CPU Timestamp Counter (CTC). The CTC counts
the number clock cycles executed since the CPU was booted. It can be a little tricky to work
with, especially in multi-core, multi-CPU environments. However, if the core/CPU used for
execution can be tightly controlled, then it can be trusted to provide a reasonable measure of the
number of cycles needed to run a section of code.
#if defined(__i386__)
static __inline__ unsigned long long rdtsc(void)
9
{
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x)); return x;
}
#elif defined(__x86_64__)
static __inline__ unsigned long long rdtsc(void)
{
unsigned hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
#end if
The above code was inserted in to the all C optimizations and de-optimizations. With this
code, a call to rtdsc() yields the current clock cycles (from the CTC). When we are running the
code either optimization or de-optimization, the number of clock cycles is different slightly from
an execution to another. So, we decided to run the code several times and then we find the
average for the all executions. For simplicity, we created a program called ‘The Version Tester’
(VT). This program is simply an executable that takes a configuration file as an argument.
Within the configuration file, the number of test iterations is specified along with a series of
programs to test. For each program, there is a description and a run command line. For each
specified program, the version tester runs the executable for the specified number of iterations.
After each iteration, it captures the number of cycles (the number of cycles is written onto stdout
by the executable being tested). After completing the number of test cycles, the VT computes the
average number of cycles per iteration.
10
2.4 Analysis and results
Many of the de-optimizations designed during the implementation worked as expected.
For example, the Branch Density de-optimization consistently showed a 12% slowdown on the
Opteron for all array sizes. This comported very well with expected cost of missing the branch
target buffer (BTB) on the Opteron so consistently. Other de-optimizations had a greater effect
than anticipated. The Loop Re-Rolling and Dependency Chain de-optimizations showed very
significant slowdowns for all array sizes. The expectation was that winnowing down the options
available to the Opteron dynamic scheduler would have an impact, but not such a stark impact.
The upshot of these surprises is: software engineers must consider the dynamic scheduler in
order write software that can run efficiently on the Opteron. Finally, some de-optimizations had
little or an inconclusive effect. For example, the Branching Pattern de-optimization turned out to
be very difficult to realize. A slowdown was achieved using random data. But no significant
slowdown was achieved using patterned data. Since using random data to cause mispredictions
yields no useful information about the Opteron, this de-optimization was deemed unsuccessful.
Overall, modern CPUs are a wild cacophony of interacting threads and processes. This makes
profiling optimizations (and de-optimizations) very difficult. It turns out that Hydra was a great
place to test de-optimizations; its multiple nodes with multiple CPUs made it easy to run
processes with relative confidence that the program being tested would not end up competing for
resources. As a result, the data for the Opteron is very smooth and consistent, while the noisy
data for the Intel Nehalem reflects its meager resources.
11
2.5 Detailed results
2.5.1 Branch density de-optimization
For this de-optimization, we designed an algorithm called ‘mod_ten_counter’. This
algorithm takes an array size as input. It then generates an array of the specified size in which
each element is populated with an integer x where 0 <= x <= 9. (Note that the data for this array
is not random; if it was, then branch mispredictions could artificially inflate the effect of the
Branch Density de-optimization.) After this array is populated, the number of times that each
integer x appears in the array is counted.
The executables associated with this algorithm are called ‘mod_ten_counter_op.exe’ and
‘mod_ten_counter_deop_1.exe’ (on Windows). The code is written in C. The section of code
being optimized/de-optimized is written in NASM. The cycles being counted includes only the
time that mod_ten_counter spends running its assembly code.
The optimized version counts the integer instances within the array using a structure that
is much like a case-switch statement in C. However, the branch statements within this structure
are spaced out using NOP instructions. These NOP instructions are used to ensure that the
optimized version maintains proper alignment and spacing for branch instructions so that branch
target buffer (BTB) misses are not incurred. The important optimized section is below:
cmp ecx, 0
je mark_0 ; We have a 0
nop
dec Ecx
je mark_1 ; We have a 1
nop
dec Ecx
je mark_2 ; We have a 2
nop
dec Ecx
je mark_3 ; We have a 3
nop
dec Ecx
12
je mark_4 ; We have a 4
nop
dec Ecx
je mark_5 ; We have a 5
nop
dec Ecx
je mark_6 ; We have a 6
nop
dec Ecx
je mark_7 ; We have a 7
nop
dec Ecx
je mark_8 ; We have a 8
nop
dec Ecx
je mark_9 ; We have a 9
Notice that there is a NOP between each DEC/JE pair. This was done in order to create space
between branches and in order to better align branching instructions. The de-optimized version
counts the integer instances with much the same structure as the optimized version. However, it
does not maintain proper alignment and spacing such that it should incur many BTB misses. The
important de-optimized section is below:
cmp ecx, 0
je mark_0 ; We have a 0
dec Ecx
je mark_1 ; We have a 1
dec Ecx
je mark_2 ; We have a 2
dec Ecx
je mark_3 ; We have a 3
dec Ecx
je mark_4 ; We have a 4
dec Ecx
je mark_5 ; We have a 5
dec Ecx
je mark_6 ; We have a 6
Dec Ecx
je mark_7 ; We have a 7
dec Ecx
je mark_8 ; We have a 8
dec Ecx
je mark_9 ; We have a 9
13
Notice that, unlike the optimized version, there are no additional NOP instructions between
DEC/JE instructions. This means that the de-optimized version has a very tightly packed bunch
of branches, unlike the optimized version. Also, note that, on average, the optimized version
executes 5 more instructions per iteration yet it is significantly outperformed by the optimized
version.
Data
As can be seen by the slowdown percentages below, packing branches as densely as the
de-optimization can have a significant impact on run-time. On the Opteron, it caused ~10%
slowdown for all array sizes; this slowdown is due to the branch target buffer misses that come
with packing branches too densely. On the Nehalem, this de-optimization also had a big impact,
though the reasons are less well understood. One point of note for the Nehalem is the length of
its pipeline, which is 17 versus 12 on the Opteron. Thus, additional slowdown could be due to a
higher cost of BTB misses on the Nehalem.
14
Figure 2. 1 Branch Density (mod_ten_counter)
Table 2. 1: Branch Density de-optimization for the AMD Opteron and Intel Nehalem
Array Size AMD Opteron:
Difference
Slowdown(%) Intel Nehalem:
Difference
Slowdown(%)
10 -47 -7.31 39 9.92
100 185 7.80 331 27.36
1000 2234 11.86 6355 89.42
10000 23230 12.69 67653 110.38
100000 203760 10.98 537362 84.27
1000000 1620652 8.40 5306766 89.63
10000000 17263048 8.34 52082971 78.85
0
50000000
100000000
150000000
200000000
250000000
Clo
ck C
ycle
s
Array Size
De-optimization: Branch Density (mod_ten_counter)
Optmzd,OpteronDe-Optmzd,Opteron
15
2.5.2 Unpredictable instruction de-Optimization
For this de-optimization, we designed an algorithm called ‘factorial_over_array’. This
algorithm takes an array size as input. It then generates an array of the specified size in which
each element is populated with an integer x where 0 <= x <= 12. After this array is populated, the
factorial of each element of the array is calculated and written back into the array, overwriting
the original array element.
The executables associated with this algorithm are called ‘factorial_over_array_op.exe’
and ‘factorial_over_array_deop_1.exe’ (on Windows). The code is written in C. The section of
code being optimized/de-optimized is written in NASM. The cycles being counted include only
the time that factorial_over_array spends running its assembly code.
The optimized and de-optimized versions are almost identical. Of course, both perform
factorial, recursively, on an integer argument. However, the optimized version properly spaces
the JNE and RET instructions so that the RET is not mispredicted. Below is the important
optimized section:
nop
eax, [esp+4]
mov ; Get the integer whose factorial is
; being calculated
cmp eax, 1 ; Have we hit one yet?
Jne Calculate ; If we haven't then do another call
nop ; This is here for alignment purposes
Ret ; return with a 1
Notice that there is a NOP instruction between the JNE instruction and the RET instruction. This
was done in order to create space between the two and ensure that the RET instruction is
properly aligned.
The de-optimized version on the other hand does not properly space and so it results in a
large number of mispredictions on the RET instruction. Below is the important de-optimized
16
section:
mov eax, [esp+4] ; Get the integer whose factorial is ; being calculated
cmp eax, 1 ; Have we hit one yet?
Jne Calculate ; If we haven't then do another call
ret ; return with a 1
Notice that, unlike the optimized version, there is no NOP instruction between the JNE and the
RET instruction. This means that the two branching instructions are adjacent and the RET
instruction is misaligned.
Data
As can be seen by the slowdown percentages below, the cost of mis-aligning heavily used
RET calls is high. On the Opteron, it caused ~12% slowdown for all array sizes; this slowdown
is due to the branch mispredictions engendered by mis-aligning RET. On the Nehalem, this de-
optimization had an inconclusive impact. Therefore, it is reasonable to assume that the way that
the Nehalem indexes branching instructions must be quite different from the Opteron.
17
Figure 2. 2 Unpredictable instructions (factorial_over_array) de-optimization
Table 2. 2 Unpredictable instructions de-optimization for the AMD Opteron and Intel
Nehalem
Array Size AMD Opteron:
Difference
Slowdown(%) Intel Nehalem:
Difference
Slowdown(%)
10 80 7.07 21 2.58
100 1510 0.88 2436 32.15
1000 10429 11.61 -990 -1.24
10000 115975 12.76 36158 5.33
100000 1139748 12.59 238852 3.54
1000000 11529624 12.66 -3505949 -5.42
10000000 175191774 18.90 3081557 0.51
0
200000000
400000000
600000000
800000000
1E+09
1.2E+09C
lock
Cyc
les
Array Size
De-optimization: Unpredictable instructions (factorial_over_array)
Optmzd, Opteron
De-Optmzd, Opteron
Optmzd, Nehalem
De-Optmzd, Nehalem
18
2.5.3 Branch pattern de-optimization
For this de-optimization, we designed an algorithm called ‘even_number_sieve’. This
algorithm takes an array size and a pattern as inputs. It then generates an array of the specified
size in which each element is populated with either 1 or 2. The order in which 1’s and 2’s appear
depends upon the pattern provided; this pattern essentially represents the branching pattern being
tested. For example, the pattern ‘12’ would cause the array to be populated with alternating 1’s
and 2’s; the pattern ‘r’ would cause the array to be populated with a random sequence of 1’s and
2’s. After the array has been constructed, the algorithm iterates over the array and replaces all of
its odd entries, the 1’s, with 0’s.
Only one executable was needed for this de-optimization. Rather, in this case, an
optimized version is a branching pattern that can be easily managed by the CPUs branch
prediction mechanism. On the other hand, a de-optimization is a branching pattern that cannot
be easily managed by the branch prediction mechanism.
The executable associated with this algorithm is called ‘even_number_sieve.exe’ (on
Windows). The code is written in C. The section of code being optimized/de-optimized is also
written in C; there is no assembly for this de-optimization.
The cycles being counted include only the time that is taken to mark the odd elements of the
array.
The code that evaluates the array elements simply tests whether each element x (mod
2) is 1 or not. If it is, then it is replaced with a 0. Below is the important code:
unsigned long long pstart = rdtsc();
for ( i = 0; i < size_of_array; i++ ) {
if ( number_array[i] % 2 == 1 ) {
19
number_array[i] = 0;
}
}
printf( "Cycles=%d\n", ( rdtsc() - pstart ) );
Notice that each element, mod 2, is compared to 1. If it is equal, then the element is
replaced with 0.
We were never able to generate interesting results from this de-optimization. It is
reasonable to conclude that the branch prediction mechanism on the Opteron is very good. Of
course, many cycles were lost when random data was used. However, no branch prediction
mechanism can find and use a pattern in random data; it would not then be random.
2.5.4 Float comparison de-optimization
For this de-optimization, we designed an algorithm called ‘compare_two_floats’. This
algorithm takes a number of iterations as input. It then generates two simple floats and tests
them, over and over again, for the requested number of iterations.
The executables associated with this algorithm are called ‘compare_two_floats_op.exe’
and ‘compare_two_floats_deop.exe’ (on Windows). The code is written in C. The section of
code being optimized/de-optimized is also written in C; there is no assembly for this de-
optimization. The cycles being counted include only the time that compare_two_floats spends
comparing the floating point values.
The optimized version performs the comparison of floating point values by casting the
floats into integers and then performing integer comparison. The important optimized section is
below:
#define FLOAT2INTCAST(f) (*((int *)(&f))) float
t=f1-f2;
pstart = rdtsc();
20
for (j = 0; j < numberof_iteration ; j++) {
if ( FLOAT2INTCAST(t) <= 0 ) {
Count_numbers(i);
count++;
} else { count++;
}
}
result=rdtsc()-pstart;
printf( "Cycles=%d\n", result );
Notice that the two floats being compared, f1 and f2, are subtracted (outside of the section being
timed) and then casted and compared to zero. Thus, no float comparison occurs.
The de-optimized version performs the comparison of floating point values in
the normal fashion, i.e. by straightforwardly comparing the floats. The important de-optimized
section is below:
float t=f1-f2; pstart =
rdtsc();
for (j = 0; j < numberof_iteration ; j++) { if ( f1<=f2 ) {
Count_numbers(i);
count++;
} else { count++;
}
}
result=rdtsc()-pstart;
printf( "Cycles=%d\n", result );
Notice that f1 and f2 are being compared in the straightforward fashion. Thus, they are
being compared as floats
Data
As can be seen by the slowdown percentages below, the cost of casting and the
comparing integers is higher until the array size crosses a certain threshold; at this threshold, the
21
cost of comparing floats begins to dominate. On the Opteron, it caused ~6% slowdown for array
sizes greater than 10000; this slowdown is due to the fact that the floating point data path on the
Opteron is more costly than the integer data path. On the Nehalem, this de-optimization had a
similar impact. Therefore, it is reasonable to assume that the Nehalem has a similar discrepancy
between its integer and floating point data paths.
Figure 2. 3 Compare_Two_floats
020000000400000006000000080000000
100000000120000000140000000160000000180000000
Clo
ck C
ycle
s
Array Size
De-Optimization (Compate_Two_floats)
Optmzd, Opteron
De-Optmzd, Opteron
Optmzd, Nehalem
De-Optmzd, Nehalem
22
Table 2. 3: Compare two floats de-optimization
Array Size AMD Opteron:
Difference
Slowdown(%) Intel Nehalem:
Difference
Slowdown(%)
10 -10991 -97.19 43 26.70
100 -10960 -86.18 471 49.68
1000 -9892 -37.97 5285 59.90
10000 -982 -0.60 52211 58.82
100000 88962 5.86 413932 41.79
1000000 987559 6.56 4298502 44.46
10000000 9949917 6.62 41906333 46.14
2.5.5 Costly instruction de-optimization
For this de-optimization, we designed an algorithm called ‘div_vs_mult’. This algorithm
takes an array size as input. It then generates an array of the specified size in which each
element is an integer x where 21 <= x <= 2
12; thus each element is a random power of two less
than or equal to 212
. After this array is populated, each element in the array is divided by two.
The executables associated with this algorithm are called ‘div_vs_mult_op.exe’ and
‘div_vs_mult_deop_1.exe’ (on Windows). The code is written in C. The section of code being
optimized/de-optimized is also written in C; there is no assembly for this de-optimization. The
cycles being counted include only the time that ‘div_vs_mult’ spends dividing each element of
the array by 2.
Both the optimized and de-optimized versions do basically the same thing. The only
difference is how each version divides each element of the array by 2. In the optimized version,
23
this is done by multiplying each element by 0.5. This means that the optimized version is able to
use the multiple multiply-data-paths available on the Opteron. The important optimized section
is below:
unsigned long long start = rdtsc();
for ( i = 0; i < size_of_array; i++ ) {
test_array[i] = test_array[i] * 0.5;
}
printf( "Cycles=%d\n", ( rdtsc() - start ) );
Notice that the division of each element occurs by multiplying it by 0.5. Thus, no actual
division occurs.
The de-optimized version does the same thing as the optimized version. It just uses
division, instead of multiplication, to do it. This can be costly on an Opteron since it has more
limited resources for division, i.e. a single data path. The important de-optimized section is
below:
unsigned long long start = rdtsc();
for ( i = 0; i < size_of_array; i++ ) {
test_array[i] = test_array[i] / 2.0;
}
printf( "Cycles=%d\n", ( rdtsc() - start ) );
Notice that the division of each element occurs by dividing it by 2.0. (Note the use of 2.0 rather
than 2 ensures that the de-optimization is processed as a floating point operation, just like the
optimization). Thus, division occurs with each iteration.
Data
As can be seen by the slowdown percentages below, the cost of using division when
multiplication would suffice is very high. On the Opteron, it caused ~25% slowdown for all
array sizes; this slowdown is due to the fact that the Opteron can only handle division on one of
24
its scalar pipelines. On the Nehalem, this de-optimization had little impact. Therefore, it is
reasonable to assume that Nehalem may have more resources for division available.
Figure 2. 4 Costly instruction de-optimization
0
100000000
200000000
300000000
400000000
500000000
600000000
Clo
ck C
ycle
s
Array Size
De-Optimization (Costly_Instruction)
Optmzd, Opteron
De-Optmzd, Opteron
Optmzd, Nehalem
De-Optmzd, Nehalem
25
Table2. 4: Costly instruction de-optimization for the AMD Opteron and Intel Nehalem
Array Size AMD Opteron:
Difference
Slowdown(%) Intel Nehalem:
Difference
Slowdown(%)
10 99 1.81 -42 -14.34
100 949 34842 74 38.4
1000 9474 34849 142. 7894
10000 9447. 37844 14244 78.4
100000 92.413 37834 114.14 4874
1000000 9419374 32894 44794. 3844
10000000 94444224 3289. .7.7474 2841
2.5.6 Load-store dependency de-optimization
For this de-optimization, we designed an algorithm called ‘dependency’. This algorithm
takes an array size as input. It then generates an array of the specified size in which each element
is populated with an integer x where 0 <= x <= 9. After this array is populated, each element in
the array has the previous element added to it; this effect ripples from the front of the array to the
back of the array. The last element in the array is the sum of all elements in the array. Thus, this
is basically a prefix sum.
The executables associated with this algorithm are called ‘dependency_op.exe’ and
‘dependency_deop.exe’ (on Windows). The code is written in C. The section of code being
optimized/de-optimized is also written in C; there is no assembly for this de-optimization. The
cycles being counted include only the time that dependency spends summing array values.
The optimized version performs the additions by keeping the previous three array
26
elements in temporary variables; it creates few load-store dependencies since the previous array
element does not need to be reloaded, i.e. the sum for the previous element doesn’t need to be
written before it can be reloaded. The important optimized section is below:
int temp_prev = test_array[0], temp1, temp2; unsigned
long long start = rdtsc();
for ( i = 3; i < size_of_array; i += 3 ) {
temp2 = test_array[i - 2] + temp_prev;
temp1 = test_array[i - 1] + temp2;
test_array[i - 2] = temp2;
test_array[i - 1] = temp1;
test_array[i] = temp_prev = test_array[i] + temp1;
}
printf( "Cycles=%d\n", ( rdtsc() - start ) );
Notice that the value being added to the current element of the array comes from the temporary
variable. Thus, its value doesn’t need to re-loaded, i.e. no load-store dependency is (likely to be)
created.
The de-optimized version, on the other hand, performs this computation in a more
natural way. However, since it must wait for the previous element to have its value written
before the next one can be calculated, many load-store dependencies are created. The
important de-optimized section is below:
unsigned long long start = rdtsc();
for ( i = 1; i < size_of_array; i++ ) {
test_array[i] = test_array[i] + test_array[i - 1];
}
printf( "Cycles=%d\n", ( rdtsc() - start ) );
Notice that, unlike the optimized version, there are no temporary variables that can be used to
prevent load-store dependencies.
27
Data
As can be seen by the slowdown percentages below, the cost associated with these kinds
of dependencies is very high. On the Opteron, it caused ~60% slowdown for all array sizes; this
is due to the fact that the Opteron does not schedule stores in the way that it schedules other
instruction types leading to very costly dependency stalls. On the Nehalem, this de-optimization
had a lesser impact. Therefore, it is reasonable to assume that its dynamic scheduler is better at
managing stores.
Figure 2. 5 Costly instruction de-optimization
0
50000000
100000000
150000000
200000000
250000000
Clo
ck C
ycle
s
Array Size
De-Optimization (Costly_Instruction)
Optmzd, Opteron
De-Optmzd, Opteron
Optmzd, Nehalem
De-Optmzd, Nehalem
28
Table 2. 5: Costly Instruction de-optimization for the Opteron and Intel Nehalem
Array Size AMD Opteron:
Difference
Slowdown(%) Intel Nehalem:
Difference
Slowdown(%)
10 .4 2487 13 .814
100 3293 4489. 32. 3189
1000 9.7. 4.841 141. 1782.
10000 491.. 428.9 14714 148.4
100000 4.44.. 4281 144.92 1.84
1000000 .1.19.1 44813 13.7... 13813
10000000 .149.449 7.8.2 14344314 1.8.4
2.5.7 High latency instruction de-optimization
For this de-optimization, we designed an algorithm called ‘fibonacci’. This algorithm
takes an array size as input. It then generates an empty array of the specified size. After this
array is created, it calculates the fibonacci number associated with each index of the array.
The executables associated with this algorithm are called ‘fib_op.exe’ and ‘fib_deop.exe’
(on Windows). The code is written in C. The section of code being optimized/de-optimized is
written in NASM. The cycles being counted include only the time that fib spends running its
assembly code.
The optimized and de-optimized versions are almost identical. Of course, both calculate
the fibonacci number associated with each index. However, the optimized version uses the
combination of DEC and JNZ instructions in order to control its branching. Below is the
important optimized section:
29
calculate:
mov edx, eax
add ebx, edx
mov eax, ebx
mov dword [edi], ebx
add edi, 4
dec ecx
jnz calculate
Notice that each iteration ends with DEC and JNZ instructions. Thus, the loop will end
when the ECX register is zero.
The de-optimized version on the other hand uses a LOOP instruction instead of the
DEC/JNZ combination. The important thing to note about the LOOP instruction is that it has a
high latency (approximately 8 cycles) compared to DEC/JNZ. Below is the important de-
optimized section:
calculate:
mov edx, eax
add ebx, edx
mov eax, ebx
mov dword [edi], ebx
add edi, 4
loop calculate
Notice that, unlike the optimized version, the loop is controlled by the LOOP instruction. Just
like the optimized version, the loop will end when ECX register is zero.
Data
As can be seen by the slowdown percentages below, the cost of the LOOP instruction is
high. On the Opteron, it caused ~17% slowdown for all array sizes; this is due solely to the very
high latency of the LOOP instruction. On the Nehalem, this de-optimization had a greater
impact. It is hard to speculate on what might cause the Nehalem to have such a poor
30
implementation of LOOP.
Figure 2. 6 Fib de-optimization
Table 2. 6: Fib de-optimization for the Opteron and Intel Nehalem
0100000002000000030000000400000005000000060000000700000008000000090000000
100000000C
lock
Cyc
les
Array Size
De-Optimization (fib)
Optmzd, Opteron
De-Optmzd, Opteron
Optmzd, Nehalem
De-Optmzd, Nehalem
Array Size AMD Opteron:
Difference
Slowdown(%) Intel Nehalem:
Difference
Slowdown(%)
10 17 9.04 9 8.57
100 155 29.4 244 81.87
1000 1386 34.74 2239 104.52
10000 14588 22.03 19519 62.16
100000 123271 17.91 256839 87.42
1000000 1206678 16.94 2430301 81.51
10000000 11716747 14.07 19896396 65.26
31
2.5.8 If condition de-optimization
For this de-optimization, we designed an algorithm called ‘if’. This algorithm takes an
array size as input. It then generates an array of the specified size in which each element is
populated with floats x where 0.5 <= x <= 11.0. After this array is populated, each element in
the array is evaluated with an if-then statement in order to determine whether to increment or
decrement a dummy variable.
The executables associated with this algorithm are called ‘if_op.exe’ and ‘if_deop.exe’
(on Windows). The code is written in C. The section of code being optimized/de-optimized is
also written in C; there is no assembly for this de-optimization. The cycles being counted
include only the time that ‘if’ spends evaluating the elements of the array with the if-then
statement.
Both the optimized and de-optimized versions do basically the same thing. The only
difference is the order which they each evaluate the clauses of the if-then statement. One of these
clauses is very easy and fast to evaluate; it is a comparison to zero. The other of the clauses is
hard and time consuming to evaluate; it is a floating point comparison. The important optimized
section is below:
int dummy = 0, mod = 0;
unsigned long long start = rdtsc();
for ( i = 0; i < size_of_array; i++ ) {
mod = ( i % 2 );
if ( mod == 0 && test_array[i] > 1.5 ) { dummy++;
} else
{
dummy - -;
}
}
printf( "Cycles=%d\n", ( rdtsc() - start ) );
Notice the ordering of the clauses within the if-then statement. This is an optimal
ordering since the floating point comparison clause need only be evaluated half of the time in
32
order to know the value of the entire ‘&&’ statement.
Again, the de-optimized version does the same thing as the optimized version. It just
evaluates the ‘&&’ statement in a different order. The important de-optimized section is below:
int dummy = 0, mod = 0;
unsigned long long start = rdtsc();
for ( i = 0; i < size_of_array; i++ ) {
mod = ( i % 2 );
if ( test_array[i] > 1.5 && mod == 0 ) { dummy++;
} else { dummy--;
}
}
printf( "Cycles=%d\n", ( rdtsc() - start ) );
Again, notice the ordering of the clauses within the if-then statement. This is a sub-optimal
ordering since the floating point comparison clause needs to be evaluated each and every time in
order to know the value of the entire ‘&&’ statement.
Data
As can be seen by the slowdown percentages below, the cost of improperly ordering
clauses within a conditional is very high. On the Opteron, it caused ~37% slowdown for all array
sizes; this is due to being forced to evaluate the most expensive of the conditional clauses for
each iteration; therefore, a simple switch in ordering can make this program 37% faster. On the
Nehalem, this de-optimization had a similar impact.
33
Figure 2. 7 If Condition de-optimization
Table 2. 7: If Condition de-optimization for the AMD Opteron and Intel Nehalem
Array Size AMD Opteron:
Difference
Slowdown(%) Intel Nehalem:
Difference
Slowdown(%)
10 -207 -26.90 74 328.7
100 22. 1784. 99. 74899
1000 442. 2.834 4..4 228.4
10000 42947 2.8.9 44244 248.4
100000 47142. 2.894 73.19. 7.844
1000000 42.79.4 24824 474441. 418.4
10000000 44.32.44 278.9 499944.4 44873
0
50000000
100000000
150000000
200000000
250000000
300000000C
lock
Cyc
les
Array Size
De-Optimization (If_Condition)
Optmzd, Opteron
De-Optmzd, Opteron
Optmzd, Nehalem
De-Optmzd, Nehalem
34
2.5.9 Loop re-rolling de-optimization
For this de-optimization, we designed an algorithm called ‘loop_re_rolling’. This
algorithm takes an array size as input. It then generates an array of the specified size in which
each element is populated with integers x where 0 <= x <= 9. After this array is populated, two
more arrays are built; one will hold the squares (x2) of each element in integer array and the
other will hold the cubes (x3) of each element in the array. After these arrays are built, squares
and cubes arrays are populated by iterating over the elements of the integer array.
The executables associated with this algorithm are called ‘loop_re_rolling_op.exe’ and
‘loop_re_rolling_deop.exe’ (on Windows). The code is written in C. The section of code being
optimized/de-optimized is also written in C; there is no assembly for this de-optimization. The
cycles being counted include only the time that the executables spend filling out the squares
and cubes array from the original.
The optimized version fills out the squares and cubes array into a single loop, since the
integer array is used to populate both. This is ideal since it gives the dynamic instruction
scheduler more flexibility when scheduling instructions to run. The important optimized section
is below:
unsigned long long pstart = rdtsc();
for ( i = 0; i < size_of_array; i++ ) {
quadratic_array[i]=load_store_array[i]*load_store_array[i];
cubic_array[i]=load_store_array[i]*load_store_array[i]*load_store_array[i];
}
printf( "Cycles=%d\n", ( rdtsc() - pstart ) );
Notice that the calculation that populates the element of each array, squares and cubes, is
within the same loop.
35
The de-optimized version fills out the squares and cubes array using separate loops. This
is sub-optimal since it takes away some of the flexibility that the dynamic instruction scheduler
might otherwise have when scheduling instructions to run. The important de-optimized section is
below:
Unsigned long long pstart = rdtsc();
for ( i = 0; i < size_of_array; i++ ) {
quadratic_array[i]=load_store_array[i]*load_store_array[i];
}
For(i=0;i<size_of_array;i++) {
cubic_array[i]=load_store_array[i]*load_store_array[i]*load_store_array[i];
}
printf( "Cycles=%d\n", ( rdtsc() - pstart ) );
Notice that the calculation that populates the element of each array, squares and cubes, is
within the separate loops.
Data
As can be seen by the slowdown percentages below, the cost of not combining loops that
can be combined is very high. On the Opteron, it caused ~50% slowdown for all array sizes; this
is solely due to the fact that splitting the computations into separate loops isolates them such that
the dynamic scheduler cannot schedule them together; some of the flexibility that it might have
had otherwise has been lost. On the Nehalem, this de-optimization had an impact, but a lesser
one. It is hard to imagine why this is less costly on the Nehalem.
36
Figure 2. 8 Loop re-rolling de-optimization
Table 2. 8: Loop re-rolling de-optimization for the AMD Opteron and Intel Nehalem
Array Size AMD Opteron:
Difference
Slowdown(%) Intel Nehalem:
Difference
Slowdown(%)
10 189 43.95 83 28.32
100 1653 53.68 357 15.35
1000 21311 54.96 4696 21.28
10000 234336 52.42 39311 17.10
100000 2352328 50.93 134786 6.05
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
10 100 1000 10000 100000
Clo
ck C
ycle
s
Array Size
De-Optimization (Loop_re_Rolling)
Optmzd, Opteron
De-Optmzd, Opteron
Optmzd, Nehalem
De-Optmzd, Nehalem
37
2.5.10 Dependency chain de-optimization
For this de-optimization, we designed an algorithm called ‘dependency_chain’. This
algorithm takes an array size as input. It then generates an array of the specified size in which
each element is populated with integers x where 0 <= x <= 20. After this array is populated, all
of the elements of the array are summed into a single integer variable.
The executables associated with this algorithm are called ‘dependency_chain_op.exe’ and
‘dependency_chain_deop_1.exe’ (on Windows). The code is written in C. The section of code
being optimized/de-optimized is also written in C; there is no assembly for this de-optimization.
The cycles being counted include only the time that the executables spend adding the elements of
the array.
The optimized version adds the elements of the array by striding through the array in four
element chunks and adding elements to four different temporary variables. After the array has
been processed, the four temporary variables are added. The advantage of the optimized version
is that it creates four large dependency chains instead of one massive one. Thus, the dynamic
scheduler has many more options when scheduling instructions. The important optimized section
is below:
int sum = 0, sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0; unsigned
long long start = rdtsc();
for ( i = 0; i < size_of_array; i += 4 ) {
sum1 += test_array[i];
sum2 += test_array[i + 1];
sum3 += test_array[i + 2];
sum4 += test_array[i + 3];
}
sum = sum1 + sum2 + sum3 + sum4;
printf( "Cycles=%d\n", ( rdtsc() - start ) );
Notice that the summing occurs across four temporary variables. These temporary
38
variables are then added after the array has been processed.
The de-optimized version sums each element of the array into one variable. This creates a
massive dependency chain that quickly exhausts the scheduling resources of the dynamic
scheduler. The important de-optimized section is below:
int sum = 0;
unsigned long long start = rdtsc();
for ( i = 0; i < size_of_array; i++ ) {
sum += test_array[i];
}
printf( "Cycles=%d\n", ( rdtsc() - start ) );
Notice that the summing occurs using one variable only.
Data
As can be seen by the slowdown percentages below, the cost of creating a massive
dependency chain is high. On the Opteron, it caused ~150% slowdown for all array sizes; this is
solely due to the exhaustion of the scheduling resources of the dynamic scheduler; exhausting
these resources effectively makes the program run sequentially (no ILP). On the Nehalem, this
de-optimization had a large impact but not quite as large as the Opteron. One can only imagine
that it has more scheduling resources at its disposal.
39
Figure 2. 9 Dependency chain de-optimization
Table 2. 9: Dependency chain de-optimization for the AMD Opteron and Intel Nehalem
Array Size AMD Opteron:
Difference
Slowdown(%) Intel Nehalem:
Difference
Slowdown(%)
10 42 198.2 2 387
100 .41 13.821 374 74844
1000 ..74 14.8.3 3492 498.2
10000 .47.2 14.844 2..42 .4844
100000 .44143 17782 343.94 4.8.2
1000000 .422.44 1438.4 3924192 ..872
10000000 9.334421 12389. 37424329 41813
0
20000000
40000000
6000000080000000
100000000
120000000
140000000
160000000
180000000C
lock
Cyc
les
Array Size
dependency_ chain
Optmzd, Opteron
De-Optmzd, Opteron
Optmzd, Nehalem
De-Optmzd, Nehalem
40
3. Building tools to optimize weaknesses in the GCC compiler
3.1 Compiler
There are many definitions to the compiler but the most simple one is a program
converts other programs from high level language to low level languages. Compilers are
different from one language to another. However, compilers can be similar to each other
if they are belongs to extended programming languages. For example, C, C++, C# and
java are extended languages from C, so compilers for those languages are similar in terms
of structure and work. Programs are written in high level languages instead of low level
languages for several reasons. One of the reasons is that programs in high level languages
are much shorter than programs written in low level languages. So, high level languages
help users or programmers to write less and efficient codes. Also, there is another
advantage which is programs can be compiled to different low level languages, and then
can be run on different machines. However, programs in high level languages are slower
than programs written in machine languages because the time that spends in converting
from high to low level languages. So, the most efficient compiler can convert from high
level language to low level language with the shortest time [6] [7]. Every compiler is
working in phases; number of phases and sequence of phases are different from compiler
to another. The first phase in a compiler is lexical analysis. Lexical analysis is the
operation that reads programs as a text, and then divides the whole programs to tokens
corresponding to the rules of languages. The second phase in a compiler is syntax
analysis. Syntax analysis takes the output from the previous level and represents the
output in a tree corresponding to the structure of a program. The third phase of compiler
levels is Type Checking. Type Checking is responsible for checking a program to see if it
41
follows the rules and regulations of grammar and structure. The fourth compiler phase is
Intermediate code generation. Intermediate code generation is responsible for translating
programs to the machine language. The fifth compiler phase is register allocation.
Variables and names that used in previous operation (Intermediate code generation level)
translated to numbers in registers of a machine language. The sixth compiler phase called
machine code generation. In the machine code generation, the code in the intermediate
language is converted to assembly language. The last compiler phase is assembly and
linking. In the assembly and linking level, the assembly code translated to the binary
form. Lexical analysis, syntax analysis and type checking are called frontend of a
compiler while register allocation, machine code generation and assembly and linking are
called backend of compiler. Most of compiler optimizations are done on the two parts
which are frontend and backend of compilers. The optimization procedure is different
from one compiler to another, but there are basics of optimizations that work in all of
them [7].
3.2 GCC compiler
GCC is one of the most popular compilers that used in different languages. GCC
is used widely because it is free source and has very active tools to optimize programs
efficiently. Currently, GCC supports six languages such as C, C++, Objective-C,
FORTRAN, Java and Ada. Also, there is another language will be added in future which
is COBOL. GCC can be considered as a part of the GNU project. GNU project was
launched in 1984 to complete UNIX operating system as free software. There are many
versions of the GCC from the first version till today [8].
42
The basic language of the GCC compiler is C, and also the GCC compiler was
written in C. So, in the beginning, GCC compiler was written as a C compiler and after
that several languages were added to it. The first language that was added to the GCC
compiler is C++ which is the extension of C language. Furthermore, Objective C is
another language that is extended from C language; it was added to the GCC compiler.
Also, there is an important language was added to the GCC compiler which is
FORTRAN. FORTRAN is considered stronger than C because FORTRAN is a rich
language with math libraries. It can do complex operations and it deals with complex
numbers. Also, there is another important language that is added to the GCC which is
JAVA. JAVA is different than other languages because the form of its code different,
also the executable file that is generated by Java’s compiler is different than other
languages, and the executable will be executed by a Java virtual machine. The last
language that is added to the GCC compiler is Ada which is developed by Ada core
technologies and added to the GCC compiler at 2001. Ada language is designed to handle
large programs [8] [9].
3.3 GCC optimizations
Compilers basically are aimed to make programs faster, efficient and reducing the
size of codes. The main important goal from compiler optimizations is the speedup of
programs. Codes are generated by optimizing compilers called optimized codes.
Optimized codes can be produced by several changes on the source code, in order to
make that code is efficient and run faster than the original code. There are several ways in
the GCC compiler to decide a particular code can be improved or not such as control flaw
analysis and data flaw analysis. Control flaw analysis is used to see how the control
43
statements are used, and to see how the path is taken from the beginning of control
statement to the end of the statement. The other one is data flaw analysis, which is used
to identify how data is working in programs [10] p101-p117. There are several
optimization techniques that used in the GCC compiler to optimize programs. Copy
propagation transformation is one of the optimizing techniques that used to reduce the
redundant information. In this method, constant values are used instead of copy
operation. For more clarification about previous optimizing technique, let’s consider the
following example:
i=10;
x[j]=I;
y[min]=b;
b=i;
The previous code will optimized to the following code by GCC compiler:
i=10;
x[j]=10;
y[min]=b
b=10;
In the previous example, in the optimized code, values are used directly to set the
variables x[j] and b, while in the original code, copy operation used to copy values.
Copying operation costs much time than just setting the constant values in the optimized
code [10] P104.
Also, there is another GCC compiler optimization called Dead code elimination.
Dead code elimination is one of the GCC compiler optimization that removes extra or
unused codes in programs, or removes codes that do not effect on the sequence of
44
operations and the result in programs. Many codes are written with extra calculations or
extra variables called dead codes. Also, when we have a code that is never reached by
compiler, in this case, this code called unreachable code. For example,
If(i==10){
:}
Else{
If(i==10){
:}
}
In the above example, the second code will be never reached by a compiler because either
the first statement will reach by a compiler or none of them will not reached by a
compiler. So, in all cases, the second statement will never reached by a compiler, and
these phenomena called unreachable code, and the optimization that removes these
phenomena called unreachable code elimination.
GCC compiler and especially the latest versions have different and very smart
techniques to optimize codes. GCC compiler has optimizations called architecture –
independent optimizations. What is the most interesting thing about these optimizations is
that they do not depend on the features of a given architecture or features of a given
processor. So, optimizing codes is in different architecture and processors. GCC compiler
has different optimization switches such as –O, -O1, -O2, -O3, -Os, -OO and nothing.
Nothing or –OO turn off all GCC optimizations. –O1 turns on first level of GCC
optimizations which is level one. –O2 and –O3 turn on level two and level three of GCC
optimizations, they added to the first level of GCC optimizations. –O enables all GCC
optimizations, and it attempts to reduce the size of codes and execution time without
45
increasing the compilation time. –Os attempts to reduce the size of codes only without
considering the compilation time [10].
In the GCC compiler, you can turn off one of the optimizations by doing the
following:
We use “no” option between –f and the name of optimization that we want to disable it.
Suppose we want to disable the optimization that guess branch possibilities, the disabling
command will be like this:
GCC file name.c –o file name –O1 –fno-guess-branch-probability. In addition, the
original name of the guess branch possibilities optimization is –fguess –branch-
probability [10] P108.
In the GCC compiler, as we mentioned before, there are three levels of GCC
optimizations. First level called level one GCC optimizations. Level one has many
optimizations that reduce size of codes and improve time consumption. –free-dce is one
of level one GCC optimizations that eliminates unreachable codes or dead codes in
programs. Also, -fif-conversion is another optimization in level one which can change
conditional jumps in to none branching code. Also, there is another optimization in level
one which predicts branch possibilities in programs called –fguess-branch-probability. In
level two optimizations, there are many optimizations added to the first level includes
loop transformation, mathematical optimizations and jump handling. One of the
mathematical optimization is called –fpeephole. These optimizations replace long
instructions with the shorter once. For example, suppose we have the following code:
A=2;
46
For( i=1; i<10; i++)
A+=2;
GCC compiler, by applying second level of optimization can replace the above
code with A=20. So, this optimization saves much time that spend in executing “For
Loop” with much iteration. Also, there is another optimization in level two which can
improve recursive functions called –foptimize-sibling-calls. This optimization attempts to
eliminate the calls of functions and replace it by the code of the function itself. In general,
calls of a function consume much time, so replacing it with the code itself will make
programs faster. Also, there are other optimizations in level two can reduce the size of
codes. The command –Os in GCC attempts to reduce the size of codes and that will
improve the memory. –Os can be applied on all level two optimizations with the
exception of optimizations that enlarge program’s codes. Level three optimizations turn
on all level one, level two and additional optimizations such as –funswitch-loops, -fgcse-
afer-reload and –finline-functions [10] P110, P111, P112.
Furthermore, there are extra optimizations that called specific optimizations. One
of the specific optimizations is –fbound-check, which can check codes to validate array
indices in order to use for array access. –ffunction-cse is one of the specific optimizations
that stores function addresses in registers. Also, there is another specific optimization
called –finline-functions which expands simple functions in the calling function. –ffloat-
store is another specific optimization which disables strong floating point values in
registers. The specific optimizations are not enabled by O1, O2 or O3, basically they are
enabled by either using the command –f followed by the name of the optimization, or
they can be enabled by –O because –O turns on all GCC optimizations [10].
47
3.4 Our optimizations to the GCC compiler
In our implantation, we found several weaknesses in the GCC compiler, so we
improved these weaknesses and then we built a tool that has all the improved weaknesses
or the optimizations. The weaknesses that we found in the GCC compiler include
Division operation, loop and recursive function, loop rerolling, loop unrolling, power
operation, square root mathematical operation and functions calls. Division operation
consume much time than multiplication, so our goal in this optimization technique is to
replace the division operation with the multiplication operation to make programs
efficient and faster. Functions are used in many programs and function calls are similar to
iterations that used in loops, but function calls are much time consuming than iterations
in loops. Therefore, programs have function calls can be replaced by loops in order to
make programs faster. Loop rerolling is one of the optimization techniques that used in
many compilers to optimize loops in programs. The basic idea in loop rerolling is to
combine several loops in one loop, and this obviously will reduce the number of
instructions because in general, instructions in one loop are less than separate individual
loops. So, the main goal in this optimization technique is to combine several loops in one
loop when it is possible to save time and make programs faster. Also, we implemented
another optimization technique which can optimize loops called loop unrolling. Loop
unrolling is one of the optimization techniques that used in many compilers to enhance
the execution time of codes that have loops. The main idea in loop unrolling is to reduce
the number of instructions. Many programs are spending less time in executing actual
codes while most of the time is used to execute the instructions. The article” An
aggressive approach to loop unrolling” stated that many programs spend 90% of the time
48
in executing 10% of a code [11]. So, reducing the time consumption in the little portion
of a code will make programs faster. Also, the article” An aggressive approach to loop
unrolling” stated that loop unrolling can improve instruction-level parallelism and
improve memory hierarchy locality [11]. Loop unrolling can be done in many loops but
not in all loops because some codes are difficult to unroll them such as two dimensional
arrays and pointers. The basic idea in loop unrolling is to open the code several times and
reduce the number of iterations, and then will make programs faster. However, there is a
disadvantage in loop unrolling technique which is increasing the size of codes, and then
will affect the memory. The main drawback of increasing code’s size is affecting the
instruction cache. When the instruction cache is overflowed, the optimized code will be
affected and then will make programs slower. This case can be happened when the size
of instruction cache is small, and the size of original code is large. Also, there is another
thing regarding to loop unrolling which is choosing the unrolling factor. There is an
important question which is how many times we should unroll a given loop? We
introduced this question in order to help users or programmers to choose the perfect
unrolling factor. Choosing a good unrolling factor depends on several factors such as the
size of instruction cache, size of code and the number of instructions in program.
In our implementation, we help the programmer or user to find the best unrolling
factor if loop unrolling optimization were found. Our tool requires the size of instruction
cache as input, length of an individual instruction because length of an instruction is
different from architecture to another, and also requires the number of iterations to a loop
as input. The output of our tool is the best maximum unrolling factor. Programmer or
user should not pass the number of unrolling factor because this will overflowed the
49
instruction cache, and then will degrade the performance or the speed of programs. In
addition to those optimization techniques, there is another one which is optimizing power
operation. Power function in C or C++ (POW()) used in many mathematical programs.
Power operation can be replaced by multiplication and definitely both of them give the
same result. We want to replace power by multiplication because power operation is
much expensive than multiplication, so replacing power with the multiplication will make
programs faster. Also, there is another optimization technique that included in our tool
which is optimizing square root mathematical operation (SQRT()). SQRT() is one of the
functions in C and C++ that used in many programs to find the square root of a number.
SQRT() function is much expensive than division, and we can see that easily from the
results of our implementation. So, in our optimization technique, we are replacing the
SQRT() function with the division operation to make programs faster. The last
optimization technique that we implemented in our tool is optimizing function calls.
Suppose, when we have a function in For Loop that used as a constant, so taking out this
function from the loop, will save much time because the calls of a function will be equal
to the number of iterations in that loop. Therefore, in our optimization technique, we are
taking out a function from a loop when this is possible to make programs faster.
3.5 Methods, analysis and results
3.5.1 Division vs. multiplication
We designed a benchmark which tests the division in order to see if the compiler
can optimize it or not. We used an array populated with power of integers from (1-12).
We divided each element in the array by 5, and run the program for different size arrays,
and then we collected the results. Also, we implemented our optimized benchmark. In our
50
optimized code, we multiplied each element in the array by 0.2, and we know the two
codes give the same result; the only difference between those two codes is using
multiplication instead of division. In the implementation, we compiled each code in two
different types; one with disabling all GCC optimizations and with enabling all GCC
optimizations. After doing the implementation, we can see that the division has a real
impact on the GCC compiler. The average slowdown percentage is 118%, and from this
point of view, we can tell the GCC compiler has very limited optimization ability in this
field. The goal from this benchmark is to tell the programmer not using the division even
if you are relying on the compiler because the compiler cannot optimize this kind of code.
If you have an option to change the division with multiplication then changes it because
the division degrades the performance comparing with multiplication. We depend on time
in milliseconds to see which one is faster than the other. The size of the array that used in
this optimization technique is 1000000 to 1000000 for time measured in milliseconds.
We implemented our optimization using large array size for measuring time in
milliseconds because the CPU is really fast. For example, if you use small size of array,
the time is zero most of the time. Furthermore, we measured time in milliseconds to see
the effects of our implementations in real time.
Here are the results and flow charts for our implementations (depending on time in
milliseconds):
A: Division code without compiler optimization.
B: Multiplication code without compiler optimization.
C: Division code with compiler optimization.
51
D: Multiplication code with compiler optimization.
Table 3. 1: Time in milliseconds for the Division vs. Multiplication optimization
technique.
Figure 3. 1 Division vs. Multiplication- milliseconds
0
50
100
150
200
250
1000000 2000000 3000000 4000000 5000000 8000000 10000000
Ru
n t
ime
Array size
Division vs multiplication/ milliseconds
A
B
C
D
52
3.5.2 Loop and recursive function
We designed another benchmark which tests the GCC compiler to see if it can
optimize the function call. We know that function calls spend much time comparative
with the other programming techniques. In our implementation, we implemented two
benchmarks, both benchmarks calculate the factorial of a number, and we repeat this
operation many times to see the differences in terms of the performance between the two
benchmarks. The original benchmark calculates the factorial of the number using
function call, so a function will be called several times to calculate the factorial of a
number. The other benchmark which is our optimized benchmark calculates the factorial
of a number using loop instead of calling a function several times. We know that using a
loop instead of function call will optimize our program; the GCC compiler has a
weakness in optimizing this kind of optimization. We depend on time in milliseconds to
see the differences in performance between the two benchmarks. We run our programs
with different number of iterations. We run the benchmark with the iterations from
1000000 to 10000000, and we collected the time in milliseconds. We depend on time in
milliseconds in order to see clearly which benchmark is faster or perform better than the
other. The results showed that our optimized code (the code with loop) is much faster
than the code with function call that optimized with the GCC optimization. Therefore,
GCC compiler has a weakness in doing this kind of optimization. Let’s see the codes,
results and graphs.
The original code by using recursive function
int tail(int factorial)
{
53
if (factorial>1)
{
result=result*factorial;
tail(factorial-1);
}
return result;
}
Our optimized code by using loop instead of calling the function every time.
int tail(int factorial)
{
temp=factorial;
for(i=1;i<temp;i++)
{
result=result*factorial;
factorial=factorial-1;
}
return result;}
A: The original recursive function code without GCC optimization.
B: Our optimized “For Loop” code without GCC optimization.
C: The original recursive function code with GCC optimization.
D: Our optimized “For Loop” code with GCC optimization.
Table 3. 2: Time in milliseconds for the Loop and recursive function optimization
technique.
54
Figure 3. 2 Loop and recursive function- Time in milliseconds
3.5.3 Loop re-rolling
We designed another benchmark that tests the GCC compiler to see if it can
optimize loops using loop re-rolling optimization technique. This optimization designed
to work with “For loop”. There are two benchmarks; one of them is the original
benchmark, and the other one is our optimized benchmark. The two benchmarks are
doing the same job; they calculate square numbers and cubic numbers. We used three
arrays which are load_store_array, quadratic_ array and cubic_array. Load_store_array
initializes randomly with integers form 0 to 9. After this array is populated, two more
arrays are built. The first benchmark which is the original benchmark calculates square
numbers and cubic numbers in two separate loops. While the other benchmark calculates
square numbers and cubic numbers using one loop. We developed these two benchmarks
to test GCC compiler because we know that loop re-rolling optimization is very
0
10000
20000
30000
40000
50000
60000
1000000 2000000 3000000 4000000 6000000 8000000 10000000
Ru
n t
ime
Array size
Loop vs recusive functions
A
B
C
D
55
important technique that used by many compilers for optimizing loops. In our
implementation, we depend on time in milliseconds to see which benchmark is faster or
has better performance than the other. Measuring the time is for the part of the code that
we want to optimize, so it is not for the whole code or program, and this will give us a
clear clue about performance. The GCC compiler cannot do this kind of optimization
because when we tested both programs (original code and our optimized code), we can
see obviously our optimized code is much faster than the original code. We run both
programs (original benchmark and our optimized benchmark) with different number of
iterations from 10 to 1000. In this optimization technique, our optimized code is
approximately 40% faster than the original code. From this percentage, we can see that
the GCC compiler has a significant weakness for optimizing loop re-rolling. Let’s see the
codes, results and graphs.
The original code:
for(i=0;i<size_of_array;i++)
{
quadratic_array[i]=load_store_array[i]*load_store_array[i];
}
for (i=0;i<size_of_array;i++)
{
cubic_array[i]=load_store_array[i]*load_store_array[i]*load_store_array[i];
}
Our optimized code:
for (i=0;i<size_of_array;i++)
{
quadratic_array[i]=load_store_array[i]*load_store_array[i];
cubic_array[i]=load_store_array[i]*load_store_array[i]*load_store_array[i];
}
A: The original code without GCC optimization.
B: The original code with GCC optimization.
C: Our optimized code without GCC optimization.
D: Our optimized code with GCC optimization.
56
Table 3. 3: Time in milliseconds for the Loop Re-Rolling optimization technique
Figure 3. 3 Loop Re-Rolling- Milliseconds
0
1000
2000
3000
4000
5000
6000
10 20 30 40 50 60 70 80 90 100
Ru
n t
ime
Array size
Loop Rerolling
A
B
C
D
57
3.5.4 Loop unrolling
We designed another benchmark that tests GCC compiler to see if it can optimize
loops using loop unrolling technique. We did this optimization because loops are used
widely in many programs, so optimizing loops can significantly improve many such
programs. There are two benchmarks in our implementation; one of them is the original
benchmark while the other one is our optimized benchmark. Both benchmarks are doing
the same job with different technique or strategy. In the original benchmark, we
multiplied each number in the array by itself, and we stored the result in different array,
and this is done using “For Loop”. Number of iterations in the “For Loop” is equal to the
size of array. In our optimized benchmark, we opened the source code, and that means we
repeated the source code inside the loop body several times to reduce number of
iterations, and we know that reducing number of iterations will speed up our benchmark.
So, in our optimized benchmark we repeated the loop body five times, number of
iterations is reduced to 80% in our implementation. For example, if the array size is 100
then number of iterations will be 20 instead of 100. In our implementation, we depend on
time in seconds to see which benchmark is faster or better optimized than the other.
Measuring the time is for the part of the code that we want to optimize it, so it is not for
the whole code or program, and this will give us a clear clue about performance. The
GCC compiler cannot do this kind of optimization because when we tested both
programs (original code and our optimized code), we can see obviously our optimized
code is much faster than the original code. We run both programs (original benchmark
and our optimized benchmark) with different number of iterations from 10000000 to
900000000. From our results for both benchmarks, we can see that this optimization
58
technique is faster comparative with our previous optimization techniques because our
optimized benchmark without GCC optimization is faster than the original benchmark
with enabling GCC optimizations. Let’s see the codes, results and graphs.
The original code:
for ( i = 0; i < size_of_array; i++ ) {
test_array[i] = test_array[i]*test_array[i];
}
Our optimized code:
for ( i = 0; i < size_of_array; i+=6 ) {
test_array[i] = test_array[i]*test_array[i];
test_array[i+1] = test_array[i+1]*test_array[i+1];
test_array[i+2] = test_array[i+2]*test_array[i+2];
test_array[i+3] = test_array[i+3]*test_array[i+3];
test_array[i+4] = test_array[i+4]*test_array[i+4];
test_array[i+5] = test_array[i+5]*test_array[i+5];}
A: The original code without GCC optimizations.
B: The original code with GCC optimizations
C: Our optimized code (Unrolling code) without GCC optimization.
D: Our optimized code (Unrolling code) with GCC optimizations.
59
Table 3. 4: Time in seconds for the Loop Unrolling optimization technique
Figure 3. 4 Loop Unrolling- seconds
3.5.5 Power vs. multiplication
We designed another benchmark that tests the GCC compiler to see if it can
optimize the power operation or not. POW function is slower than multiplication because
there is extra time that used to call this function and executing subroutine has variables
0
100
200
300
400
500
600
Ru
nti
me
Array size
Loop Unrolling
A
B
C
D
60
beside the multiplication. We designed this benchmark because we know that the
multiplication is much faster than the power operation, and also the power operation is
used widely in many math programs, so optimizing power operation will help to improve
many programs. There are two benchmarks; one of them is the original benchmark, and
the other one is our optimized benchmark. The two benchmarks are doing the same job;
they calculate the power of numbers. We used one array which is called test_array[], and
the result will be put in the same array. The test_array[] is initialized randomly with
different integer numbers. The first benchmark which is the original benchmark
calculates the power of all numbers in the test_array[] using POW power function, while
the other benchmark calculates the power of all numbers in the test_array[] using
multiplication operation instead of using POW function. In our implementation, we
depend on time in seconds to see which benchmark is faster or better optimized than the
other. Measuring the time is for the part of the code that we want to optimize it, so it is
not for the whole code or program, and this will give us a clear clue about performance.
The GCC compiler cannot do this kind of optimization because when we tested both
programs (original code and our optimized code), we can see obviously our optimized
code is much faster than the original code. We run both programs (original benchmark
and our optimized benchmark) with different number of iterations from 10000000 to
90000000. From our results for both benchmarks, we can see that this optimization
technique is faster comparative with our previous optimization techniques because our
optimized benchmark which is the multiplication benchmark without GCC optimization
is faster than the original benchmark with enabling GCC optimizations. Let’s see the
codes, results and graphs.
61
The original Power code:
for ( i = 0; i < size_of_array; i++ ) {
test_array[i] = pow(test_array[i],2);
}
Our optimized code:
for ( i = 0; i < size_of_array; i++ ) {
test_array[i] = test_array[i]*test_array[i]; }
A: The original power code without GCC optimizations.
B: Our optimized code (multiplication code) without GCC optimization.
C: The original power code with GCC optimizations.
D: Our optimized code (multiplication code) with GCC optimizations.
Table 3. 5: Time in seconds for the Power vs. Multiplication optimization technique
62
Figure 3. 5Power vs. Multiplication- seconds
3.5.6 SQRT function vs. division
We designed another benchmark that tests GCC compiler to see if it can optimize
Sqrt () function which it uses to calculate the square root for numbers in C language. We
did this implementation about Sqrt() function and division because Sqrt () function
spends much time than the division. Sqrt() function used in many mathematical
benchmarks, so optimizing it will speed up those programs. In this implementation, we
wrote two codes that are doing the same job. Both codes calculate square root of a
number with different mechanism. One code used Sqrt() function to calculate the square
root, while the other code used the division operation to calculate the square root. There
are two benchmarks, the original benchmark used Sqrt() function while our optimized
benchmark used division. In the original benchmark, we multiplied each number in the
array by Sqrt(Number1) while in our optimized code we multiplied each number in the
array by Number1/Number2. We used different numbers and the results almost the same
020406080
100120140160180200
Ru
n t
ime
Array size
Power vs multiplication
A
B
C
D
63
because the main idea in this implementation to show that Sqrt() function spends much
time because of an extra time for calling a Sqrt() function while in our optimized
benchmark we did division only. In both codes, calling a Sqrt() function and the division
operation are repeated to the number of iterations. In our implementation, we depend on
time in seconds to see which benchmark is faster or much optimized than the other.
Measuring the time is for the part of the code that we want to optimize it, so it is not for
the whole code or program, and this will give us a clear clue about performance. The
GCC compiler cannot do this kind of optimization because when we tested both
programs (original code and our optimized code); we can see obviously our optimized
code is much faster than the original code. We run both programs (original benchmark
and our optimized benchmark) with different number of iterations from 10000000 to
100000000. From our results for both benchmarks, we can see that this optimization
technique is faster comparative with our previous optimization techniques because our
optimized benchmark without GCC optimization is faster than the original benchmark
with enabling GCC optimizations. Let’s see the results and graphs.
A: Sqrt() function code without GCC optimization.
B: Our optimized division code without GCC optimization.
C: Sqrt() function code with GCC optimization.
D: Our optimized division code with GCC optimization.
64
Table 3. 6: Time in seconds for the SQRT function vs. Division optimization
technique
Figure 3. 6 SQRT function vs. Division- seconds
0
2
4
6
8
10
12
14
16
18
Ru
n t
ime
Array size
Sqrt function vs Division
A
B
C
D
65
3.5.7 The cost of the function call inside and outside loops
We designed a benchmark that tests a compiler to see if it can optimize a function
call inside loops. This optimization designed to work with “for” loops. The original code
has a function called strlen(), this function was put inside “For Loop”, we know this
function return string length. We passed this code to the GCC compiler to see if it can
optimize it. We developed this benchmark because we know that calling a function
several times will cost significant time. We optimized this benchmark manually and we
passed it to the compiler to see which is faster or who optimize better GCC compiler
optimization or our optimization. In our optimization we took out the function call
strlen() from inside the loop and we put it outside the loop, so instead of calling a
function every time in the “for loop” , it will be called one time only and will be as a
constant inside the “For Loop”. This optimization saves us much time and makes our
program faster. GCC compiler cannot do this kind of optimization because when we
tested both programs (original code and our optimized code), we can see obviously our
optimized code is much faster than the original code. We depend on time in milliseconds
to see which is faster than the other. Measuring the time is for the part of the code that we
want to optimize it, so it is not for the whole code or program, and this will give us a
clear clue about performance. In this optimization technique, our optimized code is
approximately four times faster than the original code. From this percentage, we can see
that the GCC compiler has a significant weakness. Now let’s see the codes, results and
graphs.
This is the code that we pass it to the GCC compiler in order to optimize it.
unsigned sum(const unsigned char *s) {
// x[] is an array initialized randomly with integer numbers less than 100.
66
unsigned result = 0;
for (i=0; i < strlen(s); i++) {
result += x[i];
}
return result;
}
This is our optimized code which we optimize it manually.
unsigned sum(const unsigned char *s) {
unsigned result = 0;
length = strlen(s);
for ( i=0; i < length; i++) {
result += x[i];
}
return result;
}
A: The original function calls in a loop code without GCC optimizations.
B: Function calls out of loop code without GCC optimizations (our optimized code).
C: The original function calls in a loop code with GCC optimizations.
D: Function calls out of loop code with GCC optimizations (our optimized code).
These results show the difference between the code that optimized by the GCC compiler
and our optimized code in terms of time in Milliseconds.
Table 3. 7: Time in milliseconds for the cost of the function call in and out the loop
optimization technique
67
Figure 3. 7 Function calls in and out the loop- Millisecond
3.6 Automatic parallelization in compilers
When programs have begun to execute in parallel, programmers need a compiler
to take care of parallelization. The basic idea of parallelization is to distribute the job
among processors in order to complete big work in a significant small amount of time.
Today, computers have more than one central processing unit, so to take advantage from
those CPUs, they can collaborate together to finish their job. There are two types of
parallelism which are shared and distributed memory. The main difference between
shared and distributed memory is how the communication occurs between them. In
shared memory environment, threads are communicating by each other through reads and
writes while in distributed memory environment, processes are communicating by each
other through the messages [1] [2].
0
50
100
150
200
250
300
1000000 2000000 3000000 4000000 5000000 6000000 10000000
Ru
n t
ime
Array size
Function calls in loop and out loop
A
B
C
D
68
Also, there is an architecture can mix the previous two types which called Mixed-
Type multiprocessor architecture. The last architecture mixed the shared memory
environment with the distributed memory environment.
P M
P M
: --- S :
: :
P M
(a): shared memory multiprocessor [4]
M
p
M-P---S---P-M
(b): Distributed memory multiprocessor [4]
In a shared memory environment, processors are sharing the same memory while in a
distributed memory environment; each processor has its own memory location. In
addition, in mixed-type architecture, threads may share the same memory and may
communicate to other physical locations through messages, and this is the work of
distributed memory architecture. Parallel processing can achieve big jobs in a very small
69
time through work distribution between processors. Each processor has a part of work to
finish, and in the end, all processors give their results to the master processor. Master
processor or process is in charge of creating processes, collect results from the workers
(processors) and at the end kill them. In order to get good results or expected results from
parallel processing, programmer or users should balance the work between processors
when he or she distribute the work between them. There are many programs can be
parallelized perfectly without difficulties or requiring further work, but also there are
other programs need extra work for parallelizing them such as synchronization [4]. We
need synchronization when two processes have access to the same data. So, in this
situation programmer needs to solve the conflicts between the processes before applying
or implementing the parallelism. Synchronization takes much time because it limits from
the parallelism and the time spent in its techniques. For instance, when many processes
need to access the same value, we need synchronization techniques to make the value
accessed only once at a time. There are several operations that used in synchronization
such as produce and consume. Produce sets the bit to be full if it is empty while consume
waits the value to be full, and then read the value and sets the bit to be empty [4]. By
using this method, conflicts of accessing the same value will not happen because each
time one process can access that value. Also, there is another problem that makes some
programs harder or impossible to parallelize them which is dependencies. Dependencies
consider one of the obstacles that face parallel processing. The basic idea of dependencies
is that some processes need or depend on the value of a previous process; in this case we
will have wrong results because all processes are working in the same time. There are
several kinds of dependencies such as read after write, write after read, write after write
70
and read after read. Dependencies limit the ability of parallelizing programs, and even if
we parallelized them, they become slower. In some programs, if little dependencies may
found, we can have one or more than one sequential sections, and we parallelize the rest
of the code. In some times, the whole program including its results dependent on each
other. For instance, each result in every level of a program uses in the next level, so in
this case is very difficult or impossible to parallelize it. Deciding which section in a
program can be parallelized or not is by depending on the programmer. A programmer
looks to the program, and see which section in the code has dependencies and difficult to
parallelize, and what other do not have dependencies. The method of detecting
dependencies also can be done by depending on compilers. Some compilers have built to
work with the parallel environment, so they have the ability to detect any kind of
dependencies in programs. One of the compilers that have very powerful techniques to
detect dependencies is Intel compiler. Intel compiler has the ability to analyze loops and
determine which loops have dependencies and are hard to parallelize them [5].
3.7 Automatic parallelization in GCC
GCC compiler is one of the compilers that support automatic parallelization.
GCC can parallelize loops, but it has some limitations in automatic parallelization. One
of the limitations in automatic parallelization is that GCC cannot detect dependencies in
loops, so it can parallelize loops without dependencies. Also, there is another limitation in
GCC compiler which is GCC parallelize innermost loops only, it cannot parallelize outer
loops. Now, GCC compiler parallelizes every loop without dependency, so there is no
strategy to determine which loop can be parallelized or not depending on performance
issues [3].
71
3.8 Our improvement to automatic parallelization of the GCC compiler
Our improvement to the GCC compiler is to implement tool to optimize the
weaknesses in the automatic parallelization of GCC compiler. Our tool can detect
dependencies in most of the loops. The input of our tool is C or C++ program, and the
output is a message which is either you can safely parallelize your loop or there are
dependencies and it is difficult to parallelize it. In our tool, we used the idea of GCD test.
GCD test used to detect most of the dependencies with in arrays in loops. The method of
the GCD test works as follows:
1- We consider everything in the brackets only of arrays that used in loops to
calculate the GCD.
2- We give the values in brackets to variables. For instance, A[2i+3]= A[2i-
1]*B+C, so in this case a=2, b=3, c=2 and d=-1, thus a=2 and b=3 are
representing A[2i+3], while c=2 and d=-1 are representing A[2i-1].
3- We find the GCD between a and c, in the previous example, the GCD of (2, 2)
is 2.
4- We find the result of (b-d), in the above example, (3-(-1))=4.
Since, the GCD (a, c) = 2 which divides (b-d) which is 4, so there are dependencies in our
example, and we can make sure from the result by following the iterations. Suppose, we
have the following example with the number of iterations is 6.
A [2i+3] = A[2i-1]
i=0 A[3]=A[-1]
i=1 A[5]=A[1]
i=2 A[7]=A[3]
72
i=3 A[9]=A[5]
i=4 A[11]=A[7]
i=5 A[13]=A[9]
So from the previous example, we can see obviously that iteration 2 depend on iteration
0, iteration 3 depend on the result of iteration 1 and so on.
For more clarification of the method of GCD test, suppose we have the following
example:
For (………………………..)
A [2i]= A[2i-1]*B+C
End for
From the previous example, we can set the values of GCD test like this:
a=2 b=0
c=2 d=-1
GCD (a, c) = 2
(d-b)= -1
Since the GCD (a, c)= 2 does not divide ( d-b)=-1 [4] , so there is no dependencies in our
example. Therefore, by using the method of the GCD test, we can find dependencies in
programs without following the iterations of loops manually.
Therefore, in our tool, we implemented the idea of the GCD test. Our tool takes
C or C++ program as input and print a message tells that either loop has dependencies or
73
there are no possible dependencies in program. This tool basically designed to help the
user or programmer to discover different dependencies in programs that compiled by the
GCC compiler.
3.9 Optimizing real C programs by our tool
After we built our tool that has seven optimization techniques (Division, recursive
function, loop re-rolling, loop unrolling, power operation, SQRT function and recursive
function in a loop), we tested our tool to improve real C programs such as Strassen,
Bubble sort and Selection sort.
3.9.1 Strassen optimization
Strassen is a C program can do matrix multiply faster than the normal way of
matrix multiplication. The C program was passed as input to our tool. We optimized the
C program in several ways. The first optimizing way is by using GCC compiler only. The
second optimizing way is by using our tool only, and the third optimizing way is by using
both our tool and GCC compiler. Our tool detected several optimization techniques in
Strassen.cpp which are not optimized by the GCC compiler. Our tool found that several
divisions in the program can be replaced by multiplications to make the program run
faster. Also, our tool found that two loops in Strassen.cpp can be unrolled to make the
program run faster. So, our tool gave hints or messages showing that these optimizations
can improve the program. After we optimized Strassen.cpp by our tool, we run three
programs which are the original Strassen.cpp, Strassen.cpp optimized by GCC compiler
only and Strassen.cpp optimized by both our tool and GCC compiler. We saw an
interesting result which is the optimized program by our tool only is faster than the
74
original program, and also faster than the program optimized by GCC compiler. In
addition, when we run the program that optimized by our tool and GCC compiler, we got
much speedup.
Results
A: The code without both GCC optimization and our tool optimization.
B: The code with GCC optimization but without our tool optimization.
C: The code without GCC optimization but with our tool optimization.
D: The code with GCC optimization and with our tool optimization.
The runtime for our programs:
Table 3. 8: Running time in seconds for the Strassen optimization
Array size Strassen A Strassen B Strassen C Strassen D
256 0.62 0.47 0.29 0.15
512 4.56 3.32 2.09 1.07
1024 41.38 33.20 27.65 23.17
2048 371.49 321.51 217.96 204.08
4096 3341.04 2945.98 2349.21 2222.88
75
Figure 3. 8 Strassen Optimization
3.9.2 Bubble sort optimization
Bubble sort is a very well-known sorting algorithm used to sort numbers.
BubbleSort.cpp is the original C++ program that used in our implementation, and
OptimizedBubbleSort.cpp is a C++ program that optimized by our tool. The Bubble.cpp
program was passed as input to our tool. We optimized the C++ program in several ways.
The first optimizing way is by using GCC compiler only. The second optimizing way is
by using our tool only, and the third optimizing way is by using both our tool and GCC
compiler. Our tool detected several optimization techniques in the BubbleSort.cpp which
are not optimized by the GCC compiler. Our tool found that several loops can be unrolled
several times to make BubbleSort.cpp faster. Our tool gave hints indicating that these
optimizations can optimize the program. After we optimized the BubbleSort.cpp program
manually, then we collected the running time results, we found that the optimized
0
500
1000
1500
2000
2500
3000
3500
4000
256 512 1024 2048 4096
Ru
nti
me
Array size
Strassen implementation
Strassen A
Strassen B
Strassen C
Strassen D
76
program by our tool only (OptimizedBubbleSort.cpp) is faster than the original program
BubbleSort.cpp, and also the program that optimized by our tool and the GCC compiler is
much faster than the original program that optimized by the GCC compiler only.
Results and graphs
A: The code without both GCC optimization and without our tool optimization.
B: The code with GCC optimization but without our tool optimization.
C: The code without GCC optimization but with our tool optimization.
D: The code with GCC optimization and with our tool optimization.
The runtime for our programs:
Table 3. 9: Running time in seconds for the Bubble Sort optimization
Array size BubbleSort A BubbleSort B BubbleSort C BubbleSort D
10000 0.64 0.28 0.52 0.17
20000 2.63 1.89 2.13 0.83
40000 10.45 6.68 8.62 3.59
80000 42.58 28.38 34.52 13.42
100000 65.55 43.80 54.04 20.90
77
Figure 3. 9 Bubble Sort optimization
3.9.3 Selection sort optimization
Selection sort is one of the sorting algorithms that used to order numbers.
SelectionSort.cpp is the original C++ program that used in our implementation, and
OptimizedSelectionSort.cpp is a C++ program that optimized by our tool. The
SelectionSort.cpp program was passed as input to our tool. We optimized the C++
program in several ways. The first optimizing way is by using GCC compiler only. The
second optimizing way is by using our tool only, and the third optimizing way is by using
both our tool and GCC compiler. Our tool detected several optimization techniques in the
SelectionSort.cpp which are not optimized by the GCC compiler. Our tool found that
several loops can be unrolled several times to make SelectionSort.cpp faster. After we
optimized the SelectionSort.cpp program manually, then we collected the results, we
0
10
20
30
40
50
60
70
10000 20000 40000 80000 100000
Ru
nTi
me
Array Size
Bubble Sort Optimization
BubbleSort A
BubbleSort B
BubbleSort C
BubbleSort D
78
found that the optimized program by our tool only (OptimizedSelectionSort.cpp) is faster
than the original program SelectionSort.cpp, and also the program that optimized by our
tool and the GCC compiler is much faster than the original program that optimized by the
GCC compiler only. Our tool in the programs Strassen.cpp, BubbleSort.cpp and
SelectionSort.cpp worked very well, and it added much speedup to the programs that
optimized by the GCC compiler only.
Therefore, our tool is working very well together with the GCC compiler to make
programs faster. Developers, who are using GCC compiler to optimize their codes, can
use our tool together with the GCC compiler to get fast and efficient results. We did an
implementation to optimize real C codes in order to see the effectiveness of our tool
when developers want to optimize real software.
Results and graphs
A: The code without both GCC optimization and without our tool optimization.
B: The code with GCC optimization but without our tool optimization.
C: The code without GCC optimization but with our tool optimization.
D: The code with GCC optimization and with our tool optimization.
Table 3. 10: Time in seconds for the Selection Sort optimization
Array size SelectionSort A
SelectionSort B
SelectionSort C
SelectionSort D
100000 33.60 11.64 14.50 6.31
200000 135.65 46.87 59.34 25.62
400000 544.67 188.83 237.86 104.12
800000 2197.39 751.43 959.02 428.39
1000000 3539.38 1361.53 1600 862.97
79
Figure 3. 10 Selection sort optimization
0
500
1000
1500
2000
2500
3000
3500
4000
100000 200000 400000 800000 1000000
Ru
nTi
me
Array size
SelectionSort
SelectionSort A
SelectionSort B
SelectionSort C
SelectionSort D
80
4. Conclusion
In our thesis, we have presented several ways to help developers writing more
efficient codes. Writing good software depends on the architecture specifications and
compiler. We researched the architecture specifications and compiler, and we introduced
very good tools to help developers writing most optimized codes. For the architecture
specifications, developers should fully understand the architecture of a specific hardware.
Not considering the architecture specifications can make programs run slower. In the
chapter II of our thesis, we designed intentionally de-optimized benchmarks to see what
are the strengths and weaknesses in the underlying computer architecture, and showing
what are the affects if developers not considering architecture specifications when
programs are written. The de-optimizations show, convincingly, that ignoring
architecture specifications when writing software can be very costly indeed. Most of the
de-optimizations had effects that were greater than 25% of running time. Some had
effects up to 150%. These are not trivial slowdowns. They show that prioritizing
considerations of aesthetics, portability, etc. in front of hardware can result in significant
slowdowns. It is shocking when one considers that many of the de-optimizations that
were implemented look just like code that one sees every day in business and academic
environments. Moreover, much of its was straightforwardly compiled, i.e. the compiler
did not resolve the issues during compilation. The results of the chapter II should, at the
very least, make software developers reconsider many of their accumulated habits.
Sometimes the best “looking” code may perform the worst.
Compilers have very powerful tools to optimize codes. Therefore, writing good
software not only depends on the skills of programmers and architecture specifications,
81
but also depends on the powerful of compilers. One of the compiler processes is
optimizing codes. Compilers can make programs run faster, work efficiently and also can
reduce size of codes. Therefore, In order for developers to write optimized software
should choose very powerful compiler to optimize their codes. GCC compiler is one of
the well-known compilers that used by many developers to optimize codes. In chapter III
of our thesis, we researched the weaknesses in the GCC compiler. Several weaknesses
have found in the GCC compiler, and then we built several optimization techniques such
as Division vs. Multiplication, Loop and Recursive function, Loop Re-rolling, Loop
Unrolling, Power vs. Multiplication, SQRT function vs. Division and the cost of the
function call inside and outside loops. To make using these optimizations easier we built
a tool can help developers to optimize their codes manually. Our tool can give message to
users indicating that if their programs need to optimize or not. We built a tool because in
some circumstances, GCC compiler cannot fully optimize codes. Therefore, optimizing
codes by users will make programs run faster.
Also, GCC compiler has another weakness which is the automatic parallelization
in the GCC compiler can parallelize loops that do not have dependencies. Therefore, we
built a tool can do dependencies detection. Our second tool can detect most of the
dependencies in programs, and then can give a message to users indicating that either a
particular program has dependencies or not. The main goal from chapter III was to build
tools to improve the weaknesses in the GCC compiler and help programmers to write
most efficient codes
To make sure our tools can optimize real C or C++ application, we tested our tool
to optimize real codes such Strassen.c, Bubble Sort.cpp and SelectionSort.cpp.
82
Strassen.cpp is a c program can do matrix multiplication faster. The optimization to the
Strassen.cpp was successfully implemented, and we found that our optimizing code
without GCC optimization is faster than the original Strassen.cpp with GCC
optimizations. Moreover, we optimized BubbleSort.cpp. The optimization of the Bubble
Sort was implemented successfully, and Bubble Sort that optimized by our tool is much
faster than Bubble Sort optimized by GCC compiler only. Also, we successfully
optimized Selection Sort algorithm, and the optimized program by our tool is much faster
than the optimized program by the GCC compiler. Therefore, our tool is very powerful
tool can work with the GCC compiler to make programs efficient and faster. The main
goal from our thesis is to help developers to write more efficient codes through showing
that fully understanding the architecture specifications will make writing efficient codes,
and improve the weaknesses in the GCC compiler will make programs run faster.
83
REFERENCES
[1] Midkiff, S. P. (2012). Automatic Parallelization AnOverview of Fundamental
CompilerTechniques (2012th ed., Vol. 7, pp. 3-5). N.p.: Morgan & Claypool. Retrieved
June 6, 2013, from http://0-
www.morganclaypool.com.skyline.ucdenver.edu/doi/abs/10.2200/S00340ED1V01Y2012
01CAC019
[2] Adve, S.V.; Gharachorloo, K., "Shared memory consistency models: a
tutorial," Computer , vol.29, no.12, pp.66,76, Dec 1996
doi: 10.1109/2.546611, URL: http://0-
ieeexplore.ieee.org.skyline.ucdenver.edu/stamp/stamp.jsp?tp=&arnumber=546611&isnu
mber=11956
[3] Stallman, R. 2013. Using the GNU Compiler Collection. Boston, USA: GNU Press.
[4] Jordan, H. and Alaghband, G. 2003. Fundamentals of parallel processing. Upper
Saddle River, NJ: Prentice Hall/Pearson Education.
[5] Software.intel.com. 1999. Automatic Parallelization with Intel® Compilers | Intel®
Developer Zone. [online] Available at: http://software.intel.com/en-us/articles/automatic-
parallelization-with-Intel-compilers [Accessed: 13 Jun 2013].
[6] Wilhelm, R. and Seidl, H. 2010. Compiler design. Heidelberg: Springer.
[7] Mogensen, T. 2009. Basics of compiler design. [Kbh.]: Torben Ægidius Mogensen.
[8] Griffith, A. 2002. GCC, the complete reference. New York: McGraw-Hill/Osborne.
[9] Gcc.gnu.org. 2013. GCC, the GNU Compiler Collection- GNU Project - Free
Software Foundation (FSF). [online] Available at: http://gcc.gnu.org/ [Accessed: 13 Jun
2013].
[10] Von Hagen, W. 2006. The definitive guide to GCC. Berkeley, CA: Apress.
[11] Jack W. Davidson and Sanjay Jinturkar. 2001. An Aggressive Approach to Loop
Unrolling. Technical Report. University of Virginia, Charlottesville, VA, USA.
[12] AMD64 Technology. Software Optimization Guide for AMD64 Processors. 2005.