sifting out the mud: low level c++ code reuse

Sifting out the Mud: Low Level C++ Code Reuse

Bjorn De Sutter Bruno De Bus Koen De BosschereElectronics And Information Systems Department

Ghent University, Belgium{brdsutte,bdebus,kdb}@elis.rug.ac.be

ABSTRACTMore and more computers are being incorporated in de-vices where the available amount of memory is limited. Thiscontrasts with the increasing need for additional functional-ity and the need for rapid application development. Whileobject-oriented programming languages, providing mecha-nisms such as inheritance and templates, allow fast develop-ment of complex applications, they have a detrimental effecton program size. This paper introduces new techniques toreuse the code of whole procedures at the binary level and asupporting technique for data reuse. These techniques bene-fit specifically from program properties originating from theuse of templates and inheritance. Together with our pre-vious work on code abstraction at lower levels of granular-ity, they achieve additional code size reductions of up to38% on already highly optimized and compacted binaries,without sacrificing execution speed. We have incorporatedthese techniques in Squeeze++, a prototype link-time bi-nary rewriter for the Alpha architecture, and extensivelyevaluate them on a suite of 8 real-life C++ applications.The total code size reductions achieved post link-time (i.e.without requiring any change to the compiler) range from27 to 70%, averaging at around 43%.

Categories and Subject DescriptorsD.3.2 [Programming Languages]: Language Classifica-tions—C++; D.3.4 [Programming Languages]: Proces-sors—code generation; compilers; optimization; E.4. [Co-ding and Information Theory]: Data Compaction andCompression—program representation

General TermsExperimentation, Performance

KeywordsCode compaction, code size reduction

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.OOPSLA’02, November 4-8, 2002, Seattle, Washington, USA.Copyright 2002 ACM 1-58113-417-1/02/0011 ...$5.00.

1. INTRODUCTIONMore and more computers are being incorporated in de-

vices where the available amount of memory is limited, suchas PDAs, set top boxes, wearables, mobile and embeddedsystems in general. The limitations on memory size resultfrom considerations such as space, weight, power consump-tion and production cost.

At the same time, ever sophisticated applications have tobe executed on these devices, such as encryption and speechrecognition, often accompanied by all kinds of eye-candyand fancy GUIs. These applications have to be developedin shorter and shorter design cycles. More complex appli-cations, i.e. providing more functionality, generally meanlarger applications. Additional functionality is however notthe only reason why applications are becoming bigger. An-other important reason is the use of modern software engi-neering techniques, such as OO-frameworks and component-based development, where generic building blocks are usedto tackle the complexity problem. These building blocksare primarily developed with reusability and generality inmind. An application developer often uses only part of acomponent or a library, but because of the complex relationbetween these building blocks and the straightforward waylinkers build a program (only using symbolic references), thelinker often links a lot of redundant code and data into anapplication. Even useful code that is linked with the ap-plication will often involve some superfluous instructions,since it was not programmed with that specific applicationcontext in mind.

During the last decade, the creation of smaller programsusing compaction and compression techniques was exten-sively researched. The differences between the two cate-gories is that, while compressed programs need to be de-compressed before being executed, compacted programs aredirectly executable.

At the hardware side, the techniques used range from ar-chitectural design to hardware supported dynamic decom-pression. Examples are the design and wide-spread use ofthe code size efficient Thumb ISA [36] and dynamic decom-pression when code is loaded into higher levels of the mem-ory hierarchy [22, 35].

At the software side, a wide range of techniques is devel-oped. Whole-program analysis and code extraction avoid tosome extent the linking of redundant code with a program [1,32, 33]. Application-specific, ultra compact instruction setsare interpreted [18, 19] and frequently repeated instructionsequences within a program are identified and abstractedinto procedures [6, 17] or macro-operations [4]. Dynamic

275

decompression is done in software [9] as well. Charles Le-furgy’s Phd. thesis [24] provides an excellent overview ofthese techniques.

As object-oriented programming languages (OOPL) en-gage the programmer to develop and use reusable code, itis no surprise that the average application written in OOPLcontains a considerable amount of redundant code. (For aset of Java programs, Tip and Palsberg [34] on average foundmore than 40% of all methods to be unreachable.) Besidesthis, the facilities provided by OOPL to develop reusablecode at the source code level, such as template instantiationand class inheritance, often result in duplicate code frag-ments at the assembly level, thus needlessly increasing pro-gram size again. For each different template instantiation,e.g., a separate specialization or low-level implementation isgenerated. While at the source code level the instantiationshave different types, their low-level implementations are of-ten very similar. Pointers to different types of object, e.g.,have a different type at the source code level, but are allsimple addresses at the lower level.

Because of the high overhead in code size, the use ofmodern software engineering techniques and OOPL is of-ten not even considered viable for embedded or portableapplications. One can ask whether the limitations on com-plexity and offered functionality of these applications resultfrom hardware limitations (memory size, power consump-tion, etc.) or from the overhead introduced when tryingto manage the complexity using modern software engineer-ing techniques. This question and the problems with codesize have led to the development of Embedded C++ [11],a subset of the C++ language from which features such astemplates that lead to code bloat have been removed.

The techniques introduced and discussed in this paper ad-dress the mentioned code bloat directly. Without sacrificinglanguage features (and not requiring any changes to com-pilers), programs are produced that are both significantlysmaller and at the same time faster. As these programs arestill directly executable, all forms of run-time optimizationand dynamic (de)compression techniques can still be appliedon them. This is all achieved by aggressive whole-programoptimization and extensive code reuse after a program hasbeen linked. The achieved code size reductions range from27 to 70%, averaging around 43%. From a commercial pointof view, this work leads to somewhat less than a halved codesize requirement, which under otherwise identical conditions(e.g. a fixed maximum die area available for ROM) means ayear and a half earlier on the market.

In the whole picture we sketched so far, the linker, albeit avery important component in any software development en-vironment, is largely understudied. Several techniques arebeing used to avoid linking multiple identical specializationsinto a program. These techniques include incremental link-ing and the use of repositories [25], which rely on compiler-linker cooperation. They are based on the forwarding of ex-tra information by the compiler to the linker or on feedbackfrom the linker to the compiler. However, these techniquesdo not at all address the possible reuse of nearly identicalcode fragments or code fragments that are identical only atthe assembly level, but not at the source level.

In the past we have proposed applying code and datacompaction in a link-time binary rewriter named Squeeze.We have demonstrated how aggressive whole-program opti-mization [10] and combined redundant data and code elim-

stk.h:

template<class T> class Stk {

private:

static int total_space;

T* stk;

int size;

int elements;

T* top;

public:

Stk(void);

void Push(T elem);

T Pop(void);

~Stk(void);

};

stk.cpp:

template <class T> T Stk<T>::Pop(void) {

if (elements==size){

size-=10;

total_space-=10;

stk=(T*) realloc(stk,size*sizeof(T));

top = &(stk[elements]);

}

elements--;

return *--top;

}

main.cpp:

#include <stdio.h>

#include "stk.h"

main(){

Stk<long int> x;

Stk<int> y;

x.Push(1000L);

y.Push(1000);

printf("%ld %d\n",x.Pop(),y.Pop());

}

Figure 1: Example C++ code

ination [7] at link-time (or more precisely post link-time)can effectively reduce the size of programs. It was no sur-prise that these techniques and especially the elimination ofunreachable code by detecting statically allocated data (in-cluding procedure pointers) that is dead, performed muchbetter on C++ programs than on C programs. This followsfrom the discussion above. In [10] we also discussed codeabstraction at low levels of granularity: code regions andbasic blocks. These code abstraction techniques were eval-uated on a set of C programs, and resulted in a significant,but rather limited additional code compaction.

Today we present Squeeze++, our latest prototype link-time program compactor. Its name was chosen, not only be-cause it is a better Squeeze, but because it incorporates anumber of new code abstraction and parameterization tech-niques specifically aimed at reusing whole identical or nearlyidentical procedures, as typically found in programs writtenin object-oriented languages such as C++. These new codereuse techniques are introduced in this paper, and our whole

276

range of previous work on code abstraction is reevaluated fora number of C++ programs, showing that code abstractionand parameterization is a very powerful technique for addi-tional code size reduction for C++ programs (up to 38%)on top of aggressive post link-time program optimization.This results in a total link-time code size reduction of up to70%, which is considerably more than what we previouslyachieved for C or Fortran programs [10], thus addressing theproblems of code bloat and program size where it is mostneeded: where (redundant) code is most reused.

The remainder of this paper is organized as follows. Insection 2 a motivating example is studied. The new codereuse techniques are discussed in section 3. Our previouswork on code abstraction techniques is discussed in section 4.Some new insights and enhancements are discussed as well.An extensive evaluation of the discussed techniques is madein section 5. Related work is discussed in section 6, afterwhich our conclusions are drawn in the last section.

2. MOTIVATING EXAMPLEWe have depicted some C++ code fragments in Figure 1.

This code is not from a real-world application. It serves ourpurpose however, as it contains some typical code aspectsfound in real-world applications.

A template class Stk<T> implements a stack data struc-ture, for different types of objects of type T. The elementson the stack are stored in a private array stk with size ele-ments. The top data member points to the top element ofthe stack. The number of elements on the stack is stored inelements. The array stk is dynamically allocated. It growsand shrinks in chunks of 10 elements, as objects are pushedor popped. The private static data member total space isa class member that records the total memory size for onetype of stack. Its only purpose is to serve our discussion(although it might be useful for debugging purposes). Inthe main program, two stacks are created, one for objects oftype int and one for objects of type long int.

In Figure 2 we have depicted the control flow graphs(CFGs) and assembly code for the two generated instances ofthe Pop() method. This code was generated using the Com-paq C++ V6.3-002 for Compaq Tru64 UNIX V5.1 compilerfor the Alpha platform. The CFGs are generated from theinternal representation of the program after Squeeze++

has compacted the program. Inlining was not applied sincewe want to depict these methods isolated from their callingcontexts for the purpose of our example. Neither have weapplied code abstraction, as we want to show where opportu-nities for code abstraction and parameterization are found.We have depicted assembly code with more generally un-derstandable mnemonics instead of the Alpha mnemonics,although there is no need to understand this code to under-stand the discussion of this example.

These CFGs and assembly code obviously are very sim-ilar to one another. There are some differences however,indicated by the use of a bold and italic typeface. Thesedifferences have two origins:

1. A first difference has to do with the static total spacedata member. Each of the specializations of the Stkclass has a separate instance of this class member. Inthe final program, these two instances are stored in thestatically allocated data. They are allocated at dif-ferent locations however, so the locations from which

their values are loaded in the Pop method differ. Thesedifferent locations result in different indices (or relativeaddresses) in the instructions that access them (index0x2280 in int Stk<int>::Pop() and index 0x2270 in longStk<long>::Pop()).

More generally, all statically allocated objects for someclass will have different addresses for each specializa-tion. This is not just the case for class members, butalso for, e.g., stored initialization values of local vari-ables in methods (even if these values are the same forall specializations).

2. The second difference results from the different sizesof the objects stored on the stack. On our 64-bit tar-get architecture, int variables occupy 4 bytes, whilelong int variables occupy 8 bytes. Different instructionsare used to load and store these values, correspondingto their respective word widths (e.g. load int vs. loadlong). The pointer arithmetic also differs: indexing isdone using different instructions (add int index vs. addlong index) or different relative addresses (0x0004 vs.0x0008).

With this motivating example, we have shown how very sim-ilar the assembly code of different specializations of a tem-plate can be, and what some of the most important reasonsfor differences are. A third origin of differences is often foundat calls to methods that depend on the type of object thetemplate is specialized for: a sorting method of some con-tainer class, e.g., often calls the compare method of the classit was specialized for. Specializations for different classeswill usually have exactly the same sorting routine at the as-sembly level, except for the call to the compare method. Ifthe method called is a virtual method, the address of thevtable from which the address of the actual called methodis loaded will be different. This is very similar to the firstorigin of differences in the motivating example.

3. REUSING WHOLE PROCEDURESProgramming constructs such as templates and inheri-

tance often result in multiple low-level implementations ofsource code fragments that show great similarity. Signifi-cant compaction of programs can be obtained by abstract-ing multiple-occurring instruction sequences: a multiple-occurring instruction sequence is put in a separate proce-dure and all occurrences of the sequence are replaced bycalls to that single procedure. This is the inverse of inlining.Looking for multiple-occurring instruction sequences can bedone at various levels of granularity, that are discussed inthis and the next section. We will look at whole procedures,code regions (i.e. groups of basic blocks with a unique en-try and exit point), basic blocks and subblock instructionsequences. In this section we focus on the higher level: thatof whole procedures. We also discuss the reuse of staticallyallocated data, as it creates more opportunities for the codereuse techniques.

3.1 Whole-Procedure ReuseThe targets of whole-procedure reuse are multiple iden-

tical procedures that might be found in the program be-cause template specializations are different on the sourcecode level, but identical on the assembly code level.

A question to answer is: what do we consider to be identi-cal procedures? Do we, e.g., consider two procedures identi-

277

Figure 2: The CFGs generated for int Stk<int>::Pop() and long Stk<long>::Pop().

cal if they are exactly the same, apart from the registers theyuse for storing local variables? Or apart from the exact in-struction schedules? The answer is that we require the pro-cedures to be exactly the same, having the same structureof basic blocks and having exactly the same instructions,in identical schedules, and with identical register operands.This requirement speeds-up the search for identical proce-dures, as we never need to check whether two variants arefunctionally equivalent.

One might think that a significant number of functionallyequivalent procedures will not be discovered, but this is notthe case, as a compiler implementing identical functionalitywill use identical registers and produce the same basic blocklayout. There is no reason to assume nondeterminism in thisrespect. Even for compilers implementing interproceduralregister allocation, this is often the case, as template classesare usually defined in separate source code modules and theinterprocedural register allocation typically is intramodular.

Looking for fully identical procedures is straightforward:we apply a pairwise comparison of the CFGs of proceduresby traversing the procedures in depth-first order, duringwhich their instructions are compared. If two or more identi-cal procedures have been found, one master procedure is se-

lected and all call-sites of the others are converted to call themaster procedure. Function pointers stored in the staticallyallocated data of the program (such as method addressesin vtables) are also converted to all point to the masterprocedure. Note that the search for identical procedures isan iterative process. As a procedure’s callees change, theymight become identical.

To limit the number of comparisons necessary, a finger-printing scheme is used. A procedure’s fingerprint consists ofits number of basic blocks and instructions, and a string de-scribing the structure of its CFG. This string is constructedusing a depth-first traversal, during which characters thatdenote basic block types are concatenated. The type ofa block is the type of its last instruction: there are ’C’allblocks, ’R’eturn block, ’B’ranch blocks, etc. Using the fin-gerprints, the procedures are partitioned, after which themore comprehensive pairwise comparison is performed perpartition of similar procedures. Note that the various finger-printing schemes discussed in this paper are not fundamentalfor the code reuse techniques. They are very useful howeverto limit their computational cost.

278

3.2 Procedure ParameterizationAs shown in the motivating example, it is possible that

procedures are nearly identical. One possible solution toreuse the common code fragments is to apply abstractiontechniques on smaller code fragments, such as the ones wewill discuss in the next section. This can be very expensivehowever. Suppose we need to separately abstract every basicblock but one. This would mean that for each basic blockbut one a procedure call has to be inserted.

A different approach is to merge nearly identical proce-dures and to add a parameter to the merged procedure.This parameter is used to select and execute the code frag-ments corresponding to one of the original procedures wherethe code differs. The merged procedure gets multiple entrypoints, one for each of the original procedures, at whichthe parameter is set. This approach requires two questionsto be answered: what do we consider as nearly identicalprocedures and how do we add a parameter to the mergedprocedure?

As for whole-procedure reuse, we limit the search for near-ly identical procedures to procedures with identical struc-tures of basic blocks. A pre-pass phase again partitions theprocedures using a fingerprint based on the number of blocksand the structure and the types of the blocks. The number ofinstructions is now not used to partition procedures. Proce-dures with identical structures are then pairwise comparedand are given an equivalence score. This score equals the(predicted) code size reduction when two procedures wouldbe merged. Each pair of corresponding basic blocks in thetwo procedures contributes to the score as follows:

• The number of identical instructions starting from theentry points of the blocks is added to the score. Ifboth blocks are completely identical, this is the finalcontribution of these blocks to the equivalence score.It indicates that a whole block can be eliminated.

• If the blocks are not identical, the number of identi-cal instructions going backwards from the exit pointsof the blocks is added to the score. 3 points are sub-tracted however, as we conservatively assume that 3instructions are needed to implement the conditionalbranch that needs to be inserted before the differentparts of the blocks.

The rationale behind this counting is that identical prefixesand postfixes of instruction sequences in basic blocks willbe hoisted or sunk in the merged procedure, thus eliminat-ing one of them. As this counting is only a prediction, andas this is sometimes not very accurate, due to, e.g., possi-ble register renaming or instruction reordering being appliedlater on during the code compaction, we only count identi-cal instructions in identical orders, with identical operands(register and immediate operands). For nearly identical spe-cializations of the same template class with differences thatwe discussed in section 2 this suffices. Experiments haveshown that the threshold of the equivalence score to applyparameterization is best chosen at 10, i.e. the score has tobe at least 10 or more.

Once we have found nearly identical procedures, we mustfind a location to store the parameter that will be used inthe merged procedure to select the correct code to be exe-cuted at program points were the original procedures weredifferent. This is not an easy task, because of two observa-tions:

1. We almost always must assume the merged procedureto be re-entrant through possible recursive calls. Thereason is that we very often don’t know exactly whatmethods can be called at call-sites with indirect proce-dure calls. This is in general the case where procedurepointers are used, such as for all virtual method calls.For these indirect calls, we assume that all code ad-dresses stored in the program can be possible targets,resulting in an overly conservative call graph of theprogram. The consequence of this observation of re-entrancy is that the parameter must be stored on thestack.

2. Transforming the stack behavior of procedures or aprogram in a link-time or post link-time program com-pactor is very difficult, if not impossible, due to thegeneral lack of any high-level semantic informationabout the program being compacted, and particularlyabout the stack behavior. To rely on compilers to pro-vide this information to the program compactor, wouldendanger the general applicability of our techniques,so we have chosen to not follow that path. The conse-quence is that we will not allocate additional space onthe stack for storing the parameter.

If we cannot change the stack frame of a procedure, howare we going to store a parameter in it? The solution in-volves the return address corresponding to procedure calls.If a procedure is not a leaf procedure, this address is def-initely stored on the stack, as the calls in the procedureoverwrite the return address. If the procedure is a leaf pro-cedure, the return address is often not stored on the stack,but in that case we are sure that the procedure is not re-entrant and we know that at least some register will holdthe return address throughout the whole procedure. It istherefore safe to assume that the location where the returnaddress is stored, either in a register or on the stack, is alocation that holds its value throughout a whole procedure’sexecution, whether this execution is interrupted by recursivecalls or not. On our target architecture, as on most 32-bitRISC architectures having a fixed instruction width and 4-byte aligned instructions, the 2 least significant bits of thereturn address are always zero. We can therefore use these 2bits to encode at most 4 different parameters correspondingto at most 4 procedures being parametrized and merged.The maximum number of procedures that can be mergedto a single parametrized procedure is limited to 4, but aswe will see in the evaluation section, this is enough to get asignificant compaction with procedure parameterization. Toselect groups of 4 procedures from a larger group of nearlyidentical procedures, we use a greedy algorithm based onthe equivalence scores.

Using the thus stored parameter for conditional branchesin the merged procedure body is trivial, as the location ofa return address stored on the stack is determined by thecalling conventions. The 3 instructions needed for these con-ditional branches are (1) to load the return address from thestack; (2) to extract the least-significant bits (if necessary)from the return address; (3) the conditional branch on thesebits. Note that on architectures where the return instruc-tion doesn’t care about the two least-significant bits of thereturn addresses, there will be no need to remove the param-eter from the return address before returning from a call tothe merged procedure.

279

Figure 3: The CFGs for int Stk<int>::Pop() and long Stk<long>::Pop() after parameterization.

280

For the motivating example in section 2, the final CFGs ofthe original procedures after parameterization are depictedin Figure 3. As one can see, the total number of instructionsin these 2 procedures has gone down from 68 to 56. This 56includes the number of unconditional branches that need tobe inserted because basic blocks can only have one prede-cessor on a fall-through path. These unconditional branchinstructions are not part of our internal program represen-tation, and are hence not depicted.

3.3 Reusing Statically Allocated DataIt often occurs that constant values stored in the read-

only data sections of a program occur multiple times. Oncompiling one source code module, the compiler providesspace in the statically allocated data of the generated ob-ject file for the addresses of all externally declared objectsthat are accessed from within the module. While linking andrelocating modules, the linker fills in the final addresses. Ifglobal objects are accessed from within multiple modules,their address will occur multiple times in the statically allo-cated data of the final program.

Another case are constant numerical values for which load-ing them from the data sections is the most efficient way toget them in a register. This is typically the case for floating-point constants such as π or e. These constant values areoften stored in the data sections of multiple object files andtherefore occur multiple times in the final program.

As data is linked with the program to be loaded some-where, the result of multiple occurences of identical data isthat otherwise identical code fragments can differ in the lo-cation from which they load the (identical) data. As ourlink-time code compactor incorporates a constant propa-gator that to some extent detects what constant data isloaded, we can convert instructions that load the same con-stant value from different locations into instructions thatload them from the same location. This has proven to bevery beneficial for code reuse, and especially for the reuseof whole identical or nearly identical procedures, as it lim-its the first kind of differences between procedures discussedin section 2. Note however that we apply this conversionindependently from the code reuse techniques, as they alsoimprove cache behavior.

4. FINE-GRAINED CODE REUSEIn this section, our previous work on code reuse of more

fine-grained identical or functionally equivalent code frag-ments is rediscussed. On several occasions, new insightsand enhancements are introduced.

The use of inheritance and virtual method calls resultsin procedures that are often not identical or even similar,but still contain similar code fragments, simply because theyprovide similar functionality. It therefore often happens thatparts of such methods are identical or at least functionallyequivalent (i.e. the apply the same computations, but ondifferent register operands). Sometimes template instanti-ations are not similar enough according to our equivalencescore metric, but still contain similar or identical code frag-ments. In both cases, identical or functionally equivalentcode fragments can be abstracted into procedures.

The execution of thus abstracted procedures ends at a re-turn instruction, after which the execution of the programalways continues at the instruction following the call-site ofthe abstracted procedure. Therefore, abstracted code frag-

ments must have a unique entry point and a unique exitpoint. This by definition is the case for basic blocks andsubblock instruction sequences. Code fragments consistingof multiple basic blocks can also have a unique entry andexit point. Those fragments are called code regions, andwe’ll start our discussion of fine-grained code reuse tech-niques with code regions.

4.1 Code Region AbstractionThe exit and entry points of a code region correspond to

a pair of dominator and postdominator blocks. Such pairstherefore uniquely identify a code region. Again a finger-printing system is used to prepartition all the regions in aprogram. As the number of code regions is much higher thanthe number of procedures and as techniques such as regis-ter renaming to make functionally equivalent regions usingdifferent register operands identical are computation expen-sive, we limit our search space to fully identical regions: i.e.with identical structure, instructions, schedule and registeroperands.

Allocating space for the return address poses a problemwhen procedures are being called from within the abstractedprocedure, as we discussed in section 3.2: calls to abstractedcode need to store a return address somewhere and this re-turn address can be seen as some kind of parameter. Adifference with the discussion in section 3.2 is that, becausecompiler generated stack allocation mostly takes place in theentry and exit blocks of a procedure, and not in the code inbetween, we can expect that fewer stack allocations in theoriginal code regions will interfere with the additional stackspace we want to allocate. However, while we have found avery conservative way to allocate an additional location tostore the return address for such abstracted code regions,it is so conservative that we can almost never apply it andtherefore almost never are able to abstract code regions inwhich procedure calls occur.

One exception is when the code regions to be abstractedend with a return instruction. As the call to the abstractedcode in this case is a tail call, there is no need for a pro-cedure call to the abstracted procedure. Applying tail-calloptimization, we can just jump into the abstracted proce-dure and the return at the end of it will return directly tothe callers of the procedures from which the region was ab-stracted.

4.2 Basic Block AbstractionLike we said before, basic blocks are by definition code

fragments with a unique entry and a unique exit block.As opposed to whole procedures with identical function-ality, basic blocks with identical functionality show muchmore variation in the used register operands. The reasonis that the compiler allocates registers for a whole proce-dure. Therefore, the register operands used in a basic blocklargely depend on the registers used in the surrounding code.For this reason, and because local register renaming (withina single basic block) is not nearly as complex and timeconsuming as global register renaming (over multiple ba-sic blocks), we’ll try to make functionally equivalent basicblocks identical by renaming registers.

Like we did for the other reuse techniques, basic blocksare first partitioned to speed up the detection of equivalentblocks. The fingerprints used to do so include the opcodesof the instructions, but not the register operands. Within

281

each partition, the search for functionally identical blocks isdone as follows:

• As long as there are multiple blocks in a partition, amaster block is selected.

• We try to rename all the other (slave) blocks to themaster block. If a slave block already is identical tothe master block, renaming is of course unnecessary.

• The master block and all slave blocks (made) identicalto the master block are abstracted into a procedure,and procedure calls to this abstracted procedure re-place the original occurrences of the blocks.

• If no slave blocks are found that are or can be madeidentical to the master block, the master block is re-moved from the partition.

This is all similar to our previous work in [10]. We havehowever enhanced the renaming algorithm. Our new renam-ing algorithm works in three phases:

1. Comparing the dependency graphsFirst, the dependency graphs1 of the master and slaveblocks are compared. Conceptually this is done by con-verting the blocks to a symbolic static single assign-ment representation and by just comparing whetherthe symbolic register operands of each instruction arethe same.

One detail of this conversion and comparison is im-portant to discuss: all externally defined registers areconverted to a single symbolic register. Consider theexample code in Figure 4. The fourth instruction inthe master block has two different externally definedsource register operands: r7 and r8. In the slave block,register r7 is used twice. In our symbolic representa-tion for comparing the dependency graphs of the twoblocks, the same symbolic register x is used for r7 andr8.

2. Adding copy operationsIf the slave and master blocks implement the samedependency graph as discussed in the first phase, wetry to insert copy operations. These are added forthree reasons:

(a) When externally defined registers differ, copy op-erations are inserted before the slave block (copyoperation (2) in the example).

(b) When different registers are defined in the blocksthat are live2 at the end of the slave block, copyoperations are inserted after the slave block (copyoperation (3) in the example).

(c) When a register is defined in the master blockthat is not defined in the slave block, but which

1A dependency graph is a directed graph where the operandsof instructions are vertices. Directed edges connect defini-tions (i.e. write operations) of destination operands withconsuming source operands (i.e. operands of operationsreading the value produced at the definition).2A register is considered live if its content can be used bysome instruction later on during the program execution. Ifa register is live at some program point we cannot overwriteits value, unless its original value is stored somewhere else(spilled on the stack or in some other register).

is live over the whole slave block, copy operationsare inserted before and after the slave block totemporarily store the register content in anotherregister. Assume register r0 to be live over thewhole slave block in the example. Then two copyoperations need to be inserted: (1) and (4).

3. Actual renaming and abstractionThe insertion of copy operations is considered success-ful if there are no copy operations necessary before orafter the slave block that need to copy different regis-ters to the same destination register. If the number ofinserted copy operations is small enough for the blocksto be abstracted, the slave block is simply replaced bythe master block, thus effectively renaming the slaveblock. In the example, renaming is possible, but notconsidered worthwhile, as the number of added copyoperations is higher than the code size reduction ob-tained by abstracting the basic blocks. Note that wealso would have to add two call instructions and onereturn instruction to abstract the blocks.

The difference between this renaming algorithm and theone we previously published [10] is that the old algorithmcombined phases 1 and 2: the copy operations were collectedduring the dependency graph comparison. To be able to dothis in one phase, all externally defined registers used in theblocks must get unique symbolic names (x1, x2, etc. insteadof a single x). If this is done, the fourth instruction in themaster and slave blocks of the example in Figure 4 havedifferent operands, and the dependency graph is not con-sidered identical. These blocks were therefore not consid-ered renameable with the old algorithm. One could say thatthe old algorithm could only rename functionally equivalentblocks, whereas the new algorithm is also able to renameblocks to functionally superior blocks.

Cooper and McIntosh [6] propose renaming basic blocksby globally renaming registers instead of inserting registercopy operations. This has the disadvantage that renamingto make one pair of blocks identical can affect (even undo)renaming for another pair of blocks. We feel that insert-ing copy operations is favorable, especially since a separatecopy elimination phase after the code abstraction will beable to eliminate most, if not all, of the copy instructions inthose cases where global renaming could have been appliedto make the blocks identical.

Another advantage of using copy instructions is that thesecan easily be used to “rename” immediate operands to reg-isters. Just like a copy instruction is inserted, an insertedinstruction can store a value in a register that is than usedin the abstracted block instead of the immediate operandsin the original blocks. This of course is useful only whenthe corresponding immediate operands or registers in theoriginal blocks differ.

The renaming algorithm is the only transformation weapply to make functionally equivalent (or superior) blocksidentical. A question now arises concerning our search space:do the schedules (or order) of the instructions in the masterand slave blocks matter?

What if two blocks are identical, except for the order of theinstructions? We have done some experiments to answer thisquestion by inserting a pre-pass scheduler, that generatesa deterministic schedule given a dependency graph. Thismeans that two blocks with identical dependency graphs but

282

master slave master slave master slaver9 := r0 (1)r8 := r7 (2)

r4 := r2 + r1 r1 := r2 + r1 i1 := x + x i1 := x + x r4 := r2 + r1 r4 := r2 + r1r4 := r4 - r3 r4 := r1 - r3 i2 := i1 - x i2 := i1 - x r4 := r4 - r3 r4 := r4 - r3r0 := r4 / r6 r9 := r4 / r6 i3 := i2 / x i3 := i2 / x r0 := r4 / r6 r0 := r4 / r6r4 := r7 * r8 r5 := r7 * r7 i4 := x * x i4 := x * x r4 := r7 * r8 r4 := r7 * r8

r5 := r4 (3)r0 := r9 (4)

Original blocks Using symbolic registers Renamed blocks

Figure 4: Example basic blocks for basic block abstraction.

with different instruction orders are transformed in blockswith the same order. This allows us to disregard schedulingdifferences when comparing dependency graphs or insertingcopy instructions. We have learned that this pre-pass doesnot yield a significant increase in the number of blocks beingabstracted (less than 0.1%). We conjecture that the localschedules generated by the compiler are pretty determinis-tic.

4.3 Subblock Instruction Sequence Abstrac-tion

Our previous research on abstracting partially matchedbasic blocks involved two important special cases: saves andrestores for callee-saved registers. Most of the time the savesoccur in the entry block of a procedure and the restores oc-cur in blocks ending with a return instruction. As this limitsthe number of code fragments that have to be compared, wecan afford the time to try to make these sequences iden-tical by locally rescheduling code (i.e. within the entry orreturn blocks of a procedure). This rescheduling might benecessary when the compiler has mixed the save and re-store instruction sequences with other instructions. Whileabstracting register save sequences involves some overheadbecause of the extra call and return, this is not the case forregister restores, as tail-call optimization can often eliminatethe overhead. How this is done is discussed in more detailin [10].

Whereas we previously found that a more general abstrac-tion of partially matched blocks is computationally quite ex-pensive without offering significant code size reductions forC programs, this is not the case for C++ programs. Thereason is again that C++ programs offer more possibilitiesfor code reuse. However we still need to keep the computa-tional cost reasonable, so we restrict the general abstractionof partially matched basic blocks to identical instruction se-quences. No register renaming or rescheduling is applied tocreate identical sequences. Even then an enormous num-ber of instruction sequences have to be compared. A quiteefficient way we found to do this is as follows.

A separate pass over the whole code is applied for all pos-sible instruction sequence lengths we want to abstract, start-ing with the longest sequences and ending with the shortestone (3 instructions). During each pass for instruction se-quences with length n

1. all occurring sequences of length n are collected andprepartitioned according to a fingerprint (that now in-cludes the register operands used);

2. for each partition, identical sequences are collected.These multiple identical sequences are placed in sepa-rate basic blocks, by splitting up the blocks in whichthey first occurred.

After this splitting of basic blocks for all sequence lengths,the basic block abstraction technique is applied again toactually abstract the identical sequences in the split basicblocks. By not collecting instruction sequences in phase 1that were separated in phase 2 of a previous pass, largerblocks are greedily abstracted first. Although this is not op-timal, (more beneficial splitting of basic blocks might be pos-sible), we found this to be a good compromise between com-putational efficiency and code size reductions. By having aseparate pass for all possible instruction sequence lengths,the memory-footprint can be kept acceptable without sacri-ficing execution speed. Note that for high values of n, wherelarge fingerprints need to be used, the number of instructionsequences in a program decreases, as basic blocks on averageconsist of 4–5 instructions only.

4.4 SummaryReusing multiple instances of identical code fragments can

be done at various levels of granularity: procedures, coderegions with unique entry and exit points, basic blocks andpartially matched basic blocks. Looking at the ratio be-tween computational cost and code size reduction certaintrade-offs have to be made. Register renaming to createidentical code fragments is applied at the basic block levelonly. Rescheduling to create identical code sequences is ap-plied for callee-saved register stores and restores only.

It is obvious that the different abstraction levels influenceeach other: without procedural abstraction or parameteri-zation, all the identical blocks or regions in procedures willstill be abstracted if basic block and region abstraction areapplied. The introduced overhead will then be larger how-ever, since a procedure is then largely replaced by a numberof calls to abstracted procedures.

283

5. EXPERIMENTAL EVALUATIONThe algorithms described in this paper are implemented

in Squeeze++, a binary rewriting tool aiming at code com-paction for the Alpha architecture. The Alpha architecturewas chosen because of its clean nature that eases the imple-mentation of algorithms, in particular of the Squeeze++

backend. We firmly believe that the algorithms implementedin Squeeze++ are generally applicable and not architecturespecific, and therefore consider Squeeze++ and the resultsobtained with it a valid proof-of-concept.

5.1 The BenchmarksThe performance of Squeeze++ is measured on a num-

ber of real-life C++ benchmarks. Although some of thesebenchmarks are not typical examples of embedded applica-tions, we believe they represent a broad range of the codeproperties of the targeted (embedded) applications: appli-cations written in C++ using more or less library code andusing more or less templates and inheritance. Some proper-ties of the benchmark programs are summarized in Table 1:a short functional description is provided, together with thenumber of assembly instructions in the base versions of theapplications. It are these base versions with which the com-pacted versions will be compared. They were generated us-ing three tools:

1. Compaq’s C++ V6.3-002 compiler was used to compilewith the -O1 -arch ev67 flags, resulting in the smallestbinaries the compiler can generate.

2. The Compaq Linker for Tru64Unix 5.1 (the operat-ing system used for our evaluation) was used to linkthe programs with the flags -Wl,-m -Wl,-r -Wl,-z -Wl,-r -non shared, resulting in statically3 linked binariesincluding the relocation and symbol information nec-essary for Squeeze++, and resulting in a map of theoriginal object file code and data sections in the finalprogram. Squeeze++ requires this map for its anal-yses discussed in [7].

3. A very basic post link-time program compactor. Iteliminates no-ops from the programs and performs aninitial elimination of unreachable code. It applies no-op elimination on the linked binaries because a code-size-oriented compiler would not have inserted no-opsin the first place, and an initial unreachable code elim-ination because this compensates for the fact that theCompaq system libraries are not really structured withcompact programs in mind: the object files in the li-braries have a larger granularity than one would ex-pect from a library oriented at compact programs, asmore coarse-grain object files generally result in moreredundant code being linked with a program. It alsoapplies the same code scheduling and layouting algo-rithms as Squeeze++, using the same profiling infor-mation for the benchmarks. Using the same schedulerand layouter allows us to make a fair comparison be-tween the execution speeds of different versions of thebenchmarks.

As the prototype Squeeze++ only handles statically linkedprograms, including both application-specific and library

3Unless programs are statically linked, the Compaq linkerwill not include the relocation information needed bySqueeze++ in the generated binary.

code, the fraction of the program code that we considerapplication-specific is also given. All system libraries (libc,libcxx, libstdcxx, libX11, etc.) are considered library codeand not application-specific. Libraries that typically areused by more applications on a system are also not con-sidered application-specific. For the lyx benchmarks, e.g.,the forms GUI-library is not considered application-specific,while the xforms library is considered application-specific, asit is written as a wrapper around the forms library to easethe implementation of lyx, and only of lyx. Although wehave linked xforms into the program as a library, such thatthe linker only includes the necessary parts of the xformslibrary (this part may depend, e.g., on the target platform),we still consider it part of the application-specific code. Asimilar reasoning holds for the GTL library: it is a library,so we link it with the program as a library. Still we considerit to be application-specific, as it represents the kind of li-braries that is typically not used in multiple applications ona system.

The table also includes the fraction of the application-specific code that comes from the so called repository, i.e.the fraction of the code that was generated for templateinstantiations. As discussed in section 6, a repository isused to avoid the inclusion of multiple identical templateinstantiations or specializations into a program. We haveincluded these fractions, because they give an idea of thesize of the code that we explicitly target with our techniquesfor whole-procedure reuse and parameterization.

5.2 Whole-Program Code CompactionThe code abstraction performance of Squeeze++ is eval-

uated by comparing three versions of the benchmark pro-grams: the base version for our comparison is the versiongenerated as described in the previous section. The secondversion of the programs are the compacted binaries, but inwhich no code is reused: i.e. the programs are compacted ap-plying all the available techniques in Squeeze++ (includingconstant propagation, useless code elimination, unreachablecode elimination, etc., see [7, 10] for an extended discussion)except the techniques discussed in this paper. The thirdversion is generated by Squeeze++ applying all availablecompaction and code reuse techniques.

The results for our benchmark programs are depicted inFigure 5. The lower line indicates the compaction achievedby Squeeze++ without the code abstraction techniques.As can be seen, 26%-39% of the code is eliminated, averagingaround 31%.

If code abstraction and parameterization are applied aswell, the average code size reduction climbs to 45%. Thisreduction fluctuates between 34 and 62%. The additionalcode size reduction resulting from the code reuse thus aver-ages around 14%, ranging from 6% to 24%.

5.3 Application-specific codeOne can question the usefulness of measuring code size re-

ductions on statically linked programs. Aren’t smaller pro-grams among the benefits of using shared libraries? Ouranswer to this is twofold.

First of all, we think statically linked programs are stillvery often used on embedded applications where code sizematters. A typical example is the implementation of com-mand line tools, such as for embedded Linux systems. Theseinclude tools such as ls, cat, more, less, echo, chown, chmod,

284

Program Description # instr. application- repositoryspecific code code

blackbox Fully functional, lightweight window manager 328240 14% 0%bochs Virtual Pentium machine 275440 37% 0%lcom “L” hardware description language compiler 99168 31% 0%xkobo Arcade space shooter game 252800 4% 0%fpt In-house HPF automatic parallelization tool 530896 32% 7%252.eon Probabilistic ray tracer (from the SPECint2000 suite) 136224 51% 10%lyx WYSIWYM(ean) word processor (LaTeX-like documents) 1329616 66% 23%gtl Test program from the Graph Template Library (GTL) 161936 60% 47%

Table 1: Description and properties of the programs on which our code compaction was evaluated.

Figure 5: Code size reductions achieved on staticallylinked program.

mkdir, etc. Instead of implementing these tools with mul-tiple applications and shared libraries, only one staticallylinked program is generated. This single program includesthe functionality of all separate tools. It is installed in the/bin directory and symbolic links to it are created, each linkcorresponding to one of the original command-line tools.The program evaluates the command-line command (i.e.the name fed by the user at the command-line) and the re-quested functionality is performed. By using a single stati-cally linked program, only those parts of the libraries neededfor all provided functionality are linked with the program,thus avoiding the overhead of dynamic linking. This or sim-ilar techniques are also used for implementing multiple ap-plications on embedded systems.

The second and more interesting part of our answer is theevaluation of our techniques on the application-specific codeof the program, i.e. the code that also is found in the dynam-ically linked programs. To approximate the performance ofa Squeeze++-like tool for dynamically linked applicationsas well as possible, we have adapted Squeeze++ in twoways:

1. At call-sites in application-specific code where librarycode is called, we have not resolved the calls. Thismeans that at the call-site, we do not know the callee.We only know that the callee respects the calling-

Figure 6: Code size reductions achieved onapplication-specific part of the program.

conventions. Vice versa, the thus called library code iscalled from unknown calling contexts, assuming onlythat their callers respect the calling conventions aswell. This is exactly the same as what compilers dowhen callees are externally declared or when proce-dures are exported.

2. For code reuse, all fingerprints used for prepartitioninginclude a tag indicating whether the code belongs tothe application-specific or the library part of the pro-gram. Identical, similar or equivalent code fragmentsare therefore only detected within either application-specific or library code and the code reuse is completelyseparated for both parts.

With this adaptation, application-specific code is completelyseparated from library code in Squeeze++, and the codereuse techniques applied to the application-specific code areexactly the ones that are also applicable on dynamicallylinked programs. The only remaining difference with dy-namically linked programs is that there are no stubs for im-plementing the calls to shared library code. The influenceof these stubs on the results from applying our code reusetechniques are in our opinion neglectible.

Figure 6 shows the same information as Figure 5, butnow for application-specific code only. As can be expected,the performance of the general optimization and compaction

285

techniques is lower on the application-specific code. Thereason is precisely that this code is application-specific andtherefore fewer possibilities are found to optimize the codefor the application. By and large the general optimizationand compaction possibilities for application-specific code re-sult from the inefficiencies of separate compilation. Fromlibrary code, by contrast, much more code is unreachablein some application because it is redundant for that spe-cific application. The average application-specific code sizereduction due to general optimization and compaction tech-niques is 25% (compared to 32% for the whole programs).This reduction ranges from 19 to 32% (compared to 26 to39% for the whole programs).

At least for some of the benchmarks, the code abstrac-tion and parameterization techniques perform much betteron the application-specific code, resulting in a total aver-age code size reduction of 43%. This is still lower than theaverage reduction achieved on whole programs (45%). Forsome programs however, the maximal achieved compactionon application-specific code is larger than the compaction onthe whole program. The maximum compaction now rangesfrom 27 to 70% (compared to 34 to 62% for whole programs).

For some programs, the code reuse techniques performmuch better on application-specific code than on the wholeprogram, as can be seen when comparing Figures 5 and 6.The average additional application-specific code size reduc-tion due to code reuse techniques now ranges from 7 to 38%(compared to 6 to 24% for the whole programs), averagingaround 18% (compared to 14%). This stronger performanceof the code reuse techniques on application-specific code canbe explained by looking at the third line we’ve included inFigure 6, indicating the fraction of application-specific codethat comes from the repository. There is a very strong corre-lation between the performance of the code reuse techniquesand the amount of code originating from templates. Thiscorresponds to the claims we have made about reusable codegenerated for template specializations, even when reposito-ries are used to avoid to some extent the linking of duplicatecode fragments.

5.4 Compaction TimesThe common knowledge about whole-program analyses

and optimizations is that they are notoriously slow andtherefore often not practical. In Figure 7, we have de-picted the time needed to fully compact the binaries withSqueeze++, i.e. to apply all the techniques at our disposal.We believe these compaction times show that link-time op-timizations are practical, as there is no need to apply themduring each edit/compile/debug cycle.

It is important to note that Squeeze++ is a research pro-totype. We have only optimized it to facilitate our researchand to shorten our edit/compile/debug cycles. So while we,e.g., have optimized the liveness analysis, it is still fully sep-arated from all other implemented techniques. Before anyof the other techniques that rely on liveness information areapplied, liveness analysis is restarted from scratch. It wouldbe much more efficient if all or at least some of the pro-gram transformations would update the liveness informa-tion on the fly. We have opted not to do so, because itwould severely limit the ease with which Squeeze++ canbe adapted to try out new techniques.

Figure 7: Compaction times (in minutes) in functionof the number of instructions in the binaries.

5.5 Breakdown of the ResultsIn the lower graph of Figure 8 we show the contribu-

tions of the different code reuse techniques to the total ad-ditional code size reduction achieved by these techniquescombined. These contributions are additive, since more fine-grained techniques are applied where coarse-grained tech-niques failed to find multiple-occuring code fragments. Aswe consider the numbers for the application-specific code tobe more relevant, we have depicted the graph for these num-bers. It is clear from the graph that whole-procedure reuseand abstraction perform best where templates are most used.It is also clear that the importance of the more fine-grainedtechniques lowers as more templates are used. Also notethat, as the lines in the graph for procedure parameteriza-tion and code region abstraction are basically the same, coderegion abstraction has lost most if not all of its merits.

5.6 Influence on Compaction SpeedIn the upper graph of Figure 8, we have depicted the (ad-

ditive) slowdowns of Squeeze++ when code reuse tech-niques are applied. Several things can be said about theresults shown in this graph. First of all notice that whole-procedure reuse and parameterization seem almost free.This is not because these techniques consume few cyclescompared to the other optimization techniques. The reasonis that these techniques are applied relatively early in thecompaction process. Squeeze++ consists of a number ofphases:

1. disassembly and CFG construction;

2. trivial optimizations (basic version);

3. iterative optimizations;

4. optimizations applied only once;


6. fine-grained code reuse;


8. code layout, scheduling and assembling.

286

Figure 8: Contributions to code size reduction andcompaction slowdown for the various code reusetechniques.

The iterative optimizations include constant propagation,unreachable and dead code elimination, CFG refinements,etc. The optimizations that are applied only once includesome CFG refinements and the inlining of procedures withonly one calling context.

As whole-procedure code reuse targets identical proce-dures, we do not want to take the risk that differences are in-troduced in initially identical procedures by the applicationof the iterative optimizations. On the other hand, these opti-mizations might remove differences between procedures. Toexploit both possibilities, data reuse and whole-procedurereuse are applied at the beginning of each run over the iter-ative optimizations. As this immediately eliminates a largeamount of code for some of the benchmarks, all the latertechniques in Squeeze++ are applied on a smaller program,hence the overall speedup on some benchmarks due to thereuse of whole procedures.

As parameterization targets similar whole procedures, itis of no use for nearly identical procedures that have been in-

lined in different contexts. Once they are inlined, we wouldnot consider them as procedures any more. On the otherhand, as parametrization involves the introduction of someoverhead, we want to avoid parametrizing two proceduresof which we later could find out that one of them was un-reachable. Therefore procedure parameterization is appliedjust prior to the inlining of procedures with only one call-ing context. At that time, most unreachable code that wecan detect has been detected and removed. Again, the al-gorithms in later phases are applied on a smaller program.

Two further remarks should be made: it looks as if (thealmost useless) code region abstraction is responsible for thelargest slowdown. A large part of the computations neededfor abstracting code regions however are also needed forbasic block abstraction. Therefore, the relative weight inthe slowdown because of code region abstraction is overes-timated in the graph. Finally, it is important to know thatthese numbers were collected using the Squeeze++ versionthat keeps application-specific code and library code sep-arated. All techniques are applied on both of them, albeitseparated. The slowdowns measured therefore are an overes-timation of the actual slowdowns when the techniques wouldhave been applied on application-specific code only. This isbecause most algorithms scale superlinearly and because thewhole-procedure reuse and parameterization techniques donot at all reduce the size of the library code to be handledby later phases in Squeeze++ like they do on application-specific code.4

5.7 Influence on Execution SpeedAs code reuse for minimizing the static code size is ap-

plied, control flow transfers and copy instructions are in-serted in the program. Therefore the dynamic number ofexecuted instructions increases with code reuse and we canexpect that execution times increase as well with code reuse.On the other hand, code reuse can improve cache behaviorand have a positive influence on execution speed. To mea-sure the combined positive and negative influence on execu-tion speed, we’ve measured the execution times of 4 versionsof some benchmark programs. These versions are:

• The base binaries.

• The compacted binaries where the reuse of identicalprocedures is the only applied code reuse technique,since it is the only reuse technique that introduces nooverhead (no overhead reuse).

• The fully compacted binaries (all reuse).

• A version of the programs, where all code reuse tech-niques are applied, but the ones that involve overheadare only applied on cold code, i.e. code that accordingto the profile information is not frequently executed.We took as cold code that part of the code with thelowest execution frequencies, that is responsible for thebottom 5% of the number of dynamically executed in-structions (cold code reuse).

4Because of dependencies between the different general com-paction techniques, major parts of Squeeze++ would re-quire an (infeasible) rewrite not to apply the general com-paction techniques on the library code. As we have notrewritten Squeeze++ for this purpose, not applying the ab-straction techniques to the library code would have resultedin an underestimation. We prefer presenting overestimatedslowdowns.

sifting out the mud: low level c++ code reuse

Documents