Transcript
Page 1: Reverse engineering is reverse forward engineering

Science of Computer Programming 36 (2000) 131–147www.elsevier.nl/locate/scico

Reverse engineering is reverse forward engineering

Ira D. Baxter ∗, Michael MehlichSemantic Designs, Inc., 12636 Research Blvd, #C-214 Austin, TX 78759-2200, USA

Abstract

Reverse Engineering is focused on the challenging task of understanding legacy program codewithout having suitable documentation. Using a transformational forward engineering perspec-tive, we gain the insight that much of this di�culty is caused by design decisions made duringsystem development. Such decisions “hide” the program functionality and performance require-ments in the �nal system by applying repeated re�nements through layers of abstraction, andinformation-spreading optimizations, both of which change representations and force single pro-gram entities to serve multiple purposes. To be able to reverse engineer, we essentially have toreverse these design decisions. Following the transformational approach we can use the transfor-mations of a forward engineering methodology and apply them “backwards” to reverse engineercode to a more abstract speci�cation. Since most of the existing code was not generated bytransformational synthesis, this produces a plausible formal transformational design rather thanthe original authors’ actual design. As an example, a small fragment of a real-time operatingsystem is reverse-engineered using this approach. A byproduct of the transformational reverseengineering process is a design database for the program that then can be maintained to min-imize the need for further reverse engineering during the remaining lifetime of the system. Aconsequence of a transformational forward engineering perspective is the belief that the standardplan recognition methods proposed for reverse engineering are not su�cient. c© 2000 ElsevierScience B.V. All rights reserved.

1. Introduction

Software engineering practice tends to focus on the design and implementation ofa software product without considering its lifetime, usually 10 years or more (see[18]). However, the major e�ort in software engineering organizations is spent afterdevelopment (see [5, 6]) on maintaining the systems to remove existing errors and toadapt them to changed requirements.Unfortunately, mature software systems often have incomplete, incorrect, or even

nonexistent design documentation. This makes it di�cult to understand what the systemis doing, why it is doing it, how the work is performed, and why it is coded that way.

∗ Corresponding author.E-mail addresses: [email protected] (M. Mehlich), [email protected] (I.D. Baxter)

0167-6423/00/$ - see front matter c© 2000 Elsevier Science B.V. All rights reserved.PII: S 0167 -6423(99)00034 -9

Page 2: Reverse engineering is reverse forward engineering

132 I.D. Baxter, M. Mehlich / Science of Computer Programming 36 (2000) 131–147

Consequently, mature systems are hard to modify and the modi�cations are di�cult tovalidate.For newly developed systems, the problem can be reduced by thoroughly document-

ing the system and maintaining the documentation together with the system. Ideally,the system documentation describes the product and the complete design, including itsrationale.Most captured design information is informal (not machine-interpretable). While this

is valuable for the software maintainers, informal information is subject to a widevariety of interpretations. These varied interpretations create communication problemsbetween developers, which limits the usability of the informal information.Formal descriptions of the design (and its rationale) with a precise semantics can

overcome these communication problems. It can even allow us to modify the designrather than the code (cf. [4]) and, thus, to modify a software system using semiauto-matic tools. We conjecture that transformational development of software (see [12, 13])guided by performance criteria (see [11]) is the right way to get such descriptions ofa design.In such an ideal setting there would be no reason for reverse engineering. However,

the large amount of existing software that has to be maintained forces us to face thisproblem.Current reverse engineering technology focuses on regaining information by using

analysis tools (cf. [1]) and by abstracting programs bottom-up by recognizing plansin the source code (cf. [9, 14, 15, 19]). The major purpose of such tools essentially isto aid maintainers understand the program (cf. [16]). While we believe in the utilityof these approaches as part of a larger toolkit and vision, they are quite inadequatefor the task of design recovery. Analysis tools, though valuable, only provide somedesign information derived from the structure of the code, not from its intention orconstruction. Pure plan recognition is unlikely to be powerful enough by itself to drivereverse engineering for the following reasons:

• all the necessary plan patterns in all the variations must be supplied in advance fora particular application,

• di�erent abstract concepts can map to the same code within one application (theright abstraction choice is impossible without a knowledge of the intended programfunction), and

• legacy code often has tricky optimizations which may not be the simple compositionof multiple plans.

Even more important, a common reason for doing reverse engineering is to supportmodifying the software system; understanding it is only a necessary precondition to doso. We consider transformational development with modi�cation of the design ratherthan the code as the means to accomplish incremental modi�cation. Considerable theoryand techniques have been developed to carry this out (see [2, 3]). For existing systems,this approach implies a need to reconstruct a plausible transformational design thatcould have been used to derive the code from a suitable abstract speci�cation of the

Page 3: Reverse engineering is reverse forward engineering

I.D. Baxter, M. Mehlich / Science of Computer Programming 36 (2000) 131–147 133

system, i.e. we want to apply the transformations backwards from the program to itsspeci�cation. Recording these transformations (together with the rationale for applyingthem) then allows us to modify the design instead of just the program code.In the following, we explore forward and reverse engineering in greater detail and

provide an example for transformational reverse engineering legacy assembler code toa high-level concept.

2. Forward engineering

In current forward engineering practice, informal requirements are somehow con-verted into a semiformal speci�cation using domain notations lacking precise un-derlying semantics, such as data- ow diagrams, entity-relationship diagrams, naturallanguage descriptions, or other problem-speci�c informal or semiformal notations. Theprogram then is constructed manually (i.e., in an error-prone way) from the speci�ca-tion by a creative agent, the programmer.Hidden in this activity is a set of obvious as well as non-obvious design decisions

about how to encode certain parts of the speci�cation using available implementationmechanisms to achieve performance criteria (the why of the design decisions). As anexample, a speci�cation fragment for associative retrieval with numeric keys may beimplemented using hash tables, achieving good system-response time. These decisionsare usually not documented.Over time, the program code is modi�ed to remove errors and to adapt the system to

changed requirements. The requirements may change to allow usage of alphanumerickeys and to be able to handle large amounts of data, and the implementation maychange to use disk-based B-trees. Unfortunately, these changes often occur withoutbeing re ected correctly in the speci�cation. With each such change, the gap betweenthe original speci�cation and the program increases. The result is a program withouta proper speci�cation and with untrustworthy design information (such as commentsdescribing hash tables!). The code becomes di�cult to understand and, thus, di�cultto maintain.To overcome this de�ciency, it is important to change the speci�cation �rst and then

re ect the changes in the program code. A necessary precondition for this is to havereliable information about the relationship between the speci�cation and the programcode. The design and its rationale describe the how and why of this relationship;however, they are not documented in current practice.There are two known approaches to reduce the gap between the speci�cation and

the program.The �rst one is the development of software by stepwise re�nement, introducing

intermediate descriptions of the system between the speci�cation and the �nal programcode. The intermediate descriptions should re ect the consequences of major designdecisions during the construction, which helps to understand the software and its design.However, this approach has some important drawbacks: the development steps are still

Page 4: Reverse engineering is reverse forward engineering

134 I.D. Baxter, M. Mehlich / Science of Computer Programming 36 (2000) 131–147

manual, they are too large not to contain hidden design decisions, and there is norationale directly associated with the steps. The relation between the descriptions canonly be established a posteriori, and there are a lot more documents to be maintainedfor incorporating changes of the system requirements.Despite its drawbacks, the stepwise re�nement approach provides a useful starting

point. Attacking its drawbacks leads us to the second approach: the transformationaldevelopment of software where the fairly large manual development steps are replacedby smaller, formal, correctness-preserving transformations, each of which re ects asmall design decision or optimization.A necessary prerequisite for this approach is to have a formal speci�cation. Any

domain notations used in this speci�cation (for general-purpose domains, such asdata- ow, as well as application-speci�c domains, such as money management) mustbe assigned precise underlying semantics. After doing so, the program code can bederived by making small implementation steps using formal, correctness-preservingtransformations leading from more abstract speci�cations to more concrete ones. The�nal program code then is correct by construction with respect to the formal func-tional speci�cation. Each transformation falls into one of three categories: re�nements(i.e., maps of concepts from an abstract level to a more concrete one), optimizations(i.e., maps to reduce the resources used at a level of abstraction according to somecriteria), and jittering transformations (i.e., maps to enable the application of re�ne-ments and optimizations). The decision to apply a particular transformation is thus acrucial design information. The applied transformations coupled with the rationale forselecting them is the design information that “explains” the program code. This col-lection of information is called the transformational design of the program code fromthe speci�cation.Although the selection of a transformation to apply to a certain speci�cation, in

general, is still a creative process, it is guided by the performance criteria to beachieved. This guided selection of the transformations allows the development processto be supported by a semiautomatic system that contains a large repertoire of avail-able transforms, which are applied to achieve or at least approach the performancecriteria.The transformations needed for practical software engineering comprise all those

design decisions implicitly used by current software developers, including the following(see [13, 16, 17]):

• Decomposition — Most problems can be decomposed into subproblems, which, ingeneral, can be individually solved in many di�erent ways. The actual hierarchicalstructure of the code represents only one particular choice from the set of possibledecompositions.

• Generalization=specialization — If di�erent components are similar, it may be pos-sible to construct a more general version that comprises both of them as specialcases. For e�ciency reasons, it may be a good choice to specialize a component fora certain (part of the) application.

Page 5: Reverse engineering is reverse forward engineering

I.D. Baxter, M. Mehlich / Science of Computer Programming 36 (2000) 131–147 135

• Choice of representation — To be able to implement high-level data types in pro-gram code, it is often necessary to change its representation. A common example isthe representation of sets by lists.

• Choice of algorithm — High-level concepts can be realized by many di�erent al-gorithms. The choice of algorithm may be guided by performance criteria that isdocumented poorly or not at all.

• Interleaving — For e�ciency reasons, it may be useful to realize di�erent conceptsin the same section of code or the same data structure.

• Delocalization — Certain high-level concepts may be spread throughout the wholecode introducing distracting details in other concepts (see the study of the e�ects ofdelocalization on comprehension in [10]).

• Resource sharing — Interleaved code often allows di�erent concepts to share someresources, such as control conditions, intermediate data results, functions, names, andother computational resources.

• Data caching/memoization — If some data that has to be computed is needed oftenor its computation is expensive, then it is worthwhile to cache or memoize it.

• Optimizations — In order to satisfy memory or response time constraints, manydi�erent optimizations are used, including the following:◦ the folding and unfolding (inlining) of program code,◦ the use of algebraic properties of functions,◦ the merging of computations by composing or combining functions,◦ partial evaluation (often in the form of context-dependent simpli�cation),◦ �nite di�erencing,◦ result accumulation,◦ recursion simpli�cation and elimination,◦ loop fusion,◦ optimizations typically performed by good compilers (code motion, commonsubexpression elimination, partial redundancy removal), and

◦ domain-speci�c optimizations (e.g., minimizing a �nite-state machine recognizinga certain language).

The consequences of these design decisions may overlap and may be delocalizedduring the construction of the program code from the speci�cation. Whether thesemethods are carried out mechanically by a tool or informally by smart programmers,the resulting software systems are very di�cult to understand.The transformational software development approach has the advantage that it al-

lows the automatic recording of the design decisions made during the derivation of the�nal program code from the formal speci�cation. Provided the selection of the trans-formations that are applied during this development are guided by the performancecriteria to be achieved and the relation between the selection and the performancecriteria is recorded, we get the complete design together with its rationale as the prod-uct of software development. The code itself is just a byproduct that is correct byconstruction.

Page 6: Reverse engineering is reverse forward engineering

136 I.D. Baxter, M. Mehlich / Science of Computer Programming 36 (2000) 131–147

Recording this design information would allow us to modify the speci�cation insteadof the code and then to modify the design to get a new implementation of the modi�edspeci�cation. It is our belief that the formal nature of transformations makes suchdesign modi�cations supportable by a semiautomatic tool that increases the reliabilityand reduces the cost of maintaining the system.

3. Reverse engineering

If all existing software had been developed using a transformation system recordingthe design and its rationale, there would be no need for reverse engineering any code.However, we have to make a concession to reality. A huge amount of conventionallydeveloped software exists and such systems will continue to be developed for a longtime.These systems have errors, and demands continually arise for the enhancement of

their functional and performance requirements. Modi�cation is a necessity in this set-ting. This means that there is a fundamental need to understand the systems, which isa challenging task since they are often badly designed and have an incomplete, nonex-istent, or, even worse, incorrect documentation with little or no design information.Currently, a major approach to understanding consists of trying to recognize plans

(code fragments implementing programming concepts) bottom-up from the program toa more abstract description (cf. [7, 8, 14]). Essentially, this is done by pattern matchingon an internal representation of the code (e.g., an abstract syntax tree) to recognizepatterns of higher-level plans or concepts in a lower-level code.This approach has a major drawback which we believe makes it insu�cient for

reverse engineering. It depends on having developed all (or at least all important) planinstances that may occur in the code that we want to understand. Unfortunately, theseplans can be (and have been) realized in many di�erent ways because of the interactionsbetween the many possible design decisions (i.e., transformations). Imagine a logicalarray of structures; it can be implemented as an array of structures or a structure ofarrays. Code to sort the array will look very di�erent depending on the choice of dataimplementation. A set of plans for sorting must somehow account for all the possibledi�erent representations of the data.Thus, to detect a plan in legacy code we have to be able to detect all these di�erent

realizations of the same plan. But, unfortunately, we need di�erent patterns to detectthese di�erent realizations; and all these patterns have to exist in advance to be able toreverse engineer the code, which is impossible because of the multitude of conceivablerealizations all solving the same problem.A single implementation may in fact represent any of several possible plans. An “INC

Memory” assembly instruction may implement the “Add One” abstraction, the “Storea One (knowing a value is already zero)” abstraction, the “Complement the LSB”abstraction, or several other possibilities. Without understanding the overall programfunctionality, the proper detection (choice) of plan cannot be made.

Page 7: Reverse engineering is reverse forward engineering

I.D. Baxter, M. Mehlich / Science of Computer Programming 36 (2000) 131–147 137

Worse, we must take into account that the plans we want to recognize are interleaved,delocalized (cf. [10]), and share resources. So we must �nd plan patterns in the codethat are intertwined with other plan patterns, both of which may be scattered throughoutthe whole program code.Thus, to e�ectively perform the reverse engineering task we have to use some cre-

ativity to detect and separate the high-level concepts in the code; a task that, in general,cannot be performed in advance by providing a reasonable amount of prede�ned planpatterns even together with sophisticated pattern-matching rules.These problems can be overcome by two di�erent approaches. The �rst one is to

allow a reverse engineer to add, as necessary, more plans, patterns, and rules forinterwining the rules 1 to enable a plan recognizer to detect the actual realization inthe code. However, to be able to maintain the system, we are not only interested inthe plan that is realized in the code but also in the design and its rationale that couldhave led from the plan to its realization in the actual code. Thus, the reverse engineernot only has to provide the plan pattern for the actual realization of a plan but alsothe transformations and the justi�cations for applying them (i.e., he has to forwardengineer a piece of software and achieve a certain solution that has been used in thecode).As an alternative approach, we can achieve the same e�ect by using small trans-

formations backwards (these transformations may or may not already be provided bya forward transformation engine) and abstract the code step by step to produce anabstract plan (or a set of plans) corresponding to the given code. In essence, the trans-formations applied backwards “jitter” the code into a form in which canonical plansare more easily found.Such an abstract plan may already exist in the system or may be new domain

knowledge developed during the reverse engineering process. By recording the (reverse)transformation steps used to achieve that plan, we acquire a plausible design that can beused as a formal basis for maintaining the reverse engineered system. Doing so avoidsthe need to repeatedly reverse engineer the system over its lifetime as modi�cationsare made to the code.As a side bene�t, the transformational reverse engineering approach lets us acquire

domain knowledge as well as capture a possible design. It is not necessary to de�ne thedomain knowledge in advance. Samples include new concepts identi�ed by the reverseengineer along with implementations found in the code. This information can also beused to develop and maintain other systems (i.e., we gain additional knowledge thatcan be used by a forward engineer).While we argue that transformational reverse engineering should be much more

e�ective than pure plan recognition, plan recognition has its place. First, the samepattern-matching technology on which plan recognition is based is used by any rewritecomponent of transformation systems (see Fig. 1). Second, plans may be able to “tile”

1 Note that the intertwining rules essentially are forward transformations. These transformations must beapplied to existing patterns to derive new intertwined patterns.

Page 8: Reverse engineering is reverse forward engineering

138 I.D. Baxter, M. Mehlich / Science of Computer Programming 36 (2000) 131–147

Fig. 1. Simple model of transformational reverse engineering.

a signi�cant part of an existing system, providing strong clues about some aspectsof the code. These clues can be used as clues for guiding the reverse applicationof transformations. As an example, �nding a computation that maps a string to anatural number is a hint that hashing is occurring, and hash-table-related (reverse)transformations should be attempted.Comparing transformational reverse engineering with transformational forward engi-

neering, we see that both need domain knowledge and a semiautomatic transformationengine guided by some additional criteria to reduce the e�ort needed to develop theproduct (i.e., the system design).Thus, we conclude that:

Reverse engineering needs the same knowledge and infrastructure as forwardengineering.

Reverse engineering “just” applies the forward transformations backwards (i.e., inreverse).

4. A legacy reverse engineering example: semaphores

In this section, we demonstrate that transformational reverse engineering is feasiblein practice by applying it manually to a real example. Our purpose here is not tospeci�cally carry out this task, or to provide precise algorithms, but rather to highlightthe various decisions and how to handle them transformationally. In particular, wewant to make the point that plan recognition by itself cannot handle this task, and that

Page 9: Reverse engineering is reverse forward engineering

I.D. Baxter, M. Mehlich / Science of Computer Programming 36 (2000) 131–147 139

reverse transformation can. We do not claim that this process is automatic. We assumethat the reverse engineer has considerable informal problem domain knowledge, andincrementally provides guidance to a tool that records the transformations used for laterinspection and maintenance.The example is hand-coded, highly tuned legacy Motorola 6809 assembler code from

a 1980s operating system written by one of the authors. We will reverse engineer fromthe assembly source back to the abstraction the code represents. The domain knowledgeused by the engineer is related to the operating system code, speci�cally to semaphores.This small bit of information provides considerable focus to the process.The goal of the code is to release a semaphore resource to allow another waiting,

possibly higher-priority task to eventually acquire the semaphore and continue.Having some domain knowledge about task synchronization (including at least syn-

chronization with semaphores) and not being interested in how the highest-priority taskgets the processor to run we expect to see something like the following abstract code:

procedure RT UnLock(x: ptr to semaphore)begin // unlock specified semaphoreV(x); // release a resource unitreturn; // return to original caller

end;

The actual code fragment we intend to reverse-engineer is provided in Fig 2. (Thereader need not examine it carefully, as we abstract it immediately on the next page.)Our smart reverse engineer determines that it is the code he is interested in by decodingthe assembly labels, and doing some control ow analysis.

Fig. 2. Legacy Motorola 6809 assembler code fragment.

Page 10: Reverse engineering is reverse forward engineering

140 I.D. Baxter, M. Mehlich / Science of Computer Programming 36 (2000) 131–147

The code comments here are original. (The comments in the transformed code areinformally derived from these. We do not really expect a transformation system to dothis.)This code contains several design decisions some of which we consider in greater

detail below. For this code, it is important to know that releasing a semaphore re-source is interleaved with priority scheduling; for e�ciency reasons the current task“voluntarily” gives up the processor provided it is no longer the highest-priority task.Thus, in order to be able to reverse engineer this code we must have (or develop)some knowledge about priority scheduling (a problem that goes beyond the scope ofthis paper).Since the gap between the assembler code and the procedural program code we

want to abstract to is pretty large, we make a �rst step backwards using an interme-diate domain notation that is sometimes used by compilers to translate from higher-level procedural code to assembler: a register transfer language (RTL). To do that,we can apply the following straightforward TransformSet of re�ning transformationsbackwards:

call subroutine; ⇒ jsr subroutinegoto target; ⇒ jmp targetreturn; ⇒ rtsgoto x+offset; ⇒ jmp offset, xx:=address(variable); ⇒ ldx #variablex:=variable; ⇒ ldx variablevariable:=x; ⇒ stx variablex:=x->offset; ⇒ ldx offset, xd:=x->offset; ⇒ ldd offset, x(x->offset):=d; ⇒ std offset, xint enable:=true; ⇒ intenint enable:=false; ⇒ intds(x->offset):=x->offset+1;cc:=compare(x->offset,0); ⇒ inc offset,x

cc:=compare (x, variable); ⇒ cmpx variableif cc=0 then goto label fi; ⇒ beq labelif cc>0 then goto label fi; ⇒ bgt label

These transformations are M6809 speci�c. Obviously, these transformations are sim-ple pattern-matching rules that can be applied to existing assembler code simply andautomatically. The design history must capture the application of all of these; it can cap-ture them as a set rather than a linear sequence since there is no real order requirement.Before looking at the resulting RTL-code for our example let us introduce four

simple optimizing transformations (which are equivalence transformations and, thus,

Page 11: Reverse engineering is reverse forward engineering

I.D. Baxter, M. Mehlich / Science of Computer Programming 36 (2000) 131–147 141

may be used in both directions for our purpose):

[common subexpression elimination]code(expression)

⇒ variable:=expression; code(variable)where variable is a new variable not occurring anywhereelse in the program code and there is no assignment to avariable occurring in expression within code

[register life-time renaming]newvariable:=expression1;

code(newvariable);variable:=expression2;

⇒ variable:=expression1;code(variable);variable:=expression2;

where newvariable is a new variable not occurring any-where else in the program

[dead assignment elimination]variable:=expression; ⇒ �

if variable is not used again in the code execution beforea value is assigned to it the next time

[compare introduction](expression1 relation expression2)

⇒ compare(expression1,expression2)relation 0

These transformations are useful for most assembly reverse engineering tasks. Weremark that the simple model of structural pattern-matching used by most plan recog-nition engines does not work for these; data- ow analysis is required (see [14]).Applying the re�ning and optimizing transformations backwards may lead us to the

RTL-code depicted in Fig. 3.To be able to reverse engineer further, we use our domain knowledge to detect some

non-obvious optimizations that have been used during the development of the code:• The statement goto RT SInt; actually does not indicate a continuation of RT ITSCbut is an optimized call of a subroutine with a following return;. Thus it has tobe replaced by the statements call RT SInt; return;.

• RT ITSC itself is a subroutine that happened to be located at the end of RT UnLockbut may be called from another place in the program code (which is not shown inthis program fragment). This means that we have to replace the control ow from

Page 12: Reverse engineering is reverse forward engineering

142 I.D. Baxter, M. Mehlich / Science of Computer Programming 36 (2000) 131–147

Fig. 3. The legacy assembler code transformed to register transfer language (RTL).

RT UnLock to RT ITSC, that is caused by incrementing the program counter in theprocessor, by a call to RT ITSC as a subroutine followed by a return;.

• RT ITSX is a continuation for both RT UnLock and RT ITSC under certain con-ditions. It is also a parameterless subroutine by itself. Thus jumps to it (using gotoRT ITSX;) have to be replaced by call RT ITSX; return;. For the sake ofsimplicity we inline the simple code of this procedure into the caller.

• There is a hidden convention in the code for the parameters to the proceduresRT UnLock and RT ITSC (and RT SInt) which are placed in the register X.To be able to reverse engineer to parametric procedures we have to detect thisfact. Then we can replace label: code return; by procedure label(x)begin code return; end; provided there is no goto-statement in the wholeprogram code entering or leaving the body of the procedure, i.e. code. This includes agoto-statement to label. Correspondingly, for these procedures we replace calllabel; on the caller sides by call label(x);.

Additionally, by using the obvious equivalence transformation

[if–then–else with procedure return]if condition then code1; return; fi;

code2; return;⇒ if condition then

code1;else

code2;fi;return;

if there is no goto-statement entering or leaving code1or code2

Page 13: Reverse engineering is reverse forward engineering

I.D. Baxter, M. Mehlich / Science of Computer Programming 36 (2000) 131–147 143

Fig. 4. Unoptimized register transfer language code.

and some of the optimizing transformations given above, we get the program codepresented in Fig. 4.Now let us try to gain a deeper insight into the procedure RT ITSC. To do this,

we have to know something about the organization of the ready queue (which is notpossible by just looking at the program fragment in this paper):

The tasks in the ready queue are ordered with respect to their priority (i.e.; thehighest-priority task is at the beginning of the queue whereas the lowest-prioritytask can be found at the end of the queue). The pointer to the whole queue is thusidentical to the pointer to the �rst (highest priority) task in the queue.

Examining the procedure RT ITSC, we see a check whether the current taskRT CTCB is the same as the highest-priority task at the head of the ready queueRT TCBQ. Thus, the task scheduler (which switches the program context to executethe highest-priority task) is called if and only if the highest-priority task is not thecurrent task (which called the procedure). This means that the solution re ects thata context switch to another task is expensive and therefore should be avoided if theother task to be executed is the current one. However, we are not interested in a highlye�cient program code, but rather in an abstract description of the procedure (i.e., wedo not have to avoid the call to the scheduler for e�ciency reasons). With the ad-ditional knowledge that int enable=true after calling RT SInt, we can simplifyRT ITSC as shown in Fig. 5.Inlining 2 the procedure RT ITSC into RT Unlock allows us to move the statement

int enable:=true; from inside the if-statement to the end of the procedure. 3

2 This inlining does not mean that the procedure RT ITSC becomes obsolete. The procedure may be (is)called from additional places in the operating system.3 Note, that int enable=true is valid after calling RT SInt which means that we can add the state-

ment int enable:=true; directly following the call without changing the meaning of the program code.

Page 14: Reverse engineering is reverse forward engineering

144 I.D. Baxter, M. Mehlich / Science of Computer Programming 36 (2000) 131–147

Fig. 5. Unoptimized task switching code for a RT ITSC.

Fig. 6. Simpli�ed code for a RT Unlock.

Fig. 7. Code for a RT UnLock with recovered data types.

Performing some further equivalence transformations on the if-statement and its con-dition we obtain the code as shown in Fig. 6.Because of space constraints, we cannot present the details of further reverse en-

gineering of this code. However, the next step has to recover the abstract data typesused, such as “semaphore” and “task” together with their associated procedures. Doingthis could �nally lead us to the code given in Fig. 7.Abstracting this in a task synchronization domain should then lead us to the expected

abstract code that has been presented at the beginning of this section.Recording all the transformations used in this process allows us to reapply them to

this code to reproduce the original code. We note that replacing the transformationsthat intertwine priority-scheduling code with spin-lock code will allow us to regeneratea revised semaphore unlock procedure for multiple processors. This is precisely thekind of maintenance task that we desire reverse engineering to support.

Page 15: Reverse engineering is reverse forward engineering

I.D. Baxter, M. Mehlich / Science of Computer Programming 36 (2000) 131–147 145

Now let us brie y consider a big problem that may arise when trying to reverseengineer the code using plan patterns and look speci�cally at the priority scheduling.Let us assume that there exists some domain knowledge and associated plan-patterns

about priority scheduling. On an abstract level we may �nd, for example, the followingprocedural plan for adding an additional task into the ready queue and (re)schedulingtasks:

Task.InsertIntoReadyQueue(task,priority);Task.RunHighestPriorityTask();

However, in the assembler code we found something that roughly corresponds to thefollowing re�ned plan (that demands the usage of a certain class of implementationsfor priority queues):

Task.InsertIntoReadyQueue(task,priority);if Task.Current()

=Task.FirstTaskInReadyQueue() thenTask.ContinueCurrent();

elseTask.ContinueHighestPriorityTask();

fi;

Who produced=produces this re�ned plan? Can we expect it to be part of an existingdomain about priority scheduling? Which additional re�nements of the original plando we need to cover solutions used in other legacy codes for priority scheduling?The answer to these questions is: No one can provide in advance all the possible plan

re�nements that may be found in legacy code. This means that we need a creative-plan-re�nement generator (i.e., a semiautomatic forward transformation engine). But then,why do we not just generate the plan-re�nement backwards (on demand) by reverseengineering the code using the forward transformations in the reverse direction?

5. Related work

Considerable e�ort has been invested in plan recognition. The programmer’s appren-tice (PA) (see [14]) is still one of the most sophisticated, representing plans (“clich�es”)using data and control ow. This representation provides some language independenceand maximizes the ability to match a clich�e against arbitrarily ordered code. PA essen-tially attempts to tile (with overlaps) a program recursively with cliches to explain it.We hope our example convincingly shows that recursively tiling cannot explain evensmall fragments of code.

Page 16: Reverse engineering is reverse forward engineering

146 I.D. Baxter, M. Mehlich / Science of Computer Programming 36 (2000) 131–147

A di�cult problem with a large set of plans is the cost of matching them to apotentially very large program. E�cient graph matchers (see [15]) and constraint sat-isfaction solvers (see [21]) have been implemented to help match plans against code.Both of these types of matchers can be used to good e�ect in a transformation system.However, we hope to avoid the “big-bang” problem of explaining the entire programat once by performing incremental recovery, storing the results of previous sessionsin design histories, and reusing the design for maintenance. The maintainer’s assistant(MA) (see [20]) attacks the software maintenance problem by using transformationalmethods, using a core representation for programs derived from the lambda calculus.This gives an excellent foundation with clear semantics, and so allows transformationsto be applied to the original code. MA appears to be used for porting and translationrather than explanation and modi�cation.

6. Concluding remarks

A major reason for performing reverse engineering is to maintain legacy code. There-fore, it should not be focused on program understanding, but on system maintenanceinstead. This should be done in a way that frees us from reverse engineering a systemagain and again because of modi�cations made to its code over time.Hence, reverse engineering for maintaining legacy code has to be focused on de-

sign recovery. As we need sophisticated tool support to perform system maintenancee�ciently, the recovered design has to be based on formal methods.We have shown that transformational reverse engineering is a feasible approach to

recover such a design. We essentially need the same domain knowledge and infrastruc-ture as needed for transformational forward engineering, in which a design is producedduring system construction. All the domains, their notations, and the associated trans-formations within and between domains needed for forward engineering are also neededfor reverse engineering legacy code to recover its missing design.Transformational reverse engineering can easily be performed incrementally to allow

us to reverse engineer only those parts of the code that are needed for an actual main-tenance task. Over time, more parts of the code will be abstracted and the abstractionlevel will become higher until we possibly get an abstract requirements speci�cationof the whole system.All this can be supported e�ectively by a domain-based transformation engine that

can apply transformations forwards and backwards and can modify a design incremen-tally. Such a system, called the design maintenance system (DMS), which is intendedfor handling commercial-size software, is currently under development at SemanticDesigns (www.semdesigns.com).

References

[1] V.R. Basili, H.D. Mills, Understanding and documenting programs, IEEE Trans. Software Eng. 8 (3)(1982) 270–283.

Page 17: Reverse engineering is reverse forward engineering

I.D. Baxter, M. Mehlich / Science of Computer Programming 36 (2000) 131–147 147

[2] I.D. Baxter, Transformational maintenance by reuse of design histories, Ph.D. Thesis, University ofCalifornia, Irvine, 1990.

[3] I.D. Baxter, Design maintenance systems, Commun. ACM 35 (4) (1992) 73–89.[4] I.D. Baxter, Design (not code!) maintenance, in: ICSE-17 Workshop on Program Transformation for

Software Evolution, 1995.[5] B. Boehm, Software Engineering Economics, Prentice-Hall, Englewood Cli�s, NJ, 1981.[6] T. Guimaraes, Managing application program maintenance expenditures, Commun. ACM 26 (10) (1983)

739–746.[7] J. Hartman, Understanding natural programs using proper decomposition, in: 13th International

Conference on Software Engineering, 1991.[8] J. Hartman, Automatic control understanding for natural programs, Ph.D. Thesis, University of Texas,

Austin, 1991.[9] V. Kozaczynski, J.Q. Ning, A. Engberts, Program concept recognition and transformation, IEEE Trans.

Software Eng. 18 (12) (1992) 1065–1075.[10] S. Letovsky, E. Soloway, Delocalized plans and program comprehension, IEEE Software 3 (3) (1986)

41–49.[11] R.D. McCartney, Synthesizing algorithms with performance constraints, Ph.D. Thesis, Brown University,

1987.[12] J.M. Neighbors, The Draco approach to constructing software from components, IEEE Trans. Software

Eng. 10 (5) (1984) 564–574.[13] H.A. Partsch, in: Speci�cation and Transformation of Programs: A Formal Approach to Software

Development, Springer, Berlin, 1990.[14] C. Rich, R.C. Waters, The Programmer’s Apprentice, ACM, New York, 1990.[15] C. Rich, L. Wills, Recognizing a program’s design: a graph parsing approach, IEEE Software 7 (1)

(1990) 82–89.[16] S. Rugaber, White paper on reverse engineering, Georgia Institute of Technology, 1994.[17] S. Rugaber, K. Stirewalt, L. Wills, The interleaving problem in program understanding, in: Working

Conference on Reverse Engineering, 1995, pp. 166–175.[18] T. Tamai, Y. Torimitsu, Software lifetime and its evolution process over generations, Proceedings of

1992 Conference on Software Maintenance, IEEE Computer Society Press, Silver Spring, MD, 1992.[19] P. Tonella, R. Fiutem, G. Antoniol, E. Merlo, Augmenting pattern-based architectural recovery with ow

analysis: Mosaic – a case study, in: Working Conference on Reverse Engineering, 1996, pp. 198–207.[20] M.P. Ward, K.H. Bennett, Formal methods for legacy systems, J. Software Main. Res. Practice 7 (3)

(1995) 203–220.[21] S. Woods, A. Quilici, Some experiments toward understanding how program plan recognition algorithms

scale, in: Working Conference on Reverse Engineering, 1996, pp. 21–30.


Top Related