Reverse engineering is reverse forward engineering

Download Reverse engineering is reverse forward engineering

Post on 02-Jul-2016




11 download


  • Science of Computer Programming 36 (2000) 131{

    Reverse engineering is reverse forward engineering

    Ira D. Baxter , Michael MehlichSemantic Designs, Inc., 12636 Research Blvd, #C-214 Austin, TX 78759-2200, USA


    Reverse Engineering is focused on the challenging task of understanding legacy program codewithout having suitable documentation. Using a transformational forward engineering perspec-tive, we gain the insight that much of this diculty is caused by design decisions made duringsystem development. Such decisions \hide" the program functionality and performance require-ments in the nal system by applying repeated renements through layers of abstraction, andinformation-spreading optimizations, both of which change representations and force single pro-gram entities to serve multiple purposes. To be able to reverse engineer, we essentially have toreverse these design decisions. Following the transformational approach we can use the transfor-mations of a forward engineering methodology and apply them \backwards" to reverse engineercode to a more abstract specication. Since most of the existing code was not generated bytransformational synthesis, this produces a plausible formal transformational design rather thanthe original authors actual design. As an example, a small fragment of a real-time operatingsystem is reverse-engineered using this approach. A byproduct of the transformational reverseengineering process is a design database for the program that then can be maintained to min-imize the need for further reverse engineering during the remaining lifetime of the system. Aconsequence of a transformational forward engineering perspective is the belief that the standardplan recognition methods proposed for reverse engineering are not sucient. c 2000 ElsevierScience B.V. All rights reserved.

    1. Introduction

    Software engineering practice tends to focus on the design and implementation ofa software product without considering its lifetime, usually 10 years or more (see[18]). However, the major eort in software engineering organizations is spent afterdevelopment (see [5, 6]) on maintaining the systems to remove existing errors and toadapt them to changed requirements.Unfortunately, mature software systems often have incomplete, incorrect, or even

    nonexistent design documentation. This makes it dicult to understand what the systemis doing, why it is doing it, how the work is performed, and why it is coded that way.

    Corresponding author.E-mail addresses: (M. Mehlich), (I.D. Baxter)

    0167-6423/00/$ - see front matter c 2000 Elsevier Science B.V. All rights reserved.PII: S 0167 -6423(99)00034 -9

  • 132 I.D. Baxter, M. Mehlich / Science of Computer Programming 36 (2000) 131{147

    Consequently, mature systems are hard to modify and the modications are dicult tovalidate.For newly developed systems, the problem can be reduced by thoroughly document-

    ing the system and maintaining the documentation together with the system. Ideally,the system documentation describes the product and the complete design, including itsrationale.Most captured design information is informal (not machine-interpretable). While this

    is valuable for the software maintainers, informal information is subject to a widevariety of interpretations. These varied interpretations create communication problemsbetween developers, which limits the usability of the informal information.Formal descriptions of the design (and its rationale) with a precise semantics can

    overcome these communication problems. It can even allow us to modify the designrather than the code (cf. [4]) and, thus, to modify a software system using semiauto-matic tools. We conjecture that transformational development of software (see [12, 13])guided by performance criteria (see [11]) is the right way to get such descriptions ofa design.In such an ideal setting there would be no reason for reverse engineering. However,

    the large amount of existing software that has to be maintained forces us to face thisproblem.Current reverse engineering technology focuses on regaining information by using

    analysis tools (cf. [1]) and by abstracting programs bottom-up by recognizing plansin the source code (cf. [9, 14, 15, 19]). The major purpose of such tools essentially isto aid maintainers understand the program (cf. [16]). While we believe in the utilityof these approaches as part of a larger toolkit and vision, they are quite inadequatefor the task of design recovery. Analysis tools, though valuable, only provide somedesign information derived from the structure of the code, not from its intention orconstruction. Pure plan recognition is unlikely to be powerful enough by itself to drivereverse engineering for the following reasons:

    all the necessary plan patterns in all the variations must be supplied in advance fora particular application,

    dierent abstract concepts can map to the same code within one application (theright abstraction choice is impossible without a knowledge of the intended programfunction), and

    legacy code often has tricky optimizations which may not be the simple compositionof multiple plans.

    Even more important, a common reason for doing reverse engineering is to supportmodifying the software system; understanding it is only a necessary precondition to doso. We consider transformational development with modication of the design ratherthan the code as the means to accomplish incremental modication. Considerable theoryand techniques have been developed to carry this out (see [2, 3]). For existing systems,this approach implies a need to reconstruct a plausible transformational design thatcould have been used to derive the code from a suitable abstract specication of the

  • I.D. Baxter, M. Mehlich / Science of Computer Programming 36 (2000) 131{147 133

    system, i.e. we want to apply the transformations backwards from the program to itsspecication. Recording these transformations (together with the rationale for applyingthem) then allows us to modify the design instead of just the program code.In the following, we explore forward and reverse engineering in greater detail and

    provide an example for transformational reverse engineering legacy assembler code toa high-level concept.

    2. Forward engineering

    In current forward engineering practice, informal requirements are somehow con-verted into a semiformal specication using domain notations lacking precise un-derlying semantics, such as data-ow diagrams, entity-relationship diagrams, naturallanguage descriptions, or other problem-specic informal or semiformal notations. Theprogram then is constructed manually (i.e., in an error-prone way) from the specica-tion by a creative agent, the programmer.Hidden in this activity is a set of obvious as well as non-obvious design decisions

    about how to encode certain parts of the specication using available implementationmechanisms to achieve performance criteria (the why of the design decisions). As anexample, a specication fragment for associative retrieval with numeric keys may beimplemented using hash tables, achieving good system-response time. These decisionsare usually not documented.Over time, the program code is modied to remove errors and to adapt the system to

    changed requirements. The requirements may change to allow usage of alphanumerickeys and to be able to handle large amounts of data, and the implementation maychange to use disk-based B-trees. Unfortunately, these changes often occur withoutbeing reected correctly in the specication. With each such change, the gap betweenthe original specication and the program increases. The result is a program withouta proper specication and with untrustworthy design information (such as commentsdescribing hash tables!). The code becomes dicult to understand and, thus, dicultto maintain.To overcome this deciency, it is important to change the specication rst and then

    reect the changes in the program code. A necessary precondition for this is to havereliable information about the relationship between the specication and the programcode. The design and its rationale describe the how and why of this relationship;however, they are not documented in current practice.There are two known approaches to reduce the gap between the specication and

    the program.The rst one is the development of software by stepwise renement, introducing

    intermediate descriptions of the system between the specication and the nal programcode. The intermediate descriptions should reect the consequences of major designdecisions during the construction, which helps to understand the software and its design.However, this approach has some important drawbacks: the development steps are still

  • 134 I.D. Baxter, M. Mehlich / Science of Computer Programming 36 (2000) 131{147

    manual, they are too large not to contain hidden design decisions, and there is norationale directly associated with the steps. The relation between the descriptions canonly be established a posteriori, and there are a lot more documents to be maintainedfor incorporating changes of the system requirements.Despite its drawbacks, the stepwise renement approach provides a useful starting

    point. Attacking its drawbacks leads us to the second approach: the transformationaldevelopment of software where the fairly large manual development steps are replacedby smaller, formal, correctness-preserving transformations, each of which reects asmall design decision or optimization.A necessary prerequisite for this approach is to have a formal specication. Any

    domain notations used in this specication (for general-purpose domains, such asdata-ow, as well as application-specic domains, such as money management) mustbe assigned precise underlying semantics. After doing so, the program code can bederived by making small implementation steps using formal, correctness-preservingtransformations leading from more abstract specications to more concrete ones. Thenal program code then is correct by construction with respect to the formal func-tional specication. Each transformation falls into one of three categories: renements(i.e., maps of concepts from an abstract level to a more concrete one), optimizations(i.e., maps to reduce the resources used at a level of abstraction according to somecriteria), and jittering transformations (i.e., maps to enable the application of rene-ments and optimizations). The decision to apply a particular transformation is thus acrucial design information. The applied transformations coupled with the rationale forselecting them is the design information that \explains" the program code. This col-lection of information is called the transformational design of the program code fromthe specication.Although the selection of a transformation to apply to a certain specication, in

    general, is still a creative process, it is guided by the performance criteria to beachieved. This guided selection of the transformations allows the development processto be supported by a semiautomatic system that contains a large repertoire of avail-able transforms, which are applied to achieve or at least approach the performancecriteria.The transformations needed for practical software engineering comprise all those

    design decisions implicitly used by current software developers, including the following(see [13, 16, 17]):

    Decomposition | Most problems can be decomposed into subproblems, which, ingeneral, can be individually solved in many dierent ways. The actual hierarchicalstructure of the code represents only one particular choice from the set of possibledecompositions.

    Generalization=specialization | If dierent components are similar, it may be pos-sible to construct a more general version that comprises both of them as specialcases. For eciency reasons, it may be a good choice to specialize a component fora certain (part of the) application.

  • I.D. Baxter, M. Mehlich / Science of Computer Programming 36 (2000) 131{147 135

    Choice of representation | To be able to implement high-level data types in pro-gram code, it is often necessary to change its representation. A common example isthe representation of sets by lists.

    Choice of algorithm | High-level concepts can be realized by many dierent al-gorithms. The choice of algorithm may be guided by performance criteria that isdocumented poorly or not at all.

    Interleaving | For eciency reasons, it may be useful to realize dierent conceptsin the same section of code or the same data structure.

    Delocalization | Certain high-level concepts may be spread throughout the wholecode introducing distracting details in other concepts (see the study of the eects ofdelocalization on comprehension in [10]).

    Resource sharing | Interleaved code often allows dierent concepts to share someresources, such as control conditions, intermediate data results, functions, names, andother computational resources.

    Data caching/memoization | If some data that has to be computed is needed oftenor its computation is expensive, then it is worthwhile to cache or memoize it.

    Optimizations | In order to satisfy memory or response time constraints, manydierent optimizations are used, including the following: the folding and unfolding (inlining) of program code, the use of algebraic properties of functions, the merging of computations by composing or combining functions, partial evaluation (often in the form of context-dependent simplication), nite dierencing, result accumulation, recursion simplication and elimination, loop fusion, optimizations typically performed by good compilers (code motion, commonsubexpression elimination, partial redundancy removal), and

    domain-specic optimizations (e.g., minimizing a nite-state machine recognizinga certain language).

    The consequences of these design decisions may overlap and may be delocalizedduring the construction of the program code from the specication. Whether thesemethods are carried out mechanically by a tool or informally by smart programmers,the resulting software systems are very dicult to understand.The transformational software development approach has the advantage that it al-

    lows the automatic recording of the design decisions made during the derivation of thenal program code from the formal specication. Provided the selection of the trans-formations that are applied during this development are guided by the performancecriteria to be achieved and the relation between the selection and the performancecriteria is recorded, we get the complete design together with its rationale as the prod-uct of software development. The code itself is just a byproduct that is correct byconstruction.

  • 136 I.D. Baxter, M. Mehlich / Science of Computer Programming 36 (2000) 131{147

    Recording this design information would allow us to modify the specication insteadof the code and then to modify the design to get a new implementation of the modiedspecication. It is our belief that the formal nature of transfo...


View more >