the development of a partial design recovery environment for legacy systems

The Development of a Partial Design Recovery Environment fo rLegacy Systems *

K. Kontogiannis

M . Bernstein

E. Merlo

R. DeMori

McGill University3480 University St ., Room 318, Montreal, Canada H3A 2A 7

Abstract

Computer Assisted Program Understanding sys-tems take input primarily in the form of source codeand produce output representing system concepts insome useful form. This paper discusses the develop-ment of a program design recovery environment basedon structural and behavioral recognition of program-ming plans . Our research investigates program an dplan representation methods and the issues related t othe detection of code fragments using pattern match-ing techniques. In particular, we consider the integra-tion of diverse tools allowing selection from a choiceof strategies. Our investigation focuses on technique sto allow partial design recovery when complete recog-nition is not feasible . Finally, we discuss the underly-ing process paradigm, called "Goal-Question-Analysis -Action".

1 Introduction

Reverse engineering is concerned with the develop -ment of tools and techniques for understanding unfa-miliar code to facilitate system maintenance [5] . De-sign recovery, a fundamental task for software mainte-nance and reverse engineering, refers to the process i nwhich a program is analyzed in order to be understoodas a whole with respect to

• functional specifications ,

• input parameters ,

• expected output ,

• performance, and

*This work is funded by IBM Canada Ltd . Laboratory -Center for Advanced Studies (Toronto) . Parts of this paperappear at "Program Representation and Behavioural Matchin gFor Localizing Similar Code Fragments " paper of CASCON '93Proceedings .

• the software and hardware environment in whichthe system runs .

The objective is to discover features such as the or-ganization of program structure (how procedures orsubmodules are organized), run-time functionality ofthe program (how and in what order modules are in-voked), parameter passing, aliases, side effects, an dmeaningful system abstractions .

The primary input is source code which is repre-sented usually as a flat text file. The file may con-tain a variety of forms of informal information suchas comments, I/O commands (e .g ., print statements) ,indentation, and meaningful variable names . Unfortu-nately, informal information provides insufficient dat afor analyzing complex program features. It is neces-sary to apply other techniques which operate on th elevel of abstract syntax trees (AST) or other progra mrepresentation schemes .

In the past few years several research groups havefocused their efforts on the development of tools andtechniques for program understanding and progra mrestructuring. Research issues include the develop-ment of formalisms for representing program struc-ture, control and data flow analysis, and technique sfor visualizing program execution . Design recovery re-search has been concentrated mostly on program rep-resentation and plan localization .

For real applications, however, additional issue smust be addressed . Practical problems include th elimited number of programming plans (templates) ,the difficulty of analyzing unstructured code, and th ecomplexity of plan localization algorithms . Together ,these problems make program understanding applica-tions rigid and unable to cope with large-scale systems .We take an approach in which methods and tools fo rdesign recovery are conceived based on a set of strate-gic but specific objectives .

We introduce the idea of Goal Directed Design Re-covery in which the representation method, the leve l

206

of abstraction, the appropriate analysis tool, and thecontrol strategy are dictated by the objectives an dprogram attributes that the maintainer wishes to re-cover. Depending on the particular design recoveryobjective, specific techniques include

• program flow analysis ,

• program equivalence relations (using forma lmethods), an d

• knowledge-based techniques (heuristics for com-plexity reduction, domain knowledge) .

At lower levels of abstraction, the analysis proces sexamines features derived directly from the static cod estructure . Issues considered at this level of abstrac-tion include problems such as syntactic variations, im-plementation variations, overlapping implementation ,unrecognizable code (or partial recognition), and dif-fuse structures . In general, they are issues regard-ing the identification of relationship among code frag-ments .

At higher levels of abstraction, behavioral analysisis used to recognize abstract functionalities in progra mfragments . Approaches for understanding behaviora laspects of a program are derived primarily from lan-guage semantics . Denotational and operational se-mantics provide a formal description of program be-havior and can be used as a basis to represent an drecognize semantic cliches .

In this research project a number of different re-verse engineering tools and methodologies must be in-tegrated into a single reverse engineering environment .The environment must be able t o

• perform clustering,

• represent the program's structural relationships ,

• allow for the definition of useful metrics, and

• perform selection operations based on user -defined criteria .

This yields a number of research issues that must b eaddressed, including the development of program rep-resentation schemes, the development of compariso nalgorithms, the development of repositories for soft -ware components, and the integration of user-interfacetechnology to browse, navigate, and search large col-lections of software artifacts .

Within this framework, we investigate technique sfor performing partial program design recovery. Be-cause of the complexity of the problem, partial de -sign recovery is a more realistic objective than com-pletely automated design recovery and can be very

useful when used by an experienced maintainer. Thetechniques we investigate focus on three major issues :

1. program representation schemes using languagesemantics and information obtained from the pro-gram's abstract syntax tree ;

2. comparison methods based on structural and be-havioral program properties ; and

3. interface programs for integrating different re-verse engineering tools and environments in on eoperational system using a common softwarerepository .

In the following sections we discuss the structureand functionality of such an integrated reverse engi-neering environment and we investigate the relevan tresearch issues .

2 An Integrated Reverse EngineeringEnvironment

Reverse engineering is a process that produces high-level descriptions of software from lower-level repre-sentations . Usually this means extracting design fro ma program's source code. Once design decisions ar erecognized, they are then organized into a form thatis convenient for the developer . This process requiresthe integration of different technologies, environments ,and analysis tools . In developing an integrated revers eengineering environment, it is necessary to address is -sues of

1. program representation ,

2. definition of algorithms and methodologies to ex-tract system abstractions ,

3. design of software repository systems ,

4. user-interface technology to browse, navigate, andsearch large collections of software artifacts, and

5. the definition of process models for reverse engi-neering .

Current reverse engineering research focuses on th edevelopment and integration of several different tech-niques and methodologies. There exist a few reverseengineering environments on the market and in re -search laboratories which support reverse engineeringfunctions that are limited in scope to syntactic andlexical analysis of the source code. The objective i nour research is to extend the range of available ap-proaches for reconstructing design requirements . Inparticular, we are considering

207

• program abstraction schemes ,

• approximate pattern matching (with respect t ostored code descriptions) ,

• interactive sessions with programmers ,

• graph theoretic properties [14], an d

• repositories for storing software components andresults of software analysis .

Techniques based on compiler technology, such asparsers, flow analysis methods, slicing, and dicing ar eused to build abstract syntax trees and for performin ga number of analysis tasks (e .g ., constant propagatio nanalysis, value range analysis) .

Database technology is used to develop softwar erepositories that are used for storing code descriptions ,documentations, test data, results of previous analy-sis, and other relevant information .

Reverse Engineering environments such as REFIN Eare used to provide a prototyping framework and a

programming environment for developing reverse en-gineering applications. These environments can b every flexible, incorporating a variety of programmin gparadigms (such as object oriented, logic, and func-tional programming) and facilitating the constructio nof appropriate user interfaces .

Finally, pattern matching technology is used to de-tect common programming plans in source code frag-ments .

Within this framework we are working towards th edevelopment of a reverse engineering environment tha tsupports not only the identification of system's com-ponents and their dependencies but also the extractio nof system abstractions, including design and require-ments information .

Some of the issues that must be address when de-signing a reverse engineering environment include

• communication among the different tools ,

• definition of recoverable design features and thei rapproprate level(s) of abstraction ,

• development of representation schemes for high-lighting software attributes, an d

• the development of appropriate comparison tech-niques .

Abstract Syntax Trees provide the basic structur efor program representation because they serve as in-puts of a number of flow analysis algorithms and canbe annotated with higher-level semantic information .

1 REFINE is a trademark of Reasoning Corp .

The process of discovering and identifying systemabstractions involves the development of a mentalmodel describing the function of the system at dif-ferent views. In [8] four different levels of programviews are identified :

1. Implementation, represented as an Abstract Syn-tax Tree (AST) and a symbol table of programtokens ;

2. Structural, which gives an explicit representatio nof the dependencies among program components ,

3. Functional, which relates parts of the program bytheir functions and shows logical relations amongthem; and

4. Domain, which replaces items in the functiona lview by concepts specific to the application do-main .

The implementation-level program view in ou rsystem is achieved by the use of a powerfulparser/generator [12] which creates an AST stored asa collection of objects in a software repository. Thesystem includes a very high level language (VHLL) forquerying and updating the object base and for manip-ulating the tree.

The structural view is implemented using tools [14 ]which :

1. parse the target program to extract relevant sys-tem components and dependencies, storing themin a repository ,

2. allow the user to generate hierarchies of subsys-tems ,

3. determine interfaces among the subsystems, and

4. evaluate subsystem structures using establishedsoftware engineering principles .

The functional view can be achieved by using lan-guage semantics to annotate AST representations .Examples of contributions from language semantics in -clude denotational semantics [19, 20] and operationalsemantics [10] [16] . Denotational and operational se-mantics have been quite effective in achieving math-ematical understanding of the dynamic behaviour o fprograms .

The domain-level view can be achieved by addingspecific annotations to the AST using project-specifi csemantics (such as programming standards) an ddomain-specific information such as integrity con-straints .

208

Figure 1 : A possible architecture for a distributed re-verse engineering environment . Dashed lines distin-guish different machines, or computing environments .

At lower levels of abstraction, the focus is mainl yon structural recognition and uses information de-rived from the code . Questions asked at this leve linclude analysis of resource usage between modules(e .g ., global variables, data structures, actual/formalparameter lists, files) and the identification of depen-dencies between program components (e .g ., precondi-tions/postconditions, data dependencies) . Structuralrecognition involves the analysis of control and dat aflow properties, system organization, and data depen-dencies .

At higher levels of abstraction, behavioral recogni-tion is used to recognize abstract functionalities in pro -gram fragments . Code features for behavioral recog-nition include the investigation of the effects of a com-putation in terms of initial and final values induced ,how a computation is performed in terms of transi-tion systems, and the definition of meaningful rela-tionships between different program parts (e.g., whichcomputations affect which program parts) . Behavioralrecognition implies the analysis of the semantics of th eexchanged resources between program modules (e.g. ,how exchanged resources are used, what effects theyhave), the operations which change the imported re -sources into their exported values, and, finally, the

concepts, relationships, and actions that occur in amodule.

From the preceding discussion, it is evident tha ta number of different tools and environments mus twork together order to produce the effect of a Goa lDirected Design Recovery Environment . Moreover, wewould like to have the whole environment distributedacross different machines and available to multipl eusers . These requirements suggest a system architec-ture in which each tool has its own local workspace an dall communicate through a central software repositor ythat stores the results of analysis, design concepts ,links to graph structures, and any other informationneeded for the different tools to operate . A possiblesystem architecture for such an integrated reverse en-gineering environment is shown in Fig . 1 2 .

Finally, the nature and the dynamics of the pro-cess used for design recovery suggest a model that isbased on a Goal-Question-Analysis-Action paradigm .This strategy focuses on the improvement of the soft-ware process by considering the specific project goal sand environments . In such a way, project objective sand the domain specific information in the develop-ment environment guide the selection of appropriatemodels, methods, and tools in the software process .

In this model, the maintainer operates within aframework of top-level goals . These goals provide th eframework and the justification for a series of ques-tions that must be addressed to satisfy the top-leve lgoals . Specific goals give rise to specific question swhich require the use of specific analysis strategiesand tools . Once the strategy and tool have been se-lected, an action is recommended to achieve the goal .The maintainer may require additional informationand can set new goals, reapplying the whole model a sshown in Fig. 2. Some relationships between Goals ,Questions, Analysis Tools, and Actions are shown inFig. 3 .

3 Partial Design Recovery

In order to recover the design of a program, severalissues must be addressed .

1 . A code representation formalism must be cho-sen. This formalism should allow for represent-ing both structural and behavioral code feature sand for the application of a different of analysi salgorithms .

2 This architecture was suggested by the participants of theProgram Understanding project (University of Toronto, Univer-sity of Victoria, McGill University, and IBM) .

Local

Workspace

Local

Workfp•o,

Local

Workspac e

ZControl Inlpndon

Schem a

Dale Inlwpatlon CbI.C1 OYaRepository

Tool

Pill.rnWalther

209

2. A comparison algorithm must be devised t omatch the source code representations with pro-gramming plans that represent commonly used al-gorithms and programming structures . Compar-ison algorithms depend heavily on the progra mand on the selected representation method an dthey do not always involve simple pattern match-ing .

3. A top-level control strategy must be defined . Top -level control methods focus on techniques to se-lect program parts and programming plans fo rcomparison in order to achieve plan instance lo-calization .

Partial recognition deals with the problem of rec-ognizing plan instances even when these plans are in-terleaved with other types of information in the codeor are scattered throughout the program. Multiple,failed, or incomplete plan recognition must be take ninto consideration . Multiple recognition occurs when asingle programming plan matches more than one pro-gram part . Ambiguities can be resolved using needs ,domain knowledge, or other external information . Onthe other hand, failed recognition should produce fail-ure information explaining the cause of failure . How-ever, the most common case occurs when no success o rfailure can be proven . In this case incomplete bindingsshould be produced for explanation and control .

3.1 Program Representation

One of the central issues to be addressed for designrecovery is the definition of a program representationformalism that encompasses design information at ahigher level than the source code . Most of the researchapproaches focus on the development of mathematica lformalisms and techniques which can facilitat e

1. representation of program functions in a more ab-stract way than source code ,

2. the representation of the behaviour of a program ,

3. search techniques (borrowed from A .I, or graphtheory) ,

4. ways to reflect information on the problem andthe application domain ,

5. user friendliness, in terms of how program desig nis presented to the programmer ,

6. adaptability, in terms of how easily one represen-tation can be transformed into another more ab-stract one (in case of reverse engineering) or les sabstract (in case of forward engineering), and

Figure 2: The overall functionality of a ReverseEngineering Environment using the "Goal-Question -Analysis-Action" paradigm.

7. portability to a computer environment (be abl eto define data structures for encoding the formal -ism) .

Most approaches use internal representations thatare based on structural and lexical properties, but i tis also necessary to find representations based on be-havioral code properties. A popular internal repre-sentation of code is the Abstract Syntax Tree (AST )which is (a) language independent (b) represents th efunctionality of the system as a whole, and (c) can b eused in a a variety of software engineering activities .

Our approach distinguishes between structural andfunctional code recognition . For structural code recog-nition AST program representations are considered .For functional code recognition, representations base don language semantics are investigated . This paperfocuses on structural code recognition in which AST sare annotated with information on system structur e(call graphs), data flow (data flow graphs), the result sof previous analysis, and links to informal information .Annotations are defined as needed, and thus we main-tain the capability to change or add annotations asthe system evolves. Moreover, Annotated Text (AT )representations of the code can be created using suc ha schema. Annotated Text representations link sourc ecode fragments with meaningful concepts, comments ,

Use r

Provides I Redefines

Specifications (Analysis of jecive s

code patterns, dependenc y

constraints, etc.)

output I Analysis results

Interface

Passes requests

Analyze r

Pedarms analysis

(localize code,

show graphs,

perform no w

analysis, etc.)Control Calls the appropriate analysis tools

210

and abstractions so that design recovery can be as-sisted .

The main idea is to parse source code, creating anAbstract Syntax Tree that uses a domain model t oshow the structure and the hierarchies between th edifferent language constructs . Once such an AST i screated [4], analysis can be performed to annotate thenodes of the tree . Annotations take the form of map -pings from AST nodes to other AST nodes or otherentities . The REFINE environment is used to createsuch an Abstract Syntax Tree and its correspondin gdomain model . The resulting AST is stored in an ob-ject base and can be retrieved using a query languageand a specialized programming language . AbstractSyntax Trees have been used extensively [1] for rep -resenting structural properties and serving as a basi sfor a several analysis methods .

Examples of reverse engineering environments us-ing ASTs include the RECORDER system [3], theREFINE system [12], and the Programmer's Appren-tice Project [18] .

Once the representation of the basic components o fa program by plans, cliches, or other formalisms ha sbeen studied, representation and comparison method sfor controlling plan detection in complex application smust be considered .

3.2 Comparison Methods

Comparison methods are used to perform simpl eplan instance recognition by proving equivalence orshowing a partial order relationship between two sim-ple programming plans . A simple programming planis a plan that has no other plans interleaved . In ourapproach such simple plan instance recognition will beperformed by applying equivalence relations on som ebehavioral representation of the program .

Plans can be described at different levels of abstrac -tion . The most abstract level is the one correspond-ing to the most concise description of a very complexsystem . With new representations, plan fragment sat higher levels of abstraction could be detected an ddescribed. The most common plan-code comparisonmethods include :

• simple pattern matching ,

• similarity metrics ,

• graph matching, and

• body structure .

Some systems (e .g ., PROUST) [11], [7] match syn-tax trees with syntax tree templates . A plan matches

a program statement if its unified template matchesthe statement's syntax tree and its constraints andsubgoals are satisfied . TALUS [15] compares studentand reference functions by applying a heuristic simi-larity measure . In CPU [13], programs are representedas lambda calculus expressions and procedural plans .Comparisons in CPU are performed by applying a uni -fication and matching algorithm on lambda calculu sexpressions . In UNPROG [9], program control flo wgraphs and data flow relations are compared with theprogramming plan's control flow graph and data flo wrelations . Quilici [17] matches frame schema repre-sentations of C code and if they structurally matc hthen data flow graphs are compared too . GRASP[21, 22] uses attributed data flow sub-graphs to repre-sent programs and programming plans . Comparisonsare performed by matching subgraphs and by checkin gconstraints involving control dependencies and otherprogram attributes .

We consider two types of comparison methods :structural and behavioral . In structural comparison ,the dominant program representation scheme is theAnnotated AST . The objective is to locate code frag-ments based on structural/lexical properties, contro land data flow properties, and communication proper -ties .

Comparison criteria for this level of design recov-ery include structural similarity, control and data flo wsimilarity, and relations defined on uses of variable s(e .g ., initial and final values, uninitialized data, valueranges) .

In behavioral comparison, the dominant represen-tation scheme is again the Annotated Abstract SyntaxTree emphasis placed on the annotations representingthe effects of a computation andhow a computation i sperformed . The objective is to locate code fragmentsbased on equivalence and partial order relations de -fined over Annotated Abstract Syntax Trees and o nlogic relations that structural properties of languageconstructs with their corresponding semantics . In-formation obtained from Annotated Abstract SyntaxTrees, language semantics (denotational, operational) ,the domain model, flow analysis results, and slicingtechniques can help for perfoming behavioral compar -ison of simple programming plans and code fragmen-nts .

3.3 Top-Level Control

Program parts and programming plans represente dat higher levels of abstraction are selected using a top -level control strategy and are used as input to a com-parison module . The output of such comparisons are

211

Goal Question Analysis ActionCode Correction Data Type Mismatch Type Inferencing Update declarationsCode Performance Detection of Inefficient Code Refine programs Normalize and Replace Cod eCode Correction Memory Allocation Flow Analysis Eliminate Memory Leak sCode Performance Equivalent Code Language Semantics , Pattern matching Perform AbstractionsCode Performance Find Specific Algorithms Language Semantics, Pattern matching Perform AbstractionsCode Correction Unitialized data Flow Analysis Initi alize DataCode Correction Change/Impact Analysis Slices, Dependency Ana lysis Localize Erroneous Code

Figure 3 : Basic relationships for the Goal-Question-Analysis-Action paradigm

recognized concepts and program parts satisfying thespecifications of a programming plan . Control can beguided by the needs of the particular application an dthe results of previous comparisons . Search algorithmsare used to select from the program representation dif -ferent programming parts for comparison . Bottom-up search strategies systematically select all progra mparts covering the program representation, while top-down search strategies seek single parts that can b eused to satisfy a given expectation (subgoal) . Pro-gramming plans and program parts are not alwaysrepresented using the same formalism . Moreover, dur-ing the recognition process, comparisons must be per-formed between already recognized concepts and origi -nal program material . Hierarchical recognition contro lstrategies focus on such multi-leveled representation sand are used for compositional recognition where com-plex concepts are recognized in terms of their subcom -ponents .

Program decomposition can be used . to guide theselection process . Performance is best when decom-position produces program parts that correspond t oplans in the library. Program decomposition can b eperformed a priori before the selection process start sor in a dynamic way, based on previous recognitionresults and the current needs of the application as th eselection process is performed .

For the control strategy, we focus on investigatingan island driven opportunistic search in which searc hsubgoals are set around some well-recognized point i nthe program. The idea is to use this positively identi-fied point as an anchor and then try to satisfy subgoalsaround it .

Islands are obtained by starting with design deci-sions that have been positively (or with high evidence )identified and are of particular semantic relevance, an dthen proceeding outward, extending the analysis i nboth directions . Irrelevant and intermixed plans canbe ruled out by allowing recognition gaps and partia lrecognition of plans .

As an example, consider intermixed plans or partsof plans which do not share data dependencies . Suchplans can be distinguished irrelevant ; the island-drivensearch will skip them and proceed towards what i tconsiders the current goal to be .

Such an island-driven search starts by establishin ga top-level goal which, in practice, is the satisfactio nof some programming plan . The top-level goal is es-tablished by examination of the properties of the well -identified program part and the properties of the plan sin the plan base . The search continues by trying t oprove the existence in the code of other parts of th eplan as these exist in the generic plan. Throughoutthe process, incomplete and partial satisfaction of th eplan properties causes the search to jump in an oppor-tunistic way in an attempt to satisfy whatever it can .Island-driven opportunistic search will not fail if theplan and the code representation do not match exactly.In this case the island- driven search will jump to an -other point with high likelihood of recognition and wil lstart again from this point . The search will termi-nate when no further plan recognition can take place .Thus, even partially matched plans will be reported bysuch an algorithm. Failure occurs only when none of aplan's subparts can be positively recognized. The in-fluence of constraints between nonadjacent fragment scan be investigated in the case where other fragment sare left as uninterpreted gaps . These concepts can beused to specify interactive tools that propose an in-terpretation of a fragment to the maintainer who ma ynot be interested in every fragment or who may pro -vide a personal interpretation of some uninterpretedfragments .

4 Research issues

Large legacy systems represent significant assetsfor the companies that use them. These systemshave evolved over time and tend to require continu-ous maintenance . The definition of the objectives for

212

a design recovery process is based both on a genericlist of desired design attributes to be recovered (suc has call graphs, data types, value ranges, control flow ,data flow) and on domain-specific design attributes ofa specific product or specific language .

Within the framework of this project, maintenanceobjectives were identified during a sequence of con-textual interviews with the Program Understandin gGroup (PUG) at IBM's Centre for Advanced Studies .The top-level objectives of the Program Understand-ing Project (PUP) address problems related to

a) code correctness and ;

b) performance enhancement .

The research issues which arise in such a frameworkand constitute the objectives of our work include :

• selection, incorporation, and use of an appropri-ate process model ;

• development of code representation schemes andplan localization techniques that are appropriatefor mechanical manipulation and allow for th eanalysis of structural and behavioral propertie sof the code ;

• normalization and correction of erroneous code(e.g ., removing dead code, simplifying or remov-ing redundant expressions, localizing erroneou scommunication points) ; and

• integration and use of different technologies (flo wanalysis, language semantics, pattern matching ,search and control strategies) in a reverse engi-neering environment .

Existing Program Understanding systems attemptto recognize plan instances by comparing program rep-resentations against programming plans . A commontheme in these approaches is to use a program repre-sentation formalism, a plan repository, a plan localiza-tion control strategy , and a comparison algorithm .

The major problems associated with the programunderstanding systems built so far can be summarize das follows :

• Repository completeness: it is not possible to en-code and store all possible plans occurring in a napplication, and usually the ones encoded aretrivial cases (e .g ., sorting algorithms, list traver-sals) .

• High complexity : most understanders performwell only in small and medium-sized (approx .

5000 lines) programs. In large programs com-plexity makes design recovery an extremely dif-ficult and time-consuming process . Difficultiesare caused by interleaved and scattered plans an dsyntactic or implementation variations .

• Structural and lexical matching: in most appli-cations, plan-program similarity or equivalence i sdetermined by testing structural and lexical in -formation from both the code and the plan . Thiscauses a problem when the code fragment con-tains other plans, irrelevant statements, or whenthe plan is scattered among different parts of th eapplication, because the structural and lexical -based representations of the code and the plancannot be matched . Moreover, the behaviour ofa program can not be encoded and represented .

• Graph-based matching: not all approaches useprogram representations that are based on struc-tural and lexical information . Several pro-gram understanders use specific graph-based for-malisms which incorporate data and control flow .Graph grammars are used to perform abstrac-tions and to perform plan instance recognition .The problem in these applications is that grap htransformations are usually very expensive t ocompute and graph pattern matching algorithm scan have high complexity. This imposes a seri-ous problem when large programs are analyzedbecause their corresponding graph representatio ncan be very large and complex .

The approach taken in this project focuses on min-imizing the drawbacks of other existing approaches .Below, we examine the basic views that we haveadopted and the points of difference with respect t oexisting approaches .

• Use of a development strategy. In our ap-proach we use a strategy for performing a Re-verse Engineering (Design Recovery) task. Thisstrategy was inspired by the "Goal-Question -Metrics" (GQM) strategy described by Basili i n[2] . Our strategy, the " Goal-Question-Analysis-Action" strategy, is task oriented, goal driven ,and focuses on decomposing the task of desig nrecovery into a number of interrelated but dis-tinct subtasks. This can be justified in two ways .Firstly, it is a natural way to perform design re-covery. In practice, human maintainers do not tr yto understand the whole program at once but, in-stead, they gather different pieces of informatio nand then they relate them using their expertise

213

and their programming skills . Secondly, main-tainers gather this information in a goal-drive nway. At the beginning, they establish an objective(e .g ., "find what this procedure returns and ho wit computes its returned values" , "Is it a sortingalgorithm?"), and then they gather informatio n(e .g ., "find where this variable is updated" , "whendoes this loop terminate?"), which they believe i srelevant for meeting the specific objective . Infor-mation gathering is not a random process but i sguided by the experience and the programmingskills of the maintainer . We believe that this ex-perience is valuable and that a Reverse Engineer -ing environment should give the programmer th eflexibility to specify his own queries and provid ehim with the tools to gather the information hethinks is important and relevant. Thus, the ques-tion the maintainer asks is the key to selectin gthe right analysis tool and the required actions tobe performed .

• Use of behavioral program representation model .This approach uses a program representationmethod which reveals the behavioral aspects ofthe represented program. Specifically, insteadof using a representation method based only onstructural and lexical properties of the program ,we use information which is based on languagesemantics and focuses on how a computation i sperformed, on what is the effect of a compu-tation after its termination. The advantage o fthis approach is twofold . Firstly, it is based ona well-established formalism which is supporte dby a solid mathematical theory (language seman-tics) . Secondly, the mathematical theory sup-porting the model serves as a foundation fo rdefining equivalence and partial order relation swhich can then be used to relate plans and pro-grams . Most program understanding applica-tions use program representations that incorpo-rate control flow, data flow, and structural infor-mation . The drawbacks of these representatio nschemes are that plan-program equivalence ha sno mathematical foundation and that they ar ebased on simplistic pattern-matching algorithms .The complexity of these formalisms is high, whic hmakes the design recovery of realistically largeprograms difficult . Our approach, which uses lan-guage semantics, has not been investigated yet o ncomplexity issues, but it definitely gives an ad -vantage on plan-program matching, which can beachieved based not only on structural and syn-tactic properties, but also on semantic and be-

havioral properties of programs .

• Use of an ef ficient and flexible plan localizatio nstrategy . Plan localization is a key issue in mos tprogram understanders . Plan localization algo-rithms must cope with problems due to syntac-tic variations, interleaved plans, implementatio nvariations, overlapping implementations, and un-recognizable code . Some of the most commonplan localization algorithms used in existing ap-plications are top-down goal agendas, depth-firstsearch, best-first search, repeated traversals, an dexhaustive search . The drawbacks in most appli-cations are (a) the complexity issues and (b) fail-ure to produce any results if perfect localizationcan not be achieved. Our approach has the com-plexity of an island-driven parser [6] and has th eadvantage that if the plan localization algorith mfails to recognize a plan completely it still recog-nizes parts of it, so that partial plan-program lo-calization can be performed. In most cases, then ,the maintainer can use his experience to recognizethe rest of the plan himself.

• Use of an integrating environment . As indi-cated above, the view adopted in this projectfavours programmer-assisted program under -standing over automatic program understanding .Our approach considers plan localization onl yas a subgoal of design recovery. Plans shouldbe instances of principles and not simply in-stances of abstracted code fragments . Principle scould be high-level descriptions of concepts oc-curring in the code (something that we believeit is not feasible with the current technology)or practical properties of the code itself (e .g . ,data dependencies, control dependencies, precon-ditions, and postconditions) with which program-mers are familiar . Within this framework, a pro-grammer may set high level-objectives with ab-stracted concepts . In order to address these ob-jectives, he examines practical properties of th eprogram such as control and data flow, depen-dencies, partial correctness properties, communi-cation points, and informal information . A Re-verse Engineering environment should provide t othe maintainers tools for addressing these prac-tical questions and not rely exclusively on auto-matic plan recognition using a static plan library.This approach fits well with the Goal-Question-Analysis-Action" process model, and has not beenincorporated in any of the existing program un-derstanders .

214

5 Conclusion

In this paper we gave an overview of a researchproject whose goal is the development of a revers eengineering environment . Reverse engineering and i nparticular design recovery are complex tasks and re -quire the application of a number of diverse method-ologies and techniques ranging from compiler technol-ogy to artificial intelligence and language semantics .Our research focuses on (a) the definition of a strat-egy for specifying the design recovery process, (b) th eidentification of a repertoire of design recovery objec-tives useful to a typical system maintainer, (c) th edefinition of program representation and plan repre-sentation schemes, (d) the development of a controlstrategy for localizing programming plans, (e) the de-velopment of a comparison technique for relating plan sand code representations, and (f) the integration ofanalysis results from a diverse tools such as graph ed-itors, parsers, and repositories .

Within this framework we have discussed a strategyfor describing the design recovery process, the "Goal -Question-Analysis-Action" model . In this model top-level objectives give rise to specific questions whic hin turn require the application of specific analysi stools and actions. We have looked at a list of top-level objectives identified from contextual interview swith IBM researchers and software engineers . Ob-jectives include: (a) data type mismatch in expres-sions (PL/AS related data type mismatch problem snot handled by the compiler) (b) appropriateness o fdata structures used (e.g., all fields of a structure ar eused) ; (c) localization of equivalent or similar cod efragments (code reduction) ; (d) localization of specificalgorithms in the code (plan localization) ; (e) detec-tion of inefficient and error prone code (high complex-ity, high and complex module interaction) ; (f) Codeflow analysis and unitialized data (control/data flow ,value ranges, constant propagation), and, finally ; (g)change / impact analysis (slicing, dicing, module de-pendencies)

The program representation scheme chosen to meetthese objectives is the Abstract Syntax Tree annotate dwith information accommodating links to concepts, re-sults of previous analysis, references to control flow ,data flow and structure graphs as well as representa-tions of program behaviour and functionality by usingformalisms borrowed from the area of language seman-tics. The advantage of this representation scheme i sthat it incorporates both a well-established model fo rrepresenting software structure (AST) with formal an -notations borrowed from the well-defined and formalarea of programming language semantics for represent-

ing program functionality.Comparison methods are used to relate plans an d

code representations . We distinguish between com-parisons for structural and functional design recovery.Comparisons for structural design recovery are base don control and data flow properties, system organiza-tion properties, and data dependencies . On the othe rhand, comparisons for functional design recovery arebased on the semantics of the exchanged resources be-tween different program parts, the operations whic hchange imported resources into their exported values ,and the attributes, relations, and actions that manip-ulate program's components . Within this framewor kplan localization control strategies must be definedso that comparison can be performed even when in-terleaved plans, implementation variations, and scat-tered plans exist in the code . Island-driven search al-gorithms may contribute towards the partial designrecovery objective .

Finally, the integration of different environmentsand tools in a distributed environment can be achievedby allowing a local workspace for each individual too lin which specific results and artifacts are stored, aglobal repository for storing information and data use din all applications and tools, and, finally, a translatio nprogram for transforming tool specific software enti-ties into a common and compatible in all environment sentities stored in the global repository.

The target system is currently in its design phaseand it will be applied for recovering the design of largecomplex legacy systems such as the SQL/DS package .

About the Authors

Kostas Kontogiannis is a Ph.D student at Com-puter Science Department, McGill University . He isworking in the area of software design recovery, pat -tern matching, and artificial intelligence . He can bereached at [email protected] .

Morris Bernstein is a research assistant at Com-puter Science Department, McGill University and h eis working towards the development of a reverse en-gineering environment using data flow analysis tech-niques . he can be reached at zaphod@cs .mcgill .ca .

Renato DeMori is a professor and chairman o fthe. Computer Science Department at McGill Univer-sity . His interests are focused on the areas of patternmatching, speech recognition, and program design re-covery. He can be reached at renato@cs .mcgill .ca

Ettore Merlo is a researcher at CRIM and an ad-junct professor at McGill University. He is investi-gating program representation methods and data flow

215

analysis techniques for design recovery .

He can be [11] Johnson, W .L. and Soloway, E., "PROUST :reached at merloOcrim .ca . Knowledge-Based Program Understanding, "

IEEE Transactions on Software Engineering,

ReferencesMarch 1985, pp . 267 - 275 .

[12] Kotik, G.B . and Markosian, L .Z ., Automat-ing Software Analysis and Testing Using a

[1] "AAAI92 Workshop Program on AI & Auto-mated Program Understanding", WorkshopNotes AAAI92, July 12-16, San Jose, 1992 .

Program Transformation System, ReasoningSystems Inc ., 1989 .

[13] Letovsky, S . "Plan Analysis of Programs,"[2] Basili V. R., Rombach H. D., "Tailoring the

Software Process to Project Goals and Envi -ronments" 9th International Conference onSoftware Engineering ,1987, pp . 345-359 .

Ph.D thesis, Yale University, Dept . of Com-puter Science,

YALEU/CSD/RR662, De-cember 1988 .

[14] Mueller, H .A., Rigi as a Reverse Engineering[3] Bush,

"The Automatic Restructuring ofCOBOL," IEEE Conf on Software Mainte -nance, 1985, pp. 35 .

Tool, A report from the Dept. of Comp Sci ,U of Victoria, March 1991 .

[15] Ourston, D ., "Program Recognition," IEEE[4] Buss,

E .

Henshaw,

J .

"Experiences

inProgram Understanding" Proceedings CAS-

Expert, Winter 1989, pp . 36 .

CON'92, Toronto, November 1992 .[16] Plotkin G. D, "Structural Operational Se -

mantics " ,

Lecture notes,

DAIMI FN-19 ,Aarhus University, Denmark, 1981 .[5] Chikofsky, E.J . and Cross, J .H. II, "Reverse

Engineering and Design Recovery: A Tax-onomy," IEEE Software, Jan . 1990, pp . 13 -

[17] Quilici, A., Khan, J . "Extracting Object sand Operations from C Programs, " Work-

17 . shop Notes, AI and Automated Program Un-derstanding, AAAI'92 1992, pp . 93 - 97.[6] Corazza, A., De Mori, R., Gretter, R. and

Satta G., "Computation of Probabilities foran Island-Driven Parser," IEEE Transac-tions on Pattern Analysis and Machine In -

[18] Rich, C. and Wills, L .M., "Recognizing aProgram 's Design :

A Graph-Parsing Ap-proach," IEEE Software, Jan 1990, pp . 82

telligence, Sept . 1991 . - 89 .

[7] Engberts, A .,

Kozaczynski, W., Ning,

J ."Automating software maintenance by con -

[19] Scott, D., "Data Types as Lattices," SIAMJournal of Computing, Vol . 5, No 3, 1976 ,

cept recognition-based program transforma-tion, " IEEE Conference on Software Main -

pp . 522 - 587 .

tenance - 1991, IEEE, IEEE Press, October [20] Stoy,

J .E ., Denotational Semantics,

MIT

14-17 1991 . Press, 1977.

[21] Wills, L .M., "Automated Program Recogni-[8] Harandi, M.T. and Ning, J .Q., "Knowledge-

Based Program Analysis," IEEE Software,tion: A Feasibility Demonstration," Artifi-cial Intelligence, Vol . 45, No . 1-2, Sept . 1990 .

Jan 1990, pp. 74 - 81 .[22] Wills, L .M., "Automated Program Recogni-

[9 ] Hartman, J ., "Understanding Natural Pro-grams Using Proper Decomposition," Pro-ceedings of the 13th International Conference

tion: Breaking out of the Toy Program Rut ," Workshop Notes, AI and Automated Pro -gram Understanding, AAAI'92 1992, pp . 129

of Software Engineering, May 1991 . - 133 .

[10] Hennessy M., "The Semantics of Program-ming Languages : An Elementary Introduc-tion using Structural Operational Seman -tics", Wiley 1991.

216

the development of a partial design recovery environment for legacy systems

Documents