[ieee comput. soc fourth working conference on reverse engineering - amsterdam, netherlands (6-8...

Dimensions of Data ase Reverse Engineering

Michael R. Blaha OMT Associates Inc., Chesterfield, Missouri 63017 USA

(blahs@ acm.org)

Abstract deeply. In additional recognition of modeling artifacts can provide insight into the thinking of the developers. Such insight facilitates interpretation of a problem and extrapola- tion of conclusions from sample inputs.

This paper uses the UML [ l ] as the language for ex- pressing models. The appendix describes the UML constructs used in this paper.

2. Inputs and Outputs of Reverse Engineering

We continue to be surprised by the variability of reverse engineering problems. When we tackle new problems, we often encounter situations we have not seen before. For these different situations, we have to adjust our reverse engineering techniques, level of effort, and expec- tations. This paper characterizes dimensions of variation for reverse engineering of databases.

1. Introduction By now we have reverse engineered about 30 databas-

es, mostly relational databases. We continue to be surprised by the variability of problems. It almost appears as if developers are deliberately trying to make their systems different from all others. This paper presents a survey across problems that we hope will be helpful for formulating reverse engineering methods, metrics, and tools.

We first discuss inputs and outputs for reverse engineering. The outputs of reverse engineering depend on the reason for performing reverse engineering. The inputs depend on the available resources. The inputs and outputs vary widely from problem to problem and are major factors that influence the appropriate approach, level of effort, and outcome for database reverse engineering.

The next section of the paper discusses design issues. Design issues are important for two reasons. First, there are different reasonable approaches to designing a database. A versatile reverse engineer must understand these approaches well. Furthermore, they must be carefully considered by builders of reverse engineering tools. Second, design errors are often encountered during reverse engineering. Some errors are minor and subtle. Others are gross and it is surpris- ing that they occur at all. We have discussed design issues in a prior paper [ 5 ] , so we just summarize them here.

The last major section of the paper discusses modeling issues that can become apparent during reverse engineering. Modeling issues can arise even when the developer did not deliberately consider models. Modeling issues are important if a reverse engineer is to understand a problem

2.1 Required outputs

The purpose of reverse engineering determines the appropriate outputs, as well as the depth of analysis that is required. In practice we usually perform reverse engineering for the following purposes:

Tentative requirements. Reverse engineering of existing software can yield tentative requirements for the new replacement system. Reverse engineering ensures that the functionality of the existing system is not over- looked or forgotten. Documentation. Reverse engineering can elucidate poorly documented existing software when the developers are no longer available for advice. This documentation can greatly assist maintenance of legacy software. Software assessment. We have derived substantial benefit from reverse engineering of databases from vendor software. Reverse engineering provides an unusual source of insight. The quality of the database design is an indicator of the quality of the software as a whole. A model of the underlying concepts also lets us judge functionality claims better. Integration. Reverse engineering facilitates integration of applications. A logical model of encompassed software is a prerequisite for integration. Conversion of legacy data. You must understand the logical correspondence between the old database and the new database before attempting to convert data. Assessment of state-of-the-art. From our perspective as methodologists, reverse engineering provides candid insight about the state of the database design art-as practiced

0-8186-8162497 $10.00 0 1997 IEEE 176

We most often use reverse engineering for eliciting tentative requirements and assessing vendor software. This is a different emphasis than we have seen with most papers in the literature which tend to emphasize software maintenance and conversion of databases and programming code.

We derive the following outputs from database reverse engineering:

lability for attributes, either because of sloppiness or a philosophical decision to avoid the DBMS facilities for handling nulls. Some schema incorporate views that re- veal common accesses of the database. Some D B M k encourage the use of stored procedures to define common data operations. Data. For existing applications the database is often

Models. Typically we build an object model during reverse engineering. The quality of the model depends on the available inputs and the motive for reverse engineering. For example, we require a thorough model for conversion of legacy data but a partial model can suffice for assessing vendor products. The model conveys the scope and intent of software. Mappings. For integration and conversion of legacy data we prepare mappings between attributes of models and fields of schema. We have been managing these mappings with repository software that we are developing. The repository software preserves mappings across im- ports of revised models, checks for inconsistencies, and generates reports. Evaluations. Sometimes we explicitly judge the quality of a database, especially for vendor software. We assign a grade that reflects the consistency of the design and the extent of design and modeling errors (Section 3.3 and Section 4.1).

2.2 Available inputs The available inputs vary widely and this also compli-

cates reverse engineering, especially for tool developers. Furthermore, the inputs differ in their completeness and consistency. Many of the inputs that we list below are de- scribed in [3].

Schema. The schema is normally the dominant input for database reverse engineering. It specifies the data structure and many constraints-precisely and explicitly.

The schema varies according to the kind of DBMS, both by paradigm and product. Relational DBMSs have declarative schema that can readily be inspected. 00- DBMSs declare less, so it is more difficult to analyze their schema. Network (such as CODASYL) and hierarchical (such as IMS) DBMSs also express fewer constraints.

There is also variability in the constructs that are used for a database schema. For example, a designer may or may not define primary keys, candidate keys, and foreign

available for experimentation and queries. In contrast, when we reverse engineer vendor products, we may have only a printed copy of the schema and some explanation. Furthermore, even when we have access to a vendor database, we may or may not have representative data. Database queries. Sometimes we scan the applicaticln code, looking for clues from queries that manipulate datta. User interface forms and reports. These may contain helpful information and are often easy to analyze. The reverse engineer can enter known, unusual values to estab- lish the connection between user interface forms and the underlying schema. [7] Definitions of application constructs. Sometimes a data dictionary is available as a resource. Access to application experts. Application experts can answer questions and provide rationale and context ‘to temper reverse engineering. Familiarity with the application. We have reverse ein- gineered problems ranging from familiar applications to applications that we did not understand at all. If the reverse engineer understands an application well, he or she is in a better position to make inferences. Documentation. Problems have different quality, quan- tity, and kind of documentation. Documentation provides context for reverse engineering, enabling the reverse engineer to better understand the meaning of application names and make better guesses.

2.3 Problem size

The size of a database also influences reverse engineering difficulty and effort. For example, the number of tables and attributes mostly characterize the size of a relational database; the number of indexes, views, and records are also important. With large schema the reverse engineer is more likely to encounter the styles of multiple designers. It is more compelling to use automation (such as detecting foreign keys) for large problems than for small problems.

3. Design Issues keys. In fact, our experience has been that most relational database schemas lack formal definitions of foreign

3.1 Approaches to identity -

keys. The designer can adopt different philosophies for indexing a schema to improve performance and enforce uniqueness. The designer may or may not define nul-

Identity is “that property of an object which distinguishes each object from all others.” [2] Identity is a prom- inent concern in databases. Developers must have some

177

way for referring to things. We have seen variations in identity caused by the choice of approach and by errors.

The design of identity is important to reverse engineering for two reasons. First, it is a prerequisite for under- standing implicit relationships. [3] We must determine the identifiers for objects and records before we can find relationships between them. Second, identity can be used as a basis for organizing and subdividing a large schema. We often group together classes and record types that are close- ly related via propagated keys [4].

There are several approaches for implementing identity with a database. [5] The choice of approach is a matter of preference and implies little about the quality of a schema. Most databases that we have studied use artificial identity.

Artificial identity. A system-generated identifier (also called an ID, a surrogate, or a pointer) identifies each object. Figure 1 shows a logical model and relational database tables with two kinds of artificial identifiers. The bold font denotes attributes that are part of the primary key.

Region Bank - accountNum

Logical model

Account

regionName

regionlD regional Ban klD

bankName

I Account table I regional AccountlD regionalBanklD

Figure 1 Artificial identity

-- Unstructured identifier. In the middle portion of Fig- ure 1 the identifiers of Region, Bank, and Account are just handles and have no intrinsic meaning.

-- Structured identifier. The identifier has a meaningful internal structure. In the bottom portion of Figure 1 re- gionalBankID consists of a regionID concatenated with some digits to differentiate each bank in a region. Similarly, regionalAccountID distinguishes each ac-

count relative to a regionID. Such identifiers could occur for a large corporation that services its banks from several regional databases. Each regional database can independently allocate structured identifiers for its banks and accounts.

Value-based identity. Some combination of application attributes identify each object. In Figure 2 each region and bank have a unique regionName and bankName re- spectively. Each account can be identified by a bankName combined with an accountNum.

Implementation with value-based identity

Figure 2 Value-based identity

Hybrid identity. A schema can combine artificial identity with value-based identity. In Figure 3 Region and Bank have artificial identity and Account has identity derived from a reference to a bank combined with an account number.

Implementation with hybrid identity

Figure 3 Hybrid identity

Propagated identity. Identity can also be propagated via migration of foreign keys. In Figure 4 the identifier of A is used as the primary key for both the A and B tables.

Logical model

Implementation with propagated identity

A table

other A attributes ... A primary key B attributes ...

Figure 4 Propagated identity

178

3.2 Database mappings

There are also different ways for implementing concepts and relationships with database schema that we summarize here. References [4] and [6] provide further explanation. The precise mapping strategy is a matter of preference and says little about the quality of a schema.

Domains. A domain is the set of possible values for an attribute. Simple domains can be implemented by merely substituting an appropriate data type and size. Complex domains require specialized implementations. For example, a developer can implement an enumeration domain with a string, multiple boolean flags, an enumeration table, or an enumeration encoding. Classes. A class is normally mapped to a table or record type in a database. Occasionally a class is horizontally partitioned (the schema is repeated but the records are apportioned to different databases) or vertically partitioned (the primary key is repeated and the other attributes are apportioned to different databases). Associations. Simple one-to-one and one-to-many associations may be promoted to a record type or buried in a record type for a related class. Many-to-many associations are normally promoted to a record type. Additional rules apply to complex associations that we do not elaborate here. Generalizations. The superclass and each subclass can be implemented with a separate record type. A second technique is to push subclass data up to the superclass. A third is to push superclass data down to the subclasses. And there are additional, unusual ways for implementing generalization.

3.3 Design errors

Errors affect both the viability and benefit of reverse enginccring. It can be difficult to understand a database with many crrors. Howevcr, it is important to dctect errors when asscssing thc quality of a database and fix them when developing a successor application.

Most flawed databases have only a few kinds of errors that are consistently applied. Reference [5] catalogs many of the design errors we have encountered. Some errors are isolated problcms that do not significantly affect reverse engineering. For example, some schema arc mostly corrcct, but have a few primary keys with extraneous attributes bc- yond those actually required for uniqueness. We have found minor design errors in about 75% of the databases we have reverse engineered.

However, we occasionally encounter scverc. errors that pervade a schcma. Wc havc found major design errors in about 50% of the databases we have reverse engineered.

As an example of a major design error, consider a relational database application for managing school data. The

database was originally developed for a single school. When the software was extended to handle districts of schools, the district identifier was not fully propagated into dependent tables. Figure 5 shows an excerpt of the modell, actual implementation, and an improved (partially correcf- ed) implementation.

Intended logical model

School Teacher

I I I - L .r I - I I L I I I L L 1 I-

Original flawed implementation

districtlD studentlD teacherlD

districtlD schoollD

-111---111-111111~

An improved implementation

districtlD

1 School table I 1 districtlD 1 schoollD

studentlD teacherlD district ID district1 D

Figure 5 A design error: identity conflict

In another application we found several tables with dual identifiers; the first identifier was referenced by sorne tables and the second identifier by others. Figure 6 show:$ a sample table with additional contract data. Columns id-contract + revision-num are the primary key and contract-num + revision-num are a candidate key. Column prev-contract-num refers to the previous contract-num and is by inspection a foreign key. In the remainder of the schema some tables refer to id-contract and others refer to contruct-num.

When we talked to the developers, we discovered that the dual identity was caused by an attempt to facilitate d,ata conversion. Instead of properly designing the database a.nd then writing the tedious code needed to convert legacy (la- ta, they tried to adjust the schema to make legacy data conversion easier. (The flawed schema did not help.) Techni- cally, thc table in Figure 6 satisfies normal forms, but it is confusing and difficult to program against.

179

CREATE TABLE contaddl ( id-contract NUMBER (9 )

, revision-nun NUMBER ( 4 ) , contract-nun NUMBER (8) ,contract-type VARCHARZ(2) ,booked-date DATE ,current-revision-ind VARCHAR2(1) ,revision-date DATE ,prev-contract-num NUMBER(8) ,completed_date DATE ,est-review-date DATE , estsroficiency NUMBER (11,2 )

, estrisk NWJBER(11,2)

PRIMARY KEY (id-contract, revision-nun)

UNIQUE (contract-num, revision-num) ) ;

, CONSTRAINT contaddlqk

,CONSTRAINT contaddl-uq

Figure 6 A design error: dual identity

4. Modeling Issues

4.1 Modeling errors

About 25% of the time we see conceptual errors in a database and they are usually due to lack of modeling by the software developers. These errors are especially nasty because often the reverse engineer must deeply understand the application to detect them and repair them.

In Figure 7 we have further corrected the model and schema from Figure 5. In the original flawed schema, a teacher is associated with one school. Instead a teacher should be associated with multiple schools (or a district); some persons teach at more than one school.

Corrected lopical model

0

1--1111111111-1111

Further improved implementation

studentlD teacherlD districtlD

districtlD

districtlD

districtlD schoollD teacherlD

Figure 8 shows another error we encountered with a software development tool. The inferior model is limited to binary associations; a “from” object and a “to” object must be specified for each association. The improved model introduces the notion of a role which is the intersection of a class and association and can handle binary, ternary, and n- ary associations. The “from” and “to” designation can be stored as an attribute of role.

Figure 8 A modeling error: poor conceptualization

4.2 Genericity

The typical schema (such as Figure 1, Figure 5, Figure 6, and Figure 7) directly describes application data. A generic class is more flexible and combines data and metada- ta. In Figure 9 we define a generic window parameter instead of directly storing values such as top left position, height, width, and background color with a window default.

I WindowDefault I

parameterName parametervalue

Figure 9 Generic attribute (from [4])

Figure 10 illustrates a generic class. A document can be recursively divided into document components. Page, paragraph, and line are examples of document components.

Document

number

number

Figure 7 A modeling error: wrong multiplicity Figure 10 Generic class (from [4])

180

Developers often use generic data to deliver a more flexible application. For example, the window parameters in Figure 9 need not be specified until run time when the database is populated. In contrast, if the parameters were specified as explicit attributes of window default they would have to be known at compile time.

Generic data appear in additional forms. Some applications let users customize user-defined fields and anony- mous relationships. Generic data also occurs with fields that hold binary data that is not understood by the database and must be parsed by application programming.

Regardless of the motives, genericity complicates reverse engineering. It introduces variation beyond that intrinsic to the application. Also some application concepts are no longer explicit in the schema, but are buried in the data. Improper use of genericity can be downright confusing.

Figure 11 shows an excerpt from an application we studied that badly misused generic data. (The figure does not show the actual table names, but the acronyms used were equally uninformative.) Every class had a primary key of OID and most foreign keys were stored in forward- Pointer and backwardPointer attributes that refer to an OID. The schema did not declare referentid inlegriry for any of the foreign keys.

I A table I OID backward Pointer forwardpointer description creationDate lastupdate . . .

C table

backward Pointer forwardpointer description creationDate lastupdate

I B table I I OID backward Pointer forwardPointer description creationDate lastupdate . . .

0 . .

D table

backwardpointer forwardPointer description creationDate lastupdate

Figure 11 Misuse of genericity

The size and ambiguity of the schema frustrated our detection of foreign keys. We tried to use a commercial product to detect the foreign keys and it still had not re- solved the 200 table schema after overnight processing. (It either died or was still executing!) And manually we could not resolve the foreign keys because there were just too many possibilities. For example, the backwardPointer for the A table could refer to the OID for the A, B , C, D, or some other table. For the purpose of reverse engineering it

as if the schema were encrypted (which we do not believe to be the developers’ intent).

4.3 Style

Style is another factor that affects the difficulty, effort, likelihood of success, and appropriate techniques for reverse engineering. Choices of style are often due to developer preferences. Some companies try to encourage a uniform style for their applications. There are several major aspects of style that permeate database schemas:

Consistent or variable. Perhaps the foremost aspect of style is whether decisions are consistently or sporadical- ly applied. The reverse engineer can more confidently draw conclusions for a uniform schema. Our experience has been that most schema have a uniform style. Model-driven or not. Normally, we can tell whether a schema is derived from a model. Some developers just write a schema and repeatedly patch it until it appears 1.0 work. Other developers are disciplined, carefully modeling a problem and then systematically converting the model into a schema. Model-driven schema itre mare consistent and tend fo have fewer mistakes. Modeling paradigm. If a model is built, the chosen paradigm can also affect the ultimate schema. For example, UML models tend to capture more constraints than most ER modeling paradigms. When a model accurately cap- tures constraints, they can be better considered for inclu- sion in the schema. Discipline with data types. This is similar to modelirig paradigm. If a developer thinks in terms of domains, the schema tends to have more consistent data types. The us- age of data types is often a major hint to detecting implicit foreign keys and facilitates reverse engineering. Generated or hand-written. It is usually readily apparent whether a schema has been manually written or gein- erated by a tool. Tools tend to yield code with few irreg- ularities. There can still be mistakes; they are just more consistently applied. Age of schema. The age of a schema tends to influence the style of design, beyond the choice of DBMS paradigm (relational, network, hierarchical). Age of an application. Many applications have accumu- lated artifacts from prior releases. Some of these are false clues that must be disregarded during reverse engineering. Extent of optimization. Some developers prefer to elirn- inate lightweight tables and record types that are not ma- terial to an application. Others would prefer to keep them to increase the uniformity of an implementation. For example, as Figure 12 shows, we need not define a table for PreciousMetal because it has no columns other than ,in identifier.

181

Asset w assetlD

* I* riGi&Tl I I StockASset tab’e I

I I A

Figure 12 Eliminating lightweight tables

Computational artifacts. During reverse engineering we sometimes find artifacts of computation, such as in- termediate and temporary data structures.

There are also several minor aspects of style that mostly concern documentation, but still provide important clues for reverse engineering.

Sequence of attributes. Primary keys often tend to be listed towards the front of relational database tables. Naming protocols. Some schemas have strong naming conventions. Certain phrases in names suggest a relation- ship, such as “billing”. Tables for many-to-many associations often incorporate the names of one or both related classes. Underscores often separate the words in a name. Prefixes and suffixes. A suffix of “ID” often denotes an identifier. (Though sometimes non-identifiers may also have a suffix of ID.) A suffix of “type” often denotes a shift in the level of abstraction. A prefix of “is” often denotes an attribute of boolean domain.

5. Conclusions In this paper we have stepped back from the details and

tried to observe the broad themes guiding the difficulty and success of reverse engineering a database. We hope this paper will cause other researchers to reflect on their own ex- periences and be useful grist for improving reverse engineering processes and tools.

Related Work We browsed the past WCRE proceedings, but did not

find much related work to cite. Most papers deal with fine details of reverse engineering. In contrast this paper is more of a broad overview.

Reference [8] reflects on the factors inhibiting broader adoption of reverse engineering. The difficulty of judging the required effort and appropriate deliverables is one of the obstacles. We are hopeful that in the future we can build on this paper and devise some metrics for quantifying the effort for reverse engineering a database.

Reference [9] notes that database reverse engineering is an exploratory and often unstructured activity with the

reverse engineer often learning as the process proceeds. The specifications are often incomplete and inconsistent.

Reference [IO] presents some challenges for reverse engineering. Even though the subject matter is more re- strictive (iust database reverse engineering), our paper elaborates aspects of some challenges. Our work is empir- ical observations rather than fundamental research. We are compiling and abstracting across our experience with reverse engineering for a number of real databases-some internal to corporate information system organizations and some from vendors. We elaborate the multiple information sources alluded to in [ 101. Section 2.1 of this paper focuses on economic impact.

References The following UML books are planned: Grady Booch, James Rumbaugh, and Ivar Jacobson. UML User’s Guide. Reading, Massachusetts: Addison-Wesley. James Rumbaugh, Ivar Jacobson, and Grady Booch. UML Reference Manual. Reading, Massachusetts: Addison-Wes- ley. Ivar Jacobson, Grady Booch, and James Rumbaugh. UML Process Book. Reading, Massachusetts: Addison-Wesley. SN Khoshafian and GP Copeland. Object identity. OOPS-

JL Hainaut, J Henrard, D Roland, V Englebert, and JM Hick. Structure elicitation in database reverse engineering. Third Working Conference on Reverse Engineering. November 1996, Monterey, Califomia, 131-140. Michael Blaha and William Premerlani. Object-Oriented Modeling and Design for Database Applications. Prentice Hall, Englewood Cliffs, New Jersey, 1998. Michael Blaha and William Premerlani. Observed idiosyn- cracies of relational database designs. Second Working Conference on Reverse Engineering, July 1995, Toronto, Ontario, 116-125. William Premerlani and Michael Blaha. An approach for reverse engineering of relational databases. First Working Conference on Reverse Engineering, May 1993, Baltimore, Maryland, 151-160. Jeanette Bruno. Invited talk at the Third Working Confer- ence on Reverse Engineering. November 1996, Monterey, Califomia. Spencer Rugaber and Linda M Wills. Creating a research infrastructure for reengineering. Third Working Conference on Reverse Engineering. November 1996, Monterey, Cali- fornia, 98-101. JL Hainaut, V Englebert, J Henrard, JM Hick, and D Ro- land. Requirements for information system reverse engineering support. Second Working Conference on Reverse Engineering, July 1995, Toronto, Ontario, 136-145. Peter G Selfridge, Richard C Waters, and Elliot J Chikof- sky. Challenges to the field of reverse engineering. Firsl Working Conference on Reverse Engineering, May 1993, Baltimore, Maryland, 144-1.50.

L A ’ S ~ U S A C M S I G P L A N ~ I , 11 (NOV 1986),406-416.

182

Appendix. Summary of the UML Object Mod- eling Notation

Figure 13 summarizes UML constructs that are includ- ed in this paper. Object models are built from three basic constructs: classes, associations, and generalizations.

Class:

attribute

Generalization:

Association with link attribute: Association

Class2

Qualified Association:

-------paZl Association

roleName2

Multiplicity of Associations:

F b Zeroorone F--- Exactlyone

I Class Many (zero or more)

Aggregation:

I AssemblyClass I

Figure 13 UML syntax used in this paper

A class is denoted by a rectangle and describes objects with common attributes, behavior, and semantic intent. At- tributes may be suppressed or displayed in the second portion of the class box. We use a bold font to indicate an attribute that is part of a primary key.

Generalization organizes classes by their similarities and differences. A large hollow arrowhead denotes generalization. The arrowhead points to the superclass. Simple generalization apportions superclass instances among the subclasses. The UML also supports several forms of multiple inheritance.

An association relates instances of two or more classes and is indicated by a line. Multiplicity specifies the number of instances of one class that may relate to a single instance

of an associated class. Solid balls denote “many” multiplicity, meaning zero or more. A hollow ball denotes “zero or one” multiplicity. The lack of a symbol at the end of an association line means exactly one.

An association may be qualified in which case irhe qualifier attribute further refines the multiplicity. For example a directory has many files but the combination of directory and file name corresponds to one file. A role is one end of an association and may be assigned an explicit name as we have shown for class2 of the qualified association. An aggregation is a special kind of association in which an assembly is composed of parts.

183

[ieee comput. soc fourth working conference on reverse engineering - amsterdam, netherlands (6-8...

Documents