[ieee comput. soc fourth working conference on reverse engineering - amsterdam, netherlands (6-8...

8
Dimensions of Data ase Reverse Engineering Michael R. Blaha OMT Associates Inc., Chesterfield, Missouri 63017 USA (blahs@ acm.org) Abstract deeply. In additional recognition of modeling artifacts can provide insight into the thinking of the developers. Such in- sight facilitates interpretation of a problem and extrapola- tion of conclusions from sample inputs. This paper uses the UML [ l] as the language for ex- pressing models. The appendix describes the UML con- structs used in this paper. 2. Inputs and Outputs of Reverse Engineering We continue to be surprised by the variability of re- verse engineering problems. When we tackle new prob- lems, we often encounter situations we have not seen be- fore. For these different situations, we have to adjust our reverse engineering techniques, level of effort, and expec- tations. This paper characterizes dimensions of variation for reverse engineering of databases. 1. Introduction By now we have reverse engineered about 30 databas- es, mostly relational databases. We continue to be surprised by the variability of problems. It almost appears as if devel- opers are deliberately trying to make their systems different from all others. This paper presents a survey across prob- lems that we hope will be helpful for formulating reverse engineering methods, metrics, and tools. We first discuss inputs and outputs for reverse engi- neering. The outputs of reverse engineering depend on the reason for performing reverse engineering. The inputs de- pend on the available resources. The inputs and outputs vary widely from problem to problem and are major factors that influence the appropriate approach, level of effort, and outcome for database reverse engineering. The next section of the paper discusses design issues. Design issues are important for two reasons. First, there are different reasonable approaches to designing a database. A versatile reverse engineer must understand these approach- es well. Furthermore, they must be carefully considered by builders of reverse engineering tools. Second, design errors are often encountered during reverse engineering. Some er- rors are minor and subtle. Others are gross and it is surpris- ing that they occur at all. We have discussed design issues in a prior paper [5], so we just summarize them here. The last major section of the paper discusses modeling issues that can become apparent during reverse engineer- ing. Modeling issues can arise even when the developer did not deliberately consider models. Modeling issues are im- portant if a reverse engineer is to understand a problem 2.1 Required outputs The purpose of reverse engineering determines the ap- propriate outputs, as well as the depth of analysis that is re- quired. In practice we usually perform reverse engineering for the following purposes: Tentative requirements. Reverse engineering of exist- ing software can yield tentative requirements for the new replacement system. Reverse engineering ensures that the functionality of the existing system is not over- looked or forgotten. Documentation. Reverse engineering can elucidate poorly documented existing software when the develop- ers are no longer available for advice. This documenta- tion can greatly assist maintenance of legacy software. Software assessment. We have derived substantial ben- efit from reverse engineering of databases from vendor software. Reverse engineering provides an unusual source of insight. The quality of the database design is an indicator of the quality of the software as a whole. A model of the underlying concepts also lets us judge func- tionality claims better. Integration. Reverse engineering facilitates integration of applications. A logical model of encompassed soft- ware is a prerequisite for integration. Conversion of legacy data. You must understand the logical correspondence between the old database and the new database before attempting to convert data. Assessment of state-of-the-art. From our perspective as methodologists, reverse engineering provides candid insight about the state of the database design art-as practiced 0-8186-8162497 $10.00 0 1997 IEEE 176

Upload: mr

Post on 24-Feb-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE Comput. Soc Fourth Working Conference on Reverse Engineering - Amsterdam, Netherlands (6-8 Oct. 1997)] Proceedings of the Fourth Working Conference on Reverse Engineering - Dimensions

Dimensions of Data ase Reverse Engineering

Michael R. Blaha OMT Associates Inc., Chesterfield, Missouri 63017 USA

(blahs@ acm.org)

Abstract deeply. In additional recognition of modeling artifacts can provide insight into the thinking of the developers. Such in- sight facilitates interpretation of a problem and extrapola- tion of conclusions from sample inputs.

This paper uses the UML [ l ] as the language for ex- pressing models. The appendix describes the UML con- structs used in this paper.

2. Inputs and Outputs of Reverse Engineering

We continue to be surprised by the variability of re- verse engineering problems. When we tackle new prob- lems, we often encounter situations we have not seen be- fore. For these different situations, we have to adjust our reverse engineering techniques, level of effort, and expec- tations. This paper characterizes dimensions of variation for reverse engineering of databases.

1. Introduction By now we have reverse engineered about 30 databas-

es, mostly relational databases. We continue to be surprised by the variability of problems. It almost appears as if devel- opers are deliberately trying to make their systems different from all others. This paper presents a survey across prob- lems that we hope will be helpful for formulating reverse engineering methods, metrics, and tools.

We first discuss inputs and outputs for reverse engi- neering. The outputs of reverse engineering depend on the reason for performing reverse engineering. The inputs de- pend on the available resources. The inputs and outputs vary widely from problem to problem and are major factors that influence the appropriate approach, level of effort, and outcome for database reverse engineering.

The next section of the paper discusses design issues. Design issues are important for two reasons. First, there are different reasonable approaches to designing a database. A versatile reverse engineer must understand these approach- es well. Furthermore, they must be carefully considered by builders of reverse engineering tools. Second, design errors are often encountered during reverse engineering. Some er- rors are minor and subtle. Others are gross and it is surpris- ing that they occur at all. We have discussed design issues in a prior paper [ 5 ] , so we just summarize them here.

The last major section of the paper discusses modeling issues that can become apparent during reverse engineer- ing. Modeling issues can arise even when the developer did not deliberately consider models. Modeling issues are im- portant if a reverse engineer is to understand a problem

2.1 Required outputs

The purpose of reverse engineering determines the ap- propriate outputs, as well as the depth of analysis that is re- quired. In practice we usually perform reverse engineering for the following purposes:

Tentative requirements. Reverse engineering of exist- ing software can yield tentative requirements for the new replacement system. Reverse engineering ensures that the functionality of the existing system is not over- looked or forgotten. Documentation. Reverse engineering can elucidate poorly documented existing software when the develop- ers are no longer available for advice. This documenta- tion can greatly assist maintenance of legacy software. Software assessment. We have derived substantial ben- efit from reverse engineering of databases from vendor software. Reverse engineering provides an unusual source of insight. The quality of the database design is an indicator of the quality of the software as a whole. A model of the underlying concepts also lets us judge func- tionality claims better. Integration. Reverse engineering facilitates integration of applications. A logical model of encompassed soft- ware is a prerequisite for integration. Conversion of legacy data. You must understand the logical correspondence between the old database and the new database before attempting to convert data. Assessment of state-of-the-art. From our perspective as methodologists, reverse engineering provides candid insight about the state of the database design art-as practiced

0-8186-8162497 $10.00 0 1997 IEEE 176

Page 2: [IEEE Comput. Soc Fourth Working Conference on Reverse Engineering - Amsterdam, Netherlands (6-8 Oct. 1997)] Proceedings of the Fourth Working Conference on Reverse Engineering - Dimensions

We most often use reverse engineering for eliciting tentative requirements and assessing vendor software. This is a different emphasis than we have seen with most papers in the literature which tend to emphasize software mainte- nance and conversion of databases and programming code.

We derive the following outputs from database reverse engineering:

lability for attributes, either because of sloppiness or a philosophical decision to avoid the DBMS facilities for handling nulls. Some schema incorporate views that re- veal common accesses of the database. Some D B M k encourage the use of stored procedures to define com- mon data operations. Data. For existing applications the database is often

Models. Typically we build an object model during re- verse engineering. The quality of the model depends on the available inputs and the motive for reverse engineer- ing. For example, we require a thorough model for con- version of legacy data but a partial model can suffice for assessing vendor products. The model conveys the scope and intent of software. Mappings. For integration and conversion of legacy data we prepare mappings between attributes of models and fields of schema. We have been managing these map- pings with repository software that we are developing. The repository software preserves mappings across im- ports of revised models, checks for inconsistencies, and generates reports. Evaluations. Sometimes we explicitly judge the quality of a database, especially for vendor software. We assign a grade that reflects the consistency of the design and the extent of design and modeling errors (Section 3.3 and Section 4.1).

2.2 Available inputs The available inputs vary widely and this also compli-

cates reverse engineering, especially for tool developers. Furthermore, the inputs differ in their completeness and consistency. Many of the inputs that we list below are de- scribed in [3].

Schema. The schema is normally the dominant input for database reverse engineering. It specifies the data struc- ture and many constraints-precisely and explicitly.

The schema varies according to the kind of DBMS, both by paradigm and product. Relational DBMSs have declarative schema that can readily be inspected. 00- DBMSs declare less, so it is more difficult to analyze their schema. Network (such as CODASYL) and hierar- chical (such as IMS) DBMSs also express fewer con- straints.

There is also variability in the constructs that are used for a database schema. For example, a designer may or may not define primary keys, candidate keys, and foreign

available for experimentation and queries. In contrast, when we reverse engineer vendor products, we may have only a printed copy of the schema and some explanation. Furthermore, even when we have access to a vendor da- tabase, we may or may not have representative data. Database queries. Sometimes we scan the applicaticln code, looking for clues from queries that manipulate datta. User interface forms and reports. These may contain helpful information and are often easy to analyze. The re- verse engineer can enter known, unusual values to estab- lish the connection between user interface forms and the underlying schema. [7] Definitions of application constructs. Sometimes a data dictionary is available as a resource. Access to application experts. Application experts can answer questions and provide rationale and context ‘to temper reverse engineering. Familiarity with the application. We have reverse ein- gineered problems ranging from familiar applications to applications that we did not understand at all. If the re- verse engineer understands an application well, he or she is in a better position to make inferences. Documentation. Problems have different quality, quan- tity, and kind of documentation. Documentation pro- vides context for reverse engineering, enabling the re- verse engineer to better understand the meaning of appli- cation names and make better guesses.

2.3 Problem size

The size of a database also influences reverse engi- neering difficulty and effort. For example, the number of tables and attributes mostly characterize the size of a rela- tional database; the number of indexes, views, and records are also important. With large schema the reverse engineer is more likely to encounter the styles of multiple designers. It is more compelling to use automation (such as detecting foreign keys) for large problems than for small problems.

3. Design Issues keys. In fact, our experience has been that most relational database schemas lack formal definitions of foreign

3.1 Approaches to identity -

keys. The designer can adopt different philosophies for indexing a schema to improve performance and enforce uniqueness. The designer may or may not define nul-

Identity is “that property of an object which distin- guishes each object from all others.” [2] Identity is a prom- inent concern in databases. Developers must have some

177

Page 3: [IEEE Comput. Soc Fourth Working Conference on Reverse Engineering - Amsterdam, Netherlands (6-8 Oct. 1997)] Proceedings of the Fourth Working Conference on Reverse Engineering - Dimensions

way for referring to things. We have seen variations in identity caused by the choice of approach and by errors.

The design of identity is important to reverse engineer- ing for two reasons. First, it is a prerequisite for under- standing implicit relationships. [3] We must determine the identifiers for objects and records before we can find rela- tionships between them. Second, identity can be used as a basis for organizing and subdividing a large schema. We often group together classes and record types that are close- ly related via propagated keys [4].

There are several approaches for implementing identi- ty with a database. [5] The choice of approach is a matter of preference and implies little about the quality of a sche- ma. Most databases that we have studied use artificial iden- tity.

Artificial identity. A system-generated identifier (also called an ID, a surrogate, or a pointer) identifies each ob- ject. Figure 1 shows a logical model and relational data- base tables with two kinds of artificial identifiers. The bold font denotes attributes that are part of the primary key.

Region Bank - accountNum

Logical model

Account

regionName

regionlD regional Ban klD

bankName

I Account table I regional AccountlD regionalBanklD

Figure 1 Artificial identity

-- Unstructured identifier. In the middle portion of Fig- ure 1 the identifiers of Region, Bank, and Account are just handles and have no intrinsic meaning.

-- Structured identifier. The identifier has a meaningful internal structure. In the bottom portion of Figure 1 re- gionalBankID consists of a regionID concatenated with some digits to differentiate each bank in a region. Similarly, regionalAccountID distinguishes each ac-

count relative to a regionID. Such identifiers could oc- cur for a large corporation that services its banks from several regional databases. Each regional database can independently allocate structured identifiers for its banks and accounts.

Value-based identity. Some combination of application attributes identify each object. In Figure 2 each region and bank have a unique regionName and bankName re- spectively. Each account can be identified by a bankName combined with an accountNum.

Implementation with value-based identity

Figure 2 Value-based identity

Hybrid identity. A schema can combine artificial iden- tity with value-based identity. In Figure 3 Region and Bank have artificial identity and Account has identity de- rived from a reference to a bank combined with an ac- count number.

Implementation with hybrid identity

Figure 3 Hybrid identity

Propagated identity. Identity can also be propagated via migration of foreign keys. In Figure 4 the identifier of A is used as the primary key for both the A and B ta- bles.

Logical model

Implementation with propagated identity

A table

other A attributes ... A primary key B attributes ...

Figure 4 Propagated identity

178

Page 4: [IEEE Comput. Soc Fourth Working Conference on Reverse Engineering - Amsterdam, Netherlands (6-8 Oct. 1997)] Proceedings of the Fourth Working Conference on Reverse Engineering - Dimensions

3.2 Database mappings

There are also different ways for implementing con- cepts and relationships with database schema that we sum- marize here. References [4] and [6] provide further expla- nation. The precise mapping strategy is a matter of prefer- ence and says little about the quality of a schema.

Domains. A domain is the set of possible values for an attribute. Simple domains can be implemented by merely substituting an appropriate data type and size. Complex domains require specialized implementations. For exam- ple, a developer can implement an enumeration domain with a string, multiple boolean flags, an enumeration ta- ble, or an enumeration encoding. Classes. A class is normally mapped to a table or record type in a database. Occasionally a class is horizontally partitioned (the schema is repeated but the records are apportioned to different databases) or vertically parti- tioned (the primary key is repeated and the other at- tributes are apportioned to different databases). Associations. Simple one-to-one and one-to-many asso- ciations may be promoted to a record type or buried in a record type for a related class. Many-to-many associa- tions are normally promoted to a record type. Additional rules apply to complex associations that we do not elab- orate here. Generalizations. The superclass and each subclass can be implemented with a separate record type. A second technique is to push subclass data up to the superclass. A third is to push superclass data down to the subclasses. And there are additional, unusual ways for implementing generalization.

3.3 Design errors

Errors affect both the viability and benefit of reverse enginccring. It can be difficult to understand a database with many crrors. Howevcr, it is important to dctect errors when asscssing thc quality of a database and fix them when developing a successor application.

Most flawed databases have only a few kinds of errors that are consistently applied. Reference [5] catalogs many of the design errors we have encountered. Some errors are isolated problcms that do not significantly affect reverse engineering. For example, some schema arc mostly corrcct, but have a few primary keys with extraneous attributes bc- yond those actually required for uniqueness. We have found minor design errors in about 75% of the databases we have reverse engineered.

However, we occasionally encounter scverc. errors that pervade a schcma. Wc havc found major design errors in about 50% of the databases we have reverse engineered.

As an example of a major design error, consider a re- lational database application for managing school data. The

database was originally developed for a single school. When the software was extended to handle districts of schools, the district identifier was not fully propagated into dependent tables. Figure 5 shows an excerpt of the modell, actual implementation, and an improved (partially correcf- ed) implementation.

Intended logical model

School Teacher

I I I - L .r I - I I L I I I L L 1 I-

Original flawed implementation

districtlD studentlD teacherlD

districtlD schoollD

-111---111-111111~

An improved implementation

districtlD

1 School table I 1 districtlD 1 schoollD

studentlD teacherlD district ID district1 D

Figure 5 A design error: identity conflict

In another application we found several tables with dual identifiers; the first identifier was referenced by sorne tables and the second identifier by others. Figure 6 show:$ a sample table with additional contract data. Columns id-contract + revision-num are the primary key and contract-num + revision-num are a candidate key. Column prev-contract-num refers to the previous contract-num and is by inspection a foreign key. In the remainder of the schema some tables refer to id-contract and others refer to contruct-num.

When we talked to the developers, we discovered that the dual identity was caused by an attempt to facilitate d,ata conversion. Instead of properly designing the database a.nd then writing the tedious code needed to convert legacy (la- ta, they tried to adjust the schema to make legacy data con- version easier. (The flawed schema did not help.) Techni- cally, thc table in Figure 6 satisfies normal forms, but it is confusing and difficult to program against.

179

Page 5: [IEEE Comput. Soc Fourth Working Conference on Reverse Engineering - Amsterdam, Netherlands (6-8 Oct. 1997)] Proceedings of the Fourth Working Conference on Reverse Engineering - Dimensions

CREATE TABLE contaddl ( id-contract NUMBER (9 )

, revision-nun NUMBER ( 4 ) , contract-nun NUMBER (8) ,contract-type VARCHARZ(2) ,booked-date DATE ,current-revision-ind VARCHAR2(1) ,revision-date DATE ,prev-contract-num NUMBER(8) ,completed_date DATE ,est-review-date DATE , estsroficiency NUMBER (11,2 )

, estrisk NWJBER(11,2)

PRIMARY KEY (id-contract, revision-nun)

UNIQUE (contract-num, revision-num) ) ;

, CONSTRAINT contaddlqk

,CONSTRAINT contaddl-uq

Figure 6 A design error: dual identity

4. Modeling Issues

4.1 Modeling errors

About 25% of the time we see conceptual errors in a database and they are usually due to lack of modeling by the software developers. These errors are especially nasty because often the reverse engineer must deeply understand the application to detect them and repair them.

In Figure 7 we have further corrected the model and schema from Figure 5. In the original flawed schema, a teacher is associated with one school. Instead a teacher should be associated with multiple schools (or a district); some persons teach at more than one school.

Corrected lopical model

0

1--1111111111-1111

Further improved implementation

studentlD teacherlD districtlD

districtlD

districtlD

districtlD schoollD teacherlD

Figure 8 shows another error we encountered with a software development tool. The inferior model is limited to binary associations; a “from” object and a “to” object must be specified for each association. The improved model in- troduces the notion of a role which is the intersection of a class and association and can handle binary, ternary, and n- ary associations. The “from” and “to” designation can be stored as an attribute of role.

Figure 8 A modeling error: poor conceptualization

4.2 Genericity

The typical schema (such as Figure 1, Figure 5, Figure 6, and Figure 7) directly describes application data. A ge- neric class is more flexible and combines data and metada- ta. In Figure 9 we define a generic window parameter in- stead of directly storing values such as top left position, height, width, and background color with a window default.

I WindowDefault I

parameterName parametervalue

Figure 9 Generic attribute (from [4])

Figure 10 illustrates a generic class. A document can be recursively divided into document components. Page, paragraph, and line are examples of document components.

Document

number

number

Figure 7 A modeling error: wrong multiplicity Figure 10 Generic class (from [4])

180

Page 6: [IEEE Comput. Soc Fourth Working Conference on Reverse Engineering - Amsterdam, Netherlands (6-8 Oct. 1997)] Proceedings of the Fourth Working Conference on Reverse Engineering - Dimensions

Developers often use generic data to deliver a more flexible application. For example, the window parameters in Figure 9 need not be specified until run time when the database is populated. In contrast, if the parameters were specified as explicit attributes of window default they would have to be known at compile time.

Generic data appear in additional forms. Some appli- cations let users customize user-defined fields and anony- mous relationships. Generic data also occurs with fields that hold binary data that is not understood by the database and must be parsed by application programming.

Regardless of the motives, genericity complicates re- verse engineering. It introduces variation beyond that in- trinsic to the application. Also some application concepts are no longer explicit in the schema, but are buried in the da- ta. Improper use of genericity can be downright confusing.

Figure 11 shows an excerpt from an application we studied that badly misused generic data. (The figure does not show the actual table names, but the acronyms used were equally uninformative.) Every class had a primary key of OID and most foreign keys were stored in forward- Pointer and backwardPointer attributes that refer to an OID. The schema did not declare referentid inlegriry for any of the foreign keys.

I A table I OID backward Pointer forwardpointer description creationDate lastupdate . . .

C table

backward Pointer forwardpointer description creationDate lastupdate

I B table I I OID backward Pointer forwardPointer description creationDate lastupdate . . .

0 . .

D table

backwardpointer forwardPointer description creationDate lastupdate

Figure 11 Misuse of genericity

The size and ambiguity of the schema frustrated our detection of foreign keys. We tried to use a commercial product to detect the foreign keys and it still had not re- solved the 200 table schema after overnight processing. (It either died or was still executing!) And manually we could not resolve the foreign keys because there were just too many possibilities. For example, the backwardPointer for the A table could refer to the OID for the A, B , C, D, or some other table. For the purpose of reverse engineering it

as if the schema were encrypted (which we do not believe to be the developers’ intent).

4.3 Style

Style is another factor that affects the difficulty, effort, likelihood of success, and appropriate techniques for re- verse engineering. Choices of style are often due to devel- oper preferences. Some companies try to encourage a uni- form style for their applications. There are several major aspects of style that permeate database schemas:

Consistent or variable. Perhaps the foremost aspect of style is whether decisions are consistently or sporadical- ly applied. The reverse engineer can more confidently draw conclusions for a uniform schema. Our experience has been that most schema have a uniform style. Model-driven or not. Normally, we can tell whether a schema is derived from a model. Some developers just write a schema and repeatedly patch it until it appears 1.0 work. Other developers are disciplined, carefully model- ing a problem and then systematically converting the model into a schema. Model-driven schema itre mare consistent and tend fo have fewer mistakes. Modeling paradigm. If a model is built, the chosen par- adigm can also affect the ultimate schema. For example, UML models tend to capture more constraints than most ER modeling paradigms. When a model accurately cap- tures constraints, they can be better considered for inclu- sion in the schema. Discipline with data types. This is similar to modelirig paradigm. If a developer thinks in terms of domains, the schema tends to have more consistent data types. The us- age of data types is often a major hint to detecting implic- it foreign keys and facilitates reverse engineering. Generated or hand-written. It is usually readily appar- ent whether a schema has been manually written or gein- erated by a tool. Tools tend to yield code with few irreg- ularities. There can still be mistakes; they are just more consistently applied. Age of schema. The age of a schema tends to influence the style of design, beyond the choice of DBMS para- digm (relational, network, hierarchical). Age of an application. Many applications have accumu- lated artifacts from prior releases. Some of these are false clues that must be disregarded during reverse engineering. Extent of optimization. Some developers prefer to elirn- inate lightweight tables and record types that are not ma- terial to an application. Others would prefer to keep them to increase the uniformity of an implementation. For ex- ample, as Figure 12 shows, we need not define a table for PreciousMetal because it has no columns other than ,in identifier.

181

Page 7: [IEEE Comput. Soc Fourth Working Conference on Reverse Engineering - Amsterdam, Netherlands (6-8 Oct. 1997)] Proceedings of the Fourth Working Conference on Reverse Engineering - Dimensions

Asset w assetlD

* I* riGi&Tl I I StockASset tab’e I

I I A

Figure 12 Eliminating lightweight tables

Computational artifacts. During reverse engineering we sometimes find artifacts of computation, such as in- termediate and temporary data structures.

There are also several minor aspects of style that most- ly concern documentation, but still provide important clues for reverse engineering.

Sequence of attributes. Primary keys often tend to be listed towards the front of relational database tables. Naming protocols. Some schemas have strong naming conventions. Certain phrases in names suggest a relation- ship, such as “billing”. Tables for many-to-many associ- ations often incorporate the names of one or both related classes. Underscores often separate the words in a name. Prefixes and suffixes. A suffix of “ID” often denotes an identifier. (Though sometimes non-identifiers may also have a suffix of ID.) A suffix of “type” often denotes a shift in the level of abstraction. A prefix of “is” often de- notes an attribute of boolean domain.

5. Conclusions In this paper we have stepped back from the details and

tried to observe the broad themes guiding the difficulty and success of reverse engineering a database. We hope this pa- per will cause other researchers to reflect on their own ex- periences and be useful grist for improving reverse engi- neering processes and tools.

Related Work We browsed the past WCRE proceedings, but did not

find much related work to cite. Most papers deal with fine details of reverse engineering. In contrast this paper is more of a broad overview.

Reference [8] reflects on the factors inhibiting broader adoption of reverse engineering. The difficulty of judging the required effort and appropriate deliverables is one of the obstacles. We are hopeful that in the future we can build on this paper and devise some metrics for quantifying the effort for reverse engineering a database.

Reference [9] notes that database reverse engineering is an exploratory and often unstructured activity with the

reverse engineer often learning as the process proceeds. The specifications are often incomplete and inconsistent.

Reference [IO] presents some challenges for reverse engineering. Even though the subject matter is more re- strictive (iust database reverse engineering), our paper elaborates aspects of some challenges. Our work is empir- ical observations rather than fundamental research. We are compiling and abstracting across our experience with re- verse engineering for a number of real databases-some in- ternal to corporate information system organizations and some from vendors. We elaborate the multiple information sources alluded to in [ 101. Section 2.1 of this paper focuses on economic impact.

References The following UML books are planned: Grady Booch, James Rumbaugh, and Ivar Jacobson. UML User’s Guide. Reading, Massachusetts: Addison-Wesley. James Rumbaugh, Ivar Jacobson, and Grady Booch. UML Reference Manual. Reading, Massachusetts: Addison-Wes- ley. Ivar Jacobson, Grady Booch, and James Rumbaugh. UML Process Book. Reading, Massachusetts: Addison-Wesley. SN Khoshafian and GP Copeland. Object identity. OOPS-

JL Hainaut, J Henrard, D Roland, V Englebert, and JM Hick. Structure elicitation in database reverse engineering. Third Working Conference on Reverse Engineering. November 1996, Monterey, Califomia, 131-140. Michael Blaha and William Premerlani. Object-Oriented Modeling and Design for Database Applications. Prentice Hall, Englewood Cliffs, New Jersey, 1998. Michael Blaha and William Premerlani. Observed idiosyn- cracies of relational database designs. Second Working Conference on Reverse Engineering, July 1995, Toronto, Ontario, 116-125. William Premerlani and Michael Blaha. An approach for reverse engineering of relational databases. First Working Conference on Reverse Engineering, May 1993, Baltimore, Maryland, 151-160. Jeanette Bruno. Invited talk at the Third Working Confer- ence on Reverse Engineering. November 1996, Monterey, Califomia. Spencer Rugaber and Linda M Wills. Creating a research infrastructure for reengineering. Third Working Conference on Reverse Engineering. November 1996, Monterey, Cali- fornia, 98-101. JL Hainaut, V Englebert, J Henrard, JM Hick, and D Ro- land. Requirements for information system reverse engi- neering support. Second Working Conference on Reverse Engineering, July 1995, Toronto, Ontario, 136-145. Peter G Selfridge, Richard C Waters, and Elliot J Chikof- sky. Challenges to the field of reverse engineering. Firsl Working Conference on Reverse Engineering, May 1993, Baltimore, Maryland, 144-1.50.

L A ’ S ~ U S A C M S I G P L A N ~ I , 11 (NOV 1986),406-416.

182

Page 8: [IEEE Comput. Soc Fourth Working Conference on Reverse Engineering - Amsterdam, Netherlands (6-8 Oct. 1997)] Proceedings of the Fourth Working Conference on Reverse Engineering - Dimensions

Appendix. Summary of the UML Object Mod- eling Notation

Figure 13 summarizes UML constructs that are includ- ed in this paper. Object models are built from three basic constructs: classes, associations, and generalizations.

Class:

attribute

Generalization:

Association with link attribute: Association

Class2

Qualified Association:

-------paZl Association

roleName2

Multiplicity of Associations:

F b Zeroorone F--- Exactlyone

I Class Many (zero or more)

Aggregation:

I AssemblyClass I

Figure 13 UML syntax used in this paper

A class is denoted by a rectangle and describes objects with common attributes, behavior, and semantic intent. At- tributes may be suppressed or displayed in the second por- tion of the class box. We use a bold font to indicate an at- tribute that is part of a primary key.

Generalization organizes classes by their similarities and differences. A large hollow arrowhead denotes gener- alization. The arrowhead points to the superclass. Simple generalization apportions superclass instances among the subclasses. The UML also supports several forms of multi- ple inheritance.

An association relates instances of two or more classes and is indicated by a line. Multiplicity specifies the number of instances of one class that may relate to a single instance

of an associated class. Solid balls denote “many” multiplic- ity, meaning zero or more. A hollow ball denotes “zero or one” multiplicity. The lack of a symbol at the end of an as- sociation line means exactly one.

An association may be qualified in which case irhe qualifier attribute further refines the multiplicity. For ex- ample a directory has many files but the combination of di- rectory and file name corresponds to one file. A role is one end of an association and may be assigned an explicit name as we have shown for class2 of the qualified association. An aggregation is a special kind of association in which an assembly is composed of parts.

183