[ieee comput. soc eighth working conference on reverse engineering - stuttgart, germany (2-5 oct....

i

A Retrospective on Industrial Database Reverse Engineering Projects-Part 1

Michael Blaha

OMT Associates Inc., Chesterfield, Missouri 63017 USA (www.omtassociates.com)

Abstract This paper presents a compilation of results from the reverse engineering of 35 databases. All the work was per- formed by the, same reverse engineer (the author) over the past nine years. Since the quantity of data is large, it has been split between two papers.

1. Introduction I began working in the area of database reverse engineering about ten years ago. Initially I became involved because the technology seemed to complement my prior interest in databases, modeling, and methodology. I had published several works about methodology, but felt isolated. I knew what my practices were, I believed that my advice was sound, but I had no idea what others were actually doing. Reverse engineering was attractive because it could give me an objective, unbiased look at reality.

I was amazed at what I found with the first few databases. They were so poorly designed! I wondered why would anyone build software that way. Numerous publica- tions in the literature show how to design a database. I believe the state-of-the-art is quite good, and it seemed odd that the state-of-the-practice could be so bad. At this point I was hooked on reverse engineering. I wondered if my ob- servations were aberrations or were representative.

This led me to seek opportunities to perform reverse engineering so I could get more data. Sadly, I discovered that my early experiences were typical. I encountered the occasional well-constructed database, but often what I found was more embarrassments. Not only are many software engineers designing databases badly, but they are doing it in perversely creative ways.

At this point I began to collect systematic data about my experiences with an eye towards eventually writing this paper. As I learned more about reverse engineering over the years, I collected better data. For the older case studies, when possible I reconstructed missing data (repeated the reverse engineering) or otherwise indicated that certain data are missing. To the best of my knowledge, the data in

this paper is accurate and on a comparable basis for the various databases.

In total I have reverse engineered about 50 databases to date, and most of them are included in this compilation. I omitted several early efforts for which I had not kept sufficient notes or artifacts. Every database presented in this paper was prepared by a different team of developers, as best as I can tell. I rejected several databases that were built by a team that was already represented. I only considered databases for which I had no effect on the design and did not advise the developers. I only included databases for applications that were actually built and disregarded some in- complete efforts.

The data in this paper is taken entirely from my own first-hand work. I did not include results from any other persons. All of the case studies, but one, concern relational databases. I believe the data is representative of the broad practice. Furthermore, 1 would expect to see similar results for other database paradigms and files.

The remainder of the paper is organized as follows. Section 2 describes the case studies. Unfortunately, I do not have permission to identify my sources, but I did try to characterize each application. Section 3 lists the reverse engineering purpose, the available inputs, and the desired outputs. Section 4 evaluates each database, giving separate grades for the quality of the database design and the underlying model. Section 5 list size metrics and the amount of reverse engineering time. The paper finishes with conclusions.

A companion paper [6] presents additional data on specific practices that were found.

2. Characterization of the Case Studies

Table 1 characterizes the case studies. The case studies are sequentially numbered within the year I encountered the problem and did the work. Thus the case study number is a crude indicator of when the database was built.

As a courtesy to the database sources, I cannot identify the actual applications. However, the software theme does give a flavor of the different kinds of software-the data covers a wide variety of applications.

1095-1350/018 /$10.00 0 2001 IEEE 136

Case 1 Software theme study

94-5

Source

Manufacturing operations

DB platform

96-3

General comments

Service management

32-1 Work orders in-house (???)

an RDBMS My client had a tentative database. I was trying to help them model it and improve their database design.

The terrible design practices and many generalizations made this difficult to reverse engineer.

various RDBMSs

Oracle

vendor (failure)

vendor (success)

Uses the database not only to store data but also to communicate between subsystems.

Was programmed in Fortran. Infrastructure was archaic, but application logic was sophisticated.

hierarchical DBMS

in-house (success)

vendor (success)

-

an RDBMS 1 understood little about this application. Nevertheless, I was able to reverse engineer 400 tables in a hard week of work.

A terrible design. The developers were application experts but naive about databases. 94-3 Customer management

vendor (success)

Sybase

PC RDBMS knother terrible design. However, the application seemed to serve my client’s needs.

Gross lack of consistency in the database design. Most databases are more uniform in style, regardless of whether their design quality is excellent or poor.

The product is credible even though the database design is poor.

94-4 Project management in-house (success)

in-house (success)

MS-Access

RDBMS metadata (from the system catalog) 95-1 I an RDBMS vendor

(success)

in-house (success)

Another poor design, but not as bad as case study #94- 4.

The developers were uncooperative. Typically develop ers are more willing to listen to suggestions.

A vendor commissioned this project. Nearly all my other product assessments have been commissioned by potential customers.

95-2 Financial data analysis MS-Access

Oracle 95-3 Financial data analysis in-house (failure)

vendor (failure)

FoxPro (a PC RDBMS)

~

Watcom RDBMS

RDBMS on an AS1400 platform

The vendor replaced this product with another before my client could make a decision.

This was an unusual application. My client heavily modified vendor code.

The database was conceived and designed using ERwin. The application failed for reasons apart from software quality.

From a database perspective, I recommended the prod. uct. My client decided they did not need the software.

vendor

modified vendor code (success)

(???)

96-4 Equipment management :I in- house (failed)

an RDBMS

96-5 ; Project management vendor (success)

various RDBMSs

96-6 :I Employee time tracking vendor (success)

various RDBMSs

The same vendor as for case study #96-5. I was luke- warm about this product and that discouraged my client from pursuing it.

Table 1 Characterization of the case studies

137

Software theme study Case I DB platform Source General comments

97-2

97-3

97-4

97-5

vendor (???)

vendor (???)

in-house (failure)

in-house (???)

Parts data management

Software modeling tool

Service management

Customer data

98-1 vendor (success) Parts data management

management in-house (failure)

00-1 Employee data

00-5

00-6

00-7

01-1

vendor (success)

in-house (success)

vendor (success)

vendor (success)

Sensor data

Catalog data management

Help desk operations

Process control

an RDBMS vendor Considered this product as an alternative to W6-6.

The reverse engineering assessment concurred with difficult training and an unsuccessful trial deployment. Oracle

B trieve I don’t know if this product succeededin the marketplace. The developer listened to my suggestions and greatly improved the next release. MS- Access

MS-Access Not sure if this was successful or not.

There was also an 00-DBMS implementation that ran much slower than Oracle! Training and deployment were better for this product than for case study W7-2.

The database structure had names that were unusually confusing.

I I

Oracle

98-2 Financial data in-house (success) Rdb

I I

I The developers complained about database performance, but it was a result of their poor design and poor query formulation.

MS-Access

99-1 Financial data vendor (success) MS-Access

My client had already purchased the application and I was trying to understand it better.

Oracle A poor database design was threatening the future via- bility of this application and became the impetus for reengineering.

in-house (success)

00-2 Customer data I in-house (success)

I had parts of several applications. I can assess the style, but cannot give application-specific statistics. MYSQL

Oracle

SQL Server

client was poring over the product and wanted me to do an especially thorough assessment.

This was a small, but complex metadatabase storing both structure and data in application tables.

vendor 00-3 Library data management (success)

00-4 Network security vendor (success)

unspecified RDBMS(s)

I just took a quick look at the database.

The input was an IDEFlX model with about 80% of relationships defined. IDEFlX models tend to be easier to reverse engineer.

The database had extensive cross links among tables (probably the most I have ever seen). Many tables relate to other tables. It seems excessive.

Oracle

various RDBMSs

SQL Server ~ ~ ~~ ~

A vendor also commissioned this reverse engineering.

vendor (success) 01-2 Service management

My client had already bought the product, but wanted to understand it better. Oracle

Table 1 (continued) Characterization of the case studies

138

The case studies cover both in-house developed software (14 databases) and commercial products (21 databases). I also listed, where possible, whether the application succeeded or failed. I define success as software that has been used. By definition, the case studies are biased towards success; for example, customers would not see the products that were scrapped and kept from the marketplace.

You can see (Figure 1) that all, but one (#94-l), of the case studies are for relational databases (RDBMSs) and that the data are for a mix of products.

Reverse engineering purpose

Consider for purchase

I FoxPro (a PC RDBMS) ( 1

Count

13

1 MS-Access 16 I I I I

11 I I

Oracle 18 I

Rdb 11

RDBMS on an AS/400 platform

SQL Server

Sybase

Watcom RDBMS

I unspecified RDBMSs 11’ I Figure 1 Distribution of database

platform for case studies

One of the things that has always intrigued me about database reverse engineering is how the problems vary so widely. The general comments summarize some nuances of the case studies.

3. Purpose, Inputs, and Outputs Table 2 lists the purpose of reverse engineering, available inputs, and required outputs for each case study.

Figure 2 tallies the distribution of purposes for reverse engineering. (The sum exceeds 35 because some case studies had multiple motivations.) The case studies provide em- pirical evidence for the wide utility of database reverse engineering.

A data dictionary was available for several of the case studies. A data dictionary is a list of tables and columns with a brief explanation.

Some case studies had overlapping information sources that did not completely agree. For example, several case studies had both an SQL script to create an empty database and a data dictionary. The tables and columns in the data

I Reengineering i 9 I Enterprise modeling / integration

Facilitate maintenance

Understand application better

Curiosity

Provide training material

Unknown

Figure 2 Distribution of reverse engineering purpose for case studies

dictionary did not precisely match those in the SQL script. In these situations I arbitrarily favored one of the information sources. This was sufficient to achieve the reverse engineering purpose.

4. Database Evaluation Table 3 evaluates each database. Grey cells are not docu- mented-I did not collect full information for some case studies and lacked inputs for reconstructing them. Table 3 has several abbreviations.

CK (candidate key)-a Combination of columns that uniquely identifies each row in a table. The combination must be minimal and not include any columns that are not needed for unique identification. No column in a candidate key can be null. PK (primary key)-a candidate key that is preferential- ly used to access the records in a table. A table can have at most one primary key; normally each table should have a primary key. FK (foreign k e y t a reference to a primary key and is the glue that binds tables. A foreign key must have a value for all columns or it must be wholly null. Nominally, a foreign key need not refer to a primary key and could instead refer to some other candidate key, but that is a poor design form.

A previous paper [3] presented a database grading scale. Grades are helpful for summarizing the result of reverse engineering. Table 3 extends the earlier work and assigns separate grades for the database design (largely syntactic) and the underlying model (largely semanticbthey do not always correlate. Figure 3 defines the extended grading scales.

Figure 4 summarizes the grades for the case studies. Letter grades are converted to numeric grades with an “A”=4.0, “F”=O.O, and other grades in between.

According to Figure 4, database designs seem to have improved over the past several years. I believe this trend is

.

139

Case study - 92-1

92-2

93-1

Reverse engineering purpose Available inputs

Reengineering-recover logic for successor application


Consider for purchase and under- An SQL script for creating tables stand how to integrate with other A user manual applications

An IDEF 1 X model Dialogue with the developers

A sample populated database Various manuals Executable software

94-1

94-2

Desired outputs

A conceptual model

An evaluation Product understanding A conceptual model


94-3

94-4

Consider for purchase and understand how to load legacy data (reengineering)

Reengineering-recover legacy data for another application

94-5

An SQL script for creating tables Various manuals

A populated database A user manual


A rough conceptual mode

95-1 A populated database A user manual Training material

95-2

An evaluation A conceptual model

95-3

A populated database A brief meeting with the developer Include in an enterprise model

96-1

A conceptual model

96-2

96-3


Include in an enterprise model

96-4

96-5

96-6

A data dictionary An SQL script for creating tables

The database structure


A conceptual model

97-1

An IDEFlX model Dialogue with the developers Include in an enterprise model

A conceptual model

The database structure A brief explanation of names

A conceptual model Consider conversion to RDBMS



I I

Consider for purchase I An SQL script for creating tables 1 An evaluation

A data dictionary An SQL script for creating tables Marketing brochures

A data dictionary An SQL script for creating tables Marketing brochures

An evaluation Product understanding

An evaluation Product understanding

I I

Curiosity A populated database An evaluation 1 .A conceptual model

I was never quite sure why management asked me to look at it

Port database to a new platform

An SQL script for creating tables An evaluation

Repair of major problems

A data dictionary An SQL script for creating tables

An evaluation Product understanding Consider for purchase

Table 2 Purpose, inputs, and outputs

140

Reverse engineering purpose study Case I 97-2

~~

Available inputs

Table definitions in the DBMS catalog An evaluation Training materials Product understanding Consider for purchase

97-3 An SQL script for creating tables Brief explanation of relationships An Entity-Relationship model

An evaluation Product understanding Consider for purchase

A printout of the database structure Dialogue with the developers

An evaluation. Repair of major problems. 97-4 Reengineer the application

97-5

98-1

98-2

Understand legacy software to facilitate maintenance Executable software


R'eengineering-seed the model 'for a related application

A populated database A conceptual model

Table definitions in the DBMS catalog An evaluation Training materials and user's manual Product understanding

Table definitions in the DBMS catalog A conceptual model Dialogue with the developers

98-3 A printout of the database structure An evaluation Repair of major problems Reengineer the application

A populated database A user manual 99-1 Understand application better

A printout of the database structure A conceptual model Repair of major problems. 00-2 Reengineer the application

Product understanding

00-1 A printout of the database structure An evaluation.

A conceptual model. Repair of major problems.

Reengineer the application

00-3

00-7 consider for purchase I,


A data dictionary An IDEF 1 X model

* An SQL script for creating tables Various manuals

An evaluation A conceptual model A comparison of the product model to my client's desired model

00-4

00-5

oo-6

Table 2 (continued) Purpose, inputs, and outputs

141

An SQL script for creating tables

Various manuals

An SQL script for creating tables A brief explanation from the vendor

An IDEF 1 X model Business documentation

An evaluation Improve understanding Sample data Product understanding

A quick evaluation

A conceptual model

Curiosity

Reengineering-recover logic for successor software


A database evaluation Product understanding A conceptual model

01-1 A data dictionary with indexes and for- A conceptual model ;Include in an enterprise model eign key definitions

01-2 I I


A database evaluation A conceptual model ,Improve understanding

A disciplined design style. Excessive propagated identity. B Model seems larger than necessary.

Database design comments Database design grade

Case study - 92-1

Model comments

C

92-2 F

D 93-1

94-1

94-2

B

C

94-3 F

94-4 F

C 94-5

95-1 D

95-2

95-3

D

The database tables have dual sources of identity (Section 2.3 of [4]).

dress fields make it D

96-1 D

96-2 C

B A disciplined design style. Inconsistent identity (see Figure 2 of [ 1 I). I c I Some concepts seem to overlap. 96-3

96-4 C

96-5 B A disciplined design style. Some giant tables with parallel fields (see Figure 10 or [l]). I A Imodel.

I could readily understand most of the

Table 3 Database and model evaluation

142

Database I study design grade Case I

97-5

98-1

98-2

98-3

99-1

00-1

00-2

00-3

196-6 IC

D

B

C

C

B

D

c

C

1974 IF

Poor indexing was my biggest concern. Some FKs refer to CKs rather than to PKs. Over- loaded FKs with alternative referents (can point to multiple records).

A disciplined and thorough design

Database implements a linked list (prob-

derived data that seems unnecessary.

Metatables store values and references within the same field. This permits dangling references.

C ably misguided). Some redundant,

D 100-4 1 A i--f-- 00-5 C

Database design comments 1z:I

errors. Parallel FKs. Some FKs refer to CKs.

A disciplined and thorough design

A complex model that misuses generic- ity (see Figure 1 I of [2]). (Note #98-1 is a competing product. #97-2 has 177 tables. W8- 1 has 33 tables. j

A disciplined and thorough design The model has directed relationships rather than the notion of a role (see Fig- ure 8 of [2]). Many arbitrary restrictions. For example, forecast data is limited to three years. Estimated and actual data can be stored, but no other possibilities.

It is difficult to judge the model because the database design is so bad.

Two digit years. FK data type does not always match PK data type. Much denormalization of D tables. 1 1 No PKs defined. Dangling FK references. Sloppy indexing. Parallel FKs.

The model is more complex than needed. The model could be smaller and better conceived.

Some tables seem to have duplicate sources of identity. An inefficient database design.

One PK seems to have an extraneous field. An unnecessary index on a descriptive field. Some inconsistencies in data types.

Insufficient understanding to prepare a

The database has surprisingly few indexes.

Table 3 (continued) Database and model evaluation

143

Case study - 00-6

The database has no major flaws. The style is reasonable and uniformly applied. The database has flaws that are not readily apparent in the oper- ation of the application. The flaws can be repaired without much disruption.

00-7

01-1

Data types and lengths are not uniform. Not null constraints are not used to enforce required fields. Candidate keys and enumer- ations are not defined in the database. Col- umns have cryptic names.

01-2

Grade

A

B

C

D

F

Database design grade

B

c , I

C

D

Database design comments

Has an awkward mix of existence-based and value-based identity. Otherwise a clean design. Overloaded FKs with alternative referents (can point to multiple records). Every table (including many-to-many relationships) has an ID as the PK-they don’t enforce the uniqueness of many-to-many relationships.

Irregular indexing. Parallel FKs.

Sloppy indexing. Much redundancy.

Model grade - A

Model comments

A deep metamodel. I rarely see this kind of sophistication.

The product uses directed relationships. This is much inferior to roles (see Figure 8 of [2]). The underlying model seems needlessly complex and insufficiently abstract.

Brute force listing of many detailed fields; a property list would be better. Insufficient understanding to prepare a model

Table 3 (continued) Database and model evaluation

~~~

Explanation I Examples of design flaws

The database has major flaws that are difficult to fix and cause noticeable problems (bugs, reduced performance, difficult maintenance) in the application.

Primary keys are not defined. Indexing is haphazard; many foreign keys lack indexes and some indexes are subsumed by other indexes. Foreign keys have mismatching data types. Excessive propagated identity. Parallel foreign keys.

I I The database has much unnecessary redun-

The database has severe flaws that compromise the application.

The database is appalling. The

, dant data. Extensive binary data (compiled programming language data structures) may be stored in the database, subverting the dec. laration of data. There may be gross denormalization and dangling foreign key references.

application does not run prop- erly or runs only because of Gross design errors

brute-force programming effort.

Examples of model flaws

Anonymous fields that application code must interpret.

Needless complexity. Exces- sive generalization. Specific modeling errors.

Lack of crisp conceptualiza- tion. Many arbitrary restrictions.

Gross conceptual errors

Figure 3 Database grading scale

144

1 Statistic

[ Design average, first 17 case studies I 1.5 [

Grade

I I

Model average, first 10 case studies

Model average, last 10 case studies

1 Design average, last 18 case studies I 2.2 [ 2.1

2.1

Figure 4 Average grades for the case studies

real and reflects an increasing use of database design tools. Nevertheless, it is disappointing that the design grade is not higher. It is relatively easy to get a solid design (grade of A or B) from a tool, so presumably many designers still are not using tools.

According to Figure 4, the quality of models has not changed and is about a "C". I am not surprised by this statistic. Many developers are baffled by models and do not appreciate the leverage that modeling can provide for building applications. I have no great insight about how to raise the quality of modeling. About all that I can advise is that companies should rely more heavily on the talented few who understand models well. This advice is consistent with Fred Brooks' observation that the productivity of experts is an order of magnitude higher than that of the mass- es. [7].

Figure 5 shows the average grade for failed and successful projects. Many factors affect the success or failure of a software.project, but the data show that the quality of the database design and the underlying model are material.

IDesig n average, failures 1 i" 1 Model average, successes

Model average, failures

Figure 5 Grade vs. project success/failure

5. Database Metrics

Table 4 presents data about the reverse engineering productivity of the author and some possible variables. There are two sets of numbers for case study WO-4; the first is for reverse engineering of the metamodel and the second for reverse engineering of the model that populates the metamodel.

Note the size of the databases. The average number of tables in the applications was about 90 and the average number of columns per table was 12.

I believe that reverse engineering effort is proportional to the total number of columns in the database (number of tables * columns per table) as well as some other factors. The last column of Table 4 shows the number of columns reverse engineered per hour of effort. Not counting partial efforts, the ratio varies from 6 to 104 so there are more variables besides this simple ratio. Familiarity with the application does not seem to cause a noticeable difference. On average I was able to reverse engineer 60 columns per hour.

6. Conclusions

This paper has reported experimental data obtained over the years from database reverse engineering as well as the interpretations of the author. It would be interesting to see results from other database reverse engineers, as well as data from other reverse engineering disciplines. There are all kinds of questions:

What kinds of results are found for othei database para-

How does the quality of programming compare to the

How do the results in this paper compare to what others

I encourage other reverse engineers to publish their experiences and suggest additional kinds of data that the reverse engineering community should collect.

digms, files, and Cobol data structures?

quality of databases?

have found?

References

Michael Blaha and William Premerlani. Observed idiosyn- crasies of relational database designs. Second Working Conference on Reverse Engineering, July 1995, Toronto, Ontario, 116-125. Michael Blaha. Dimensions of database reverse engineering. Fourth Working Conference on Reverse Engineering, October 1997, Amsterdam, The Netherlands, 176-183. Michael Blaha. On reverse engineering of vendor databases. Fifth Working Conference on Reverse Engineering, October 1998, Honolulu, Hawaii, 183-190. Michael Blaha. An industrial example of database reverse engineering. Sixth Working Conference on Reverse Engi- neering, October 1999, Atlanta, Georgia, 196-203. Michael Blaha and Ian Benson. Teaching database reverse engineering. Seventh Working Conference on Reverse Engi- neering, November 2000, Brisbane, Australia, 79-85. Michael Blaha. A Retrospective on Industrial Database Re- verse Engineering Projects-Part 2. Submitted io Eighth Working Conference on Reverse Engineering. Frederick P. Brooks, Jr. The Mythical Man-Month, Anniver- sary Edition. Reading, Massachusetts: Addison-Wesley, 1995.

145

Case Number Columns My familiarity Approximate recovery effort (in study of tables per table with application hours) 92-1 32 8 low 3 (no analysis recovery)

No of tables * cols per table I Effort

85

93-1 pG------ 19

88

00-5 134 I12 I low 14 I102

94-2

94-3 94-4

-400 low 40 (no analysis recovery) i 33 16 high

37 34 medium - - .

94-5

95-1

95-2

146

13 5 medium 2 32

7 14 high 1 98

67 12 medium 12 67

00-4 11 8 7 13

34 -20 9 76 IOW

00-6 00-7 01-1

94 8 low 10 75

232 10 low 77 30

82 14 low IO (partial analysis recovery) 115

01-2 484 -12 medium 40 (partial dsgn, no analysis recovery) 145

[ieee comput. soc eighth working conference on reverse engineering - stuttgart, germany (2-5 oct....

Documents