a process for assessing data quality harry m. sneed anecon gmbh, wien university of regensburg,...

22
A Process for Assessing Data Quality Harry M. Sneed ANECON GmbH, Wien University of Regensburg, Bayern OÖ-Fachhochschule Hagenberg, Austria ACM Workshop on Software Quality Szeged, Hungary 4th September 2011 SNEED WOSQ2011

Upload: bonnie-logan

Post on 01-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

A Process for AssessingData QualityHarry M. Sneed

ANECON GmbH, WienUniversity of Regensburg, Bayern

OÖ-Fachhochschule Hagenberg, Austria

ACM Workshop on Software QualitySzeged, Hungary

4th September 2011

SNEEDWOSQ2011

The Problem of Data Quality• The original data model was never thought through and does

not satisfy the requirements on the data. • The original database design was once adequate, but over the

years it has become obsolete as data requirements changed.• The database structure has depreciated under the weight of

many minor alterations (new attributes, keys and Indexes)• The database is now used for other purposes than for which it

was originally intended.• As a result of program faults more and more errors have

crept unnoticed into the data and corrupted it. • An ever increasing number of dead data objects has

accumulated over time and burdens the database.• Data redundancy leads to inconsistent information.• As a result of continual evolution, the data structure has

become so complex that it can no longer be evolved.

SNEED-1WOSQ2011

The Purpose of a Data Audit• A Data Audit is a process by which the quality of the database

is assessed and suggestions for improvement are made. • The Process encompasses two parallel activities:

– Static Analysis of the Database Schema– Static Analysis of the Database Content

• The static analysis of the database schema entails an inspection and measurement of the data structures and data type definitions.

• The static analysis of the database content entails a validation of the actual data content against a specification of what that content should be.

• The goal is to uncover discrepancies and deficiencies in the data and to propose means of eliminating or improving them.

SNEED-2WOSQ2011

Static Analysis of the Database Schema• The database schema is compared against the rules for

data structuring, including the rules of data normalization.

• The data type definitions are checked for compliance and incompliant types reported.

• Structural deficiencies in the database schema are searched for and uncovered.

• Size and complexity limits are controlled and excesses noted.

• The size, complexity and quality of the database structure is measured to compare with other databases and with a benchmark.

• Recommendations for the improvement of the database structure are suggested.

SNEED-3WOSQ2011

Setting the rules

Definingthe metrics

Rules Metrics

DatabaseSchemaAuditor

DeficiencyReport

MetricReport

DatabaseSchema

DatabaseViews

DatabaseAnalyst

Figure 1: Static Database Analysis

SNEED-4WOSQT2011

Typical relational data deficiencies:

• There is no unique primary key • Foreign key to related table is missing• There are no views defined upon a table• No indexes are defined for a table• Key contains too many sub keys• Table contains more than allowed number of columns• Table has more than allowed number of indexes• Table has more than allowed number of views• Table row exceeds maximum allowed length• Table contains incompatible data types• Null Option is missing• Delete Option is missing• Data names do not conform to naming convention• Data schema is inadequately commented.

Checking for Database DeficienciesSNEED-5WOSQ2011

| | || WARN: | 22 TimeStamp Data is not allowed || 253| LEDG_DAT_INS DATETIME YEAR TO SECOND DEFA || | || WARN: | 18 Variable Character Data is not allowed || 254| LEDG_USR_INS VARCHAR (20) DEFAULT USER, || | || WARN: | 17 Not null Option is missing || 254| LEDG_USR_INS VARCHAR (20) DEFAULT USER, || | || WARN: | 17 Not null Option is missing || 256| LEDG_USR_UPD VARCHAR (20), || | || PROB: | 05 Table has no View definition || 258| EPSLEDGE || | || DEFI: | 08 Table has too many relationships (foreign keys) || 258| EPSLEDGE || | || DEFI: | 11 Row Length exceeds maximum length allowed || 258| EPSLEDGE || | || DEFI: | 12 Table has too few comments || 258| EPSLEDGE || | |+----------------------------------------------------------------------+| Number of major Rule Violations = 11 || Number of media Rule Violations = 11 || Number of minor Rule Violations = 132 || Total Number of Rule Violations = 154 || Number of Database Schema Lines = 267 || || Rate of Rule Conformity = 0.651 |+----------------------------------------------------------------------+

Sample Database Deficiency Report SNEED-6WOSQ2011

Database Quality Metrics

• data independence = number of views and indexes relative to the number of tables or record types

• data accessibility = number of keys, indexes and selects relative to the number of tables,

• data flexibility = number of attributes and keys relative to the number of tables,

• data testability = number of test cases relative to the number of data attributes stored,

• data storage efficiency = number of attributes stored relative to the size of the storage in bytes,

• data conformity = number of schema deficiencies relative to the number of schema lines.

WOSQ2011 SNEED-7

Database Complexity Metrics

• data complexity = number of different data types and keys , relative to the sum of all data attributes stored,

• view complexity = number of views relative to the number of tables,

• access complexity = number of tables, keys and indexes relative to the number of stored attributes,

• relational complexity = number of foreign keys and relations relative to the number of tables and keys in all,

• structural complexity = number of structural elements - tables, relations and views – relative to the number of stored attributes,

• storage complexity = number of attributes stored relative to the storage capacity.

WOSQ2011 SNEED-8

+------------------------------------------------------------------------------+| || SQL DATA BASE METRIC REPORT ||SYSTEM: DATABASE || DATE: 10.03.11 PAGE: 001 |+------------------------------------------------------------------------------+| NR NR DATA PRIM FOR NR NR ROW DATA FUNC NR || TABLE NAME LINES FLDS TYPES KEYS KEYS RELS INDX SIZE PTS PTS DEFIS|+------------------------------------------------------------------------------+|[WEEKDAYS] 012 03 02 01 00 01 00 104 009 007 002 ||[ERRORLOGS] 016 05 02 01 00 01 00 017 011 007 002 ||[AUSWERTUNGTYP] 016 06 03 01 00 01 00 163 012 007 002 ||[FEATUREDMITGLIEDPOO 014 05 04 01 00 01 00 774 011 007 003 ||[FACHGRUPPEN] 028 10 06 01 00 01 00 101 016 007 002 ||[KONTAKTARTEN] 014 04 02 01 00 01 00 053 010 007 002 ||[CATEGORIES] 022 06 02 01 00 01 00 021 012 007 002 ||[AUSWERTUNGSPALTE] 012 03 02 01 00 01 00 101 009 007 002 ||[MENUCATEGORIES] 012 03 02 01 00 01 00 058 009 007 002 ||[SITEMAP] 024 12 04 01 00 01 00 1309 018 007 003 ||#FILTERCRITERIA 010 02 02 00 00 01 00 008 006 007 003 ||[BUSINESSHOURS] 014 05 03 01 00 01 00 017 011 007 002 ||[ZERTIFIKATE] 034 10 05 01 00 01 00 755 016 007 003 ||#FILTERCRITERIA 010 02 02 00 00 01 00 008 006 007 003 ||[REPORTINGHISTORY] 018 09 04 01 00 01 00 538 015 007 003 ||#FILTERCRITERIA 010 02 02 00 00 01 00 008 006 007 003 ||[MITARBEITERFUNKTION 013 04 04 01 00 01 00 058 010 007 002 ||#FILTERCRITERIA 010 02 02 00 00 01 00 008 006 007 003 ||[SCHLAGWORTE] 019 03 03 01 00 01 00 257 009 007 002 ||[PRODUKTE] 017 07 03 01 00 01 00 1205 013 007 003 ||[REGIONEN] 022 12 06 01 00 01 00 142 018 007 002 ||[REPORTTYPE] 012 03 03 01 00 01 00 058 009 007 002 ||[SUCHNAMEPATTERNS] 007 02 02 00 00 01 00 017 006 007 003 ||[STATISTIKOHNEPRODUK 018 09 04 01 00 01 00 523 015 007 003 ||[BANNER] 014 05 02 01 00 01 00 773 011 007 003 ||[STATISTIKOHNEEDITIE 018 09 04 01 00 01 00 523 015 007 003 |

Size Metrics of individual Data Tables SNEED-9WOSQ2011

+------------------------------------------------------------------------------+| D A T A B A S E Q U A N T I T Y M E T R I C S |+------------------------------------------------------------------------------+| Total Number of SQL Source Files ===> 01 || Total Number of SQL Code Lines ===> 20107 || Total Number of SQL Comment Lines ===> 2320 || Total Number of SQL Procedure Statements ===> 7842 || Total Number of SQL Statement Types ===> 3245 || Total Number of SQL Access Operations ===> 3732 || Total Number of SQL Select Operations ===> 1323 || Total Number of SQL Insert Operations ===> 334 || Total Number of SQL Update Operations ===> 73 || Total Number of SQL Delete Operations ===> 43 || Total Number of SQL Stored Procedures ===> 1497 || Total Number of SQL Procedure Blocks ===> 558 || Total Number of SQL Procedure Conditions ===> 1933 || Total Number of SQL Procedure Loops ===> 09 || Total Number of SQL Procedural Branches ===> 2439 || Total Number of SQL Data Declarations ===> 266 || Total Number of SQL Set Data Assignments ===> 1325 || Total Number of SQL Output Operations ===> 1074 || Total Number of SQL Input Operations ===> 749 || Total Number of Stored Procedure Points ===> 4806 || Total Number of SQL Tables ===> 62 || Total Number of Data Attributes ===> 468 || Total Number of different Data Types ===> 202 || Total Number of primary Keys ===> 56 || Total Number of Foreign Keys ===> 00 || Total Number of DB-Indexes ===> 17 || Total Number of Relationships ===> 62 || Total Number of Database Views ===> 15 || Total Number of Database View Attributes ===> 119 || Total Length of all SQL Rows ===> 36028 || Total Number of Data-Points ===> 10884 || Total Number of Function-Points ===> 434 || Total Number of Deficiencies ===> 311 || Total Number of required Test Cases ===> 1377 |+------------------------------------------------------------------------------+

Summary Report of Database Size Metrics SNEED-10WOSQ2011

+------------------------------------------------------------------------------+| D A T A B A S E C O M P L E X I T Y M E T R I C S |+------------------------------------------------------------------------------+| Data Complexity ===> 0.611 || Access Complexity ===> 0.635 || Relational Complexity ===> 0.609 || Structural Complexity ===> 0.228 || Storage Complexity ===> 0.120 || || Average Database Complexity ===> 0.440 || |+------------------------------------------------------------------------------+| D A T A B A S E Q U A L I T Y M E T R I C S |+------------------------------------------------------------------------------+| Data Independence ===> 0.471 || Data Accessability ===> 0.859 || Data Testability ===> 0.518 || Data Conformity ===> 0.984 || Storage Efficiency ===> 0.200 || || Average Database Quality ===> 0.606 || |+------------------------------------------------------------------------------+

Sample Complexity and Quality MetricsSNEED-11WOSQ2011

Static Analysis of the Database Content

• The database contents are checked against assertions on the valid value domains as well as against assertions on valid relationships.

• Invalid data values are discovered thru comparison with valid values and recorded.

• Missing data objects are reported. • Redundant data objects and data corpses are revealed

and reported.• The degree of correctness is computed for each data

table to compare with other tables.• The correctness metric of the database content is

passed on to the metric database for further evaluation.

SNEED-12WOSQ2011

Specify the scripts

Validationscript

ValidationScript

Compiler

Comparison

Tables

DatabaseContentAuditor

ValidationReport

DatabaseSchema

DataBase

Content

DatabaseTables

Figure 2: Data Validation Process

DatabaseAnalyst

SNEED-13WOSQ2011 Data Validation Process

table: LAENDER;// This is test procedure is for checking the correctness of the Bundesländer when (new.BUNDESLAND_ID = "1") then replace new.SUMME by "11"; when (new.BUNDESLAND_ID = "2") then replace new.SUMME by "12"; when (new.BUNDESLAND_ID = "3") then replace new.SUMME by "13"; when (new.BUNDESLAND_ID = "4") then replace new.SUMME by "14"; when (new.BUNDESLAND_ID = "5") then replace new.SUMME by "15"; when (new.BUNDESLAND_ID = "6") then replace new.SUMME by "16"; when (new.BUNDESLAND_ID = "7") then replace new.SUMME by "17"; when (new.BUNDESLAND_ID = "8") then replace new.SUMME by "18"; when (new.BUNDESLAND_ID = "9") then replace new.SUMME by "19"; when (old.STAATSID = "14") then replace old.STAATSID by "13"; when (new.SUMME = "65") then replace new.SUMME by "66"; when (old.BUNDESLANDNR = "9") then skip; if ( new.BUNDESLAND_ID = old.BUNDESLANDNR ! "0" ); assert new.STAAT_ID = old.STAATSID; assert new.BUNDESLAND = "NIEDEROESTERREICH" if (old.SUMME = "33";) assert new.BUNDESLAND = "OBEROESTERREICH" if (old.SUMME = "44"); assert new.BUNDESLAND = old.BUNDESLAND if (other); assert new.BUNDESLAND = old.BUNDESLAND; assert new.SUMME = "100" ! "200" ! "300" ! "400"; assert new.SUMME = {1.00:5.00}; assert new.MITGLIEDER = "300" ! "656" ! "644" ! "92";end;

Sample Assertion Script SNEED-14WOSQ2011

| Key Fields of Record(new,old) +-----------------------------------------------------------------------------------------+| Ist :BUNDESLANDNR | Soll:BUNDESLANDNR +----------------------------------------------+-----------------------------------------+| Non-Matching Fields | Non-Matching Values +----------------------------------------------+-----------------------------------------+| RecKey:1 | •| Ist : BUNDESLAND | NIEDERöSTERREICH | Soll: BUNDESLAND | NIEDEROESTERREICH +----------------------------------------------+-----------------------------------------+| RecKey:2 | | Ist : STAATSID | 13 | Soll: STAATSID | 12 +----------------------------------------------+-----------------------------------------+| RecKey:2 | | Ist : BUNDESLAND | OBERöSTERREICH | Soll: BUNDESLAND | OBEROESTERREICH +----------------------------------------------+-----------------------------------------+| RecKey:4 | | Ist : STAATSID | 13 | Soll: STAATSID | 14 +----------------------------------------------+-----------------------------------------+| RecKey:5 | | Ist : BUNDESLAND | KäRNTEN | Soll: BUNDESLAND | KAERNTEN +----------------------------------------------+-----------------------------------------+| RecKey:8 | | Ist : STAATSID | 13 | Soll: STAATSID | 18 +----------------------------------------------+-----------------------------------------+

Sample Data Validation ReportSNEED-15WOSQ2011

+----------------------------------------------------------------------------------------+| New:MGNR SUBNR JAHR ART LFNR || Old:MGNR SUBNR JAHR ART LFNR |+----------------------------------------------+-----------------------------------------+| Non-Matching Fields | Non-Matching Values |+----------------------------------------------+-----------------------------------------+| RecKey:000083 0000 0000 BAN 00002 | || Ist: DATUM | 2002-04-29 || Soll: DATUM | 2003-04-10 |+----------------------------------------------+-----------------------------------------+| RecKey:000095 0000 0000 BAN 00002 | missing from the old File/Table |+----------------------------------------------+-----------------------------------------+| RecKey:000112 0000 0000 BAN 00002 | || Ist: DATUM | 2002-05-02 || Soll: DATUM | 2003-04-29 |+----------------------------------------------+-----------------------------------------+| RecKey:000083 0000 0000 BAN 00001 | missing from the new File/Table |+----------------------------------------------+-----------------------------------------+| RecKey:000095 0000 0000 BAN 00001 | missing from the new File/Table |+----------------------------------------------+-----------------------------------------+| RecKey:000112 0000 0000 BAN 00001 | missing from the new File/Table |+----------------------------------------------+-----------------------------------------++----------------------------------------------+-----------------------------------------+| Total Number of old Records checked: 76 || Number of old Records found in new File: 73 || Number of old Records with duplicate Keys: 00 || Number of old Records not in new Table: 03 || Total Number of new Records checked: 76 || Number of new Records found in old File: 75 || Number of new Records with alternate Keys: 00 || Number of new Records not in old File: 01 || Total Number of Fields checked: 225 || Total Number of non-Matching Fields: 02 || Percentage of matching Fields: 99 % || Percentage of matching Records: 99 % |+----------------------------------------------------------------------------------------+

Data Validation Metric ReportSNEED-16WOSQ2011

Database Size ComparisonSNEED-17WOSQ2011

Database Complexity EvaluationSNEED-18WOSQ2011

Database Quality EvaluationSNEED-19WOSQ2011

Quality0

0,4 0,6

1

0,5

Complexity1

0,6 0,4

0

0,5

Change rate1

0,7 0,05

0

0,1

Defect DensityDefects per 1000 Data

0,020

0,002 0,001

0

0,003

Conformity0,5

0,6 0,8

1,00

0,7

Correctness0,5

0,6 0,9

1,0

0,8

Visualization of the Database Status with SoftEval

Database DashboardSNEED-20WOSQ2011

Conclusions

• Databases are an important asset of corporate users, perhaps even more important than the software!

• As with software the quality of data decreases over time as a result of usage and change.

• The complexity of data increases over time as a result of change and extension.

• Therefore, it is imperative to periodically clean up and restructure the data.

• For that purpose one needs information on the state of the data.

• Such information is provided by a database audit.

SNEED-21WOSQ2011