role of data cleaning in data warehouse

15
Role of Data cleaning in Data Warehouse Presentation on Ramakant Soni Assistant Professor, BKBIET, Pilani [email protected]

Upload: ramakant-soni

Post on 23-Jan-2018

1.717 views

Category:

Education


0 download

TRANSCRIPT

Role of Data cleaning in Data Warehouse

Presentation on

Ramakant SoniAssistant Professor, BKBIET, Pilani

[email protected]

What is Data Warehouse ?

Data warehouse is an information delivery system where we can integrate and

transform data into information used largely for strategic decision making. The

historic data in the enterprise from various operational systems is collected and

is clubbed with other relevant data from outside sources to make integrated

data as content of data warehouse.

What is Data Cleaning ?

Data cleaning, also called data cleansing or scrubbing, deals with detecting and

removing errors and inconsistencies from data in order to improve the quality

of data.

Introduction

RAMAKANT SONI, BKBIET

Steps to build Data Warehouse: ETL Process

Figure 1. ETL ProcessRAMAKANT SONI, BKBIET

Need of Data Cleaning

• Data warehouses require and provide extensive support for data cleaning.

• They load and continuously refresh huge amounts of data from a variety of

sources so the probability of “dirty data” is high.

• Data warehouses are used for decision making, so the correctness of data

is vital to avoid wrong conclusions.

RAMAKANT SONI, BKBIET

Requirements

A data cleaning approach should satisfy several requirements:

• Detect and remove all major errors and inconsistencies both in individualdata sources and when integrating multiple sources. The approach shouldbe supported by tools to limit manual inspection and programming effort.

• Data cleaning should not be performed in isolation but together withschema-related data transformations based on comprehensive metadata.

• Mapping functions should be specified in a declarative way for datacleaning and be reusable for other data sources as well as for queryprocessing.

• A workflow infrastructure should be supported to execute all datatransformation steps for multiple sources and large data sets in a reliableand efficient way.

RAMAKANT SONI, BKBIET

Data Quality Problems

RAMAKANT SONI, BKBIET

Single-source problems

The data quality of a source largely depends on the degree to which it is governed byschema and integrity constraints controlling permissible data values.

• Sources without schema, such as files, have few restrictions on what data can beentered and stored, giving rise to a high probability of errors and inconsistencies.

• Database systems, enforce restrictions of a specific data model (e.g., the relationalapproach requires simple attribute values, referential integrity, etc.) as well asapplication-specific integrity constraints.

Schema-Level problems occur because of the lack of appropriate model-specific orapplication-specific integrity constraints.

Instance-Level problems relate to errors and inconsistencies that cannot be preventedat the schema level (e.g., misspellings).

RAMAKANT SONI, BKBIET

Example: Single Source Problem

RAMAKANT SONI, BKBIET

Multi-source problems

The problems in single sources are aggravated when multiple sources are integrated.Each source may contain dirty data and the data in the sources may be representeddifferently, overlap or contradict because of the independent sources.

Result: Large degree of heterogeneity.

Problem in cleaning: To identify overlapping data, in particular matching recordsreferring to the same real-world entity. This problem is also referred to as the objectidentity problem, duplicate elimination problem.

Frequently, the information is only partially redundant and the sources maycomplement each other by providing additional information about an entity.

Solution: duplicate information should be purged out and complementing informationshould be consolidated and merged in order to achieve a consistent view of real worldentities.

RAMAKANT SONI, BKBIET

Example: Multi-Source Problem

Figure 2. Multi-Source problem example

RAMAKANT SONI, BKBIET

Data cleaning Phases

In general, data cleaning involves several phases:

• Data analysis

• Definition of transformation workflow and mapping rules

• Verification

• Transformation

• Backflow of cleaned data

RAMAKANT SONI, BKBIET

Data cleaning process

Data analysis & Definingtransformation workflow,mapping rules

Verification & Transformation

Backflow of cleaned data

Figure 3. Data Cleaning Process

RAMAKANT SONI, BKBIET

Data cleaning Tool support

Large variety of tools is available to support data transformation and data cleaning:

• Data analysis Tools 1. Data profiling tool Eg. MigrationArchitect( Evoke Software)2. Data mining tool Eg. WizRule( WizSoft)

• Data reengineering tools uses discovered patterns and rules for cleaning.Eg. Integrity( Vality Software)

• Specialized cleaning tools deal with Particular Domain1. Special Domain Cleaning Eg. IDCentric( FirstLogic)2. Duplicate Elimination Eg. MatchIt( HelpItSystems)

• ETL tools uses repository built on DBMS to manage all metadata about data sources, target schema, mapping script etc. in uniform way

Eg. Extract( ETI), CopyManager( InformationBuilders)

RAMAKANT SONI, BKBIET

References

1. Data Cleaning: Problems and Current Approaches- Erhard Rahm, Hong Hai Do-University of Leipzig

2. Data cleaning, a problem that is redolent of Data Integration in Data Warehousing -Shridhar B. Dandin- BKBIET Pilani

3. Principles and methods of data cleaning- Arthur D. Chapman

RAMAKANT SONI, BKBIET

Thank You

RAMAKANT SONI, BKBIET