data cleaning techniques

1

Data Cleaning Techniques

Shahid Rajaee Teacher Training UniversityFaculty of Computer Engineering

PRESENTED BY:Amir Masoud Sefidian

2

Today’s Lecture Content• Introduction• Enhanced Technique to Clean Data in the Data Warehouse• DWCLEANSER: A Framework for Approximate Duplicate Detection•Data Quality Mining• Data Quality Mining With Association Rules• Data Cleaning Using Functional Dependencies

3

Today’s Lecture Content•Introduction• Enhanced Technique to Clean Data in the Data Warehouse• DWCLEANSER: A Framework for Approximate Duplicate Detection•Data Quality Mining• Data Quality Mining With Association Rules• Data Cleaning Using Functional Dependencies

4

Introduction• Data quality is a main issue in quality information management.• Data quality problems occur anywhere in information systems. • These problems are solved by Data Cleaning:

• Is a process used to determine inaccurate, incomplete or unreasonable data and then improve the quality through correcting of detected errors => reduces errors and improves the data quality.

• Data Cleaning can be a time consuming and tedious process but it cannot be ignored.• Data quality criterias : accuracy, integrity, completeness, validity, consistency, schema

conformance, uniqueness,… .

5


6

An Enhanced Technique to Clean Data in the Data Warehouse• Using a new algorithm that detects and corrects most of the error types and expected problems, such as

lexical errors, domain format errors, irregularities, integrity constraint violation, and duplicates, missing value .

• Presents a solution working on the quantitative data and any data that have limited values.

• Offers the user interaction by selecting the rules and any sources and the desired targets.

• Algorithm is able to clean the data completely, addressing all the mistakes and inconsistencies in the data or numerical values specified.

• Time taken to process huge data is not as important as obtaining high quality data since a huge amount of data can be treated one-time.

• Main focus has been on achieving good quality of the data.

• Pace of implementation of this algorithm is adequate.

• It well scales to large amount of data processing without a significant degradation of the most of relative performance issues.

7

Flowchart of proposed techniqueProposed model can easily be developed in a data -warehouse, by the following algorithm:

8

user selects any rules needed in the data cleaning system. layout and descriptions for fields of the data set, which are used in implementing of the algorithm.

COMPARISON OF THE PROPOSED TECHNIQUE WITH SOME EXISTING TECHNIQUES

Above 1009 records, containing a lot of anomalies have been examined before and after processing by different available methods

(such as: statistics, clustering) a big difference in the number of anomalies which confirms the effectiveness and quality of this algorithm.

10


11

DWCLEANSER: A Framework for Approximate Duplicate Detection

• A novel framework for detection of exact as well as approximate duplicates in a data warehouse.

• Decreases the complexity involved in the previously designed frameworks by providing efficient data cleaning techniques.

• Provides a comprehensive metadata support to the whole cleaning process.

• Provisions have also been suggested to take care of outliers and missing fields.

12

Existing Framework

13

Existing FrameworkPreviously designed framework designed is a sequential, token-based framework that offers fundamental services of data cleaning in six steps :

1)Selection of attributes: Attributes are identified and selected for further processing in the following steps.2) Formation of tokens: The selected attributes are utilized to form tokens for similarity computation.3) Clustering/Blocking of records: The blocking/clustering algorithm is used to group the records based on the calculated similarity and block-token key.4) Similarity computation for selected attributes: Jaccard similarity method is used for comparing token values of selected attributes in a field.5) Detection and elimination of duplicate records: A rule based detection and elimination approach is employed for detecting and eliminating the duplicates in a cluster or in many clusters.6) Merge: The cleansed data is combined and stored.

14

Proposed Framework: DWCLEANSER

15

1.Field Selection• Records are decomposed into fields:

• Fields are analyzed for gathering data about their type, relationship with other fields, key fields and integrity constraints so that have enough metadata about the decomposed fields.

• Missing fields stored in a separate temporary table and preserved in the repository along with their source record, relation name, data types and integrity constraints.

• Missing fields are reviewed by the DBA to verify the reason for their existence.

(1) if the data is missing it can be recaptured; (2) if the value is not known efforts can be made to gather the data to complete the record or fill the missing field with a valid value.

if no valid data can be collected the values is preserved in the repository for further verification and not used in the cleaning procedure.

16

2.Computation of RulesCertain rules are computed that will be utilized during the implementation of the cleaning process.Threshold value:The threshold value is calculated based on the experiments conducted in previous researches.Values lower than the thresholds increase the number of false positives. Values above thresholds are not able to detect all duplicates.Values in between can be used to recognize approximate duplicates.

Rules for classification of fields:Selected fields are classified on the basis of their data types.

Rules for data quality attributes:Previous framework only focused on 3 quality attributes of data: completeness, accuracy and consistency.2 other quality attribute values proposed in new framework:

Validity:

Integrity:

17

3. Formation of Clusters• Using recursive record matching algorithm for initial cluster formation with slight modification:

• Use it for matching of fields rather than whole record.• Clusters are stored in priority queue.• Priorities of clusters in the queue are assigned on the basis of their ability to detect duplicates data sets. • The cluster that detected the recent match is stored assigned the highest priority.

4. Match ScoreMatch scores are assigned by applying Smith-Waterman algorithm(An edit-distance based strategy).The calculations done in this method are stored in a matrix.

5. Detection of Exact and Approximate Duplicates When a new field is to be matched against any data set present in a cluster use Union-Find structure.If it fails in detecting any match then we employ Smith-Waterman.

6. Handling of Outliers and Missing FieldsRecords that do not match any of the clusters present are called outliers or singleton records. Singleton records may be stored in a separate file, stored in the repository for future analysis and comparisons.

18

7. Updating Metadata/Repository:

Metadata and repositories will be an integral part of proposed framework:important components of repositories:

1. Data dictionary: store the information about the relations, their sources, schema, etc.

2. Rules directory: All the calculated values of thresholds, quality attributes, matching scores, etc.

3. Log files: They are used to store:• information about the selected fields and their source record. • classification of the fields based on their data type explicitly under 3 categories numeric, strings and characters.

4. Outlier & Missing field files: stores the outliers and missing fields and their related information like-type, source relation.

19

Comparison of Existing and Proposed Framework

20


21

Data Quality MiningData mining process :

• Involves into the data collection, cleaning the data, building a model and monitoring the models.• Automatically extract hidden and intrinsic information from the collections of data. • Has various techniques that are suitable for data cleaning.

Some commonly used data mining techniques:

Association Rule Mining :• Takes an input and induces rules as output; the outputs can be association rules.• Association rules describe relationships among large data sets and co-occurrence of items.

Functional dependency:shows the connection and association between attributes and shows how one specific combination of values on one set of attributes determines one specific combination of values on another set.

22


23

Data Quality Mining With Association RulesObjective:Used here to detect, quantify, explain and correct data quality deficiencies in very large databases.find a relationship with the items in huge database in addition to that it improves the data quality.

Association rules generates a rule for all the transactions which are checked by their confidence level.

Find out the strength of all rules by the following steps:• Determine transaction type.• Generates the association rule.• Assign a score to each transaction based on the generated rules

Score : summing the confidence values of the rules it violates. Rule violation occurs when a tuples must satisfy the rule body but not it’s consequent.

Idea: assign high scores to a transaction is to suspect the deficiencies.Suggest minimal threshold for confidence to restrict the rule set in order to improve the results.

Sort the transactions according to their score values.Based on the score, the system decides whether to accept or reject the data or else issue a warning.

24

Data Cleaning Using Functional DependenciesFunctional Dependency(FD) is an important feature for referencing to the relationship between attributes and candidate keys in tuples.

FD discovery could find too many FDs and, if use directly in a cleaning process, could cause it to NP time => degrade the performance of the data cleaning.

Developing a cleaning engine by combining:

FD discovery technique + data cleaning technique

+Use the feature in query optimization called Selectivity Value to decrease the number of

FDs discovered(prune unlikely FDs).

25


26

SYSTEM ARCHITETURE

27

SYSTEM ARCHITETUREData collector• Retrieve data from relational database and Improves some quality of data (corrects data from basic typos, invalid domains

and invalid formats) and prepares it for the next module (in a relational format).

FD engine• Is an FD finding module• Dirty data usually has some errors => use the Approximate FD technique to remove errors and find FD. • Apply the selectivity value technique to rank the candidates in its Pruning step and select the candidates only with high

and low rank from the computing FD step. • At the same time, any errors detected from this modified FD engine are suspicious tuples for cleaning.

• The errors can be separated into 2 types:o Errors from finding non-candidate key FDs are inconsistent data. o Errors from finding a candidate key FDs are potentially duplicated data.

• Together with the (discovered FDs + all suspicious error tuples) will be sent to the next step.

28

SYSTEM ARCHITETURECleaning Engine:Receive: • suspicious error tuples • FD selected from the FD engineThen:Assign weight to the data (high error produces a high weight).

Tuples with low weights will repair the high weight tuples.

FD repairing technique:After updating the weight, the engine brings the FD to clean the data by using the Cost-based algorithm (use low cost data to repair a high cost data).

Duplicate Elimination:The last step is to find the duplicate data by improving the sorted neighbor-hood method algorithm through using the candidate key FD from the FD engine to assign key and sorting data from the attribute on the left-hand side of FDs.

Relational database: Other modules storing and retrieving data from this module.

29

SELECTING THE FDApply selectivity value for ranking the candidate in order to find the appropriate FD.

1 Selectivity valuethe selectivity value determine distribution.If the selectivity value of any attribute • is high => the attribute value is highly distributed. • is low => the attribute value is more likely to be united. Highly distributed attribute is potentially a candidate key and can be used to eliminate duplicates.

The lowest distributed attribute can be applied to improve the error of distortion of attribute values in the cleaning engine.

30

SELECTING THE FD2 Ranking the candidateAfter calculating the selectivity value for determining the ranks of candidates, we sort these ranks in ascending order.

To choose potentially good candidates:Define the low ranking threshold and high ranking threshold as a pruning point. The selected candidates are chosen from the candidates with either high ranking or low ranking values.

The high ranking candidate has high selectivity is potentially a candidate key .

The low ranking candidates is potentially an invariant valued which can be functionally determined by some attribute in a trivial manner. Thus, it can be computed to be a non-candidate key on the right-hand side.

The middle ranking is not precise so ignored.

31

SELECTING THE FD3 Improve the pruning step :The pruning step is a step for generating the candidate set by computing the candidates from level 1.

Pruning lattice example

32

Improved pruning method

• Begins the pruning by getting the set of candidates in level - 1 and then, checks the candidates.• If they are not the FD and in either high or low accepted ranking => use StoreCandidate function to store new candidate

from candidate_x and candidate_y in the current level. • Other candidates that are in a neither low nor high ranking will be ignored.

33

Results50,000 real customer tuples are used as a data source.

Separate the dataset into 3 sets, as follows: o first dataset has 10% duplicates, o second dataset has 10% errors o last dataset has 10% duplicates and errors.

Results showed that this work can identify duplicates and anomalies with high recall and low false positive.

PROBLEM :Combining solution is sensitive to data size:

• Data volume increase => discovery algorithm speed decrease

• Number of attributes increase => the discovery creates more candidates of FD and generates too many FDs including noise ones.

34

Strengths and Limitations of Data Quality Mining Methods :

Association rules Functional DependencyReduce the number of rules to generate for

a transactionEasily identifies suspicious tuples for cleaning

avoids a severe pitfall of association rule mining

Decrease the number of functional dependency discovered

difficult to generate association rules for all transactions

is not suitable for large database because it is difficult to sort all the records

35

Main References:

1. Hamad, Mortadha M., and Alaa Abdulkhar Jihad. "An Enhanced Technique To Clean Data In The Data Warehouse". 2011 Developments in E-systems Engineering (2011): n. pag. Web. 20 Dec. 2015.

2. Thakur, G., Singh, M., Pahwa, P. and Tyagi, N. (2011). DWCLEANSER: A Framework for Approximate Duplicate Detection. Advances in Computing and Information Technology, pp.355-364.

3. Natarajan, K., Li, J. and Koronios, A. (2010). Data mining techniques for data cleaning, Engineering Asset Lifecycle Management, Springer London, pp.796-804.

4. Kollayut Kaewbuadee, Yae Temtanapat, and Ratchata Peachavanish, (2006) Data cleaning using functional dependency from data mining process, International Journal on Computer Science and Information System (IADIS) V1 , no. 2, 117–131 ,ISBN: ISSN : 1646 – 3692.

QUESTION??...

data cleaning techniques

Data & Analytics