data preprocess

22

Upload: srigiridharan92

Post on 13-Jul-2015

373 views

Category:

Education


2 download

TRANSCRIPT

Page 1: Data preprocess
Page 2: Data preprocess

1. Introduction

2. Data Quality: Needs of Preprocessing the data?

3. Data Preprocessing tasks

4. Data Cleaning

5. Data integration

6. Data reduction

7. Data Transformation and Data Discretization

8. Conclusion

Page 3: Data preprocess

• It is a process which is comes before applying data mining technique's

Page 4: Data preprocess

• Low-quality data will lead to low-quality mining results.

• So we need to smear Data Preprocessing techniques such as:

- Data quality- Data cleaning

- Data integration- Data reduction - Data transformation - Data discremination

Page 5: Data preprocess

• Data have quality if the requirements of the intended use.

• There are many factors comprising data quality, including:

– Accuracy– Completeness– Consistency– Timeliness– Believability – Interpretability

Page 6: Data preprocess

• Data cleaning routines attempt to fill in missing values , smooth out noise while identifying outliers, and inconsistencies in data.

• Basic methods of data cleaning:

– Missing value

– Noisy Data

– Data Cleaning as a process

Page 7: Data preprocess

• Ignore the tuple

• Fill in missing values manually [ time consuming and infeasible]

• Fill in it automatically with[a global constant : e.g., “Unknown”, ∞]

• Use the most portable value to fill in the missing value [regression, inference-based tools using Bayesian formalism or decision tree induction]

Page 8: Data preprocess

• Noise is the random error or variance in a measured variable.

• Binning:

Binning method smooth a sorted data value by consulting its “neighborhood”, that is, the value around it.

The sorted values are distributed into number of “buckets”, or “bins”.

Page 9: Data preprocess

• Smoothing by bin means:

Each value in a bin is replaced by the mean value of the bin [4,8,15 in bin is 9].

• Smoothing by bin medians:

Each value in a bin replaced by the bin median

• Smoothing by bin boundaries:

The minimum and maximum values in a given bin are identified as the bin boundaries each bin values is then replaced by closest boundary value

Binning is also used as a discretization technique.

Page 10: Data preprocess

• Regression:

Data smoothing can also done by regression, a technique that conforms of values to the function

– Linear regression involves finding “best” line to fit two attributes. one attribute used to predict other

– Multiple linear regression extension of linear regression.

• Outlier analysis:

it may be detected by clustering. Where similar values are organized into groups or clusters.

Page 11: Data preprocess

• The first step in the data cleaning is discrepancy detection [inconsistent data] .

• The data should examined regarding :

– Unique rule [ each attribute value must be different from all other attribute value ]

– Consecutive rule [no missing values between lowest and highest values of the attribute]

– Null rule [specifies the use of blanks, question marks, special characters]

Page 12: Data preprocess

• Use commercial tools

Data scrubbing: use simple domain knowledge (e.g, postal code, spell-check) to detect errors and make corrections

Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers)

• Data migration and integration

Data migration tools: allow transformations to be specified

ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface

Page 13: Data preprocess

• It is the merging of data from multiple

data stores.

• Careful integration avoid and reduce redundancies and inconsistencies in resulting data set.

• Schema integration: [ Integrate metadata from different sources]

• Entity identification problem: [ Identify real world entities from multiple data sources]

• Redundancy analysis: [an attribute value may be redundant that can be detect by correlation analysis]

Page 14: Data preprocess

• This technique applied to obtain a reduced representation of the data set.

• Data reduction strategies include

– Dimensionality reduction :

Remove unimportant attributes

Its method include wavelet transforms , principal components analysis(PCA) which transforms the original data onto a smaller space.

Page 15: Data preprocess

– Numerosity reduction:

Replace the original data volume by alternative

– Data compression:

transformations are applied to obtain a reduced or “compressed” representation of the original data.

• If the compressed data without any information loss then the Data reduction is called “lossless”.

• If we reconstruct only an approximation of the original data, then the Data reduction is called “lossy”.

• Dimensionality reduction and numerosity reduction techniques can also be considered forms of “data compression”.

Page 16: Data preprocess

16

Original Data Compressed Data

lossless

Original DataApproximated

lossy

Data compression

Page 17: Data preprocess

• Data transformation routines convert the data into appropriate forms for mining.

• Strategies for data transformation includes: Smoothing: Remove noise from data

Attribute/feature construction: New attributes constructed from the given ones to help mining process.

Aggregation: Summarization, data cube construction. (e.g) daily sales aggregate to compute monthly or annual total amounts.

Normalization: Scaled to fall within a smaller, specified range, min-max normalization(0.1 to 1.0 or 0.0 to 1.0)

Page 18: Data preprocess

• It transforms numeric data by mapping values to interval or concept labels.

• Discretization and concept hierarchy generation can also be useful,

• where raw values for attributes are replaced by ranges or higher conceptual levels .

• raw values of a numeric attribute (e.g age) are replaced by interval lables (e.g 0-10, 11-20, etc) or higher-level concepts (e.g youth , adult, senior).

Page 19: Data preprocess

• Three types of attributes

– Nominal values from an unordered set, e.g., color, profession– Ordinal values from an ordered set [military or academic rank ]– Numeric real numbers, e.g integer or real numbers

• Discretization:

Divide the range of a continuous attribute into intervals

– Interval labels can then be used to replace actual data values – Reduce data size by discretization– Supervised vs. unsupervised– Split (top-down) vs. merge (bottom-up)– Discretization can be performed recursively on an attribute– Prepare for further analysis, e.g., classification

Page 20: Data preprocess

Although numerous methods of data preprocessing have been developed ,data preprocessing remains an active area of research ,due to the huge amount of inconsistent or dirty data and the complexity of the problem.

Page 21: Data preprocess
Page 22: Data preprocess