dwm ppt

Data Preprocessing in Data Warehouse

Group Members:Kaavya Johri 090101088Palash Gaur 090101122Roshan P Babu 090101143Sakshi Kulbhaskar 090101145

Todays real world databases are highly suspectible to incomplete (lacking attribute values, lacking certain attributes of interest), noisy(contain errors) ,missing, and inconsistent data due to their huge sizes and their origing from multiple ,heterogenous sources ,i.e. low quality data will lead to low quality mining.So in order to improve the efficiency and ease of mining process, data is pre-processed.Data preprocessing is an important step in knowledge discovery process, because quality decisions must be based on quality data.Detecting data anomalies,rectifying them early and reducing data to be analysed can lead to huge payoffs for decision making.Introduction

Data have quality if they satisfy the requirements of intended use.Many factors comprise data quality including Accuracy,Completeness,Consistency,Timeliness(completion w.r.t. time),Believability(how much data is trusted by user) and Inter-operability(how easy data is understood).

Data Preprocessing

Data cleaningDoes work to clean the data by filling in missing values, smoothing noisy data, identify or remove outliers, and resolving inconsistenciesData integrationIntegration of multiple databases, data cubes, or files i.e. including data from multiple sources.Data reductionObtains a reduced representation of data set that is smaller in volume ,yet produces same analytical results.Data transformationData is transformed into forms appropriate for mining.Major Tasks in Data Preprocessing

Forms of Data Preprocessing

Data cleaning routines work to clean" the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies as we know ,dirty data can cause confusion for the mining procedure.Data Cleaning

Data is not always availableE.g., many tuples have no recorded value for several attributes, such as employee income in sales dataMissing data may be due to equipment malfunctioninconsistent with other recorded data and thus deleteddata not entered due to misunderstandingcertain data may not be considered important at the time of entrynot register history or changes of the data

Missing Data/values

Ignore the tuple: this is usually done when class label is missing (assuming mining tasks involve classification)not effective when the percentage of missing values per attribute varies considerably,unless tuple contains several attributes with missing values.Fill in the missing value manually: time consuming and may not be feasibleUse a global constant to fill in the missing value: replace all missing attribute values by same constant, such as unknown.Use the attribute mean or median to fill in the missing valueUse the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree induction.Data Cleaning Methods

Noise is a random error or variance in a measured variableIncorrect attribute values may due tofaulty data collection instrumentsdata entry problemsdata transmission problemstechnology limitationinconsistency in naming convention

To overcome with this, we have to smooth data to remove noise.Noisy Data

Binning method: this method does local smoothingfirst sort data and partition into (equi-depth) binsthen smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.Clusteringdetect and remove outliers. Similar values are organised into culsters and values that fall outside it are outliers.Regressionsmooth by fitting the data into regression functions i.e. conforms data values to a function. Linear regression involves finding the best line to fit two attributes so that one can be used to predict the other.Smoothing Techniques

* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34* Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34* Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29* Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34Example of Binning Method

Smoothing: remove noise from dataAggregation: summarization, data cube constructionGeneralization: concept hierarchy climbingNormalization: scaled to fall within a small, specified rangemin-max normalizationz-score normalizationnormalization by decimal scalingData Transformation

min-max normalization

z-score normalization

normalization by decimal scaling

Data Transformation: NormalizationWhere j is the smallest integer such that Max(| |)

Thank You

dwm ppt

Documents

data quality

noisy data

recorded data

data reductionobtains

data cubes

data transformationdata

inconsistent data

data cleaningdata