dm lecture 2 18-09-2012

Upload: habibsultanbscs

Post on 04-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 DM Lecture 2 18-09-2012

    1/36

    November 14, 2012 Data Mining: Concepts and Techniques 1

    Data Mining:Concepts and Techniques

    Slides for Textbook

    Chapter 2 &3

  • 7/30/2019 DM Lecture 2 18-09-2012

    2/36

    November 14, 2012 Data Mining: Concepts and Techniques 2

    Chapter 3: Data Preprocessing

    Why preprocess the data?

    Data cleaningData integration and transformation

    Data reduction

    Discretization and concept hierarchy generation

    Summary

  • 7/30/2019 DM Lecture 2 18-09-2012

    3/36

    November 14, 2012 Data Mining: Concepts and Techniques 3

    Why Data Preprocessing?

    Data in the real world is dirtyincomplete : lacking attribute values, lacking certainattributes of interest, or containing only aggregate

    datanoisy : containing errors or outliersinconsistent : containing discrepancies in codes ornames

    No quality data, no quality mining results!Quality decisions must be based on quality dataData warehouse needs consistent integration of quality data

  • 7/30/2019 DM Lecture 2 18-09-2012

    4/36

    November 14, 2012 Data Mining: Concepts and Techniques 4

    Multi-Dimensional Measure of Data Quality

    A well-accepted multidimensional view: AccuracyCompletenessConsistencyTimelinessBelievability

    Value addedInterpretability

    Accessibility

  • 7/30/2019 DM Lecture 2 18-09-2012

    5/36

    November 14, 2012 Data Mining: Concepts and Techniques 5

    Major Tasks in Data Preprocessing

    Data cleaningFill in missing values, smooth noisy data, identify or removeoutliers, and resolve inconsistencies

    Data integrationIntegration of multiple databases, data cubes, or files

    Data transformationNormalization and aggregation

    Data reductionObtains reduced representation in volume but produces thesame or similar analytical results

    Data discretizationPart of data reduction but with particular importance, especially

    for numerical data

  • 7/30/2019 DM Lecture 2 18-09-2012

    6/36

    November 14, 2012 Data Mining: Concepts and Techniques 6

    Forms of data preprocessing

  • 7/30/2019 DM Lecture 2 18-09-2012

    7/36November 14, 2012 Data Mining: Concepts and Techniques 7

    Chapter 3: Data Preprocessing

    Why preprocess the data?

    Data cleaning Data integration and transformation

    Data reduction

    Discretization and concept hierarchy generation

    Summary

  • 7/30/2019 DM Lecture 2 18-09-2012

    8/36November 14, 2012 Data Mining: Concepts and Techniques 8

    Data Cleaning

    Data cleaning tasks

    Fill in missing values

    Identify outliers and smooth out noisy data

    Correct inconsistent data

  • 7/30/2019 DM Lecture 2 18-09-2012

    9/36November 14, 2012 Data Mining: Concepts and Techniques 9

    Missing Data

    Data is not always available

    E.g., many tuples have no recorded value for severalattributes, such as customer income in sales data

    Missing data may be due toequipment malfunction

    inconsistent with other recorded data and thus deleted

    data not entered due to misunderstanding

    certain data may not be considered important at the time of entry

    not register history or changes of the data

    Missing data may need to be inferred.

  • 7/30/2019 DM Lecture 2 18-09-2012

    10/36November 14, 2012 Data Mining: Concepts and Techniques 10

    How to Handle Missing Data?

    Ignore the tuple: usually done when class label is missing (assuming

    the tasks in classification not effective when the percentage of

    missing values per attribute varies considerably.

    Fill in the missing value manually: tedious + infeasible?Use a global constant to fill in the missing value: e.g., unknown, a

    new class?!

    Use the attribute mean to fill in the missing value

    Use the attribute mean for all samples belonging to the same class

    to fill in the missing value: smarter

    Use the most probable value to fill in the missing value: inference-

    based such as Bayesian formula or decision tree

  • 7/30/2019 DM Lecture 2 18-09-2012

    11/36November 14, 2012 Data Mining: Concepts and Techniques 11

    Noisy Data

    Noise: random error or variance in a measured variableIncorrect attribute values may due to

    faulty data collection instruments

    data entry problemsdata transmission problemstechnology limitationinconsistency in naming convention

    Other data problems which requires data cleaningduplicate recordsincomplete data

    inconsistent data

  • 7/30/2019 DM Lecture 2 18-09-2012

    12/36November 14, 2012 Data Mining: Concepts and Techniques 12

    How to Handle Noisy Data?

    Binning method:first sort data and partition into (equi-depth) binsthen one can smooth by bin means, smooth by bin

    median, smooth by bin boundaries , etc.Clustering

    detect and remove outliersCombined computer and human inspection

    detect suspicious values and check by humanRegression

    smooth by fitting the data into regression functions

  • 7/30/2019 DM Lecture 2 18-09-2012

    13/36November 14, 2012 Data Mining: Concepts and Techniques 13

    Simple Discretization Methods: Binning

    Equal-width (distance) partitioning:It divides the range into N intervals of equal size:uniform grid if A and B are the lowest and highest values of theattribute, the width of intervals will be: W = ( B - A )/ N. The most straightforwardBut outliers may dominate presentationSkewed data is not handled well.

    Equal-depth (frequency) partitioning:It divides the range into N intervals, each containingapproximately same number of samplesGood data scaling

    Managing categorical attributes can be tricky.

  • 7/30/2019 DM Lecture 2 18-09-2012

    14/36November 14, 2012 Data Mining: Concepts and Techniques 14

    Binning Methods for Data Smoothing

    * Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,29, 34

    * Partition into (equi-depth) bins:- Bin 1: 4, 8, 9, 15

    - Bin 2: 21, 21, 24, 25- Bin 3: 26, 28, 29, 34

    * Smoothing by bin means:- Bin 1: 9, 9, 9, 9- Bin 2: 23, 23, 23, 23- Bin 3: 29, 29, 29, 29

    * Smoothing by bin boundaries:- Bin 1: 4, 4, 4, 15- Bin 2: 21, 21, 25, 25- Bin 3: 26, 26, 26, 34

  • 7/30/2019 DM Lecture 2 18-09-2012

    15/36November 14, 2012 Data Mining: Concepts and Techniques 15

    Cluster Analysis

  • 7/30/2019 DM Lecture 2 18-09-2012

    16/36November 14, 2012 Data Mining: Concepts and Techniques 16

    Chapter 3: Data Preprocessing

    Why preprocess the data?

    Data cleaning

    Data integration and transformation

    Data reduction

    Discretization and concept hierarchy generation

    Summary

  • 7/30/2019 DM Lecture 2 18-09-2012

    17/36November 14, 2012 Data Mining: Concepts and Techniques 17

    Data Integration

    Data integration:combines data from multiple sources into a coherentstore

    Schema integrationintegrate metadata from different sourcesEntity identification problem: identify real world entitiesfrom multiple data sources, e.g., A.cust-id B.cust-#

    Detecting and resolving data value conflicts

    for the same real world entity, attribute values fromdifferent sources are differentpossible reasons: different representations, differentscales, e.g., metric vs. British units

  • 7/30/2019 DM Lecture 2 18-09-2012

    18/36November 14, 2012 Data Mining: Concepts and Techniques 18

    Handling Redundant Datain Data Integration

    Redundant data occur often when integration of multipledatabases

    The same attribute may have different names in

    different databasesOne attribute may be a derived attribute in anothertable, e.g., annual revenue

    Redundant data may be able to be detected by

    correlational analysisCareful integration of the data from multiple sources mayhelp reduce/avoid redundancies and inconsistencies andimprove mining speed and quality

  • 7/30/2019 DM Lecture 2 18-09-2012

    19/36November 14, 2012 Data Mining: Concepts and Techniques 19

    Data Transformation

    Smoothing: remove noise from data Aggregation: summarization, data cube constructionGeneralization: concept hierarchy climbingNormalization: scaled to fall within a small, specifiedrange

    min-max normalization

    z-score normalizationnormalization by decimal scaling

    Attribute/feature constructionNew attributes constructed from the given ones

  • 7/30/2019 DM Lecture 2 18-09-2012

    20/36November 14, 2012 Data Mining: Concepts and Techniques 20

    Data Transformation:Normalization

    min-max normalization

    z-score normalization

    normalization by decimal scaling

    A A A

    A A

    A

    minnewminnewmaxnewminmax

    minvv _)__('

    A

    A

    devstand

    meanvv

    _'

    j

    vv

    10' Where j is the smallest integer such that Max(| |)

  • 7/30/2019 DM Lecture 2 18-09-2012

    21/36November 14, 2012 Data Mining: Concepts and Techniques 21

    Chapter 3: Data Preprocessing

    Why preprocess the data?

    Data cleaning

    Data integration and transformation

    Data reduction

    Discretization and concept hierarchy generation

    Summary

  • 7/30/2019 DM Lecture 2 18-09-2012

    22/36November 14, 2012 Data Mining: Concepts and Techniques 22

    Data Reduction Strategies

    Warehouse may store terabytes of data: Complex dataanalysis/mining may take a very long time to run on thecomplete data setData reduction

    Obtains a reduced representation of the data set that ismuch smaller in volume but yet produces the same (oralmost the same) analytical results

    Data reduction strategiesData cube aggregationDimensionality reductionNumerosity reduction

    Discretization and concept hierarchy generation

  • 7/30/2019 DM Lecture 2 18-09-2012

    23/36November 14, 2012 Data Mining: Concepts and Techniques 23

    Example of Decision Tree Induction

    Initial attribute set:{A1, A2, A3, A4, A5, A6}

    A4 ?

    A1? A6?

    Class 1 Class 2 Class 1 Class 2

    > Reduced attribute set: {A1, A4, A6}

  • 7/30/2019 DM Lecture 2 18-09-2012

    24/36

    November 14, 2012 Data Mining: Concepts and Techniques 24

    Data Compression

    String compressionThere are extensive theories and well-tuned algorithmsTypically losslessBut only limited manipulation is possible withoutexpansion

    Audio/video compressionTypically lossy compression, with progressive

    refinementSometimes small fragments of signal can bereconstructed without reconstructing the whole

    Time sequence is not audio

    Typically short and vary slowly with time

  • 7/30/2019 DM Lecture 2 18-09-2012

    25/36

    November 14, 2012 Data Mining: Concepts and Techniques 25

    Data Compression

    Original DataCompressed

    Data

    lossless

    Original DataApproximated

  • 7/30/2019 DM Lecture 2 18-09-2012

    26/36

    November 14, 2012 Data Mining: Concepts and Techniques 26

    Histograms

    A popular data reductiontechniqueDivide data into buckets

    and store average (sum)for each bucketCan be constructedoptimally in one

    dimension using dynamicprogrammingRelated to quantizationproblems. 0

    510

    15

    20

    25

    30

    35

    40

    1

    0 0 0 0

    2

    0 0 0 0

    3

    0 0 0 0

    4

    0 0 0 0

    5

    0 0 0 0

    6

    0 0 0 0

    7

    0 0 0 0

    8

    0 0 0 0

    9

    0 0 0 0

    1 0

    0 0 0 0

  • 7/30/2019 DM Lecture 2 18-09-2012

    27/36

    November 14, 2012 Data Mining: Concepts and Techniques 27

    Clustering

    Partition data set into clusters, and one can store cluster

    representation only

    Can be very effective if data is clustered but not if datais smeared

    Can have hierarchical clustering and be stored in multi-

    dimensional index tree structuresThere are many choices of clustering definitions and

    clustering algorithms, further detailed in Chapter 8

  • 7/30/2019 DM Lecture 2 18-09-2012

    28/36

    November 14, 2012 Data Mining: Concepts and Techniques 28

    Sampling

    Allow a mining algorithm to run in complexity that ispotentially sub-linear to the size of the dataChoose a representative subset of the data

    Simple random sampling may have very poorperformance in the presence of skew

    Develop adaptive sampling methodsStratified sampling:

    Approximate the percentage of each class (orsubpopulation of interest) in the overall databaseUsed in conjunction with skewed data

    Sampling may not reduce database I/Os (page at a time).

  • 7/30/2019 DM Lecture 2 18-09-2012

    29/36

    November 14, 2012 Data Mining: Concepts and Techniques 29

    Sampling

    Raw Data

  • 7/30/2019 DM Lecture 2 18-09-2012

    30/36

    November 14, 2012 Data Mining: Concepts and Techniques 30

    Sampling

    Raw Data Cluster/Stratified Sample

  • 7/30/2019 DM Lecture 2 18-09-2012

    31/36

    November 14, 2012 Data Mining: Concepts and Techniques 31

    Hierarchical Reduction

    Use multi-resolution structure with different degrees of reductionHierarchical clustering is often performed but tends todefine partitions of data sets rather than clusters Parametric methods are usually not amenable tohierarchical representationHierarchical aggregation

    An index tree hierarchically divides a data set intopartitions by value range of some attributesEach partition can be considered as a bucketThus an index tree with aggregates stored at eachnode is a hierarchical histogram

  • 7/30/2019 DM Lecture 2 18-09-2012

    32/36

    November 14, 2012 Data Mining: Concepts and Techniques 32

    Chapter 3: Data Preprocessing

    Why preprocess the data?

    Data cleaning

    Data integration and transformation

    Data reduction

    Discretization and concept hierarchy generation

    Summary

  • 7/30/2019 DM Lecture 2 18-09-2012

    33/36

    November 14, 2012 Data Mining: Concepts and Techniques 33

    Discretization

    Three types of attributes:Nominal values from an unordered setOrdinal values from an ordered setContinuous real numbers

    Discretization:divide the range of a continuous attribute intointervalsSome classification algorithms only accept categoricalattributes.Reduce data size by discretizationPrepare for further analysis

  • 7/30/2019 DM Lecture 2 18-09-2012

    34/36

    November 14, 2012 Data Mining: Concepts and Techniques 34

    Discretization and Concept hierachy

    Discretization reduce the number of values for a given continuousattribute by dividing the range of the attribute intointervals. Interval labels can then be used to replaceactual data values.

    Concept hierarchies

    reduce the data by collecting and replacing low levelconcepts (such as numeric values for the attributeage) by higher level concepts (such as young,middle-aged, or senior).

  • 7/30/2019 DM Lecture 2 18-09-2012

    35/36

    November 14, 2012 Data Mining: Concepts and Techniques 35

    Specification of a set of attributes

    Concept hierarchy can be automatically generated basedon the number of distinct values per attribute in thegiven attribute set. The attribute with the mostdistinct values is placed at the lowest level of thehierarchy.

    country

    province_or_ state

    city

    street

    15 distinct values

    65 distinct values

    3567 distinct values

    674,339 distinct values

  • 7/30/2019 DM Lecture 2 18-09-2012

    36/36

    Chapter 3: Data Preprocessing

    Why preprocess the data?

    Data cleaning

    Data integration and transformation

    Data reduction

    Discretization and concept hierarchy generation

    Summary