why preprocess the data? - burapha universitykomate/886464/[week... · 2011. 11. 22. · 4 data...

21
1 Preprocessing data Komate AMPHAWAN

Upload: others

Post on 25-Jan-2021

11 views

Category:

Documents


1 download

TRANSCRIPT

  • 1

    Preprocessing data

    Komate AMPHAWAN

  • 2

    Why Preprocess the Data?

  • 3

    Low-quality data will lead to low-quality mining results

    • How can the data be preprocessed in order to help improve the quality of the data and, consequently, of the mining results?

    • How can the data be preprocessed so as to improve the efficiency and ease of the mining process?

  • 4

    Data preprocessing techniques

    • Data cleaning can be applied to remove noise and correct inconsistencies in the data.

    • Data integration merges data from multiple sources into a coherent data store, such as a data warehouse.

    • Data transformations, such as normalization, may be applied.

    • Data reduction can reduce the data size by aggregating, eliminating redundant features, or clustering, for instance.

  • 5

    Why Preprocess the Data?

    • The data you wish to analyze by data mining techniques are incomplete (lacking attribute values or certain attributes of interest, or containing only aggregate data), noisy(containing errors, or outlier values that deviate from the expected), and inconsistent(e.g., containing discrepancies in the department codes used to categorize items).

  • 6

    Descriptive Data Summarization [1]

    • For data preprocessing to be successful, it is essential to have an overall picture of your data.

    • Descriptive data summarization techniques can be used to identify the typical properties of your data and highlight which data values should be treated as noise or outliers.

  • 7

    Descriptive Data Summarization [2]

    • learn about data characteristics regarding both central tendency(แนวโนม) and dispersion(การแพรกระจาย) of the data.

    • Measures of central tendency include mean, median, mode, and midrange.

    • Measures of data dispersion include quartiles, interquartile range (IQR), and variance.

  • 8

    • Descriptive statistics are of great help in understanding the distribution of the data.

    examine how we can be computed efficiently in large databases

    Descriptive Data Summarization [3]

  • 9

    Measuring the Central Tendency [1]

    • most effective numerical measure of the “center” of a set of data is the (arithmetic) mean.

  • 10

    • A distributive measure is a function that can be computed for a given data set by (i)partitioning the data into smaller subsets, (ii)computing the measure for each subset, and then (iii) merging the results in order to arrive at the measure’s value for the original (entire) data set.

    sum(), count(), min() and max() are distributive measures

    Measuring the Central Tendency [2]

  • 11

    • An algebraic measure is computed by applying an algebraic function to one or more distributive measures.

    average (or mean()) is an algebraic measurebecause it can be computed by sum()/count().

    Measuring the Central Tendency [3]

  • 12

    Example

    • Weighted arithmetic mean (weighted average)

    Each value xi in a set may be associated with a weight wi, for i = 1,…,N.

    The weights reflect the significance, importance, or

    occurrence frequency attached to their respective values.

  • 13

    • A holistic measure is computed on the entire data set as a whole. It cannot be computed by partitioning the given data into subsets and merging the values obtained for the measure in each subset.

    median

    Measuring the Central Tendency [4]

  • 14

  • 15

    • mode is the value that occurs most frequently in the set.

    • It is possible for the greatest frequency to correspond to several different values, which results in more than one mode.

    • Data sets with one, two, or three modes are respectively called unimodal, bimodal, and trimodal.

    Measuring the Central Tendency [5]

  • 16

    Measuring the Dispersion of Data [1]

    • The degree to which numerical data tend to spread is called the dispersion, or variance of the data.

    Range

    the five-number summary (based on quartiles)

    the interquartile range

    the standard deviation

  • 17

    • Let x1,x2,…, xN be a set of observations for some attribute.

    • The range of the set is the difference between the largest (max()) and smallest (min()) values.

    Measuring the Dispersion of Data [2]

  • 18

    • Quartiles

    The first quartile, denoted by Q1, is the 25th

    percentile; the third quartile, denoted by Q3, is the 75th percentile.

    • Interquartile range (IQR)

    Measuring the Dispersion of Data [3]

  • 19

    • The five-number summary of a distribution consists of the median, the quartiles Q1 and Q3, and the smallest and largest individual observations, written in the order

    {Minimum, Q1, Median, Q3, Maximum}

    Measuring the Dispersion of Data [4]

  • 20

    Variance and Standard Deviation

    • The variance of N observations, x1,x2,… ,xN, is

    • The standard deviation, σ, of the observations is the square root of the variance σ2.

  • 21

    Q & A