data mining spring 2007 noisy data data discretization using entropy based and chimerge
TRANSCRIPT
Data MiningSpring 2007
• Noisy data• Data Discretization using
Entropy based and ChiMerge
Noisy Data
• Noise: Random error, Data Present but not correct.– Data Transmission error– Data Entry problem
• Removing noise– Data Smoothing (rounding, averaging within a window).– Clustering/merging and Detecting outliers.
• Data Smoothing– First sort the data and partition it into (equi-depth) bins.– Then the values in each bin using Smooth by Bin Means,
Smooth by Bin Median, Smooth by Bin Boundaries, etc.
Noisy Data (Binning Methods)
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34* Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34* Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29* Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
Noisy Data (Clustering)
• Outliers may be detected by clustering, where similar values are organized into groups or “clusters”.
• Values which falls outside of the set of clusters may be considered outliers.
Data Discretization
• The task of attribute (feature)-discretization techniques is to discretize the values of continuous features into a small number of intervals, where each interval is mapped to a discrete symbol.
• Advantages:- – Simplified data description and easy-to-understand data
and final data-mining results.– Only Small interesting rules are mined.– End-results processing time decreased.– End-results accuracy improved.
Effect of Continuous Data on Results Accuracy
age income age buys_computer<=30 medium 9 ?<=30 medium 11 ?<=30 medium 13 ?
age income age buys_computer<=30 medium 9 no<=30 medium 10 no<=30 medium 11 no<=30 medium 12 no
Data Mining
• If ‘age <= 30’ and income = ‘medium’ and age = ‘9’ then buys_computer = ‘no’
• If ‘age <= 30’ and income = ‘medium’ and age = ‘10’ then buys_computer = ‘no’
• If ‘age <= 30’ and income = ‘medium’ and age = ‘11’ then buys_computer = ‘no’
• If ‘age <= 30’ and income = ‘medium’ and age = ‘12’ then buys_computer = ‘no’
Discover only those rules which contain support (frequency) greater >= 1
Due to the missing value in training dataset, the accuracy of prediction decreases and becomes “66.7%”
Entropy-Based Discretization
• Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is
• Where pi is the probability of class i in S1, determined by dividing the number of samples of class i in S1 by the total number of samples in S1.
Entropy-Based Discretization
• The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization.
• The process is recursively applied to partitions obtained until some stopping criterion is met, e.g.,
Example 1
IDID 1 2 3 4 5 6 7 8 9
AgeAge 21 22 24 25 27 27 27 35 41
GradeGrade F F P F P P P P P
• Let Grade be the class attribute. Use entropy-based discretization to divide the range of ages into different discrete intervals.
• There are 6 possible boundaries. They are 21.5, 23, 24.5, 26, 31, and 38.
• Let us consider the boundary at T = 21.5. Let S1 = {21} Let S2 = {22, 24, 25, 27, 27, 27, 35, 41}
(21+22) / 2 = 21.5
(22+24) / 2 = 23
Example 1 (cont’)
• The number of elements in S1 and S2 are:|S1| = 1|S2| = 8
• The entropy of S1 is
• The entropy of S2 is
)0(log)0()1(log)1(
)P(log)P()F(log)F()(
22
221 GradePGradePGradePGradePSEnt
)6(log)6()2(log)2(
)P(log)P()F(log)F()(
22
222 GradePGradePGradePGradePSEnt
ID 1 2 3 4 5 6 7 8 9
Age 21 22 24 25 27 27 27 35 41
Grade F F P F P P P P P
Example 1 (cont’)
• Hence, the entropy after partitioning at T = 21.5 is
...
)(|9|
|8|)(
|9|
|1|
)(||
||)(
||
||),(
21
22
11
SEntSEnt
SEntS
SSEnt
S
STSE
Example 1 (cont’)
• The entropies after partitioning for all the boundaries are: T = 21.5 = E(S,21.5) T = 23 = E(S,23) . . T = 38 = E(S,38)
Select the boundary with the smallest entropySuppose best is T = 23ID 1 2 3 4 5 6 7 8 9
Age 21 22 24 25 27 27 27 35 41
Grade F F P F P P P P P
Now recursively apply entropy discretization upon both partitions
ChiMerge (Kerber92)
• This discretization method uses a merging approach.• ChiMerge’s view:
– First sort the data on the basis of attribute (which is being descretize).
– List all possible boundaries or intervals. In case of last example there were 6 boundaries, 0, 21.5, 23, 24.5, 26, 31, and 38.
– For all two near intervals, calculate 2 class independent test.
• {0,21.5} and {21.5, 23}• {21.5, 23} and {23, 24.5}• .• .
– Pick the best 2 two near intervals and merge them.
ChiMerge -- The Algorithm
Compute the 2 value for each pair of adjacent intervals
Merge the pair of adjacent intervals with the lowest 2 value
Repeat and until 2 values of all adjacent pairs exceeds a threshold
Chi-Square Test
oij = observed frequency of interval i for class j
eij = expected frequency (Ri * Cj) / N
c
i
r
j ij
ijij
e
eo
1 1
2
2
Class 1
Class 2 Σ
Int 1
o11 o12 R1
Int 2
o21 o22 R2
Σ C1 C2 N
ChiMerge Example
Data Set Sample: F K
1 1 12 3 23 7 14 8 15 9 16 11 27 23 28 37 19 39 210 45 111 46 112 59 1
• Interval points for feature F are: 0, 2, 5, 7.5, 8.5, 10, etc.
ChiMerge Example (cont.)
2 was minimum for intervals: [7.5, 8.5] and [8.5, 10]
K=1 K=2
Interval [7.5, 8.5] A11=1 A12=0 R1=1Interval [8.5, 9.5] A21=1 A22=0 R2=1
C1=2 C2=0 N=2
Based on the table’s values, we can calculate expected values:E11 = 2/2 = 1,E12 = 0/2 0,E21 = 2/2 = 1, and E22 = 0/2 0
and corresponding 2 test:2 = (1 – 1)2 / 1 + (0 – 0)2 / 0.1 + (1 – 1)2 / 1 + (0 – 0)2 / 0.1 = 0.2
For the degree of freedom d=1, and 2 = 0.0 < 2.706 ( MERGE !)
c
i
r
j ij
ijij
e
eo
1 1
2
2
oij = observed frequency of interval i for class j
eij = expected frequency (Ri * Cj) / N
ChiMerge Example (cont.)Additional Iterations:
K=1 K=2
Interval [0, 7.5] A11=2 A12=1 R1=3Interval [7.5, 10] A21=2 A22=0 R2=2
C1=4 C2=1 N=5
E11 = 12/5 = 2.4,E12 = 3/5 = 0.6,E21 = 8/5 = 1.6, and E22 = 2/5 = 0.4
2 = (2 – 2.4)2 / 2.4 + (1 – 0.6)2 / 0.6 + (2 – 1.6)2 / 1.6 + (0 – 0.4)2 / 0.4 2 = 0.834
For the degree of freedom d=1, 2 = 0.834 < 2.706 (MERGE!)
ChiMerge Example (cont.)
K=1 K=2
Interval [0, 10.0] A11=4 A12=1 R1=5
Interval [10.0, 42.0] A21=1 A22=3 R2=4
C1=5 C2=4 N=9
E11 = 2.78
E12 =2.22
E21 = 2.22
E22 = 1.78
and 2 = 2.72 > 2.706 (NO MERGE !)
Final discretization: [0, 10], [10, 42], and [42, 60]
References
– Text book of Jiawei Han and Micheline Kamber, “Data Mining: Concepts and Techniques”, Morgan Kaufmann Publishers, August 2000. (Chapter 3).
– Data Mining: Concepts, Models, Methods, and Algorithms by Mehmed Kantardzic John Wiley & Sons 2003. (Chapter 3).