a dynamic-programming algorithm for hierarchical discretization of continuous attributes amit goyal...

31
A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The University of British Columbia

Post on 21-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

A dynamic-programming algorithm for hierarchical discretization of continuous attributes

Amit Goyal (15st April 2008)

Department of Computer Science

The University of British Columbia

Page 2: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Reference

Ching-Cheng Shen and Yen-Liang Chen. A dynamic-programming algorithm for hierarchical discretization of continuous attributes. In European Journal of Operational Research 184 (2008) 636-651 (ElseVier).

Amit Goyal (UBC Computer Science)

Page 3: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Overview

Motivation Background Why need Discretization? Related Work DP Solution Analysis Conclusion

Amit Goyal (UBC Computer Science)

Page 4: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Motivation

Situation: Attrition rate for mobile phone customer is around 25-30% per year

Task: Given customer information for

past N months, predict who is likely to attrite next month

Also estimate customer value & what is the cost effective order to be made to this customer

Customer Attributes:AgeGenderLocationPhone billsIncomeOccupation

Amit Goyal (UBC Computer Science)

Page 5: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Pattern Discoveryt1: Beef, Chicken, Milk

t2: Beef, Cheese

t3: Cheese, Boots

t4: Beef, Chicken, Cheese

t5: Beef, Chicken, Clothes, Cheese, Milk

t6: Chicken, Clothes, Milk

t7: Chicken, Milk, Clothes

Transaction data Assume:

min_support = 30% min_confidence = 80%

An example frequent itemset: {Chicken, Clothes, Milk} [sup = 3/7]

Association rules from the itemset: Clothes Milk, Chicken [sup = 3/7, conf = 3/3] … … Clothes, Chicken Milk, [sup = 3/7, conf = 3/3]

Amit Goyal (UBC Computer Science)

Page 6: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Issues with Numeric Attributes Size of the discretized intervals affect support &

confidence{Occupation = SE, (Income = $70,000)} {Attrition = Yes}{Occupation = SE, (60K Income 80K)} {Attrition = Yes}{Occupation = SE, (0K Income 1B)} {Attrition = Yes}

If intervals too small may not have enough support

If intervals too large may not have enough confidence

Loss of Information (How to minimize?) Potential solution: use all possible intervals Too many rules!!!

Amit Goyal (UBC Computer Science)

Page 7: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Background

Discretization reduce the number of values for a given

continuous attribute by dividing the range of the attribute into intervals.

Concept hierarchies reduce the data by collecting and replacing low

level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior).

Amit Goyal (UBC Computer Science)

Page 8: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Why need discretization?

Data Warehousing and Mining Data reduction Association Rule Mining Sequential Patterns Mining

In some machine learning algorithms like Bayesian approaches and Decision Trees.

Granular Computing

Amit Goyal (UBC Computer Science)

Page 9: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Related Work

Manual Equal-Width Partition Equal-Depth Partition Chi-Square Partition Entropy Based Partition Clustering

Amit Goyal (UBC Computer Science)

Page 10: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Simple Discretization Methods: Binning Equal-width (distance) partitioning:

It divides the range into N intervals of equal size: uniform grid

if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N.

The most straightforward Equal-depth (frequency) partitioning:

It divides the range into N intervals, each containing approximately same number of samples

Amit Goyal (UBC Computer Science)

Page 11: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Chi-Square Based Partitioning

The larger the Χ2 value, the more likely the variables are related

Merge: Find the best neighboring intervals and merge

them to form larger intervals recursively

Χ2 (chi-square) test

Expected

ExpectedObserved 22 )(

Amit Goyal (UBC Computer Science)

Page 12: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Entropy Based Partition

Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is

E S TS

EntS

EntS S S S( , )| |

| |( )

| |

| |( ) 1

12

2

The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization.

The process is recursively applied to partitions obtained until some stopping criterion is met

Amit Goyal (UBC Computer Science)

Page 13: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Clustering

Partition data set into clusters based on similarity, and store

cluster representation (e.g., centroid and diameter) only

Can be very effective if data is clustered but not if data is

“smeared”

Can have hierarchical clustering and be stored in multi-

dimensional index tree structures

There are many choices of clustering definitions and

clustering algorithms

Amit Goyal (UBC Computer Science)

Page 14: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Weaknesses

Seeks a local optimal solution instead of a global optimal

Subject to constraint that each interval can only be partitioned into a fixed number of sub-intervals

Constructed tree may be unbalanced

Amit Goyal (UBC Computer Science)

Page 15: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Notations

val(i): value of ith data num(i): number of occurrences of value val(i) R: depth of the output tree ub: upper boundary on the number of

subintervals spawned from an interval lb: lower boundary

Amit Goyal (UBC Computer Science)

Page 16: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Example

R = 2, lb = 2, ub = 3

Amit Goyal (UBC Computer Science)

Page 17: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Problem Definition

Given parameters R, ub, and lb and input data val(1), val(2), …, val(n) and num(1), num(2), … num(n), our goal is to build a minimum volume tree subject to the constraints that all leaf nodes must be in level R and that the branch degree must be between ub and lb

Amit Goyal (UBC Computer Science)

Page 18: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Distances and Volume

j

ix

xnumjimeanxvalji )(*|),()(|),(intradist

),1(interdist),(interdist

j))1,totalnum(uu),totalnum(i())()1((

),(totalnum))()1((),,(interdist

L juui

uvaluval

jiuvaluvaluji

R

Intra-distance of a node containing data from data i to data j

Inter-distance b/w two adjacent siblings; first node containing data from i to u, second node containing data from u+1 to j

Volume of a tree is the total intra-distance minus total inter-distance in the tree

Amit Goyal (UBC Computer Science)

Page 19: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Theorem

The volume of a tree = the intra-distance of the root node + the volumes of all its sub-trees - the inter-distances among its children

Amit Goyal (UBC Computer Science)

Page 20: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Notations

T*(i,j,r): the minimum volume tree that contains data from data i to data j and has depth r

T(i,j,r,k): the minimum volume tree that contains data from data i to data j, has depth r, and whose root has k branches

D*(i,j,r): the volume of T*(i,j,r) D(i,j,r,k): the volume of T(i,j,r,k)

Amit Goyal (UBC Computer Science)

Page 21: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Notations Cont.

}1nceinterdista

min{

1

1

1

node))(v node and b/w v(the

) - nodee of v(the volum): QD(i,j,r,k

thk-

v

th

k

v

th

Amit Goyal (UBC Computer Science)

Page 22: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Notations Cont.

)},(nceinterdista

1nceinterdista

min{

1

1

1

ui

node))(v node and b/w v(the

)nodee of v(the volum (i,j,r,k):QD

R

thk-

v

th

k

v

thM

Amit Goyal (UBC Computer Science)

Page 23: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Algorithm

)},(interdist),(interdist

)1,,,1(),,(*{min),,,(

uiui

krjuQDruiDkrjiQD

LR

M

jui

M

Amit Goyal (UBC Computer Science)

Page 24: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Algorithm Cont.

)},(interdist

)1,,,1(),,(*{min),,,(

ui

krjuQDruiDkrjiQD

L

M

jui

Amit Goyal (UBC Computer Science)

Page 25: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Algorithm Cont.

),1,,(),(intradist),,,( krjiQDjikrjiD

Amit Goyal (UBC Computer Science)

Page 26: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

The complete DP algorithm

)},(interdist),(interdist

)1,,,1(),,(*{min),,,(

uiui

krjuQDruiDkrjiQD

LR

M

jui

M

)},(interdist

)1,,,1(),,(*{min),,,(

ui

krjuQDruiDkrjiQD

L

M

jui

),1,,(),(intradist),,,( krjiQDjikrjiD

)},,,({min),,(* krjiDrjiDrbklb

Amit Goyal (UBC Computer Science)

Page 27: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Volume of trees constructed Run times of different algorithms

Gain Ratios of different algorithms(Monthly Household Income)

Gain Ratios of different algorithms(Money Spent Monthly)

Amit Goyal (UBC Computer Science)

Page 28: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Conclusion

Global optima instead of local optima Each interval is partitioned into the most

appropriate number of subintervals Trees are balanced Time complexity is cubic, thus slightly slower

Amit Goyal (UBC Computer Science)

Page 29: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

http://www.cs.ubc.ca/~goyal([email protected])

Thank you !!!Thank you !!!

Amit Goyal (UBC Computer Science)

Page 30: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Gain Ratio

n

iii ppTentropy

12log)(

)(||

||),...,(

121 i

r

i

ir Sentropy

S

SSSSpurity

The information gain due to particular split of S into Si, i = 1, 2, …., r Gain (S, {S1, S2, …., Sr) = purity(S ) – purity (S1, S2, … Sr)

.||

||2

log||||

),(SiS

SiS

ASnfoIntrinsicI

.),(

),(),(ASnfoIntrinsicI

ASGainASGainRatio

Amit Goyal (UBC Computer Science)

Page 31: A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The

Chi-Square Test Example

Heads Tails Total

Observed 53 47 100

Expected 50 50 100

(O-E)2 9 9

X2= (O-E)2/E 0.18 0.18 0.36

In order to see whether this result is statistically significant, the P-value (the probability of this result not being due to chance) must be calculated or looked up in a chart. The P-value is found to be Prob(X2

1 ≥ 0.36) = 0.5485. There is thus a probability of about 55% of seeing data that deviates at least this much from the expected results if indeed the coin is fair. Hence, fair coin.

Amit Goyal (UBC Computer Science)