business intelligence 3. data mining...

CLASSIFICATION

Business Intelligence 3. Data Mining

1

Classification

Classify given examples into a class based upon values for condition variables of the examples.

Two step process

1. Model construction using past examples

2. Model usage using future (unknown) examples

Accuracy of the model

◦ Percentage of test set samples that are correctly classified by the model

◦ Test set is independent of training set

◦ If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known

2

Step 1: Model Construction

Training

Data

NAME RANK YEARS TENURED

Mike Assistant Prof 3 no

Mary Assistant Prof 7 yes

Bill Professor 2 yes

Jim Associate Prof 7 yes

Dave Assistant Prof 6 no

Anne Associate Prof 3 no

Classification

Algorithms

IF rank = ‘professor’

OR years > 6

THEN tenured = ‘yes’

Classifier

(Model)

3

Step 2: Using the Model in Classification

Classifier

Testing

Data

NAME RANK YEARS TENURED

Tom Assistant Prof 2 no

Merlisa Associate Prof 7 no

George Professor 5 yes

Joseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

4

Decision Tree

age income student credit buys_computer

<=30 high no fair no

<=30 high no excellent no

31…40 high no fair yes

>40 medium no fair yes

>40 low yes fair yes

>40 low yes excellent no

31…40 low yes excellent yes

<=30 medium no fair no

<=30 low yes fair yes

>40 medium yes fair yes

<=30 medium yes excellent yes

31…40 medium no excellent yes

31…40 high yes fair yes

>40 medium no excellent no

Training Set Buys_computer

5

Algorithm for Decision Tree Construction

Basic algorithm (a greedy algorithm)

◦ Tree is constructed in a top-down recursive manner

◦ Examples are partitioned recursively based on the most

distinguishing attribute (e.g., highest information gain)

◦ Partitioning repeats until there are no more training tuples left to

distinguish (all samples for a given node belong to the same class)

6

Enhancements to Decision Tree Construction

Tree pruning

◦ When too many branches, some may reflect anomalies due to noise or outliers

◦ May display poor accuracy for unseen samples

◦ Need to prune the trees appropriately

Attributes discretization

◦ If continuous-valued, they must be discretized in advance

Missing value fill-up

◦ Assign the most common value of the attribute

◦ Assign probability to each of the possible values

7

age?

student? credit rating?

<=30 >40

no yes yes

yes

31..40

fair excellent yes no

IF-THEN Rules Extracted from DT

Rules are easier to understand than large trees

One rule is created for each path from the

root to a leaf

Each test along a path forms a conjunction,

and leaf forms a consequence (a class

prediction)

Rules are mutually exclusive and

exhaustive

Example: Rule extraction from our buys_computer decision-tree

IF age = young AND student = no THEN buys_computer = no

IF age = young AND student = yes THEN buys_computer = yes

IF age = mid-age THEN buys_computer = yes

IF age = old AND credit_rating = excellent THEN buys_computer = no

IF age = young AND credit_rating = fair THEN buys_computer = yes

8

PATTERN MINING

Business Intelligence 3. Data Mining

9

Pattern Mining Find frequent patterns in data set.

◦ Frequent pattern – a pattern (a set of items, subsequences, substructures,

etc.) that occurs frequently in a data set

Why find (hidden) frequent patterns in data?

◦ What products were often purchased together?— Beer and diapers?!

◦ What are the subsequent purchases after buying a PC?

◦ Can we automatically classify web documents?

Applications

◦ Basket analysis, cross-marketing, catalog design, sale campaign analysis, web

log (click count and stream) analysis, etc.

10

• Itemset X = {x1, …, xk}

• Find all the rules X Y with minimum support and confidence

• support(X Y) = P(X U Y): both X and Y contained in basket

• confidence(X Y) = P(Y | X): Prob of Y contained when X contained

• Example

◦ Given supmin = 50%,

confmin = 50%

◦ Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}

◦ Association rules:

A D (60%, 100%)

D A (60%, 75%)

Frequent Patterns and Association Rules

Transaction-id Items bought

10 A, B, D

20 A, C, D

30 A, D, E

40 B, E, F

50 B, C, D, E, F

11

Apriori Algorithm for Mining Frequent Patterns

Apriori pruning principle:

◦ Which is more frequent {beer, diaper, nuts} or {beer, diaper} ?

◦ “If there is any itemset which is not frequent, its superset

cannot be frequent.”

Method:

◦ Initially, scan DB once to get frequent 1-itemset

◦ Generate length (k+1) candidate itemsets (only) from length k

frequent itemsets

◦ Test the candidates against DB if they are frequent

◦ Terminate when no more frequent set be generated

12

Apriori Algorithm: An Example

Database TDB

1st scan

C1

L1

L2

C2 C2

2nd scan

C3 L3 3rd scan

Tid Items

10 A, C, D

20 B, C, E

30 A, B, C, E

40 B, E

Itemset sup

{A} 2

{B} 3

{C} 3

{D} 1

{E} 3

Itemset sup

{A} 2

{B} 3

{C} 3

{E} 3

Itemset

{A, B}

{A, C}

{A, E}

{B, C}

{B, E}

{C, E}

Itemset sup

{A, B} 1

{A, C} 2

{A, E} 1

{B, C} 2

{B, E} 3

{C, E} 2

Itemset sup

{A, C} 2

{B, C} 2

{B, E} 3

{C, E} 2

Itemset

{B, C, E}

Itemset sup

{B, C, E} 2

Supmin = 2

Why NOT {A, B, C} ? 13

Apriori Algorithm: An Example

14

Itemset sup

{B, C, E} 2

Rule Sup (%) Conf (%)

B → C, E 50 66.7

C → B, E

E → B, C

B, C → E

C, B → E

E, B → C

Tid Items

10 A, C, D

20 B, C, E

30 A, B, C, E

40 B, E

Itemset sup

{A, C} 2

{B, C} 2

{B, E} 3

{C, E} 2

Rule Sup (%) Conf (%)

A → C 50 100

C → A

B → C

C → B

B → E

E → B

C → E

E → C

Lift: Measure of Interest in Association

play basketball eat cereal [40%, 66.7%] may be misleading

◦ In case overall % of students eating cereal is 75% > 66.7%.

P(eatCereal/playBasketball) vs. P(eatCereal)

play basketball not eat cereal [20%, 33.3%] is more accurate

Lift – a measure of interest. To be of interest, it must be > 1

Lift (C→A) = P(A/C)/P(A)

Support, Confidence and Lift must be considered at the same time.

33.15000/1250

3000/1000)( CBlift

89.05000/3750

3000/2000)( CBlift

Basketball Not basketball Sum (row)

Cereal 2000 1750 3750

Not cereal 1000 250 1250

Sum(col.) 3000 2000 5000

15

Exercise

Consider the following transaction data. Assume min_sup = 60% and min_conf = 80%.

Find all the association rules with their supports and confidences (with lift > 1).

TID Items_bought

T100 {M, O, N, K, E, Y}

T200 {D, O, N, K, E, Y}

T300 {M, A, K, E}

T400 {M, U, C, K, Y}

T500 {C, O, O, K, I, E}

16

Multiple-Level Association Mining

In the real world, items often form hierarchies

Observation: Items at lower level are expected to have lower support

Uniform Support

Milk

[support = 10%]

2% Milk

[support = 6%]

Skim Milk

[support = 4%]

Level 1

min_sup = 5%

Level 2

min_sup = 5%

Level 1

min_sup = 5%

Level 2

min_sup = 3%

Reduced Support

17

Multi-Level Association Mining

Redundancy filtering: some rules may be redundant due

to “ancestor” rules

Example

1. milk wheat bread [support = 8%, confidence = 70%]

2. 2% milk wheat bread [support = 2%, confidence = 72%]

◦ The first rule is an ancestor of the second rule

◦ A rule is redundant if its confidence is close to that

of ancestor rule.

18

Problems of Frequent Pattern Mining

Problems in Apriori

◦ Multiple scans of transaction database

◦ Huge number of candidates

Many algorithms developed to solve such problems

◦ Algorithms that reduce db scans

◦ Algorithms that shrink candidate size (or sometimes no

candidate generated at all)

◦ Algorithms that facilitate candidate counting

◦ Constraint-based mining

19

Monotonic Constraints

Monotonic constraint

◦ When an item set S satisfies the constraint, so does any of

its superset SS.

◦ We call the constraint monotonic constraint

(monotonically increasingly satisfying)

Examples

◦ C1: sum(S.profit) 15 is monotone

If item set S satisfies C1, so does every superset of S

◦ C2: sum(S.price) v is monotone

◦ C3: min(S.price) v is monotone

20

Anti-Monotonic Constraints

Anti-monotonic constraint

◦ When an item set S violates the constraint, so does any of

its superset SS.

◦ We call the constraint anti-monotonic

◦ (monotonically increasingly violated)

Example

◦ C4: sum(S.price) v is called anti-monotone,

If S violates the constraint “sum(S.price) v”, all its superset SS

also violates the constraint “sum(SS.Price) v”.

◦ C5: sum(S.price) v” is not anti-monotone

◦ C6: sum(S.profit) 15 is anti-monotone

21

The Apriori Algorithm — Example

TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5

Database D itemset sup.

{1} 2

{2} 3

{3} 3

{4} 1

{5} 3

itemset sup.

{1} 2

{2} 3

{3} 3

{5} 3

Scan D

C1

L1

itemset

{1 2}

{1 3}

{1 5}

{2 3}

{2 5}

{3 5}

itemset sup

{1 2} 1

{1 3} 2

{1 5} 1

{2 3} 2

{2 5} 3

{3 5} 2

itemset sup

{1 3} 2

{2 3} 2

{2 5} 3

{3 5} 2

L2

C2 C2

Scan D

C3 L3 itemset

{2 3 5}Scan D itemset sup

{2 3 5} 222

Apriori + Anti-monotonic Constraint

TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5

Database D itemset sup.

{1} 2

{2} 3

{3} 3

{4} 1

{5} 3

itemset sup.

{1} 2

{2} 3

{3} 3

{5} 3

Scan D

C1

L1

itemset

{1 2}

{1 3}

{1 5}

{2 3}

{2 5}

{3 5}

itemset sup

{1 2} 1

{1 3} 2

{1 5} 1

{2 3} 2

{2 5} 3

{3 5} 2

itemset sup

{1 3} 2

{2 3} 2

{2 5} 3

{3 5} 2

L2

C2 C2

Scan D

Anti-monotonic constraint:

sum{S.price} < $100 (given 5.price > $120) 23

business intelligence 3. data mining...

Documents