business intelligence 3. data mining...
TRANSCRIPT
CLASSIFICATION
Business Intelligence 3. Data Mining
1
Classification
Classify given examples into a class based upon values for condition variables of the examples.
Two step process
1. Model construction using past examples
2. Model usage using future (unknown) examples
Accuracy of the model
◦ Percentage of test set samples that are correctly classified by the model
◦ Test set is independent of training set
◦ If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known
2
Step 1: Model Construction
Training
Data
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Classification
Algorithms
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier
(Model)
3
Step 2: Using the Model in Classification
Classifier
Testing
Data
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
4
Decision Tree
age income student credit buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Training Set Buys_computer
5
Algorithm for Decision Tree Construction
Basic algorithm (a greedy algorithm)
◦ Tree is constructed in a top-down recursive manner
◦ Examples are partitioned recursively based on the most
distinguishing attribute (e.g., highest information gain)
◦ Partitioning repeats until there are no more training tuples left to
distinguish (all samples for a given node belong to the same class)
6
Enhancements to Decision Tree Construction
Tree pruning
◦ When too many branches, some may reflect anomalies due to noise or outliers
◦ May display poor accuracy for unseen samples
◦ Need to prune the trees appropriately
Attributes discretization
◦ If continuous-valued, they must be discretized in advance
Missing value fill-up
◦ Assign the most common value of the attribute
◦ Assign probability to each of the possible values
7
age?
student? credit rating?
<=30 >40
no yes yes
yes
31..40
fair excellent yes no
IF-THEN Rules Extracted from DT
Rules are easier to understand than large trees
One rule is created for each path from the
root to a leaf
Each test along a path forms a conjunction,
and leaf forms a consequence (a class
prediction)
Rules are mutually exclusive and
exhaustive
Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = young AND credit_rating = fair THEN buys_computer = yes
8
PATTERN MINING
Business Intelligence 3. Data Mining
9
Pattern Mining Find frequent patterns in data set.
◦ Frequent pattern – a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
Why find (hidden) frequent patterns in data?
◦ What products were often purchased together?— Beer and diapers?!
◦ What are the subsequent purchases after buying a PC?
◦ Can we automatically classify web documents?
Applications
◦ Basket analysis, cross-marketing, catalog design, sale campaign analysis, web
log (click count and stream) analysis, etc.
10
• Itemset X = {x1, …, xk}
• Find all the rules X Y with minimum support and confidence
• support(X Y) = P(X U Y): both X and Y contained in basket
• confidence(X Y) = P(Y | X): Prob of Y contained when X contained
• Example
◦ Given supmin = 50%,
confmin = 50%
◦ Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}
◦ Association rules:
A D (60%, 100%)
D A (60%, 75%)
Frequent Patterns and Association Rules
Transaction-id Items bought
10 A, B, D
20 A, C, D
30 A, D, E
40 B, E, F
50 B, C, D, E, F
11
Apriori Algorithm for Mining Frequent Patterns
Apriori pruning principle:
◦ Which is more frequent {beer, diaper, nuts} or {beer, diaper} ?
◦ “If there is any itemset which is not frequent, its superset
cannot be frequent.”
Method:
◦ Initially, scan DB once to get frequent 1-itemset
◦ Generate length (k+1) candidate itemsets (only) from length k
frequent itemsets
◦ Test the candidates against DB if they are frequent
◦ Terminate when no more frequent set be generated
12
Apriori Algorithm: An Example
Database TDB
1st scan
C1
L1
L2
C2 C2
2nd scan
C3 L3 3rd scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup
{A, B} 1
{A, C} 2
{A, E} 1
{B, C} 2
{B, E} 3
{C, E} 2
Itemset sup
{A, C} 2
{B, C} 2
{B, E} 3
{C, E} 2
Itemset
{B, C, E}
Itemset sup
{B, C, E} 2
Supmin = 2
Why NOT {A, B, C} ? 13
Apriori Algorithm: An Example
14
Itemset sup
{B, C, E} 2
Rule Sup (%) Conf (%)
B → C, E 50 66.7
C → B, E
E → B, C
B, C → E
C, B → E
E, B → C
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A, C} 2
{B, C} 2
{B, E} 3
{C, E} 2
Rule Sup (%) Conf (%)
A → C 50 100
C → A
B → C
C → B
B → E
E → B
C → E
E → C
Lift: Measure of Interest in Association
play basketball eat cereal [40%, 66.7%] may be misleading
◦ In case overall % of students eating cereal is 75% > 66.7%.
P(eatCereal/playBasketball) vs. P(eatCereal)
play basketball not eat cereal [20%, 33.3%] is more accurate
Lift – a measure of interest. To be of interest, it must be > 1
Lift (C→A) = P(A/C)/P(A)
Support, Confidence and Lift must be considered at the same time.
33.15000/1250
3000/1000)( CBlift
89.05000/3750
3000/2000)( CBlift
Basketball Not basketball Sum (row)
Cereal 2000 1750 3750
Not cereal 1000 250 1250
Sum(col.) 3000 2000 5000
15
Exercise
Consider the following transaction data. Assume min_sup = 60% and min_conf = 80%.
Find all the association rules with their supports and confidences (with lift > 1).
TID Items_bought
T100 {M, O, N, K, E, Y}
T200 {D, O, N, K, E, Y}
T300 {M, A, K, E}
T400 {M, U, C, K, Y}
T500 {C, O, O, K, I, E}
16
Multiple-Level Association Mining
In the real world, items often form hierarchies
Observation: Items at lower level are expected to have lower support
Uniform Support
Milk
[support = 10%]
2% Milk
[support = 6%]
Skim Milk
[support = 4%]
Level 1
min_sup = 5%
Level 2
min_sup = 5%
Level 1
min_sup = 5%
Level 2
min_sup = 3%
Reduced Support
17
Multi-Level Association Mining
Redundancy filtering: some rules may be redundant due
to “ancestor” rules
Example
1. milk wheat bread [support = 8%, confidence = 70%]
2. 2% milk wheat bread [support = 2%, confidence = 72%]
◦ The first rule is an ancestor of the second rule
◦ A rule is redundant if its confidence is close to that
of ancestor rule.
18
Problems of Frequent Pattern Mining
Problems in Apriori
◦ Multiple scans of transaction database
◦ Huge number of candidates
Many algorithms developed to solve such problems
◦ Algorithms that reduce db scans
◦ Algorithms that shrink candidate size (or sometimes no
candidate generated at all)
◦ Algorithms that facilitate candidate counting
◦ Constraint-based mining
19
Monotonic Constraints
Monotonic constraint
◦ When an item set S satisfies the constraint, so does any of
its superset SS.
◦ We call the constraint monotonic constraint
(monotonically increasingly satisfying)
Examples
◦ C1: sum(S.profit) 15 is monotone
If item set S satisfies C1, so does every superset of S
◦ C2: sum(S.price) v is monotone
◦ C3: min(S.price) v is monotone
20
Anti-Monotonic Constraints
Anti-monotonic constraint
◦ When an item set S violates the constraint, so does any of
its superset SS.
◦ We call the constraint anti-monotonic
◦ (monotonically increasingly violated)
Example
◦ C4: sum(S.price) v is called anti-monotone,
If S violates the constraint “sum(S.price) v”, all its superset SS
also violates the constraint “sum(SS.Price) v”.
◦ C5: sum(S.price) v” is not anti-monotone
◦ C6: sum(S.profit) 15 is anti-monotone
21
The Apriori Algorithm — Example
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
Database D itemset sup.
{1} 2
{2} 3
{3} 3
{4} 1
{5} 3
itemset sup.
{1} 2
{2} 3
{3} 3
{5} 3
Scan D
C1
L1
itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}
itemset sup
{1 2} 1
{1 3} 2
{1 5} 1
{2 3} 2
{2 5} 3
{3 5} 2
itemset sup
{1 3} 2
{2 3} 2
{2 5} 3
{3 5} 2
L2
C2 C2
Scan D
C3 L3 itemset
{2 3 5}Scan D itemset sup
{2 3 5} 222
Apriori + Anti-monotonic Constraint
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
Database D itemset sup.
{1} 2
{2} 3
{3} 3
{4} 1
{5} 3
itemset sup.
{1} 2
{2} 3
{3} 3
{5} 3
Scan D
C1
L1
itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}
itemset sup
{1 2} 1
{1 3} 2
{1 5} 1
{2 3} 2
{2 5} 3
{3 5} 2
itemset sup
{1 3} 2
{2 3} 2
{2 5} 3
{3 5} 2
L2
C2 C2
Scan D
Anti-monotonic constraint:
sum{S.price} < $100 (given 5.price > $120) 23