1 data mining: mining frequent patterns, association and correlations
TRANSCRIPT
1
Data Mining: Mining Frequent Patterns,
Association and Correlations
2
Mining Frequent Patterns, Association and Correlations
Basic concepts and a road map
Efficient and scalable frequent itemset mining methods
Mining various kinds of association rules
From association mining to correlation analysis
Summary
3
What Is Frequent Pattern Analysis?
Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the
context of frequent itemsets and association rule mining
Motivation: Finding inherent regularities in data
What products were often purchased together?— Beer and
diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents? Assume all data are categorical. No good algorithm for numeric data.
4
Why Is Freq. Pattern Mining Important?
Discloses an intrinsic and important property of data sets
Forms the foundation for many essential data mining tasks Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia,
time-series, and stream data Classification: associative classification Cluster analysis: frequent pattern-based
clustering Data warehousing: cube-gradient Broad applications
Association Rules
Rules for customers that buy PC for buying antivirus software PC antivirus-sw [support =2%, conf=
60%] Support 2% of customers both PC and
antivirus-sw together Confidence 60% of PC buyers also buys
antivirus-sw Needs to satisfy minimum support and
confidence
5
Definitions
6
Let be set of all items I = {bread, milk, beer, chocolate, egg,
diaper} Transaction
T1 = {milk, egg} Itemset
Set of items k-itemset: itemset with k items
Frequent itemset Itemsets that satisfies a frequency
threshold called minimum support count
CS583, Bing Liu, UIC 7
Transaction data: a set of documents
A text document data set. Each document is treated as a “bag” of keywordsdoc1: Student, Teach, School doc2: Student, School doc3: Teach, School, City, Game doc4: Baseball, Basketballdoc5: Basketball, Player, Spectator doc6: Baseball, Coach, Game, Teamdoc7: Basketball, Team, City, Game
Association Rules
Association rule: an implication A B A ⊂I, B⊂I, A∩B=∅
Holds with some support s and confidence c Support (AB): the percentage of
transactions that contain A U B which is P(A U B)
Confidence (AB): the conditional probability that a transaction having A also contains B which is P(B|A) = freq(AUB) / freq(A)
Minimum support and minimum confidence are prespecified thresholds
8
9
Basic Concepts: Frequent Patterns and Association Rules
Itemset I = {i1, …, ik}
Find all the rules X Y with minimum support and confidence support, s, probability
that a transaction contains X Y
confidence, c, conditional probability that a transaction having X also contains Y
Let supmin = 50%, confmin = 50%
Freq. itemset: {A:3, B:3, D:4, E:3, AD:3}Association rules:
A D (60%, 100%)D A (60%, 75%)
Customerbuys diaper
Customerbuys both
Customerbuys beer
Transaction-id Items bought
1 A, B, D
2 A, C, D
3 A, D, E
4 B, E, F
5 B, C, D, E, F
Association Rule mining
Goal: Find all rules that satisfy the user-specified minimum support and minimum confidence
Association Rule mining can be viewed in two steps Find all frequent itemsets: itemsets that
occur at least as frequently as prespecified minimum support count min_sup
Generate strong association rules: that satisfy minimum support and minimum confidence
First part is more costly10
11
Closed Patterns
Huge number of itemsets: if an itemset is frequent then all its subsets are frequent
A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, …, a100} contains =
Too huge to compute or store Solution: Mine closed patterns instead
Closed Patterns
An itemset X is closed if X is frequent and there exists no super-pattern Y כ X, with the same support as X (proposed by Pasquier, et al. @ ICDT’99) i.e. use the biggest possible itemset for a
support level Closed pattern is a lossless compression of freq.
patterns Reducing the # of patterns and rules
12
13
Closed Patterns
Exercise. DB = {<a1, …, a100>, < a1, …,
a50>} Min_sup = 1.
What is the set of closed itemset? <a1, …, a100>: 1
< a1, …, a50>: 2 What is the set of all patterns?
2100-1!!
14
Mining Frequent Patterns, Association and Correlations
Basic concepts and a road map
Efficient and scalable frequent itemset mining methods
Mining various kinds of association rules
From association mining to correlation analysis
Summary
15
Scalable Methods for Mining Frequent Patterns
The downward closure property of frequent patterns Any subset of a frequent itemset must be
frequent If {beer, diaper, nuts} is frequent, so is {beer,
diaper} i.e., every transaction having {beer, diaper, nuts}
also contains {beer, diaper} Scalable mining methods: Several approaches
Apriori (Agrawal & Srikant@VLDB’94) Freq. pattern growth (FPgrowth—Han, Pei & Yin
@SIGMOD’00) Any algorithm should find the same set of rules
although their computational efficiencies and memory requirements may be different
16
Apriori: A Candidate Generation-and-Test Approach
Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from
length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can
be generated
17
The Apriori Algorithm—An Example
Database TDB
1st scan
C1L1
L2
C2 C2
2nd scan
C3 L33rd scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup{A, B} 1{A, C} 2{A, E} 1{B, C} 2{B, E} 3{C, E} 2
Itemset sup{A, C} 2{B, C} 2{B, E} 3{C, E} 2
Itemset
{B, C, E}
Itemset sup
{B, C, E} 2
Supmin = 2
18
The Apriori Algorithm
Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size k
L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;
19
Important Details of Apriori
How to generate candidates? Step 1: self-joining Lk
Step 2: pruning How to count supports of candidates? Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd acde from acd and ace
Pruning: acde is removed because ade is not in L3
C4={abcd}
Step 2: Generating rules from frequent itemsets
Frequent itemsets association rules One more step is needed to generate
association rules For each frequent itemset X,
For each proper nonempty subset A of X, Let B = X - A A B is an association rule if
Confidence(A B) ≥ minconf,support(A B) = support(AB) = support(X) confidence(A B) = support(A B) / support(A)
Generating rules: an example Suppose {2,3,4} is frequent, with sup=50%
Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, with sup=50%, 50%, 75%, 75%, 75%, 75% respectively
These generate these association rules: 2,3 4, confidence=100% 2,4 3, confidence=100% 3,4 2, confidence=67% 2 3,4, confidence=67% 3 2,4, confidence=67% 4 2,3, confidence=67% All rules have support = 50%
support(A B) = support(AB)
Generating rules: summary
To recap, in order to obtain A B, we need to have support(A B) and support(A)
All the required information for confidence computation has already been recorded in itemset generation. No need to see the data T any more.
This step is not as time-consuming as frequent itemsets generation.
23
Bottleneck of Apriori Alg
Multiple database scans are costly Mining long patterns needs many passes of
scanning and generates lots of candidates To find frequent itemset i1i2…i100
# of scans: 100 Bottleneck: candidate-generation-and-test Can we avoid candidate generation?
24
Mining Frequent Patterns Without Candidate Generation
Grow long patterns from short ones using
local frequent items
“abc” is a frequent pattern
Get all transactions having “abc”: DB|abc
“d” is a local frequent item in DB|abc
abcd is a frequent pattern
25
Construct FP-tree from a Transaction Database
min_support = 3
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Sort frequent items in frequency descending order, f-list
3. Scan DB again, construct FP-tree
{}
f:1
c:1
a:1
m:1
p:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
26
Construct FP-tree from a Transaction Database
min_support = 3
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Sort frequent items in frequency descending order, f-list
3. Scan DB again, construct FP-tree
{}
f:2
c:2
a:2
b:1m:1
p:1 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
F-list=f-c-a-b-m-p
27
Construct FP-tree from a Transaction Database
min_support = 3
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Sort frequent items in frequency descending order, f-list
3. Scan DB again, construct FP-tree
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
F-list=f-c-a-b-m-p
28
Construct FP-tree from a Transaction Database
min_support = 3
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Sort frequent items in frequency descending order, f-list
3. Scan DB again, construct FP-tree F-list=f-c-a-b-m-p
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
29
Find Patterns Having P From P-conditional Database
Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent
item p Accumulate all of transformed prefix paths of item p to
form p’s conditional pattern base
Conditional pattern bases
item cond. pattern base
c f:3
a fc:3
b fca:1, f:1, c:1
m fca:2, fcab:1
p fcam:2, cb:1
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
Example: conditional patterns having ‘p’
Start with the bottom of “header table” Find all paths to ‘p’ Those are: (f:4, c:3, a:3, m:2, p:2) and (c:1,
b:1, p:1) Paths that contain ‘p’
(f:2, c:2, a:2, m:2, p:2) and (c:1, b:1, p:1) Find prefix paths (remove ‘p’)
(f:2, c:2, a:2, m:2) and (c:1, b:1)
30
31
From Conditional Pattern-bases to Conditional FP-trees
For each pattern-base Accumulate the count for each item in the
base Construct the FP-tree for the frequent items of
the pattern basem-conditional pattern base:
fca:2, fcab:1
{}
f:3
c:3
a:3m-conditional FP-tree
All frequent patterns relate to mm, fm, cm, am, fcm, fam, cam, fcam
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header TableItem frequency head f 4c 4a 3b 3m 3p 3
min_support = 3
Another example
Transaction itemset Sorted itemset
Freq items
T1 a,b,e b,a,e b:7
T2 b,d b,d a:6
T3 b,c b,c c:6
T4 a,b,d b,a,d d:2
T5 a,c a,c e:2
T6 b,c b,c
T7 a,c a,c
T8 a,b,c,e b,a,c,e
T9 a,b,c b,a,c
32
Min_sup = 2
Example –cont.
Item Cond. pattern Cond. FP-tree Freq. patterns
e {b,a:1}{b,a,c:1}
<b:2,a:2> {b,e:2}{a,e:2}{a,b,e:2}
d {b,a:1}{b:1} <b:2> {b,d:2}
c {b,a:2}{b:2}{a:2}
<b:4,a:2><a:2>
{b,c:4}{a,c:4}{a,b,c:2}
a {b:4} <b:4> {b,a:4}
33
34
Benefits of the FP-tree Structure
Completeness Preserve complete information for frequent
pattern mining Never break a long pattern of any transaction
Compactness Reduce irrelevant info—infrequent items are gone Items in frequency descending order: the more
frequently occurring, the more likely to be shared Never be larger than the original database (not
count node-links and the count field)
35
Partition Patterns and Databases
Frequent patterns can be partitioned into subsets according to f-list F-list=f-c-a-b-m-p Patterns containing p Patterns having m but no p … Patterns having c but no a nor b Pattern f
Completeness and non-redundency
36
Scaling FP-growth by DB Projection
FP-tree cannot fit in memory?—DB projection First partition a database into a set of projected
DBs Then construct and mine FP-tree for each
projected DB Parallel projection vs. Partition projection
techniques Parallel projection is space costly
37
Why Is FP-Growth the Winner?
Divide-and-conquer: decompose both the mining task and DB
according to the frequent patterns obtained so far
leads to focused search of smaller databases Other factors
no candidate generation, no candidate test compressed database: FP-tree structure no repeated scan of entire database basic ops—counting local freq items and
building sub FP-tree, no pattern search and matching
38
Mining Frequent Patterns, Association and Correlations
Basic concepts and a road map
Efficient and scalable frequent itemset mining methods
Mining various kinds of association rules
From association mining to correlation analysis
Summary
39
Mining Various Kinds of Association Rules
Mining multilevel association
Miming multidimensional association
Mining quantitative association
Mining interesting correlation patterns
40
Mining Multiple-Level Association Rules
Items often form hierarchies Flexible support settings
Items at the lower level are expected to have lower support
Logitech wireless laser mouse M505 & Norton antivirus 2015
Difficult to find strong associations use multiple min_sup
uniform support
Milk[support = 10%]
2% Milk [support = 6%]
Skim Milk [support = 4%]
Level 1min_sup = 5%
Level 2min_sup = 5%
Level 1min_sup = 5%
Level 2min_sup = 3%
reduced support
CS583, Bing Liu, UIC 41
Problem for uniform support
If the frequencies of items vary a great deal, we will encounter two problems If min_sup is set too high, those rules
that involve rare items will not be found. To find rules that involve both frequent
and rare items, min_sup has to be set very low. This causes combinatorial explosion because those frequent items will be associated with one another in all possible ways.
42
Mining Multi-Dimensional Association
Single-dimensional rules:buys(X, “milk”) buys(X, “bread”)
Multi-dimensional rules: 2 dimensions or predicates Inter-dimension assoc. rules (no repeated
predicates)age(X,”19-25”) occupation(X,“student”) buys(X, “coke”)
hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)
Categorical Attributes: finite number of possible values, no ordering among values—(occupation, brand, color)
Quantitative Attributes: numeric, implicit ordering among values—discretization, clustering, and gradient approaches
43
Mining Quantitative Associations
Techniques can be categorized by how numerical attributes, such as age or salary are treated
1. Static discretization based on predefined concept hierarchies (data cube methods)
2. Dynamic discretization based on data distribution (quantitative rules)
Intervals dynamically determined
3. Clustering: Distance-based association one dimensional clustering then association
44
Mining Other Interesting Patterns
Flexible support constraints Some items (e.g., diamond) may occur rarely
but are valuable Customized supmin specification and application
Top-K closed frequent patterns Hard to specify supmin
45
Mining Frequent Patterns, Association and Correlations
Basic concepts and a road map
Efficient and scalable frequent itemset mining methods
Mining various kinds of association rules
From association mining to correlation analysis
Summary
Interesting Measure - Lift
Strong rules are not necessarily important Only an estimate, not measure real
strength of correlation
play basketball eat cereal [40%, 66.7%] is misleading
The overall % of students eating cereal is 75% > 66.7%.
play basketball not eat cereal [20%, 33.3%] is more
accurate, although with lower support and confidence46
Basketball
Not basketball Sum (row)
Cereal 2000 1750 3750
Not cereal 1000 250 1250
Sum(col.) 3000 2000 5000
47
Interestingness Measure: Correlations (Lift)
Measure of dependent/correlated events: lift
If lift > 1 positively correlated
If lift < 1 negatively correlated
If lift = 0 independent
89.05000/3750*5000/3000
5000/2000),( CBlift
)()(
)(
BPAP
BAPlift
33.15000/1250*5000/3000
5000/1000),( CBlift
Basketball
Not basketball Sum (row)
Cereal 2000 1750 3750
Not cereal 1000 250 1250
Sum(col.) 3000 2000 5000
48
Mining Frequent Patterns, Association and Correlations
Basic concepts and a road map
Efficient and scalable frequent itemset mining methods
Mining various kinds of association rules
From association mining to correlation analysis
Summary
49
Summary
Frequent pattern mining—an important task in data
mining
Support & confidence
Scalable frequent pattern mining methods
Apriori (Candidate generation & test)
FPgrowth (Projection-based)
Mining a variety of rules and interesting patterns
Correlation analysis
lift