1 data mining: mining frequent patterns, association and correlations

1

Data Mining: Mining Frequent Patterns,

Association and Correlations

2

Mining Frequent Patterns, Association and Correlations

Basic concepts and a road map

Efficient and scalable frequent itemset mining methods

Mining various kinds of association rules

From association mining to correlation analysis

Summary

3

What Is Frequent Pattern Analysis?

Frequent pattern: a pattern (a set of items, subsequences,

substructures, etc.) that occurs frequently in a data set

First proposed by Agrawal, Imielinski, and Swami [AIS93] in the

context of frequent itemsets and association rule mining

Motivation: Finding inherent regularities in data

What products were often purchased together?— Beer and

diapers?!

What are the subsequent purchases after buying a PC?

What kinds of DNA are sensitive to this new drug?

Can we automatically classify web documents? Assume all data are categorical. No good algorithm for numeric data.

4

Why Is Freq. Pattern Mining Important?

Discloses an intrinsic and important property of data sets

Forms the foundation for many essential data mining tasks Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia,

time-series, and stream data Classification: associative classification Cluster analysis: frequent pattern-based

clustering Data warehousing: cube-gradient Broad applications

Association Rules

Rules for customers that buy PC for buying antivirus software PC antivirus-sw [support =2%, conf=

60%] Support 2% of customers both PC and

antivirus-sw together Confidence 60% of PC buyers also buys

antivirus-sw Needs to satisfy minimum support and

confidence

5

Definitions

6

Let be set of all items I = {bread, milk, beer, chocolate, egg,

diaper} Transaction

T1 = {milk, egg} Itemset

Set of items k-itemset: itemset with k items

Frequent itemset Itemsets that satisfies a frequency

threshold called minimum support count

CS583, Bing Liu, UIC 7

Transaction data: a set of documents

A text document data set. Each document is treated as a “bag” of keywordsdoc1: Student, Teach, School doc2: Student, School doc3: Teach, School, City, Game doc4: Baseball, Basketballdoc5: Basketball, Player, Spectator doc6: Baseball, Coach, Game, Teamdoc7: Basketball, Team, City, Game

Association Rules

Association rule: an implication A B A ⊂I, B⊂I, A∩B=∅

Holds with some support s and confidence c Support (AB): the percentage of

transactions that contain A U B which is P(A U B)

Confidence (AB): the conditional probability that a transaction having A also contains B which is P(B|A) = freq(AUB) / freq(A)

Minimum support and minimum confidence are prespecified thresholds

8

9

Basic Concepts: Frequent Patterns and Association Rules

Itemset I = {i1, …, ik}

Find all the rules X Y with minimum support and confidence support, s, probability

that a transaction contains X Y

confidence, c, conditional probability that a transaction having X also contains Y

Let supmin = 50%, confmin = 50%

Freq. itemset: {A:3, B:3, D:4, E:3, AD:3}Association rules:

A D (60%, 100%)D A (60%, 75%)

Customerbuys diaper

Customerbuys both

Customerbuys beer

Transaction-id Items bought

1 A, B, D

2 A, C, D

3 A, D, E

4 B, E, F

5 B, C, D, E, F

Association Rule mining

Goal: Find all rules that satisfy the user-specified minimum support and minimum confidence

Association Rule mining can be viewed in two steps Find all frequent itemsets: itemsets that

occur at least as frequently as prespecified minimum support count min_sup

Generate strong association rules: that satisfy minimum support and minimum confidence

First part is more costly10

11

Closed Patterns

Huge number of itemsets: if an itemset is frequent then all its subsets are frequent

A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, …, a100} contains =

Too huge to compute or store Solution: Mine closed patterns instead

Closed Patterns

An itemset X is closed if X is frequent and there exists no super-pattern Y כ X, with the same support as X (proposed by Pasquier, et al. @ ICDT’99) i.e. use the biggest possible itemset for a

support level Closed pattern is a lossless compression of freq.

patterns Reducing the # of patterns and rules

12

13

Closed Patterns

Exercise. DB = {<a1, …, a100>, < a1, …,

a50>} Min_sup = 1.

What is the set of closed itemset? <a1, …, a100>: 1

< a1, …, a50>: 2 What is the set of all patterns?

2100-1!!

14






Summary

15

Scalable Methods for Mining Frequent Patterns

The downward closure property of frequent patterns Any subset of a frequent itemset must be

frequent If {beer, diaper, nuts} is frequent, so is {beer,

diaper} i.e., every transaction having {beer, diaper, nuts}

also contains {beer, diaper} Scalable mining methods: Several approaches

Apriori (Agrawal & Srikant@VLDB’94) Freq. pattern growth (FPgrowth—Han, Pei & Yin

@SIGMOD’00) Any algorithm should find the same set of rules

although their computational efficiencies and memory requirements may be different

16

Apriori: A Candidate Generation-and-Test Approach

Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)

Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from

length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can

be generated

17

The Apriori Algorithm—An Example

Database TDB

1st scan

C1L1

L2

C2 C2

2nd scan

C3 L33rd scan

Tid Items

10 A, C, D

20 B, C, E

30 A, B, C, E

40 B, E

Itemset sup

{A} 2

{B} 3

{C} 3

{D} 1

{E} 3

Itemset sup

{A} 2

{B} 3

{C} 3

{E} 3

Itemset

{A, B}

{A, C}

{A, E}

{B, C}

{B, E}

{C, E}

Itemset sup{A, B} 1{A, C} 2{A, E} 1{B, C} 2{B, E} 3{C, E} 2

Itemset sup{A, C} 2{B, C} 2{B, E} 3{C, E} 2

Itemset

{B, C, E}

Itemset sup

{B, C, E} 2

Supmin = 2

18

The Apriori Algorithm

Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size k

L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;

19

Important Details of Apriori

How to generate candidates? Step 1: self-joining Lk

Step 2: pruning How to count supports of candidates? Example of Candidate-generation

L3={abc, abd, acd, ace, bcd}

Self-joining: L3*L3

abcd from abc and abd acde from acd and ace

Pruning: acde is removed because ade is not in L3

C4={abcd}

Step 2: Generating rules from frequent itemsets

Frequent itemsets association rules One more step is needed to generate

association rules For each frequent itemset X,

For each proper nonempty subset A of X, Let B = X - A A B is an association rule if

Confidence(A B) ≥ minconf,support(A B) = support(AB) = support(X) confidence(A B) = support(A B) / support(A)

Generating rules: an example Suppose {2,3,4} is frequent, with sup=50%

Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, with sup=50%, 50%, 75%, 75%, 75%, 75% respectively

These generate these association rules: 2,3 4, confidence=100% 2,4 3, confidence=100% 3,4 2, confidence=67% 2 3,4, confidence=67% 3 2,4, confidence=67% 4 2,3, confidence=67% All rules have support = 50%

support(A B) = support(AB)

Generating rules: summary

To recap, in order to obtain A B, we need to have support(A B) and support(A)

All the required information for confidence computation has already been recorded in itemset generation. No need to see the data T any more.

This step is not as time-consuming as frequent itemsets generation.

23

Bottleneck of Apriori Alg

Multiple database scans are costly Mining long patterns needs many passes of

scanning and generates lots of candidates To find frequent itemset i1i2…i100

# of scans: 100 Bottleneck: candidate-generation-and-test Can we avoid candidate generation?

24

Mining Frequent Patterns Without Candidate Generation

Grow long patterns from short ones using

local frequent items

“abc” is a frequent pattern

Get all transactions having “abc”: DB|abc

“d” is a local frequent item in DB|abc

abcd is a frequent pattern

25

Construct FP-tree from a Transaction Database

min_support = 3

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

1. Scan DB once, find frequent 1-itemset (single item pattern)

2. Sort frequent items in frequency descending order, f-list

3. Scan DB again, construct FP-tree

{}

f:1

c:1

a:1

m:1

p:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

26


min_support = 3





{}

f:2

c:2

a:2

b:1m:1

p:1 m:1

Header Table


F-list=f-c-a-b-m-p

27


min_support = 3





{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table


F-list=f-c-a-b-m-p

28


min_support = 3




3. Scan DB again, construct FP-tree F-list=f-c-a-b-m-p

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table


29

Find Patterns Having P From P-conditional Database

Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent

item p Accumulate all of transformed prefix paths of item p to

form p’s conditional pattern base

Conditional pattern bases

item cond. pattern base

c f:3

a fc:3

b fca:1, f:1, c:1

m fca:2, fcab:1

p fcam:2, cb:1

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table


Example: conditional patterns having ‘p’

Start with the bottom of “header table” Find all paths to ‘p’ Those are: (f:4, c:3, a:3, m:2, p:2) and (c:1,

b:1, p:1) Paths that contain ‘p’

(f:2, c:2, a:2, m:2, p:2) and (c:1, b:1, p:1) Find prefix paths (remove ‘p’)

(f:2, c:2, a:2, m:2) and (c:1, b:1)

30

31

From Conditional Pattern-bases to Conditional FP-trees

For each pattern-base Accumulate the count for each item in the

base Construct the FP-tree for the frequent items of

the pattern basem-conditional pattern base:

fca:2, fcab:1

{}

f:3

c:3

a:3m-conditional FP-tree

All frequent patterns relate to mm, fm, cm, am, fcm, fam, cam, fcam

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header TableItem frequency head f 4c 4a 3b 3m 3p 3

min_support = 3

Another example

Transaction itemset Sorted itemset

Freq items

T1 a,b,e b,a,e b:7

T2 b,d b,d a:6

T3 b,c b,c c:6

T4 a,b,d b,a,d d:2

T5 a,c a,c e:2

T6 b,c b,c

T7 a,c a,c

T8 a,b,c,e b,a,c,e

T9 a,b,c b,a,c

32

Min_sup = 2

Example –cont.

Item Cond. pattern Cond. FP-tree Freq. patterns

e {b,a:1}{b,a,c:1}

<b:2,a:2> {b,e:2}{a,e:2}{a,b,e:2}

d {b,a:1}{b:1} <b:2> {b,d:2}

c {b,a:2}{b:2}{a:2}

<b:4,a:2><a:2>

{b,c:4}{a,c:4}{a,b,c:2}

a {b:4} <b:4> {b,a:4}

33

34

Benefits of the FP-tree Structure

Completeness Preserve complete information for frequent

pattern mining Never break a long pattern of any transaction

Compactness Reduce irrelevant info—infrequent items are gone Items in frequency descending order: the more

frequently occurring, the more likely to be shared Never be larger than the original database (not

count node-links and the count field)

35

Partition Patterns and Databases

Frequent patterns can be partitioned into subsets according to f-list F-list=f-c-a-b-m-p Patterns containing p Patterns having m but no p … Patterns having c but no a nor b Pattern f

Completeness and non-redundency

36

Scaling FP-growth by DB Projection

FP-tree cannot fit in memory?—DB projection First partition a database into a set of projected

DBs Then construct and mine FP-tree for each

projected DB Parallel projection vs. Partition projection

techniques Parallel projection is space costly

37

Why Is FP-Growth the Winner?

Divide-and-conquer: decompose both the mining task and DB

according to the frequent patterns obtained so far

leads to focused search of smaller databases Other factors

no candidate generation, no candidate test compressed database: FP-tree structure no repeated scan of entire database basic ops—counting local freq items and

building sub FP-tree, no pattern search and matching

38






Summary

39

Mining Various Kinds of Association Rules

Mining multilevel association

Miming multidimensional association

Mining quantitative association

Mining interesting correlation patterns

40

Mining Multiple-Level Association Rules

Items often form hierarchies Flexible support settings

Items at the lower level are expected to have lower support

Logitech wireless laser mouse M505 & Norton antivirus 2015

Difficult to find strong associations use multiple min_sup

uniform support

Milk[support = 10%]

2% Milk [support = 6%]

Skim Milk [support = 4%]

Level 1min_sup = 5%

Level 2min_sup = 5%

Level 1min_sup = 5%

Level 2min_sup = 3%

reduced support

CS583, Bing Liu, UIC 41

Problem for uniform support

If the frequencies of items vary a great deal, we will encounter two problems If min_sup is set too high, those rules

that involve rare items will not be found. To find rules that involve both frequent

and rare items, min_sup has to be set very low. This causes combinatorial explosion because those frequent items will be associated with one another in all possible ways.

42

Mining Multi-Dimensional Association

Single-dimensional rules:buys(X, “milk”) buys(X, “bread”)

Multi-dimensional rules: 2 dimensions or predicates Inter-dimension assoc. rules (no repeated

predicates)age(X,”19-25”) occupation(X,“student”) buys(X, “coke”)

hybrid-dimension assoc. rules (repeated predicates)

age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)

Categorical Attributes: finite number of possible values, no ordering among values—(occupation, brand, color)

Quantitative Attributes: numeric, implicit ordering among values—discretization, clustering, and gradient approaches

43

Mining Quantitative Associations

Techniques can be categorized by how numerical attributes, such as age or salary are treated

1. Static discretization based on predefined concept hierarchies (data cube methods)

2. Dynamic discretization based on data distribution (quantitative rules)

Intervals dynamically determined

3. Clustering: Distance-based association one dimensional clustering then association

44

Mining Other Interesting Patterns

Flexible support constraints Some items (e.g., diamond) may occur rarely

but are valuable Customized supmin specification and application

Top-K closed frequent patterns Hard to specify supmin

45






Summary

Interesting Measure - Lift

Strong rules are not necessarily important Only an estimate, not measure real

strength of correlation

play basketball eat cereal [40%, 66.7%] is misleading

The overall % of students eating cereal is 75% > 66.7%.

play basketball not eat cereal [20%, 33.3%] is more

accurate, although with lower support and confidence46

Basketball

Not basketball Sum (row)

Cereal 2000 1750 3750

Not cereal 1000 250 1250

Sum(col.) 3000 2000 5000

47

Interestingness Measure: Correlations (Lift)

Measure of dependent/correlated events: lift

If lift > 1 positively correlated

If lift < 1 negatively correlated

If lift = 0 independent

89.05000/3750*5000/3000

5000/2000),( CBlift

)()(

)(

BPAP

BAPlift

33.15000/1250*5000/3000

5000/1000),( CBlift

Basketball

Not basketball Sum (row)

Cereal 2000 1750 3750

Not cereal 1000 250 1250

Sum(col.) 3000 2000 5000

48






Summary

49

Summary

Frequent pattern mining—an important task in data

mining

Support & confidence

Scalable frequent pattern mining methods

Apriori (Candidate generation & test)

FPgrowth (Projection-based)

Mining a variety of rules and interesting patterns

Correlation analysis

lift

1 data mining: mining frequent patterns, association and correlations

Documents