pattern mining - knowledge discovery and data...

Pa�ern MiningKnowledge Discovery and Data Mining 2

Roman Kern and Heimo Gursch

ISDS, TU Graz

2019-10-31

Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 1 / 44

Outline

1 Introduction

2 Apriori Algorithm

3 FP-Growth Algorithm

4 Sequence Pa�ern Mining

5 Alternatives & Remarks

6 Tools


Introduction to Pa�ern MiningWhat & Why


Pa�ern Mining Intro

Definition of KDDKDD is the nontrivial process of identifying valid, novel, potentially useful, and ultimatelyunderstandable pa�erns in data

→ But, until now we did not directly investigate into mining of these pa�erns.


Pa�ern Mining Example

Famous example for pa�ern miningMany supermarkets/grocery (chains) do have loyalty cards

→ gives the store the ability to analyse the purchase behaviour (and increase sales, formarketing, etc.)Benefits

I Which items have been bought togetherI Hidden relationships: probability of purchasing an item given another item is already part of

the purchaseI … or temporal pa�erns



Transactions in a groceryAll purchases are organised in transactions

Each transaction contains a list of purchased items

For example:

Id Items

0 Bread, Milk1 Bread, Diapers, Beer, Eggs2 Milk, Diapers, Beer, Coke3 Bread, Milk, Diapers, Beer4 Bread, Milk, Diapers, Coke



Ways of analysing the transaction dataFrequent itemset

I Which items are o�en bought together?I → association analysis also called itemset mining

Frequent associationsI If one itemset occurs what itemset will most probably also occur alongside?I Find “if A - then B” associationsI → association rule mining



ResultA grocery store chain in the Midwest of the US found that on Thursdays men tend to buy beer ifthey bought diapers as well.

Association Rule{diapers → beer}

Note #1: The grocery store could make use of this knowledge to place diapers and beer close to each other and tomake sure none of these product has a discount on Thursdays.Note #2: Not clear whether this is actually true.


Pa�ern Mining Basics

Formal description: items and transactionsI = {I1, I2, ..., In} - the set of all available items (e.g. the grocery goods - inventory)

D = {t1, t2, ..., tm} - the transactions, where each transaction consists of a set of items(tx ⊆ I)

Note #1: The organisation of transaction is sometimes referred to as horizontal dataNote #2: One could see each item as feature and thus each transaction as feature vector with zeros for items not contained in thetransaction.format



Formal description: itemset and association ruleX = {I1, I2, ..., Ii} - the itemset, just a bunch of items, a subset of IFrequent itemset is an itemset that found o�en in the transactionsX → Y - an association rule

I X and Y are itemssetsI X is called premise, antecedent or le�-hand-side (LHS)I Y is called conclusion, consequent or right-hand-side (RHS)

Frequent association rule is an association rule that is true for many transactions. I. e., X → Yis found o�en.



Formal description: support of an itemsetHow frequent is a frequent itemset?

→ The support of an itemset is the probability of finding the itemset in the transactions.

Support(X) = P(X) = P(I1, I2, ..., In)Support(X) = |{t∈D|X⊆t}|

|D|The relative frequency of the itemset (relative number of transactions containing all itemsfrom the itemset)

Note: Some algorithms take the total number of transactions instead of the relative frequency



Formal description: support of an association ruleHow frequent is a frequent association rule?

→ The support of an association rule is the probability of finding the association rule to betrue in the transactions.

Support(X → Y) = P(X ∪ Y) = P(IX1 , IX2 , ..., IXn , IY1 , IY2 , ..., IYm)Support(X) = |{t∈D|X∪Y⊆t}|

|D|… equals to the joint itemset



Formal description: confidenceHow relevant is a (frequent) association rule?

→ The confidence is the probability to find the conclusion Y given the premise X is there.

Conf(X → Y) = P(IY1 , IY2 , ..., IYm|IX1 , IX2 , ..., IXn )Conf(X → Y) = Support(X∪Y)

Support(X)

Conf(X → Y) = |{t∈D|X∪Y⊆t}||{t∈D|X⊆t}|

The confidence reflects how o�en one finds Y in transactions, which contain X



Association Rule Mining GoalFind all frequent, relevant association rules, i.e. all rules with a su�icient support and su�icientconfidence

What is it useful for?Market basket analysis, query completion, click stream analysis, genome analysis, …

Preprocessing for other data mining tasks (e. g., select most frequent part from dataset).



Simple Solution (for association analysis)1 Pick out any combination of items2 Test for su�icient support3 Goto 1 (until all combinations have been tried)

Not the optimal solution, due to 2n combinations for n items

Two improvements are presented next:→ Aprioi Algorithm→ FP-Growth Algorithm


The Apriori Algorithm… and the apriori principle


Apriori Algorithm

�estionReduce the number of candidates from 2n combinations for n items

Can some candidates be ignored?

Apriori algorithmSimple and e�icient algorithm to find frequent itemsets

Can also be applied to find association rules

InputThe transactions

The (minimal) support threshold

For association rules additionally: The (minimal) confidence threshold

R. Agrawal, H. Road, and S. Jose, “Mining Association Rules between Sets of Items in Large Databases,” ACM SIGMOD Record. Vol.22. No. 2. ACM, 1993.


Apriori Algorithm

Apriori principleIf an itemset is frequent, then all of the subsets are frequent as wellIf an itemset is infrequent, then all of the supersets are infrequent as well

Note: This can be used to reduce the number of candidates considerably


Apriori Algorithm

Apriori basic approach for frequent itemsets1 Find all frequent itemsets of size k = 1 and add them to the frequent itemsets F1 (i.e., check

for support of single items)2 k = k + 13 Generate the candidates Ck of size k from the known frequent itemsets Fk−1 of size k − 14 Prune the candidates in Ck containing infrequent sub-itemsets (apriori principle)5 For all candidate in Ck

I Calculate the support countI If the support is large enough move to Fk ; if not delete it

6 Goto 2 (until there no entries for Fk are found)


Apriori Algorithm

Basis for frequent association rule miningFrequent association rule mining is done on the frequent itemsets and not on thetransactions directly, thus reducing the search space considerably.

A set with k items has 2k − 2 candidates for association rules, i.e., all subsets but the emptyand the full set can be premise and conclusion.Confidence is monotonous only on the same itemsets

I Conf(ABC → D) ≥ Conf(AC → CD) ≥ Conf(A→ BCD)

but not in generalI Conf(ABC → D) can be larger or smaller than Conf(AB→ D)


Apriori Algorithm

Apriori basic approach for frequent association rules1 Take the frequent itemsets of size k ≥ 22 Start by generating association rules with a conclusion set Hm of size m = 1

I Hm of size m = 1 is generated by picking all items out of the frequent itemset one at a timeand one a�er the other.

3 m = m+ 14 Generate candidates the conclusion for size m, by merging two rules from the previous step

I New premise: intersection of old premisesI New conclusion: union of old premisesI I.e., one item moves from the premise to the conclusion

5 Prune the candidates containing infrequent sub-itemsets (apriori principle)6 For all candidate

I Calculate the confidenceI If the confidence is large enough move to Hm; if not delete it

7 Goto 2 (until there no entries for Hk are found)Roman Kern and Heimo Gursch (ISDS, TU Graz) Pa�ern Mining 2019-10-31 21 / 44

The FP-Growth Algorithm… more e�icient than Apriori


FP-Growth Algorithm

OverviewCompletely di�erent approach than the Apriori algorithm

Follows a divide and conquer approachLower complexity than Apriori for finding frequent itemsets by several orders of magnitude

I Reduces the scans scans through the transaction database to two

Main idea: transform the data set into an e�icient data structure, the FP Tree1 Build the Frequent Pa�ern Tree (FP-Tree)2 Mine the FP-Tree for frequent itemsets by

F Creating conditional FP-Trees on-the-flyF Recursively mine the conditional FP-Trees for frequent itemsets


FP-Growth Algorithm

The FP-TreeContains all information from the transactions

Allows e�icient retrieval of frequent itemsetsConsists of a tree

I Each node represents an itemI Each node has a count, # of transactions from the root to the node

Consists of a linked listI For each frequent item there is a head with countI Links to all nodes with the same item


FP-Growth Algorithm

Building the FP-Tree1 Scan through all transactions and count each item generate the table Ix → countx of the

item support given by their count2 Scan through all transactions and

1 Sort the items in the transaction by descending support count2 Throw away infrequent items, countx < minSupport3 Iterate over the items and match against tree (starting with the root node)

F if the current item is found as child to the current node, increase its countF if the current item is not found, create a new child (branch)F Repeat for all items in transaction

4 Repeat for all transactions


FP-Growth Algorithm

Given a set of transactions (1), sort the items by frequency in the data set and prune infrequent items (4), create theFP-Tree with the tree itself (root on the le�) a linked list for each of the items (head on the top)


FP-Growth Algorithm

Mine the FP-Tree for frequent itemsets1 Start with single item itemsets I ordered by increasing support2 Create a conditional FP-Tree for the selected itemset I

I Remove all items with less support from the treeI Remove all transaction where the selected item is not it

3 Recursively mine the FP-Tree:I If the tree has only one branch, all combination of the items in that branch are frequent item

setsI If the tree has more than one branch:

F Build a new conditional FP-Tree (new recursion) by added the item with the least support toitemset I

I Recursively mine by going to 3 until no new items are in the conditional treeI Goto 2 until all items are done


FP-Growth Algorithm

Complete FP-tree (top le�), element e removed (top right), and conditional tree for e (bo�om le�).


Sequence Pa�ern MiningMining pa�erns in temporal data


Sequence Pa�ern Mining

Initial definitionGiven a set of sequences, where each sequence consists of a list of elements and each elementconsists of a set of items, and given a user-specified min support threshold, sequential pa�ernmining is to find all frequent subsequences, i.e., the subsequences whose occurrence frequencyin the set of sequences is no less than min support.

- R. Agrawal and R. Srikant, “Mining Sequential Pa�erns,” Proc. 1995 Int’l Conf. Data Eng. (ICDE ’95), pp. 3-14, Mar.1995.

Note #1: This definition can then be expanded to include all sorts of constraints.Note #2: Each element in a sequence is an itemset.



Applications & Use-CasesShopping cart analysis

I e.g. common shopping sequences

DNA sequence analysis(Web)Log-file analysis

I Common access pa�erns

Derive features for subsequent analysis



Transaction database vs. sequence database

ID Itemset

1 a, b, c2 b, c3 a, e, f, g4 c, f, g

ID Sequence

1 <a(abc)(ac)d(cf)>2 <(ad)c(bc)(ae)>3 <(ef)(ab)(df)cb>4 <eg(af)cbc>



Subsequence & Super-sequenceA sequence is an ordered list of elements, denoted < x1 x2 . . . xk >

Given two sequences α =< a1 a2 . . . an > and β =< b1 b2 . . . bm >α is called a subsequence of β, denoted as α v β

I if there exist integers 1 ≤ j1 < j2 < · · · < jn ≤ m such that a1 ⊆ bj1 , a2 ⊆ bj2 , . . . , an ⊆ bjnβ is a super-sequence of α

I e.g. α =< (ab), d > and β =< (abc), (de) >Note #1: A database transaction (single sequence) is defined to contain a sequence, if that sequence is a subsequence.Note #2: Frequently found subsequences (within a transaction database) are called sequential pa�erns.



Initial approaches to sequence pa�ern mining relied on the a-priory principleI If a sequence S is not frequent, then none of the super-sequences of S is frequent

Generalized Sequential Pa�ern Mining (GSP) is one example of such algorithmsDisadvantages:

I Requires multiple scans within the databaseI Potentially many candidates


FreeSpan - Algorithm

FreeSpan - Frequent Pa�ern-Projected Sequential Pa�ern MiningInput: minimal support, sequence database

Output: all sequences with a frequency equal or higher than the minimal support

Main idea: Recursively reduce (project) the database instead of scanning the wholedatabase multiple timesProjection

I Reduced version of a database consisting just a set of sequencesI Thus each sequence in the projected database contains the sequence

Keep the sequences sortedI All sequences are kept in an fixed orderI Starting with the most frequent to the least frequent

Similar divide and conquer approach than in FP-Growth, but uses an ordered matrix datastructure instead of tree.


PrefixSpan - Algorithm

PrefixSpan - IdeaProblem in FreeSpan: The projected database size dose only shrink in each iteration byremoval of the infrequent items

I Only consider frequent prefixes and not all prefixes to make it shrink faster

PrefixSpan - Algorithm1 Calculate the support for each item. Count it once per sequence, not per occurrence in

sequence2 For each item in increasing order of support: Make the item to an on-element prefix pi and

do3 Generate a projected database for pi4 Find the frequent items f having the current prefix pi and extend pi with f to pi+1

5 Goto 2 for all elements in pi+1 until pi+1 is empty


Alternatives & Remarks… and remarks


Remarks

Problems of frequent association rulesRules with high confidence, but also high support

I e.g. use significance test

Confidence is relative to the frequency of the premiseI Hard to compare multiple association rules

Redundant association rules

Many frequent itemsets (o�en even more than transactions)


Remarks

How to deal with numeric a�ributes?Binning (discretization) on predefined ranges

Distribution-based binning

Extend algorithm to allow numeric a�ributes, e.g. Approximate Association Rule Mining

Frequent closed pa�ern miningA pa�ern is called closed, if there is no superset that satisfies the minimal support

Produces a smaller number of pa�erns

Similar maximum pa�ern mining


Further Algorithms

Frequent tree mining

Frequent graph mining

Constraint based miningApproximate pa�ern mining

I e.g. picking a single representative for a group of pa�ernsI … by invoking a clustering algorithm

Mining cyclic or periodic pa�erns

J. Han, H. Cheng, D. Xin, and X. Yan, “Frequent pa�ern mining: current status and future directions,” Data Min. Knowl. Discov., vol.15, no. 1, pp. 55–86, Jan. 2007.


Connection to ML

Connection to other tasks in machine learningFrequent pa�ern based classification

I Mine the relation between features/items and the labelI … useful for feature selection/engineering

Frequent pa�ern based clusteringI Clustering in high dimensions is a challenge, e.g. textI … use frequent pa�erns as a dimensionality reduction


Evaluation & Tools… for pa�ern mining


Evaluation

EvaluationUsually there are many frequent itemsets and rules found.

I Which ones are the interesting once?I Which ones are spurious?

Support & ConfidenceI Support and Confidence are defined on the primes that the occurrence of the elements is

statistically independent from each other.I This is not true in the general case!I If there is a occurrence frequencies are mismatched they tend to fail.


Evaluation

Contingency table

Y YX f11 f10 f1+X f01 f00 f0+

f+1 f+0 N

f11 support of X and Y




f+1 & f+1 sum of column

f1+ & f0+ sum of column


Evaluation

Problems with Confidence

Co�ee Co�eeTea 15 5 20Tea 75 5 80

90 10 100

Association Rule: Tea→ Co�ee

Conf(Tea→ Co�ee) = 1520 = 0.75

But P(Co�ee) = 0.9 meaning knowing that a persondrinks tea reduces the probability that the persondrinks co�ee considerably

P(Co�ee|Tea) = 7580 = 0.9375)

What rules to look for?Conf(X → Y) should be high, so that the rule is more likely to be true than false

Conf(X → Y) > Support(Y): Otherwise, the rule will be misleading since having item Xactually reduces the chance of having item Y in the same transaction

There are loads of such measures!


Evaluation

Li� and Interest

Lift = P(Y |X)P(Y) Interest = P(Y ,X)

P(Y)P(Y)

Lift is used for rules

Interest is used for itemsets

Also called Interest Factor (IF) or Interest (I)

It measures the ratio of the support of a pa�ern (numerator) against its baseline support(denominator) under the independence assumption.

Lift = 1 implies statistical independence


Evaluation

Problems with Confidence

Co�ee Co�eeTea 15 5 20Tea 75 5 80

90 10 100

Association Rule: Tea→ Co�ee

Conf(Tea→ Co�ee) = 1520 = 0.75

But P(Co�ee) = 0.9

Lift(Tea→ Co�ee) = 7590 = 0.8333)

Lift is smaller than 1, so there is a slightly negativerelationship between tea drinkers and co�eedrinkers.


Tools for pa�ern Mining

Java ToolsSPMF - http://www.philippe-fournier-viger.com/spmf/

Python ToolsPyFIM - http://www.borgelt.net/pyfim.html

PrefixSpan - https://pypi.org/project/prefixspan/

R Toolsarules - https://cran.r-project.org/web/packages/arules/index.html

arulesSequences -https://cran.r-project.org/web/packages/arulesSequences/index.html

Spark (Scala, Java, Python & R)Spark ML -https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html


http://www.philippe-fournier-viger.com/spmf/

http://www.borgelt.net/pyfim.html

https://pypi.org/project/prefixspan/

https://cran.r-project.org/web/packages/arules/index.html

https://cran.r-project.org/web/packages/arulesSequences/index.html

https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html

The EndNext: Deep Learning

Recommended book: P-N. Tan, M. Steinbach, A. Karpatne, V. Kumar: Introduction to DataMining. Pearson, Second Edition, 2018. ISBN 9780134545943https://www-users.cs.umn.edu/~kumar001/dmbook/index.php Chapter 5 “AssociationAnalysis: Basic Concepts and Algorithms” available on the web page

Christian Borgelt Slides: http://www.borgelt.net/teach/fpm/slides.html (516 slides)http://www.is.informatik.uni-duisburg.de/courses/im_ss09/folien/

MiningSequentialPatterns.ppt


https://www-users.cs.umn.edu/~kumar001/dmbook/index.php

http://www.borgelt.net/teach/fpm/slides.html

http://www.is.informatik.uni-duisburg.de/courses/im_ss09/folien/MiningSequentialPatterns.ppt

http://www.is.informatik.uni-duisburg.de/courses/im_ss09/folien/MiningSequentialPatterns.ppt

pattern mining - knowledge discovery and data...

Documents