lecture14 - advanced topics in association rules
DESCRIPTION
TRANSCRIPT
Introduction to MachineIntroduction to Machine LearningLearning
Lecture 14Lecture 14Advanced Topics in Association Rules Mining
Albert Orriols i Puigi l @ ll l [email protected]
Artificial Intelligence – Machine LearningEnginyeria i Arquitectura La Salleg y q
Universitat Ramon Llull
Recap of Lecture 13Ideas come from the market basket analysis (MBA)y ( )
Let’s go shopping!
Milk, eggs, sugar, bread
Milk, eggs, cereal, b d
Eggs, sugar
bread
Customer1Customer1
Customer2 Customer3
What do my customer buy? Which product are bought together?
Aim: Find associations and correlations between the different
Slide 2
d assoc at o s a d co e at o s bet ee t e d e e titems that customers place in their shopping basket
Artificial Intelligence Machine Learning
Recap of Lecture 13
D t b TDBItemset sup
Itemset supDatabase TDBC1
L1Tid Items10 A C D
{A} 2{B} 3{C} 3
Itemset sup{A} 2{B} 3
1st scan10 A, C, D20 B, C, E30 A, B, C, E
{C} 3{D} 1{E} 3
{C} 3{E} 3
C2 C2
40 B, E
ItemsetItemset sup{A B} 1L2 2nd scan
te set{A, B}{A, C}
{A, B} 1{A, C} 2{A, E} 1
Itemset sup{A, C} 2{B C} 2 {A, E}
{B, C}{B E}
{B, C} 2{B, E} 3{C, E} 2
{B, C} 2{B, E} 3{C, E} 2
C3 L33rd scan
{B, E}{C, E}
{C, E} 2
Itemset It t
Slide 3Artificial Intelligence Machine Learning
3 33 scante set{B, C, E}
Itemset sup{B, C, E} 2
Recap of Lecture 13Challengesg
Apriori scans the data base multiple times
M t ft th i hi h b f did tMost often, there is a high number of candidates
Support counting for candidates can be time expensive
Several methods try to improve this points bySeveral methods try to improve this points byReduce the number of scans of the data base
Shrink the number of candidates
Counting the support of candidates more efficiently
Slide 4Artificial Intelligence Machine Learning
Today’s Agenda
Starting a journey through some advancedStarting a journey through some advanced topics in ARM
Mining frequent patterns without candidate generation
Multiple Level AR
Sequential Pattern MiningSequential Pattern Mining
Quantitative association rules
Mining class association rules
B d t & fidBeyond support & confidence
Applications
Slide 5Artificial Intelligence Machine Learning
Revisiting Candidate Generation
Remember A priori?pUse the previous frequent itemsets (k-1) to generate the k-itemsetste sets
Count itemsets support by scanning the data base
Bottleneck in the process: Candidate generationSuppose 100 items
First level of the tree 100 nodesFirst level of the tree 100 nodes
Second level of the tree ⎟⎟⎠
⎞⎜⎜⎝
⎛2
100
In general, number of k-itemsets: ⎟⎟⎠
⎞⎜⎜⎝
⎛k
100
⎟⎠
⎜⎝ 2
Slide 6Artificial Intelligence Machine Learning
⎟⎠
⎜⎝ k
Can We Avoid Generation?Build an auxiliar structure to get statistics about the gitemsets in order to avoid candidate generation
Use an FP-treeUse an FP tree
Avoid multiple scans of the data
Divide-and-conquer methodology
Avoid candidate generation
Outline of the process:Outline of the process:Generate an FP-Tree
Mine the FP-tree
Slide 7Artificial Intelligence Machine Learning
Building the FP-Tree
TID Items Sorted FIS
1 {F,A,C,D,G,I,M,P} {F,C,A,M,P}
2 {A,B,C,F,L,M,O} {F,C,A,B,M}
3 {B,F,H,J,O} {F,B}
4 {B,C,K,S,P} {C,B,P}
5 {A,F,C,E,L,P,M,N} {F,C,A,M,P}
Scan the DB for the first time and identify frequent itemsets. They are: <(f:4),(c:4), (a:3),(b:3),(m:3),(p:3)>
We sort the items according to their frequency in the last column
Slide 8Artificial Intelligence Machine Learning
Building the FP-TreeAfter reading TID1:
TID Items Sorted FIS
1 {F,A,C,D,G,I,M,P} {F,C,A,M,P} F:1root
2 {A,B,C,F,L,M,O} {F,C,A,B,M}
3 {B,F,H,J,O} {F,B} C:14 {B,C,K,S,P} {C,B,P}
5 {A,F,C,E,L,P,M,N} {F,C,A,M,P} A:1
M:1
P:1
Scan again the DB to build the tree
Slide 9Artificial Intelligence Machine Learning
g
Building the FP-TreeAfter reading TID2:
TID Items Sorted FIS
1 {F,A,C,D,G,I,M,P} {F,C,A,M,P} F:2root
2 {A,B,C,F,L,M,O} {F,C,A,B,M}
3 {B,F,H,J,O} {F,B} C:24 {B,C,K,S,P} {C,B,P}
5 {A,F,C,E,L,P,M,N} {F,C,A,M,P} A:2
M:1B:1
B:1P:1
B:1
Slide 10Artificial Intelligence Machine Learning
Building the FP-TreeAfter reading TID3:
TID Items Sorted FIS
1 {F,A,C,D,G,I,M,P} {F,C,A,M,P} F:3root
2 {A,B,C,F,L,M,O} {F,C,A,B,M}
3 {B,F,H,J,O} {F,B} C:2B:1
4 {B,C,K,S,P} {C,B,P}
5 {A,F,C,E,L,P,M,N} {F,C,A,M,P} A:2
M:1B:1
M:1P:1
M:1
Slide 11Artificial Intelligence Machine Learning
Building the FP-TreeAfter reading TID4:
TID Items Sorted FIS
1 {F,A,C,D,G,I,M,P} {F,C,A,M,P} F:3root
C:12 {A,B,C,F,L,M,O} {F,C,A,B,M}
3 {B,F,H,J,O} {F,B} C:2B:1
B:14 {B,C,K,S,P} {C,B,P}
5 {A,F,C,E,L,P,M,N} {F,C,A,M,P} A:2 P:1
M:1B:1
M:1P:1
M:1
Slide 12Artificial Intelligence Machine Learning
Building the FP-TreeAfter reading TID5:
TID Items Sorted FIS
1 {F,A,C,D,G,I,M,P} {F,C,A,M,P} F:4root
C:12 {A,B,C,F,L,M,O} {F,C,A,B,M}
3 {B,F,H,J,O} {F,B} C:3B:1
B:14 {B,C,K,S,P} {C,B,P}
5 {A,F,C,E,L,P,M,N} {F,C,A,M,P} A:3 P:1
M:2B:1
M:1P:2
M:1
Slide 13Artificial Intelligence Machine Learning
Building the FP-TreeTID Items Sorted FIS
1 {F,A,C,D,G,I,M,P} {F,C,A,M,P}
2 {A,B,C,F,L,M,O} {F,C,A,B,M} root
3 {B,F,H,J,O} {F,B}
4 {B,C,K,S,P} {C,B,P}
F:4
C 3B:1
C:1
B 1
Item
F
5 {A,F,C,E,L,P,M,N} {F,C,A,M,P}C:3
A:3
B:1
P:1
C
A A:3
M:2B:1
P:1B
M
P:2M:1P
Build and index to access quickly to the nodes and traverse the tree
Slide 14Artificial Intelligence Machine Learning
q y
Mining the FP-TreeProperties to mine the FP-treep
Node-link prop.: All possible itemsets in which the frequent item a is included can be found by following a’s node-linksa s c uded ca be ou d by o o g a s ode s
F:4root
C:1F:4
C:3B:1
C:1
B:1
Item
F
P has support of 3
Two paths in the FP-tree for node P
A:3 P:1
C
A
tree for node P
1. {F,C,A,M}
2 {C B P}
M:2B:1
B
M
P
2. {C,B,P}
P:2M:1P
Slide 15Artificial Intelligence Machine Learning
Mining the FP-TreePrefix path prop.: To calculate the frequent patterns for a node p p p q pa in path P, only the prefix subpath of node of node a in Pneeds to be accumulated, and the frequency count of every node in the prefix path should carry the same count as node anode in the prefix path should carry the same count as node a
rootN d P i i l d i
rootF:4
B:1
C:1Item
F
Node P is involved in:
(F:4,C:3,A:3,M:2,P:2)
Take the prefix of the
F:4
B:1
C:1
C:3 B:1F
C
A
Take the prefix of the path until M
(F:4,C:3,A:3)
C:3 B:1
A:3
M:2B:1
P:1A
B
M
Adjust counts to 2
(F:2,C:2,A:2)
A:3
M:2B:1
P:1
M:2
P:2M:1
M
PSo, F, C, and A co-ocurwith M
M:2
P:2M:1
Slide 16Artificial Intelligence Machine Learning
Mining the FP-TreeFragment growth: Let α be an itemset in DB, B be α’s g g ,conditional pattern base, and β be an itemset in B. Then, the support α U β is equivalent to the support of β in B.
tF:2
root
For M, we had
C:2
,
(F:2,C:2,A:2)
(F:1,C:1,A:1,B:1)
A:2
B:1
Therefore,
{(F,C,A,M):2},{(F,C,M}:2}, B:1 …
Slide 17Artificial Intelligence Machine Learning
Is FP-growth Faster than Apriori?
As the support threshold goes down, the number of itemsets increases dramatically. FP-growth does not need to generate candidates and test them
Slide 18
candidates and test them.
Artificial Intelligence Machine Learning
Is FP-growth Faster than Apriori?
Both FP-growth and A priori scale linearly with the number of transactions. But FP-growth is more efficient
Slide 19Artificial Intelligence Machine Learning
Next Class
Advanced topics in association rule mining
Slide 20Artificial Intelligence Machine Learning
Introduction to MachineIntroduction to Machine LearningLearning
Lecture 14Lecture 14Advanced Topics in Association Rules Mining
Albert Orriols i Puigi l @ ll l [email protected]
Artificial Intelligence – Machine LearningEnginyeria i Arquitectura La Salleg y q
Universitat Ramon Llull