AR mining
Implementation and comparison of three AR mining algorithms
Xuehai Wang, Xiaobo Chen, Shen chen
CSCI6405 class project
AR mining
Outline
• Motivation
• Dataset
• Apriori based hash tree algorithm
• FP-tree algorithm
• Conclusion
• Reference
AR mining
Motivation
• Make the time of generating rules as shot as possible!
• To understand the three algorithms– Apriori algorithm– Apriori with hash tree algorithm– FP-tree algorithm
• Learn how to improve an algorithm
AR mining
Dataset• IBM dataset generator
– Can set item number– Can set minimal support– Can set dataset size
1 2 5 8 9
2 3 4 6 7 12
Tid item
AR mining
Apriori principle
• Apriori principle– A candidate generation-and-test Approach [4]– Given a frequent itemset, its subset must be fre
quent– A set is infrequent, its super set will not be gene
rated and tested
• But there is still some places can be improved– Count the support– I/O scan times
AR mining
Apriori Hash Tree Alg
• Candidate K-itemset size is l• There is n transactions• Average transaction size is m• Calculate support count:
– Original Apriori Alg:
– With hash tree: O( n.log(l).(mk) )
)( mklnO
)log( mklnO
AR mining
Apriori Hash Tree Alg
• Candidate is stored in a hash tree structure
Tid Items
1 1 2
2 1 3 6
3 1 2 3
4 2 4
5 2 3 6
6 5 6
1-itemset candidate hash tree
1(1)2(1)1(2)
3(1)
1(2) 3(1)2(1)
AR mining
Apriori Hash Tree Alg
Tid
Items
1 1 2
2 1 3 6
3 1 2 3
4 2 4
5 2 3 6
6 5 6
2(4)5(1) 6(3)
1(3) 3(3)4(1)
1itemset , Min support = 2
AR mining
Apriori Hash Tree Alg
Tid
Items
1 1 2
2 1 3 6
3 1 2 3
4 2 4
5 2 3 6
6 5 6
2 3(2)2 6(1)
1 3(2)1 2(2)
3 6(2)
1 6(1)
2 itemset, Min support = 2
3 itemset, Min support = 2
1 2 3(1)
AR mining
FP-tree
• Since the mining dataset is always very huge, it’s impossible to read all transactions into computer memory all in once.
• But I/O scan is very time consuming.
• FP-tree algorithm will try to suite all information from the dataset into computer memory, hence only need to scan I/O two times.
AR mining
FP-tree
• FP-tree algorithm and implementation– By Xiaobo Chen
AR mining
FP-tree (Frequent Pattern Tree)
• Mining frequent pattern without candidate generation
• Divide and conquer methodology: decompose mining tasks into smaller ones
AR mining
FP-tree (Merits of FP-tree algorithm)
• Make most use of common shared prefix
• Complete and compact
All information of a transaction is
stored in a path
The size is constrained by the data set consequently, the longest path corresponds to the longest
pattern
The compact ratio: over 100
AR mining
FP-tree (Construction of FP-tree)
• TID freq. Items bought
• 100 {f, c, a, m, p}
• 200 {f, c, a, b, m}
• 300 {f, b}
• 400 {c, p, b}
• 500 {f, c, a, m, p}
min_support = 3Item frequency f 4c 4a 3b 3m 3p 3
f:1
c:1
a:1
m:1
p:1
root
AR mining
FP-tree (construction (Cont’d))TID freq. Items bought100 {f, c, a, m, p}200 {f, c, a, b, m}300 {f, b}400 {c, p, b}500 {f, c, a, m, p}
f:2
c:2
a:2
m:1
p:1
b:1
m:1
root
AR mining
FP-tree construction (Cont’d)• TID freq. Items bought
• 100 {f, c, a, m, p}
• 200 {f, c, a, b, m}
• 300 {f, b}
• 400 {c, p, b}
• 500 {f, c, a, m, p}
min_support = 3Item frequency f 4c 4a 3b 3m 3p 3Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
f:4
c:3
a:3
m:2
p:2
b:1
m:1
b:1
c:1
b:1
p:1
root
AR mining
FP-tree (Mining Frequent Patterns Using the FP-tree)
• General idea (divide-and-conquer)– Recursively grow frequent pattern path using the FP-
tree
• Method – For each item, construct its conditional pattern-base,
and then its conditional FP-tree
– Repeat the process on each newly created conditional FP-tree
– Until the resulting FP-tree is empty, or it contains only one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern)
AR mining
FP-tree (Mining Frequent Patterns Using the FP-tree)
Conditional pattern base for p
fcam:2, cb:1
f:4
c:3
a:3
m:2
p:2
c:1
b:1
p:1
p
• Start with last item in order (i.e., p).
• Follow node pointers and traverse only the paths containing p.
• Accumulate all of transformed prefix paths of that item to form a conditional pattern base
root
Constructing a new FP-tree based on this pattern base leads to only one branch c:3Thus we derive only one frequent pattern cont. p. Pattern cp
AR mining
FP-tree (Mining Frequent Patterns Using the FP-tree)
• Move to next least frequent item in order, i.e., m
• Follow node pointers and traverse only the paths containing m.
• Accumulate all of transformed prefix paths of that item to form a conditional pattern base
Conditional pattern base for m
fca:2, fcab:1
f:4
c:3
a:3
m:2
m
m:1
b:1
Constructing a new FP-tree based on this pattern base leads to path fca:3From this we derive frequent patterns fcam, fcm, cam, fm, cm, am
root
AR mining
FP-tree (Conditional Pattern-Bases for the example)
EmptyEmptyf
{(f:3)}|c{(f:3)}c
{(f:3, c:3)}|a{(fc:3)}a
Empty{(fca:1), (f:1), (c:1)}b
{(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m
{(c:3)}|p{(fcam:2), (cb:1)}p
Conditional FP-treeConditional pattern-baseItem
AR mining
FP-tree (Why is Frequent pattern Growth fast?)
• Performance studies show that
FP-growth is an order of magnitude faster than
Apriori, and is also faster than tree-projection
• Reasoning:
– No candidate generation, no candidate test
– Use compact data structure
– Eliminate repeated database scan
– Basic operation is counting and FP-tree building
AR mining
FP-tree: Expected result: FP-growth vs. Apriori: Scalability With the Support Threshold
0
10
20
30
40
50
60
70
80
90
100
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
Ru
n t
ime(s
ec.)
D1 FP-grow th runtime
D1 Apriori runtime
AR mining
Conclusion
• FP-tree is faster than other two algorithms.
• Apriori as well as hash tree algorithms are easier to implement.– We can easily combine them with other
methods or tools. (i.e. distributed parallel computing).
• The parameter of dataset is very important too.– Density, size, min support …
AR mining
References
• [1] Jiawei Han and Micheline Kamber: "Data Mining: Concepts and Techniques ", Morgan Kaufmann, 2001
• [2] Jiawei Han, Jian Pei, Yiwen Yin: Mining Frequent Patterns without Candidate Generation, ACM SIGMOD, 2000
• [3] N.Mamoulis, Advanced Database Technologies (Slides)
• [4] Jiawei Han and Micheline Kamber. Data Mining - Concepts and Techniques. MorganKaufmann Publishers, 2001.