department of computer science university of california—los angeles

26
PDA 2/19/2002 Qinghua Zou Department of Computer Science University of California—Los Angeles Pattern Decomposition Algorithm for Data Mining Frequent Patterns Qinghua Zou Advisor: Dr. Wesley Chu

Upload: evita

Post on 18-Mar-2016

55 views

Category:

Documents


0 download

DESCRIPTION

Pattern Decomposition Algorithm for Data Mining Frequent Patterns Qinghua Zou Advisor: Dr. Wesley Chu. Department of Computer Science University of California—Los Angeles. Outline. 1. The problem 2. Importance of mining frequent sets 3. Related work 4. PDS, an efficient approach - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Department of Computer Science University of California—Los Angeles

PDA 2/19/2002 Qinghua Zou

Department of Computer ScienceUniversity of California—Los Angeles

Pattern Decomposition Algorithm for Data Mining Frequent Patterns

Qinghua Zou

Advisor: Dr. Wesley Chu

Page 2: Department of Computer Science University of California—Los Angeles

PDA 2/19/2002 Qinghua Zou

Outline

1. The problem

2. Importance of mining frequent sets

3. Related work

4. PDS, an efficient approach

5. Performance analysis

6. Conclusion

Page 3: Department of Computer Science University of California—Los Angeles

PDA 2/19/2002 Qinghua Zou

1. The Problem

Frequent Itemsets a, b, c, d, e ab, ac, ad, bc, bd, be ,

cd, ce, de abc, abd, bcd, bce,

bde, cde bcde

D1: a b c d e f2: a b c g3: a b d h4: b c d e k5: a b c

D is a transaction database5 transactions9 items: a,b,c,…,h,k

Minimal support = 2

The problem:Given a transaction dataset D and a minimal support.

To find frequent itemsets

Page 4: Department of Computer Science University of California—Los Angeles

PDA 2/19/2002 Qinghua Zou

1. 1 More terms for the problem

Basic terms:• I0 = {1,2, …, n}: The set of all items

• e.g., items in supermarkets, words in a sentence, etc

• ti, transaction: A set of items• e.g., items I bought yesterday in a supermarket, sentences in a document

• D, data set: A set of transactions• I, Itemset: Any subset of I0

• sup(I), support of I:The number of the transactions containing I• frequent set: sup( I ) >= minsup• conf(r), confidence of a rule r:{1,2} => {3}

conf(r) = sup ( {1,2,3} ) / sup( {1,2} ) The problem: Given a minsup, how to find all frequent sets

quickly?• E.g. 1-item, 2-item, … k-item frequent sets

Page 5: Department of Computer Science University of California—Los Angeles

PDA 2/19/2002 Qinghua Zou

2. Why Mining Frequent Sets ?

Frequent pattern mining — Foundation for several essential data mining tasks:• association, correlation, causality

• sequential patterns

• partial periodicity, cyclic/temporal associations Applications:

• basket data analysis, cross-marketing, catalog design, loss-leader analysis

• clustering, classification, Web log sequence, DNA analysis, etc.• Text Mining, finding multi-words combination

Page 6: Department of Computer Science University of California—Los Angeles

PDA 2/19/2002 Qinghua Zou

3. Related Work

1994, Apriori: Rakesh Agrawal, IBM SJBottom up search; using L(k) =>C(k+1)

1995, DHP: Jong et al. IBM TJDirect Hashing and Pruning

1997, DIC: Sergey Brin. Stanford UnivDynamic Itemset Counting

1997, MaxClique: Mohammed et el. Univ of RochesterUsing clique; L(2) =>C(k), k=3,…,m

1998, Max-Miner: Roberto et al. IBM SJTop-down pruning

1998, Pincer-Search: Lin et al, New York UnivBoth bottom up and top down search

2000, FP-tree: Jiawei HanBuilding frequent pattern tree

Page 7: Department of Computer Science University of California—Los Angeles

PDA 2/19/2002 Qinghua Zou

3.1 Apriori Algorithm Example

L1={a,b,c,d,e}C2={ab,ac,ad,ae,bc,bd,be,cd,ce,de}

L2={ab,ac,ad,bc,bd,be,cd,ce,de}C3’={abc,abd,acd,bcd,bce,bde,cde}C3={abc,abd,acd,bcd,bce,bde,cde}

L3={abc, abd, bcd, bce, bde, cde}C4’={abcd,bcde}C4={abcd,bcde}

L4={bcde}

Answer = L1, L2, L3, L4

D1: a b c d e f2: a b c g3: a b d h4: b c d e k5: a b c

Page 8: Department of Computer Science University of California—Los Angeles

PDA 2/19/2002 Qinghua Zou

Apriori Algorithm

1) L(1) = { large 1-itemsets };2) for( k=2; L(k-1)!=null; k++ ) {3) C(k) = apriori-gen( L(k-1) ) //new candidates4) forall transactions t in D {5) Ct = subset( C(k), t ); //candidates contained in t6) forall candiates c in Ct7) c.count++;8) }9) Lk = {c in Ck | c.count>=minsup}10) }11) Answer = U L(k)

Page 9: Department of Computer Science University of California—Los Angeles

PDA 2/19/2002 Qinghua Zou

3.2 Pincer-Search Algorithm

01. L0 := null; k := 1; C1 := {{ i } | i belong to 0}

02. MFCS := I0; MFS := null;

03. while Ck != null

04. read database and count supports for Ck and MFCS

05. remove frequent itemsets from MFCS and add them to MFS

06. determine frequent set Lk and infrequent set Sk

07. use Sk to update MFCS

08. generate new candidate set Ck+1 (join, recover, and prune)

09. k := k +1

10. return MFS

Page 10: Department of Computer Science University of California—Los Angeles

PDA 2/19/2002 Qinghua Zou

Pincer Search Example

L0={}, MFCS=abcdefghk, MFS={}C1={a,b,c,d,e,f,g,h,k},

L1={a,b,c,d,e}, MFCS=abcde, MFS={}C2={ab,ac,ad,ae,bc,bd,be,cd,ce,de}

L2={ab,ac,ad,bc,bd,be,cd,ce,de}, MFCS={abcd,bcde}, MFS={}C3={abc,abd,acd,bcd,bce,bde,cde}

L3={abc, abd, bcd, bce, bde, cde}, MFCS={}, MFS={bcde}C4’={abcd,bcde}C4={abcd}

L4={}

Answer = L1, L2, L3, L4, MFS

D1: a b c d e f2: a b c g3: a b d h4: b c d e k5: a b c

Page 11: Department of Computer Science University of California—Los Angeles

PDA 2/19/2002 Qinghua Zou

3.3 FP-Tree

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

min_support = 3

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

Steps:1. Scan DB once, find

frequent 1-itemset (single item pattern)

2. Order frequent items in frequency descending order

3. Scan DB again, construct FP-tree

Page 12: Department of Computer Science University of California—Los Angeles

PDA 2/19/2002 Qinghua Zou

D1: a b c d e f2: a b c g3: a b d h4: b c d e k5: a b c

Ordered frequent itemsa b c d ea b ca b db c d ea b c

{}

a:4 b:1

c:1b:4

c:3

d:1d:1

e:1

Header Table

Item frequency head a 4b 4c 4d 3e 2

e:1

d:1Recursively searching the tree to findFrequent itemsets.

It is not easy.

FP-tree Example

Page 13: Department of Computer Science University of California—Los Angeles

PDA 2/19/2002 Qinghua Zou

4. PDA: Basic Idea

D11: 2: 3: 4: 5:

L1, ~L1

calculating dicomposing

D21: 2: 3: 4:

223

1

Page 14: Department of Computer Science University of California—Los Angeles

PDA 2/19/2002 Qinghua Zou

4.1 PDA: terms

Definitions• Ii , Itemset, is a set of item, e.g. {1,2,3}• Pi, pattern, is a set of itemsets, e.g.{ {1,2,3}, {2,3,4} }• occ(Pi), occur times of pattern Pi• t, transaction, a pair of (Pi, occ(Pi)), e.g.( { {1,2,3},{2,3,4} }, 2 ) • D, data set, a set of transactions• D(k), the data set for generating k-item frequent sets• k-item independent, itemset I1 is k-item independent with I2 , i.e.

the number of their common items is less than k. e.g. {1,2,3} and {2,3,4} , common set {2,3}, so they are 3-item

independent, but not 2-item independent

Page 15: Department of Computer Science University of California—Los Angeles

PDA 2/19/2002 Qinghua Zou

4.2 Decomposing Example

1). Suppose we are given a pattern p= abcdef:1 in D1 where L1

={a,b,c,d,e} and f in ~L1. To decompose p with ~L1, we simply delete f from p, leaving us with a new pattern abcde:1 in D2.

2). Suppose a pattern p= abcde:1 in D2 and ae in ~L2. Since ae is infrequent and cannot occur in a future frequent set, we decompose p= abcde:1 to a composite pattern q= abcd,bcde:1 by removing a and e respectively from p.

3). Suppose a pattern p= abcd,bcde:1 in D3 and acd in ~L3. Since acd is a subset of abcd, abcd is decomposed into abc, abd, bcd. Their sizes are less than 4, so they are not qualified for D4. Itemset bcde does not contain acd, so it remains the same and is included in D4. Results will be bcde:1.

Page 16: Department of Computer Science University of California—Los Angeles

PDA 2/19/2002 Qinghua Zou

4.2 continue

Split Example: t.P={{1,2,3,4,5,6,7,8}}, we found 156 to be an infrequent 3-item set. We split 156 into 15, 16, 56. Result: {{1,2,3,4,5,7,8}, {1,2,3,4,6,7,8}, {2,3,4,5,6,7,8}

Quick-split Example: t.P={{1,2,3,4,5,6,7,8}}, we found infreq 3-item set {156,157,158, 167, 168, 178, 125, 126, 127,128, 135, 136, 137,138,145, 146, 147, 148} . Build max-common tree

15 2|3|4|6|7|8

7 2|3|4|86 2|3|4|7|8

8 2|3|4

1~5; ~2~3~4~6~7~8

~7; ~2~3~4~8~6; ~2~3~4~7~8

~8; ~2~3~4

1 ~5~6~7~8

~1; ~5~6~7~8

{{2,3,4,5,6,7,8}, {1,2,3,4}}{{2,3,4,5,6,7,8}}{{1,2,3,4}} 4-item independent

3

Page 17: Department of Computer Science University of California—Los Angeles

PDA 2/19/2002 Qinghua Zou

4.3 PDA: Algorithm

PD ( transaction-set T )1: D1 = {<t, 1>| t ∊ T }; k=1;2: while (Dk≠ Φ) do begin3: forall p inDk do // counting4: forall k-itemset s of p.IS do5: Sup(s|Dk) += p.Occ;6: decide Lk and ~Lk ;//build Dk+1

7: Dk+1= PD-rebuild(Dk, Lk, ~Lk); 8: k++;9: end10:Answer = ∪ Lk

Page 18: Department of Computer Science University of California—Los Angeles

PDA 2/19/2002 Qinghua Zou

4.4 PDA: rebuilding

PD-rebuild (Dk, Lk, ~Lk)1: Dk+1 = Φ; ht = an empty hash table;2: forall p in Dk do begin 3: // qk, ~qk can be taken from previous counting qk={s|s in p.IS ∩ Lk }; ~qk={t|t in p.IS ∩ ~Lk }4: u = PD-decompose(p.IS, ~qk);5: v ={s in u| s is k-item independent in u}6: add <u-v, p.Occ> to Dk+1;7: forall s in v do8: if s in ht then ht.s.Occ+= p.Occ;9: else put <s,p.Occ> to ht;10: end11: Dk+1 = Dk+1 ∪ {p in ht};

Page 19: Department of Computer Science University of California—Los Angeles

PDA 2/19/2002 Qinghua Zou

4.5 PDA ExampleD3

1: abcd, bcde: 12: a b c: 23: a b d: 14: b c d e: 1

D4

1: b c d e: 2D1

1: a b c d e f: 12: a b c g: 13: a b d h: 14: b c d e k: 15: a b c: 1

D2

1: a b c d e: 12: a b c: 23: a b d: 14: b c d e: 1

L1 IS Occ{a} 4{b} 5{c} 4{d} 3{e} 2

~L1 IS Occ{f} 1{g} 1{h} 1{k} 1

fghk

L2 IS Occ{ab} 4{ac} 3{ad} 2{bc} 4{bd} 3{be} 2{cd} 2{ce} 2{de} 2

~L2 IS Occ{ae} 1

L3 IS Occ{abc} 3{abd} 2{bcd} 2{bce} 2{bde} 2{cde} 2

~L3 IS Occ{acd} 1

L4 IS Occ{bcde} 2

~L4 IS Occ

D5= Φ

Page 20: Department of Computer Science University of California—Los Angeles

PDA 2/19/2002 Qinghua Zou

5. Experiments onSynthetic Databases

The benchmark databases are generated by a popular synthetic data generation program from IBM Quest project

Parameters:• n is the number of different items (set to 1000)• |T| is the average transaction size• |I| is the average size of the maximal frequent itemsets,• |D| is the number of transactions• |L| is the number of the maximal frequent itemsets

T20-I6-1K: |T| = 20, |I| = 6, |D| = 1k T20-I6-10K: |T| = 20, |I| = 6, |D| = 10k T20-I6-100K: |T| = 20, |I| = 6, |D| = 100k

Page 21: Department of Computer Science University of California—Los Angeles

PDA 2/19/2002 Qinghua Zou

Figure 7. Execution times comparison between Apriori and PD vs. minimum support

T10.I4.D100K

1

10

100

1000

2 1.5 1 0.75 0.5 0.33 0.25 Minimum Support (%)

Tim

e (s

)

Apriori

PD

10

100

1000

10000

2 1.5 1 0.75 0.5 0.33 0.25 Minimum Support (%)

Tim

e (s

)

Apriori

PD

T25.I10.D100K

Comparison With Apriori

Page 22: Department of Computer Science University of California—Los Angeles

PDA 2/19/2002 Qinghua Zou

T10.I4.D100K

0

10

20

30

40

50

60

70

2nd 3rd 4th 5th 6th 7th 8th 9thPasses (minsup=0.25%)

Tim

e (s

)

Apriori

PD

Figure 8. Execution times comparison between Apriori and PD vs. passes

0

500

1000

1500

2000

2500

Passes (minsup=0.25%)

Tim

e (s

)

Apriori

PD

T25.I10.D100K

Time Distribution

Page 23: Department of Computer Science University of California—Los Angeles

PDA 2/19/2002 Qinghua Zou

Figure 9. Scalability comparison between Apriori and PD

0

2

4

6

8

10

50K 100K 150K 200K 250KNumber of transactions (minsup=0.75%)

Rel

ativ

e Ti

me

Apriori

PD-Miner

T25.I10

Scale Up Experiment

Page 24: Department of Computer Science University of California—Los Angeles

PDA 2/19/2002 Qinghua Zou

0

100

200

300

400

500

60 80 100 120 140 160 200Number of transactions (K)

Cal

ibra

ted

rela

tive

time

FP-tree

PD

0

50

100

150

2 1.5 1 0.75 0.5 0.33 0.25 Minimum Support (%)

Cal

ibra

ted

rela

tive

time

D1 FP-treeD1 PDD2 FP-treeD2 PD

D1=T10.I4.D100K, D2= T25.I10.D100K

T25.I10

Figure 10. Performance comparison between FP-tree and PD for selective

minimum support

Figure 11. Scalability comparison between FP-tree and

PD

Comparison with FP-tree

Page 25: Department of Computer Science University of California—Los Angeles

PDA 2/19/2002 Qinghua Zou

6. Conclusion

In PDA, transaction number shrinks quickly to 0 Shrinks both the transaction number and itemset length

• In transaction: summing• In item set: decomposing

Only one scan of database No candidate set generation Long patterns can be found at any iteration

Page 26: Department of Computer Science University of California—Los Angeles

PDA 2/19/2002 Qinghua Zou

[1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In VLDB'94, pp. 487-499.

[2] R. J. Bayardo. Efficiently mining long patterns from databases. In SIGMOD'98, pp. 85-93.

[3] Zaki, M. J.; Parthasarathy, S.; Ogihara, M.; and Li, W. 1997. New Algorithms for Fast Discovery of Association Rules. In Proc. of the Third Int’l Conf. on Knowledge Discovery in Databases and Data Mining, pp. 283-286.

[4] Lin, D.-I and Kedem, Z. M. 1998. Pincer-Search: A New Algorithm for Discovering the Maximum Frequent Set. In Proc. of the Sixth European Conf. on Extending DatabaseTechnology.

[5] Park, J. S.; Chen, M.-S.; and Yu, P. S. 1996. An Effective Hash Based Algorithm for Mining Association Rules. In Proc. of the 1995 ACM-SIGMOD Conf. on Management of Data, pp. 175-186.

[6] Brin, S.; Motwani, R.; Ullman, J.; and Tsur, S. 1997. Dynamic Itemset Counting and Implication Rules for Market Basket Data. In Proc. of the 1997 ACM-SIGMOD Conf. On Management of Data, 255-264.

[7] J. Pei, J. Han, and R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets, Proc. 2000 ACM-SIGMOD Int. Workshop on Data Mining and Knowledge Discovery (DMKD'00), Dallas, TX, May 2000.

[8] J. Han, J. Pei, and Y. Yin. Mining Frequent Patterns without Candidate Generation, Proc. 2000 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD'00), Dallas, TX, May 2000.

[9] Bomze, I. M., Budinich, M., Pardalos, P. M., and Pelillo, M. The maximum clique problem, Handbook of Combinatorial Optimization (Supplement Volume A), in D.-Z. Du and P. M. Pardalos (eds.). Kluwer Academic Publishers, Boston, MA, 1999.

[10] C. Bron and J. Kerbosch. Finding all cliques of an undirected graph. In Communications of the ACM, 16(9):575-577, Sept. 1973.

[11] Johnson D.B., Chu W.W., Dionisio J.D.N., Taira R.K., Kangarloo H., Creating and Indexing Teaching Files from Free-text Patient Reports. Proc. AMIA Symp 1999; pp. 814-818.

[12] Johnson D.B., Chu W.W., Using n-word combinations for domain specific information retrieval, Proceedings of the Second International Conference on Information Fusion – FUSION’99, San Jose, CA, July 6-9,1999.

[13] A. Savasere, E. Omiecinski, and S. Navathe. An Efficient Algorithm for Mining Association Rules in Large Databases. In Proceedings of the 21st VLDB Conference, 1995.

[14] Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo. Efficient algorithms for discovering association rules. In Usama M. Fayyad and Ramasamy Uthurusamy, editors, Proc. of the AAAI Workshop on Knowledge Discovery in Databases, pp. 181-192, Seattle, Washington, July 1994.[15] H. Toivonen. Sampling Large Databases for Association Rules. In Proceedings of the 22nd International Conference on Very Large Data Bases, Bombay, India, September 1996.

Reference