outline introduction – frequent patterns and the rare item problem – multiple minimum support...

44
Outline • Introduction Frequent patterns and the Rare Item Problem Multiple Minimum Support Framework Issues with Multiple Minimum Support Framework • Related Work • Proposed Approaches Methodology to specify items’ MIS values An algorithm to mine frequent patterns effectively. Mining frequent patterns in databases in which items’ frequencies vary widely. Mining rare periodic-frequent patterns. • Conclusions and Future Work 1

Upload: candice-elliott

Post on 05-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Outline

• Introduction– Frequent patterns and the Rare Item Problem– Multiple Minimum Support Framework– Issues with Multiple Minimum Support Framework

• Related Work• Proposed Approaches

– Methodology to specify items’ MIS values– An algorithm to mine frequent patterns effectively.– Mining frequent patterns in databases in which items’ frequencies vary

widely.– Mining rare periodic-frequent patterns.

• Conclusions and Future Work1

Page 2: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Related Work

2

• Minimum constraint model using multiple minimum support constraints was discussed in the literature. This model still generates uninteresting frequent patterns if the items’ frequencies within a database vary widely. (KDD 1999, DSS 2006)

• Consider only rarely occurring items and mine association rules containing only rare items. These approaches fail to discover association rules involving both frequent and rare items. (DMKD 2006, MDM 2007, International Journal on Open Problems Computational Mathematics, 2009).

Page 3: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Related Work

3

• Relative support, all-confidence, any-confidence, bond, lift, X2 and other interestingness measures have been proposed in the literature to discover both frequent patterns and association rules.

• Each measure has a selection bias that justifies the significance of an association rule.

• There exists no universally acceptable best measure to find interesting association rules in a database.

• Tan et al. (KDD 2001) have proposed techniques for selecting a right measure to mine association rules.

• Akshat et al. (COMAD 2010) have proposed techniques for selecting a right measure to mine rare association rules.

Page 4: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Related Work

4

• Relative support, all-confidence, any-confidence, bond, lift, X2 and other interestingness measures have been proposed in the literature to discover both frequent patterns and association rules.

• Each measure has a selection bias that justifies the significance of an association rule.

• There exists no universally acceptable best measure to find interesting association rules in a database.

• Tan et al. (KDD 2001) have proposed techniques for selecting a right measure to mine association rules.

• Akshat et al. (COMAD 2010) have proposed techniques for selecting a right measure to mine rare association rules

Page 5: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Related Work

5

• Constraints have been used in the literature to reduce the number of frequent pattern in a database.

– Example: maximal, closed, top-k and utility frequent patterns.

• Multi-level frequent pattern mining has been proposed using concept hierarchy.

Page 6: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Outline

• Introduction– Frequent patterns and the Rare Item Problem– Multiple Minimum Support Framework– Issues with Multiple Minimum Support Framework

• Related Work• Proposed Approaches

– Methodology to specify items’ MIS values– An algorithm to mine frequent patterns effectively.– Mining frequent patterns in databases in which items’ frequencies vary

widely.– Mining rare periodic-frequent patterns.

• Conclusions and Future Work6

Page 7: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Outline

• Introduction– Frequent patterns and the Rare Item Problem– Multiple Minimum Support Framework– Issues with Multiple Minimum Support Framework

• Related Work• Proposed Approaches

– Methodology to specify items’ MIS values– An algorithm to mine frequent patterns effectively.– Mining frequent patterns in databases in which items’ frequencies vary

widely.– Mining rare periodic-frequent patterns.

• Conclusions and Future Work7

Page 8: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Methodologies to Specify Items’ MIS values

8

• Liu et. al. (KDD ’99) have introduced percentage-based methodology to specify items’ MIS values.

• Percentage-based Methodology:– Items’ MIS values are equivalent to the percentage of their

respective support.

MIS(ij) = maximum (S(ij) * β, LS)where,

S(ij) = support of an item ij in ILS = lowest MIS an item can have β = user-specified constant, [0, 1]

• This methodology still suffer from rare item problem.

Page 9: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Rare Item Problem in Percentage Based Methodology

9

• At high β value rare items have their MIS values very close their respective supports. Thus, leading to missing of rules involving rare items.

• At low β value frequent items will have their MIS values very far away from their supports. Thus, causing combinatorial explosion.

• Example

– Let S(a) =100 and S(b)=10.

• If β =0.9, then MIS(a) = 90 and MIS(b) = 9. For item b, MIS(b) is relatively very close to S(b). Hence, it is difficult for the item ‘b’ to combine with other items and generate frequent patterns.

• If β =0.5, then MIS(a) = 50 and MIS(b) = 5. Although, MIS(b) is relatively away from its support, the difference between the S(a) and

MIS(a) is very huge. This causes combinatorial explosion.

Page 10: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Proposed Methodology to Specify Items’ MIS values

10

• Observation

– Non-uniform difference between the items’ support and MIS values is the cause for the rare item problem in the percentage-based methodology.

• Idea:

– Ensure there exists uniform difference between items’ support and MIS values.

• Support difference-based methodology

• MIS(ij) = minimum(S(ij) – SD, LS) – where, SD – support difference metric. SD can be user-specified or can be

derived from the dataset as follows. • SD = α(1-β)

– where, α – parameter such as mean and median of the dataset, and – β – user-specified constant, [0, 1].

Page 11: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Experimental Results

11

• Dataset1. Synthetic dataset

• Total items: 870• Total number of Transactions: 1,00,000.

2. Real-world dataset.1. Total items: 832. Total number of transactions: 298

• Parameter values:– LS = 0.1– α = mean of the support of all frequent

items.– β = varied at 0.25, 0.5 and 0.9

• Algorithms– Apriori algorithm– MSApriori – uses percentage-based

methodology– IMSApriori – uses support difference-based

methodology.

Table 3: SD values used in different datasets.

Page 12: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Experiment 1: Analysis of MIS values specified by both methods

12

Figure: MIS values specified by percentage-based methodology in synthetic dataset.

Figure: MIS values specified by support difference-based methodology in synthetic dataset.

Page 13: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Experiment 2: Generation of Frequent Patterns

13Figure: Generation of frequent patterns in synthetic and retail datasets.

Page 14: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Outline

• Introduction– Frequent patterns and the Rare Item Problem– Multiple Minimum Support Framework– Issues with Multiple Minimum Support Framework– Contribution of this thesis

• Related Work• Proposed Approaches

– Methodology to specify items’ MIS values– An algorithm to mine frequent patterns effectively.– Mining frequent patterns in databases in which items’ frequencies vary widely.– Mining rare periodic-frequent patterns.

• Conclusions and Future Work

14

Page 15: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Improved Multiple Minimum Support Based Frequent Pattern Mining Approaches.

15

• The frequent patterns discovered with multiple minsups framework do not satisfy downward closure property.

• This increases the search space for mining frequent patterns.

• Liu et. al (KDD ‘99) has introduced a multiple minimum support apriori (MSApriori) to discover frequent patterns with multiple minsups framework.

• The MSApriori algorithm suffers from the same performance problems as Apriori algorithm.

• Hence, an FP-growth-like algorithm known as the Conditional Frequent Pattern-growth (CFP-growth) was proposed in the literature (DSS 2006).

Page 16: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

CFP-growth Algorithm

E.g.

16

a c b d e f g h10 10 8 7 3 3 3 2

ItemsMIS

• The algorithm accepts items’ MIS values as the input parameter.

• Using the items’ MIS values as the prior knowledge, it discovers frequent patterns with a single scan on the transactional database.

• Working of CFP-growth algorithm is as follows:

1. Sorts the items in descending order of their MIS values.

Page 17: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

CFP-growth Algorithm

2. Using the sorted list of items, an FP-tree-like structure known as MIS-tree is constructed with every scan on the transactional database.

17

Figure 19: Construction of MIS-tree. (a) Before scanning the database.(b) After scanning first transaction.(c) After scanning second transaction.(d) After scanning every transaction.

Page 18: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

CFP-growth Algorithm

3. From MIS-tree, the items which cannot generate any frequent pattern are removed by using the following criterion.

“Items whose support is less than the lowest MIS value among all items cannot generate any frequent pattern.”

The lowest MIS value is 2. Therefore, the item ‘h’ that has support less than 2 is removed from the MIS-tree.

18

a c b d e f g h10 10 8 7 3 3 3 2

ItemsMIS

12 9 9 8 5 3 2 1Sup.

Page 19: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

CFP-growth Algorithm

4. The resultant tree is known as the compact MIS-tree.

19

Figure: Compact MIS-tree created after pruning item ‘h’ from the MIS-tree.

Page 20: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

CFP-growth Algorithm

5. The compact MIS-tree is mined using conditional pattern bases to discover complete set of frequent patterns.

6. Since downward closure property no longer holds, the CFP-growth builds conditional pattern bases until it is empty for a suffix pattern.

20Figure: Mining frequent patterns from the MIS-tree.

Page 21: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Performance Issues in CFP-growth

21

1. The criterion used by CFP-growth to prune the items from the MIS-tree still considers some of those items which cannot generate any frequent pattern.

• CFP-growth prunes the item ‘h’ and considers ‘a, b, c, d, e, f and g’ items for generating frequent patterns.

• However, ‘g’ cannot generate frequent pattern as its support is less than the lowest MIS value among all remaining items.

2. Searches in some of those infrequent suffix patterns which cannot generate any frequent pattern at any higher order.

a c b d e f g h10 10 8 7 3 3 3 2

ItemsMIS

12 9 9 8 5 3 2 1Sup.

Page 22: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

An Improved CFP-growth Algorithm: CFP-growth++

22

• Key observations:

1. In every frequent pattern, the item having lowest MIS value is a frequent pattern (KDIR ‘09).

2. In every frequent pattern, all non-empty subsets involving the item having lowest MIS value are frequent (EDBT ‘10).

Page 23: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Correctness of the observations

23

Page 24: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Four pruning techniques

24

1. Least Minimum Support (LMS) - Any item having support less than the lowest MIS value of a frequent item cannot generate frequent pattern at higher order.

2. Conditional minimum support - The minsup for any pattern generated from the conditional pattern base of the suffix item is its MIS value.

3. Conditional Closure property

4. Infrequent leaf node pruning

Page 25: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Four pruning techniques

25

1. Conditional Closure property - If the suffix pattern is infrequent, its no super-suffix pattern can generate a frequent pattern at higher order.

2. Infrequent leaf node pruning - Leaf nodes belonging to the infrequent items can be pruned because their conditional pattern bases cannot generate any frequent pattern.

Page 26: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Working of CFP-growth++ Algorithm

26

• Steps

1. Construction of MIS-tree

2. Construction of compact MIS-tree

3. Mining Compact MIS-tree

4. The first step remains the same in CFP-growth and CFP-growth++ algorithms.

• The second and third steps differ in CFP-growth and CFP-growth++ algorithms.

Page 27: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Working of CFP-growth++ Algorithm

27

• The techniques “least minimum support” and “infrequent leaf node pruning” have been used for building compact MIS-tree with only those items that can generate frequent patterns at higher order.

• The techniques “conditional minimum support” and “conditional closure property” have been used for reducing the search space.

Page 28: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Step 1: Construction of MIS-tree

28

• The algorithm constructs MIS-tree using the user-specified items’ MIS values.

Figure :MIS-tree constructed after scanning every transaction in the database.

Page 29: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Step 2: Construction of Compact MIS-tree

29

• Using least minimum support, CFP-growth++ prunes all those items which cannot generate any frequent pattern at higher order.

Figure : MIS-tree after completely pruning the items ‘g’ and ‘h’. Note that ‘g’ is not pruned in CFP-growth.

Page 30: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Step 2: Construction of Compact MIS-tree

30

• Using infrequent leaf node pruning, the leaf nodes of the infrequent items are pruned from the MIS-tree. The resultant tree is known as compact MIS-tree.

Figure : Compact MIS-tree generated after infrequent node pruning.

Page 31: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Step 3: Mining Compact MIS-tree

31

• Using conditional minimum support and conditional closure property, compact MIS-tree is mined using conditional pattern bases to discover complete set of frequent patterns.

Figure : Mining Compact MIS-tree Using Conditional Pattern Bases.

Page 32: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Experimental Results

32

Table 4: Dataset characteristics.

Datasets

• Percentage-based methodology is used for specifying items’ MIS values.

• LS=minsup=0.1

• β=1/α and varied α from 1 to 20.

Page 33: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Experiment 1: Generation of Frequent patterns.

33

Figure : Generation of frequent patterns in different datasets.

Page 34: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Experiment 2: Runtime Requirements

34

Figure : Runtime taken by various algorithms in different datasets.

Page 35: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Experiment 3: Scalability Test

• β=0.5 and LS=0.1

• Experimental procedure

– Dataset: Kosark

– We divided the dataset into five portions of 0.2 million transactions in each part.

– Each part is added to one another.

35

Figure : Runtime taken by different algorithms.

Page 36: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Summary of the Contributions

Topic Existing methodology

Performance problem Proposed Methodology

Specifying items’ MIS

values

Percentage-based

methodology

Causes rare item problem as it will not maintain uniform difference between items’ support and MIS values

Support-difference based methodology

Patterns do not satisfy downward

closure property

CFP-growth 1. Constructed tree is not efficient

2. Search space is huge as it searches using those items that cannot generate frequent pattern at higher order.

CFP-growth++ uses “least minimum support”, and “infrequent leaf node pruning” to construct tree effectively. In addition, uses “conditional minsup” and “conditional closure property” to effectively reduce the search space.

36

Page 37: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Summary of Contributions

Topic Existing methodology

Performance problem Proposed Methodology

Not sufficient for databases of widely varying items’ frequencies

Multiple minimum support

framework

Generates uninteresting frequent patterns containing both very high and very low frequency items. The items within the pattern are not correlated.

A new interestingness measure “item-to-

pattern difference” has been extended to prune

such interesting frequent patterns.

Periodic-frequent pattern mining.

Single minimum

support and single

maximum periodicity framework

The rare item problem. 1. The multiple minimum supports and maximum periodicity framework

2. A pattern growth algorithm

37

Page 38: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Outline

• Introduction– Frequent patterns and the Rare Item Problem– Multiple Minimum Support Framework– Issues with Multiple Minimum Support Framework

• Related Work• Proposed Approaches

– Methodology to specify items’ MIS values– An algorithm to mine frequent patterns effectively.– Mining frequent patterns in databases in which items’ frequencies vary

widely.– Mining rare periodic-frequent patterns.

• Conclusions and Future Work38

Page 39: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Conclusions and Future Work

39

• For efficient mining of rare frequent patterns, the notion “support difference” has been exploited to specify items’ MIS values.

• An Improved FP-growth-like approach has been proposed for mining rare frequent patterns. Various heuristics have been exploited to reduce the search space.

• To extract rare frequent patterns from the datasets of widely varying items’ frequencies, we introduce a new measure, called “item-to-pattern difference” and proposed an efficient FP-growth-like approach.

• Overall, the proposed approaches provide scope to effectively mine interesting frequent patterns or association rules by trading with the additional efforts from the user, especially in terms of specifying items’ MIS values.

Page 40: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

Conclusions and Future Work

40

• The future work is as follows:

– The interestingness of patterns discovered using various measures needs to be studied.

– Rare item problem in various constraint-based patterns needs to be investigated and addressed.

– The multiple minsups framework needs to be extended to other data sets such as streams, sequential data and uncertain data etc.

Page 41: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

References

R. Agrawal, T. Imieli ´ nski, and A. Swami. Mining association rules between sets of items in large databases. In SIGMOD ’93: Proceedings of the 1993 ACM SIGMOD international conference on Management of data, pages 207–216, New York, NY, USA, 1993. ACM.

R. Agrawal and R. Srikant. Mining sequential patterns. In Data Engineering, 1995. Proceedings of the Eleventh International Conference on, pages 3 –14, mar 1995.

T. Brijs, G. Swinnen, K. Vanhoof, and G. Wets. Using association rules for product assortment decisions: a case study. In KDD ’99: Proceedings of the fifth ACM SIGKDD international con-ference on Knowledge discovery and data mining, pages 254–260, New York, NY, USA, 1999.ACM.

E. R. Omiecinski. Alternative interest measures for mining associations in databases. IEEE Trans. on Knowl. and Data Eng., 15(1):57–69, 2003.

Page 42: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: generalizing association rules to correlations. SIGMOD Rec., 26(2):265–276, 1997.

H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh. Querying and mining of time series data: experimental comparison of representations and distance measures. Proc. VLDB Endow., 1:1542–1552, August 2008.

W. J. Frawley, G. Piatetsky-shapiro, and C. J. Matheus. Knowledge discovery in databases: an overview, 1992.

J. Han, J. Pei, Y. Yin, and R. Mao. Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Data Min. Knowl. Discov., 8(1):53–87, 2004.

References

Page 43: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

J. Gray. The next database revolution. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, SIGMOD ’04, pages 1–4, New York, NY, USA, 2004. ACM.

M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy. Mining data streams: a review. SIGMOD Rec.,34:18–26, June 2005

J. Hipp and G. Nakhaeizadeh. Algorithms for association rule mining – a general survey and comparision. ACM SIGKDD Explorations Newsletter, 2(1):58 – 64, 2000.

Y.-H. Hu and Y.-L. Chen. Mining association rules with multiple minimum supports: a new mining algorithm and a support tuning mechanism. Decis. Support Syst., 42(1):1–24, 2006.

References

Page 44: Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework

R. Uday Kiran, P. Krishna Reddy. Mining Rare Association Rules in the Datasets with Widely Varying Items’ Frequencies. In DASFAA (1) 2010, pages 49–62.

R. Uday Kiran, P. Krishna Reddy. Towards Efficient Mining of Periodic-Frequent Pat-terns in Transactional Databases. In DEXA (2) 2010, pages 194–208.

References