mining discriminative itemsets in ata streams using ... · empirical analysis shows that the...
Post on 09-Nov-2020
2 Views
Preview:
TRANSCRIPT
MINING DISCRIMINATIVE ITEMSETS IN DATA
STREAMS USING DIFFERENT WINDOW MODELS
Mr Majid Seyfi
Associate Professor Shlomo Geva, Associate Professor Yue Xu
Submitted in fulfilment of the requirements for the degree of
Master of Philosophy
Faculty of Science and Engineering
Queensland University of Technology
2018
To my supervisors, parents and friends
Mining Discriminative Itemsets in Data Streams using Different Window Models Page i
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page i
Keywords
Anomaly detection, Association classification, Concept drift, Data mining, Data stream
mining, Discriminative classification, Discriminative itemset, Discriminative rule,
Efficiency, Emerging patterns, Frequent pattern, Heuristic, Optimization, Pattern mining,
Personalization, Prediction mining, Prefix tree, Recommendation, Sliding window model,
Sparse itemsets, Tilted-time window model, Transaction.
Mining Discriminative Itemsets in Data Streams using Different Window Models Page ii
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page ii
Abstract
With the availability of big data in areas such as social networks, online marketing systems and
stock markets, data mining techniques have been successively used for knowledge discovery. For
user gratification, businesses commonly create profiles for their customers and keep track of user
activities. Usually, the historical activities of each customer are stored as transaction file data
streams. Embedded knowledge in these data streams is applicable for description and prediction
of the trends. It can be used for different purposes like personalization, anomaly detection, target
marketing and forecasting upcoming trends.
Frequent itemset mining, sequential rule mining and decision tree methods are popular
techniques used for description and prediction mining. However, most of the proposed methods
designed based on the static datasets or a single data stream. The algorithms based on single data
stream lose the part of knowledge that is related to interaction between data streams. The
embedded patterns and differences between data streams can be used for defining effective data
mining methods.
This thesis, for the first time, develops algorithms for mining discriminative itemsets in
data streams. Discriminative itemsets are those itemsets that are frequent in the one data stream
and their frequencies in that stream are much higher than other data streams in the application
domain. The discriminative itemsets are more distinctive and are more useful in data stream
comparison. This thesis has developed novel discriminative itemset mining algorithms using the
titled-time window model and sliding window model in data streams. The determinative
heuristics have been applied to the proposed algorithms for efficient mining of discriminative
itemsets in data streams with the highest refined approximate bound. The extracted
discriminative itemsets can be used for enhancing the effectiveness of description and prediction
in data stream applications. The proposed algorithms can be utilized on the targeted datasets. The
empirical analysis shows that the proposed algorithms have good time and space complexities
with high accuracy in large data streams.
In summary, this thesis attempts to address some of the challenges in the pattern mining
in more than one data stream for the purpose of description and prediction in data streams.
Mining Discriminative Itemsets in Data Streams using Different Window Models Page iii
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page iii
Table of Contents
Keywords ................................................................................................................................................ iii
Abstract ................................................................................................................................................... ii
Table of Contents ................................................................................................................................... iii
List of Figures ......................................................................................................................................... vi
List of Tables ........................................................................................................................................... ix
List of Abbreviations ................................................................................................................................ x
List of Notations ..................................................................................................................................... xi
Statement of Original Authorship ........................................................................................................ xiv
Acknowledgments ................................................................................................................................. xv
1 CHAPTER 1: INTRODUCTION ........................................................................................................ 1
1.1 Background .................................................................................................................................. 1
1.2 Problem statement and objectives .............................................................................................. 3 1.2.1 Research problems ........................................................................................................... 3 1.2.2 Research objectives .......................................................................................................... 5
1.3 Research significance ................................................................................................................... 8
1.4 Research limitation ...................................................................................................................... 8
1.5 Contributions ............................................................................................................................... 9
1.6 Publications .................................................................................................................................. 9
1.7 Thesis outline ............................................................................................................................. 10
2 CHAPTER 2: LITERATURE REVIEW .............................................................................................. 12
2.1 Data stream mining .................................................................................................................... 12
2.2 Frequent itemset mining in data stream ................................................................................... 13 2.2.1 Window models and update intervals in data stream.................................................... 13 2.2.2 Frequent itemset mining algorithms in data stream ...................................................... 15
2.3 Contrast data mining ................................................................................................................. 18 2.3.1 Emerging patterns .......................................................................................................... 19 2.3.1.1 Different types of emerging patterns ............................................................................. 22 2.3.1.2 Delta-discriminative emerging patterns ......................................................................... 24 2.3.1.3 Differences between emerging patterns and discriminative itemsets ........................... 26 2.3.2 Other contrast patterns .................................................................................................. 27 2.3.3 Discriminative itemset mining in data streams .............................................................. 28 2.3.4 Discriminative item mining in data streams ................................................................... 29
2.4 Association rule mining .............................................................................................................. 32 2.4.1 Association rule mining in data streams ......................................................................... 33 2.4.2 Classification rule mining ................................................................................................ 33
2.5 Verifying the new knowledge contributed to the area .............................................................. 35
2.6 Research gaps and implications ................................................................................................. 37
Mining Discriminative Itemsets in Data Streams using Different Window Models Page iv
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page iv
3 CHAPTER 3: MINING DISCRIMINATIVE ITEMSETS IN DATA STREAMS .......................................... 39
3.1 Existing works ............................................................................................................................ 39
3.2 Research problem ...................................................................................................................... 42 3.2.1 Formal definition ............................................................................................................ 43 3.2.2 Discriminative itemset mining ........................................................................................ 44
3.3 DISTree method ......................................................................................................................... 44 3.3.1 DISTree construction and pruning .................................................................................. 46 3.3.2 DISTree algorithm ........................................................................................................... 51 3.3.3 DISTree summary ............................................................................................................ 54
3.4 DISSparse method ...................................................................................................................... 54 3.4.1 Mining discriminative itemsets using sparse prefix tree ................................................ 56 3.4.1.1 Potential discriminative itemsets generation using minimized DISTree ........................ 61 3.4.1.2 Conditional FP-Tree expansion ....................................................................................... 63 3.4.1.3 Tuning non-discriminative subsets of discriminative itemsets ....................................... 65 3.4.2 DISSparse Algorithm ....................................................................................................... 66 3.4.3 DISSparse Algorithm Complexity .................................................................................... 68 3.4.4 Modified DISSparse and modified DPMiner ................................................................... 68 3.4.5 DISSparse summary ........................................................................................................ 73
3.5 Chapter summary ...................................................................................................................... 73
4 CHAPTER 4: MINING DISCRIMINATIVE ITEMSETS IN DATA STREAMS USING THE TILTED-TIME WINDOW MODEL ........................................................................................................................ 76
4.1 Existing works ............................................................................................................................ 76
4.2 Research problem ...................................................................................................................... 78 4.2.1 Problem formal definition .............................................................................................. 80 4.2.2 Discriminative itemset mining using the tilted-time window model ............................. 83
4.3 Tilted-time window model ......................................................................................................... 84 4.3.1 Tilted-time window model updating .............................................................................. 86 4.3.2 Discriminative itemsets approximate bound .................................................................. 88 4.3.2.1 Maintaining the discriminative itemsets in the tilted-time window model ................... 89 4.3.2.2 Improving the accuracy using relaxation ratio ............................................................... 91 4.3.2.3 Tail pruning in the tilted-time window model ................................................................ 93
4.4 H-DISSparse method .................................................................................................................. 95 4.4.1 H-DISSparse Algorithm ................................................................................................... 95 4.4.2 H-DISSparse Algorithm Complexity ................................................................................ 97
4.5 Chapter summary ...................................................................................................................... 97
5 CHAPTER 5: MINING DISCRIMINATIVE ITEMSETS IN DATA STREAMS USING THE SLIDING WINDOW MODEL ...................................................................................................................... 100
5.1 Existing works .......................................................................................................................... 100
5.2 Research problem .................................................................................................................... 104 5.2.1 Problem formal definition ............................................................................................ 105 5.2.2 Discriminative itemset mining using the sliding window model .................................. 108
5.3 Offline Sliding window model .................................................................................................. 108 5.3.1 Mining discriminative itemsets in sliding window using prefix tree ............................. 109 5.3.2 Incremental offline sliding window .............................................................................. 110 5.3.2.1 Initializing the offline sliding window ........................................................................... 111 5.3.2.2 Stable and updated subsets in offline sliding window ................................................. 113 5.3.2.3 S-DISStream tuning and pruning in offline sliding window .......................................... 121
5.4 Online Sliding window model .................................................................................................. 123 5.4.1 Mining discriminative itemsets in online sliding window using queue structure ......... 123
Mining Discriminative Itemsets in Data Streams using Different Window Models Page v
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page v
5.4.2 Improving the accuracy using relaxation ratio ............................................................. 124
5.5 S-DISSparse method ................................................................................................................. 125 5.5.1 S-DISSparse Algorithm .................................................................................................. 126 5.5.2 S-DISSparse Algorithm Complexity ............................................................................... 128
5.6 Chapter summary .................................................................................................................... 128
6 CHAPTER 6: EVALUATION AND ANALYSIS ................................................................................ 131
6.1 Benchmarking .......................................................................................................................... 131 6.1.1 Evaluation benchmarks ................................................................................................ 131 6.1.1.1 Batch processing benchmarks ...................................................................................... 132 6.1.1.2 Data stream processing benchmarks ............................................................................ 134 6.1.2 Evaluation environment ............................................................................................... 134
6.2 Batch processing ...................................................................................................................... 135 6.2.1 Evaluation on synthetic datasets .................................................................................. 135 6.2.2 Evaluation on real datasets .......................................................................................... 142 6.2.3 Discussion on 𝜹-discriminative emerging patterns ...................................................... 145 6.2.3.1 Evaluation on modified DISSparse and modified DPM ................................................. 146 6.2.3.2 Evaluation on real datasets with modified DISSparse and modified DPM ................... 149 6.2.4 Discussion ..................................................................................................................... 149
6.3 Tilted-time window model ....................................................................................................... 150 6.3.1 Evaluation on synthetic datasets .................................................................................. 151 6.3.1.1 Approximation in discriminative Itemsets in the tilted-time window model ............... 155 6.3.1.2 Discriminative Itemsets in the tilted-time window model without tail pruning .......... 157 6.3.2 Evaluation on real datasets .......................................................................................... 158 6.3.2.1 Scalability on datasets with less concept drifts in the tilted-time window model ....... 159 6.3.2.2 Approximation in discriminative Itemsets in the tilted-time window model ............... 162 6.3.3 Discussion ..................................................................................................................... 164
6.4 Sliding window model .............................................................................................................. 165 6.4.1 Evaluation on synthetic datasets .................................................................................. 166 6.4.1.1 Offline sliding discriminative itemsets .......................................................................... 166 6.4.1.2 Online sliding discriminative itemsets .......................................................................... 169 6.4.2 Evaluation on real datasets .......................................................................................... 171 6.4.3 Discussion ..................................................................................................................... 175
6.5 Chapter summary .................................................................................................................... 176
7 CHAPTER 7: CONCLUSIONS ...................................................................................................... 178
7.1 Summary of contributions ....................................................................................................... 179
7.2 Summary of findings ................................................................................................................ 181
7.3 Connections between the three tasks ..................................................................................... 182
7.4 Limitations and the future research issues .............................................................................. 182
REFERENCES .............................................................................................................................. 184
Mining Discriminative Itemsets in Data Streams using Different Window Models Page vi
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page vi
List of Figures
Figure 1.1: Research methodology and thesis structure ........................................................................ 10
Figure 2.1: Landmark window model ................................................................................................... 13
Figure 2.2: Damped window model ...................................................................................................... 13
Figure 2.3: Sliding window model ........................................................................................................ 14
Figure 2.4: Tilted-time window model .................................................................................................. 14
Figure 2.5: Support plan for emerging patterns (Dong and Li 1999) .................................................... 21
Figure 3.1 Header-Table and FP-Tree structures for input batch of transactions ................................. 48
Figure 3.2 Conditional FP-Tree of Header-Table item 𝑎 ..................................................................... 49
Figure 3.3 Header-Table and DISTree structure without pruning (the full prefix tree size is only for
display and is not generated) ........................................................................................................ 50
Figure 3.4 Final DISTree structure and the reported discriminative itemsets ....................................... 51
Figure 3.5 Conditional FP-Tree of Header-Table item 𝑎 associated with the top ancestor on the first
level .............................................................................................................................................. 56
Figure 3.6 Minimized DISTree generated from the left-most subtree in Figure 3.5 ............................. 62
Figure 3.7 Expanded conditional FP-Tree of Header-Table item 𝑎 after processing the first subtree . 64
Figure 3.8 Minimized DISTree generated out of the potential discriminative subsets of the left-most
subtree in conditional FP-Tree for Header-Table item 𝑎 ............................................................. 64
Figure 3.9 Expanded modified conditional FP-Tree of Header-Table item a after processing the
second subtree .............................................................................................................................. 65
Figure 3.10 A sample of discriminative itemsets distribution with different discriminative levels in
market basket monitoring application .......................................................................................... 75
Figure 4.1 Tilted-time window frames .................................................................................................. 81
Figure 4.2 Logarithmic tilted-time window structure (Giannella et al. 2003) ....................................... 84
Figure 4.3 A sample H-DISStream based on Example 3.1 with the built-in tilted-time window model85
Figure 4.4 Tilted-time window model updating .................................................................................... 87
Figure 5.1 Sliding window model 𝑊 made of three partitions 𝑃 ........................................................ 106
Figure 5.2 Header-Table and S-FP-Tree structures by the first partition 𝑃1 ...................................... 112
Figure 5.3 Header-Table and S-DISStream structures by the first partition 𝑃1 .................................. 113
Figure 5.4 Header-Table and updated S-FP-Tree structures by adding second partition 𝑃2 .............. 113
Figure 5.5 Conditional FP-Tree of Header-Table item 𝑎 updated by partition 𝑃2 ............................. 117
Figure 5.6 Updated S-DISStream after processing 𝑆𝑡𝑎𝑏𝑙𝑒(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑐) in conditional FP-Tree for
Header-Table item 𝑎 .................................................................................................................. 118
Figure 5.7 Expanded conditional FP-Tree of Header-Table item 𝑎 updated by partition 𝑃2 after
processing the first subtree ......................................................................................................... 120
Mining Discriminative Itemsets in Data Streams using Different Window Models Page vii
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page vii
Figure 5.8 Updated S-DISStream after processing potential discriminative subsets of the left-most
subtree in conditional FP-Tree for Header-Table item 𝑎 ........................................................... 120
Figure 5.9 Expanded conditional FP-Tree of Header-Table item a updated by partition 𝑃2 after
processing the second subtree .................................................................................................... 121
Figure 5.10 Updated S-DISStream after processing potential discriminative subsets of the left-most
subtree in conditional FP-Tree for Header-Table item 𝑎 ........................................................... 121
Figure 5.11 Final S-DISStream after offline sliding based on partition 𝑃2 ......................................... 122
Figure 5.12 Transaction-List made of partitions fit in the online sliding window frame 𝑊 ............... 123
Figure 6.1 Scalability with discriminative level 𝜃 for 𝐷1 (support threshold 𝜑 = 0.0001) ............... 136
Figure 6.2 Scalability with support threshold 𝜑 for 𝐷1(discriminative level 𝜃 = 10) ....................... 137
Figure 6.3 Number of the discriminative itemsets with different dataset length ratios (𝑛2/𝑛1) (discriminative level 𝜃 = 10 and support threshold 𝜑 = 0.0001) ............................................. 137
Figure 6.4 Scalability with different dataset length ratios (𝑛2/𝑛1) (discriminative level 𝜃 = 10 and
support threshold 𝜑 = 0.0001) .................................................................................................. 138
Figure 6.5 Scalability with discriminative level 𝜃 for 𝐷2 (support threshold 𝜑 = 0.0001) ............... 138
Figure 6.6 Scalability with discriminative level 𝜃 for 𝐷3 (support threshold 𝜑 = 0.0001) ............... 139
Figure 6.7 Number of the discriminative itemsets with different 𝜃 for 𝐷3 (support threshold 𝜑 =0.0001) ....................................................................................................................................... 139
Figure 6.8 Scalability with discriminative level 𝜃 for 𝐷3 (support threshold 𝜑 = 0.00001) ............. 140
Figure 6.9 Number of the discriminative itemsets with discriminative level 𝜃 for 𝐷3 (support
threshold 𝜑 = 0.0001) ............................................................................................................... 140
Figure 6.10 Frequent items distribution in 𝑆1 ..................................................................................... 141
Figure 6.11 Frequent items distribution in 𝑆2 ..................................................................................... 141
Figure 6.12 Frequent items distribution in discriminative itemsets in the modified 𝐷1 with
discriminative level 𝜃 = 10 and support threshold 𝜑 = 0.0001 ............................................... 141
Figure 6.13 Frequent items distribution in discriminative itemsets in the modified 𝐷1 with
discriminative level 𝜃 = 10 and support threshold 𝜑 = 0.0001 ............................................... 142
Figure 6.14 Scalability with discriminative level 𝜃 for susy dataset (support threshold 𝜑 = 0.01) ... 143
Figure 6.15 Scalability with support threshold 𝜑 for susy dataset (discriminative level 𝜃 = 2) ........ 143
Figure 6.16 Scalability with discriminative level 𝜃 for accident dataset (support threshold 𝜑 = 0.01)144
Figure 6.17 Scalability with discriminative level 𝜃 for mushroom dataset (support threshold 𝜑 =0.001) ......................................................................................................................................... 145
Figure 6.18 Scalability with 𝑚𝑖𝑛_𝑠𝑢𝑝 for 𝐷1 (δ = 2) ........................................................................ 147
Figure 6.19 Scalability with δ for 𝐷1 (𝑚𝑖𝑛_𝑠𝑢𝑝 = 50) ...................................................................... 147
Figure 6.20 Scalability with 𝑚𝑖𝑛_𝑠𝑢𝑝 for 𝐷3 (δ = 50) ...................................................................... 148
Figure 6.21 Scalability with δ for 𝐷3 (𝑚𝑖𝑛_𝑠𝑢𝑝 = 50) ...................................................................... 148
Figure 6.22 Scalability of batch processing not considering the tilted-time window model updating 153
Figure 6.23 Tilted-time window model updating time complexity ..................................................... 154
Figure 6.24 Time complexity of H-DISTree and H-DISSparse algorithms ........................................ 154
Figure 6.25 H-DISStream structure size .............................................................................................. 155
Mining Discriminative Itemsets in Data Streams using Different Window Models Page viii
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page viii
Figure 6.26 Scalability of H-DISSparse algorithm by relaxation of 𝛼 = 1, 𝛼 = 0.9 and 𝛼 = 0.75 ... 156
Figure 6.27 Number of sub-discriminative itemsets by relaxation of 𝛼 = 0.9 and 𝛼 = 0.75 ............. 156
Figure 6.28 Scalability of H-DISSparse algorithm by eliminating Corollary 4-1 and Corollary 4-3158
Figure 6.29 Scalability of batch processing not considering the tilted-time window model updating 161
Figure 6.30 Time complexity of H-DISTree and H-DISSparse algorithms ........................................ 162
Figure 6.31 H-DISStream structure size .............................................................................................. 162
Figure 6.32 Scalability of H-DISSparse algorithm by relaxation of 𝛼 = 1 and 𝛼 = 0.9 .................... 163
Figure 6.33 Number of sub-discriminative itemsets by relaxation of 𝛼 = 0.9 ................................... 163
Figure 6.34 S-DISSparse and DISSparse time complexity for 𝐷1 (window frame 𝑊 = 25) ............. 167
Figure 6.35 S-DISSparse space complexity in offline sliding for 𝐷1 (𝑊 = 25) ................................ 167
Figure 6.36 S-DISSparse and DISSparse time and space complexity for 𝐷2 (window frame 𝑊 = 10)168
Figure 6.37 S-DISSparse time complexity in online and offline sliding for 𝐷1 (𝑊 = 25) ................ 169
Figure 6.38 S-DISSparse time complexity for online and offline sliding for 𝐷1 (𝑊 = 25) with
different relaxation of 𝛼 ............................................................................................................. 170
Figure 6.39 S-DISStream size for 𝐷1 (𝑊 = 25) by different relaxation of 𝛼 .................................... 170
Figure 6.40 Number of itemsets that their tag is changed for 𝐷1 (𝑊 = 25) by different relaxation of 𝛼171
Figure 6.41 Number of discriminative and sub-discriminative itemsets for 𝐷1 (𝑊 = 25) by different
relaxation of 𝛼 ............................................................................................................................ 171
Figure 6.42 S-DISSparse and DISSparse time complexity for susy dataset (window frame 𝑊 = 20)172
Figure 6.43 S-DISSparse space complexity in offline sliding for susy dataset (𝑊 = 20) .................. 173
Figure 6.44 S-DISSparse time complexity in online and offline sliding for susy dataset (𝑊 = 20) .. 173
Figure 6.45 S-DISStream size for susy dataset (𝑊 = 20) by different relaxation of 𝛼 ...................... 174
Figure 6.46 Number of itemsets that their tag is changed for susy dataset (𝑊 = 20) by different
relaxation of 𝛼 ............................................................................................................................ 174
Figure 6.47 Number of discriminative and sub-discriminative itemsets for susy dataset (𝑊 = 20) by
different relaxation of 𝛼 ............................................................................................................. 174
Mining Discriminative Itemsets in Data Streams using Different Window Models Page ix
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page ix
List of Tables
Table 2.1 Data stream frequent itemset mining algorithms ................................................................... 15
Table 3.1 An input batch in data streams .............................................................................................. 48
Table 3.2 Desc-Flist order of frequent items in target data stream 𝑆1 .................................................. 48
Table 5.1 The first input batch in data streams fits in partition 𝑃1 ..................................................... 112
Table 5.2 Desc-Flist order of frequent items is target data stream 𝑆1 in the first batch...................... 112
Table 5.3 The second input batch in data streams fits in partition 𝑃2 ................................................. 113
Table 6.1 The number of discriminative itemsets in the tilted-time window model ........................... 152
Table 6.2 The number of discriminative itemsets in the tilted-time window model ........................... 157
Table 6.3 The number of discriminative itemsets in the tilted-time window model ........................... 160
Table 6.4 The number of discriminative itemsets in the tilted-time window model ........................... 164
Mining Discriminative Itemsets in Data Streams using Different Window Models Page x
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page x
List of Abbreviations
DISTree – discriminative tree
DISSparse-discriminative sparse
H-DISSparse-historical discriminative sparse
S-DISSparse – sliding discriminative sparse
FP-Tree – frequent pattern tree
FP-Growth – frequent pattern growth
H-DISStream – historical discriminative stream
S-FP-Tree – sliding frequent pattern tree
S-DISStream – sliding discriminative stream
Mining Discriminative Itemsets in Data Streams using Different Window Models Page xi
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page xi
List of Notations
𝑺𝒊: Data stream i
𝑩: Batch of transactions
𝑻: Transaction made of items
∑: Alphabet set of items
𝒆: Item
𝑰: Itemset
𝒇𝒊(𝑰): Frequency of the itemset I in data stream 𝑆𝑖
𝑰(𝒂𝒏,𝒎): Itemset ending with 𝑎 with frequency of 𝑛 in 𝑆𝑖 and 𝑚 in 𝑆𝑗
𝒏𝒊: Length of data stream 𝑆𝑖
𝒓𝒊(𝑰): Frequency ratio of itemset 𝐼 in data stream 𝑆𝑖
𝜽: Discriminative level
𝝋: Support threshold
𝑹𝒊𝒋(𝑰): Frequency ratio of itemset 𝐼 in target data stream 𝑆𝑖 vs general data stream 𝑆𝑗
𝑫𝑰𝒊𝒋: Discriminative itemsets in a single batch of transactions
𝑵𝑫𝑰𝒊𝒋: Non-discriminative itemsets in a single batch of transactions
𝑫𝒆𝒔𝒄 − 𝑭𝒍𝒊𝒔𝒕: Descending order of items based on their frequencies in the first batch
𝑺𝒖𝒃𝒕𝒓𝒆𝒆𝒓𝒐𝒐𝒕: Branches under the same root in the first level of conditional FP-Tree and ending
with the processing Header-Table item
𝑯𝒆𝒂𝒅𝒆𝒓_𝑻𝒂𝒃𝒍𝒆_𝒊𝒕𝒆𝒎𝒔(𝑺𝒖𝒃𝒕𝒓𝒆𝒆𝒓𝒐𝒐𝒕): The set of Header-Table items which are linked
under their subtree root node using Header-Table links
𝒊𝒕𝒆𝒎𝒔𝒆𝒕𝒔(𝒓𝒐𝒐𝒕, 𝒂): The set of itemsets in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 ending with a header item 𝑎 ∊
𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡)
𝐌𝐚𝐱_𝐟𝐫𝐞𝐪𝒊(𝒓𝒐𝒐𝒕, 𝒂): The maximum frequency of 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑎) in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 in the
target data stream 𝑆𝑖
𝐌𝐚𝐱_𝐝𝐢𝐬_𝐯𝐚𝐥𝐮𝐞(𝒓𝒐𝒐𝒕, 𝒂): The maximum discriminative value of 𝐼𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑎 ) in
𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡
𝑷𝒐𝒕𝒆𝒏𝒕𝒊𝒂𝒍(𝑺𝒖𝒃𝒕𝒓𝒆𝒆𝒓𝒐𝒐𝒕): Potential discriminative itemsets in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡
𝑰𝒏𝒕𝒆𝒓𝒏𝒂𝒍 𝒏𝒐𝒅𝒆𝒓𝒐𝒐𝒕: The internal nodes in the paths between 𝑟𝑜𝑜𝑡 of 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 and the
𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡)
𝒊𝒕𝒆𝒎𝒔𝒆𝒕𝒔(𝒓𝒐𝒐𝒕, 𝒊𝒏, 𝒂): The set of itemsets in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 ending with a header item
𝑎 ∊ 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) with subset of internal node 𝑖𝑛
𝐌𝐚𝐱_𝐟𝐫𝐞𝐪𝒊(𝒓𝒐𝒐𝒕, 𝒊𝒏, 𝒂): The maximum frequency of 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑖𝑛, 𝑎) in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡
in the target data stream 𝑆𝑖
Mining Discriminative Itemsets in Data Streams using Different Window Models Page xii
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page xii
𝐌𝐚𝐱_𝐝𝐢𝐬_𝐯𝐚𝐥𝐮𝐞(𝒓𝒐𝒐𝒕, 𝒊𝒏, 𝒂): The maximum discriminative value of 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑖𝑛, 𝑎) in
𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡
𝑷𝒐𝒕𝒆𝒏𝒕𝒊𝒂𝒍(𝒊𝒏): Potential discriminative itemsets with subset of internal node 𝑖𝑛 in
𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡
𝐌𝐚𝐱_𝐟𝐫𝐞𝐪(𝒓𝒐𝒐𝒕, 𝒂): The maximum frequency of 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑎) in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 in the
datasets
𝐌𝐢𝐧_𝐝𝐞𝐥𝐭𝐚_𝐯𝐚𝐥𝐮𝐞(𝒓𝒐𝒐𝒕, 𝒂): The minimum delta discriminative value of 𝐼𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑎)
in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡
𝐌𝐚𝐱_𝐟𝐫𝐞𝐪(𝒓𝒐𝒐𝒕, 𝒊𝒏, 𝒂): The maximum frequency of 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑖𝑛, 𝑎) in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡
in the datasets
𝐌𝐢𝐧_𝐝𝐞𝐥𝐭𝐚_𝐯𝐚𝐥𝐮𝐞(𝒓𝒐𝒐𝒕, 𝒊𝒏, 𝒂): The minimum delta discriminative value of
𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑖𝑛, 𝑎) in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡
𝑾𝒌: Window frame k in the tilted-time window model
𝒇𝒊𝒌(𝑰): Frequency of the itemset I in data stream 𝑆𝑖 in window frame 𝑊𝑘
𝒏𝒊𝒌: Length of data stream 𝑆𝑖 in window frame 𝑊𝑘
𝒓𝒊𝒌(𝑰): Frequency ratio of itemset 𝐼 in data stream 𝑆𝑖 in window frame 𝑊𝑘
𝑹𝒊𝒋𝒌 (𝑰): Frequency ratio of itemset 𝐼 in target data stream 𝑆𝑖 vs general data stream 𝑆𝑗 in window
frame 𝑊𝑘
𝑫𝑰𝒊𝒋𝒌 : Discriminative itemsets in tilted-time window model with 𝑘 window frames
𝒇𝒊𝟎..𝒌(𝑰): Total frequency of the itemset I in data stream 𝑆𝑖 in window frame 𝑊0 to 𝑊𝑘
𝒏𝒊𝟎..𝒌: Total length of data stream 𝑆𝑖 in window frame 𝑊0 to 𝑊𝑘
𝜶: Relaxation threshold
𝑺𝑫𝑰𝒊𝒋𝒌 : Sub-discriminative itemsets in tilted-time window model with 𝑘 window frames
𝑵𝑫𝑰𝒊𝒋: Non-discriminative itemsets in data stream 𝑆𝑖 against data stream 𝑆𝑗 in the tilted-time
window model
𝒇𝒊(𝑰) < 𝒙, 𝒚 >: Frequency of the itemset 𝐼 in the group of continuous batches 𝐵𝑥 to 𝐵𝑦 with
𝑥 ≥ 𝑦, in data stream 𝑆𝑖
𝑷: Partition
𝑾: Sliding window model
𝒇𝒊𝒘(𝑰): Frequency of the itemset I in data stream 𝑆𝑖 in sliding window frame 𝑊
𝒏𝒊𝒘: Length of data stream 𝑆𝑖 in sliding window frame 𝑊
𝒓𝒊𝒘(𝑰): Frequency ratio of itemset 𝐼 in data stream 𝑆𝑖 in sliding window frame 𝑊
𝑹𝒊𝒋𝒘(𝑰): Frequency ratio of itemset 𝐼 in target data stream 𝑆𝑖 vs general data stream 𝑆𝑗 in sliding
window frame 𝑊
𝑫𝑰𝒊𝒋𝒘: Discriminative itemsets in sliding window frame 𝑊
Mining Discriminative Itemsets in Data Streams using Different Window Models Page xiii
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page xiii
𝑺𝑫𝑰𝒊𝒋𝒘: Sub-discriminative itemsets in sliding window frame 𝑊
𝑵𝑫𝑰𝒊𝒋𝒘: Non-discriminative itemsets in sliding window frame 𝑊
𝑺𝒕𝒂𝒃𝒍𝒆(𝑺𝒖𝒃𝒕𝒓𝒆𝒆𝒓𝒐𝒐𝒕): Stable 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 during offline sliding
𝑺𝒕𝒂𝒃𝒍𝒆(𝒊𝒏): Stable itemsets with subset of 𝑖𝑛 in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 during offline sliding
QUT Verified Signature
August 2018
Mining Discriminative Itemsets in Data Streams using Different Window Models Page xv
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page xv
Acknowledgments
I would like to express my gratitude to all those who gave me the chance to complete
this thesis.
Firstly, I would like to thank Associate Prof Shlomo Geva, as my principal supervisor,
for his specific guidance and encouragement throughout this candidature. Many thanks to my
associate supervisor, Associate Prof Yue Xu, for her generous support and comments on my
work during this research work. I appreciate all the help and support I also received from
Associate Prof Richi Nayak, as co-author in my publications. Thank you, for providing your
valuable suggestions and encouragement, for the period of my study.
Special thanks to the Science and Engineering Faculty, QUT, which has provided me
with a comfortable research environment and thoughtful administrative help. I would like to
express my appreciation to CRC Smart Services and again, the Science and Engineering Faculty,
QUT, for the partial financial funding for this research work.
Copyediting and proofreading services for this thesis were provided and are
acknowledged, according to the guidelines laid out in the University-endorsed national policy
guidelines for the editing of research theses.
Thank you to all my colleagues who shared the research offices with me, and the
members of my research group, for participating in helpful discussions and being important
friends along my study journey. Finally, I would like to extend my heartfelt appreciation to my
parents for their unconditional love. I thank them for their support for my study overseas, and for
their consistently positive advice.
Chapter 1: Introduction Page 1
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 1
1Chapter 1: Introduction
This chapter presents a general outline for the thesis by converging the research
background in Section 1.1 and research statement in Section 1.2. It will highlight the overall
research significance in Section 1.3, research limitation in Section 1.4 and contributions in
Section 1.5. Furthermore, it includes the publications from this research in Section 1.6 and the
thesis outline in Section 1.7.
1.1 BACKGROUND
The large amounts of datasets collected in different domains provide this opportunity for
the knowledge discovery and data mining tasks to explain the natural processes in much clearer
and more understandable ways. This is actually one of the main challenges that organizations and
individuals are facing, in turning their fast-growing data collections into accessible and actionable
knowledge. The important target of the data mining techniques is to look for easy-to-understand,
novel patterns that can show the underlying concepts of the large datasets. These techniques
however are not applicable in the current real applications unless they enable exploration,
explanation and summary of the datasets in concise, efficient and understandable forms. Pattern
mining is the task of discovering patterns, showing the important content and structure of the
datasets, and presenting them in understandable forms for further usage.
Pattern mining is useful in many real life applications like personalization (Tseng and
Lin 2006; Djahantighi et al. 2010), anomaly detection (Lin et al. 2010), enhancing marketing
strategies (Hollfelder, Oria and ¨Ozsu 2000; Huang et al. 2008; Prasad and Madhavi 2012) and
forecasting upcoming trends (Mori et al. 2005). Companies would have more focused future
plans in their businesses based on description of the patterns observed for each customer or group
of customers and prediction of their subsequent patterns, as well as information about general
trends. Using this sort of knowledge, they can increase their marketing strength and offer better
services to their customers. In many areas like social networks, online marketing systems, stock
markets, online news agencies and search engines, users have profiles and their activities are
tracked in the form of transactional data streams. The embedded knowledge in these historical
datastreamsisusedforunderstandingthecurrentpatternsdiscoveredinusers’activitiesaswell
as prediction of their future patterns.
A data stream is defined as a sequence of transactions that are coming at high speed
through time (Manku and Motwani 2002). Compared to traditional static datasets, data streams
have more complexities, as they are big, they are growing fast, their embedded knowledge
changes during time and queries usually need real time answers (Manku and Motwani 2002).
Chapter 1: Introduction Page 2
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 2
Currently, data stream mining methods such as frequent itemset mining (Giannella et al. 2003;
Hyuk and Lee 2003; Li, Lee and Shan 2004; Yu et al. 2004; Chi et al. 2004; Chang and Lee
2005; Leung and Khan 2006; Lin, Hsueh and Hwang 2008; Li and Lee 2009; Manku 2016) and
sequential rule mining techniques (Shin and Lee 2008; ÇokpınarandGündem2012; Ahmed et
al. 2012) have been used for knowledge discovery from data streams. The extracted knowledge
from these methods is used for the purpose of data stream pattern description and data stream
pattern prediction (Hollfelder, Oria and ¨Ozsu 2000; Mori et al. 2005; Tseng and Lin 2006;
Eichinger, Nauck and Klawonn 2006; Aggarwal 2007; Huang et al. 2008; Cheng, Ke and Ng
2008; Lin et al. 2010; Prasad and Madhavi 2012; Manku 2016). However, most of the proposed
methods work on a single data stream (Aggarwal 2007; Cheng, Ke and Ng 2008; Manku 2016).
Using data mining techniques in more than one data stream helps for better understanding of
streams, as data streams in the same domain usually have a relationship with each other and with
general trends. The periodic frequent items, frequent itemsets, association rules and changes in
user activities can be considered in comparison with the general trends.
In dynamic tracing of stock market fluctuation, one maybe interested in itemsets that
come together much more frequently in one stock market than another one. These itemsets are
useful for identifying the specific set of items which are of high interest in one market compared
to the other markets. These itemsets can also be useful for anomaly detection and personalization.
In network traffic measurements, in order to find the fraud or intrusion detections, the concurrent
activities of one user, which are more frequent than the same group of activities in the whole
network, are investigated; for example, identifying the users who have the set of activities in their
log files more frequently than the log files of rest of the population. What is looked for is a group
of web pages visited together, by a specific user or group of users, more frequently in comparison
to another user or groups, as well as the queries that are asked many more times in one
geographical area compared to another area, or the whole world. This can be used for better
optimization, localization and suggestion. To find the success factors of one project, you look for
the itemsets that are happening more frequently in the documents of the successful project
compared to the failed one. Building a personalized dynamic news delivery service looks at a
group of words which are more common in the news read by a specific user, compared to the
same group of words in a collection of all news, and updates the system by changes in user
preferences. An essential issue inherent in all the mentioned applications is to find itemsets that
can distinguish the target stream from all others (Lin et al. 2010). There are different data mining
techniques proposed for dataset comparison.
The emerging patterns (Dong and Li 1999) were proposed for differentiating between
datasets by comparing their frequent itemsets. EPs are defined as itemsets whose frequencies
grow significantly from one dataset to another dataset. EPs are able to highlight the emerging
trends in the time stamped datasets and also show the differences between data classes. The same
Chapter 1: Introduction Page 3
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 3
idea is useful for differentiating between data streams. Also, in (Lin et al. 2010) three algorithms
have been proposed for mining discriminative items in data streams. The items are discovered
with frequencies much higher in the target data stream in comparison with the general data
stream. Discriminative item mining suffers from the problem of being meaningless, as the items
are not carrying so much knowledge with them. Compared to the items, itemsets have more
value in the mining process by themselves; also the association rules can be extracted out of them
as another kind of valuable knowledge. This emphasizes the need for developing more effective
methods for data mining from more than one data stream.
The 𝛿-discriminative emerging patterns are defined as special type of useful emerging
patterns (Li, Liu and Wong 2007) which occur in only one dataset (data class) with almost no
occurrence in other datasets (data classes). The 𝛿-discriminative emerging patterns are frequent
in the target dataset with frequency less than 𝛿 in the other datasets. The delta-discriminative
emerging pattern does not find some of useful emerging patterns compared to the discriminative
itemsets defined in this thesis. The discriminative itemsets are determined based on their relative
occurrences in the target and general dataset. If the support of a pattern is relatively different in
the targetdatasetandgeneraldataset thepattern isconsidereddiscriminative.However, theδ-
discriminative emerging patterns are determined based on their frequency (i.e., < 𝛿) in the
general dataset; for example, the discriminative itemsets which are frequent in the target dataset
and general dataset, but the frequency is relatively different in the two datasets are not discovered
as δ-discriminative emerging patterns
1.2 PROBLEM STATEMENT AND OBJECTIVES
The previous section in this chapter describes the motivation of this research and also
identifies the general problems in acquiring user information. This section lists major research
questions that must be addressed in the work of this thesis.
1.2.1 Research problems
In this thesis, algorithms are developed for knowledge discovery from data streams.
These algorithms are proposed to enhance the effectiveness of description and prediction mining
in data stream environments. Working on more than one data stream has more challenges
compared to single data stream mining as the algorithms should deal with the bigger size of data
streams and also with the interaction between data streams. We use data mining techniques to
mine the discriminative itemsets in data streams in historical and real time window frames. The
extracted knowledge is used for describing the current trends of the data stream and recommend
for predicting its future trends using classification techniques.
Motivated by the applications explained in the previous section, the problem of mining
“discriminative itemsets” in data streams is defined following the concepts of discriminative
Chapter 1: Introduction Page 4
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 4
items (Lin et al. 2010), frequent itemsets (Agrawal and Srikant 1994) and emerging patterns
(Dong and Li 1999). Discriminative itemsets in data streams are the frequent itemsets in one data
stream with much higher frequencies than the same itemsets in another data stream.
Discriminative itemsets are closely related to the frequent itemsets in data streams. The
developed algorithms have to work on more than one data stream and will face the challenges
related to inter-relationships between data streams. The discriminative rules can be defined from
discriminative itemsets as the sequential rules in one data stream that have higher support and
confidence compared to the same sequential rules in the other data streams. These discriminative
rules can be used for prediction mining in data streams using classification techniques. We
develop algorithms based on a batch of transactions, the historical tilted-time window model and
the sliding window model.
As the growing speed data arrival in data streams is very fast, the proposed algorithms
should be designed in a way that can handle these large data streams in reasonable time and space
complexities. The pattern mining should be done quickly enough in order to be useful in the real
time data streams. To skip the unpromising processes, the proposed algorithms are modified
based on heuristics learned from the general and specific characteristics of the data streams.
In order to show the importance of the discriminative itemsets, mine the discriminative
itemsets and enhance the performance of discriminative itemset mining tasks, this thesis
addresses the following problems:
How the discriminative itemsets can be used in the classification techniques of large
data streams.
The aim of the discovered discriminative itemsets in data streams in this thesis can lead
them to be employed in the classification techniques for prediction in data streams. The
discovered discriminative itemsets in the tilted-time and sliding window models can be used for
the definition of the discriminative rule, as the rules with higher support and confidence in the
target data stream compare to the general data stream. We raised this research question, but it is
not a research task in this thesis work.
How the discriminative itemsets can be discovered from data streams.
There are many algorithms proposed for mining frequent itemsets from a single data
stream. Using these algorithms separately in multiple data streams and merging the results for
comparison purposes is time-consuming and does not follow the data stream mining tasks
necessities. The new algorithms should be defined and developed for processing more than one
data stream in a same scan and using the same data structures. However, the number of
Chapter 1: Introduction Page 5
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 5
discriminative itemsets is much fewer in comparison to the number of frequent itemsets and the
algorithms should have pruning techniques that lead to less itemsets being generated and tested.
This is an emerging new research problem,
How the discriminative itemsets can efficiently define the trends in data streams in
the historical and real-time window frames.
The designed algorithms should be adapted to the fast speed nature of the data streams.
The algorithms should utilize the concise data structures for saving the itemsets in the recent or
historical time frames. The heuristics should be proposed to exclude the generation of the non-
potential itemset combinations during the process. The aim is to employ the techniques for
achieving the high scalability and high accuracy in the large and fast-growing data streams. The
defined heuristics can be in general for pattern mining in the transactional data streams or
specified, based on the properties of the target data streams. This is raised as the fourth research
question in this thesis,
How well the discriminative itemsets can be set based on the characteristics of the
target datasets.
The data streams in different domains have different characteristics. Based on the
application domain and the input characteristics, the algorithms and data structures can be
optimised by tuning the input parameters either from the beginning or by the time, through
learning from history. How to tune and modify the related parameters such as support threshold
and discriminative level in the designed algorithms is important.
1.2.2 Research objectives
This thesis attempts to discover discriminative itemsets for efficiently and effectively
describing data streams, which can be adjusted to be used for data stream prediction. To achieve
this goal, the thesis will emphasize the discriminative itemset mining, as the success of
discriminative rule mining methods largely depends on this step. The discriminative itemsets are
extracted in the historical and real time window frames. The proposed thesis can be broken down
into various tasks, namely:
Developing extensive research to show the importance of discriminative itemsets in
data streams in the real applications: Discriminative itemsets in data streams are
focused on differences in trends in data streams. Discriminative itemsets can be
employed in defining the classification techniques in more than one data stream. The
Chapter 1: Introduction Page 6
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 6
existing frequent itemset mining methods in data stream are not applicable for more than
one data stream.
This includes extensive research in different algorithms proposed for mining frequent
itemsets in data stream, different contrast data mining methods and association classification
mining methods. The importance of the discriminative itemset and its superiority to the frequent
itemset and emerging patterns is discussed, and the application of discriminative itemset for the
definition of the discriminative rule is provided.
Developing algorithms to discover discriminative itemsets in data streams: Existing
works in data stream mining have been focused on frequent itemset mining in a single
data stream. Discriminative itemsets are different types of itemsets and involve multiple
data streams. Existing frequent itemset mining methods in data stream are not applicable
for mining discriminative itemsets. The emerging pattern mining methods are mainly
work on the static datasets and are involved in presenting the emerging patterns in
compact forms. New techniques will have to be developed.
This includes developing discriminative itemset mining algorithm in data streams. The
first algorithm is developed based on simple expansion of the state-of-the-art techniques designed
for frequent itemset mining in a single data stream.
Enhancing efficiency of the discriminative itemset mining based on determinative
heuristics: The discriminative itemsets can be discovered by comparing the frequent
itemsets in data streams without considering the size and complexity of the data streams.
This is not suited for the fast speed nature of data streams. New, efficient techniques will
have to be developed.
This includes developing the discriminative itemset mining algorithm in data streams
based on the determinative heuristics for having concise data structures and a faster mining
process.
Developing discriminative itemset mining algorithms in data streams for historical
time frames: The discriminative itemset mining in historical time frames is important in
applications with the necessity of discovering patterns in different time frames during the
history of data streams.
Chapter 1: Introduction Page 7
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 7
The algorithm has to be developed based on the tilted-time window model to show the
discriminative itemsets in different time granularities. We choose the tilted-time window model
to show the discriminative itemsets in different time frames during the history of the data
streams. The input transactions are grouped as a batch of transactions and the results are updated
in an offline state by shifting and merging the window frames.
Developing discriminative itemset mining algorithms in data streams for the real
time frame: The discriminative itemset mining in a real time frame is different than
historical time frames as the itemsets have to be discovered and updated in a fixed-size
window frame.
The algorithm has to be developed based on the sliding window model to show the
discriminative itemsets in the real time frame. We choose the sliding window model to show the
discriminative itemsets in a fixed size recent time frame of the data streams. The transactions
come one-by-one and results are updated based on the changes in the transactions that fit in the
sliding window frame.
Optimization in discriminative itemset mining: The discriminative itemset mining
techniques are defined in general. The algorithms can be optimized based on their input
parameters.
The parameters in the discriminative itemset mining algorithms can be defined generally,
or specifically tuned, based on the application domains. The parameters have to be set in smart
ways to mine the useful set of the discriminative patterns based on their different support and
discrimination.
In order to achieve these goals, the surveys on the data stream mining, the different
algorithms proposed for frequent itemset mining in data stream and their different categories, and
the data mining techniques defined for more than one data stream, are investigated. The emerging
pattern mining as the main contrast data mining task is extensively explained. The association
classification rule mining methods in data streams are also discussed briefly by exploring the
association rule mining methods used for the association classification. The problems are then
solved step-by-step and the advanced models are presented out of the proposed algorithms. The
object of the research in this thesis is to develop the optimized methods with more concise data
structures and faster processes in finding the discriminative itemsets in data streams. The
outcomes can be used for prediction purposes in data streams by defining the discriminative rules
and adapting the algorithms based on the application domain. The contributions are highly
Chapter 1: Introduction Page 8
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 8
significant and original especially in the proposed algorithms for the large and fast speed data
streams in the tilted-time and the sliding window models.
1.3 RESEARCH SIGNIFICANCE
Data streams are used in many real life applications like personalization (Tseng and Lin
2006; Djahantighi et al. 2010), anomaly detection (Lin et al. 2010), marketing (Hollfelder, Oria
and ¨Ozsu 2000; Huang et al. 2008; Prasad and Madhavi 2012) and forecasting upcoming trends
(Mori et al. 2005). With the presence of a huge amount of knowledge in the data streams, it is
considered crucial to use data mining to gain insights from it. Data streams are growing fast and
description of their current patterns as well as prediction about their future patterns are big
challenges (Wang et al. 2003; Zhu, Wu and Yang 2006; Zhang et al. 2010). Compared to the
static datasets, data streams are live currents of data, which are useful for online prediction.
The proposed algorithms for pattern mining can be categorized based on online or offline
data streams, different window models and approximation types (Aggarwal 2007; Cheng, Ke and
Ng 2008; Manku 2016). Most of the current pattern mining methods in data streams are working
based on single data streams. We use the extracted knowledge from comparing data streams to
highlight the differences between one data stream and other data streams. The current existing
classifier methods applied for prediction mining mostly work on the single data streams (Wang et
al. 2003; Zhu, Wu and Yang 2006; Zhang et al. 2010). The discriminative itemsets in data
streams can be used with these classifiers by distinguishing the data streams. The different
behaviours and trends of data streams are compared for better differentiation in their patterns.
This thesis will work on discriminative itemsets in data streams recommended to be used in
discriminative rule mining for a classifier method working based on different patterns of data
streams.
1.4 RESEARCH LIMITATION
One of the major limitations of this research work is that there exist no similar methods
for working on multiple data streams that can be used as a benchmark. However, as the
discriminative itemsets are closely related to the frequent itemsets, we expand the modified
versions of the FP-Growth method proposed for frequent itemset mining in a single data stream
to work on more than one data stream. The efficiency of our proposed advanced algorithms is
then evaluated in comparison with this method. The emerging patterns algorithms can be used as
the other benchmark for evaluation purposes. However, they are mostly working based on static
datasets and they have, both basically and in detail, differences in their output patterns, compared
to our proposed methods. As a close research work to the proposed methods in this thesis we
modify the DPMiner (Li, Liu and Wong 2007) to include all the delta-discriminative emerging
patterns (i.e., the original method does not find the large number of emerging patterns as
Chapter 1: Introduction Page 9
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 9
redundant patterns). We also modify the definition of the discriminative itemsets and
consequently the proposed methods to be the same as the delta-discriminative emerging patterns.
The efficiency of the proposed modified algorithms is then evaluated together.
1.5 CONTRIBUTIONS
This thesis has developed approaches for discovering discriminative itemsets in data
streams in different window models: In particular, the contributions of this thesis are to:
Develop the extensive research on frequent itemset mining in data streams and contrast
data mining to show the importance of discriminative itemsets in data streams in real
applications, and propose the discriminative itemsets for the application of classification
in data streams. Specifically, this covers the literatures that use the frequent itemset
mining and contrast data mining for the classification of static datasets and data streams.
Develop a simple single pass algorithm, DISTree, for mining the discriminative itemsets
in data streams. More specifically, this is about the definition of the discriminative
itemset mining in data streams and developing a simple method for the benchmarking
purpose with the advanced proposed methods.
Develop an advanced and efficient algorithm, DISSparse, for mining the discriminative
itemsets in data streams. More specifically, this is about developing an efficient method
for the fast speed large data streams.
Develop one algorithm for mining the discriminative itemsets in data streams using
novel data structures and the tilted-time window model. In this part, the efficient H-
DISSparse algorithm is developed for the historical discriminative itemset mining.
Develop a novel algorithm for mining discriminative itemsets over data streams
using the sliding window model. The efficient S-DISSparse algorithm is developed
for mining the discriminative itemsets in the sliding window model. This can handle
large datasets in an online data stream growing at high speed.
Show strategies and principles for parameter setting based on data stream characteristics.
Using experiments in a wide range of datasets and with different parameter settings, the
principles are demonstrated for tuning the algorithms’parameters.
1.6 PUBLICATIONS
M Seyfi, S Geva, R Nayak, Mining Discriminative Itemsets in Data Streams, Web
Information Systems Engineering–WISE 2014, 125-134
M Seyfi, R Nayak, Yue Xu, S Geva, Efficient Mining of Discriminative Itemsets, Web
Intelligence–WI 2017, ACM: 451-459
Chapter 1: Introduction Page 10
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 10
1.7 THESIS OUTLINE
This study is designed to explore the data mining techniques for pattern mining in data
streams. The study will primarily provide extensive research on frequent itemset mining in data
stream, contrast data mining and classification techniques in static datasets and data streams. The
secondary focus is on developing methods for mining discriminative itemsets in data streams.
The third focus of this study is making the developed algorithms efficient for mining
discriminative itemsets in large and fast-growing data streams. Finally, we show strategies and
principles to tune and modify the related parameters such as support threshold and discriminative
level. This thesis is organised in seven chapters, which follow the structure shown in Figure 1.1.
Figure 1.1: Research methodology and thesis structure
Chapter 1: Introduction Page 11
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 11
In Chapter 2, a review of the relevant literature is performed. The reviewed literature
includes the latest works in the area of data stream mining and the frequent itemset mining in
data streams in different window models. In the next step, the emerging pattern mining and
discriminative item mining in data streams are reviewed. After that, the sequential rule mining in
data streams is discussed. At the end, the significance of the work and the new knowledge
contributed to the area is verified, followed by the research gaps and motivations.
In Chapter 3, the problem of mining discriminative itemsets in data streams is formally
defined and the DISTree method is proposed based on the expansion of the FP-Growth method
for more than one data stream. However, the efficiency of this method is not tolerable for the
large datasets. The DISSparse method is proposed for efficient mining of the discriminative
itemsets. The proposed methods address the problem in a way that can be expanded to the tilted-
time and sliding window model. This chapter mainly covers the published paper of ‘Mining
Discriminative Itemsets in Data Streams’ and the published paper of ‘Efficient Mining of
Discriminative Itemsets’.
In Chapter 4, the H-DISSparse method is presented for mining the discriminative
itemsets using the tilted-time window model based on the offline data streams. The algorithm
shows results with high accuracy and recall. The H-DISSparse method is efficient for the range
of large data streams..
In Chapter 5, the S-DISSparse method is presented for mining the discriminative
itemsets in the sliding window model based on the offline and online updating states. This
method processes the input data streams in a fast and efficient way. The proposed S-DISSparse
algorithm has efficient time and space complexity and high accuracy in the high speed data
streams..
In Chapter 6, the four proposed algorithms are tested on various synthetic and real
datasets with different characteristics and sizes. The results are analysed and the efficiency and
accuracy of the methods are discussed in the wide range of input data streams. The principles are
showed for the parameter setting based on data streams’characteristics.
In Chapter 7, the conclusions and summary of the key findings that are drawn from this
thesis are highlighted as the significant contributions. Limitations are also pointed out and the
future works are discussed.
Chapter 2: Literature review Page 12
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 12
2Chapter 2: Literature review
This chapter presents a critical review of the literature essential to addressing the research
questions introduced in Chapter 1. This review presents and analyses current theories and
methodologies that have been used in the relevant research areas. In so doing, sound argument is
developed to support the research undertaken in this thesis.
This chapter starts with the definition of data streams and the sets of requirements
needed for developing data stream mining algorithms in Section 2.1. This will continue with the
developed algorithms for frequent itemset mining in a single data stream in Section 2.2. In
advance, the contrast data mining research methods dealing with pattern mining in more than one
data stream are discussed, and the emerging patterns as a close concept to the discriminative
itemsets are presented in Section 2.3. The data stream description using association rules is
discussed by opening the new research area of discriminative rule mining in Section 2.4. The
significance of the research and the new knowledge contributed to the research area is discussed
in Section 2.5. It finally highlights the limitations of current methods to support the research
hypothesis in Section 2.6.
2.1 DATA STREAM MINING
The data stream is defined as a continual sequence of transactions that are coming in a
fast speed through time (Manku and Motwani 2002). The data streams are considered as dynamic
datasets and compared to the traditional static datasets, there are more challenges in data stream
processing. First, the volumes of data streams in their life periods are huge and increasing fast
and they cannot be maintained and processed in the main memory or even in the secondary
storage. Second, in most of the data stream-related applications, quick and real time answers are
needed. Third, the embedded knowledge in data streams will change through time and by passing
the transactions we lose their information and the trends change by the concept drifts. Because of
these characteristics, every designed algorithm for data stream processing should guarantee the
following recommended requirements (Minos Garofalakis, Johannes Gehrke and Rastogi 2002;
Manku and Motwani 2002; Hyuk and Lee 2003).
All the data stream mining algorithms should work based on the concise and compressed
data structures, fitted in the main memory. These data structures should monitor the data streams
in the current state and the defined period of the history. The high speed nature of the data
streams is stressing that the designed algorithms must be faster than the rate of incoming data
streams. The new incoming transactions have to be processed as quickly as possible and the
Chapter 2: Literature review Page 13
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 13
results should be updated rapidly. The algorithms should follow the single pass constraint design
as in the fast data streams usually there is no time for the second pass and the heuristics should be
defined and used in the algorithms based on the data stream characteristics and the problem
definitions. The requested queries need short response time and the answers may go out of
interest by the time that stream is updated. This emphasises the need for the fast updating rates of
the algorithms. The data structures used in the algorithms have to be defined and updated by time
as the streams are passing, and the indicated concept drifts happening in the data streams. The
embedded knowledge in data streams will change through time and by passing the transactions
their information is lost. As a consequence, data stream mining algorithms are designed with
trade-off between their accuracy and ability to satisfy the discussed requirements. The
discriminative itemsets are a small subset of frequent itemsets. In the section below we discuss
the frequent itemset mining algorithm in data stream and their different categories in detail.
2.2 FREQUENT ITEMSET MINING IN DATA STREAM
In the last decade the frequent itemset mining algorithms working on single data stream
have attracted much attention (Manku and Motwani 2002; Hyuk and Lee 2003; Wu, Zhang and
Zhang 2004; Yu et al. 2004; Chang and Lee 2005; Leung and Khan 2006; Lin, Hsueh and
Hwang 2008; Cheng, Ke and Ng 2008; Li and Lee 2009; Guo, Su and Qu 2011; Manku 2016).
These studies can be categorized according to their update intervals, different window models
and the type of approximations (Cheng, Ke and Ng 2008). Depending on the target application
domain, the data stream mining algorithms can be bulk processing the input transactions, which
is called offline data streams, or processing by transaction generation through time, which is
called online data streams (Manku and Motwani 2002).
2.2.1 Window models and update intervals in data stream
The frequent itemset mining algorithms in data streams are categorized based on the
landmark window model, sliding window model, damped window model and the tilted-time
window model (Aggarwal 2007; Cheng, Ke and Ng 2008) (Figure 2.1 to Figure 2.4).
Figure 2.1: Landmark window model
Figure 2.2: Damped window model
Chapter 2: Literature review Page 14
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 14
Figure 2.3: Sliding window model
Figure 2.4: Tilted-time window model
In the algorithms designed based on the landmark window model (Figure 2.1), the
frequent itemsets are extracted and updated from a special point defined in data stream as the
start point until now. The starting point can be set anywhere in the data stream and this is
considered as the beginning of the window model without change for the entire period of data
stream mining (Manku and Motwani 2002). The damped window model (Figure 2.2) follows the
landmarked model using the same idea of the specified start point. However, in this window
model, the recent input transactions have higher impacts. The weighting functions are used for
the incoming transactions by giving higher weight to the newly added transactions. These groups
of algorithms usually use decaying factors for the old transactions (Chang and Lee 2003).
Compared to the landmark and damped window models with static start points, the
sliding window model (Figure 2.3) has a dynamic start point, which is moving by time. This
window model has static size, which is defined either based on the number of transactions or a
time slide. This window size is set by the user based on the domain application and the
algorithms ignore the data arrived out of this window slide (Chi et al. 2004). In the tilted-time
window model (Figure 2.4), frequent itemsets are maintained in the historical separated window
frames, which are in different sizes. The window frames belong to the old times, cover bigger
time intervals and the ones for the newer time intervals are set smaller. The reason for this
variation in the window sizes is because in the current times users are usually interested in data
mining results at fine granularities and in the past times at coarse granularities (Giannella et al.
2003).
Depending on the applications, the out-coming results are updated offline by processing
every new batch of incoming transactions or online by processing every newly arrived single
transaction (Aggarwal 2007). The frequent itemset mining algorithms can discover the exact
answers or the answers with approximations. The approximate algorithms are classified based on
the false-positive and false-negative techniques (Yu et al. 2004; Aggarwal 2007; Cheng, Ke and
Ng 2008). In the false-positive methods, by the level of user-defined approximations, a number
of non-frequent itemsets may appear in the final results as frequent causes for decreasing the
Chapter 2: Literature review Page 15
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 15
accuracy of the algorithms. In the false-negative algorithms all the reported answers are frequent;
however, a number of frequent itemsets may be missed results in lower recall. Out of the
different surveys, most of designed frequent itemset mining algorithms are false positive (Yu et
al. 2004; Aggarwal 2007; Cheng, Ke and Ng 2008; Li and Lee 2009).
2.2.2 Frequent itemset mining algorithms in data stream
In this section the main frequent itemset mining algorithms are discussed briefly with
their distinct characteristics and limitations. Table 2.1 lists the most popular algorithms and
specifies their main features based on the different categories:
Table 2.1 Data stream frequent itemset mining algorithms
Algorithm Window
model
Update interval Approximation
type
Lossy counting
(Manku and
Motwani 2002)
Landmark Offline data stream False-positive
estDec (Hyuk and
Lee 2003)
Damped Online data stream False-positive
FP-Stream
(Giannella et al.
2003)
Tilted-time Offline data stream False-positive
FPDM (Yu et al.
2004)
Landmark Offline data stream False-negative
Moment (Chi et al.
2004)
Sliding Online data stream Exact answers
DSM-FI (Li, Lee and
Shan 2004)
Landmark Offline data stream False-positive
estWin (Chang and
Lee 2005)
Sliding Online data stream False-negative
DStree (Leung and
Khan 2006)
Sliding Offline data stream Exact answers
VSMTP (Lin, Hsueh
and Hwang 2008)
Tilted-time Offline data stream False-positive
WSW (Tsai 2009) Sliding Online data stream Exact answers
The Lossy counting (Manku and Motwani 2002) is a one-pass algorithm proposed for
finding the frequent itemsets in a data stream. In this algorithm a given threshold 𝜃 is defined by
user for the application domain and the error rate (ɛ) adjusted by Chernoff bound. The answers
are guaranteed to contain all the frequent itemsets with frequency higher than 𝜃 and not less than
(𝜃 − ɛ). The itemsets with frequency between (𝜃 − ɛ) and 𝜃 make the group of false-positive
ones in the answering set. In the false-positive algorithms, generally if there is time to do the
second pass scan in the data stream, then all the false-positive answers are eliminated (Aggarwal
2007).
Chapter 2: Literature review Page 16
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 16
The FPDM (Yu et al. 2004) is another false-negative frequent itemset mining algorithm.
The same as Lossy counting, this algorithm works based on the landmark window model over
data stream. In this method, the non-frequent itemsets do not appear in the answering set but it
may lose the frequent itemsets with frequency higher than the user defined threshold 𝜃. The
approximation in the answering set is controlled by the user defined error rate ɛ 𝜖 (0,1) and
reliability 𝛿 𝜖 (0,1). The ɛ is used to control the error bound and 𝛿 controls the memory
consumption. Like any other frequent itemset mining algorithm, the exponential numbers of
candidate itemsets is the main problem. The generation of the candidate itemsets is much bigger
for the data mining in more than one data stream.
In the landmark window model the frequent itemsets are mined in the entire data stream
but in the tilted-time window structure the frequent itemsets are maintained in the different time
granularities (Giannella et al. 2003) of the data stream history. In many applications, users are
interested in time related trends and changes in frequent itemsets than the frequent itemsets
themselves (Giannella et al. 2003). Out of the historical windows structure, the time sensitive
queries for frequent itemsets are answered. The FP-Stream (Giannella et al. 2003) is a prefix tree
structure defined based on the FP-Growth method (Han, Pei and Yin 2000). This structure uses a
logarithmic time scale tilted-time window as users are usually interested in recent trends in the
short-term periods and the historical trends in the long-term periods. The logarithmic window
model is defined for each prefix tree node to show the frequency of the frequent itemsets in the
recent and historical time frames. The FP-Stream method uses the same support threshold in all
the tilted-time windows. This method mines the frequent itemsets in the current time frame and
then merges the results by passing the old frequent itemsets in the tilted-time window frames.
The VSMTP (Lin, Hsueh and Hwang 2008) algorithm expanded this method with changeable
support thresholds in its window model.
The Moment (Chi et al. 2004) extracts the closed frequent itemsets in the sliding window
of data stream. Based on the assumption in this algorithm, the size of the sliding window is set to
the number of transactions that are fitted in main memory. The algorithm uses the heuristic that
in most of the cases the frequent itemsets are the same in successive sliding windows (Cheng, Ke
and Ng 2008). It means the boundary between frequent and non-frequent itemsets, and also the
boundary between closed frequent itemsets and other itemsets move very slowly. In this
algorithm, instead of looking for all the closed frequent itemsets in each sliding window, mostly
it tries to define these boundaries. The closed enumeration tree (CET) data structure, which is
fitted in main memory, is defined to have the closed frequent itemsets and the itemsets that are in
the boundary of closed frequent itemsets and other itemsets. This structure is updated by new
incoming transactions and categorized as an online data stream processing method.
The DSM-FI (Li, Lee and Shan 2004) uses an FP-Tree structure like forest and mines
the approximate frequent itemsets in the landmark window model. estWin (Chang and Lee 2005)
Chapter 2: Literature review Page 17
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 17
is another algorithm working based on the sliding window model. All the transactions in the
current window are saved by CTL data structure. The CTL initializes by every new transaction
until it becomes full. In the next state the old transactions are deleted by new incoming
transactions. The significant frequent itemsets are maintained in the monitoring prefix tree and
updated by changes in the current window. Another tree structure proposed is DStree (Leung and
Khan 2006). In this structure in each node a list of frequency counters are maintained for every
batch of transactions. The algorithm is offline and by each new batch of transactions, the window
slides and the transactions in the oldest batch are deleted and counters in every affected node are
shifted. This algorithm is designed for exact frequency counting.
The estDec (Hyuk and Lee 2003) is an online data stream mining algorithm that focuses
on recent frequent itemsets more than the old ones. The weight of the old transactions is
decreased by defined decaying factors as the new transactions arrive. This algorithm saves the
frequent itemsets in a lattice and updates the lattice by every new transaction. The frequency of
the itemsets that are a subset of the new transaction will change based on the damped window
model. At first their frequencies decrease by decaying factor and then increase by one. In the next
step, the subsets of current transactions that can be frequent are added to the lattice. In WSW (Tsai
2009) users are allowed to define the number of windows, with size and weight for each of them.
This can be useful when certain sections of the data stream are considered more important
compared to the other parts.
Summary: In summary and based on the discussion in (Aggarwal 2007), the landmark
model is more fundamental compared to the other window models. In the sliding window model,
the most challenging problem happens when the main memory does not fit the sliding window.
In the damped model, compared to the landmark model, by every new transaction the frequency
of all the itemsets has to be adjusted, even if they are not a subset of that recent transaction. The
proposed algorithms in this thesis are categorized based on the algorithms developing for mining
the discriminative itemsets, using the tilted-time window model for offline data streams followed
by developing the algorithm for mining the discriminative itemsets using the sliding window
model in online data streams. We choose the tilted-time window model to have the
discriminative itemsets in multiple time granularities. We choose the sliding window model to
have the discriminative itemsets in the most recent time.
Expanding the current frequent itemset mining algorithms for working on more than one
data stream has different challenges that cause inefficient results. These methods have been
designed based on either the basic Apriori idea or FP-Growth algorithms (Aggarwal 2007).
Expanding the Apriori-based frequent itemset mining idea for the discriminative itemset mining
is not feasible. The concept behind Apriori group methods is that they are working based on the
principle of (k+1)-candidate itemset generation and test, out of the frequent k-Itemsets. This
process will be time and space consuming for the discriminative itemset mining as the entirety of
Chapter 2: Literature review Page 18
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 18
the candidate frequent itemsets in the target data stream have to be generated and then tested for
saving the frequent itemsets in data structures. They have to be then checked against the same
itemsets in other data streams for their frequencies. Based on the definition of the discriminative
itemsets, they are in a much smaller percentage of the total frequent itemsets, especially in larger
discriminative level thresholds. This is cause for inappropriate results in expansion of the
frequent pattern mining algorithms for discriminative itemset mining in more than one data
stream. The frequent itemset mining methods followed FP-Growth can possibly be better
expanded for discriminative itemset mining; however, generating the unnecessary combinations
in the prefix tree structures and checking of itemsets for frequency in more than one data stream
will incur unreasonable extra time and space usage.
The prefix tree structures used in the FP-Growth algorithms can be modified based on
the pruning heuristics, by ignoring the generation of the frequent itemsets that does not lead to the
discriminative itemsets. The designed algorithms discover the discriminative itemsets that exist in
the current batch with the exact accuracy. The approximations in the historical results during
merging the window frames in the tilted-time window model are set based on the user defined
thresholds leading to the higher time and space complexities, as discussed in the analysis part in
Chapter 6. The approximation in the sliding window model algorithm depends on the sub-
divisions of the window size leading to the higher accuracy in the application domains with less
concept drifts in the short time periods. These algorithms are categorized as the approximate
methods that guarantee the high accuracy of the reported discriminative itemsets. Losing a
percentage of the answers during merging the different time periods is caused for lower recall.
The discovered discriminative itemsets are proposed for the discriminative rule mining
in data streams for prediction purposes in classification techniques. The discriminative rules can
be defined based on the discriminative itemsets as the sequential rules in one data stream that
have higher support and confidence compared to the same sequential rules in the other data
streams. The discriminative itemsets are measured based on the support and confidence
thresholds to find the set of useful discriminative rules for prediction mining.
Discriminative itemset mining in data streams focuses on differentiating the data streams
based on their trends in different window models. This is a wide research topic defined as
contrast data mining. In the contrast data mining the differences between datasets are discovered
based on the defined thresholds. In the section below the contrast data mining as the main
research area for discriminative itemset mining is discussed in detail.
2.3 CONTRAST DATA MINING
Contrast mining is a focused data mining research area for discovering interesting
contrast patterns that state the significant differences between datasets (Dong and Bailey 2012).
The discriminative itemsets are a kind of contrast patterns. The contrast patterns show non trivial
Chapter 2: Literature review Page 19
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 19
differences between datasets. The emerging patterns (Dong and Li 1999) are one of the well-
known contrast patterns. The distinguishing power of emerging patterns is used for building
powerful classifiers (Dong et al. 1999). In emerging patterns the degree of change in supports of
itemsets is important, and the actual support of itemsets is not considered. In contrast, the
discriminative itemsets proposed in this thesis focus on the difference in support rather than the
change degree of support. We discover the real support of each discriminative itemset and the
relative differences of supports in datasets explicitly, which can provide concrete information
assisting users in making right decisions. The discriminative itemsets with different cardinalities
are useful for making rule-based classifiers with high accuracy. Most importantly, there are too
many emerging patterns with low supports which may not be interesting. The discriminative
itemsets are in smaller numbers as firstly they must be frequent in the target dataset.
2.3.1 Emerging patterns
Similar to the concept of discriminative itemsets are the emerging patterns (EPs) (Dong
and Li 1999). EPs are defined as itemsets whose frequencies grow significantly from one dataset
𝐷1 to another 𝐷2. EPs are able to highlight the emerging trends in the time stamped datasets and
also show the differences between data classes. In most of the applications the EPs with large
supports are mainly folklore (i.e., already known) and EPs with low to medium support, such as
1%-20%, are interested. For example, the purchase patterns in a company with the growth rate of
two from last year to this year, even with low supports, can give good insights to the domain
experts. In the medical field, EPs with long length and low supports may add new knowledge to
the field. The treatment combinations and symptoms with even small growth rate from patient
not cured compared to who were cured can suggest good treatments (i.e., if there are no better
plan).
The interest in finding emerging patterns with small supports is challenging as first, the
useful Apriori property not holds and second, there are too many candidate patterns. Because of
these two reasons the naïve methods are expensive. However, using some nice properties of the
patterns (i.e., set intervals) the efficient mining method is proposed (Dong and Li 1999). In the
interval closed collection 𝑆 of sets, such that 𝑋 and 𝑍 are in 𝑆 and 𝑌 is a set which is 𝑋 ⊆ 𝑌 ⊆ 𝑍,
then 𝑌 is in 𝑆. The collections of large itemsets for a specific given thresholds are interval closed.
The proposed method (Dong and Li 1999), which has been followed in other studies (Zhang,
Dong and Kotagiri 2000; Fan and Ramamohanarao 2002; Bailey, Manoukian and
Ramamohanarao 2002; Alhammady and Ramamohanarao 2005; Loekito and Bailey 2006; Li,
Liu and Wong 2007; Bailey and Loekito 2010; Yu et al. 2013; Yu et al. 2015), works based on
border definition of maximal (large) itemsets in the first and second datasets. The large itemsets
are discovered using their borders as the pair of sets of the minimal itemsets and of the maximal
Chapter 2: Literature review Page 20
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 20
itemsets. The borders are much smaller than the collections they represent. The algorithms are
quick even with large number of emerging patterns. For example, in mushroom dataset presented
in UCI repository, the algorithm finds 228 EPs for the growth rate threshold of 2.5 (Dong and Li
1999).
In the ordered pair of datasets 𝐷1 and 𝐷2 the growth rate of itemset 𝑋 from 𝐷1 to 𝐷2, is
defined as in equation (2.1). Considering 𝜌 as a growth-rate threshold, an itemsets 𝑋 is 𝜌-
emerging pattern from 𝐷1 to 𝐷2 if 𝐺𝑟𝑜𝑤𝑡ℎ𝑅𝑎𝑡𝑒(𝑋) ≥ 𝜌. The main interest in emerging pattern
mining is on the degree of change in supports, than their actual supports.
𝐺𝑟𝑜𝑤𝑡ℎ𝑅𝑎𝑡𝑒(𝑋) = {
0, 𝑖𝑓𝑠𝑢𝑝𝑝1 = 0 𝑎𝑛𝑑 𝑠𝑢𝑝𝑝2 = 0∞, 𝑖𝑓 𝑠𝑢𝑝𝑝1 = 0 𝑎𝑛𝑑 𝑠𝑢𝑝𝑝2 ≠ 0
𝑠𝑢𝑝𝑝2𝑠𝑢𝑝𝑝1
, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (2.1)
The EPs are identified by extracting the maximal itemsets separately for each dataset
based on the defined frequency changing rate in the first and second datasets. The maximal
itemsets are then reported altogether between the two borders. The borders are defined using the
lowest and the highest thresholds (Dong and Li 1999), as the Left and Right borders denoted as
<Left, Right>. By working on these two borders it limits the answers in the area between <L, R>
to show a large collection of itemsets. The idea is to decrease the number of candidate itemsets
by defining just the borders for the maximal itemsets. In all EPs the ‘frequencygrow’isbetween
thegiven‘borders’.Theemergingpatternminingmethods need to check frequency or frequency
change but do not care whether the pattern is frequent or not. The discriminative itemsets
methods proposed in this thesis first finds frequent itemsets and then compares the difference of
the frequencies of the same itemsets in two datasets.
Based on Figure 2.5 the emerging pattern mining method defines two minimum support
thresholds 𝜃𝑚𝑖𝑛 and 𝛿𝑚𝑖𝑛 for itemset 𝑋 in two datasets 𝐷1 and 𝐷2 and extracts the maximal
itemset boundary between these two maximal itemsets (triangle ACE). This area is separated into
three different domains; triangle ABG, rectangle BCDG and triangle GDE, respectively. The EPs
extracted from these three areas are merged together as a final result (Dong and Li 1999).
Chapter 2: Literature review Page 21
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 21
Figure 2.5: Support plan for emerging patterns (Dong and Li 1999)
The EPs in BCDG rectangle are precisely those itemsets whose support in 𝐷2 ≥ 𝜃𝑚𝑖𝑛
but in 𝐷1 < 𝛿𝑚𝑖𝑛. The proposed method in (Dong and Li 1999) uses the borders of
𝐿𝑎𝑟𝑔𝑒𝛿𝑚𝑖𝑛(𝐷1) and of 𝐿𝑎𝑟𝑔𝑒𝜃𝑚𝑖𝑛(𝐷2), instead of using 𝐿𝑎𝑟𝑔𝑒𝛿𝑚𝑖𝑛(𝐷1) and of
𝐿𝑎𝑟𝑔𝑒𝜃𝑚𝑖𝑛(𝐷2) themselves, as inputs of the algorithm. The algorithm drives the emerging
patterns by manipulating only the two borders and producing the border representation of the
derived EPs. The algorithm avoids the high number of candidate itemsets generation and
avoiding printing large number of EPs.
The EPs in GDE triangle are from candidate itemsets whose support in 𝐷1 ≥ 𝛿𝑚𝑖𝑛 and
in 𝐷2 ≥ 𝜃𝑚𝑖𝑛. These candidates are exactly from 𝐿𝑎𝑟𝑔𝑒𝛿𝑚𝑖𝑛(𝐷1) ⋂ 𝐿𝑎𝑟𝑔𝑒𝜃𝑚𝑖𝑛(𝐷2). The
approximate size of intersection can be estimated by checking the border description. When this
intersection is small, the EPs can be found easily by checking the support of all the candidates in
the intersection. When this intersection is large, it is recursively divided to a
new rectangle and new triangle and applying the border algorithm for the new rectangle until find
approximately all the emerging patterns. Finding the EPs in ABG triangle is a hard task as there
may many EPs in this area with small support in 𝐷1 or 𝐷2 or both. These are the large number of
itemsets with 𝑠𝑢𝑝𝑝(𝑋) < 𝛿𝑚𝑖𝑛 or in 𝑠𝑢𝑝𝑝(𝑋) < 𝜃𝑚𝑖𝑛. Generally, finding all the EPs is very
challenging and algorithms mainly look for a way to show the best approximations.
The key point of using borders is to efficiently showing large collections of itemsets as
emerging patterns. Each interval closed set of emerging patterns has a unique border <L, R>,
where L is the collection of minimal itemsets and R is the collection of maximal itemsets. The
method uses the Max-Miner (Roberto J. Bayardo 1998) for discovering the maximal borders. The
differential procedure, called Border-Diff and the main algorithm is MBD-LLborder, which uses
the Border-Diff as subroutine (Dong and Li 1999). The algorithm calls the Border-Diff multiple
times to drives the EPs.
Considering the large number of emerging patterns and multiple applications and dataset
types, different algorithms use borders for mining different types of emerging patterns. The
different types of emerging patterns are explained in the section below.
Chapter 2: Literature review Page 22
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 22
2.3.1.1 Different types of emerging patterns
The different types of emerging patterns have been defined (Dong and Bailey 2012) for
capturing the changes or differences between datasets. Jumping emerging patterns (JEPs)
(Bailey, Manoukian and Ramamohanarao 2002) is a special type of emerging patterns which are
the itemsets whose support increase from zero in one dataset to non-zero in another dataset. The
jumping emerging pattern (JEP) growth rate must be infinite (i.e. it is present in one and absent
in another). The jumping emerging patterns are useful as a means of discovering inherent
distinctions exist amongst datasets. The proposed method in (Bailey, Manoukian and
Ramamohanarao 2002) uses prefix-tree structure as applied in (Han, Pei and Yin 2000) for
frequent itemset mining. However, there are significant differences in the kinds of tree shapes
that lead to efficient mining.
The mining of emerging patterns is harder than that of frequent itemsets. The Apriori
property does not exist for JEPs and thus the algorithm has greater complexity. The JEPs can be
more efficiently discovered than general emerging patterns and are useful in building strong
classifiers (Li, Dong and Ramamohanarao 2001). Since the emerging pattern mining is dealing
with several classes of data, each node in the prefix tree structure must hold the frequency of the
itemset for each class. The multiple transactions are compressed with sharing a same prefix of
itemset and they are merged into individual nodes with increased counts. The items in the prefix
tree are ordered in the inverse ratio tree ordering. The intuition for this ordering is JEPs reside
much higher up in the tree than they would under the descending frequency in frequent tree
ordering (Han, Pei and Yin 2000). This will help to limit the depth of branch traversals needed to
mine emerging patterns.
The method uses a core function called Border-Diff used in (Dong and Li 1999) to return
the set of JEPs. The initially constructed prefix tree has null root node, with each child node as
the root of a subtree named component tree. Each component tree is traversed downward for
each of its branches. The nodes which contain a non-zero counter for the class which the JEPs
are discovered and zero counters for every other class are called base nodes. The significance of
these nodes is that the itemset spanning from the root of the branch to the base node is unique to
the class being processed. This itemset is therefore a potential JEP and hence any subset of this
itemset is also potentially a JEP. After identifying a potential JEP, it gathers up all negative
transactions that are related to it (i.e. share the root and base node). Border-diff is then invoked to
identify all actual JEPs contained within the potential JEP. After examining all branches for a
particular component tree, the branches are inserted into the remaining component trees having
removed the initial node of each. The component trees are traversed from leftmost component
tree to find children as base nodes for potential JEPs. It examines every potential JEP with
Chapter 2: Literature review Page 23
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 23
reference to every component tree (and thus every item in the problem), to ensures completeness.
The number of component trees is equal to the number of unique items in the datasets.
The ConsEPMiner (Zhang, Dong and Kotagiri 2000) discovers the emerging patterns
based on two types of constraints namely, external constraints and inherent constraints. The
external constraints are user-given minimum on support, growth rate and growth-rate
improvement. The support and growth rate directly prune the search space. The growth rate
improvement is used to eliminate the uninteresting emerging patterns. A positive growth-rate
improvement ensures concise representation of EPs which are not subsumed by previously
discovered patterns. The inherent constraints same subset support, top growth rate and same
origin are derived from dataset characteristics and properties of EPs. The inherent constraints
also used for further pruning the search space. The search framework is made of a set
enumeration tree (i.e., SE-tree) which enumerates all the itemsets in breadth-first-search. The two
types of constraints are used for limiting the search space by skipping large number of itemsets.
Essential jumping emerging patterns (eJEPs) (Fan and Ramamohanarao 2002) are the
itemsets whose support grows from zero in one dataset to higher than a specified threshold in
another dataset, with no subset as essential jumping emerging patterns. eJEPs are minimal
itemsets. If using less attributes the two data classes can be distinguished, using more may not be
useful and may even add noise. In (Fan and Ramamohanarao 2002) the eJEPs are discovered
using a pattern tree (P-tree). A P-tree is an extended prefix tree structure storing the quantitative
information about eJEPs. The count of item for each dataset is registered and items with larger
support ratios which are closer to the root make the eJEPs. By depth first search from the root,
algorithm always find the shortest one first. This process is completely different from FP-Growth
(Han, Pei and Yin 2000) based methods. It merges the node during search to ensure complete set
of eJEPs are generated. The pattern growth is achieved using concatenation of prefix pattern with
the new one at deep level. As the algorithm is interested to the short eJEPs the depth of the
search is not very deep (i.e., normally 5 to 10). The accurate classifiers are made out of the
smaller eJEPs compared to JEPs.
In (Bailey and Loekito 2010) a method is proposed for mining contrast patterns in
changing data based on the old and the current parts of a data stream. The method is focused on
jumping emerging patterns as special type of contrast patterns. The minimal JEPs are discovered
in data stream by adding new transactions and deleting the old transactions. This is different to
the problem of mining discriminative itemsets in data streams in this thesis as the contrast
patterns are discovered in the old part (i.e., old class) compared to the recent part (i.e., recent
class) of a single data stream. The discriminative itemsets proposed in this thesis are discovered
in multiple data streams changing at a same time.
Chapter 2: Literature review Page 24
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 24
2.3.1.2 Delta-discriminative emerging patterns
The pattern mining based on support and confidence of itemsets may bring statistical
flaws to the results. The delta-discriminative itemsets are itemsets with ranked statistical merits
under different test statistics such as chi-square, risk ratio, odds ratio, etc. The 𝛿-discriminative
emerging patterns (Li, Liu and Wong 2007) are determined based on a threshold 𝛿. The
DPMiner algorithm (Li, Liu and Wong 2007) can efficiently mine the 𝛿-discriminative emerging
patterns. The algorithm skips the subset of frequent itemsets if their support in the general dataset
is larger than 𝛿. However, for the discriminative itemsets proposed in this thesis, a subset of a
non-discriminative itemsets can be discriminative. It also skips the redundant itemsets defined as,
a superset of discriminative itemsets with the same or smaller infinite ratio between the supports
in the target and general dataset. The itemset with infinite ratio has high frequency in the one
dataset and zero or very low frequency in the other datasets. Similar to the discriminative
itemsets, the delta-discriminative emerging patterns must be frequent in the target dataset
(positive data class) i.e., 𝑓𝑖 > 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 but with frequency less than delta in the
general dataset i.e., 𝑓𝑗 < 𝛿, where 𝛿 is usually a small integer number 1 or 2. The Discriminative
Pattern Miner (DPMiner) algorithm (Li, Liu and Wong 2007) discovers the delta-discriminative
emerging patterns having the maximum frequency in the contrasting classes, efficiently (Dong
and Bailey 2012).
The DPMiner algorithm is proposed based on the concepts of equivalence classes. An
equivalence class EC is a subset of itemsets that always occur together frequently in the same set
of transactions. The equivalence class can be uniquely defined based on closed patterns and a set
of generators. The closed pattern is a frequent itemset with no proper superset in the same
frequency (Zaki and Hsiao 2002). The generators are the itemsets with smaller frequency
compared to every immediate proper subset (Li et al. 2006). The generators are the minimal
itemsets among the equivalence class and the closed pattern is the maximum itemset. The key
idea in DPMiner algorithm is to mine the concise representation of the equivalence classes. The
DPMiner employs 𝛿 constraint to reduce the pattern search space by setting a border of non 𝛿-
discriminative emerging patterns. If an EC is 𝛿-discriminative and non-redundant, then none of
its subset ECs can be 𝛿-discriminative and none of its superset ECs can be non-redundant. The
DPMiner efficiently mine the delta-discriminative emerging patterns by skipping the subset of
itemsets with 𝑓𝑗 > 𝛿.
The δ-discriminative equivalence class often have a value of infinity under relative
support (i.e., 𝑓𝑖 𝑛𝑖⁄
𝑓𝑗 𝑛𝑗⁄=
𝑛𝑗∗𝑓𝑖
𝑛𝑖∗𝑓𝑗= ∞) used in many statistical tests. The equivalence class is
redundant if the closed itemsets of one EC is the subset of closed itemsets of another EC as all the
itemsets already subsumed by the subset closed itemset. This emphasizes that only the most
generalδ-discriminative equivalences are non-redundant. In the depth first search the algorithm
Chapter 2: Literature review Page 25
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 25
stop the pattern mining along the branch whenever it reachestotheδ-discriminative equivalence
class. This will ignore the low ranked equivalence classes as redundant. DPMiner is efficient as it
integrates the mining of closed patterns and generators into one depth-first framework.
The definition of the δ-discriminative equivalence classes are close to the jumping
emerging patterns (JEPs) but different in concept. A JEP is an itemsets which occurs in one data
classandneveroccursinotherclasses.Theδ-discriminative equivalence class are presented in a
group of itemsets but JEP are presented in single. The DPMiner delete the redundancy within
equivalence class while the redundancy exists in the JEPs.Also,theδ-discriminative equivalence
class must be frequent but JEPs can be infrequent.
The CDPM method (Conditional Discriminative Patterns Mining) (He et al. 2017) is
also proposed for discovering a set of significant non-redundant discriminative patterns, which
have no similar discrimination in their subsets. The discriminative itemsets with frequency
difference in the two datasets (or two data classes) greater than the defined significant threshold
are discovered. Out of these significant discriminative itemsets only the itemsets with frequency
difference in all their subsets greater than the defined local significant threshold are identified.
The CDPM mainly focuses on effectiveness of the patterns and not the efficiency of the
algorithm in case of time and space usage. It discovers the small number of patterns that have
discriminative power which cannot be obtained from their subsets (i.e., subsets of a conditional
discriminative pattern do not have the same discriminative power as pattern does). The DPM (Li,
Liu and Wong 2007) and CDPM (He et al. 2017) discover the discriminative itemsets based on
their statistical measures. Discovering a small set of conditional discriminative patterns without
redundancy is more preferred than generating all patterns. However, subsets of non-conditional
discriminative patterns could be discriminative, but will be missed out by CDPM algorithm. The
proposed methods in this thesis discover a complete set of discriminative itemsets based on their
explicit supports in the datasets and not the statistic measures. These are useful for the
applications using support and confidence for pattern mining. The discriminative itemsets are
used for mining discriminative rules based on the defined support threshold and confidence
threshold. The discriminative rules are then used for making rule based classifiers.
Summary: The δ-discriminative emerging patterns skips the subset of itemsets if their
support in the general dataset is larger than 𝛿. However, for the discriminative itemsets proposed
in this thesis, a subset of a non-discriminative itemsets can be discriminative. The δ-
discriminative emerging pattern also ignores the low ranked equivalence classes as redundant.
However, for the discriminative itemsets proposed in this thesis, explicit relative supports are
provided for every discriminative itemset. This will support users with more information for data
stream description and decision making. In the section below the detail differences between
emerging patterns and discriminative itemsets are discussed.
Chapter 2: Literature review Page 26
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 26
2.3.1.3 Differences between emerging patterns and discriminative itemsets
The discriminative itemset mining is a novel research problem by mining a complete set
of the itemsets which are frequent in the target dataset based on the support threshold, and are
discriminative in the target dataset compared to the general dataset based on discriminative level
threshold. Despite the similarities between the definition of discriminative itemsets and Emerging
Patterns (EPs), they are different in several ways;
Firstly, in EPs the degree of change in supports of itemsets is important, and the actual
support of itemsets is not considered (Dong and Li 1999). EPs can be infrequent, which could
result in many EPs with low supports. In contrast, discriminative itemsets have to be frequent in
the target dataset based on the support threshold, and be discriminative in the target dataset
compared to the general dataset based on discriminative level.
Secondly, the proposed algorithms for emerging patterns are mostly focused on
representing these patterns in a compact way to avoid examining all the possible itemsets. The
real supports of the EPs are not explicitly presented as they are reported in a group between two
borders using the maximal itemsets. For the purpose of comparison, in discriminative itemsets in
data streams the frequencies of the highly and lowly discriminative itemsets have to be known.
Thirdly, the EPs algorithms are mainly proposed for static datasets, except a small
number of works in stream mining (Alhammady and Ramamohanarao 2005; Bailey and Loekito
2010) based on the same idea of border definition. The (Alhammady and Ramamohanarao 2005)
method discovers the EPs in each block of transactions and discards the block out of the process.
Based on the defined minimum thresholds for each dataset the maximal itemsets are extracted
separately and then borders are defined. This border definition is not useful for data streams as
the designed algorithms for data streams have to process the streams by one scan.
Fourthly, the EPs method generates all the combinations of the itemsets, which is not
suitable for the discriminative itemset mining methods looking for only generating the potential
combinations of itemsets. Also, comparing the concept of the proposed methods for EPs
following the candidate itemsets generation, the discriminative itemset mining method is more
efficient if designed based on the FP-Growth (Han, Pei and Yin 2000).
The 𝛿-discriminative emerging patterns proposed in (Li, Liu and Wong 2007) are
determined based on a threshold 𝛿. The DPMiner algorithm in (Li, Liu and Wong 2007) can
efficiently mine the 𝛿-discriminative emerging patterns by skipping the subset of itemsets if their
support in the general dataset is larger than 𝛿. The discriminative itemsets discussed in this thesis
are relatively discriminative in the target dataset compared to the general dataset. A subset of a
non-discriminative itemsets can be discriminative. The delta-discriminative emerging patterns do
not include some of useful emerging patterns compared to the discriminative itemsets defined in
this thesis.
Chapter 2: Literature review Page 27
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 27
The discriminative itemsets which are frequent in the target dataset and general dataset,
but the frequency is relatively different in the two datasets are not discovered as the δ-
discriminative emerging patterns, for example, the itemsets in market basket frequent in all
suburbs with relatively higher frequency in the target suburb with smaller average age of
population. The DPMiner skips the redundant itemsets as a superset of subset itemsets with the
same infinite ratio between the supports in the target and general dataset. The discriminative
itemsets are discovered with explicit supports in datasets without redundancy, for example, the
superset of discriminative itemsets with different cardinalities is discovered as the discriminative
itemsets.
Summary: In the discriminative itemsets proposed in this thesis, the frequencies of
discriminative itemsets are derived and explicitly provided to the user together with the patterns.
The significance of discriminative itemset mining in comparison to the emerging pattern mining
is in the explicit cardinality of the itemsets in the datasets. Every discriminative itemset is
reported with its real supports in the target dataset and general dataset, respectively. The accurate
classification techniques can be defined based on the discriminative rules which are extracted
from the discriminative itemsets.
The proposed algorithms for emerging patterns are mostly focused on representing these
patterns in a compact way to avoid examining all the possible itemsets. Therefore, in the
proposed methods for discriminative itemset mining the combination of itemset generation in an
FP-Tree should be restricted by defining the useful heuristics working both based on the general
and specific features of the target data streams. The data streams should scan together and for one
time only. In contrast to the EPs method, the difference between very high discriminative
itemsets and other discriminative itemsets is observable by the exact support value of itemsets
using the definition of the support threshold and discriminative level.
2.3.2 Other contrast patterns
In the other related research, the HFP-Tree method (Zhu and Wu 2007) is proposed for
discovering the relational patterns across multiple databases based on the desire queries. The
relational patterns hold the specific relationships in datasets that is exist in the itemsets. This
method finds the desired patterns based on the defined queries; for example, itemsets with
frequency in datasets A and B higher than specific threshold. This is followed by the H-Stream
(Guo et al. 2011) for mining frequent patterns across multiple data streams in the tilted-time
window model. The H-Stream is an offline method based on the FP-Growth (Han, Pei and Yin
2000) without restricting the generation of non-potential itemset combinations which will be time
and space consuming for the large and fast-growing data streams. There are also research works
following the discriminative sequential changes (Patel, Hsu and Lee 2011; Gao et al. 2016) for
data streams differentiation. The contrast subspace mining proposed recently (Duan et al. 2014;
Chapter 2: Literature review Page 28
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 28
Duan et al. 2016) for discovering the subspaces that maximize the ratio of likelihood of a query
in one class against other class; for example, patients with symptoms most similar to the cases in
disease A, and at the same time dissimilar to cases in disease B.
In the section below the discriminative itemset mining in data streams, challenges,
usefulness and applications and are discussed.
2.3.3 Discriminative itemset mining in data streams
Discriminative pattern mining is a new research area in data stream mining (Lin et al.
2010). The discriminative itemsets are those itemsets that are frequent in the target data stream
and their frequencies in that stream are much higher than other data streams in the application
domain. Compared to the frequent itemsets and sequential patterns, the discriminative itemsets
are more distinctive and more directed for the purpose of comparison between data streams and
carry more valuable information. They can highlight the differences between patterns and trends
of different data streams more clearly and distinguish their major trends and be more useful for
the purpose of prediction mining. The discriminative itemsets can also be discovered for the
purpose of anomaly detection and personalization in the web applications. The generally frequent
itemsets are not distinctive (Lin et al. 2010) and do not suffice for their intended purposes.
In contrast to the numerous algorithms proposed on mining frequent itemsets in single
data streams, there has not been much research done for pattern mining in more than one data
stream. In (Amagata and Hara 2017 ) a method proposed for mining top-k closed co-occurrence
patterns across multiple streams using the sliding window model. Pattern mining in more than
one data stream has the following challenges. Compared to the frequent itemset mining
techniques, the discriminative itemset mining does not follow the Apriori property and a sub-set
of a discriminative itemset can be non-discriminative. The exponential number of itemset
combinations demand high time and space complexities when the frequent itemset mining
algorithms are used in the large and fast-growing data streams (Cheng, Ke and Ng 2008). The
designed methods should deal with the exponential number of itemset combinations in more than
one data stream generated and be tested against the discriminative itemset criteria with efficiency
in gaining results.
Despite these challenges, discriminative itemset mining is an emerging research area.
Summary: One of the interesting real world scenarios is dynamic tracing of transactions
in market basket datasets. The itemsets that occur more frequently in one market compared to the
other markets are of interest. These are useful for identifying the customers or group of customers
who have high interest in specific items compared to the rest of the population. The discovered
itemsets are useful for personalization or anomaly detection as well. Considering only the
frequent itemsets in data streams, is not distinctive enough for these purposes, as they may be
generally frequent in other data streams as well (Lin et al. 2010). There are many other examples
Chapter 2: Literature review Page 29
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 29
that can show the usefulness and significance of the discriminative pattern mining in data
streams. In the network traffic monitoring the discriminative patterns shows the activity sets that
are more impressive for specific users in comparison to the rest of the group activities in the
whole network. This information is meaningful for anomaly detection. In another example, the
discriminative itemsets can be effectively used in the search engines and news delivery services
for the purpose of personalization (Lin et al. 2010).
The essential issue in the above example applications is to find the itemsets that can
distinguish the target stream from all other streams. There is not much research done in pattern
mining in more than one data stream. A couple of methods have followed the discriminative
items mining in data streams (Lin et al. 2010; Seyfi 2011) with challenges for expansion to the
discriminative itemset mining. Another related research area, as we explained, is Emerging
Patterns (EPs) (Dong and Li 1999) with a close definition to the definition of discriminative
itemsets. The emerging patterns are described as itemsets whose frequencies grow significantly
higher in one dataset in comparison to another one.
2.3.4 Discriminative item mining in data streams
Discriminative items are the items that are frequent in the target data stream and
infrequent in the other data streams totally defined as general data stream. In (Lin et al. 2010)
three different methods have been proposed for mining discriminative items in data streams
namely, frequent item based, hash-based method and hybrid method.
In the frequent item based method the space-saving algorithms (Metwally, Agrawal and
El Abbadi 2005) are used separately on each data stream and then the results combined to find
the discriminative items. In this method the recall is highly dependent on the minimum support
error. However, because the frequent items are discovered separately in each data stream and the
comparison is done only after processing both data streams, we may count so many non-
discriminative items as the frequent items in the target data stream defined as 𝑆1 which are
frequent in general data stream defined as 𝑆2, as well. Applying the same method for
discriminative itemset mining would cause the same issue as many of the frequent itemsets are
frequent in both streams but with a bigger challenge because of the combinatorial itemsets
generation in multiple data streams.
In the hash-based method, two data streams process together at the same time, so if one
item is found to be very frequent in the second data stream, its counting would stop and save the
effort in both the data streams. In this approach, items are being appointed to different groups
named buckets and the frequency of all the items in one bucket is counted together using the
same counter. The buckets are expanded if they reach the threshold of discriminative bucket and
frequency of the smaller group of items counted together. The issue in this method is that a
number of the discriminative items may be buried in non-discriminative buckets because of the
Chapter 2: Literature review Page 30
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 30
groupfrequencycountingineachbucketandtheeffectofotheritems’frequenciesinthegeneral
data stream. This will result in lower recall in this method by losing the discriminative items in
the potential but not expanded higher level buckets. The same issue would arise for
discriminative itemsets.
The hybrid method is tried to discover the concealed discriminative items in the non-
discriminative potential buckets to improve the performance of the hash-based structure. For this
purpose, using a number of space-saving counters in the target stream, a summary of every
bucket is saved, which can be referred to as a sub-stream of the main stream. Items are assigned
to different expandable buckets and all the items in one bucket are counted together using the
same counter. Using this hybrid structure, in the current bucket, the frequent items of the target
data stream are found and by expanding that bucket, there would be a good chance to find the
discriminative items concealed in the non-discriminative but potential buckets. However, having
frequent items in the current bucket would not be a good heuristic for bucket expansion as the
frequent items may be frequent in the general data stream 𝑆2, as well (Seyfi 2011). The hybrid
method (Lin et al. 2010) is also designed for mining the discriminative items in the landmark
window model. The two data streams are processed together and the items are assigned to the
different buckets. In each bucket, frequency of items are counted by the same counter using the
space-saving algorithm (Metwally, Agrawal and El Abbadi 2005). Buckets are only expanded if
they meet the discriminative bucket conditions. Expansion of this method for the other type of
window models will face different challenges for keeping on updating the window model by the
items coming in or going out of the window frame.
Compared to the discriminative itemsets, these methods have simplicity due to not
needing combinatorial explosion. The process of discriminative itemset mining with these
methods would be time and space consumption especially for dealing with large data streams. All
of the three mentioned methods have been designed for pre-defined thresholds and it is not
possible to change the thresholds after setting them up for the first time. This method only
generates the discriminative items and not the itemsets.
The hybrid structure can be possibly expanded for discriminative itemset mining by
assigning several itemsets to the same bucket using an improved hash function. However, it will
face major challenges such as very large exponential number of itemsets generation for each new
transaction, which will require a huge hybrid structure and a very complex hashing process for
assigning the best sets of itemsets to each bucket. This would not be a meaningful expansion and
updating and restructuring the hybrid structure for the large number of itemsets generated for
window updating will be a complex process. If the hash assignment functions are not employed
properly the number of concealed discriminative itemsets will increase the result in low recall.
The hybrid method also shows the approximate frequencies of the itemsets and the efficiency of
this method highly depends upon the hashing function and the group of itemsets assigned to the
Chapter 2: Literature review Page 31
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 31
same bucket. Also, as a group of itemsets in each bucket has to be counted together, there
wouldn’tbeseparatefrequenciesforeveryitemset,whichisnecessaryfor thehistorical tilted-
time window model. In the tilted-time window model there needs to be the frequency of the
discriminative itemsets to add in the historical windows.
The hierarchical counter approach (Seyfi 2011) is proposed for identifying the exact
frequencies of all the items including the infrequent items in the datasets and then discovering the
discriminative items. The hierarchical counters have real values and each of the 10-valued
positions (a different power of 10) is used for holding the frequency of one item. If the frequency
of the item goes up and passes the 9 the counter is set to zero and its counter in the higher place
of the hierarchy will increase. The process will happen for all of the counters in the hierarchies.
Compare to the hybrid method (Lin et al. 2010) the different thresholds can be used at different
times. The unique feature of this method is its different type of counter structure that seems to be
more space efficient compared to the normal counters. In the same way as the hybrid method this
method can be also expanded for discriminative itemset mining using the FP-Growth method by
appointing each group of itemsets to a hierarchy of counters. A better approximation in the
results seems to be achieved using this type of counter. However, defining the hierarchical
counter structure for the itemsets is not acceptable because of the explosion in the number of
itemset combinations in discriminative itemset mining. Also, appointing and tuning the
hierarchical counters for the historical tilted-time or real-time sliding window models would have
high time complexity.
The superiority of the discriminative itemsets over discriminative items can be validated
in different real world examples. The discriminative items in the search engines are the keywords
that are frequent in the target data stream and infrequent in the other data streams. The
discriminative itemsets in the search engines are the set of the keywords that more frequently
appeared together in the target data stream compared the other data streams; for example, the
discriminative itemsetof ‘women’s skinny jeans’compared to thediscriminative item ‘jeans’.
These are more interesting for the purpose of comparison, personalization and recommendation.
The discriminative itemsets in the market basket are more useful compared to the discriminative
items for the purpose of recommendation and personalization; for example, the discriminative
itemsetof‘bread, beer, eggs’comparedtodiscriminative item‘eggs’.
Summary: Using itemsets is better as the items are meaningless and theydon’thave
much knowledge with them. Also, the itemset combination generation is a big burrier in mining
discriminative itemsets compared to the mining discriminative items. Mining discriminative
itemsets in data streams seems to have far more complexities than frequent itemset mining in
single data streams. The Apriori property in frequent itemsets is not applicable for discriminative
itemsets as the subset of discriminative itemsets can be non-discriminative. Also, the algorithms
Chapter 2: Literature review Page 32
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 32
should be adjusted for the bigger size of datasets and deal with the inter-relationships between
data streams.
Considering the challenges in the expansion of the current discriminative item mining
method to the area of discriminative itemset mining, the existed frequent itemset mining
algorithms following the Apriori or FP-Growth concepts are the better options to be applied for
the purpose of itemset generation in more than one data stream. The number of generated
itemsets in the FP-Growth algorithm would be lower than Apriori based algorithms. However, it
should be considered that the concept of closed (Chi et al. 2006) and maximal (Farzanyar,
Kangavari and Cercone 2012) itemsets used in some research to deal with the combinatorial
itemset generation problem in data streams is not applicable in discriminative itemset mining for
reducing the itemsets, as the Apriori property is not valid.
In this thesis the tilted-time window model and the sliding window model are selected
for mining discriminative itemsets. Mining discriminative itemsets in data streams at multiple
time granularities and real time frames can be defined as the of the research gaps in the area of
pattern mining in data streams.
In many applications the rule mining is the next step after pattern mining in datasets. The
discriminative rules can be defined from discriminative itemsets as the sequential rules in one
data stream that have higher support and confidence compared to the same sequential rules in the
other data streams. These discriminative rules can be used for prediction mining in data streams
using classification techniques. The association rules in single data stream and association
classification techniques are explained in the section below as similar concepts to the
discriminative rules and discriminative classifications.
2.4 ASSOCIATION RULE MINING
Algorithms for association rule mining consist of two steps. In the first step they discover
the itemsets that meet the support threshold and then in the second step based on these discovered
frequent itemsets they drive the association rules that meet the confidence criterion. There are
several approaches proposed for association rule mining like AprioriAll (Peng and Liao
2009), GSP (Agrawal and Srikant 1994), SPADE (Zaki 2001) SPAM (Ayres et al. 2002) and
PrefixSpan (Han et al. 2001). The proposed algorithms for sequential rule mining in data streams
should follow the requirements of data stream mining algorithms and adopt an incremental
and one pass mining approach as well as defining synopsis structure. Also, in order to avoid
the concept drifting problem (Wang et al. 2003), the data mining methods have to be adapted
to the change in data distribution (Jiang and Gruenwald 2006).
The existing algorithms for sequential rule mining in data streams are discussed briefly
in the section below.
Chapter 2: Literature review Page 33
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 33
2.4.1 Association rule mining in data streams
Several algorithms are proposed for association rule mining in data streams. In (Shin and
Lee 2008) a method proposed for mining association rules directly over the changing set of
currently frequent itemsets, which are generated over an online data stream using the estDec
(Hyuk and Lee 2003) method. In (Çokpınar and Gündem 2012), a rule mining system is
proposed for mining positive and negative association rules in XML data streams. The extracted
rules from association rule mining techniques are generally positive rules. However, there are
some works on negative association rules (Antonie andZa¨ıane2004; Wu, Zhang and Zhang
2004). The itemsets with support greater than the positive threshold are used for mining positive
association rules, and the itemsets with support less than negative threshold are used for mining
negative association rules. In (ÇokpınarandGündem2012) the PNRMXS algorithm proposed
for mining association rules is based on two steps of frequent itemset mining and
sequential rule mining. The algorithm is designed for a time sliding window model on
a data stream. In (Ahmed et al. 2012) a tree structure HUS-Tree (high utility stream tree)
and algorithm, called HUPMS (high utility itemset mining over stream data) is
proposed for incremental and interactive HUP mining over sliding window in data
streams. The negative rule mining approach introduced in (Yuan et al. 2002) decreases the
search space for mining negative rule mining by losing many negative relations in
data.
One of the drawbacks of current methods in the area of sequential rule mining
in data streams is that the user can only define the parameters before the execution
and cannot adjust the parameters during the process (Jiang and Gruenwald 2006).
Similar to the frequent itemset mining and association rule mining, the discriminative
itemsets mining in data streams can be followed by discriminative rule mining. These
discriminative rules can be used for defining accurate classification techniques. The
discriminative classification can be defined based on the discriminative rule mining. In the
section below the association classification as of the classification rule mining methods is
discussed with its challenges, usefulness and applications.
2.4.2 Classification rule mining
Classification is one of the main data mining techniques used in applications like
personalization, anomaly detection, recommendation and prediction. There are many
classification techniques developed for discovering a small subset of rules from the dataset which
can be used for prediction purposes in real world applications. These methods are grouped in
different categories like decision trees (Quinlan 2014), rule learning (Clark and Niblett 1989),
nayve-Bayes classification (Duda and Hart 1973) and the statistical approach (Lim, Loh and Shih
Chapter 2: Literature review Page 34
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 34
2000). Many algorithms are also proposed for data stream classifications (Alhammady and
Ramamohanarao 2005; Aggarwal 2007; Li et al. 2009; Hashemi et al. 2009; Zhao, Wang and Xu
2012; Masud et al. 2012). These methods are mainly working based on the pattern mining from a
single data stream. As with the other data stream mining algorithms, the challenges in developing
data stream classifications are the fast speed growing of data streams, unbounded memory
requirements and concept drifting (Aggarwal 2007; Manku 2016). These make the adaptation of
the compact set of rules necessary to be synchronized with the fast speed of data streams.
Association classification (AC) (Ma 1998; Li, Han and Pei 2001) is proposed based on
the integration of the association rule mining and the classification techniques. These are closed
concept with the difference that classification predicts the class labels but the association rules
show the relationships between items in transactions (Thabtah 2007). In association rule mining,
the targets are not predetermined while the classes are the defined sets of targets in classifications
(Ma 1998). Out of the current studies the association classification is a promising approach that
can compete with the decision tree, rule induction and probabilistic classifiers (Ma 1998; Thabtah
2007). The AC methods have four steps of rule ranking, rule pruning, rule prediction and rule
evaluation (Ma 1998).
The rule mining is the most time-consuming step in the traditional AC mining
approaches and not applicable for fast-growing large data streams. The number of the rules
discovered in the data streams can also be large causing other steps to take longer. There are
several studies on using association classification in data streams (Song and Li 2010; Su, Liu and
Song 2011; Lakshmi and Reddy 2012; Saengthongloun et al. 2013; Waiyamai et al. 2014;
Kompalli and Cherku 2015). The discriminative itemsets defined as frequent itemsets in a data
stream with higher frequency than frequency in the general data stream can be used for
discovering discriminative rules as a new set of rules. The discriminative rules are useful for the
classification purpose, with a smaller number and better knowledge for differentiation of the data
streams. The discriminative classification can be defined following the similar concept as
association classification and based on the discriminative rule mining.
Summary: Development of an effective algorithm for mining discriminative rules
highly depends on the effectiveness of the algorithms used for mining discriminative itemsets.
However, the rule mining challenges also should be addressed in different window models. The
challenges mostly relate to managing the number of transactions fitted in the window sizes and
updating the results by every incoming transaction.
In this thesis there are several new techniques developed for discriminative itemsets
mining in data streams. In the section below the new knowledge contributed to the research area
is verified.
Chapter 2: Literature review Page 35
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 35
2.5 VERIFYING THE NEW KNOWLEDGE CONTRIBUTED TO THE AREA
In this chapter we explained many related works to the problem of mining discriminative
itemsets in data streams. The problem of mining discriminative itemsets in data streams is
motivated by mining frequent itemsets in data stream (Manku and Motwani 2002), mining
discriminative items in data streams (Lin et al. 2010) and mining emerging patterns (Dong and Li
1999). The discriminative itemsets are those itemsets that are frequent in the target data stream
and their frequencies in that stream are much higher than other data streams in the application
domain. We explained the differences between emerging patterns and discriminative itemsets
and also differences between discriminative items and discriminative itemsets in details in the
above sections. Compared to the frequent itemsets the discriminative itemsets have sparsity
characteristics. The number of discriminative itemsets is much less than frequent itemsets.
However, the discriminative itemsets are more directed for the purpose of comparison between
data streams. The discriminative itemsets can distinguish the datasets in different real world
scenarios.
The discriminative itemsets are significant when there is no inherent discrimination in
the datasets (data classes). In many application domains there are inherent conceptual differences
between datasets (data classes), for example, in mushroom dataset from UCI repository (Dheeru
and Karra Taniskidou 2017) the two classes are edible and poisonous. In this type of applications
the emerging pattern mining is useful. Most of the emerging patterns in these type of applications
are jumping emerging patterns, i.e., special type of emerging patterns whose support increase
from zero in one dataset to non-zero in another dataset. We observed that in emerging pattern
mining different techniques are used to decrease the number of emerging patterns to a subset of
useful emerging patterns. This is not useful in applications which do not have inherent
discrimination in the datasets, for example, the two different markets in market basket dataset
analysis. In this type of applications the discriminative itemsets are discovered with high
frequency in all datasets.
The other significance of discriminative itemset is that every discriminative itemset is
discovered with explicit frequencies in the datasets. This is useful in pattern mining in the
applications based on the support and confidence, for example, in rule mining applications. The
emerging pattern mining mainly define the desired emerging patterns based on the ranked
statistical merits under different test statistics such as chi-square, risk ratio, odds ratio, etc (Dong
and Bailey 2012). The emerging pattern mining methods use heuristics to have compact set of
emerging patterns. Discovering a small set of emerging patterns without redundancy is more
preferred than generating all patterns. However, the proposed methods in this thesis discover a
complete set of discriminative itemsets efficiently based on their explicit supports in the datasets
and not the statistic measures.
Chapter 2: Literature review Page 36
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 36
Also, the proposed discriminative itemset mining algorithms in this thesis for the first
time deal with mining discriminative itemsets in data streams. As per literature reviewed in this
chapter there is no similar work on data streams. In this thesis, we are contributed in mining
discriminative itemsets in datasets in a batch of transactions, mining discriminative itemsets in
data streams using tilted-time window model, and mining discriminative itemsets in data streams
using the sliding window model.
In the first step and in Chapter 3, the discriminative itemsets are discovered from a batch
of transactions belongs to the datasets for offline updating the window models. The most
challenging issue in this step is efficiency. The proposed algorithms for mining discriminative
itemset in a batch of transactions have to be efficient for dealing with the exponential number of
itemset generation in more than one data stream. The Apriori property defined for the frequent
itemsets is not valid in discriminative itemset mining, that is, not every subset of a discriminative
itemset is discriminative. The proposed algorithms are designed based on the concise process and
data structures for saving time and space. The discriminative itemsets in a batch of transactions
are discovered and used for updating the different window models in offline state. The algorithms
should have full accuracy and recall in a batch of transactions i.e., no false positive or false
negative answers. The approximation is not acceptable in batch processing as the offline updating
the window model will increase the approximation to a higher level.
In the second step and in Chapter 4, the discriminative itemsets are discovered from a
batch of transactions belongs to the datasets for offline updating the tilted-time window model.
The tilted-time window model is updated in an offline state in the specific time intervals defined
for the batch of transactions. The discriminative itemsets discovered from the new batch of
transactions are saved in the current window frame in the tilted-time window model. To improve
the approximation in discriminative itemsets in the tilted-time window model, the sub-
discriminative itemsets are also defined. The sub-discriminative itemsets are discovered as they
may become discriminative in the future by merging the window frames. The discriminative and
sub-discriminative itemsets in the older window frames are discovered by shifting and merging
the itemsets. The approximate discriminative itemsets in the tilted-time window model get worse
in the larger window frames by merging the approximate smaller window frames. The
determinative properties of the discriminative itemsets in the tilted-time window model are
applied to decrease the number of false positive and false negative answers in the tilted-time
window model.
In the third step and in Chapter 5, the discriminative itemsets are discovered from a batch
of transactions belongs to the datasets for offline updating the sliding window model. The sliding
window model is updated in offline state in the specific time intervals defined for the batch of
transactions. The new batch of transactions is processed and the discriminative itemsets are saved
in the sliding window frame and the itemsets belong to the oldest partition out of the sliding
Chapter 2: Literature review Page 37
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 37
window frame are deleted. The proposed algorithm for offline sliding window model should
have full accuracy and recall. Approximation is not accepted in sliding window model as by each
offline updating of the window model the approximation will increase to a higher level. The
heuristics are proposed for mining exact set of discriminative itemsets in offline sliding window
model. The sub-discriminative itemsets are also saved as they may become discriminative in the
future by online sliding window frame. The online sliding window model is happen between two
offline sliding limited to the itemsets in the window frame. The online sliding window model is
for mining discriminative itemsets in an online state with approximate results.
Finally and in Chapter 6, the proposed methods are evaluated with several synthetic and
real datasets with different characteristics for their efficiency and effectiveness. The input batches
of transactions are generated synthetically in wide varieties. The different parameter settings are
tested to show the effect of each parameter on the algorithm efficiency and discovered patterns.
The proposed methods are compared with the baseline methods for both efficiency and
effectiveness. The different behaviours of the proposed algorithms with different input data
streams and different input parameters are demonstrated. The effectiveness of the discriminative
itemsets is represented in real world applications.
In the section below the research gaps are identified to be addressed in this thesis.
2.6 RESEARCH GAPS AND IMPLICATIONS
In spite of the abundance of research in frequent itemset mining in data stream, there are
not many research works in pattern mining in more than one data stream. Almost all of the
proposed algorithms are developed for single data stream (Aggarwal 2007; Cheng, Ke and Ng
2008; Manku 2016). Also, most of the emerging pattern mining algorithms are proposed on static
datasets and not data streams. As there is a research gap, more attention needs to be considered
for studying in the area of pattern mining in more than one data stream. The discriminative
itemsets are defined as the frequent itemsets in the target data stream that have frequencies much
higher than that of the same itemsets in other data streams. Compared to frequent itemset and
sequential rule mining in single data streams, discriminative itemsets and rules are more directed
for the purpose of comparison between data streams. The discriminative itemsets distinguish the
target data stream from other data streams. The emerging patterns are the close concept to the
discriminative itemsets. As with the emerging patterns (Dong and Li 1999), mining
discriminative itemsets from more than one data stream can reveal interesting insights. Like EPs
in static datasets, the discriminative itemsets can be used for prediction purposes by defining
classifier techniques or be used for highlighting the differences between different data streams.
As discussed before, most of the existing emerging pattern mining methods have been
designed for the static datasets (Dong and Li 1999; Fan and Ramamohanarao 2002; Zhang, Dong
and Kotagiri 2000; Bailey, Manoukian and Ramamohanarao 2002; Alhammady and
Chapter 2: Literature review Page 38
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 38
Ramamohanarao 2005). These methods try to define left and right borders for maximal itemsets
and use them to reach to the EPs. They scan each dataset separately and several times, which is
not acceptable in data stream environments. They are involved with the problem of a large
number of candidate itemsets, which again would be a big barrier in data streams. We need to
propose algorithms to avoid processing of non-potential itemsets during mining discriminative
itemsets.
The classification is one of the main tasks of the emerging patterns. There are many
classification techniques proposed, based on emerging patterns both for static datasets (Li, Dong
and Ramamohanarao 2000, 2001; Fan and Ramamohanarao 2002, 2003) and stream data
(Alhammady and Ramamohanarao 2005; Yu et al. 2012). Also, the discriminative subsequence
mining is used for classification purposes (Nowozin, Bakir and Tsuda 2007). However, the type
and the concept of these patterns are very different compare to the discriminative itemsets. The
purpose of the discriminative itemset mining in data streams can lead them to the discriminative
rule mining and the discriminative classification mining.
There are number of methods proposed for mining discriminative items in data streams.
Discriminative itemsets are more valuable in data mining process as the items are meaningless
and they do not have much knowledge with them. Also, the discriminative rules can be extracted
out of the discriminative itemsets which are very important for prediction mining.
Every proposed algorithm for data stream mining should be very effective and efficient
in time and space complexities to be useful and reasonable in real-time situations in large and
fast-growing data streams. This research will use the inherent heuristics of the target datasets to
optimize the data mining process. In summary, the following research gaps will be addressed by
this research:
Lack of method for differentiating between data streams by highlighting the
discriminative itemsets.
Lack of method for efficient discriminative itemset mining in multiple time
granularities.
Lack of method for efficient discriminative itemset mining in sliding real time
window frame.
Lack of efficient discriminative itemset mining methods optimized based on the
general and specific characteristics of the data streams in different window
models.
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 39
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 39
3Chapter 3: Mining Discriminative Itemsets in
Data Streams
In this chapter, the problem of mining discriminative itemsets in data streams is formally defined.
The comprehensive research problem is outlined and two methods are proposed. The primary
method is the simple expansion based on the basics of FP-Growth (Han, Pei and Yin 2000),
called DISTree method. The advanced efficient method is based on the sparse characteristics of
the discriminative itemsets, called the DISSparse method. The proposed methods are explained
in detail with deep discussion about their limitations in real world applications. The
discriminative itemsets are discovered from a batch of transactions and used for updating the
window models in an offline state. The precision of the proposed methods is proved with full
accuracy and recall in mining discriminative itemsets in one batch of transactions. The proposed
methods are extensively evaluated on various datasets with different sizes exhibiting diverse
characteristics in terms of transaction size, frequent-pattern size, and number of unique items
using different thresholds in Chapter 6. Empirical analysis shows much better time and space
complexity gained by the DISSparse algorithm proposed in (Seyfi et al. 2017) in comparison to
the modified version of the DISTree algorithm proposed in (Seyfi, Geva and Nayak 2014). To the
best of our knowledge, the proposed methods in this chapter are the first algorithms for mining
discriminative itemsets.
The chapter starts with describing the existing works in Section 3.1. The mathematical
definition of the research problem using several notations is proposed in Section 3.2. In
Section 3.3 the DISTree method is proposed as the primary method for mining discriminative
itemsets. In Section 3.4 the DISSparse method is proposed as the advanced method for efficient
mining of the discriminative itemsets. The chapter is finalized by the discussion about the
DISTree and DISSparse methods and their state-of-the-art techniques in Section 3.5. This chapter
covers the paper of Mining Discriminative Itemsets in Data Streams (Seyfi, Geva and Nayak
2014) and the paper of Efficient Mining of Discriminative Itemsets (Seyfi et al. 2017).
3.1 EXISTING WORKS
FP-Growth is a famous method proposed for mining frequent patterns by pattern
fragment growth (Han, Pei and Yin 2000). A large database is compressed into a much smaller
data structure called FP-Tree. The FP-Tree based mining method avoids generation of large
number of candidate itemsets using pattern fragment growth method. A divide-and-conquer
method is used for decomposing the frequent pattern mining to smaller tasks.
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 40
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 40
The frequent pattern tree, or FP-Tree for short, is a compact prefix-tree structure used for
holding information about the frequent patterns. The frequent pattern tree is constructed in the
first scan of the datasets out of only the frequent items. This data structure reduces the necessity
of repeatedly scanning the datasets. The transactions with identical frequent patterns can be
merged with number of occurrence as frequency. To make this easy the items in whole
transactions are ordered based on fixed order. If the frequent items in the transactions ordered
based on their frequency descending order, more prefixes can be shared between transactions. In
this way items with higher frequency have better chance for sharing the nodes compare to the
less frequent items. To facilitate the tree traversal a header table is constructed from the items.
The items in header table point to their occurrences in the tree using linked lists.
Based on sharing the frequent items in transactions the size of FP-Tree is much smaller
than original dataset. This ensures the subsequent mining with rather smaller data structure. The
pattern fragment growth based on FP-Tree is started from frequent length 1-patterns. The
conditional FP-Tree is constructed out of the set of frequent items which are co-occurring with
the specific item as suffix, and mining continues recursively with such tree. The patterns are
discovered by concatenating of the suffixes with new generated patterns from conditional FP-
Tree. The divide and conquer partitioning-based is used to reduce the size of the generated
patterns. This also simplify the generation of the long patterns by concatenating of the shorter
ones with suffix.
Mining the discriminative patterns using the FP-Tree has the problem of combinatorial
candidate generation and test. The recursive process is performed and the patterns are obtained
by a pattern fragment growth method. However, the pattern fragment method is not applied to the
problem of discriminative itemset mining as it is discussed in details in Section 3.3. We modify
the original FP-Growth method by the multiple counters in each node of prefix-tree i.e., for
showing the frequencies in different datasets. We use linear algorithm to discover all the
discriminative itemsets as it is explained in the full chapter. For efficiency we propose two
heuristics to eliminate the non-potential itemset from itemset combination generation. We
compare the methods in case of time and space usage in Chapter 6.
The Discriminative Pattern Miner (DPMiner) algorithm (Li, Liu and Wong 2007) is a
famous method proposed for efficient mining of the delta-discriminative emerging patterns
(Dong and Bailey 2012) based on the concept of equivalence classes. An equivalence class EC is
a subset of itemsets that always occur together frequently in the same set of transactions. The
equivalence classes are uniquely defined based on closed patterns and a set of generators. The
closed pattern is a frequent itemset with no proper superset in the same frequency (Zaki and
Hsiao 2002). The generators are the itemsets with smaller frequency compared to every
immediate proper subset (Li et al. 2006). The generators are the minimal itemsets among the
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 41
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 41
equivalence class and the closed pattern is the maximum itemset. The key idea in DPMiner
algorithm is to mine the concise representation of the equivalence classes.
The 𝛿-discriminative emerging patterns must be frequent in the target dataset (𝑓𝑖 > 𝑚𝑠)
and have frequency less than 𝛿 in the other datasets (𝑓𝑗 < 𝛿). The DPMiner mines the delta-
discriminative emerging patterns by skipping the subset of itemsets with frequency 𝑓𝑗 > 𝛿
during depth search first. It also skips the redundant itemsets as a superset of subset itemsets with
the same infinite ratio between the supports in the target and general dataset. The itemset with
infinite ratio has high frequency in the one dataset and zero or very low frequency in the other
datasets. The DPMiner employs 𝛿 constraint to reduce the pattern search space by setting a
border of non 𝛿-discriminative emerging patterns. If an EC is 𝛿-discriminative and non-
redundant, then none of its subset ECs can be 𝛿-discriminative and none of its superset ECs can
be non-redundant. This is caused for the wide difference in the number of discovered itemsets as
the redundant 𝛿-discriminative emerging patterns. The equivalence class is redundant if the
closed itemsets of one EC is the subset of closed itemsets of another EC as all the itemsets
alreadysubsumedby thesubsetcloseditemset.Thisemphasizes thatonly themostgeneralδ-
discriminative equivalences are non-redundant. This will ignore the low ranked equivalence
classes as redundant.
The discriminative itemsets defined in this thesis are measured by their relative
occurrences inthetargetandgeneraldataset,however,theδ-discriminative emerging patterns are
measured by their frequency (i.e., < 𝛿) in the general dataset. A subset of a non-discriminative
itemsets can be discriminative. The δ-discriminative emerging pattern also ignores the low
ranked equivalence classes as redundant. However, for the discriminative itemsets proposed in
this thesis, explicit relative supports are provided for every discriminative itemset. The
discriminative itemset mining is a novel research problem by mining a complete set of the
itemsets which are frequent in the target dataset and are relatively discriminative in the target
dataset compared to the general dataset. In this chapter we modify the original DPMiner to
include all the delta-discriminative itemsets. We also modify the proposed DISSparse in this
chapter and its proposed heuristics to find all the delta-discriminative itemsets. We compare these
methods in case of time and space usage for the desired delta-discriminative emerging patterns in
chapter 6.
The main contribution in this chapter is proposing two algorithms for mining
discriminative itemsets. The DISTree method (Seyfi, Geva and Nayak 2014) is proposed in
Section 3.3 based on simple expansion of the FP-Growth method. The efficient DISSparse
method (Seyfi et al. 2017) is proposed in Section 3.4 based on two heuristics to skip the non-
discriminative itemsets. The technical details of the proposed algorithms are provided in a way to
distinguish with the existing FP-Growth method (Han, Pei and Yin 2000).
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 42
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 42
3.2 RESEARCH PROBLEM
A data stream is defined as a dynamic large ordered sequence of transactions that are
coming with fast speed through time. In most of the data stream mining applications, the stream
can be read only once by considering the limited computing and storage capabilities (Manku and
Motwani 2002). Stock market monitoring, online market basket analysis, network access
patterns, search engine queries in a region etc., can be modelled as data streams. The
discriminative itemsets are defined as the frequent itemsets in the target data stream that have
frequencies much higher than that of the same itemsets in other data streams. Not losing
generality, we call other data streams a 'general data stream' for the sake of simplicity. The
discriminative itemsets are relatively frequent in the target data stream and relatively infrequent
in the general data stream. An essential issue in this research problem is to find the itemsets that
can distinguish the target data stream from all other data streams.
There are many real world scenarios that can show the significance of discriminative
itemset mining in data streams. In online monitoring of market basket transactions, the itemsets
that occur more frequently in one market compared to the other markets are of interest. These are
useful for identifying the specific set of items which are of high interest in one market compared
to the other markets. The discovered itemsets are useful for personalization, anomaly detection or
prediction purposes. In an application of web page personalization, the discriminative itemsets
are groups of the web pages visited together by specific user groups much more frequently than
other user groups. In search engines, the discriminative itemsets are the sequences of queries
asked with higher support in one geographical area compared to another area. In network traffic
monitoring, discriminative itemsets are the concurrent activities of one user, which are more
frequent in comparison to the rest of the same group activities in the whole network.
Discriminative itemsets highlight the differences between data streams (Lin et al. 2010; Seyfi,
Geva and Nayak 2014; Seyfi et al. 2017). They can be used for classification methods by
distinguishing the trends in the target data stream from other data streams.
The discriminative itemsets are a sparse subset of frequent itemsets. There are additional
challenges in the problem of mining discriminative itemsets in data streams as this does not
follow the Apriori property defined for the frequent itemset mining and a subset of discriminative
itemsets can be non-discriminative. Additionally, any discriminative itemset mining method has
to deal with the exponential number of itemsets generated in more than one dataset. These
challenges demand high time and space complexities. Despite these challenges, discriminative
itemset mining is an emerging research area with great potential. The frequent itemsets can be
frequent in all data streams and not distinctive, while discriminative itemsets carry distinctive
knowledge and can be used for the purpose of comparison between data streams.
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 43
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 43
In this chapter, the discriminative itemset mining is discussed only based on a single
batch of transactions 𝐵 which is made of two data streams 𝑆𝑖 and 𝑆𝑗. Later, in Chapter 4 and
Chapter 5, we discuss the problem in detail, using the tilted-time window model and sliding
window model, respectively.
3.2.1 Formal definition
Let ∑ be the alphabet set of items, a transaction 𝑇 = {𝑒1, … 𝑒𝑖 , 𝑒𝑖+1, … , 𝑒𝑛}, 𝑒𝑖 ∈ ∑, is
defined as a set of items in ∑. The items in the transaction are in alphabetical order by default for
ease in describing the mining algorithm. The two data streams 𝑆𝑖 and 𝑆𝑗 are defined as the target
and general data stream; each consists of a different number of transactions, i.e., 𝑛𝑖 and 𝑛𝑗 (𝑛𝑖 ≠
𝑛𝑗), respectively. An itemset 𝐼 is defined as a subset of ∑. The itemset frequency is the number of
transactions that contain the itemset. The frequency of the itemset 𝐼 in data stream 𝑆𝑖 is denoted
as 𝑓𝑖(𝐼) and the frequency ratio of itemset 𝐼 in data stream 𝑆𝑖 is defined as 𝑟𝑖(𝐼) = 𝑓𝑖(𝐼)/𝑛𝑖. In
this chapter, if the frequency ratio of itemset 𝐼 in the target data stream 𝑆𝑖 is larger than the
frequency ratio in the general data stream 𝑆𝑗, i.e., 𝑟𝑖(𝐼)
𝑟𝑗(𝐼)> 1, then the itemset 𝐼 can be considered
as a discriminative itemset. Let 𝑅𝑖𝑗(𝐼) be the ratio between 𝑟𝑖(𝐼) and 𝑟𝑗(𝐼), i.e., 𝑅𝑖𝑗(𝐼) = 𝑟𝑖(𝐼)
𝑟𝑗(𝐼).
Obviously, the higher the 𝑅𝑖𝑗(𝐼), the more discriminative the itemset 𝐼 is.
To more accurately define discriminative itemsets, we introduce a user-defined threshold
𝜃 > 1, called a discriminative level threshold with no upper bound. An itemset 𝐼, is considered
discriminative if 𝑅𝑖𝑗(𝐼) ≥ 𝜃. This is formally defined as:
𝑅𝑖𝑗(𝐼) = 𝑟𝑖(𝐼)
𝑟𝑗(𝐼)=𝑓𝑖(𝐼)𝑛𝑗
𝑓𝑗(𝐼)𝑛𝑖 ≥ 𝜃 (3.1)
The 𝑅𝑖𝑗(𝐼) could be very large but with very low 𝑓𝑖(𝐼). In order to accurately identify
discriminative itemsets which have reasonable frequency, and also in the case of 𝑓𝑗(𝐼) = 0, we
introduce another user-specified support threshold, 0 < 𝜑 < 1 𝜃⁄ , to eliminate itemsets that
have very low frequency. In this chapter, an itemset 𝐼 is considered as discriminative if its
frequency becomes greater than 𝜑𝜃𝑛𝑖 i.e., 𝑓𝑖(𝐼) ≥ 𝜑𝜃𝑛𝑖 and also 𝑅𝑖𝑗(𝐼) ≥ 𝜃.
Definition 3.1. Discriminative itemsets: Let 𝑆𝑖 and 𝑆𝑗 be two data streams, with the
current size of 𝑛𝑖 and 𝑛𝑗 respectively, that contain varied length transactions of items in ∑, a user
defined discriminative level threshold 𝜃 > 1 and a support threshold 𝜑 𝜖 (0, 1 𝜃⁄ ). A set of
discriminative itemsets in 𝑆𝑖 against 𝑆𝑗, denoted as 𝐷𝐼𝑖𝑗, is formally defined as:
𝐷𝐼𝑖𝑗 = {𝐼 ⊆ ∑ | 𝑓𝑖(𝐼) ≥ 𝜑𝜃𝑛𝑖 & 𝑅𝑖𝑗(𝐼) ≥ 𝜃} (3.2)
The itemsets that are not discriminative are defined as non-discriminative itemsets:
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 44
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 44
Definition 3.2. Non-Discriminative itemsets: Let 𝑆𝑖 and 𝑆𝑗 be two data streams, with the
current size of 𝑛𝑖 and 𝑛𝑗 respectively that contain varied size transactions of items in ∑, a user
defined discriminative level threshold 𝜃 > 1 and a support threshold 𝜑 𝜖 (0, 1 𝜃⁄ ). The non-
discriminative itemsets are the itemsets with frequency less than 𝜑𝜃𝑛𝑖 in the target data stream 𝑆𝑖
or the frequency ratio in 𝑆𝑖 compared to 𝑆𝑗 less than 𝜃. A non-discriminative itemset can be
ignored if it is not a subset of discriminative itemsets. A set of non-discriminative itemsets in 𝑆𝑖
against 𝑆𝑗, denoted as 𝑁𝐷𝐼𝑖𝑗, is formally defined as:
𝑁𝐷𝐼𝑖𝑗 = {𝐼 ⊆ ∑ | 𝑓𝑖(𝐼) < 𝜑𝜃𝑛𝑖 𝑜𝑟 𝑅𝑖𝑗(𝐼) < 𝜃} (3.3)
3.2.2 Discriminative itemset mining
The problem of discriminative itemset mining is much more complicated compared to
the frequent itemset mining. The most challenging issue is the efficiency. The discriminative
itemset mining algorithms have to be efficient for dealing with the exponential number of itemset
generation in more than one data stream. The Apriori property defined for the frequent itemsets is
not valid in discriminative itemset mining, that is, not every subset of a discriminative itemset is
discriminative. The non-discriminative itemsets should be ignored in the process if they are not
subset of discriminative itemsets. This is time and space consuming even in the non-streaming
environments. The algorithms have to be designed based on the concise process and data
structures for saving time and space. To the best of current knowledge and based on the literature
reviewed in Chapter 2, there is no research work in mining discriminative itemsets in data
streams.
Due to the complexity of the discriminative itemset mining process and the current
methods for the frequent itemset mining in single data stream, the primary method is proposed by
expanding the FP-Growth method (Han, Pei and Yin 2000) of frequent itemset mining adaptable
for different window models. The advanced efficient method is then proposed based on the
sparse characteristics of the discriminative itemsets. In this chapter, the DISTree and DISSparse
algorithms are developed for the purpose of mining discriminative itemsets in data streams. The
data structures and the detailed process of the algorithms are discussed using a running example.
The specific characteristics and the limitations of the proposed methods in the large and fast-
growing data streams are presented. The DISTree and DISSparse algorithms are evaluated in
Chapter 6, based on the different input data streams with different characteristics, for their time
and space complexities.
3.3 DISTREE METHOD
The DISTree (Seyfi, Geva and Nayak 2014) is proposed by simple expansion based on
the basics of FP-Growth method (Han, Pei and Yin 2000). The DISTree method is developed for
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 45
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 45
offline mining the discriminative itemsets in a batch of transactions in the datasets adaptable for
the different window models. The DISTree algorithm has been designed based on several data
structures either modified from standard FP-Growth method of frequent itemset mining, or
specifically developed for discriminative itemset mining. The data structures are explained with
their concepts usage in the algorithm. To facilitate describing the method, several important
concept and constructs are defined below.
FP-Tree: The prefix tree structure proposed in the FP-Growth (Han, Pei and Yin 2000)
is used for holding the frequent items of the transactions by sharing the branches for their most
common frequent items. The branch from root to a node is considered as the prefix of the
itemsets ending at a node after that node. Each node in the FP-Tree is associated with a counter
showing the frequency of the itemset consisting of the items on the path starting from the root
and ending at this node. In the DISTree method, the FP-Tree is adapted by adding two counters
in each node for holding the frequencies of the itemsets in the target dataset and the general
dataset; for example, there are two counters associated with each node in the FP-Tree in
Figure 3.1.
In this thesis, we use the sequence from the root to a node to represent an itemset and the
two associated values indicate the frequency of the itemset in the target dataset and the general
dataset, respectively, for example, 𝑐𝑏7,3 in Figure 3.1 represents the frequency of itemset 𝑐𝑏 is 7
in the target dataset and 3 in the general dataset.
Header-Table: The Header-Table is a tabular structure showing all the items in the
defined alphabet ∑ by considering the processing order from the least frequent items. For fast
traversing the prefix tree structures, each item is associated with two linked-lists, which hold the
itemsets ending with that item in FP-Tree and DISTree, respectively; for example, the Header-
Table has links to the FP-Tree and DISTree in Figure 3.1 and Figure 3.3, respectively.
DISTree: The DISTree is a similar prefix tree structure to FP-Tree with two counters in
each node. The two counters 𝑓𝑖 and 𝑓𝑗 show the frequencies of the itemsets in the sequence
ending with each particular node in the target data stream and the general data stream,
respectively. The DISTree is constructed by traversing through the links in the Header-Table
structure following the FP-Growth method (Han, Pei and Yin 2000), generating all the
combinations of itemsets that are frequent in the target data stream and considered discriminative
based on Definition 3.1. The non-discriminative itemsets identified during the process are deleted
from DISTree unless they are a subset of discriminative itemsets; for example, 𝑐12,13 in
Figure 3.3 is a non-discriminative itemset but it is a subset of itemset 𝑐𝑏4,2 which is a
discriminative itemset. The DISTree structure is used to store the discriminative itemsets with
their intermediate nodes.
Window model: The DISTree method can be applied with different window models.
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 46
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 46
The tilted-time structure is a window model for showing the recent and historical
answers in different time granularities. A group of incoming transactions in a period of time are
considered as a batch of transactions 𝐵. By processing every new batch of transactions, the newly
discovered itemsets are transferred to the tilted-time window structure (Giannella et al. 2003) and
the structure is updated by shifting and merging the older answers in an offline state. The
logarithmic structure of the tilted-time window model shown in Figure 2.4 will be held for each
discovered discriminative itemset during the history of data streams, but as the number of
discriminative itemsets is much smaller than all the itemsets presented in the data streams, the
used space is not large.
The sliding window model is a structure for showing the answers in a fixed period of
time updated dynamically. This can be obtained by dividing the window size into several smaller
partitions. The discriminative itemsets are discovered in each partition using the DISTree
algorithm and updated in the sliding window model shown in Figure 2.3. The window slides by
processing the new batch of transactions and removing the itemsets related to the oldest batch of
transactions in the offline state.
The process of mining the discriminative itemsets using the DISTree method is
explained in the section below followed by the limitations and shortcomings for larger data
streams. The algorithms using the tilted-time window model and sliding window model will be
introduced in Chapter 4 and Chapter 5, respectively.
3.3.1 DISTree construction and pruning
The FP-Tree is constructed first for the transactions in both data streams. For every new
transaction in the data stream 𝑆𝑖 either a new prefix or a sub-prefix (i.e., a new branch or a sub-
branch in the tree) is added to the FP-Tree and the frequency pairs are adjusted or the frequency
pairs are updated in the prefixes in FP-Tree if the itemset already appeared in the past
transactions. Out of the running example presented in Table 3.1, the Header-Table and the FP-
Tree structures are shown in Figure 3.1. The process is continued by generating the itemsets
which are frequent in the target data stream using FP-Growth (Han, Pei and Yin 2000) and
adding them to the DISTree structure. The non-discriminative itemsets are deleted for space
saving if they are not a subset of any other discriminative itemsets.
Following the FP-Growth standard method (Han, Pei and Yin 2000) the FP-Tree is
traversed using the Header-Table links from the least frequent items and the conditional FP-Tree
is made by the patterns in the FP-Tree ending with that item. The conditional FP-Tree is the
miniaturized version of the original FP-Tree constructed for each item in the Header-Table out
of the frequent items in FP-Tree paths ending with that item defined as items’ conditional
patterns; for example, in Figure 3.1 the conditional patterns of item 𝑎 are
𝑐𝑏𝑑𝑎3,2, 𝑐𝑏𝑎1,0, 𝑐𝑒𝑎1,0, 𝑐𝑎0,4, 𝑏𝑎1,0, 𝑎0,2, respectively. The process is continued by generating
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 47
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 47
all the combinations of each single path in a conditional FP-Tree ending with Header-Table
item, and adding them to the DISTree structure. By processing each conditional FP-Tree, the
DISTree is checked by traversing through the links in the Header-Table item for the new
discriminative itemsets based on the defined thresholds. The non-discriminative itemsets are
deleted for space saving if they are not a subset of any other discriminative itemsets. After
processing the whole Header-Table items the discovered discriminative itemsets are fully set in
the DISTree structure and can be used for offline updating the different window models. The
DISTree structure is shown in the final state in the Figure 3.4.
Lemma 3-1 (Completeness of DISTree) itemset combination generation using
conditional FP-Tree as defined in the FP-Growth method (Han, Pei and Yin 2000) ensures the
completeness of discriminative itemsets in the DISTree structure by generation of all itemsets
which are frequent in the target data stream 𝑆𝑖.
Proof. The DISTree method following the basics of FP-Growth (Han, Pei and Yin 2000)
generates the full itemset combinations if they are frequent in the target data stream 𝑆𝑖. The
itemsets that are frequent in the target data stream 𝑆𝑖 are the superset of the discriminative
itemsets and are fully set or updated intermediately to the DISTree structure. Based on
Definition 3.1 the itemsets that are of interest are frequent in the target data stream 𝑆𝑖 and their
frequency ratio with the general data stream 𝑆𝑗 is 𝜃 times or higher. On the other side the non-
discriminative itemsets are deleted by traversing the recently added or updated itemsets through
the links in the Header-Table if they are not a subset of any discriminative itemsets.
∎
Lemma 3-2 (Correctness of DISTree) All the itemsets in the DISTree structure are
checked based on Definition 3.1 and Definition 3.2 to be tagged as discriminative or non-
discriminative itemsets, respectively.
Proof. DISTree structure is made by itemsets that are frequent in the target data stream 𝑆𝑖
as the superset of discriminative itemsets. All the itemsets in the DISTree structure are tagged as
discriminative or non-discriminative based on the Definition 3.1 and Definition 3.2 by traversing
through Header-Table links.
∎
Example 3.1. The DISTree construction is graphically demonstrated using the running
example presented in Table 3.1. Two simple data streams with the same number of transactions
in 𝑆1 and 𝑆2 (𝑛1 = 𝑛2 = 15), are presented in Table 3.1.
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 48
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 48
Table 3.1 An input batch in data streams
The DISTree method (Seyfi, Geva and Nayak 2014) is proposed and tested based on
single scan but the size of the data structures and the processing time is highly affected. Data
stream mining algorithms (Giannella et al. 2003; Chi et al. 2006; Tanbeer et al. 2009) use two
scans for making the concise data structures and faster processing time in which items are
ordered by decreasing frequencies as in (Han, Pei and Yin 2000). This ordering is adjusted based
on the frequent items in the first batch of transactions and remains fixed for all remaining batches
in the data streams. In this chapter, we modify the original DISTree method based on two scans.
In the first scan, the frequent items in the target data stream 𝑆1 are found and sorted based on the
descending order of their frequencies (i.e., Desc-Flist order in Table 3.2 is constructed out of
frequent items in Table 3.1). The Desc-Flist is used as the default order for all the prefix tree
structures and also shows the processing order in the Header-Table (e.g., the Header-Table in
Figure 3.1 is processed from the least frequent item in the dataset, item 𝑎). The frequent items in
each input transaction in the datasets are sorted based on the Desc-Flist order before adding to the
FP-Tree structure (e.g., the FP-Tree in Figure 3.1 is made of transactions with Desc-Flist order
in their items).
Table 3.2 Desc-Flist order of frequent items in target data stream 𝑆1
In this example the discriminative level threshold is set to 𝜃 = 2 and the support
threshold is set to 𝜑 = 0.1. The FP-Tree and DISTree together with Header-Table is represented
in Figure 3.1 and Figure 3.3, respectively. For simplicity, we only show the Header-Table links
of item 𝑎.
Figure 3.1 Header-Table and FP-Tree structures for input batch of transactions
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 49
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 49
Based on the FP-Growth method, the conditional patterns and the conditional FP-Tree
are constructed for each item in the Header-Table following the Desc-Flist bottom-up order. The
conditional patterns of a Header-Table item are the sub-patterns in the original FP-Tree starting
from one root node and ending at that item, for example in Figure 3.1 the conditional patterns of
item 𝑎 are 𝑐𝑏𝑑𝑎3,2, 𝑐𝑏𝑎1,0, 𝑐𝑒𝑎1,0, 𝑐𝑎0,4, 𝑏𝑎1,0, 𝑎0,2, respectively. The conditional FP-Tree is
constructed for each item in the Header-Table on its conditional patterns, for example, the
conditional FP-Tree for item 𝑎 is given in Figure 3.2. In Figure 3.2, the two conditional patterns
𝑐𝑒𝑎1,0 and 𝑐𝑎0,4 are merged together as 𝑐𝑎1,4. The item 𝑒 is not frequent in this conditional FP-
Tree and is not included.
Figure 3.2 Conditional FP-Tree of Header-Table item 𝑎
The itemsets which are frequent in the target dataset are generated using FP-Growth
(Han, Pei and Yin 2000). The itemset combinations in each single path of the conditional FP-
Tree are generated and the new generated itemsets are added to the DISTree structure either as
new paths or by updating the previously added itemset frequencies; for example, in Figure 3.3
itemset 𝑐𝑏𝑎4,2 is generated from 𝑐𝑏𝑎3,2 and 𝑐𝑏𝑎1,0 in Figure 3.2. The non-discriminative
itemsets are ignored for space saving if they are not a subset of any discriminative itemsets. By
making the conditional FP-Tree for each item in the Header-Table and generating itemset
combinations in each single path of the conditional FP-Trees, the DISTree structure is
constructed. The DISTree contains all frequent itemsets in the first dataset no matter whether they
are discriminative or not. The highlighted node shows the discriminative itemsets as in
Figure 3.3. For simplicity, we only show the Header-Table links of item 𝑎.
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 50
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 50
Figure 3.3 Header-Table and DISTree structure without pruning (the full prefix tree size is
only for display and is not generated)
The DISTree construction for the fast-growing data streams will consume large memory
space. For each unique itemset, a new branch of itemsets has to be added by multiple counters in
each node for data streams frequencies as in Figure 3.3. However, the DISTree is not constructed
in full size and non-discriminative itemsets are pruned after processing each conditional FP-Tree,
as explained in the paragraph below (i.e., Figure 3.3 is only for display of complete itemset
generation in DISTree). The final DISTree for Example 3.1 is presented in Figure 3.4. The
coloured nodes in Figure 3.3 and Figure 3.4 are showing the discriminative itemsets ending with
them. As explained earlier in this chapter, the Apriori property of frequent itemsets is not valid
for the discriminative itemsets, and subsets of a discriminative itemset can be non-discriminative.
Saving the full size DISTree inside the main memory is not possible for the large
datasets so it is necessary for the DISTree to be in compact size. For the concise DISTree
structure, the non-discriminative itemsets should be pruned from the itemset tails for each item in
the Header-Table following the bottom-up order of the Desc-Flist. The tail pruning process is
defined for each itemset ending with 𝑎 denoted as 𝐼(𝑎) generated during processing the
conditional FP-Tree. If based on the Definition 3.2 the 𝑓𝑖(𝐼(𝑎)) < 𝜑𝜃𝑛𝑖 or
𝑟𝑖(𝐼(𝑎)) 𝑟𝑗(𝐼(𝑎))⁄ < 𝜃 then 𝐼(𝑎) is not discriminative. The itemset 𝐼(𝑎) is tagged as non-
discriminative and removed from the DISTree if it is a leaf node. This deletion process saves
reasonable time and space during the DISTree construction by reducing the size of DISTree.
Lemma 3-3 (Concise DISTree structure) the pruning of non-discriminative itemsets
ensures to hold the DISTree as a concise data structure.
Proof. The DISTree structure is traversed through Header-Table links based on the
reverse order of Desc-Flist. The non-discriminative itemsets are deleted from the DISTree if they
are not a subset of discriminative itemsets. The final DISTree structure only holds the
discriminative itemsets and their non-discriminative subsets.
∎
By pruning the non-discriminative itemsets staying as leaf nodes, the final DISTree
structure for input dataset in Table 3.1 is shown in Figure 3.4. The DISTree in this example
contains one non-discriminative itemset staying as the internal node i.e., 𝑐12,13. After finding the
discriminative itemsets in DISTree structure, the tails are pruned if they are non-discriminative
itemsets. The tail nodes are the nodes in DISTree with no children and they are pruned if they are
non-discriminative. A significant difference appears in size of the DISTree structure with and
without tail pruning (e.g., Figure 3.3).
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 51
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 51
Figure 3.4 Final DISTree structure and the reported discriminative itemsets
The DISTree structure is made of a small subset of the frequent itemsets in the FP-Tree.
The Apriori property of frequent itemsets is not valid for the discriminative itemsets and subsets
of a discriminative itemset can be non-discriminative (e.g., 𝑐12,13). Figure 3.4 also shows the six
discriminative itemsets inside the DISTree for the data streams in Table 3.1. Following the
generation of all combinations of frequent itemsets in the target data stream 𝑆𝑖 in the DISTree
structure and traversing and tagging the itemsets as discriminative or non-discriminative, it is
proved that the proposed DISTree algorithm shows full accuracy and recall in a batch of
transactions in data streams.
3.3.2 DISTree algorithm
The DISTree algorithm starts by reading the batch of transactions and making the Desc-
Flist based on the descending order of the item frequencies in the target data stream 𝑆𝑖. The
Desc-Flist order is used for saving space by sharing the paths in the prefix trees with most
frequent items on the top. In data stream mining this Desc-Flist is made from the first batch of
transactions and remains the same for all the upcoming batches in the data streams. In the case of
single-pass algorithm requirements, this step can be ignored by making the FP-Tree and DISTree
using an alphabetical order of items. The input parameters discriminative level 𝜃 and support
threshold 𝜑 are defined based on the application domain, data stream characteristics and sizes or
by the domain expert users as discussed in Chapter 6. The presented DISTree algorithm is
applicable for one batch of transactions.
The transactions are added to the FP-Tree by the prefix path sharing of the itemsets. The
DISTree structure is constructed by initializing its root as an empty tree. The DISTree is updated
by traversing the itemsets in the FP-Tree using the Header-Table links and making the
conditional FP-Tree of the Header-Table item and then generating all combinations of each
single path in the conditional FP-Tree. This is following the basics of the standard FP-Growth
method (Han, Pei and Yin 2000) for frequent itemset generation. Based on the Desc-Flist order
of Header-Table items and from the least frequent item, the DISTree is updated by adding the
new itemset combinations generated from conditional FP-Tree. The new itemsets added to the
DISTree (i.e., ending with Header-Table item) are then checked based on the Definition 3.1 and
Definition 3.2 to be saved as discriminative itemsets or be deleted from the DISTree data
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 52
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 52
structure as non-discriminative itemsets if they are leaf nodes, respectively. By the end of itemset
combination generation for the batch of transactions, the full DISTree has been tagged based on
the discovered discriminative itemsets.
Algorithm 3.1 (DISTree: Discriminative itemset mining using itemsets
Generation and Test)
Input: (1) The discriminative level threshold 𝜃; (2) The support threshold 𝜑; (4)
The input batch 𝐵 made of transactions with alphabetically ordered items
belonging to data streams 𝑆𝑖 and 𝑆𝑗.
Output: 𝐷𝐼𝑖𝑗, a set of discriminative itemsets in 𝑆𝑖 against 𝑆𝑗 in a batch of
transactions 𝐵.
Begin
1) Scan 𝐵 to generate Desc-Flist and order the items in transactions based on
Desc-Flist;
2) Make FP-Tree for 𝐵 based on expansion of FP-Growth;
3) 𝐷𝐼𝑖𝑗 = { };
4) For each item x in Header-Table do // bottom-up order of Desc-Flist
5) Make conditional FP-Treex based on item x;
6) For each path in conditional FP-Treex do
7) Generate all itemsets ending with Header-Table item x from the path;
8) If itemsets are not in DISTree, add new paths in DISTree, otherwise
update the frequency of these itemsets in DISTree;
9) End for;
10) Check DISTree for discriminative & non-discriminative nodes and
update 𝐷𝐼𝑖𝑗 by discovered discriminative itemsets;
11) Delete non-discriminative leaf nodes;
12) End for;
13) Report discriminative itemsets in 𝐷𝐼𝑖𝑗;
End.
In the DISTree algorithm, the significant parts attracting considerable complexity are
related to the itemset combination generation for each single path in conditional FP-Tree, and
DISTree construction and test. The number of itemset combinations is exponential specifically in
the large datasets. The Theorem 3-1 and Theorem 3-2 prove the correctness of the DISTree
method within its minimum space usage, respectively as below.
Theorem 3-1 (Completeness and correctness of DISTree): Based on Lemma 3-1 all
combination of frequent itemsets in the target data stream 𝑆𝑖 are generated in the DISTree
structure. Based on Lemma 3-2 all the itemsets in the DISTree structure are traversed and tagged
as discriminative or non-discriminative itemsets. The non-discriminative itemsets that are not a
subset of any discriminative itemset are pruned in DISTree structure. The complete set of
discriminative itemsets is held in DISTree structure.
Theorem 3-2 (Concise DISTree structure): Based on Theorem 3-1 the complete set of
itemset combinations are generated in the DISTree structure and correctly tagged as
discriminative and non-discriminative itemsets. Based on Lemma 3-3 the non-discriminative
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 53
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 53
itemsets are deleted from the DISTree structure if they are not subsets of discriminative itemsets.
These prove the minimum space usage by the DISTree structure.
The efficiency of the DISTree algorithm is discussed in detail by evaluating the
algorithm with several input data streams in Chapter 6. Empirical analysis shows the
performance of the proposed method on different datasets. The method is tested with different
discriminative level thresholds, support thresholds and ratios between sizes of two data streams.
However, it is discussed that there are several issues regarding the performance of the DISTree
algorithm in generally large and fast-growing data streams.
In this thesis, the problem of mining discriminative itemsets in data streams is defined
using two different window models. The proposed DISTree method is used in Chapter 4 for
mining discriminative itemsets in data streams using the tilted-time window model (e.g.,
Figure 2.3). The DISTree method is then used in Chapter 5 for mining discriminative itemsets in
data streams using the sliding window model (e.g., Figure 2.4). Here we explain briefly about
updating process of the two window models after processing a single batch of transactions 𝐵
using the DISTree algorithm.
For the mining discriminative itemsets in data streams using the tilted-time window
model the pre-defined window size made of a batch of input transactions is set as the current
window frame for offline updating the window model. The discriminative itemsets are
discovered using the DISTree method and are transferred from the DISTree structure to the first
window frame as the current window frame. The tilted-time structure is updated by shifting and
merging the older window frames to the larger window frames using the logarithmic way
following the basics of the FP-Stream method (Giannella et al. 2003), as explained in Chapter 4.
The discriminative itemsets in the current window frame and the older window frames are
reported separately as the output results in the history of data streams. The discriminative itemset
mining is continued by making the DISTree structure for every new incoming batch of
transactions and offline updating the tilted-time window model.
For the mining discriminative itemsets in data streams using the sliding window model,
the sliding window frame can be divided into several smaller partitions for offline sliding the
window model. The discriminative itemsets are discovered using the DISTree method and are
transferred from the DISTree structure to the sliding window frame by merging with older
itemsets in the sliding window frame, as explained in Chapter 5. By every new incoming batch of
transactions, the window slides and the itemsets that are out of the window frame belonging to
the oldest partitions are deleted from the window model and the results are updated in offline
sliding. However, when the sliding window frame becomes full every new input transaction is
checked for having a subset in the set of discriminative itemsets in the sliding window frame for
online sliding.
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 54
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 54
3.3.3 DISTree summary
In this section the DISTree method proposed for mining the discriminative itemsets in
data streams based on the expansion of the FP-Growth method (Han, Pei and Yin 2000) was
discussed. The DISTree method generates all the itemset combinations ending with each Header-
Table item and adds them to the prefix tree DISTree structure. The new itemsets are checked by
their frequencies and tagged as discriminative or non-discriminative itemsets. The itemset
combination generation is based on the frequent itemsets in the target data stream. The
conditional FP-Tree is made for each Header-Table item by the bottom-up order of Desc-Flist
and based on the frequent items in FP-Tree paths ending with that item, called conditional
patterns. All the combinations of each single path in the conditional FP-Tree are generated and
added to the DISTree structure. To control the data structure size, the non-discriminative itemsets
are deleted from the DISTree structure.
The FP-Growth (Han, Pei and Yin 2000) method was originally designed for the
frequent itemset mining in single dataset. The main part of the DISTree method is the itemset
combination generations limited to the itemsets that are specifically frequent in the target data
stream. However, there are many itemset combinations that are frequent in both data streams and
not discriminative. Based on the conceptual definition and following the empirical analysis in
Chapter 6, the discriminative itemsets are a small subset of frequent itemsets. The sparse
characteristics of discriminative itemsets must be used for defining a novel efficient method
acceptable in large, complex and fast-growing data streams. The process must be adjusted based
on the fact that the Apriori property is not valid for the discriminative itemset mining. The
determinative heuristics have to be applied for limiting the mining efficiently to the potential
discriminative itemsets. In the next section a novel efficient method called DISSparse (Seyfi et al.
2017) is proposed for mining the discriminative itemsets in data streams.
The precision of the DISTree algorithm was proved by Theorem 3-1 based on its
completeness and correctness. Theorem 3-2 also proved the minimum space usage by DISTree
structure in the main memory. Considering small and simple datasets, the DISTree method can
be used correctly for offline updating the different window models.
3.4 DISSPARSE METHOD
The DISSparse method is developed for efficient offline mining the discriminative
itemsets in a batch of transactions in datasets (Seyfi et al. 2017). There are two main issues with
the DISTree method as explained in the previous section. First, it generates the itemset
combinations if they are frequent in the target data stream without considering their frequency in
the general data stream. Second, the candidate itemsets remain in the DISTree structure while
they are checked for being discriminative. The DISTree can become massive for large datasets
with long discriminative itemsets.
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 55
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 55
The key novel idea of the DISSparse method is to limit the discriminative itemset mining
to the potential subtrees of the DISTree structure, instead of mining the entire DISTree structure.
To this end, two heuristics are proposed to determine the potential subtrees and the potential
internal nodes in the subtrees for mining discriminative itemsets. The subtrees which do not have
potential for containing discriminative itemsets are removed without checking. To facilitate
description of the method, some new data structures are defined below.
Conditional FP-Tree: The conditional pattern of an item is the sub-pattern base under
the condition of the Header-Table item’sexistenceintheoriginalFP-Tree. The conditional FP-
Tree is constructed for each item in the Header-Table on its conditional patterns. In contrast with
FP-Growth based methods, the conditional FP-Tree in the DISSparse method contains the nodes
related to the processing Header-Table item by saving the top ancestor on the first level as the
root of the Header-Table items; for example, node 𝑐 is the top ancestor of the Header-Table
items 𝑎 in the left-most subtree in Figure 3.5, 𝑐 appears in all Header-Table item nodes in the
left-most subtree. The nodes in the first level of the conditional FP-Tree determine different
subtrees, which are used separately to generate potential discriminative itemset combinations.
The subtree is made of branches under the same root in the first level of the conditional FP-Tree
and ending with the processing Header-Table item, denoted as 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡; for example, the
conditional FP-Tree in Figure 3.5 has three subtrees under root 𝑐, 𝑏 and 𝑎 (i.e., 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑐,
𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑏 and 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑎, respectively). The Header-Table items are linked under their subtree
root node using Header-Table links.
As defined before, each node represents an itemset starting from the root and ending at
this node and is annotated with the frequencies of the itemset in two datasets, for example, 𝑎3,2 in
Figure 3.5 indicates that the frequency of itemset 𝑐𝑏𝑑𝑎 is 3 and 2 in stream 𝑆𝑖 and 𝑆𝑗
respectively. For simplicity, we use a single header item 𝑎3,2 to denote an itemset 𝑐𝑏𝑑𝑎 and its
frequency. Let 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) be a set of header items with their
frequency, for each item 𝑎𝑛,𝑚 ∈ 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡), 𝐼(𝑎𝑛,𝑚) is defined as
the itemset ending with 𝑎 and the frequency of 𝐼(𝑎𝑛,𝑚) in stream 𝑆𝑖 and 𝑆𝑗 are 𝑛 and 𝑚,
respectively; for example, in Figure 3.5 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑐) contains 𝑎3,2, 𝑎1,0
and 𝑎1,4, 𝐼(𝑎3,2) = 𝑐𝑏𝑑𝑎, 𝑓𝑖 (𝐼(𝑎3,2)) = 3 and 𝑓𝑗 (𝐼(𝑎3,2)) = 2. In the DISSparse method, the
conditional FP-Tree will be expanded by its subtrees during the process, as explained in
Section 3.4.1.2.
Based on Definition 3.1 and empirical studies, the discriminative itemsets are a small
subset of frequent itemsets. The DISSparse method is presented with two determinative
heuristics for limiting the mining discriminative itemsets to the potential itemsets in
Section 3.4.1. We prove the full accuracy and recall of the DISSparse method in one batch of
transactions in Section 3.4.2. The efficiency of the DISSparse method is compared with the
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 56
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 56
proposed DISTree method in Chapter 6, for a single batch of transactions. The method is
evaluated extensively using various thresholds on several large and complex fast-growing data
streams with different characteristics.
3.4.1 Mining discriminative itemsets using sparse prefix tree
Following FP-Growth, the FP-Tree is generated for the incoming batch of transactions
as in Figure 3.1. The conditional patterns and conditional FP-Tree for each Header-Table item is
then produced following the increasingorderof the items’frequencyinHeader-Table starting
from the item with the lowest frequency as in Desc-Flist (i.e., the conditional FP-Tree for the
Header-Table item 𝑎 presented in Figure 3.5). The difference between conditional FP-Tree in
Figure 3.2 and conditional FP-Tree in Figure 3.5 is that in Figure 3.5 we process the subtrees one
by one.
Figure 3.5 Conditional FP-Tree of Header-Table item 𝑎 associated with the top ancestor on
the first level
The conditional FP-Tree is traversed from the left-most subtree through the processing
Header-Table item links. The left-most subtree in conditional FP-Tree and its internal nodes are
checked under subject of two heuristics that are defined in this section. The heuristics determine
the potential discriminative itemsets in a combination set of itemsets. The two heuristics are
defined based on two important measures, the maximum frequency of the itemsets in a subtree
and the maximum discriminative value of the itemsets in a subtree. We first define the two
measures below, then define the two heuristics.
Let 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑎) denote the set of itemsets in subtree 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 starting from
𝑟𝑜𝑜𝑡 and ending with a header item 𝑎𝑛,𝑚 ∊ 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡), e.g.,
𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑐, 𝑎) = {𝐼(𝑎3,2), 𝐼(𝑎1,0), 𝐼(𝑎1,4)}. The maximum frequency of 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑎)
in the target dataset 𝑆𝑖, denoted as Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑎 ), is defined as the sum of the frequencies
in 𝑆𝑖 of the itemsets in 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑎) as below.
Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑎 ) = ∑ 𝑓𝑖(𝑏)
𝑏∈𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡,𝑎)
(3-a)
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 57
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 57
For simplicity, in the equation above, 𝑓𝑖(𝑏) refers to the frequency in 𝑆𝑖 of an itemset
ending with an item 𝑏 in 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡); for example, in Figure 3.5 the
maximum frequency of itemsets in 𝑆𝑖 in the left-most subtree 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑐, which contains three
branches ending with 𝑎3,2, 𝑎1,0 and 𝑎1,4 is equal to 5 i.e., Max_freq𝑖(𝑐, 𝑎) = 5.
Let 𝒮 be the power set of 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑎), 𝒮 = 2𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡,𝑎) , i.e., 𝒮 consists of
all subsets of 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑎). For 𝐵 ∈ 𝒮 and 𝐵 ≠ { }, the discriminative value of the
itemsets in 𝐵 is defined below.
Dis_value(𝐵) =
{
𝑟𝑖(𝐵)
𝜃 ∑𝑓𝑗(𝑏)
𝑏∈𝐵
= 0
𝑟𝑖(𝐵)
𝑟𝑗(𝐵) ∑𝑓𝑗(𝑏)
𝑏∈𝐵
> 0
(3-b)
Where 𝑟𝑖(𝐵) =∑ 𝑓𝑖(𝑏)𝑏∈𝐵
𝑛𝑖 and 𝑟𝑗(𝐵) =
∑ 𝑓𝑗(𝑏)𝑏∈𝐵
𝑛𝑗 . 𝑟𝑖(𝐵) is called the relative support of
B in 𝑆𝑖. It is the sum of the relative supports of the itemsets in B in 𝑆𝑖 and 𝑆𝑗, respectively.
When ∑ 𝑓𝑗(𝑏)𝑏∈𝐵 = 0, the itemsets 𝑏 ∈ 𝐵 do not exist in dataset 𝑆𝑗. In this case,
Dis_value(𝐵) is defined as the ratio between the sum of the relative supports of the itemsets of 𝐵
in 𝑆𝑖 and the discriminative level threshold 𝜃. The idea behind is that the itemset should be
frequent and significant in the target dataset.
When ∑ 𝑓𝑗(𝑏)𝑏∈𝐵 > 0 which indicates that at least one of the itemsets in 𝐵 does exist in
the general dataset, Dis_value(𝐵) is defined as the ratio between the sum of the relative supports
of the itemsets of 𝐵 in 𝑆𝑖 and the sum of the relative supports of the itemsets of 𝐵 in 𝑆𝑗. 𝑟𝑖(𝐵)
𝑟𝑗(𝐵) is
called the discriminative value of 𝐵, denoted as 𝑅𝑖𝑗(𝐵) =𝑟𝑖(𝐵)
𝑟𝑗(𝐵) (e.g., in the left-most subtree in
Figure 3.5, for = {𝑐𝑏𝑎𝑑, 𝑐𝑏𝑎, 𝑐𝑎}, 𝑅𝑖𝑗(𝐵) =5
6∗15
15, 𝑛𝑖 = 𝑛𝑗 = 15).
The maximum discriminative value of all itemsets in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 ending with a header
item 𝑎, denoted as Max_dis_value(𝑟𝑜𝑜𝑡, 𝑎), is defined below.
Max_dis_value(𝑟𝑜𝑜𝑡, 𝑎) = 𝑚𝑎𝑥𝐵∈𝒮{Dis_value(𝐵)} = 𝑚𝑎𝑥𝐵∈𝒮
{
𝑟𝑖(𝐵)
𝜃 ∑𝑓𝑗(𝑏)
𝑏∈𝐵
= 0
𝑟𝑖(𝐵)
𝑟𝑗(𝐵) ∑𝑓𝑗(𝑏)
𝑏∈𝐵
> 0}
(3-c)
Obviously, Max_dis_value(𝑟𝑜𝑜𝑡, 𝑎) is the highest discriminative value among all
subsets of 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑎).
In order to determine Max_dis_value(𝑟𝑜𝑜𝑡, 𝑎), all possible itemsets in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡
will have to be generated as required in equations (3-b) and (3-c). However, the generation of all
possible itemset combinations is time-consuming. An efficient method, as described below, is
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 58
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 58
designed to find a subset 𝐵 of 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑎) which makes Dis_value(𝐵) maximum among
all subsets.
Let 𝐵 with maximum 𝑅𝑖𝑗(𝐵) in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 be defined as 𝐵𝑚𝑎𝑥. Initially, 𝐵𝑚𝑎𝑥 is
initialized by summing up the 𝑓𝑖(𝑏) frequencies of the itemsets 𝑏 with 𝑓𝑗(𝑏) = 0.
The frequencies of itemset 𝑏, with maximum frequency ratio, is summed up by 𝐵𝑚𝑎𝑥
only if it increases its discriminative value i.e., 𝑅𝑖𝑗(𝐵𝑚𝑎𝑥); for example, in Figure 3.5 the
maximum discriminative value of itemsets in the left-most subtree, 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑐, is equal to 2, i.e.,
Max_dis_value(𝑐, 𝑎) = 2, which is calculated by the sum of frequencies of two itemsets ending
with the items in 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑐), i.e., 𝑎1,0 and 𝑎3,2. Following the
Definition 3.1 if the 𝑓𝑗(𝐵𝑚𝑎𝑥) = 0 then 𝑅𝑖𝑗(𝐵𝑚𝑎𝑥) =𝑓𝑖(𝐵𝑚𝑎𝑥)
𝜃𝑛𝑖. The algorithm for calcultating
the Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑎) and estimating the Max_dis_value(𝑟𝑜𝑜𝑡, 𝑎) is presented in
Algorithm 3.2.
Algorithm 3.2 𝐌𝐚𝐱_𝐟𝐫𝐞𝐪𝒊 and 𝐌𝐚𝐱_𝐝𝐢𝐬_𝐯𝐚𝐥𝐮𝐞
Input: (1) 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡; (2) header item
𝑎 ∊ 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡).
Output: (1) Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑎); (2) Max_dis_value(𝑟𝑜𝑜𝑡, 𝑎).
Begin
1) Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑎) = 0; 𝐹𝑖 = 0, 𝐹𝑗 = 0;
2) For each item 𝑏 ∊ 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) do
3) Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑎)+= 𝑓𝑖(𝐼(𝑏));
4) If 𝑓𝑗(𝐼(𝑏)) = 0 then 𝐹𝑖+= 𝑓𝑖(𝐼(𝑏)); Tag 𝑏 as checked;
5) End if;
6) End For;
7) While ∃𝑏 ∊ 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) and 𝑏 is unchecked do
8) Find 𝐼(𝑏) with maximum 𝑅𝑖𝑗(𝐼(𝑏));
9) If 𝐹𝑖+ 𝑓𝑖(𝐼(𝑏))
𝐹𝑗+ 𝑓𝑗(𝐼(𝑏)) >
𝐹𝑖
𝐹𝑗 then 𝐹𝑖+= 𝑓𝑖(𝐼(𝑏)); 𝐹𝑗+= 𝑓𝑗(𝐼(𝑏));
10) End if;
11) Tag 𝑏 as checked;
12) End While;
13) Max_dis_value(𝑟𝑜𝑜𝑡, 𝑎) =𝐹𝑖∗𝑛𝑗
𝐹𝑗∗𝑛𝑖 ;
14) Return Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑎) and Max_dis_value(𝑟𝑜𝑜𝑡, 𝑎);
End.
In the above algorithm, the items in 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) are scanned
in the two separated loops. In the first loop the Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑎) is calculated based on the
frequency in 𝑆𝑖 (i.e., 𝑓𝑖(𝐼(𝑏))) of all itemsets in the 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡. The 𝐹𝑖 is also initiated based
on sum of 𝑓𝑖(𝐼(𝑏)) in the itemsets with 𝑓𝑗(𝐼(𝑏)) = 0. The 𝐹𝑖 and 𝐹𝑗 are used for calculating the
Max_dis_value(𝑟𝑜𝑜𝑡, 𝑎). In the second loop and using a greedy method, the 𝐹𝑖 and 𝐹𝑗 are
updated by adding the frequencies of the itemset 𝐼(𝑏) in 𝑆𝑖 and 𝑆𝑗 with maximum 𝑅𝑖𝑗(𝐼(𝑏)) if
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 59
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 59
the frequencies of the itemset 𝐼(𝑏) increase the ratio between 𝐹𝑖 and 𝐹𝑗. The ratio between 𝐹𝑖 and
𝐹𝑗 is the greatest ratio in all itemsets in the subtree. In the second loop each item is checked one
time and the Max_dis_value(𝑟𝑜𝑜𝑡, 𝑎) are calculated based on the selected items.
The first heuristic is formally defined below.
HEURISTIC 3.1. A 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 in the conditional FP-Tree in terms of a header item 𝑎 is
considered as a potential discriminative subtree denoted as 𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) if it
satisfies the following conditions:
1. Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑎) ≥ 𝜑𝜃𝑛𝑖
2. Max_dis_value(𝑟𝑜𝑜𝑡, 𝑎) ≥ 𝜃
Where 𝜃 > 1 is discriminative level threshold, 𝜑 𝜖 (0, 1 𝜃⁄ ) is the support threshold, 𝑛𝑖 is the
size of the target data stream 𝑆𝑖 and 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑎) is a set of itemsets in subtree
𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 ending with a header item 𝑎 ∊ 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡).
Lemma 3-4 (Potential discriminative subtree) HEURISTIC 3.1 ensures that none of the
non-potential discriminative subtrees contains any discriminative itemset.
Proof. The first condition in HEURISTIC 3.1 ensures that the sum of frequencies in
target data stream S𝑖 of itemsets of a potential discriminative subtree is frequent, which implies
that there could be an itemset in target data stream S𝑖 in the subtree which is frequent. The second
condition in HEURISTIC 3.1 ensures that the maximum discriminative value of a potential
discriminative subtree is larger than the discriminative level 𝜃, which implies that there could be
an itemset in the subtree whose discriminative value is larger than the discriminative level 𝜃.
For a subtree, if it does not satisfy any of the two conditions, the subtree is considered as
a non-potential discriminative subtree. Because a non-potential discriminative subtree breaches
one or both of the conditions, the subtree does not contain any frequent itemset in target data
stream S𝑖 or does not contain any itemset whose discriminative value is larger than the
discriminative level 𝜃. According to Definition 3.1, the subtree would not contain any
discriminative itemset.
∎
In Figure 3.5, the left-most subtree related to the processing Header-Table item 𝑎 under
root node 𝑐 is potential with Max_freq𝑖(𝑐, 𝑎) = 5 ≥ (𝜑𝜃𝑛𝑖 = 0.1 ∗ 2 ∗ 15 = 3) and
Max_dis_value(𝑐, 𝑎) = 2 ≥ (𝜃 = 2). In this chapter, for the sake of simplicity, the dataset
lengths are omitted from ratios as 𝑛1 = 𝑛2 = 15. In case of data streams with different length
(i.e., 𝑛2
𝑛1≠ 1), the ratios must be multiplied by the constant of
𝑛2
𝑛1.
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 60
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 60
Using HEURISTIC 3.1 all potential discriminative subtrees can be identified and all
non-potential subtrees are to be ignored from the itemset combination generation. A
𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) could contain non-discriminative itemsets. The non-potential subsets
may exist in 𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) as internal nodes which are in the paths between 𝑟𝑜𝑜𝑡 of
𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 and the items in 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡). The internal nodes are
denoted as 𝐼𝑛𝑡𝑒𝑟𝑛𝑎𝑙 𝑛𝑜𝑑𝑒𝑟𝑜𝑜𝑡 (e.g., in the left-most subtree in conditional FP-Tree in
Figure 3.5, 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑐 has two internal nodes i.e., 𝑏 and 𝑑, respectively).
Let 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑖𝑛, 𝑎) denote a set of itemsets in subtree 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 ending with
a header item 𝑎 ∊ 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) with the internal node 𝑖𝑛 in each of the
itemsets, i.e., 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑖𝑛, 𝑎) ⊆ 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑎), for example, in Figure 3.5,
𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑐, 𝑏, 𝑎) = {𝐼(𝑏3,2), 𝐼(𝑏1,0)} . The maximum frequency of 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑖𝑛, 𝑎) in
the target dataset 𝑆𝑖 is denoted as Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑖𝑛, 𝑎); for example, for 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑐, 𝑏, 𝑎) ,
Max_freq𝑖(𝑐, 𝑏, 𝑎) = 4. In Figure 3.5 the maximum frequency of itemsets with the subset of
internal node item 𝑑 i.e., 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑐, 𝑑, 𝑎) in the left-most subtree with one appearance, 𝐼(𝑑3,2)
in 𝑆𝑖 is equal to 3 i.e., Max_freq𝑖(𝑐, 𝑑, 𝑎) = 3.
The maximum discriminative value of 𝐼𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑖𝑛, 𝑎) is denoted as
Max_dis_value(𝑟𝑜𝑜𝑡, 𝑏, 𝑎); for example, in Figure 3.5 the maximum discriminative value of the
itemsets with the subset of internal node item 𝑏 in the left-most subtree is equal to 2 i.e.,
Max_dis_value(𝑐, 𝑏, 𝑎) = 2, which is made of two itemsets with subsets of internal node
𝐼(𝑏3,2) and 𝐼(𝑏1,0), respectively. In Figure 3.5 the maximum discriminative value of the
itemsets with subset of internal node item 𝑑 in the left-most subtree is equal to 1.5 i.e.,
Max_dis_value(𝑐, 𝑑, 𝑎) = 1.5, which is made of itemset with the subset of internal node
𝐼(𝑑3,2).
The second heuristic is formally defined below.
HEURISTIC 3.2. An internal node 𝑖𝑛 ∈ 𝐼𝑛𝑡𝑒𝑟𝑛𝑎𝑙 𝑛𝑜𝑑𝑒𝑟𝑜𝑜𝑡 in a potential subtree
𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 is considered as potential discriminative internal node denoted as 𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑖𝑛) if
it satisfies the following conditions:
1. Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑖𝑛, 𝑎) ≥ 𝜑𝜃𝑛𝑖
2. Max_dis_value(𝑟𝑜𝑜𝑡, 𝑖𝑛, 𝑎) ≥ 𝜃
Where 𝜃 > 1 is discriminative level threshold, 𝜑 𝜖 (0, 1 𝜃⁄ ) is support threshold and 𝑛𝑖 is the
size of target data stream 𝑆𝑖 and 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑖𝑛, 𝑎) is a set of itemsets in subtree
𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 ending with a header item 𝑎 ∊ 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) with the
internal node 𝑖𝑛 as subset.
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 61
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 61
Lemma 3-5 (Potential discriminative internal node) HEURISTIC 3.2, ensures that none
of the non-potential discriminative internal nodes would occur in any discriminative itemset.
Proof. The first condition in HEURISTIC 3.2 ensures that the sum of frequencies in
target dataset S𝑖 of itemsets with subset of the potential discriminative internal node is frequent,
which implies that there could be an itemset in target dataset S𝑖 in the subtree with subset of the
potential discriminative internal node which is frequent. The second condition in
HEURISTIC 3.2 confirms that the maximum discriminative value of the potential
discriminative internal node is larger than the discriminative level 𝜃, which implies that there
could be an itemset in the subtree with subset of the potential discriminative internal node whose
discriminative value is larger than the discriminative level 𝜃.
For an internal node, if it does not satisfy any of the two conditions, the internal node is
considered as a non-potential discriminative internal node. Because a non-potential
discriminative internal node breaches one or both of the conditions, the internal node is not
contained in any frequent itemset in target dataset S𝑖 or is not contained in any itemset whose
discriminative value is larger than the discriminative level 𝜃. According to Definition 3.1, the
internal node would not be contained in any discriminative itemset.
∎
In Figure 3.5, the internal node 𝑑 in the left-most subtree 𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑐) related
to the processing Header-Table item 𝑎 under root node 𝑐 is non-potential with
Max_freq𝑖(𝑐, 𝑑, 𝑎) = 3 ≥ (𝜑𝜃𝑛𝑖 = 0.1 ∗ 2 ∗ 15) and Max_dis_value(𝑐, 𝑑, 𝑎) = 1.5 ≱ (𝜃 =
2). The non-potential internal nodes are ignored from itemset combination generation in
𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) (e.g., the 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑐, 𝑑, 𝑎) is not involved in the minimized DISTree
in Figure 3.6).
3.4.1.1 Potential discriminative itemsets generation using minimized DISTree
The DISSparse method does not use the DISTree structure to identify discriminative
itemsets, which is usually very big. Instead, it identifies potential discriminative subtrees in the
conditional FP-Tree, and then discovers discriminative itemsets from the potential discriminative
subtrees, which can significantly increase the efficiency. The potential discriminative itemsets
identified from a 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 in conditional FP-Tree are represented in a minimized DISTree
structure defined below.
Minimized DISTree: The minimized DISTree is similar to the DISTree structure
defined early in previous section (Seyfi, Geva and Nayak 2014). The size of a minimized
DISTree is bounded to the size of potential subsets in one potential discriminative subtree in a
conditional FP-Tree and non-potential subsets are ignored when the itemset combinations are
generated (e.g., the minimized DISTree in Figure 3.6 is generated out of the potential
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 62
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 62
discriminative subsets of the left-most subtree in the conditional FP-Tree in Figure 3.5 without
considering 𝑑 which is a non-discriminative internal node). The minimized DISTree covers the
itemsets starting with 𝑟𝑜𝑜𝑡 item of 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 as prefix and items in
𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) as postfix (e.g., the minimized DISTree in Figure 3.6
covers the itemsets with prefix of 𝑐 and postfix of 𝑎, generated out of the potential discriminative
subset of items 𝑐, 𝑏 and 𝑎 in the left-most subtree in conditional FP-Tree in Figure 3.5). This is
formally defined below.
Let 𝐼 be an itemset in a minimized DISTree for a given potential discriminative subtree,
i.e., 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡. For all internal subsets 𝐼′ of itemset 𝐼, i.e., 𝐼′ ⊂ 𝐼, which start immediately
after the 𝑟𝑜𝑜𝑡 item of 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 and end before the items in
𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡), the subsets 𝐼′ are included in the minimized DISTree
generated from the 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 if 𝐼 = {𝑟𝑜𝑜𝑡} ∪ 𝐼′ ∪ {𝑎}, where
𝑎 ∊ 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡).
It should be noted that in an itemset, the 𝑟𝑜𝑜𝑡 item of 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 and the item in
𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) may refer to a same node (i.e., length-1 itemset) (e.g., in
Figure 3.5 in the right-most subtree under root 𝑎 (i.e., 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑎) the node 𝑎0,2 refers to the 𝑟𝑜𝑜𝑡
of 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑎 and also 𝑎0,2 ∊ 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑎)). This confirms the minimized
DISTree covers itemsets in a 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 that satisfy the condition above. We will explain the
conditional FP-Tree expansion to obtain all possible subtrees in next subsection.
The minimized DISTree is generated from the potential discriminative subsets of one
𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) for the processing Header-Table item; for example, Figure 3.6 shows
the minimized DISTree for Header-Table item 𝑎, generated out of the potential discriminative
subset of items 𝑐, 𝑏 and 𝑎 in the left-most subtree in conditional FP-Tree in Figure 3.5 (i.e.,
𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑐)). As explained before, the minimized DISTree of a subtree 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡
covers the itemset combinations starting with 𝑟𝑜𝑜𝑡 item of 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 as prefix and ending by
items in 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) as postfix. The potential discriminative itemsets
with other prefix items are generated from potential discriminative subtrees with different 𝑟𝑜𝑜𝑡
items in separate minimized DISTrees.
Figure 3.6 Minimized DISTree generated from the left-most subtree in Figure 3.5
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 63
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 63
By generating the potential discriminative itemset combinations out of all branches in a
potential discriminative subtree (e.g., three branches in the left-most subtree,
𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑐) in Figure 3.5 i.e., 𝑐𝑏𝑑𝑎3,2, 𝑐𝑏𝑎1,0 and 𝑐𝑎1,4), the minimized DISTree is
traversed through Header-Table item links for mining the discriminative itemsets based on
Definition 3.1 (e.g., the highlighted node 𝑎4,2 in Figure 3.6 refers to the discriminative itemset
𝑐𝑏𝑎4,2). The conditional FP-Tree then is expanded for processing the next subtree of the
processing Header-Table item as in the section below.
3.4.1.2 Conditional FP-Tree expansion
In a conditional FP-Tree, except for the left-most subtree such as the subtree with root 𝑐
in Figure 3.5, a subtree with a particular root item may not contain all the itemsets starting with
that particular item, for example, in the subtree with root 𝑏 in Figure 3.5, the itemset 𝑏𝑑𝑎, which
was included in the left-most subtree, was not included in the subtree with root 𝑏. In order to
generate all the possible discriminative itemsets, before traversing the next 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 of the
processing Header-Table item, the conditional FP-Tree must be expanded by adding the sub-
branches of the current 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 without their 𝑟𝑜𝑜𝑡 item. Each sub-branch is added to the
conditional FP-Tree under the 𝑟𝑜𝑜𝑡 node of one of the remaining subtrees by summing up the
frequencies of the itemsets ending with the items in 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡), for
example, three sub-branches in the left-most subtree, 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑐 in Figure 3.5 i.e., 𝑏𝑑𝑎3,2, 𝑏𝑎1,0
and 𝑎1,4 are added to the conditional FP-Tree as in Figure 3.7 to generate three branches under
root 𝑏 or 𝑎 i.e., 𝑏𝑑𝑎3,2, 𝑏𝑎2,0 and 𝑎1,6. For space saving, the
𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) of the processed 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 are removed from
conditional FP-Tree. The conditional FP-Tree expansion continues by processing each
𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 until no further 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 remains.
Lemma 3-6 (Completeness of itemset prefixes) For the current processing Header-Table
item, the conditional FP-Tree expansion confirms that all the possible itemset prefixes of this
item can be obtained.
Proof. Each 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 is used for generating the potential discriminative itemsets
with 𝑟𝑜𝑜𝑡 item of 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 as prefix and items in 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) as
postfix. The expanded conditional FP-Tree with sub-branches of the processed 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡
ensures that the left-most subtree covers all the itemset prefixes starting with the subtree root
item. The expanded conditional FP-Tree with sub-branches of every processed 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡
obtains a complete set of itemsets in distinct subtrees starting with every possible 𝑟𝑜𝑜𝑡 item. This
ensures the completeness of the conditional FP-Tree in DISSparse method for itemset
combination generation with all prefixes for each header item.
∎
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 64
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 64
The expanded conditional FP-Tree of Header-Table item 𝑎 after processing the first
subtree (i.e., 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑐) is presented in Figure 3.7; for example, the path 𝑏𝑑𝑎3,2 is added to
𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑏 by expanding the conditional FP-Tree with sub-branch 𝑏𝑑𝑎3,2 of 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑐. The size
of a conditional FP-Tree may increase by expansion, especially at the beginning. However, the
increment is not exponential, as compared to the original conditional FP-Tree size and as per
empirical analysis, does not affect the DISSparse algorithm performance.
Figure 3.7 Expanded conditional FP-Tree of Header-Table item 𝑎 after processing the first
subtree
In Example 3.1, the 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑏 is traversed by 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑏)
links. The 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑏 has two branches ending with the 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑏) i.e.,
𝐼(𝑎3,2) and 𝐼(𝑎2,0) as in Figure 3.7. Based on HEURISTIC 3.1 the 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑏 is potential with
Max_freq𝑖(𝑏, 𝑎) = 5 ≥ (𝜑𝜃𝑛𝑖 = 0.1 ∗ 2 ∗ 15) and Max_dis_value(𝑏, 𝑎) = 2.5 ≥ (𝜃 = 2).
Based on HEURISTIC 3.2 the internal node 𝑑 is non-potential with Max_freq𝑖(𝑏, 𝑑, 𝑎) = 3 ≥
(𝜑𝜃𝑛𝑖 = 0.1 ∗ 2 ∗ 15) and Max_dis_value(𝑏, 𝑑, 𝑎) = 1.5 ≱ (𝜃 = 2)).
The minimized DISTree for the Header-Table item 𝑎 based on potential discriminative
subsets in the left-most subtree in conditional FP-Tree in Figure 3.7, 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑏, is generated as
in Figure 3.8 (e.g., the highlighted node 𝑎(5,2) in Figure 3.8 refers to the discriminative itemset
𝑏𝑎5,2).
Figure 3.8 Minimized DISTree generated out of the potential discriminative subsets of the
left-most subtree in conditional FP-Tree for Header-Table item 𝑎
The conditional FP-Tree of the Header-Table item 𝑎 is then expanded by adding the two
sub-branches of 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑏, 𝑑𝑎3,2 and 𝑎2,0 as in Figure 3.9. Based on HEURISTIC 3.1 the
𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑑 with one itemset ending the items in 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑑) i.e., 𝐼(𝑎3,2)
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 65
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 65
is non-potential with Max_freq𝑖(𝑑, 𝑎) = 3 ≥ (𝜑𝜃𝑛𝑖 = 0.1 ∗ 2 ∗ 15) and
Max_dis_value(𝑑, 𝑎) = 1.5 ≱ (𝜃 = 2), and the minimized DISTree is not generated. Based on
Lemma 3-6, related to the completeness of itemset prefixes, the conditional FP-Tree must be
expanded by adding the sub-branches of the non-potential 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 as well; for example, the
conditional FP-Tree in Figure 3.9 is expanded by adding the single sub-branch of 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑑,
path 𝑎3,2. The 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑎 with one itemset ending with the items in
𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑎) (i.e., 𝐼(𝑎6,8)) is found as non-potential.
Figure 3.9 Expanded modified conditional FP-Tree of Header-Table item a after processing
the second subtree
The potential discriminative itemset combination generation for the processing Header-
Table item is finished if there is no more 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 in the conditional FP-Tree. Following the
bottom-up order of Desc-Flist, the conditional FP-Tree is then generated for the rest of Header-
Table items respectively (i.e., item 𝑒 in Example 3.1). In the DISSparse method by processing
the last Header-Table item (i.e., item 𝑐 in Example 3.1) the full discriminative itemsets are
reported as in 𝐷𝐼𝑖𝑗. Considering the data streams, the non-discriminative subsets of
discriminative itemsets, as part of the answers set, may become involved in window model
updating. The basics for adjusting the frequencies of these itemsets are explained, as in the
section below.
3.4.1.3 Tuning non-discriminative subsets of discriminative itemsets
Against the Apriori property and distinguishing the discriminative itemset mining from
frequent itemset mining, the non-discriminative itemsets appear as subset of discriminative
itemsets; for example, the item 𝑐 in Example 3.1 is a subset of discriminative itemsets, but 𝑐 is
not discriminative. After mining all discriminative itemsets in the datasets, the frequencies of
non-discriminative itemsets, which appear as subsets of discriminative itemsets, must be adjusted
accordingly using the original FP-Tree. In data stream mining, these itemsets may become
involved in window model updating, as discussed in Chapter 4 and Chapter 5. For the sake of
clarity, in this thesis the non-discriminative subsets of discriminative itemsets are called non-
discriminative subsets. Tuning the frequencies of non-discriminative subsets is not a time-
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 66
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 66
consuming process, if the sparse discriminative itemsets are considered with less number of non-
discriminative subsets.
Lemma 3-7 (Exact non-discriminative subsets) tuning the frequencies of the non-
discriminative itemsets appearing as a subset of discriminative itemsets using original FP-Tree
ensures the exact frequencies of these itemsets as part of the results that may become involved in
the window model updating.
Proof. The original FP-Tree is the superset of conditional FP-Trees and has full view of
all itemsets in the datasets. The exact frequencies of the non-discriminative subsets are collected
accurately using their appearances in the original FP-Tree by traversing through Header-Table
links.
∎
3.4.2 DISSparse Algorithm
The DISSparse algorithm starts by reading the batch of transactions and making the
Desc-Flist based on the descending order of the item frequencies in the target data stream 𝑆𝑖. The
Desc-Flist order is used for saving space by sharing the paths in the prefix tree structures,
including FP-Tree, conditional FP-Tree and minimized DISTree, with most frequent items on
the top. In data stream mining this Desc-Flist is made from the first batch of transactions and
remains the same for all the upcoming batches in data streams. In the case of single-pass
algorithm requirements, this step can be ignored by making the prefix tree structures using an
alphabetical order of items. The input parameters discriminative level 𝜃 and support threshold 𝜑
are defined based on the application domain, data stream characteristics and sizes or by the
domain expert users as discussed in Chapter 6.
The FP-Tree and Header-Table are built using expansion of FP-Growth (Han, Pei and
Yin 2000) for the batch of transactions 𝐵. Following the bottom-up order of Desc-Flist, the
conditional FP-Tree is built for each Header-Table item. Every 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 in the conditional
FP-Tree is assessed using HEURISTIC 3.1 and HEURISTIC 3.2 to limit the algorithm to the
potential discriminative itemsets. The minimized DISTree is generated for a
𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) by the potential discriminative itemset combinations and
discriminative itemsets are instantly reported in 𝐷𝐼𝑖𝑗. The conditional FP-Tree of the processing
Header-Table item is expanded by the sub-branches of the processed 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 and the
process continues if there is a new 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡. By full discovery of the discriminative itemsets
the frequencies of non-discriminative subsets are tuned using original FP-Tree.
Algorithm 3.3 (DISSparse: Discriminative itemset mining using Sparse
Prefix Tree)
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 67
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 67
Input: (1) The discriminative level threshold 𝜃; (2) The support threshold 𝜑; (3)
The input batch 𝐵 made of transactions with alphabetically ordered items
belonging to data streams 𝑆𝑖 and 𝑆𝑗.
Output: 𝐷𝐼𝑖𝑗, a set of discriminative itemsets in 𝑆𝑖 against 𝑆𝑗 in a batch of
transactions 𝐵.
Begin
1) Scan 𝐵 to generate Header-Table and to order the items in Header-Table
by frequency;
2) Make FP-Tree for 𝐵;
3) 𝐷𝐼𝑖𝑗 ={ };
4) For each item x in Header-Table do // x is least-frequent
5) Make conditional FP-Treex based on item x;
6) While conditional FP-Treex has remaining 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 do // left-most
order
7) Assess 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 using HEURISTIC 3.1;
8) If 𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) then
9) Find 𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝐼𝑛𝑡𝑒𝑟𝑛𝑎𝑙 𝑛𝑜𝑑𝑒𝑟𝑜𝑜𝑡) using HEURISTIC 3.2;
10) Generate minimized DISTree based on 𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡)
with only potential internal nodes in 𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝐼𝑛𝑡𝑒𝑟𝑛𝑎𝑙 𝑛𝑜𝑑𝑒𝑟𝑜𝑜𝑡);
11) Generate discriminative itemsets from the minimized DISTree;
12) Update 𝐷𝐼𝑖𝑗 by adding the discovered discriminative itemsets;
13) End if;
14) Expand conditional FP-Treex by sub-branches of 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡;
15) End while;
16) End for;
17) Report discriminative itemsets in 𝐷𝐼𝑖𝑗;
End.
Based on the three lemmas, we can prove that the DISSparse algorithm can find all
correct discriminative itemsets.
Theorem 3-3 (Completeness and correctness of DISSparse): Based on Lemma 3-6,
each conditional FP-Tree obtains a complete set of itemset prefixes for each Header-Table item.
Based on Lemma 3-4 and Lemma 3-5, the potential discriminative itemsets in each potential
𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 in conditional FP-Tree are generated in a minimized DISTree and discriminative
itemsets are discovered correctly. These prove the completeness and correctness of the
DISSparse method by discovering all the discriminative itemsets and their non-discriminative
subsets.
As discussed, in this thesis the problem of mining discriminative itemsets in data streams
is defined, using two different window models. The proposed DISSparse method is used in
Chapter 4 for mining discriminative itemsets in data streams using the tilted-time window model
(e.g., Figure 2.3). The DISSparse method is then used in Chapter 5 for mining discriminative
itemsets in data streams using the sliding window model (e.g., Figure 2.4). The DISSparse
algorithm is used for processing a single batch of transactions 𝐵 following the updating process
for the two window models, as was explained briefly for DISTree algorithm in Section 3.3.2.
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 68
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 68
3.4.3 DISSparse Algorithm Complexity
In the DISSparse algorithm, the significant parts attracting considerable complexity are
related to the finding of potential discriminative itemsets, generating the minimized DISTree and
conditional FP-Tree expansion.
The most time-consuming part is finding potential subtrees. For this part, we used two
separated loops. Each item is checked one time and the Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑎) are calculated. The
Max_dis_value(𝑟𝑜𝑜𝑡, 𝑎) is calculated based on the selected items using a greedy method using a
nested loops. Thesizeoftheouterloopisthenumberofitems,theinnerloop,“find 𝐼(𝑏) with
maximum 𝑅𝑖𝑗(𝐼(𝑏))” involvescombination.However, using the greedy method decreases the
number of combinations 𝐼(𝑏). The number of generated potential discriminative itemsets in a
minimized DISTree is much fewer compared to the exponential number of frequent itemsets in
the target dataset 𝑆𝑖 as generated in DISTree method. The DISSparse does not use a big
intermediate data structure as DISTree and discriminative itemsets are instantly discovered from
the minimized DISTree. The number of discovered discriminative itemsets is usually less than
actual size of the minimized DISTree.
The efficiency of the DISSparse algorithm is evaluated in Chapter 6 with empirical
analysis which shows the performance of the proposed method on different datasets in
comparison to the proposed DISTree method. The method is tested with different discriminative
level thresholds, support thresholds and ratios between sizes of two datasets. It is discussed that
the DISSparse method has efficient time and space complexity in generally large datasets for
mining a large number of both short and long discriminative itemsets.
The 𝛿-discriminative emerging patterns have close definition to the discriminative
itemsets defined in this chapter. However, there are several differences as we discussed earlier.
We explained that the discriminative itemsets discovered by DISSparse algorithm cannot be
limited with the static bound of frequency (< 𝛿) in the general dataset. For the purpose of
comparison and as another baseline method, we modify the definition criteria and consequently
the proposed heuristics in DISSparse algorithm to discover all the δ-discriminative emerging
patterns as in the section below. We also modify the original DPMiner method (i.e., DPM in
short) to discover all the 𝛿-discriminative emerging patterns (i.e., including redundant emerging
patterns). The methods are tested together in Chapter 6 in case of time and space usage.
3.4.4 Modified DISSparse and modified DPMiner
The DISSparse method is modified in a way to discover the 𝛿-discriminative emerging
patterns following the similar definition as in DPMiner method. The modification is in the
subtrees in the conditional FP-Tree and pruning the non-delta-discriminative itemsets instead of
non-discriminative itemsets. The conditional FP-Tree is traversed from the left-most subtree
through processing the Header-Table item links. The two proposed heuristics in DISSparse
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 69
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 69
method are re-defined to find the potential subtrees and potential internal nodes for mining delta-
discriminative itemsets, respectively. The algorithm looks for delta-discriminative itemsets
instead of discriminative itemsets defined in this chapter. The rest of the algorithm is the same as
original DISSparse algorithm. Each subtree and its internal nodes are checked based on two new
heuristics to identify the potential subtrees for mining 𝛿-discriminative itemsets. The potential
subtree is defined based on two conditions as defined below.
The maximum frequency of 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑎) in the dataset 𝑆𝑖 and dataset 𝑆𝑗, denoted
as Max_freq(𝑟𝑜𝑜𝑡, 𝑎 ), is defined as the sum of the frequencies in 𝑆𝑖 and 𝑆𝑗 of the itemsets in
𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑎) as below.
Max_freq(𝑟𝑜𝑜𝑡, 𝑎 ) = ∑ (𝑓𝑖(𝑏) + 𝑓𝑗(𝑏))
𝑏∈𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡,𝑎)
(3-d)
Let 𝒮 be the power set of 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑎), 𝒮 = 2𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡,𝑎) , i.e., 𝒮 consists of
all subsets of 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑎). For 𝐵 ∈ 𝒮 and 𝐵 ≠ { }, the delta-discriminative value of the
itemsets is defined below.
Delta_dis(𝐵) =
{
∑ 𝑓𝑖(𝑏)
𝑏∈𝐵
, ∑(𝑓𝑖(𝑏)
𝑏∈𝐵
+ 𝑓𝑗(𝑏)) ≥ 𝑚𝑖𝑛 _𝑠𝑢𝑝 And ∑ 𝑓𝑖(𝑏)
𝑏∈𝐵
≤ ∑ 𝑓𝑗(𝑏)
𝑏∈𝐵
∑ 𝑓𝑗(𝑏)
𝑏∈𝐵
, ∑(𝑓𝑖(𝑏)
𝑏∈𝐵
+ 𝑓𝑗(𝑏)) ≥ 𝑚𝑖𝑛 _𝑠𝑢𝑝 And ∑ 𝑓𝑗(𝑏)
𝑏∈𝐵
< ∑ 𝑓𝑖(𝑏)
𝑏∈𝐵
(3-e)
Delta_dis(𝐵) is the smaller frequency of itemset 𝐵 among the two datasets. The
minimum delta-discriminative value of all itemsets in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 ending with a header item 𝑎,
denoted as Min_delta_dis(𝑟𝑜𝑜𝑡, 𝑎), is defined below.
Min_delta_dis(𝑟𝑜𝑜𝑡, 𝑎) = 𝑚𝑖𝑛𝐵∈𝒮{Delta_dis(𝐵)} (3-f)
The Min_delta_dis(𝑟𝑜𝑜𝑡, 𝑎) can be defined either based on the dataset 𝑆𝑖 or dataset 𝑆𝑗
depending on the frequency of the itemsets in the datasets. Obviously, Min_delta_dis(𝑟𝑜𝑜𝑡, 𝑎)
shows the highest discrimination among all subsets of 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑎).
In order to determine Min_delta_dis(𝑟𝑜𝑜𝑡, 𝑎), all possible itemsets in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 will
have to be generated as required in above equation. However, the generation of all possible
itemset combinations is time-consuming. An efficient method, as described below, is designed to
find a subset 𝐵 of 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑎) which makes Delta_dis(𝐵) minimum among all subsets.
Let 𝑆𝑓𝑖 , 𝑆𝑓𝑗 be sum frequency of all itemsets in 𝐵 in datasets 𝑆𝑖 and 𝑆𝑗, respectively.
The sum of 𝑆𝑓𝑖 and 𝑆𝑓𝑗 are checked for being larger than 𝑚𝑖𝑛 _𝑠𝑢𝑝 and with 𝑆𝑓𝑗 ≤ 𝛿 or 𝑆𝑓𝑖 ≤
𝛿. For the sake of simplicity we only show the 𝑆𝑓𝑗 ≤ 𝛿. The complete algorithm should cover
both 𝑆𝑓𝑗 ≤ 𝛿 and 𝑆𝑓𝑖 ≤ 𝛿.
Initially, 𝐹𝑖 is initialized by summing up the 𝑓𝑖(𝑏) frequencies of the itemsets 𝑏 with
𝑓𝑗(𝑏) = 0. If 𝐹𝑖 is frequent then the subtree 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 is potential.
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 70
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 70
We delete the frequencies of itemsets from sum frequency of all itemsets, one by one,
until we find the potential subtree or we finish the itemsets. Let 𝐼(𝑏) with maximum
(𝑆𝑓𝑖+𝑆𝑓𝑗)−(𝑓𝑖(𝐼(𝑏))+ 𝑓𝑗(𝐼(𝑏)))
𝑆𝑓𝑗− 𝑓𝑗(𝐼(𝑏)) in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 be defined as 𝑚𝑖𝑛_𝑙𝑒𝑎𝑓. The frequencies of itemset
𝑚𝑖𝑛_𝑙𝑒𝑎𝑓, is decreased from 𝑆𝑓𝑖 and 𝑆𝑓𝑗 every time until 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 become potential or no
more itemset (𝑏) remains. The algorithm for calculating the Max_freq(𝑟𝑜𝑜𝑡, 𝑎 ) and estimating
the Min_delta_dis(𝑟𝑜𝑜𝑡, 𝑎) is presented in Algorithm 3.4.
Algorithm 3.4 𝐌𝐚𝐱_𝐟𝐫𝐞𝐪_𝐌𝐢𝐧_𝐝𝐞𝐥𝐭𝐚_𝐝𝐢𝐬
Input: (1)𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡; (2) header item 𝑎 ∊ 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡)
Output: (1) Max_freq(𝑟𝑜𝑜𝑡, 𝑎); (2) Min_delta_dis(𝑟𝑜𝑜𝑡, 𝑎).
Begin
1. 𝑆𝑓𝑖 = 0, 𝑆𝑓𝑗 = 0; 𝐹𝑖 = 0, 𝐹𝑗 = 0;
2. For each item 𝑏 ∊ 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) do
3. 𝑆𝑓𝑖+= 𝑓𝑖(𝐼(𝑏)); 𝑆𝑓𝑗+= 𝑓𝑗(𝐼(𝑏));
4. If 𝑓𝑗(𝐼(𝑏)) = 0 then 𝐹𝑖+= 𝑓𝑖(𝐼(𝑏)); End if;
5. End For;
6. If 𝐹𝑖 ≥ 𝑚𝑖𝑛 _𝑠𝑢𝑝 then
7. Max_freq(𝑟𝑜𝑜𝑡, 𝑎) = 𝐹𝑖; Min_delta_dis(𝑟𝑜𝑜𝑡, 𝑎) = 0;
8. Else
9. While ∃𝑏 ∊ 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) and 𝑏 is unchecked do
10. Find 𝐼(𝑏) with maximum (𝑆𝑓𝑖+𝑆𝑓𝑗)−(𝑓𝑖(𝐼(𝑏))+ 𝑓𝑗(𝐼(𝑏)))
𝑆𝑓𝑗− 𝑓𝑗(𝐼(𝑏)) as 𝑚𝑖𝑛_𝑙𝑒𝑎𝑓;
11. If (𝑆𝑓𝑖 + 𝑆𝑓𝑗) ≥ 𝑚𝑖𝑛 _𝑠𝑢𝑝 & 𝑆𝑓𝑗 ≤ 𝛿 then
12. Max_freq(𝑟𝑜𝑜𝑡, 𝑎) = 𝑆𝑓𝑖 + 𝑆𝑓𝑗;
13. Min_delta_dis(𝑟𝑜𝑜𝑡, 𝑎) = 𝑆𝑓𝑗.
14. Else
15. 𝑆𝑓𝑖−= 𝑓𝑖(𝑚𝑖𝑛_𝑙𝑒𝑎𝑓); 𝑆𝑓𝑗−= 𝑓𝑗(𝑚𝑖𝑛_𝑙𝑒𝑎𝑓);
16. End if;
17. Tag 𝑚𝑖𝑛_𝑙𝑒𝑎𝑓 as checked;
18. End While;
19. End if;
20. Return Max_freq(𝑟𝑜𝑜𝑡, 𝑎) and Min_delta_dis(𝑟𝑜𝑜𝑡, 𝑎); End.
In the above algorithm, the items in 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) are scanned
in the two separated loops. In the first loop the Max_freq(𝑟𝑜𝑜𝑡, 𝑎) is calculated based on the all
itemsets in the 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡. The 𝐹𝑖 is also calculated based on itemsets with 𝑓𝑗(𝐼(𝑏)) = 0. In
case of 𝐹𝑖 ≥ 𝑚𝑖𝑛 _𝑠𝑢𝑝 the subtree is potential. In the other case and in the second loop, each
item is checked one time using a greedy method and Min_delta_dis(𝑟𝑜𝑜𝑡, 𝑎) is calculated based
on the selected items. The itemset 𝐼(𝑏) with smallest ratio 𝑅𝑖𝑗(𝐼(𝑏)) is found as 𝑚𝑖𝑛_𝑙𝑒𝑎𝑓.
Before decreasing the 𝑚𝑖𝑛_𝑙𝑒𝑎𝑓 frequencies from 𝑆𝑓𝑖 and 𝑆𝑓𝑗, it is checked if (𝑆𝑓𝑖 + 𝑆𝑓𝑗) ≥
𝑚𝑖𝑛 _𝑠𝑢𝑝 and 𝑆𝑓𝑗 ≤ 𝛿 then the subtree is potential, otherwise the 𝑚𝑖𝑛_𝑙𝑒𝑎𝑓 frequencies are
decreased from 𝑆𝑓𝑖 and 𝑆𝑓𝑗, respectively.
The HEURISTIC_Delta 3.1 is defined below.
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 71
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 71
HEURISTIC_Delta 3.1. 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 in the conditional FP-Tree in terms of a header item
𝑎 is considered as a potential delta-discriminative subtree denoted as
𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) if it satisfies the following conditions:
1. Max_freq(𝑟𝑜𝑜𝑡, 𝑎) ≥ 𝑚𝑖𝑛 _𝑠𝑢𝑝
2. Min_delta_dis(𝑟𝑜𝑜𝑡, 𝑎) ≤ 𝛿
Where 𝛿 is a small integer number, 𝑚𝑖𝑛 _𝑠𝑢𝑝 is the support threshold and 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑎) is
a set of itemsets in subtree 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 ending with a header item
𝑎 ∊ 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡).
Lemma 3-8 (Potential delta-discriminative subtree) HEURISTIC_Delta 3.1 ensures
that none of the non-potential delta-discriminative subtrees contains any delta-discriminative
itemset.
Proof. The first condition in HEURISTIC_Delta 3.1 ensures that the sum of frequencies
in datasets of itemsets in a potential delta-discriminative subtree is frequent, which implies that
there could be an itemset in datasets in the subtree which is frequent. The second condition in
HEURISTIC_Delta 3.1 ensures that the minimum delta-discriminative value of a potential
discriminative subtree is smaller than the delta-discriminative value 𝛿, which implies that there
could be an itemset in the subtree whose discriminative value is smaller than the delta-
discriminative value 𝛿.
For a subtree, if it does not satisfy any of the two conditions, the subtree is considered as
a non-potential delta-discriminative subtree. Because a non-potential delta-discriminative subtree
breaches one or both of the conditions, the subtree does not contain any frequent itemset in
datasets or does not contain any itemset whose delta-discriminative value is smaller than the
delta-discriminative value 𝛿. According to definition, the subtree would not contain any delta-
discriminative itemset.
∎
A potential subtree could contain non-𝛿-discriminative itemsets. Non-potential subsets
may also exist in a potential subtree as internal nodes. We define the HEURISTIC_Delta 3.2 for
the potential internal nodes in a same way as we did in the HEURISTIC_Delta 3.1.
The HEURISTIC_Delta 3.2 is defined below.
HEURISTIC_Delta 3.2. An internal node 𝑖𝑛 ∈ 𝐼𝑛𝑡𝑒𝑟𝑛𝑎𝑙 𝑛𝑜𝑑𝑒𝑟𝑜𝑜𝑡 in a potential subtree
𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 is considered as potential delta-discriminative internal node denoted as
𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑖𝑛) if it satisfies the following conditions:
1. Max_freq(𝑟𝑜𝑜𝑡, 𝑖𝑛, 𝑎) ≥ 𝑚𝑖𝑛 _𝑠𝑢𝑝
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 72
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 72
2. Min_delta_dis(𝑟𝑜𝑜𝑡, 𝑖𝑛, 𝑎) ≤ 𝛿
Where 𝛿 is a small integer number, 𝑚𝑖𝑛 _𝑠𝑢𝑝 is the support threshold and
𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑖𝑛, 𝑎) is a set of itemsets in subtree 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 ending with a header item
𝑎 ∊ 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) with the internal node 𝑖𝑛 as subset.
Lemma 3-9 (Potential delta-discriminative internal node) HEURISTIC_Delta 3.2
ensures that none of the non-potential delta-discriminative internal nodes would occur in any
delta-discriminative itemset.
Proof. The first condition in HEURISTIC_Delta 3.2 ensures that the sum of
frequencies in datasets of itemsets with subset of the potential delta-discriminative internal node
is frequent, which implies that there could be an itemset in datasets in the subtree with subset of
the potential delta-discriminative internal node which is frequent. The second condition in
HEURISTIC_Delta 3.2 confirms that the minimum delta-discriminative value of the potential
delta-discriminative internal node is smaller than the delta-discriminative value 𝛿, which implies
that there could be an itemset in the subtree with subset of the potential delta-discriminative
internal node whose delta-discriminative value is smaller than the delta-discriminative value 𝛿.
For an internal node, if it does not satisfy any of the two conditions, the internal node is
considered as a non-potential delta-discriminative internal node. Because a non-potential delta-
discriminative internal node breaches one or both of the conditions, the internal node is not
contained in any frequent itemset in datasets or is not contained in any itemset whose delta-
discriminative value is smaller than the delta-discriminative value 𝛿. According to definition, the
internal node would not be contained in any delta-discriminative itemset.
∎
The non-potential internal nodes are ignored from itemset combination generation in
𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡). The rest of the DISSparse algorithm remains the same as the original
DISSparse method proposed in this chapter. Based on the new heuristics DISSparse algorithm
discovers the all 𝛿-discriminative emerging patterns.
The DPM algorithm is also modified in a way to cover all the delta-discriminative
itemsets with any support (i.e., including redundant delta-discriminative emerging patterns). We
generate all the combinations of each delta-discriminative emerging pattern to find all the delta-
discriminative emerging patterns with explicit frequencies. The generated itemsets are checked
for the true frequencies using the hashing function proposed in DPMiner (Li, Liu and Wong
2007). The modified DISSparse and modified DPM are evaluated based on different datasets in
Chapter 6.
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 73
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 73
3.4.5 DISSparse summary
In this section, the DISSparse method proposed for efficient mining of the discriminative
itemsets in data streams, following the basics of FP-Growth method (Han, Pei and Yin 2000) and
by proposing determinative heuristics. The DISSparse method generates only the potential
discriminative itemsets ending with each Header-Table item to discover the discriminative
itemsets. Following FP-Growth method and as with the DISTree method the generation of
itemset combinations for each Header-Table item is based on the conditional patterns and the
conditional FP-Tree made for that item. In the DISSparse method the conditional FP-Tree is a
modified data structure and itemset generation from conditional FP-Tree is limited by taking
benefit of the sparse characteristics of the discriminative itemsets.
The DISSparse does not use a big intermediate data structure as DISTree and
discriminative itemsets are directly updated to the window model. The process of itemset
combination generation is based on the potential discriminative itemsets by eliminating the non-
potential discriminative itemsets using HEURISTIC 3.1 and HEURISTIC 3.2. The frequencies
of the non-discriminative itemsets appears as a subset of discriminative itemsets are extracted
using original FP-Tree. The DISSparse method, using the proposed heuristics and the potential
discriminative itemset generation, ensures the efficient and accurate discriminative itemset
mining with minimum space usage.
Based on the empirical evaluations in Chapter 6, the DISSparse method with the
proposed heuristics, exhibits an efficient time and space complexity especially with smaller
discriminative level thresholds. In the DISSparse method, the potential discriminative itemsets
are discovered in each subtree in the conditional FP-Tree and the minimized DISTree structure is
generated by the potential discriminative itemset combinations for a potential subtree of the
Header-Table item. The conditional FP-Tree is expanded during the process to cover the
subtrees with all the itemset prefixes. The number of generated itemsets in the DISSparse method
is much fewer compared to the exponential number of frequent itemsets in the target data stream
𝑆𝑖 as generated in DISTree method.
The precision of the DISSparse algorithm was proved by Theorem 3-3 based on its
correctness and completeness. The DISSparse method can be used correctly for offline updating
of the different window models.
3.5 CHAPTER SUMMARY
Considering the two proposed methods in this chapter, DISTree is applicable for small
datasets with simpler complexities. The DISTree method follows the generation of all
combinations of single paths in a conditional FP-Tree for the Header-Table items if they are
frequent in the target data stream. The itemsets generated out of each conditional FP-Tree are
tested based on the definitions of discriminative and non-discriminative itemsets. This is a time-
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 74
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 74
consuming process as the discriminative itemsets are a sparse subset of frequent itemsets. The
DISSparse method is proposed for efficient mining of discriminative itemsets using
determinative heuristics for restricting the process to the potential discriminative itemsets. Based
on the discussed theorems, both algorithms report the discriminative itemsets with full accuracy
and recall in a single batch of transactions.
The process of mining discriminative itemsets highly depends on the distributions of
transactions in batches in data streams. Following the principles of FP-Growth-based methods,
the FP-Tree structure is used in DISTree and DISSparse algorithms for holding the input
transactions in a concise way in the main memory. In the DISTree method, the DISTree structure
also stays in the main memory. Although the DISTree size is controlled by pruning the non-
discriminative itemsets, it may still consume high memory space for the large batches. A big
DISTree will result by the necessity of checking large number of itemset combinations as the
discriminative itemsets do not follow the Apriori property. In fact, it may not be feasible to hold
the DISTree structure in memory for large batches, so in the DISTree method it is necessary to
choose an appropriate size for the single batch of transactions.
In the DISTree method, many of the generated itemsets are frequent in both target data
stream 𝑆𝑖 and general data stream 𝑆𝑗. In the DISSparse method using the proposed heuristics
many non-discriminative itemsets are skipped from itemset combination generation in a
minimized DISTree caused by significant improvements in mining discriminative itemsets as
reported in Chapter 6. In the DISSparse method, the modified conditional FP-Tree structure is
generated for each processing Header-Table item. The discriminative itemsets are instantly
updated from the minimized DISTree structure, generated for a potential discriminative subtree,
to the window model without the need of a big intermediate data structure as in DISTree method.
These are small data structures as discussed theoretically and also evidenced by empirical
analysis in Chapter 6.
Both DISTree and DISSparse methods are working based on two scans. In the first scan,
the frequent items in the target data stream are found and the items in all input transactions and
the paths in prefix tree structures are ordered based on the descending order of the frequent items.
Several data stream mining algorithms (Giannella et al. 2003; Chi et al. 2006; Tanbeer et al.
2009) use two scans for making the concise data structures and faster processing time, in which
items are ordered by decreasing frequencies as in (Han, Pei and Yin 2000). In (Seyfi, Geva and
Nayak 2014) we proposed the DISTree algorithm based on a single scan but the performance of
the algorithm was highly affected with large data structures and high processing time.
We explained many different real world applications for mining discriminative itemsets.
One of the interesting scenarios is in market basket analyses by looking for itemsets being bought
more frequently in one market compared to the rest of markets. This can be used for
personalization or anomaly detection. Figure 3.10 shows a sample of discriminative itemsets with
Chapter 3: Mining Discriminative Itemsets in Data Streams Page 75
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 75
different discriminative levels discovered in one market compared to the other markets. Based on
the empirical analysis in Chapter 6 on different datasets the average number of long
discriminative itemsets decreases by higher discriminative level thresholds. The information
provided with the discriminative itemsets can be used for better differentiation of the trends in the
target market compared to the trends in rest of the markets. Based on the experiments conducted
in Chapter 6, the discriminative itemsets with very high discriminative levels appear less often in
the total distribution of discriminative itemsets.
Figure 3.10 A sample of discriminative itemsets distribution with different discriminative
levels in market basket monitoring application
The interesting characteristic of discriminative itemsets compared to the frequent
itemsets is that all of the discovered discriminative itemsets are useful and meaningful (e.g., the
frequent itemsets have redundancy, which is limited by the closed frequent itemset). Also, many
of the (closed or) frequent itemsets are frequent in both the target data stream and the general data
stream. They may not be a good source of knowledge for differentiation between data streams.
The difference between the discriminative and frequent itemsets can be highlighted by significant
differences in the number of discriminative itemsets discovered in the data sets in comparison to
the frequent itemsets as reported in Chapter 6.
In this chapter, the multiple data streams were treated as the target data stream 𝑆𝑖 and the
rest of them combined as the general stream 𝑆𝑗. The methods remain the same principally, even if
the multiple data streams need to be considered separately. We list this as future work with
modification of the current implementation considering applications requiring more than two
data streams. In the next chapter, we propose the algorithms for mining discriminative itemsets in
data streams, using the tilted-time window model.
Chapter 4: Mining Discriminative Itemsets in Data Streams using the Tilted-time Window Model Page 76
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 76
4Chapter 4: Mining Discriminative Itemsets in
Data Streams using the Tilted-time Window
Model
In this chapter, the problem of mining discriminative itemsets in data streams using the tilted-
time window model is formally defined. The comprehensive research problem is outlined and
one method is proposed. The advanced high-efficient and high-accurate method called H-
DISSparse utilizes the DISSparse algorithm (Seyfi et al. 2017) proposed in Chapter 3 with the
tilted-time window model. The proposed method is explained in detail with its novel data
structures and offline updating of the tilted-time window model. In this chapter, in order to
achieve the best approximation in mining discriminative itemsets in data streams, we use the
properties of the discriminative itemsets in the tilted-time window model to propose a novel and
efficient method. The proposed method guarantee the approximate support and approximate ratio
bound in discriminative itemsets in the large and fast-growing data streams with the necessity of
concise process fitting in real world applications. In Chapter 6, the proposed method is
extensively evaluated on data streams made of multiple batches of transactions exhibiting diverse
characteristics and by setting different thresholds. Empirical analysis shows high efficiency in
time and space usage with the highest refined approximate bound in discriminative itemsets
gained by the H-DISSparse algorithm. To the best of our knowledge, the proposed method in this
chapter is the first algorithm for mining discriminative itemsets in data streams using the tilted-
time window model.
The chapter starts with describing the existing works in Section 4.1. The mathematical
definition of the research problem using several notations is proposed in Section 4.2. The tilted-
time window model and its updating process are discussed in Section 4.3. In Section 4.4 the H-
DISSparse method is proposed as the advanced method for high-efficient and high-accurate
mining of discriminative itemsets in data streams using the tilted-time window model. The
chapter is finalized by discussion about the H-DISSparse method and its state-of-the-art
techniques in Section 4.5.
4.1 EXISTING WORKS
FP-Stream is a famous method proposed for mining frequent patterns from data streams
using the tilted-time window model (Giannella et al. 2003). Mining frequent patterns from data
streams has more challenges compared to the static datasets as the infrequent patterns can
Chapter 4: Mining Discriminative Itemsets in Data Streams using the Tilted-time Window Model Page 77
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 77
become frequent later and cannot be ignored. The data structures need to be adjusted regularly by
pattern frequencies over times. The extended framework is proposed for mining the time-
sensitive frequent patterns in data stream with approximate support guarantee. The prefix tree
data structure proposed in (Han, Pei and Yin 2000) is an effective and concise data structure for
frequent pattern mining. The FP-Stream is a model based on FP-Tree consists of in-memory
pattern tree structure to capture the frequent and sub-frequent itemsets; and a tilted-time window
table for each itemset. By embedding the tilted-time window structure into each node, the space
is saved.
Using an efficient algorithm (Giannella et al. 2003) the FP-Stream structure is
maintained and updated over single data stream to summarize the frequent patterns at multiple
time granularities. Considering the large volume of dynamic data stream, it is not possible to
maintain the entire frequent patterns in the history in a limited size window. The frequent patterns
are compressed and stored using a prefix tree structure under a tilted-time window framework
residing in the main memory. The patterns in data stream are updated incrementally with
incoming transactions. The frequent and sub-frequent patterns are maintained in a pattern tree
data structure with a built-in tilted-time window model for each individual pattern. The method is
able to answer time-sensitive queries under the framework of FP-Stream over data streams with
an error bound guarantee.
For the completeness of frequent patterns for stream data, the FP-Stream method stores
the information related to infrequent itemsets. Considering the 𝑚𝑖𝑛 _𝑠𝑢𝑝 and relaxation ratio of
error rate ɛ, itemset 𝐼 is frequent if its support is no less than 𝑚𝑖𝑛 _𝑠𝑢𝑝, and it is sub-frequent if
the support is less than 𝑚𝑖𝑛 _𝑠𝑢𝑝 but not less than ɛ, otherwise it is infrequent. The sub-frequent
patterns are maintained as the patterns that are not frequent but become frequent later. The
infrequent patterns discarded as they are in large number and loosing information about them will
not affect the support of itemsets too much. Incremental update of the prefix tree and the
embedded tilted-time window model occurs when some infrequent patterns become sub-
frequent, or vice versa. The definition of frequent, sub-frequent, and infrequent patterns is
actually relative to a period of time. For example, a pattern may be sub-frequent over a period,
but it is possible that it becomes infrequent over a longer period.
In FP-Stream method (Giannella et al. 2003), the FP-Tree structure is used for saving
the information about the transactions in the current window. The discovered frequent patterns
are saved with the frequency histories in the tiled-time windows related to the pattern in the
pattern tree structure. This tree is similar to the FP-Tree but with the difference that it stores
patterns instead of transactions. Mining the frequent patterns over different time intervals in a
data stream assumes that the transactions can be scanned in a limited size window at any
moment. The pattern fragment method is used for mining the frequent patterns in the current
window frames. However, pattern fragment is not applied to the problem of discriminative
Chapter 4: Mining Discriminative Itemsets in Data Streams using the Tilted-time Window Model Page 78
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 78
itemset mining as it does not follow the Apriori property. We discuss this in details in
Section 4.3. Using the FP-Stream structure it is possible to mine frequent patterns in the current
window, mine frequent patterns over different time ranges, put different weights on different
windows to mine various kinds of weighted frequent patterns, and mine evolution of frequent
patterns based on the changes of their occurrences in a sequence of windows.
The superiority of the tilted-time window model to the landmark window model and
damped window model is discussed in describing the research problem in the next section. The
details structure of the logarithmic window model and its updating process adjusted for mining
discriminative itemsets in data streams is presented in Section 4.3. The different types of the
tilted-time window models are discussed by explaining the incremental frequency merging and
frequency shifting in the window model. We explain the two types of pruning techniques
proposed in (Giannella et al. 2003) for mining frequent itemsets in data stream using the tilted-
time window model. Then we discuss the reasons why the proposed pruning techniques cannot
be used for mining discriminative itemsets from data stream using the tilted-time window model,
and we propose the new solutions.
The main contribution of this chapter is proposing one algorithm for mining
discriminative itemsets in data streams using the tilted-time window model. The H-DISSparse
method is an efficient method proposed in this chapter by utilizing the DISSparse method (Seyfi
et al. 2017) which is proposed in Chapter 3. The method is proposed based on the properties of
the discriminative itemsets in the tilted-time window model. The technical details of the proposed
algorithm are provided in a way to distinguish with the proposed DISSparse method.
4.2 RESEARCH PROBLEM
The patterns in data streams are usually time sensitive and in many applications, users
are more interested in changes in patterns and their trends during the time, than the patterns
themselves (Giannella et al. 2003). The patterns appearing in the old time may not be dominant
anymore and may have lost their attractions (e.g., patterns in the news delivery services).
Particular groups of patterns appearing in a period of time should not affect the general trend of
the patterns in data streams during the history or recent time (e.g., patterns related to the specific
events). Time related patterns and the changes in their trends during the history of data streams
are of interest for recent patterns in the short time intervals and the old patterns in the larger time
intervals. These patterns are represented in the tilted-time window model in different time periods
for answering the time sensitive queries. The tilted windows are in different sizes and each one
points to the specific period of time. There are different types of tilted-time window models, like
the natural tilted-time window model and the logarithmic tilted-time window model, as explained
in detail in Section 4.3.
Chapter 4: Mining Discriminative Itemsets in Data Streams using the Tilted-time Window Model Page 79
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 79
The discriminative itemsets in the tilted-time window model are defined as the frequent
itemsets in the target data stream that have frequencies much higher than that of the same
itemsets in other data streams in different time periods. Not losing generality, we call other data
streams a 'general data stream' for the sake of simplicity. The discriminative itemsets are
relatively frequent in the target data stream and relatively infrequent in the general data stream in
each time period during the history. An essential issue in this research problem is to find the
itemsets that can distinguish the target data stream from all other data streams in each time
period.
There are many real world scenarios that can show the significance of mining
discriminative itemsets in data streams using the tilted-time window model. Monitoring the
market basket transactions could have started a long time ago and treating the old and new
transactions in a same way, equally may not be useful for tracing market fluctuation and guiding
the business (e.g., as in the landmark window model (Manku and Motwani 2002)). The fading of
old transactions (e.g., reducing the weight as in the damped window model (Chang and Lee
2003)) may not be enough for those applications interested in finding the changes in the
discriminative itemsets and their trends during the time. Discriminative itemsets in the data
streams made of market basket dataset represented in a tilted-time window model are useful for
identifying the specific set of items that are of high interest in one market compared to the other
markets in different time periods. They are applicable for showing the relative changes in the data
stream trends in different time periods and for answering the time sensitive queries (e.g., in
applications of personalization, anomaly detection and prediction).
Web page personalization can be optimized by changes in user preferences in different
time periods or during specific events. User groups may have different preferences during the
time by visiting the groups of the web pages much more frequently than other user groups in
different time periods. The sequences of queries with higher support in one geographical area
compared to another area are time related. The discriminative sequences of queries related to the
specific events and changes in the relative trends in different geographical areas are monitored
separately in different time periods for better recommendation. Changes in the discriminative
pattern trends during network monitoring in the last few minutes are more valuable for anomaly
detection and network interference prediction than discriminative patterns themselves.
The data streams are processed in one scan and the tilted-time window model is updated
in an offline state with approximate support and approximate ratio in discriminative itemsets in
different time periods. The discriminative itemsets are reported with exact support and exact ratio
in one batch of transactions. The discriminative itemsets may happen between the borders of the
tilted-time window frames. The relaxation ratio 𝛼 is defined for saving the itemsets with
approximate frequencies. The discriminative itemsets are reported with approximate support and
approximate ratio between borders of the tilted-time window model. The multiple scans are not
Chapter 4: Mining Discriminative Itemsets in Data Streams using the Tilted-time Window Model Page 80
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 80
acceptable in data stream mining (Aggarwal 2007; Han, Pei and Kamber 2011) and the
discriminative itemsets with exact minimum support and exact frequency ratio in data streams
cannot be discovered. The approximation can become worse in the larger time periods by
merging the approximate discriminative itemsets in the smaller time periods which caused for
increase in the number of false-positives and false-negatives.
The number of false-positives and false-negative discriminative itemsets must be
bounded for qualified answers. The discriminative itemsets do not follow the Apriori property
defined for the frequent itemsets and a subset of discriminative itemsets can be non-
discriminative. The tail pruning techniques proposed in (Giannella et al. 2003) for efficient
frequent itemset mining using the tilted-time window model by minimum support guarantee are
not applicable to discriminative itemset mining using the tilted-time window model. The
discriminative itemsets are a sparse subset of frequent itemsets. Using the properties of the
discriminative itemsets in the tilted-time window model three corollaries is defined. This
guarantee the highest refined approximate support and approximate ratio bound in the tilted-time
window model with efficient time and space complexity in the high speed data streams.
In this chapter, the discriminative itemset mining using the tilted-time window model is
discussed based on two data streams 𝑆𝑖 and 𝑆𝑗, modelled as multiple continuous batches of
transactions denoted as 𝐵1,…𝐵ℎ, 𝐵ℎ+1,…,𝐵𝑛. Later in Chapter 5, the problem is discussed in
detail using the sliding window model.
4.2.1 Problem formal definition
The formal definition of mining discriminative itemsets presented in Chapter 3 is
expanded for mining discriminative itemsets using the tilted-time window model as defined
below:
Let∑bethealphabet set of items, a transaction 𝑇 = {𝑒1, … 𝑒𝑖 , 𝑒𝑖+1, … , 𝑒𝑛}, 𝑒𝑖 ∈ ∑, is
defined as a set of items in ∑. The items in the transaction are in alphabetical order by default for
ease in describing the mining algorithm. The two data streams 𝑆𝑖 and 𝑆𝑗 are defined as the target
and general data streams; each consists of a different number of transactions, i.e., 𝑛𝑖 and 𝑛𝑗,
respectively. A group of input transactions from two data streams 𝑆𝑖 and 𝑆𝑗 in the pre-defined
time period are set as a batch of transactions 𝐵𝑛 i.e., 𝑛 ≥ 1.
The tilted-time window model is composed of different window frames denoted as 𝑊𝑘
i.e., 𝑘 ≥ 0 as in Figure 4.1. Each window frame 𝑊𝑘 refers to a different time period containing
itemsets made of transactions from different number of batches in two data streams 𝑆𝑖 and 𝑆𝑗
with the lengths of 𝑛𝑖𝑘 and 𝑛𝑗
𝑘, respectively. The current window frame is denoted as 𝑊0.
Chapter 4: Mining Discriminative Itemsets in Data Streams using the Tilted-time Window Model Page 81
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 81
Figure 4.1 Tilted-time window frames
An itemset 𝐼 is defined as a subset of ∑. The itemset frequency is the number of
transactions that contains the itemset. The frequency of itemset 𝐼 in data stream 𝑆𝑖 in the window
frame 𝑊𝑘 is denoted as 𝑓𝑖𝑘(𝐼) and the frequency ratio of itemset 𝐼 in data stream 𝑆𝑖 in the
window frame 𝑊𝑘 is defined as 𝑟𝑖𝑘(𝐼) = 𝑓𝑖
𝑘(𝐼)/𝑛𝑖𝑘.
In this chapter, if the frequency ratio of itemset I in the target data stream 𝑆𝑖 in the
window frame 𝑊𝑘 is larger than the frequency ratio in the general data stream 𝑆𝑗, i.e., 𝑟𝑖𝑘(𝐼)
𝑟𝑗𝑘(𝐼)
> 1,
then the itemset I can be considered as a discriminative itemset in the window frame 𝑊𝑘. Let
𝑅𝑖𝑗𝑘 (𝐼) be the ratio between 𝑟𝑖
𝑘(𝐼) and 𝑟𝑗𝑘(𝐼), i.e., 𝑅𝑖𝑗
𝑘 (𝐼) =𝑟𝑖𝑘(𝐼)
𝑟𝑗𝑘(𝐼)
. Obviously, the higher the
𝑅𝑖𝑗𝑘 (𝐼), the more discriminative the itemset 𝐼 is.
To more accurately define discriminative itemsets, we introduce a user-defined threshold
𝜃 > 1, called a discriminative level threshold with no upper bound. An itemset 𝐼, is considered
discriminative in the window frame 𝑊𝑘 if 𝑅𝑖𝑗𝑘 (𝐼) ≥ 𝜃. This is formally defined as:
𝑅𝑖𝑗𝑘 (𝐼) =
𝑟𝑖𝑘(𝐼)
𝑟𝑗𝑘(𝐼)
=𝑓𝑖𝑘(𝐼)𝑛𝑗
𝑘
𝑓𝑗𝑘(𝐼)𝑛𝑖
𝑘 ≥ 𝜃 (4.1)
The 𝑅𝑖𝑗𝑘 (𝐼) could be very large but with very low 𝑓𝑖
𝑘(𝐼). In order to accurately identify
discriminative itemsets that have reasonable frequency in the window frame 𝑊𝑘, and also in the
case of 𝑓𝑗𝑘(𝐼) = 0, we introduce another user-specified support threshold, 0 < 𝜑 < 1 𝜃⁄ , to
eliminate itemsets that have very low frequency in the window frame 𝑊𝑘. In this chapter, an
itemset 𝐼 is considered as discriminative if its frequency in the window frame 𝑊𝑘 becomes
greater than 𝜑𝜃𝑛𝑖 i.e., 𝑓𝑖𝑘(𝐼) ≥ 𝜑𝜃𝑛𝑖 and also 𝑅𝑖𝑗
𝑘 (𝐼) ≥ 𝜃.
Definition 4.1. Discriminative itemsets in the tilted-time window model: Let 𝑆𝑖 and 𝑆𝑗 be
two data streams, with the current size of 𝑛𝑖𝑘 and 𝑛𝑗
𝑘 in a window frame 𝑊𝑘 i.e., 𝑘 ≥ 0 that
contain varied length transactions of items in ∑, a user defined discriminative level threshold
𝜃 > 1 and a support threshold 𝜑 𝜖 (0, 1 𝜃⁄ ). The set of discriminative itemsets in 𝑆𝑖 against 𝑆𝑗
in the tilted-time window model in the window frames 𝑊𝑘, denoted as 𝐷𝐼𝑖𝑗𝑘 i.e., 𝑘 ≥ 0, is
formally defined as:
𝐷𝐼𝑖𝑗𝑘 = {𝐼 ⊆ ∑ | 𝑓𝑖
𝑘(𝐼) ≥ 𝜑𝜃𝑛𝑖𝑘 & 𝑅𝑖𝑗
𝑘 (𝐼) ≥ 𝜃} (4.2)
The itemsets that are not discriminative in the current window frame 𝑊0, can be
discriminative in some larger window frames in the tilted-time window model (e.g., by merging
Chapter 4: Mining Discriminative Itemsets in Data Streams using the Tilted-time Window Model Page 82
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 82
the multiple window frames). In order to avoid missing any potential discriminative itemsets in
larger window frames, we propose to identify sub-discriminative itemsets in the tilted-time
window model with a parameter specified by the user. The sub-discriminative itemsets are
discovered by relaxation of 𝛼 𝜖 (0,1) with more number sub-discriminative itemsets in smaller
𝛼. The 𝛼 is defined for approximate support and approximate ratio between borders of the tilted-
time window model. The itemsets 𝐼 is sub-discriminative if it is not discriminative but its
frequency in target data stream 𝑆𝑖 is not less than 𝛼𝜑𝜃𝑛𝑖 and its ratio is not less than 𝛼𝜃. The
discriminative itemsets are in interest; however, the sub-discriminative itemsets also kept
tracking during the process as they may be discriminative in larger window frames. For the sake
of clarity in this chapter any notation related to the sum up of all tilted window frames is denoted
as 0. .𝑚; for example, the ∑ 𝑓𝑖𝑘𝑚
𝑘=0 is denoted as 𝑓𝑖0..𝑚.
Definition 4.2. Sub-Discriminative itemsets in the tilted-time window model: Let 𝑆𝑖 and
𝑆𝑗 be two data streams, with the current size of 𝑛𝑖𝑘 and 𝑛𝑗
𝑘 in each window frame 𝑊𝑘 i.e., 𝑘 ≥ 0
that contain varied length transactions of items in ∑, 𝑚 be the number tilted window frames for
the itemset i.e., 𝑘 ≤ 𝑚, a user defined discriminative level threshold 𝜃 > 1, a support
threshold 𝜑 𝜖 (0, 1 𝜃⁄ ) and a relaxation parameter 𝛼 𝜖 (0,1). A set of sub-discriminative
itemsets in 𝑆𝑖 against 𝑆𝑗 in the tilted-time window model, denoted as 𝑆𝐷𝐼𝑖𝑗, is formally defined
as:
𝑆𝐷𝐼𝑖𝑗 = {𝐼 ⊆ ∑ ∣ (𝑓𝑖0(𝐼) ≥ 𝛼𝜑𝜃𝑛𝑖
0 & 𝑅𝑖𝑗0 ≥ 𝛼𝜃) 𝑜𝑟 (𝑓𝑖
0..𝑚(𝐼) ≥ 𝛼𝜑𝜃𝑛𝑖0..𝑚 &
𝑓𝑖0..𝑚
𝑓𝑗0..𝑚 ∗
𝑛𝑗0..𝑚
𝑛𝑖0..𝑚
≥ 𝛼𝜃)}
(4.3)
The sub-discriminative itemsets are the potential discriminative itemsets in the current
window frames 𝑊0 or in the history of the data streams considering all window frames i.e., the
sum up of frequencies and sum up of data stream lengths in 𝑊0, …𝑊𝑘, 𝑊𝑘+1,… , 𝑊𝑚 denoted
as 𝑊0..𝑚. The relaxation of 𝛼 𝜖 (0,1) is defined for better approximate support and approximate
ratio in discriminative itemsets in the larger window frames with the trade-off between the
computational cost and less number of wrong or hidden discriminative itemsets.
The itemsets that are not discriminative and not sub-discriminative are defined as non-
discriminative itemsets:
Definition 4.3. Non-Discriminative itemsets in the tilted-time window model: Let 𝑆𝑖 and
𝑆𝑗 be two data streams, with the current size of 𝑛𝑖𝑘 and 𝑛𝑗
𝑘 in different window frame 𝑊𝑘
i.e., 𝑘 ≥ 0 that contain varied length transactions of items in ∑, 𝑚 be the number of tilted
window frames for the itemset i.e., 𝑘 ≤ 𝑚, a user defined discriminative level threshold 𝜃 > 1,
a support threshold 𝜑 𝜖 (0, 1 𝜃⁄ ) and a relaxation parameter 𝛼 𝜖 (0,1). The set of non-
Chapter 4: Mining Discriminative Itemsets in Data Streams using the Tilted-time Window Model Page 83
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 83
discriminative itemsets in 𝑆𝑖 against 𝑆𝑗 in the tilted-time window model, denoted as 𝑁𝐷𝐼𝑖𝑗 is
formally defined as:
𝑁𝐷𝐼𝑖𝑗 = {𝐼 ⊆ ∑ ∣ (𝑓𝑖0(𝐼) < 𝛼𝜑𝜃𝑛𝑖
0 𝑜𝑟 𝑅𝑖𝑗0 < 𝛼𝜃) & (𝑓𝑖
0..𝑚(𝐼) < 𝛼𝜑𝜃𝑛𝑖0..𝑚 𝑜𝑟
𝑓𝑖0..𝑚
𝑓𝑗0..𝑚 ∗
𝑛𝑗0..𝑚
𝑛𝑖0..𝑚
< 𝛼𝜃)}
(4.4)
The non-discriminative itemsets are not frequent in the target data stream with the
relaxation of 𝛼 and have frequency ratio in 𝑆𝑖 compared to 𝑆𝑗 less than 𝛼𝜃 in the current window
frame 𝑊0 or in the history of the data streams in all window frames 𝑊0..𝑚 by the sum up of
frequencies and sum up of data stream lengths in 𝑊0,…𝑊𝑘, 𝑊𝑘+1,… ,𝑊𝑚 (i.e., 𝑊0..𝑚). The
non-discriminative itemsets are used for tail pruning in the tilted-time window model as
discussed in Section 4.3.
4.2.2 Discriminative itemset mining using the tilted-time window model
The problem of discriminative itemset mining using the tilted-time window model is
defined for offline batch processing in data streams. The tilted-time window model is updated in
an offline state in the specific time intervals defined for the batch of transactions. The new batch
of transactions is processed and the discriminative itemsets are saved in the current window
frame 𝑊0 in the tilted-time window model. The sub-discriminative itemsets are also saved as
they may become discriminative in the future by merging the window frames. By discovering the
discriminative and sub-discriminative itemsets from the current batch of transactions in 𝑊0, the
discriminative and sub-discriminative itemsets are discovered by shifting and merging the
itemsets in the older window frames 𝑊𝑘 i.e., 𝑘 > 0.
The tilted-time window model updating by shifting and merging is discussed in detail
using running examples by showing the significant challenges. The approximate discriminative
itemsets in the tilted-time window model get worse in the larger window frames. The
approximate discriminative itemsets in the smaller window frames are shifted and merged in the
larger window frames. This can cause a greater number of false-positives and false-negatives in
the tilted-time window model. The determinative properties of the discriminative itemsets in the
tilted-time window model are applied for the approximate bound guarantee. Considering the
complexities of the discriminative itemset mining in data streams using the tilted-time window
model, the advanced high accurate and high efficient H-DISSparse method is proposed by
utilizing the DISSparse method (Seyfi et al. 2017) proposed in Chapter 3 and defining new data
structures adapted for offline updating the tilted-time window model. The H-DISSparse
algorithm is evaluated based on the input data streams with different characteristics in Chapter 6
for their time and space complexities and different approximate support and approximate ratio
Chapter 4: Mining Discriminative Itemsets in Data Streams using the Tilted-time Window Model Page 84
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 84
bounds in discriminative itemsets. The specific characteristics and the limitations of the proposed
methods in the large and fast growing data streams are discussed.
4.3 TILTED-TIME WINDOW MODEL
The tilted-time window model is a group of built-in tables within the nodes of a prefix
tree structure (i.e., H-DISStream as presented in Figure 4.3). Each node in the prefix tree
structure has a built-in table for holding the frequencies of the discriminative itemsets in different
time periods called window frames. The nodes in the prefix tree structure and each window
frame in the built-in tables has two counters 𝑓𝑖 and 𝑓𝑗 for holding the frequencies of the itemset in
the target data stream 𝑆𝑖 and the general data stream 𝑆𝑗, respectively. The discriminative itemsets
are reported in the different tilted window frames in offline time intervals. The tilted-time
window model grows by the new batch of transactions arriving during the time. However, based
on its maintenance strategies the window model remains as a compact data structure. The most
recent results are represented in the current window frames in small time intervals and the older
results are merged in the older window frames in larger time intervals. The tilted-time window
model can be defined based on different window frames such as the natural tilted-time window
frames or the logarithmic tilted-time window frames as below.
The natural tilted-time window model is designed by considering the smallest time
period (e.g., 60 seconds) as the current window frame and makes the larger window frames by
merging the smaller ones (e.g., quarters, hours, days, weeks, etc.); for example in the natural
tilted-time window model, every 15 single minute time periods is accumulated as one quarter and
every four quarters accumulated as one hour, and a day is built by merging 24 single hour time
periods. For reporting the results in a period of months, at most 59 tilted window frames need to
be maintained (Giannella et al. 2003). The logarithmic tilted-time window model is a more
compact data structure and an alternative way for presenting the natural tilted-time window
model; for example a batch of transactions in one minute is supposed as the smallest time period.
The current window frame shows the discriminative itemsets in the last minute and it is followed
by the results in the remaining slots of the next 2 minutes, 4 minutes, 8 minutes, etc. In this
chapter, the logarithmic tilted-time window model is applied in the proposed algorithms as in
Figure 4.2. In this structure the current window frame 𝑊0 covers the current time period 𝑡 and the
older window frame 𝑊𝑘 covers the 2𝑘𝑡 time periods i.e., 𝑘 ≥ 0.
Figure 4.2 Logarithmic tilted-time window structure (Giannella et al. 2003)
The H-DISStream is a prefix tree structure with the built-in tilted-time window model as
defined below.
Chapter 4: Mining Discriminative Itemsets in Data Streams using the Tilted-time Window Model Page 85
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 85
H-DISStream: The H-DISStream prefix tree structure is for holding the discovered
discriminative and sub-discriminative itemsets as well. Both discriminative and sub-
discriminative itemsets share the branches in the same H-DISStream structure for their most
common frequent items (e.g., as in Figure 4.3). Each path in the H-DISStream may represent a
subset of multiple discriminative and sub-discriminative itemsets started from the root of the
prefix tree structure. Each node in the H-DISStream has two counters 𝑓𝑖 and 𝑓𝑗 for holding the
frequencies of an itemset in the target data stream 𝑆𝑖 and the general data stream 𝑆𝑗, respectively
in the current window frame 𝑊0. The Header-Table is defined for fast traversing the prefix tree
structure using the links holding the itemsets ending with identical items and the nodes are tagged
as discriminative, sub-discriminative or non-discriminative (i.e., subset of discriminative or sub-
discriminative). Each H-DISStream node may have a built-in tilted-time window frame if the
itemset is discriminative in the larger window frame 𝑊𝑘 (i.e., 𝑘 > 0), sub-discriminative in the
history summary of the window frames (i.e., 𝑊0..𝑚) or appeared as a non-discriminative subset of
discriminative or sub-discriminative itemsets in any window frame 𝑊𝑘 (i.e., 0 < 𝑘 ≤ 𝑚).
For the sake of clarity in this chapter, the non-discriminative subset of discriminative or
sub-discriminative itemsets in any window frame 𝑊𝑘 is called a non-discriminative subset; for
example, in Figure 4.3 𝑐12,13 is the non-discriminative itemset which is the subset of other
discriminative itemsets in 𝑊0, by considering the discriminative level 𝜃 = 2, the support
threshold 𝜑 = 0.1 and dataset length in two data streams 𝑛10 = 𝑛2
0 = 15. The H-DISStream is
updated by its tilted-time window model in offline time intervals after processing a new batch of
transactions in the pre-defined time intervals; for example, in Figure 4.3 the discriminative
itemset 𝑐𝑏𝑎4,2 is discovered in the current window frame 𝑊0. The itemset 𝑐𝑏𝑎 has a built-in
table with four entries related to the older window frame 𝑊𝑘 i.e., 𝑘 > 0.
Figure 4.3 A sample H-DISStream based on Example 3.1 with the built-in tilted-time
window model
Chapter 4: Mining Discriminative Itemsets in Data Streams using the Tilted-time Window Model Page 86
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 86
The construction of the H-DISStream data structure (i.e., 𝑊0) is started based on the
discriminative itemsets discovered out of the first batch of transactions. By processing every new
batch of transactions the discriminative itemsets are shifted and the tilted windows merged
together. The pruning happens in the H-DISStream if the itemset is non-discriminative in the
tilted-time window model and stays as leaf node.
4.3.1 Tilted-time window model updating
The data streams are defined as continuous batches, each containing a different number
of transactions depending on the speed of data streams. This is showed as 𝐵1,…𝐵ℎ, 𝐵ℎ+1,…,
𝐵𝑛 with 𝐵𝑛 as the most recent one and 𝐵1 as the oldest one. For each itemset 𝐼 the two counters
𝑓𝑖(𝐼) < 𝑥, 𝑦 > and 𝑓𝑗(𝐼) < 𝑥, 𝑦 > denote the frequencies of the itemset I in data streams 𝑆𝑖 and
𝑆𝑗, respectively in the group of continuous batches 𝐵𝑥 to 𝐵𝑦 with 𝑥 ≥ 𝑦. For the sake of clarity,
the itemset 𝐼 and data stream indicatives are omitted from the context (i.e., 𝑓𝑖(𝐼) < 𝑥, 𝑦 > and
𝑓𝑗(𝐼) < 𝑥, 𝑦 > denoted as 𝑓 < 𝑥, 𝑦 >). The frequencies of an itemset in the logarithmic tilted-
time window model are kept during the history as follows:
𝑓 < 𝑛, 𝑛 >; 𝑓 < 𝑛 − 1, 𝑛 − 1 >; 𝑓 < 𝑛 − 2, 𝑛 − 3 >; 𝑓 < 𝑛 − 4, 𝑛 − 7 >;… (4-a)
The ratio between sizes of two neighbour windows (i.e., number of batches of two
neighbour windows) shows the window frame growth rate (e.g., in Figure 4.2 the window frame
growth rate is 2). It should be noted that in the logarithmic tilted-time window model
⌈𝑙𝑜𝑔2(𝑛) + 1⌉ frequencies exist, so the number is small enough even for a large number of
batches (e.g., 109 batches requires 31 couple of frequencies) (Giannella et al. 2003). For
updating the tilted-time window model by shifting and merging the older window frames the
intermediate windows denoted as [𝑓𝑖(𝐼) < 𝑥, 𝑦 >] and [𝑓𝑗(𝐼) < 𝑥, 𝑦 >] are used as extra
memory spaces as shown below.
The 𝑓 < 𝑛, 𝑛 > saves the frequencies of the itemsets discovered from the current batch
of transactions in 𝑊0 (i.e., 𝑓𝑖0(𝐼) and 𝑓𝑗
0(𝐼) in H-DISStream prefix tree structure). By processing
the new batch of transactions the 𝑓 < 𝑛, 𝑛 > is shifted and replaces the 𝑓 < 𝑛 − 1, 𝑛 − 1 > in
𝑊1 (i.e., 𝑓𝑖1(𝐼) and 𝑓𝑗
1(𝐼)) and the recent frequencies set to the 𝑓 < 𝑛, 𝑛 >. Before shifting the
𝑓 < 𝑛 − 1, 𝑛 − 1 > to the next level, it is checked to see if its intermediate window is empty
then it is shifted to that, otherwise the frequencies in the 𝑓 < 𝑛 − 1, 𝑛 − 1 > and its intermediate
window are added together and shifted to the next level 𝑓 < 𝑛 − 2, 𝑛 − 3 > in 𝑊2 (i.e., 𝑓𝑖2(𝐼)
and 𝑓𝑗2(𝐼)). This process will continue until shifting stops.
Following (Giannella et al. 2003) 𝐵1 is the oldest batch, but 𝑊0 is the latest window, so,
𝐵1 is in 𝑊𝑚 (if 𝑚 is the oldest window), 𝐵𝑛 is in 𝑊0. If we put the batches and the windows
Chapter 4: Mining Discriminative Itemsets in Data Streams using the Tilted-time Window Model Page 87
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 87
together in one line, the indexes of batches are decreasing from current time to old time, while the
indexes of windows are increasing from current time to old time.
The processing scenario for continued batches of transactions is presented in the
example below:
Example 4.1. The first batch of transactions 𝐵1 is processed and the discovered
discriminative itemsets are set to the H-DISStream considered as 𝑓 < 1,1 > which is the most
recent window frame (𝑊0) in the tilted-time window model. By processing the next batch of
transactions 𝐵2, the itemsets from H-DISStream are shifted to the older window frame (𝑊1)
represented as 𝑓 < 1,1 > in the window frame table and the new discovered itemsets are set to
H-DISStream as 𝑓 < 2,2 > (i.e., 𝑊0). The process continues for another batch of transactions
𝐵3, and the discovered itemsets in H-DISStream and its tilted-time window model set as 𝑓 <
3,3 >, 𝑓 < 2,2 > [𝑓 < 1,1 >]. The [𝑓 < 1,1 >] is the intermediate window frame and by a
new batch of transactions it will be merged with 𝑓 < 2,2 > and is represented in the tilted-time
window model as 𝑓 < 4,4 >; 𝑓 < 3,3 >; 𝑓 < 2,1 >. The full process is represented step-by-
step for the first 10 batches of transactions as in Figure 4.4:
𝑓 < 1,1 >
𝑓 < 2,2 >; 𝑓 < 1,1 >
𝑓 < 3,3 >; 𝑓 < 2,2 > [𝑓 < 1,1 >]
𝑓 < 4,4 >; 𝑓 < 3,3 >; 𝑓 < 2,1 >
𝑓 < 5,5 >; 𝑓 < 4,4 > [𝑓 < 3,3 >]; 𝑓 < 2,1 >
𝑓 < 6,6 >; 𝑓 < 5,5 >; 𝑓 < 4,3 > [𝑓 < 2,1 >]
𝑓 < 7,7 >; 𝑓 < 6,6 > [𝑓 < 5,5 >]; 𝑓 < 4,3 > [𝑓 < 2,1 >]
𝑓 < 8,8 >; 𝑓 < 7,7 >; 𝑓 < 6,5 >; 𝑓 < 4,1 >
𝑓 < 9,9 >; 𝑓 < 8,8 > [𝑓 < 7,7 >]; 𝑓 < 6,5 >; 𝑓 < 4,1 >
𝑓 < 10,10 >; 𝑓 < 9,9 >; 𝑓 < 8,7 > [𝑓 < 6,5 >]; 𝑓 < 4,1 >
Figure 4.4 Tilted-time window model updating
The tilted-time window model is used for saving the older discovered discriminative and
sub-discriminative itemsets by merging the smaller window frames in the larger sizes. The
itemsets inside the larger window frames 𝑊𝑘 (i.e., 𝑘 > 0) must be tested after merging under
updated frequencies and stream lengths based on Definition 4.1, Definition 4.2 and Definition 4.3
and tagged as discriminative, sub-discriminative or non-discriminative itemsets, respectively.
The sub-discriminative itemsets have potential to be discriminative by merging the window
frames as presented in the example below.
Example 4.2. Considering 𝜃=3 and support threshold 𝜑 = 0.1, the itemset 𝐼 with
frequencies 𝑓10(𝐼) = 5 and 𝑓2
0(𝐼) = 2 in window frame 𝑊0 with the lengths of 𝑛10 = 10 and
𝑛20 = 10 is non-discriminative. The itemset 𝐼 with frequencies 𝑓1
1(𝐼) = 5 and 𝑓21(𝐼) = 1 in the
window frame 𝑊1 with the lengths of 𝑛11 = 15 and 𝑛2
1 = 15 is discriminative. By setting
Chapter 4: Mining Discriminative Itemsets in Data Streams using the Tilted-time Window Model Page 88
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 88
relaxation of 𝛼 = 0.8 the itemset 𝐼 is not omitted and discovered as discriminative itemset in the
larger window frame by shifting and merging the window frames 𝑊0 and 𝑊1.
4.3.2 Discriminative itemsets approximate bound
The pruning techniques have been proposed in (Giannella et al. 2003) for the efficient
mining of frequent itemsets using the tilted-time window model by approximate minimum
support guarantee. These techniques are not applicable to the problem of mining discriminative
itemsets using the tilted-time window model, as explained briefly in below.
The first pruning technique in (Giannella et al. 2003) is related to the tail pruning in the
tilted-time window model by accepting an error threshold boundary of the false positive frequent
itemsets. The tail sequences of the oldest tilted-time window frames related to the itemset are
pruned if the itemset is not frequent in any of those window frames and not sub-frequent in the
history of the data stream from the current time period to any of those window frames, by the
defined error threshold. Based on the claim in (Giannella et al. 2003) the number of false positive
answers is reasonable if the error threshold set small enough. However, setting a small error
threshold is in contradiction with efficiency as the large number of sub-frequent itemsets with
very low support has to be generated and saved in the tilted-time window model. Empirical
analysis shows the number of frequent itemsets grows exponentially in smaller supports.
The discriminative itemsets are a subset of frequent itemsets and in many applications
the discriminative itemsets are in interest with low support and low ratio (e.g., in anomaly
detection). Also, in the research problem of discriminative itemset mining, the itemsets in each
window frame have at least two frequencies related to the target data stream and general data
stream. By dropping tail sequences of the oldest tilted-time window frames, the discriminative
itemsets will be missed during merging the older window frames, causing false negatives and
less recall; for example, because of lack of itemsets in window frames with possible high ratio
between the lengths of target data stream and general data stream (i.e., 𝑅𝑖𝑗𝑘 (𝐼)≫ 1). This can also
cause false positive and less accuracy; for example, because of lack of itemsets in window frames
with possible low ratio between the lengths of the target data stream and the general data stream
(i.e., 𝑅𝑖𝑗𝑘 (𝐼)≪ 1).
The second pruning technique in (Giannella et al. 2003) is based on the anti-monotone
Apriori property in the frequent itemsets. The superset of the frequent itemset has a frequency
equal to or less than its subset, that is hold in all the window frames in the tilted-time window
model. Hence if the itemset is not frequent in the current batch then none of its supersets need be
examined. Following this if the tail of itemsets can be dropped based on the explained tail
pruning techniques, then the possible existed similar tail in its all supersets can be dropped. The
Apriori property is not valid for the discriminative itemset mining and an itemset can be
Chapter 4: Mining Discriminative Itemsets in Data Streams using the Tilted-time Window Model Page 89
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 89
discriminative in different window frames with non-discriminative subsets in the same window
frames.
In this chapter, the properties of the discriminative itemsets in the tilted-time window
model are applied within efficient time and space usage for the highest refined approximate
support and approximate ratio bound, by minimizing the number of false-positives and false-
negatives discriminative itemsets in in data streams.
The FP-Tree is defined based on the basics of FP-Growth (Han, Pei and Yin 2000) out
of frequent items of transactions by sharing the branches for their most common frequent items.
This was adapted in Chapter 3 by two counters in each node for holding the frequencies of
itemsets in the target dataset and general dataset. In the proposed method in this chapter, the FP-
Tree is generated for the current batch of transactions as a similar prefix tree structure made for
one batch of transactions in Chapter 3, but without pruning infrequent items (i.e., FP-Tree
includes all items). The FP-Tree includes the infrequent items in the target data stream 𝑆𝑖 in the
current batch of transactions, so it can be used for mining the frequencies of the itemsets that are
non-discriminative subsets of the discriminative itemsets in the older window frames. It should
be noted that although the FP-Tree includes all the items, each conditional FP-Tree that is
generated during offline batch processing follows the basics of FP-Growth (Han, Pei and Yin
2000) by pruning items that are infrequent in the target data stream 𝑆𝑖. This ensures that the FP-
Tree including all the items does not add high complexity to the batch processing using
DISSparse algorithm. The discovered discriminative itemsets are saved in a pattern tree with a
built-in tilted-time window model as we explain in the section below.
4.3.2.1 Maintaining the discriminative itemsets in the tilted-time window model
In the proposed H-DISSparse method in this chapter, the itemsets that are discriminative,
or appeared as non-discriminative subset (of discriminative itemsets) at least in one window
frame 𝑊𝑘 (i.e., 𝑘 ≥ 0) are saved in the tilted-time window model during the history of data
streams. The exact set of discriminative itemsets in the current batch of transactions is discovered
using DISSparse algorithm during batch processing respectively and is held in the current
window frame 𝑊0. The exact frequencies of non-discriminative subsets in the current window
frame 𝑊0 are obtained by traversing the FP-Tree through Header-Table links for their
appearances in the current batch of transactions. The tilted-time window frames are updated by
shifting and merging the itemset frequencies in the larger window frames after processing each
batch of transactions in the offline state.
Property 4.1. Exact set of discriminative itemsets in the current batch of transactions is
held in the current window frame 𝑊0.
Chapter 4: Mining Discriminative Itemsets in Data Streams using the Tilted-time Window Model Page 90
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 90
This property says that the entire discriminative itemsets in current window frame 𝑊0
are saved with their exact frequencies. The discriminative itemsets in the current window frame
𝑊0 are discovered using the DISSparse algorithm (Seyfi et al. 2017) with 100% accuracy and
recall.
Property 4.2. Exact set of non-discriminative subsets (of discriminative itemsets) in the
current batch of transactions is held in the current window frame 𝑊0.
This property says that the entire non-discriminative itemsets stays as internal node in
current window frame 𝑊0 (i.e., subset of discriminative itemsets) are saved with their exact
frequencies. The exact frequencies of the non-discriminative itemsets (subsets) in the current
window frame 𝑊0 are obtained by traversing the FP-Tree through Header-Table links for their
appearances in the current batch of transactions.
The first corollary is formally defined below.
Corollary 4-1. An exact set of itemsets including discriminative and non-discriminative subsets
in the current batch of transactions is held in the current window frame 𝑊0.
Where the current window frame 𝑊0 is the H-DISStream prefix tree structure and the non-
discriminative subsets are the subset of discriminative itemsets at least in one window frame 𝑊𝑘
(i.e., 𝑘 ≥ 0) in the tilted-time window model during the history of data streams.
Rationale 4-1. (H-DISStream holds exact frequencies of itemsets from the time they are
maintained in the tilted-time window model) Corollary 4-1 ensures that any itemset in the H-
DISStream structure and its built-in tilted-time window model has the exact frequencies in each
window frame 𝑊𝑘 (i.e., 0 ≤ 𝑘 ≤ 𝑚 where 𝑚 is the oldest window frame related to the itemset).
Proof. The precision of the DISSparse algorithm for mining discriminative itemsets in a
batch of transactions has been proved in Chapter 3 based on Theorem 3-3. The Property 4.1
ensures the exact set of discriminative itemsets is held in the current window frame 𝑊0. The
Property 4.2 ensures to hold the exact frequencies of non-discriminative subsets in the current
window frame 𝑊0. The FP-Tree holds all the items including the infrequent items in the target
data stream 𝑆𝑖. This implies that the frequencies of non-discriminative subsets that in the current
batch of transactions are infrequent in the target data stream 𝑆𝑖, are not missed. The above two
properties imply that any itemset in 𝑊0 holds the exact frequencies in the current batch of
transactions. The tilted-time window updating is done by shifting and merging the itemset
frequencies in the smaller window frames, which are holding the exact frequencies of itemsets, to
Chapter 4: Mining Discriminative Itemsets in Data Streams using the Tilted-time Window Model Page 91
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 91
the larger window frames. These ensure the exact frequencies of the itemset in every tilted
window frames 𝑊0, 𝑊1,…, 𝑊𝑚 (i.e., 𝑊𝑚 is the oldest window frame related to itemset).
∎
Based on Corollary 4-1 the exact itemset frequencies are held in each window frame by
considering the oldest window frame where the itemset has appeared in the H-DISStream
structure. However, the itemset frequencies may have been ignored during batch processing in
the oldest window frames (i.e., as non-discriminative itemset without any discriminative
superset). The frequencies of an itemset are less than or equal to its actual frequencies in the
history of data streams considering all input batches 𝐵1,…𝐵ℎ, 𝐵ℎ+1,… ,𝐵𝑛. Let 𝑊𝑥 be the
oldest possible window frame in H-DISStream, and 𝑊𝑚 be the oldest window frame related to
the itemset 𝐼 in H-DISStream (i.e., m ≤ 𝑥), the following statements hold.
∑𝑓𝑖𝑘(𝐼) ≤ ∑𝑓𝑖
𝑘(𝐼)
𝑥
𝑘=0
, ∑𝑓𝑗𝑘(𝐼) ≤ ∑𝑓𝑗
𝑘(𝐼)
𝑥
𝑘=0
𝑚
𝑘=0
𝑚
𝑘=0
(4-b)
∑𝑛𝑖𝑘 ≤ 𝑛𝑖 , ∑𝑛𝑗
𝑘 ≤ 𝑛𝑗
𝑚
𝑘=0
𝑚
𝑘=0
(4-c)
The conditions above caused approximate frequencies and approximate ratios in
discriminative itemsets in the tilted-time window model. The discriminative itemsets with
approximate frequencies less than their exact frequencies may be missed in the tilted-time
window model. The approximate ratio can be less or more than the actual ratio considering the
ratios in the oldest window frames 𝑊𝑘 i.e., 𝑚 < 𝑘 ≤ 𝑥 (e.g., 𝑅𝑖𝑗𝑘 (𝐼)≫ 1 or 𝑅𝑖𝑗
𝑘 (𝐼)≪ 1,
respectively). Based on the above conditions one of the following two statements holds.
∀ 𝑙, 0 ≤ 𝑙 ≤ 𝑚 ≤ 𝑥,∑ 𝑓𝑖
𝑘(𝐼)𝑚
𝑘=𝑙
∑ 𝑓𝑗𝑘(𝐼)
𝑚
𝑘=𝑙
∗∑ 𝑛𝑗
𝑘𝑚
𝑘=𝑙
∑ 𝑛𝑖𝑘
𝑚
𝑘=𝑙
≤∑ 𝑓𝑖
𝑘(𝐼)𝑥
𝑘=𝑙
∑ 𝑓𝑗𝑘(𝐼)
𝑥
𝑘=𝑙
∗∑ 𝑛𝑗
𝑘𝑥
𝑘=𝑙
∑ 𝑛𝑖𝑘
𝑥
𝑘=𝑙
(4-d)
∀ 𝑙, 0 ≤ 𝑙 ≤ 𝑚 ≤ 𝑥,∑ 𝑓𝑖
𝑘(𝐼)𝑚
𝑘=𝑙
∑ 𝑓𝑗𝑘(𝐼)
𝑚
𝑘=𝑙
∗∑ 𝑛𝑗
𝑘𝑚
𝑘=𝑙
∑ 𝑛𝑖𝑘
𝑚
𝑘=𝑙
>∑ 𝑓𝑖
𝑘(𝐼)𝑥
𝑘=𝑙
∑ 𝑓𝑗𝑘(𝐼)
𝑥
𝑘=𝑙
∗∑ 𝑛𝑗
𝑘𝑥
𝑘=𝑙
∑ 𝑛𝑖𝑘
𝑥
𝑘=𝑙
(4-e)
To improve the accuracy we define a relaxation ratio for discovering the discriminative
itemsets between borders of the tilted-time window model as we explain in the section below.
4.3.2.2 Improving the accuracy using relaxation ratio
Data stream mining algorithms basically must be designed based on a single scan
(Aggarwal 2007; Han, Pei and Kamber 2011) as the multiple scans of datasets is often too
expensive. The discriminative itemsets with approximate ratios less than their exact ratios may be
missed in the tilted-time window model i.e., false-negatives. The non-discriminative itemsets
with approximate ratios greater than their exact ratios may be reported in the tilted-time window
Chapter 4: Mining Discriminative Itemsets in Data Streams using the Tilted-time Window Model Page 92
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 92
model as discriminative itemsets i.e., false-positives. The approximation can be refined using a
relaxation of 𝛼 by sub-discriminative itemsets based on the Definition 4.2. The sub-
discriminative itemsets are discovered during batch processing by modifying the DISSparse
algorithm, and are saved in the current window frame 𝑊0. The two heuristics HEURISTIC 3.1
and HEURISTIC 3.2 proposed in DISSparse method (Seyfi et al. 2017) are modified by the
relaxation of 𝛼, for holding the sub-discriminative itemsets during the batch processing. The sub-
discriminative itemsets are saved in the tilted-time window model as the potential discriminative
itemsets following Definition 4.2 and based on the relaxation of 𝛼.
Property 4.3. By modifying the Corollary 4-1 using relaxation of 𝛼 a set of non-
discriminative itemsets in the current batch is held in 𝑊0 are discovered as sub-discriminative
itemset.
This property says that the sub-discriminative itemsets are discovered by choosing the
relaxation of 𝛼 from non-discriminative itemsets. The sub-discriminative itemsets in the current
window frame 𝑊0 and the history of data streams are discovered for better approximation in
itemset frequencies and itemset frequency ratios.
Property 4.4. By using smaller relaxation of 𝛼 a better approximation will be in the
discriminative itemsets in the tilted-time window model.
This property says that the more sub-discriminative itemsets are discovered by choosing
the smaller relaxation of 𝛼. This is a trade-off between better approximations in discriminative
itemsets in the tilted-time window model and computation cost.
The second corollary is formally defined below.
Corollary 4-2. A refined approximate bound in discriminative itemsets in the tilted-time window
model is obtained by modifying the Corollary 4-1 based on relaxation of 𝛼.
Where 𝛼 is the relaxation threshold for sub-discriminative itemset and Corollary 4-1 is for
holding the exact set of discriminative itemsets and non-discriminative subsets in the current
batch of transactions in the current window frame 𝑊0.
Rationale 4-2. (highest refined approximate bound in discriminative itemsets in the
tilted-time window model) Corollary 4-2 ensures that the approximation in discriminative
itemsets in the tilted-time window model may be improved by holding the sub-discriminative
itemsets in the tilted-time window model.
Chapter 4: Mining Discriminative Itemsets in Data Streams using the Tilted-time Window Model Page 93
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 93
Proof. The sub-discriminative itemsets have potential to be discriminative by merging in
the larger window frames. The sub-discriminative itemsets improve the approximate bound in
discriminative itemsets by increasing the number of window frames that hold the exact
frequencies of itemsets in the tilted-time window model. This caused less number of false-
positives and false-negatives in the discriminative itemsets in the tilted-time window model.
Considering two relaxations of 𝛼 and 𝛼′ i.e., 𝛼 ≤ 𝛼′ the approximate bound for itemset 𝐼 in the
tilted-time window model is obtained by its exact frequencies from the time it is maintained in
the window model 𝑊𝑚 and 𝑊𝑚′, respectively i.e., 𝑚 ≥ 𝑚′.
∎
The size of the pattern tree (i.e., H-DISStream) and its built-in tilted-time window model
can be large in the history of data stream. We propose a tail pruning techniques for holding the
in-memory data structures in a reasonable size as we explain in the section below.
4.3.2.3 Tail pruning in the tilted-time window model
The discriminative itemsets are a sparse subset of frequent itemsets. The H-DISStream
with built-in tilted-time window frames, in principle, is much smaller than the FP-Stream used in
(Giannella et al. 2003) for frequent itemset mining in data streams using the tilted-time window
model. However, without effective pruning techniques, the H-DISStream structure can still
become unnecessarily large during the history of data streams. Using a reasonable relaxation of 𝛼
a non-discriminative itemset in the H-DISStream structure is non-potential to be discriminative,
with the approximate bound. The tail pruning in H-DISStream is applied for space saving by
pruning the non-discriminative itemsets based on the Definition 4.3.
Property 4.5. The set of non-discriminative itemsets are defined based Definition 4.3
and can be tagged for deletion for space saving.
This property says that the large number of non-discriminative itemsets can be tagged for
deleting and space saving in the tilted-time window model in data streams.
Property 4.6. The set of non-discriminative itemsets stay as leaf node are deleted from
tilted-time window model for space saving.
This property says that the large number of non-discriminative itemsets is deleted if they
stay as leaf node. This tail pruning will cause for large space saving in the tilted-time window
model in data streams.
The third corollary is formally defined below.
Chapter 4: Mining Discriminative Itemsets in Data Streams using the Tilted-time Window Model Page 94
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 94
Corollary 4-3. An itemset in the H-DISStream and its built-in tilted-time window model is
pruned if it is non-discriminative itemsets and stays as tail itemset.
Where the tail itemset is not a subset of any discriminative or sub-discriminative itemsets in any
𝑊𝑘 i.e., 𝑘 ≥ 0 and the Definition 4.3 defines the non-discriminative itemsets in data streams in
the tilted-time window model.
Rationale 4-3. (concise H-DISStream structure) Corollary 4-3 ensures that any itemset
that is held in the tilted-time window model is a discriminative, sub-discriminative or non-
discriminative subset in the history of data streams.
Proof. The H-DISStream is made up of a compact prefix tree structure by sharing
branches for their most common frequent items. The logarithmic built-in tilted-time window
model is also a very compact data structure. The Property 4.5 ensures that the discriminative and
sub-discriminative itemsets are not pruned. The Property 4.6 ensures that the non-discriminative
subsets are not pruned. These imply that the itemsets are pruned if they have the least potential to
be discriminative in the recent trends in the data streams. The more space is saved using
Corollary 4-3 by pruning the non-discriminative itemsets staying as leaf node and their possible
direct non-discriminative subsets iteratively. In the tilted-time window model consider itemsets
𝐼 ⊂ 𝐼′ which are both in the H-DISStream structure at the end of the batch processing. Let 𝑊0,
𝑊1,…, 𝑊𝑚 and 𝑊0, 𝑊1,…, 𝑊𝑚′ be the window frames that are maintained in the tilted-time
window model for the itemsets 𝐼 and 𝐼′, respectively. The number of window frames related to
the itemset 𝐼 is equal or more than the number of window frames related to the itemset 𝐼′ (i.e., 𝑚
≥ 𝑚′ ).
∎
The periodic changes in discriminative itemsets are happened by concept drifts in data
streams and the discriminative itemsets in the neighbour window frames become considerably
different. The pruned non-discriminative itemsets as explained are basically the least potential
discriminative itemsets. In Chapter 6, the principles are proposed for setting the relaxation of 𝛼
based on data stream characteristics.
Claim 4-1 Based on Rationale 4-1, Rationale 4-2 and Rationale 4-3, the highest
refined approximate bound in discriminative itemsets is achieved efficiently by setting relaxation
of 𝛼 to a reasonably small value and applying the tail pruning.
The Claim 4-1 essentially says the approximation in discriminative itemset mining in
data streams using the tilted-time window model is improved by saving a greater number of sub-
discriminative itemsets. The sub-discriminative itemsets have potential to be discriminative
Chapter 4: Mining Discriminative Itemsets in Data Streams using the Tilted-time Window Model Page 95
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 95
itemsets in the recent history of data streams. We call this the highest refined approximate bound
in the discriminative itemsets in the tilted-time window model, which is obtained by the less
number of false-positives and false-negatives during the recent history of data streams.
In the next sections, a single pass algorithm is proposed for mining discriminative
itemsets in data streams using the tilted-time window model. The prefix tree structure H-
DISStream with the built-in tilted-time window model is used in algorithm following the defined
corollaries for the tilted-time window model updating in this section.
4.4 H-DISSPARSE METHOD
In this section, we describe the process of efficient mining discriminative itemsets using
the tilted-time window model in the H-DISSparse method with the approximate bound.
The H-DISSparse method utilizes the DISSparse algorithm (Seyfi et al. 2017) proposed
in Chapter 3 with the tilted-time window model. The DISSparse is used for the batch processing
in offline updating the tilted-time window model using same data structure, FP-Tree, Header-
Table, conditional FP-Tree and minimized DISTree as defined in Chapter 3. The discriminative
(and sub-discriminative) itemsets are directly updated to the H-DISStream structure (i.e., current
window frame 𝑊0), and the tilted-time window model is updated by shifting and merging the
itemsets in the older window frames 𝑊𝑘 (i.e., 𝑘 > 0). The H-DISSparse method continues by
discovering discriminative and sub-discriminative itemsets for the next batch of transactions. To
best of our knowledge, the H-DISSparse method is considered as the first works for efficient
mining of discriminative itemsets using the tilted-time window model with the approximate
bound.
4.4.1 H-DISSparse Algorithm
The H-DISSparse algorithm is presented by incorporating three corollaries defined in
Section 4.3.2 for the efficient discriminative itemset mining using the tilted-time window model
and within the approximate bound guarantee. The H-DISStream structure is updated in offline
time intervals when the current batch of transactions 𝐵𝑛 is full (i.e., n ≥ 1). The first batch of
transactions 𝐵1 is treated differently by calculating all the item frequencies and making the Desc-
Flist based on the descending order of the item frequencies. The Desc-Flist order is used for
saving space by sharing the paths in the prefix trees with most frequent items on the top. This
Desc-Flist remains the same for all the upcoming batches in data streams. The H-DISSparse
algorithm is single-pass for the rest of the batches of transactions. The input parameters
discriminative level 𝜃, support threshold 𝜑 and relaxation of 𝛼 are defined based on the
application domain, data stream characteristics and sizes or by the domain expert users, as
discussed in Chapter 6. The H-DISSparse algorithm is represented in Algorithm 4.1.
Chapter 4: Mining Discriminative Itemsets in Data Streams using the Tilted-time Window Model Page 96
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 96
The FP-Tree is made by adding the transactions from 𝐵𝑛 (i.e., the most current batch of
transactions) without pruning infrequent items. The tilted-time window model is updated for
larger window frames 𝑊𝑘 (i.e., k > 0) by shifting and merging as explained, based on the basic
of the logarithmic tilted-time window frames as in (Giannella et al. 2003). Following the
DISSparse algorithm (Seyfi et al. 2017) proposed in Chapter 3, the minimized DISTree is
generated by the itemset combinations from potential discriminative subsets in the conditional
FP-Tree made for each Header-Table item by following the proposed HEURISTIC 3.1 and
HEURISTIC 3.2. The HEURISTIC 3.1 and HEURISTIC 3.2 are modified using
Corollary 4-2 by setting the relaxation of 𝛼 for the approximate bound guarantee in
discriminative itemset in the tilted-time window model. The itemsets in minimized DISTree
structure are checked based on the Definition 4.1, Definition 4.2 and Definition 4.3 to be saved as
discriminative and sub-discriminative itemsets or be deleted as non-discriminative itemsets if
they are leaf nodes, respectively.
The H-DISStream structure as the current window frame 𝑊0 is updated instantly by
discriminative and sub-discriminative itemsets in each minimized DISTree. The conditional FP-
Tree of the processing Header-Table item is expanded by the sub-branches of the processed
subtree and the process continues if there is a new subtree, as explained in DISSparse algorithm
in Chapter 3. By full discovery of the discriminative itemsets in 𝐵𝑛, the exact frequencies of the
non-discriminative subsets not updated in 𝑊0 are tuned using the FP-Tree following the
Corollary 4-1. By the end of updating the window frame 𝑊0, the tail pruning is applied to the H-
DISStream and its built-in tilted-time window model following the Corollary 4-3. The
discriminative itemsets in target data stream 𝑆𝑖 against general data stream 𝑆𝑗 in each window
frame 𝑊𝑘 (i.e., 𝑘 ≥ 0) are reported in offline time intervals in each window frame in 𝐷𝐼𝑖𝑗𝑘 and
the H-DISSparse algorithm is continued for the new incoming batch of transactions 𝐵𝑛+1.
Algorithm 4.1 ((H-DISSparse: Efficient Mining of Discriminative Itemsets in
Data Streams using the Tilted-time Window Model)
Input: (1) The discriminative level 𝜃; (2) The support threshold 𝜑; (3) The
relaxation of 𝛼; and (4) The Input batches 𝐵𝑛 i.e., 𝑛 ≥ 1 made of transactions
with alphabetically ordered items belonging to data streams 𝑆𝑖 and 𝑆𝑗.
Output: 𝐷𝐼𝑖𝑗𝑘 i.e., 𝑘 ≥ 0, different set of discriminative itemsets in 𝑆𝑖 against 𝑆𝑗
in the tilted-time window model (H-DISStream structure)
Begin
1) While not end of streams do
2) Read current batch of transactions 𝐵𝑛;
3) Order the items in transactions based on Desc-Flist made of 𝐵1;
4) Make FP-Tree for 𝐵𝑛 based on expansion of FP-Growth (i.e., includes all
items);
Chapter 4: Mining Discriminative Itemsets in Data Streams using the Tilted-time Window Model Page 97
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 97
5) Update H-DISStream as 𝑊0 using DISSparse algorithm modified based
on Corollary 4-2; (HEURISTIC 3.1 and HEURISTIC 3.2 modified by
Corollary 4-2);
6) Update window frames 𝑊𝑘 (i.e., 𝑘 > 0) by shifting and merging using
FP-Stream algorithm in (Giannella et al. 2003);
7) Tune non-discriminative subset in 𝑊0 using FP-Tree (Corollary 4-1);
8) Apply tail pruning in H-DISStream (Corollary 4-3);
9) Report discriminative itemsets 𝐷𝐼𝑖𝑗𝑘 for each window frame 𝑊𝑘;
10) End while;
End.
4.4.2 H-DISSparse Algorithm Complexity
In the H-DISSparse algorithm, the significant part attracting considerable complexity is
related to the DISSparse algorithm by generating the potential discriminative itemsets. Updating
the tilted-time window model by shifting and merging, tuning the frequencies of the non-
discriminative subsets and applying the tail pruning in the H-DISStream structure have less
complexity compared to the FP-Stream, by considering the sparsity property of discriminative
itemsets. In H-DISSparse the tilted-time window model is updated after finding every pattern out
of current batch of transactions. The efficiency of the H-DISSparse algorithm is discussed in
detail by evaluating the algorithm with the input data streams in Chapter 6. Empirical analysis
shows the performance of the proposed method by testing with different parameter settings (e.g.,
relaxation of 𝛼). The efficiency of the H-DISSparse algorithm is discussed on large and fast-
growing data streams for mining discriminative itemsets with the approximate bound guarantee.
The proposed DISSparse method (Seyfi et al. 2017) in Chapter 3 uses two determinative
heuristics and new concise data structures for efficient discriminative itemset mining in a batch of
transactions. Following Corollary 4-2 the relaxation of 𝛼 is adapted to the defined
HEURISTIC 3.1 and HEURISTIC 3.2 in Chapter 3 for approximate support and approximate
ratio bound guarantee in discriminative itemsets in the tilted-time window model.
4.5 CHAPTER SUMMARY
Considering the proposed method in this chapter, the H-DISSparse method is applicable
for the large datasets. The utilized DISSparse method for offline batch processing is efficient for
mining discriminative itemsets in data streams, based on the heuristics proposed in Chapter 3.
The tail pruning techniques proposed in the FP-Stream method are not applicable in the method
proposed in this chapter, and the new techniques are defined. The discriminative itemsets are
held in the proposed novel prefix tree H-DISStream structure and its built-in tilted-time window
model in approximate support and approximate ratio during the history of data streams.
In this chapter, three corollaries are defined for mining discriminative itemsets in the
tilted-time window model with the highest refined approximate bound. The defined corollaries
are applied to the H-DISSparse method. The process is efficient in the H-DISSparse method
Chapter 4: Mining Discriminative Itemsets in Data Streams using the Tilted-time Window Model Page 98
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 98
based on its efficient DISSparse algorithm, utilized for the batch processing. The defined
corollaries in this chapter, guarantee to hold the exact frequencies of the discriminative itemsets,
sub-discriminative itemsets and non-discriminative subsets from the time they are maintained in
the tilted-time window model. The non-discriminative itemsets, staying as leaf nodes in H-
DISStream, are pruned with their possible built-in tilted-time window frames. The highest refined
approximate bound is achieved with setting the smaller relaxation for tail pruning in the tilted-
time window model. Following the compact logarithmic tilted-time window model and the
defined tail pruning techniques, the H-DISStream data structure is able to be fitted in the main
memory to ascertain that mining discriminative itemsets in data streams using the tilted-time
window model is realistic in fast-growing data streams.
The H-DISStream structure is stable during the time and the discriminative itemsets that
appeared after processing batches with high concept drifts are neutralized by merging in larger
window frames. The process of building the H-DISStream structure can become more efficient
by periodically reordering the data structures based on the new trends in the data streams. The
Desc-Flist order made of the first batch of transactions is the default order for making all the data
structures in the algorithms. The efficiency of the algorithms may be affected by this default
order in case of high concept drifts in the data streams during the time. The data structures such
as FP-Tree, conditional FP-Tree, minimized DISTree and H-DISStream can be updated
periodically with new ordering of frequent items for better efficiency (i.e., Desc-Flist adjusted
based on the new trends in frequent items). However, the overhead of restructuring the large H-
DISStream structure and its tilted-time window model must be considered.
The proposed method is extensively evaluated with datasets exhibiting distinct
characteristics in Chapter 6. Empirical analysis shows that the proposed H-DISSparse method
has efficient time and space complexity with the highest refined approximate bound based on the
defined corollaries and by considering the smaller relaxation of 𝛼. The discriminative itemset
appeared with high concept drifts in specific batches are neutralized quickly during merging in
larger window frames. The in-memory data structures defined for the algorithms are efficiently
stay small based on the defined corollaries and are stabilized during the process.
We explained many different real world applications for mining discriminative itemsets
using the tilted-time window model. One of the interesting scenarios is in online news delivery
services by looking for a group of words which are more frequent in the news read by specific
users compared to the same group of words in a collection of all news, in different time periods.
This can be used for personalization based on the user high interest during the current period of
time, history of data streams or specific events and updating the system based on changes in user
preferences. The classification techniques can be applied based on the discriminative itemsets in
the tilted-time window model. The general trends are not affected by high concept drifts at
specific periods of time and different weighting functions can be adjusted to the discriminative
Chapter 4: Mining Discriminative Itemsets in Data Streams using the Tilted-time Window Model Page 99
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 99
itemsets in each window frame. The interesting queries can be asked based on the trends in
different time periods. The discriminative itemsets are a sparse subset of frequent itemsets and
are applicable to many real world applications with large data streams (e.g., online market basket
analysis, network access patterns, etc.).
In this chapter, discriminative itemsets are updated in offline time intervals in the tilted-
time window model. In the next chapter, we propose the algorithm for mining discriminative
itemsets in data streams using the sliding window model in offline and online updating states.
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 100
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 100
5Chapter 5: Mining Discriminative Itemsets in
Data Streams using the Sliding Window
Model
In this chapter, the problem of mining discriminative itemsets in data streams using the sliding
window model is formally defined. The comprehensive research problem is outlined and one
method is proposed. The method called S-DISSparse utilizes the efficient DISSparse algorithm
(Seyfi et al. 2017) proposed in Chapter 3 with the sliding window model. The proposed method
is explained in detail with its novel data structures and the offline updating sliding window
model. In this chapter, the novel determinative heuristics are proposed for exact and efficient
mining discriminative itemsets using the offline sliding window model. The proposed heuristics
guarantee the exact discriminative itemsets in the large and fast-growing data streams with the
necessity of concise process fitting in real world applications. The online sliding is also proposed
for mining discriminative itemsets in data streams using the sliding window model in the real
time frame with the highest refined approximation. The proposed method is extensively
evaluated on data streams made of multiple batches of transactions exhibiting diverse
characteristics and by setting different thresholds, in Chapter 6. Empirical analysis shows
efficient time and space complexity gained by the S-DISSparse algorithm for mining
discriminative itemsets in the offline and online sliding window model. To the best of our
knowledge, the proposed method in this chapter is the first algorithm for mining discriminative
itemsets in data streams using the sliding window model.
The chapter starts with describing the existing works in Section 5.1. The mathematical
definition of the research problem using several notations is proposed in Section 5.2. The sliding
window model and its offline updating process are discussed in Section 5.3. The online sliding
window model is discussed in Section 5.4. The S-DISSparse method is proposed for mining
discriminative itemsets in data streams using the sliding window model in Section 5.5. The
chapter is finalized by discussion about the S-DISSparse method and its state-of-the-art
techniques in Section 5.6.
5.1 EXISTING WORKS
Moment is a famous method proposed for mining closed frequent patterns from data
streams using sliding window model (Chi et al. 2004). A synopsis data structure is designed to
monitor the transactions in the sliding window model for mining the current closed frequent
itemsets. The synopsis data structure cannot monitor all the itemset combinations due to time and
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 101
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 101
memory limits. On the other side monitoring only the frequent itemsets is not enough as it is not
possible to detect the itemset when they change from infrequent to frequent. The Moment (Chi et
al. 2004) introduces a compact data structure, the closed enumeration tree (CET), for maintaining
the dynamic set of itemsets over a sliding window. The itemsets in the boundary between closed
frequent itemsets and the rest of the itemsets are saved in CET, and the boundary is changed by
conceptual drifts in data stream. The itemset status will change (e.g., from non-frequent to
frequent) through the boundary. The data structure is small enough to be maintained in memory
and updated in real time. The boundary is relatively stable and most of the itemsets do not often
change status.
The CET data structure in the Moment algorithm (Chi et al. 2004) is informative enough
for mining any (closed) frequent itemset over a sliding window. By choosing a reasonably large
window size, and having less concept drifts in data stream, most of the itemsets do not change
their status form frequent to infrequent or vice versa. This means the itemsets are mainly updated
by their frequency in window model and do not change their status by the effects of adding or
deleting the transactions. Also, the changes to the entire tree structure are limited to the change in
the status through the boundary nodes.
The Moment algorithm (Chi et al. 2004) holds four types of nodes in CET, including:
infrequent gateway nodes, unpromising gateway nodes, intermediate nodes, and closed nodes.
These node types and the change status between them are described briefly. Infrequent gateway
nodes are the infrequent itemsets with a frequent parent or frequent sibling of parent.
Unpromising gateway nodes are frequent itemsets with a superset of a closed itemset with the
same support. Intermediate nodes are frequent itemsets with a child node with same support
which the itemset is not unpromising gateway node. Closed nodes represent closed itemsets in
the sliding window model and can be an internal node or a leaf node. The node statuses
(itemsets) are changed as a matter of operations adding a transaction, and deleting a transaction.
In the Moment algorithm (Chi et al. 2004) by adding a transaction, most likely the node
status will not change and the algorithm simply updates the support of the itemsets with the
minimum cost. In specific cases these changes may happen. An infrequent gateway node may
become frequent. In this case all its left siblings must be checked for making new children, and
the pruned branches under node must be re-explored. An unpromising gateway node may
become promising. In this case the pruned branches under node must be re-explored. A closed
node will remained as closed without change in status, and an intermediate node can become
closed node.
In the Moment algorithm (Chi et al. 2004) by deleting the old transactions most likely
and similar to adding the new transaction the node statues will not change and only the supports
should be updated. In specific cases these changes may happen. An infrequent gateway node
remains as infrequent node. An unpromising gateway node may change to infrequent node. The
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 102
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 102
frequent node can become infrequent by decreasing its support. In this case their entire
descendants are pruned and all of its left siblings are updated by removing the children obtained
from join with the node. A promising node may become unpromising. In this case the left check
of its siblings is necessary. A closed node may become non-closed.
The Moment algorithm (Chi et al. 2004) is basically based on the concept of closed
itemset as a concise way of showing frequent itemsets. The concept of closed frequent itemsets is
not applicable to the discriminative itemset mining as it does not follow the Apriori property.
The estWin algorithm (Chang and Lee 2005) is proposed for mining recently frequent
itemsets over an online transactional data stream. The mining over an online data stream is
applied based on flexible trade-off between processing time and mining accuracy. The algorithm
uses a minimum support threshold and significant support threshold for mining the frequent
itemsets. An itemset is recently frequent if its current support within the sliding window size is
greater than or equal to minimum support threshold. An itemset with the current support greater
than significant support threshold within the sliding window size is significant itemset. The
significant itemsets are maintained in the main memory by a lexicographic tree structure. The
effect of old transactions become out of range by decreasing the occurrence count of each itemset
that appeared in the transaction.
In estWin algorithm (Chang and Lee 2005) the total number of itemsets monitored in the
main memory is minimized by two operations, which are delayed insertion and pruning. The
itemset insertion is delayed until it become significant. The itemsets are pruned if they become
insignificant. The itemsets with current support much less than minimum support are not
monitored since they cannot be frequent in the recent future. These itemsets are considered as
insignificant itemsets. The support of an itemset which is not monitored in the current window in
data stream can be estimated by the supports of its subsets which are currently monitored. All the
transactions in the current window are maintained by a structure named CTL (Current
Transaction List). The estWin has two states as window initialization state, and window sliding
state. The window is initialized while the number of transactions generated so far in data stream
is less than sliding window model. The window slides when CTL become full. In this state the
new transaction is added and the oldest transaction is extracted from CTL.
In estWin algorithm (Chang and Lee 2005) the currently significant itemsets are
maintained in a lexicographic tree structure as in (Wang and Han 2004) which is called
monitoring tree. Every node in the monitoring tree holds an item and it shows the itemset from
the path between root and this item. The algorithm maintain the set of (pcnt, acnt, err, mtid) in
each entry. The pcnt is maximum possible count of the itemset and it is estimated number of
transactions contain the itemset before it is added to the monitoring tree. The actual count acnt is
the number of transactions in the sliding window that contain the itemsets after itemset is
inserted. The maximum error count err of itemset is the error in maximum possible count pcnt
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 103
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 103
when the pcnt is estimated. The mtid is the TID of the transactions which caused for insertion of
the itemset in the monitoring tree. In order to an itemset be a new significant itemset, all of its
subsets must be currently significant itemset.
The estWin (Chang and Lee 2005) is an approximate method and the frequency of the
itemsets are estimated based on frequency of their subsets. The Apriori property of subset is not
applied to the problem of discriminative itemset mining.
The DSTree (Leung and Khan 2006) is proposed for mining the frequent itemsets in data
stream over an sliding window model. The DSTree is designed for exact stream mining of
frequent itemsets. A novel tree structure call DSTree (Data Stream Tree) is proposed for
maintaining the transaction frequencies in different batches and mining the exact frequent
itemsets. By the data stream flow the fixed size window model defined by user slides and DSTree
is updated. The frequencies of items are continuously affected by inserting a new batch or
removing the old batch of transactions. In DSTree a list of counters are maintained at each node
and the last entry of the list at the node shows its frequency count in the current batch. When the
next batch of transaction comes this list is shifted forward and the last entry shifts and becomes
the second last entry. At the same time the frequency count corresponding to the oldest batch is
removed. By every window sliding these shift and update operations are done.
The proposed DSTree algorithm (Leung and Khan 2006) holds a pointer in each node to
show the last update. When the window slides, the transactions in the oldest batch are deleted by
shifting the list of frequency count using the pointer in each node. The DSTree is made
independent of the minimum support and every transaction in the batches is monitored. The
frequent itemset mining is then can be applied using the FP-Growth method (Han, Pei and Yin
2000). This structure can also be used for mining the maximal or closed itemsets in the sliding
window model.
The DSTree algorithm (Leung and Khan 2006) basically is for holding the items
frequency in offline sliding of the batch of transactions. The FP-Growth algorithm (Han, Pei and
Yin 2000) is run based on DSTree structure for mining the frequent itemsets in the offline sliding
window model. The FP-Growth algorithm is designed based on the Apriori property and divide
and quanquer, which is not applied to the problem of discriminative itemset mining.
The main contribution in this chapter is proposing one algorithm for mining
discriminative itemsets in data streams using the sliding window model. The S-DISSparse
method is proposed in this chapter for mining exact discriminative itemsets in data streams using
the offline sliding window model. The discriminative itemsets with approximate frequency are
discovered using the online sliding window model. The proposed S-DISSparse method in this
chapter is completely novel and utilizes the DISSparse algorithm (Seyfi et al. 2017) which is
proposed in Chapter 3. The technical detail of the proposed algorithm is provided in a way to
distinguish with the DISSparse method.
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 104
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 104
5.2 RESEARCH PROBLEM
The embedded knowledge in data streams is changing during the time. Processing the
recent transactions is important in the applications looking for the recent patterns inside the data
streams (e.g., anomaly detection and decision making). Discovering the recent patterns in a finite
number of transactions or a fixed time period in data streams and continuous monitoring of the
variations in these patterns gives valuable information for data stream processing based on the
recent trends. Monitoring the gradual changes in the recent patterns in the fast-growing data
streams quickly, especially in the online status, using the sliding window model is useful for
answering the recent time-restricted queries. The transactions for pattern mining should be
restricted to the most recent ones in the fixed-size window frame by eliminating the effects of the
obsolete transactions in the information. The transactions in the window frame have to be saved
to delete their effects from the results by window sliding when they go out of the window frame
(Chang and Lee 2005).
The discriminative itemsets in the sliding window model are defined as the frequent
itemsets in the target data stream that have frequencies much higher than that of the same
itemsets in other data streams in a fixed recent time period. Not losing generality, we call other
data streams a 'general data stream' for the sake of simplicity. The discriminative itemsets are
relatively frequent in the target data stream and relatively infrequent in the general data stream
during the fixed-size sliding window frame. An essential issue in this research problem is to find
the itemsets that can distinguish the target data stream from all other data streams in a finite
number of recent transactions or a fixed recent time period.
There are many real world scenarios that can show the significance of mining
discriminative itemsets in data streams using the sliding window model. Monitoring the stock
market fluctuations for the most recent transactions in the fixed-size sliding window frame can be
useful for quickly detecting the changes in the recent trends. Discriminative itemsets represented
in the sliding window model are useful for data stream comparison based on their recent trends
by identifying the specific set of items that are of high interest in one market compared to the
other markets in the fixed recent time period. The itemsets that occur more frequently in the
recent transactions in one stock market compared to the other stock markets can be used for
answering the recent time-restricted queries. They are applicable for showing the relative changes
in the data stream trends in the recent time period and for answering the real time queries. Web
page personalization can be optimized by changes in user preferences in the fixed recent time
period. The sequences of queries with higher support in one geographical area, compared to
another area, are time related. Changes in the discriminative pattern trends during network
monitoring in the last few minutes are more valuable for anomaly detection and network
interference prediction than discriminative patterns themselves.
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 105
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 105
The problem of mining discriminative itemsets in data streams using the sliding window
model in comparison to the other window models has additional challenges. The sliding window
model has to be updated by adding the large number of itemset combinations of recent
transactions and deleting the effects of the old transactions from the fixed-size sliding window
frame. The challenges are more in the online sliding window model as during online sliding, by
appearing every single transaction in the data streams, the window frame has to be updated
quickly by adding the itemsets of the recent transaction and deleting the itemsets of the oldest
transaction. Depending on application, the sliding window frame is defined based on fixed time
period or fixed number of transactions and the discriminative itemsets are represented in offline
and online updating states. The Apriori property defined for the frequent itemsets is not valid in
discriminative itemsets and a subset of a discriminative itemset can be non-discriminative. The
proposed method has to be highly time and memory efficient. The greatest challenge is
generation of compact in-memory data structures by processing only the recent potential
discriminative itemsets in the offline and online sliding states. Moreover, unsynchronized data
streams with different speeds add more challenges to the sliding window updating process.
In this chapter, the discriminative itemset mining using the sliding window model is
discussed, based on two data streams 𝑆𝑖 and 𝑆𝑗, modelled as multiple continuous batches of
transactions denoted as 𝐵1,…𝐵ℎ, 𝐵ℎ+1,…,𝐵𝑛.
5.2.1 Problem formal definition
The formal definition of mining discriminative itemsets presented in Chapter 3 is
expanded for mining discriminative itemsets using the sliding window model as defined below:
Let∑bethealphabetsetof items, a transaction 𝑇 = {𝑒1, … 𝑒𝑖 , 𝑒𝑖+1, … , 𝑒𝑛}, 𝑒𝑖 ∈ ∑, is
defined as a set of itemsin∑. The items in the transaction are in alphabetical order by default for
ease in describing the mining algorithm. The two data streams 𝑆𝑖 and 𝑆𝑗 are defined as the target
and general data streams; each consists of a different number of transactions, i.e., 𝑛𝑖 and 𝑛𝑗,
respectively. A group of input transactions from two data streams 𝑆𝑖 and 𝑆𝑗 in the pre-defined
time period are set as a batch of transactions 𝐵𝑛 i.e., 𝑛 ≥ 1. Let 𝑃 be a partition fitting an input
batch of transactions 𝐵. The sliding window frame denoted as 𝑊 is made of a fixed number of
partitions 𝑃𝑘 i.e., 𝑘 ≥ 1 and 𝑃𝑘 ⊆ 𝑊 (e.g., in Figure 5.1 the sliding window model is made of
three partitions) and refers to the fixed recent time period containing itemsets made of
transactions in two data streams 𝑆𝑖 and 𝑆𝑗 with the lengths of 𝑛𝑖𝑤 and 𝑛𝑗
𝑤, respectively. All
partitions cover the same width of time period but the number of transactions in partitions in
sliding window frame 𝑊 varies depending on the speed of data streams. The window frame 𝑊
slides in an offline state by adding the itemsets in the recent partition i.e., 𝑃𝑛𝑒𝑤 and deleting the
itemsets in the oldest partition i.e., 𝑃𝑜𝑙𝑑, as in Figure 5.1.
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 106
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 106
Figure 5.1 Sliding window model 𝑊 made of three partitions 𝑃
An itemset I is defined as a subset of ∑. The itemset frequency is the number of
transactions that contains the itemset. The frequency of itemset 𝐼 in data stream 𝑆𝑖 in the window
frame 𝑊 is denoted as 𝑓𝑖𝑤(𝐼) and the frequency ratio of itemset 𝐼 in data stream 𝑆𝑖 in the
window frame 𝑊 is defined as 𝑟𝑖𝑤(𝐼) = 𝑓𝑖
𝑤(𝐼)/𝑛𝑖𝑤.
In this chapter, if the frequency ratio of itemset I in the target data stream 𝑆𝑖 in the sliding
window frame 𝑊 is larger than the frequency ratio in the general data stream 𝑆𝑗, i.e., 𝑟𝑖𝑤(𝐼)
𝑟𝑗𝑤(𝐼)
> 1,
then the itemset I can be considered as a discriminative itemset in the sliding window frame 𝑊.
Let 𝑅𝑖𝑗𝑤(𝐼) be the ratio between 𝑟𝑖
𝑤(𝐼) and 𝑟𝑗𝑤(𝐼), i.e., 𝑅𝑖𝑗
𝑤(𝐼) =𝑟𝑖𝑤(𝐼)
𝑟𝑗𝑤(𝐼)
. Obviously, the higher the
𝑅𝑖𝑗𝑤(𝐼), the more discriminative the itemset 𝐼 is.
To more accurately define discriminative itemsets, we introduce a user-defined threshold
𝜃 > 1, called a discriminative level threshold with no upper bound. An itemset 𝐼, is considered
discriminative in the window frame 𝑊 if 𝑅𝑖𝑗𝑤(𝐼) ≥ 𝜃. This is formally defined as:
𝑅𝑖𝑗𝑤(𝐼) =
𝑟𝑖𝑤(𝐼)
𝑟𝑗𝑤(𝐼)
=𝑓𝑖𝑤(𝐼)𝑛𝑗
𝑤
𝑓𝑗𝑤(𝐼)𝑛𝑖
𝑤 ≥ 𝜃 (5.1)
The 𝑅𝑖𝑗𝑤(𝐼) could be very large but with very low 𝑓𝑖
𝑤(𝐼). In order to accurately identify
discriminative itemsets which have reasonable frequency in the window frame 𝑊, and also in the
case of 𝑓𝑗𝑤(𝐼) = 0, we introduce another user-specified support threshold, 0 < 𝜑 < 1 𝜃⁄ , to
eliminate itemsets that have very low frequency in the window frame 𝑊. In this chapter, an
itemset 𝐼 is considered as discriminative if its frequency in the window frame 𝑊 becomes greater
than 𝜑𝜃𝑛𝑖 i.e., 𝑓𝑖𝑤(𝐼) ≥ 𝜑𝜃𝑛𝑖 and also 𝑅𝑖𝑗
𝑤(𝐼) ≥ 𝜃.
Definition 5.1. Discriminative itemsets in the sliding window model: Let 𝑆𝑖 and 𝑆𝑗 be two
data streams, with the current size of 𝑛𝑖𝑤 and 𝑛𝑗
𝑤 in sliding window frame 𝑊 that contain varied
length transactions of items in ∑, a user defined discriminative level threshold 𝜃 > 1 and a
support threshold 𝜑 𝜖 (0, 1 𝜃⁄ ). The set of discriminative itemsets in 𝑆𝑖 against 𝑆𝑗 in the sliding
window model in window frame 𝑊, denoted as 𝐷𝐼𝑖𝑗𝑤 is formally defined as:
𝐷𝐼𝑖𝑗𝑤 = {𝐼 ⊆ ∑ | 𝑓𝑖
𝑤(𝐼) ≥ 𝜑𝜃𝑛𝑖𝑤 & 𝑅𝑖𝑗
𝑤(𝐼) ≥ 𝜃} (5.2)
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 107
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 107
The itemsets that are not discriminative in the current state of sliding window frame 𝑊,
can be discriminative by window sliding; for example, by updating the sliding window frame
with the transactions coming in or going out of the window frame size. In order to avoid missing
any potential discriminative itemsets in the full size of the sliding window frame 𝑊, we propose
to identify sub-discriminative itemsets in the sliding window model with a parameter specified by
the user. The sub-discriminative itemsets are discovered by relaxation of 𝛼 𝜖 (0,1) with more
number of sub-discriminative itemsets in smaller 𝛼. The itemset 𝐼 is sub-discriminative if it is not
discriminative but its frequency in target data stream 𝑆𝑖 is not less than 𝛼𝜑𝜃𝑛𝑖𝑤 and its ratio is not
less than 𝛼𝜃. The discriminative itemsets are in interest; however, the sub-discriminative itemsets
also kept tracking during the process as they may be discriminative by sliding window frame.
Definition 5.2. Sub-discriminative itemsets in the sliding window model: Let 𝑆𝑖 and 𝑆𝑗 be
two data streams, with the current size of 𝑛𝑖𝑤 and 𝑛𝑗
𝑤 in sliding window frame 𝑊 that contains
varied length transactions of items in ∑, a user-defined discriminative level threshold 𝜃 > 1, a
support threshold 𝜑 𝜖 (0, 1 𝜃⁄ ) and a relaxation parameter 𝛼 𝜖 (0,1). A set of sub-discriminative
itemsets in 𝑆𝑖 against 𝑆𝑗 in the sliding window model 𝑊, denoted as 𝑆𝐷𝐼𝑖𝑗𝑤 is formally defined as:
𝑆𝐷𝐼𝑖𝑗𝑤 = {𝐼 ⊆ ∑ ∣ 𝑓𝑖
𝑤(𝐼) ≥ 𝛼𝜑𝜃𝑛𝑖𝑤 & 𝑅𝑖𝑗
𝑤 ≥ 𝛼𝜃} (5.3)
The sub-discriminative itemsets are the potential discriminative itemsets in the sliding
window frames 𝑊 in the data streams. The relaxation of 𝛼 𝜖 (0,1) is defined for better
approximate support and approximate ratio in discriminative itemsets in the online sliding
window model with the trade-off between the computational cost and less number of wrong or
hidden discriminative itemsets. The discriminative itemsets are reported with exact support and
exact ratio in offline sliding window model. The discriminative itemsets are reported with
approximate support and approximate ratio between two offline sliding of the window model,
during online sliding window model.
The itemsets that are not discriminative and not sub-discriminative are defined as non-
discriminative itemsets:
Definition 5.3. Non-discriminative itemsets in the sliding window model: Let 𝑆𝑖 and 𝑆𝑗 be
two data streams, with the current size of 𝑛𝑖𝑤 and 𝑛𝑗
𝑤 in sliding window frame 𝑊 that contains
varied length transactions of items in ∑, a user defined discriminative level threshold 𝜃 > 1, a
support threshold 𝜑 𝜖 (0, 1 𝜃⁄ ) and a relaxation parameter 𝛼 𝜖 (0,1). The set of non-
discriminative itemsets in 𝑆𝑖 against 𝑆𝑗 in the sliding window model, denoted as 𝑁𝐷𝐼𝑖𝑗𝑤 is
formally defined as:
𝑁𝐷𝐼𝑖𝑗𝑤 = {𝐼 ⊆ ∑ ∣ 𝑓𝑖
𝑤(𝐼) < 𝛼𝜑𝜃𝑛𝑖𝑤 𝑜𝑟 𝑅𝑖𝑗
𝑤 < 𝛼𝜃} (5.4)
The non-discriminative itemsets are not frequent in the target data stream with the
relaxation of 𝛼 or have frequency ratio in 𝑆𝑖 compared to 𝑆𝑗 less than 𝛼𝜃 in the sliding window
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 108
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 108
frame 𝑊 in the data streams. The non-discriminative itemsets are used for tail pruning in the
sliding window model, as discussed in Section 5.3.
5.2.2 Discriminative itemset mining using the sliding window model
The problem of discriminative itemset mining using the sliding window model is defined
for offline batch processing in data streams. The sliding window model is updated in offline state
in the specific time intervals defined for the batch of transactions. The new batch of transactions
is processed and the discriminative itemsets are saved in the sliding window frame 𝑊 and the
itemsets belong to the oldest partition out of the sliding window frame 𝑊 are deleted. The sub-
discriminative itemsets are also saved as they may become discriminative in the future by online
sliding window frame. By discovering the new discriminative and sub-discriminative itemsets
from the current batch of transactions in 𝑊, the old itemsets in the sliding window frame 𝑊 are
assessed for being discriminative and sub-discriminative based on the recent data stream lengths.
The offline sliding window model is discussed in detail using running examples by
showing the significant challenges. Two determinative heuristics are proposed for efficient and
exact mining discriminative itemsets in the offline sliding window model. The online sliding
window model is happen between two offline sliding of the window model. The online sliding
window model is discussed in detail for mining discriminative itemsets in an online state with
approximate bound guarantee based on the properties of discriminative itemsets in the sliding
window model. Considering the complexities of the discriminative itemset mining in data
streams using the sliding window model, the novel efficient S-DISSparse method is proposed by
utilizing the proposed DISSparse method (Seyfi et al. 2017) in Chapter 3 and defining new data
structures adapted for offline and online updating the sliding window model. The S-DISSparse
algorithm is evaluated based on the input data streams with different characteristics in Chapter 6
for its time and space complexities in the offline sliding window model. The algorithm is also
evaluated for different approximate support and approximate ratio bounds in discriminative
itemsets in the online sliding window model. The specific characteristics and the limitations of
the proposed method in the large and fast-growing data streams are discussed.
5.3 OFFLINE SLIDING WINDOW MODEL
The sliding window model is made of itemsets from the recent transactions in the range
of window frame size stay in a prefix tree structure (i.e., S-DISStream as presented in Figure 5.3).
The frequencies of itemsets are held in the full size sliding window frame and the discriminative
itemsets are reported in the offline and online updating sliding window model. The offline sliding
window model is updated by the new batch of transactions arriving and the oldest batch of
transactions going out in offline time intervals in the window frame. The online sliding window
model is updated by adding and deleting the recent and oldest transaction respectively in the
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 109
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 109
window frame in the real time frame. The size of the sliding window frame (i.e., 𝑊) can be
defined based on a fixed time period or a fixed number of transactions. The size of the sliding
window frame based on a fixed time period can be varied, as the data streams have different
speeds during the time. The size of the sliding window frame is defined based on the desired
output range in the application domain and the limit of main memory. To facilitate describing the
method, several important concepts and constructs are defined in the section below.
5.3.1 Mining discriminative itemsets in sliding window using prefix tree
The two prefix tree structures are defined for holding the transactions and itemsets in the
sliding window model, respectively.
S-FP-Tree: The prefix tree structure proposed in the FP-Growth (Han, Pei and Yin
2000) is used for holding the items of the transactions, without pruning infrequent items, by
sharing the branches for their most common frequent items (i.e., S-FP-Tree includes all items).
The S-FP-Tree is adapted by adding two counters in each node for holding the frequencies of the
itemsets in the target data stream 𝑆𝑖 and the general data stream 𝑆𝑗, respectively; for example,
there are two counters associated with each node in the S-FP-Tree in Figure 5.2. The S-FP-Tree
is updated during window sliding by adding the transactions of the new partition and deleting the
transactions of the oldest partition in the sliding window frame 𝑊. New paths in the S-FP-Tree
are added for the new transactions or the frequencies of paths are updated. The frequencies of
paths in the S-FP-Tree are decreased by deleting the transactions of the oldest partition. The
nodes in the S-FP-Tree are tagged based on their recent status as stable if the frequencies have
not changed by adding the new partition or deleting the oldest partition, respectively, otherwise
they are tagged as updated.
S-DISStream: The S-DISStream prefix tree structure is for holding the discovered
discriminative and sub-discriminative itemsets as well. Both discriminative and sub-
discriminative itemsets share the branches in the same S-DISStream structure for their most
common frequent items (e.g., as in Figure 5.3). Each path in S-DISStream may represent a subset
of multiple discriminative and sub-discriminative itemsets started from the root of the prefix tree
structure. Each node in S-DISStream has two counters 𝑓𝑖 and 𝑓𝑗 for holding the frequencies of
itemset in the target data stream 𝑆𝑖 and the general data stream 𝑆𝑗, respectively in the sliding
window frame 𝑊. The itemsets in S-DISStream are made of transactions from the partitions that
fit in the sliding window frame 𝑊.
The Header-Table is defined for fast traversing the prefix tree structures using the links
holding the itemsets ending with identical items. Each Header-Table item node in S-DISStream
saves the top ancestor on the first level as the root; for example, node 𝑐 is the top ancestor of all
different Header-Table items, including 𝑎, 𝑒, 𝑏 and 𝑐 in the left-most subtree in Figure 5.3, and 𝑐
appears in all nodes in the left-most subtree. The nodes in the first level of S-DISStream
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 110
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 110
determine different subtrees made of number of itemsets under the same root in the first level of
S-DISStream and ending with different Header-Table items. Following similar notations as in the
conditional FP-Tree in DISSparse method (Seyfi et al. 2017) proposed in Chapter 3, a subtree in
S-DISStream is denoted as 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡; for example, the S-DISStream in Figure 5.3 has three
subtrees under root 𝑐, 𝑏 and 𝑒 (i.e., 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑐, 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑏 and 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑒, respectively).
During processing each header item, the set of Header-Table items which are linked
under their subtree root node using Header-Table links is denoted as
𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡). The nodes are tagged as discriminative, sub-
discriminative or non-discriminative (i.e., subset of discriminative or sub-discriminative
itemsets). The S-DISStream is updated in the sliding window model in offline and online states.
The nodes in S-DISStream are tagged based on their recent status as stable if the frequencies have
not changed by offline window sliding i.e., adding new partition or deleting the oldest partition,
otherwise they are tagged as updated. The nodes in S-DISStream are tagged as online if they are
updated during online sliding. The concept of the stable nodes in S-FP-Tree and the stable nodes
in S-DISStream refers to the itemsets that do not have any change in their frequencies during
sliding window model.
The offline processing is used for controlling the generation of stable and non-potential
discriminative itemset combinations. The online processing is used for more up-to-date and
accurate online answers. The discriminative itemsets are discovered in the offline state
periodically, and the mining process is continued in the online state. The two heuristics are
applied as new heuristics for efficient mining discriminative itemsets in data streams using the
sliding window model, in the S-DISSparse method.
5.3.2 Incremental offline sliding window
The sliding window model can be simply implemented by adapting the DISTree
algorithm (Seyfi, Geva and Nayak 2014) or DISSparse algorithm (Seyfi et al. 2017), proposed in
Chapter 3, in the offline updating state. The recent batch of transactions fitting in a partition of
the sliding window frame 𝑊 is processed for mining discriminative itemsets using DISTree or
DISSparse algorithms and the results are saved in the sliding window model (i.e., S-DISStream).
The sliding window model is updated by discriminative itemsets discovered from each new batch
of transactions fitting in the new partition added to the sliding window frame 𝑊, and deleting the
itemsets belong to the oldest partition out of the full size sliding window frame 𝑊. However,
there are several challenges with this naïve approach as below.
First, the sliding window frame 𝑊 is made of several partitions and discovering the
discriminative itemsets in a single partition and merging them with itemsets in the full size
sliding window frame can result in high numbers of false positives and false negatives, which
significantly downgrades the output quality; for example, the itemsets in a partition can be non-
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 111
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 111
discriminative with a very small frequency ratio (i.e., much larger frequency in general data
stream compared to the target data stream). The itemset is pruned during batch processing caused
for less accuracy by reporting false positives in the sliding window frame 𝑊. The non-
discriminative itemsets in a partition can be discriminative by merging in the sliding window
model: for example, the non-discriminative itemsets with high frequency ratio but infrequent in
the target data stream. The itemset is pruned during batch processing causing less recall by
reporting false negatives in the sliding window frame 𝑊. Second, many itemsets can be
discriminative in a single partition and non-discriminative in the sliding window frame 𝑊,
causing an inefficient mining process. Third, two batch processing must be done during the
window model sliding i.e., one for adding the discriminative itemsets in the recent batch and one
for deleting the itemsets in the oldest batch in the sliding window frame 𝑊. This may not be
realistic if the transitions come continuously in fast speed.
Data stream mining algorithms basically must be designed based on a single scan
(Aggarwal 2007; Han, Pei and Kamber 2011). Algorithm design based on multiple scans is often
too expensive, specifically in the sliding window model with the necessity of fast updating. The
smaller relaxations for sub-discriminative itemsets can improve the approximation relatively,
with extra complexity. The efficient method should be designed for exact mining discriminative
itemsets by processing the recently updated transactions in the S-FP-Tree and tagging the
discriminative and non-discriminative itemsets in the S-DISStream based on the recent data
stream lengths in the sliding window frame 𝑊.
In this chapter, the two heuristics are proposed as new heuristics based on the S-FP-Tree
nodes status (i.e., stable or updated during window model sliding) within efficient time and space
usage for offline updating of the sliding window model, based on the S-DISSparse method. The
offline sliding window model is described in the sub-sections below.
5.3.2.1 Initializing the offline sliding window
The S-FP-Tree is initialized by the transactions fitting in the first partition; 𝑃1 and
discriminative itemsets are discovered following the normal process of DISSparse algorithm
(Seyfi et al. 2017) and set to the S-DISStream. All the nodes in the S-FP-Tree and S-DISStream
are tagged as stable (i.e., not updated) before adding the recent partition 𝑃𝑛𝑒𝑤 and deleting the
oldest partition 𝑃𝑜𝑙𝑑 i.e., in the full size window frame 𝑊 as in Figure 5.1. The nodes in S-FP-
Tree paths are tagged as updated if the frequencies change during window model sliding; for
example, by adding new transactions or deleting old transactions.
The conditional FP-Tree is made for each Header-Table item based on the item’s
conditional patterns in the S-FP-Tree. The Header-Table items in the conditional FP-Tree hold
the same status as following the tags in the S-FP-Tree. The tags in
𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) in the conditional FP-Tree can show that the itemsets in a
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 112
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 112
𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 are updated or stable during the window model sliding. Similar approaches are
applied for the updated or stable itemsets with subset of 𝐼𝑛𝑡𝑒𝑟𝑛𝑎𝑙 𝑛𝑜𝑑𝑒𝑟𝑜𝑜𝑡 in a potential
𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡. The two new heuristics are proposed accordingly based on the recently updated
itemsets, for mining discriminative itemsets out of the updated potential discriminative subsets in
the conditional FP-Tree. The stable itemsets are checked in the S-DISStream and are tagged
based on the recent data stream lengths in the sliding window frame 𝑊.
Example 5.1. The S-FP-Tree and S-DISStream constructions and updating are
graphically demonstrated using the running example with two batches of transaction fitting in the
first two partitions in the sliding window model, respectively. The first batch (i.e., followed as in
Example 3.1) made of data streams 𝑆1 and 𝑆2 (𝑛1 = 𝑛2 = 15) fits in 𝑃1 is presented in
Table 5.1. The second batch made of data streams 𝑆1 and 𝑆2 (𝑛1 = 𝑛2 = 5) fits in 𝑃2 is
presented in Table 5.3.
Table 5.1 The first input batch in data streams fits in partition 𝑃1
The Desc-Flist order made of the first batch of transactions (i.e., Table 5.1) is generated
as in Table 5.2. This Desc-Flist remains the same for all the upcoming batches in data streams.
Table 5.2 Desc-Flist order of frequent items is target data stream 𝑆1 in the first batch
In this example the discriminative level threshold is set to 𝜃 = 2 and the support
threshold is set to 𝜑 = 0.1. The S-FP-Tree and S-DISStream structures made of the first partition
𝑃1 are represented in Figure 5.2 and Figure 5.3, respectively. The highlighted nodes in S-
DISStream in Figure 5.3 refer to the discriminative itemset.
Figure 5.2 Header-Table and S-FP-Tree structures by the first partition 𝑃1
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 113
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 113
Figure 5.3 Header-Table and S-DISStream structures by the first partition 𝑃1
By the processing the most recent partition the S-DISStream holds the discriminative
itemsets in the offline sliding window model. The S-FP-Tree and S-DISStream structures are
then tagged based on their stable and updated subsets for efficient mining as explained in the
section below.
5.3.2.2 Stable and updated subsets in offline sliding window
Before processing the next batch of transactions fitting in 𝑃2 all the nodes in the S-FP-
Tree and S-DISStream structures are tagged as stable. Figure 5.4 shows the S-FP-Tree structures
after adding the second batch of transactions fits in 𝑃2 (i.e., as in Table 5.3). The updated nodes
in S-FP-Tree are represented by thick borders; for example, the path 𝑏𝑑𝑎3,1 is appeared in the S-
FP-Tree after adding the new batch of transactions fits in 𝑃2.
Table 5.3 The second input batch in data streams fits in partition 𝑃2
Figure 5.4 Header-Table and updated S-FP-Tree structures by adding second partition 𝑃2
A 𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) in the conditional FP-Tree in the sliding window frame 𝑊
satisfies two conditions i.e., Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑎) ≥ 𝜑𝜃𝑛𝑖𝑤 and Max_dis_value(𝑟𝑜𝑜𝑡, 𝑎) ≥ 𝜃,
where 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑎) is the set of itemsets in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 ending with a header item
𝑎 ∊ 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡).
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 114
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 114
Let 𝒮 be the power set of 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑎), i.e., 𝒮 = 2𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡,𝑎) , i.e., 𝒮 consists
of all subsets of 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑎). For 𝐵 ∈ 𝒮 and 𝐵 ≠ { }, the frequency of each itemset in
𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡, in respect of the data stream 𝑆𝑖 in the sliding window frame 𝑊, is defined below.
𝑓𝑖𝑤(𝐵) =∑𝑓𝑖
𝑤(𝑏
𝑏∈𝐵
) (5-a)
For simplicity, in the equation above 𝑓𝑖𝑤(𝑏) refers to the frequency in 𝑆𝑖 of an itemset
𝑏 ∈ 𝐵 belongs to the power set of 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑎). The frequency of itemset in a subtree is
stable if all 𝑏 ∈ 𝐵 are stable during the offline sliding window model. The discriminative value
of each itemset in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 in the sliding window frame 𝑊 is defined below.
Dis_value(𝐵) =
{
𝑟𝑖𝑤(𝐵)
𝜃 ∑𝑓𝑗
𝑤(𝑏)
𝑏∈𝐵
= 0
𝑟𝑖𝑤(𝐵)
𝑟𝑗𝑤 ∑𝑓𝑗
𝑤(𝑏)
𝑏∈𝐵
> 0
(5-b)
Where 𝑟𝑖𝑤(𝐵) =
∑ 𝑓𝑖𝑤(𝑏)𝑏∈𝐵
𝑛𝑖𝑤 and 𝑟𝑗
𝑤(𝐵) =∑ 𝑓𝑗
𝑤(𝑏)𝑏∈𝐵
𝑛𝑗𝑤 . 𝑟𝑖
𝑤(𝐵) is called the relative support
of B in 𝑆𝑖. It is the sum of the relative supports of the itemsets in B in 𝑆𝑖 and 𝑆𝑗, respectively.
A potential 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 is stable if all potential discriminative itemsets in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡
are stable during the offline sliding window model as below.
Potential itemsets = {∀𝐵 ∈ 𝒮 ∣ ∑𝑓𝑖𝑤(𝑏)
𝑏∈𝐵
≥ 𝜑𝜃𝑛𝑖𝑤 ∩ Dis_value(𝐵) ≥ 𝜃} (5-c)
In order to find the updated and stable potential discriminative itemsets, all possible
itemsets in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 will have to be generated. However, the generation of all possible
itemset combinations is time-consuming. In this chapter, we propose the simple method for
calculating the Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑎) and estimating the Max_dis_value(𝑟𝑜𝑜𝑡, 𝑎) in the S-
DISSparse method.
The itemset is stable if it is summed up by the frequencies of only stable itemsets. Let 𝐵
with maximum 𝑅𝑖𝑗𝑤(𝐵) in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 be defined as 𝐵𝑚𝑎𝑥. Initially, 𝐵𝑚𝑎𝑥 is initialized by
summing up the 𝑓𝑖𝑤(𝑏) frequencies of the itemsets 𝑏 with 𝑓𝑗
𝑤(𝑏) = 0. The frequencies of 𝑏,
with maximum frequency ratio, is summed up by 𝐵𝑚𝑎𝑥 only if it increases its discriminative
value i.e., 𝑅𝑖𝑗𝑤(𝐵𝑚𝑎𝑥). The 𝐵𝑚𝑎𝑥 is considered as updated if it is summed up by the frequencies
of any updated itemset 𝑏. However, if the discriminative value of 𝐵𝑚𝑎𝑥 sum up by any updated 𝑏
is larger than discriminative level 𝜃, the 𝐵𝑚𝑎𝑥 is considered as updated (i.e., the overall
frequencies are tested only and not summed up); for example, in Figure 5.5 the maximum
discriminative value of itemsets in the left-most subtree, 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑐, is equal to 2, i.e.,
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 115
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 115
Max_dis_value(𝑐, 𝑎) = 2, which is calculated by the sum of frequencies of two stable itemsets
ending with the items in 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑐), i.e., 𝑎1,0 and 𝑎3,2.
Following the Definition 5.1 if the 𝑓𝑗𝑤(𝑏𝑚𝑎𝑥) = 0 then 𝑅𝑖𝑗
𝑤(𝑏𝑚𝑎𝑥) =𝑓𝑖𝑤(𝑏𝑚𝑎𝑥)
𝜃𝑛𝑖. The
Max_freq𝑖(𝑐, 𝑎) = 4 which is calculated by the sum of frequency of similar stable itemsets i.e.,
𝐼(𝑎1,0) and 𝐼(𝑎3,2) and the 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑐 is defined as stable subtree. The algorithm proposed for
calculating the Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑎) and Max_dis_value(𝑟𝑜𝑜𝑡, 𝑎) for finding the stable
𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 in S-DISSparse method for mining potential discriminative itemsets. The statement
𝑓𝑤(𝑏𝑚𝑎𝑥)+= 𝑓𝑤(𝑏) in the algorithm is the short way of two statements 𝑓𝑖
𝑤(𝑏𝑚𝑎𝑥)+= 𝑓𝑖𝑤(𝑏)
and 𝑓𝑗𝑤(𝑏𝑚𝑎𝑥)+= 𝑓𝑗
𝑤(𝑏), for the sake of simplicity.
Algorithm 5.1 Stable, updated 𝐌𝐚𝐱_𝐟𝐫𝐞𝐪𝒊(𝒓𝒐𝒐𝒕, 𝒂), 𝐌𝐚𝐱_𝐝𝐢𝐬_𝐯𝐚𝐥𝐮𝐞(𝒓𝒐𝒐𝒕, 𝒂)
Input: (1) 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡; (2) header item
𝑎 ∊ 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡).
Output: (1) stable or updated Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑎); (2) stable or updated
Max_dis_value(𝑟𝑜𝑜𝑡, 𝑎).
Begin
1) Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑎) = 0; 𝐹𝑖𝑤 = 0; 𝐹𝑗
𝑤 = 0;
2) For each items 𝑏 ∊ 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) do
3) Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑎)+= 𝑓𝑖𝑤(𝐼(𝑏));
4) If 𝐼(𝑏) is updated then Tag Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑎) as updated; End if;
5) If 𝑓𝑗𝑤(𝐼(𝑏)) = 0 then 𝐹𝑖
𝑤+= 𝑓𝑖𝑤(𝐼(𝑏)); Tag 𝑏 as checked;
6) If 𝐼(𝑏) is updated then Tag 𝐹𝑖𝑤 as updated; End if;
7) End if;
8) End for;
9) While ∃𝑏 ∊ 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) and 𝑏 is unchecked do
10) Find 𝐼(𝑏) with maximum 𝑅𝑖𝑗𝑤(𝐼(𝑏));
11) If 𝐹𝑖𝑤+ 𝑓𝑖
𝑤(𝐼(𝑏))
𝐹𝑗𝑤+ 𝑓𝑗
𝑤(𝐼(𝑏)) >
𝐹𝑖𝑤
𝐹𝑗𝑤 Or
𝐹𝑖𝑤+ 𝑓𝑖
𝑤(𝐼(𝑏))
𝐹𝑗𝑤+ 𝑓𝑗
𝑤(𝐼(𝑏)) > 𝜃 then 𝐹𝑖
𝑤+= 𝑓𝑖𝑤(𝐼(𝑏));
𝐹𝑗𝑤+= 𝑓𝑗
𝑤(𝐼(𝑏));
12) End if;
13) Tag 𝑏 as checked;
14) If 𝐼(𝑏) is updated then Tag 𝐹𝑖𝑤 as updated;
15) End While;
16) Max_dis_value(𝑟𝑜𝑜𝑡, 𝑎) =𝐹𝑖𝑤∗𝑛𝑗
𝑤
𝐹𝑗𝑤∗𝑛𝑖
𝑤
17) If 𝐹𝑖𝑤 is updated then Tag Max_dis_value(𝑟𝑜𝑜𝑡, 𝑎) as updated;
18) Return Stable or updated Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑎), Max_dis_value(𝑟𝑜𝑜𝑡, 𝑎);
End.
In the above algorithm, the items in 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) are scanned
in the two separated loops. In the first loop the Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑎) is calculated based on the all
itemsets in the 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡. If any 𝐼(𝑏) is updated then Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑎) is tagged as
updated. The 𝐹𝑖𝑤 is also initiated based on sum of 𝑓𝑖(𝐼(𝑏)) in the itemsets with 𝑓𝑗(𝐼(𝑏)) = 0.
The 𝐹𝑖𝑤 is also tagged as updated if any 𝐼(𝑏) is updated. The 𝐹𝑖
𝑤 and 𝐹𝑗𝑤are used for calculating
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 116
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 116
the Max_dis_value(𝑟𝑜𝑜𝑡, 𝑎). In the second loop and using a greedy method, the 𝐹𝑖𝑤 and 𝐹𝑗
𝑤 are
updated by adding the frequencies of the itemset 𝐼(𝑏) with maximum 𝑅𝑖𝑗𝑤(𝐼(𝑏)) if the
frequencies of the itemset 𝐼(𝑏) increase the ratio between 𝐹𝑖𝑤 and 𝐹𝑗
𝑤, or ratio between 𝐹𝑖𝑤 and
𝐹𝑗𝑤 become greater than discriminative level 𝜃. If any 𝐼(𝑏) is updated then 𝐹𝑖
𝑤 is tagged as
updated caused for updated Max_dis_value(𝑟𝑜𝑜𝑡, 𝑎). In the second loop each item is checked
one time and the Max_dis_value(𝑟𝑜𝑜𝑡, 𝑎) are calculated based on the selected items.
All discriminative itemsets in a stable 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 are stable in the sliding window
frame 𝑊 compared to the current state in S-DISStream. A potential 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 can be non-
potential before offline sliding with different data stream lengths in the sliding window frame 𝑊,
and new discriminative itemsets may be discovered in the sliding window frame 𝑊; for example,
by decreasing the length of target data stream 𝑆𝑖 or increasing the data streams length ratio, 𝑛𝑗𝑤
𝑛𝑖𝑤. A
𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) that satisfies the conditions Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑎) ≥ 𝜑𝜃𝑛𝑖𝑤 and
Max_dis_value(𝑟𝑜𝑜𝑡, 𝑎) ≥ 𝜃 with smaller frequency in the target data stream S𝑖, or smaller
frequency ratio in target data stream S𝑖 vs general data stream S𝑗, compared to the last size of
sliding window frame 𝑊 (i.e., last offline window model sliding), is not stable. For the sake of
clarity, this part has not been represented in the Algorithm 5.1, considering all partitions is in a
same size containing an equal number of transactions. The algorithm has to be modified by
holding the length of data streams in the last offline sliding, and comparing the
Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑎) and Max_dis_value(𝑟𝑜𝑜𝑡, 𝑎) calculated by the recent and old data stream
lengths.
A potential 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 in the sliding window model is processed in a different way.
The first heuristic is formally defined below.
HEURISTIC 5.1. A potential 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 is stable denoted as 𝑆𝑡𝑎𝑏𝑙𝑒(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) if all
potential discriminative itemsets in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 that satisfy the following conditions are stable:
1. Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑎) ≥ 𝜑𝜃𝑛𝑖𝑤
2. Max_dis_value(𝑟𝑜𝑜𝑡, 𝑎) ≥ 𝜃
Where 𝜃 > 1 is the discriminative level threshold, 𝜑 𝜖 (0, 1 𝜃⁄ ) is the support threshold, 𝑛𝑖𝑤 is
the size of target data stream 𝑆𝑖 in the sliding window frame 𝑊 and Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑎) and
Max_dis_value(𝑟𝑜𝑜𝑡, 𝑎) are stable if any itemset in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 that satisfies the two
conditions is stable.
Lemma 5-1 (Stable subtree) HEURISTIC 5.1 ensures that any discriminative itemset in
a 𝑆𝑡𝑎𝑏𝑙𝑒(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) is stable in the sliding window model.
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 117
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 117
Proof. The two conditions in HEURISTIC 5.1 ensure that any itemset that is frequent in
the target data stream 𝑆𝑖 and has discriminative value larger than the discriminative level 𝜃 is
stable in the sliding window model. This implies that all discriminative itemsets in a
𝑆𝑡𝑎𝑏𝑙𝑒(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) exist in the sliding window model by processing the
𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) during previous offline sliding in the window model. The itemset
combinations in a 𝑆𝑡𝑎𝑏𝑙𝑒(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) are tagged as discriminative or non-discriminative in
the 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 of S-DISStream, based on the recent data stream lengths in sliding window
frame 𝑊. This implies that the discriminative itemsets are discovered based on the recent data
stream lengths that have been changed by adding the new partition and deleting the oldest
partition in the sliding window frame 𝑊.
For a subtree, if any of the two conditions is updated the subtree is considered as a
potential discriminative subtree and the potential discriminative itemset combinations are
generated from the subtree as it may contain new discriminative itemsets.
∎
In Figure 5.5, the left-most subtree related to the processing Header-Table item 𝑎 under
root node 𝑐 (i.e., 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑐) is stable as the only subset of itemsets in the subtree that satisfy the
conditions in HEURISTIC 5.1 i.e., 𝐼(𝑎3,2) and 𝐼(𝑎1,0) by 𝑎3,2, 𝑎1,0 ∈ 𝐵 and 𝐵 ∈ 𝒮, is stable as
in conditions below.
∑𝑓𝑖𝑤(𝑏)
𝑏∈𝐵
≥ 3 + 1 = 4 ≥ (𝜑𝜃𝑛𝑖𝑤 = 0.1 ∗ 2 ∗ 20 = 4) (5-d)
Dis_value(𝐵) = ∑ 𝑓𝑖
𝑤(𝑏)𝑏∈𝐵
∑ 𝑓𝑗𝑤(𝑏)𝑏∈𝐵
∗𝑛𝑗𝑤
𝑛𝑖𝑤 =
(3 + 1)
(2 + 0)∗20
20=4
2≥ (𝜃 = 2) (5-e)
In this chapter, for the sake of simplicity the dataset lengths are omitted from ratios as
𝑛1 = 𝑛2. In case of data streams with different length (i.e., 𝑛2
𝑛1≠ 1) the ratios must be multiplied
by the constant of 𝑛2
𝑛1. The conditional FP-Tree of Header-Table item 𝑎 is made out of S-FP-Tree
updated with partition 𝑃2 is presented as in Figure 5.5.
Figure 5.5 Conditional FP-Tree of Header-Table item 𝑎 updated by partition 𝑃2
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 118
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 118
The 𝑆𝑡𝑎𝑏𝑙𝑒(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) is ignored from itemset combination generation. The itemset
combinations of a 𝑆𝑡𝑎𝑏𝑙𝑒(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 of S-DISStream are traversed using
Header-Table links and tagged as discriminative or non-discriminative based on the recent data
stream lengths in the sliding window frame 𝑊; for example, in 𝑆𝑡𝑎𝑏𝑙𝑒(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑐) the itemset
𝐼(𝑎4,2) i.e., 𝑐𝑏𝑎4,2 in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑐 of S-DISStream ending with
𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑐), is tagged as discriminative as in Figure 5.6.
Figure 5.6 Updated S-DISStream after processing 𝑆𝑡𝑎𝑏𝑙𝑒(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑐) in conditional FP-Tree
for Header-Table item 𝑎
A 𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑖𝑛) in a 𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) (i.e., 𝑖𝑛 ∊ 𝐼𝑛𝑡𝑒𝑟𝑛𝑎𝑙 𝑛𝑜𝑑𝑒𝑟𝑜𝑜𝑡) in the
conditional FP-Tree in the sliding window frame 𝑊 satisfies two conditions i.e.,
Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑖𝑛, 𝑎) ≥ 𝜑𝜃𝑛𝑖𝑤 and Max_dis_value(𝑟𝑜𝑜𝑡, 𝑖𝑛, 𝑎) ≥ 𝜃, where
𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠(𝑟𝑜𝑜𝑡, 𝑖𝑛, 𝑎) is a set of itemsets in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 ending with a header item 𝑎 ∊
𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) with the internal node 𝑖𝑛 as subset.
Let 𝒮 be the power set of 𝐼𝑛𝑡𝑒𝑟𝑛𝑎𝑙 𝑛𝑜𝑑𝑒𝑟𝑜𝑜𝑡, i.e., 𝒮 = 2𝐼𝑛𝑡𝑒𝑟𝑛𝑎𝑙 𝑛𝑜𝑑𝑒𝑟𝑜𝑜𝑡 , and itemset
𝐼, with subset of 𝑖𝑛 ∊ 𝐼𝑛𝑡𝑒𝑟𝑛𝑎𝑙 𝑛𝑜𝑑𝑒𝑟𝑜𝑜𝑡, ending with
𝑎 ∈ 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) denoted as 𝐼(𝑖𝑛). The frequency of each itemset in
𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 with subset of internal node 𝑖𝑛, in respect of the data stream 𝑆𝑖 in the sliding
window frame 𝑊, is defined below (i.e., 𝐵 ∈ 𝒮).
𝑓𝑖𝑤(𝐵) =∑𝑓𝑖
𝑤(𝑏
𝑏∈𝐵
) (5-f)
The frequency of an itemset in a subtree is stable if all 𝑏 ∈ 𝐵 are stable during the offline
sliding window model. A potential internal node 𝑖𝑛 ∊ 𝐼𝑛𝑡𝑒𝑟𝑛𝑎𝑙 𝑛𝑜𝑑𝑒𝑟𝑜𝑜𝑡 is stable if all
potential discriminative itemsets in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 with internal node 𝑖𝑛 as subset are stable during
the offline sliding window model.
All discriminative itemsets in a 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 with a subset of a stable 𝑖𝑛 are stable in the
sliding window frame 𝑊 compared to the current state in S-DISStream. A potential internal node
𝑖𝑛 ∊ 𝐼𝑛𝑡𝑒𝑟𝑛𝑎𝑙 𝑛𝑜𝑑𝑒𝑟𝑜𝑜𝑡 can be non-potential before offline sliding with different data stream
lengths in sliding window frame 𝑊, and new discriminative itemsets may be discovered in
sliding window frame 𝑊; for example, by decreasing the length of target data stream 𝑆𝑖 or
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 119
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 119
increasing the data streams’ length ratio, 𝑛𝑗
𝑛𝑖. A 𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑖𝑛) that satisfies the conditions
Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑖𝑛, 𝑎) ≥ 𝜑𝜃𝑛𝑖𝑤 and Max_dis_value(𝑟𝑜𝑜𝑡, 𝑖𝑛, 𝑎) ≥ 𝜃 with smaller frequency
in the target data stream S𝑖, or smaller frequency ratio in target data stream S𝑖 vs general data
stream S𝑗, compared to the last size of sliding window frame 𝑊 (i.e., last offline window model
sliding) is not stable.
The potential internal node 𝑖𝑛 ∊ 𝐼𝑛𝑡𝑒𝑟𝑛𝑎𝑙 𝑛𝑜𝑑𝑒𝑟𝑜𝑜𝑡 in the sliding window model is
processed in a different way.
The second heuristic is formally defined below.
HEURISTIC 5.2. An internal node 𝑖𝑛 ∊ 𝐼𝑛𝑡𝑒𝑟𝑛𝑎𝑙 𝑛𝑜𝑑𝑒𝑟𝑜𝑜𝑡 is stable denoted as 𝑆𝑡𝑎𝑏𝑙𝑒(𝑖𝑛) if
all potential discriminative itemsets in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 with internal node 𝑖𝑛 as subset that satisfy
the following conditions are stable.
1. Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑖𝑛, 𝑎) ≥ 𝜑𝜃𝑛𝑖𝑤
2. Max_dis_value(𝑟𝑜𝑜𝑡, 𝑖𝑛, 𝑎) ≥ 𝜃
Where 𝜃 > 1 is the discriminative level threshold, 𝜑 𝜖 (0, 1 𝜃⁄ ) is support threshold, 𝑛𝑖𝑤 is the
size of target data stream 𝑆𝑖 in the sliding window frame 𝑊 and Max_freq𝑖(𝑟𝑜𝑜𝑡, 𝑖𝑛, 𝑎) and
Max_dis_value(𝑟𝑜𝑜𝑡, 𝑖𝑛, 𝑎) are stable if any itemset in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 with internal node 𝑖𝑛 as
subset that satisfies the two conditions is stable.
Lemma 5-2 (Stable internal node) HEURISTIC 5.2 ensures that any discriminative
itemset in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡with subset of a 𝑆𝑡𝑎𝑏𝑙𝑒(𝑖𝑛) is stable in the sliding window model.
Proof. The two conditions in HEURISTIC 5.2 ensure that any itemset with subset of
internal node 𝑖𝑛 ∊ 𝐼𝑛𝑡𝑒𝑟𝑛𝑎𝑙 𝑛𝑜𝑑𝑒𝑟𝑜𝑜𝑡 that is frequent in the target data stream 𝑆𝑖 and has
discriminative value larger than the discriminative level 𝜃 is stable in the sliding window model.
This implies that all discriminative itemsets with the subset of 𝑆𝑡𝑎𝑏𝑙𝑒(𝑖𝑛) exist in the sliding
window model by processing the 𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑖𝑛) in a potential 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 during a previous
offline sliding of the window model. The itemset combinations with a 𝑆𝑡𝑎𝑏𝑙𝑒(𝑖𝑛) are tagged as
discriminative or non-discriminative in the 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 of S-DISStream, based on the recent
data stream lengths in sliding window frame 𝑊. This implies that the discriminative itemsets are
discovered based on the recent data stream lengths that have been changed by adding the new
partition and deleting the oldest partition in the sliding window frame 𝑊.
For a 𝑖𝑛 ∊ 𝐼𝑛𝑡𝑒𝑟𝑛𝑎𝑙 𝑛𝑜𝑑𝑒𝑟𝑜𝑜𝑡 if any of the two conditions is updated the internal node
is considered as a potential discriminative internal node and the potential discriminative itemset
combinations with the subset of internal node are generated from the subtree, as it may contain
new discriminative itemsets.
∎
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 120
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 120
Every itemset in a 𝑆𝑡𝑎𝑏𝑙𝑒(𝑖𝑛) is stable in the sliding window model; for example in
Figure 5.5 in the left-most subtree related to the processing Header-Table item 𝑎 under root node
𝑐 (i.e., 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑐) the internal node 𝑏 is stable as the only subset of items in the subtree that
satisfy the conditions in HEURISTIC 5.2 i.e., made of 𝐼(𝑏3,2) and 𝐼(𝑏1,0) as 𝑏3,2, 𝑏1,0 ∈ 𝐵 and
𝐵 ∈ 𝒮, is stable. The 𝑆𝑡𝑎𝑏𝑙𝑒(𝑖𝑛) is ignored from itemset combination generation. The itemset
combinations with a subset of 𝑆𝑡𝑎𝑏𝑙𝑒(𝑖𝑛) in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 of S-DISStream are traversed using
Header-Table links and tagged as discriminative or non-discriminative based on the recent data
stream lengths in the sliding window frame 𝑊; for example, in Figure 5.6 the itemset 𝑐𝑏𝑎4,2 in
𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑐 of S-DISStream ending with 𝐻𝑒𝑎𝑑𝑒𝑟_𝑇𝑎𝑏𝑙𝑒_𝑖𝑡𝑒𝑚𝑠(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑐) with subset of
internal node 𝑏, is tagged as discriminative.
Following the running Example 5.1 the conditional FP-Tree of Header-Table item 𝑎 is
expanded by sub-branches of 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑐 (i.e., 𝑏𝑑𝑎3,2, 𝑏𝑎1,0 and 𝑎1,5) as in Figure 5.7. The
𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑏) and its 𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑖𝑛) are not stable and S-DISStream is updated by new
itemset combinations generated out of the potential 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑏 as in Figure 5.8. The
discriminative itemsets (i.e., 𝑏𝑑𝑎4,2 and 𝑏𝑎8,3) are discovered in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑏. The potential
discriminative itemsets may exist in the S-DISStream out of processing the old partitions and be
over written; for example, in Figure 5.8 the itemset 𝑏𝑎5,2 exists in S-DISStream out of processing
partition 𝑃1 as in Figure 5.3 and is overwritten by the new frequencies as 𝑏𝑎8,3.
Figure 5.7 Expanded conditional FP-Tree of Header-Table item 𝑎 updated by partition 𝑃2
after processing the first subtree
Figure 5.8 Updated S-DISStream after processing potential discriminative subsets of the left-
most subtree in conditional FP-Tree for Header-Table item 𝑎
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 121
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 121
The conditional FP-Tree of Header-Table item 𝑎 is then expanded by sub-branches of
𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑏 (i.e., 𝑑𝑎6,3 and 𝑎2,0) as in Figure 5.9. The 𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑑) is not stable and S-
DISStream is updated by the discriminative itemset 𝑑𝑎6,3 as in Figure 5.9.
Figure 5.9 Expanded conditional FP-Tree of Header-Table item a updated by partition 𝑃2
after processing the second subtree
Figure 5.10 Updated S-DISStream after processing potential discriminative subsets of the
left-most subtree in conditional FP-Tree for Header-Table item 𝑎
Following the bottom-up order of Desc-Flist, the conditional FP-Tree is then generated
for the rest of Header-Table items respectively (i.e., item 𝑒 in Example 5.1 as in Table 5.2). The
new discriminative itemsets in each potential 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 are inserted to S-DISStream. The tag
of itemsets in S-DISStream belongs to the 𝑆𝑡𝑎𝑏𝑙𝑒(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) and 𝑆𝑡𝑎𝑏𝑙𝑒(𝑖𝑛) are updated
based on the recent data stream lengths in the sliding window frame 𝑊. The frequencies of
itemsets in S-DISStream that are not updated (i.e., belongs to the non-potential 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 or
with subset of non-potential 𝑖𝑛 ∊ 𝐼𝑛𝑡𝑒𝑟𝑛𝑎𝑙 𝑛𝑜𝑑𝑒𝑟𝑜𝑜𝑡) must be adjusted based on their
appearances in S-FP-Tree followed by updating the tag of itemsets, as explained in the section
below.
5.3.2.3 S-DISStream tuning and pruning in offline sliding window
By processing the last conditional FP-Tree (e.g., conditional FP-Tree of item 𝑐 in
Example 5.1) the offline sliding continues by checking the itemsets in S-DISStream that have not
been updated. The itemsets that belong to the non-potential 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 or with subset of non-
potential 𝑖𝑛 ∊ 𝐼𝑛𝑡𝑒𝑟𝑛𝑎𝑙 𝑛𝑜𝑑𝑒𝑟𝑜𝑜𝑡 are not updated during batch processing. The itemsets in S-
DISStream are checked by traversing using Header-Table links and the frequencies of the
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 122
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 122
itemsets that have not been updated are tuned, based on their appearances in the S-FP-Tree and
tagged as discriminative or non-discriminative based on the recent data stream lengths.
Against the Apriori property and distinguishing the discriminative itemset mining from
frequent itemset mining, the non-discriminative itemsets appear as a subset of discriminative
itemsets; for example, the items 𝑐 and 𝑑 in Example 5.1 are subsets of discriminative itemsets as
in Figure 5.11, but they are not discriminative. The frequencies of non-discriminative itemsets
appearing as a subset of discriminative itemsets must also be set accordingly using the S-FP-
Tree. These itemsets may become involved in the online sliding window, as explained in
Section 5.4. Tuning the frequencies of the not-updated itemsets and non-discriminative subsets is
not a time-consuming process, by considering the sparse discriminative itemsets with less
number of non-discriminative subsets.
Lemma 5-3 (Exact non-discriminative subsets) tuning the frequencies of the non-
discriminative itemsets appearing as a subset of discriminative itemsets using S-FP-Tree ensures
the exact frequencies of these itemsets in S-DISStream that may become involved in the online
window model updating.
Proof. The S-FP-Tree is the superset of conditional FP-Trees and has full view of all
itemsets in the datasets in the sliding window frame 𝑊. The exact frequencies of the non-
discriminative subsets are collected accurately using their appearances in S-FP-Tree by
traversing through Header-Table links.
∎
The tail pruning in S-DISStream is applied for space saving. The itemset in S-DISStream
in the sliding window model is pruned if it is non-discriminative and stays as a leaf node. The tail
pruning ensures that S-DISStream only maintain the discriminative itemsets and the non-
discriminative subsets in the sliding window frame 𝑊. The final S-DISStream after offline
sliding by partition 𝑃2 as in Table 5.3 and tail pruning is presented in Figure 5.11 with the eight
discriminative itemsets as listed in the table. The tail pruning is applied in the S-FP-Tree structure
following the same process for deleting the old transactions that are out of sliding window frame
𝑊 with zero frequencies in data streams.
Figure 5.11 Final S-DISStream after offline sliding based on partition 𝑃2
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 123
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 123
The S-DISStream is proposed for dynamically monitoring the discovered discriminative
itemsets in an offline state. The S-DISStream is updated by every new incoming batch of
transactions that fits in partition 𝑃𝑛𝑒𝑤 in sliding window frame 𝑊 and deleting the itemsets that
belong to the oldest partition 𝑃𝑜𝑙𝑑 out of full size sliding window frame 𝑊. The S-DISStream
structure is constructed and updated in the offline sliding window model by adjusting the
efficient mining discriminative itemsets. The online sliding window model is defined for
approximate online monitoring the discriminative itemsets in the sliding window frame 𝑊
between two offline sliding, as in the section below.
5.4 ONLINE SLIDING WINDOW MODEL
In offline sliding the discriminative itemsets are reported only after finishing the offline
processing by the new partition which may take time, depending on the size of datasets and their
characteristics, and may not be acceptable for some real-time applications. The online sliding is
defined to discover the discriminative itemsets and report the change in trends in an online state
when any new single transaction arrives. To facilitate describing the method, several important
concepts and constructs are defined in the section below.
5.4.1 Mining discriminative itemsets in online sliding window using queue structure
The queue structure is defined for holding the transactions in the online sliding window
model made of partitions with different sizes.
Transaction-List: This is a queue structure for keeping track of the transactions in the
online sliding window model as in Figure 5.12. For each transaction fits in the online sliding
window frame 𝑊, it holds the partition number; either it belongs to the target data stream 𝑆𝑖 or
the general data stream 𝑆𝑗 and a link to the transaction in the S-FP-Tree node. The Transaction-
List contains only the recent transactions fit in the defined online sliding window frame 𝑊. The
rest of transactions are deleted when the window frame 𝑊 slides as in Figure 5.12.
Figure 5.12 Transaction-List made of partitions fit in the online sliding window frame 𝑊
The online sliding window model is happening within two offline sliding of the window
model limited to the itemsets in S-DISStream (i.e., the online sliding window frame 𝑊). The
itemsets in the S-DISStream are the potential itemsets to change their tags during the online
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 124
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 124
sliding window model. The frequency and tag of the itemsets existing in S-DISStream are
updated during online sliding and no new itemset is generated. The window frame 𝑊 slides in
the online state between offline sliding, when a new batch is loaded with transactions. Every
transaction updates the S-FP-Tree and is linked by Transaction-List for online sliding. During
online sliding every new transaction in the recent partition (i.e., 𝑃𝑛𝑒𝑤 as in Figure 5.1) in window
frame 𝑊 is checked for having a subset in the S-DISStream by traversing through Header-Table
links. Subsets of new transaction exist in S-DISStream, called online itemsets, are used for online
sliding by increasing the itemset frequencies and updating the tags in the S-DISStream; for
example, a discriminative itemset may become non-discriminative.
By each new incoming transaction, the oldest transaction in the Transaction-List is
deleted by its online itemsets out of the sliding window frame 𝑊 if it belongs to the oldest
partition (i.e., 𝑃𝑜𝑙𝑑 as in Figure 5.1). Online itemsets of the old transaction (i.e., subsets exist in S-
DISStream) are used for online sliding by decreasing the itemset frequencies and updating the
tags in the S-DISStream. The online sliding continues for every new transaction until the end of
the new partition. The itemsets in the S-DISStream that are updated during online sliding are
tagged as online. The online itemsets in the S-DISStream hold the exact frequencies in the sliding
window frame 𝑊, however, they must be tagged after offline sliding based on the recent data
stream lengths during S-DISStream tuning and pruning as in Section 5.3.2.3. The S-DISStream
structure is updated in the online sliding window model by proposing one new corollary as in the
section below.
5.4.2 Improving the accuracy using relaxation ratio
The HEURISTIC 5.1 and HEURISTIC 5.2 are modified by the relaxation of 𝛼 for
holding the sub-discriminative itemsets in the sliding window frame 𝑊. The sub-discriminative
itemsets are saved in the sliding window model as the potential discriminative itemsets following
Definition 5.2 and based on the relaxation of 𝛼.
Property 5.1. Modifying HEURISTIC 5.1 and HEURISTIC 5.2 based on relaxation
of 𝛼 obtains the sub-discriminative itemsets.
This property says that the sub-discriminative itemsets are discovered by choosing the
relaxation of 𝛼 from non-discriminative itemsets. The sub-discriminative itemsets in the sliding
window frame 𝑊 are discovered for better approximation in itemset frequencies and itemset
frequency ratios in the online sliding window model.
Property 5.2. By using smaller relaxation of 𝛼 a better approximation will be in the
discriminative itemsets in the online sliding window model.
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 125
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 125
This property says that the more sub-discriminative itemsets are discovered by choosing
the smaller relaxation of 𝛼. This is a trade-off between better approximations in discriminative
itemsets in the online sliding window model and computation cost.
The corollary is formally defined below.
Corollary 5-1. A refined approximate bound in discriminative itemsets in the online
sliding window model is obtained by modifying the HEURISTIC 5.1 and HEURISTIC 5.2
based on relaxation of 𝛼.
Where 𝛼 is the relaxation threshold for sub-discriminative itemsets, and HEURISTIC 5.1 and
HEURISTIC 5.2 are defined for potential discriminative itemset combination generation during
the offline sliding window model.
Rationale 5-1. (highest refined approximate bound in discriminative itemsets in the
online sliding window model) Corollary 5-1 ensures that the approximation in discriminative
itemsets in the online sliding window model may be improved by holding the sub-discriminative
itemsets in the S-DISStream structure in an online sliding window model.
Proof. The sub-discriminative itemsets improve the approximate bound in discriminative
itemsets by increasing the number of potential discriminative itemsets under relaxation of 𝛼. We
call this the highest refined approximate bound in discriminative itemsets in the online sliding
window model. The Corollary 5-1 is more efficient when the discriminative itemsets are stable
in the neighbour partitions with less concept drifts present in the datasets.
∎
5.5 S-DISSPARSE METHOD
In this section we describe the process of efficient mining discriminative itemsets using
the sliding window model in the S-DISSparse method by effectively dealing with explosion in
the number of generated itemsets in online and offline states.
The S-DISSparse method utilizes the DISSparse algorithm (Seyfi et al. 2017) proposed
in Chapter 3 with the offline sliding window model. The DISSparse algorithm is used for the
batch processing in offline updating the sliding window model using S-FP-Tree and S-
DISStream structures proposed in this chapter, and Header-Table, conditional FP-Tree and
minimized DISTree as defined in Chapter 3. The discriminative (and sub-discriminative) itemsets
are directly updated to the S-DISStream structure (i.e., sliding window frame 𝑊) and the sliding
window model is updated in offline and online states. The S-DISSparse method continues by
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 126
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 126
discovering discriminative and sub-discriminative itemsets for the next batch of transactions fits
in the new partition. To the best of our knowledge, the S-DISSparse method is considered as the
first work for efficient mining discriminative itemsets using the sliding window model with
approximate bound guarantee.
5.5.1 S-DISSparse Algorithm
The S-DISSparse algorithm is presented by incorporating two heuristics and one
corollary proposed in this chapter, for the efficient discriminative itemset mining using the
sliding window model. The S-DISStream structure is updated with the exact set of discriminative
itemsets in offline time intervals when the current batch of transactions 𝐵𝑛 is full (i.e., n ≥ 1).
The S-DISStream structure is updated with an approximate set of discriminative itemsets in a real
time frame when the current batch 𝐵𝑛 (i.e., n > 1) is loaded with transactions. The first batch of
transactions 𝐵1 is treated differently by calculating all the item frequencies and making the Desc-
Flist based on the descending order of the item frequencies. The Desc-Flist order is used for
saving space by sharing the paths in the prefix trees with the most frequent items on the top. This
Desc-Flist remains the same for all the upcoming batches in data streams. The S-DISSparse
algorithm is single-pass for the rest of the batches of transactions. The input parameters
discriminative level 𝜃, support threshold 𝜑 and relaxation of 𝛼 are defined based on the
application domain, data stream characteristics and sizes or by the domain expert users as
discussed in Chapter 6.
The S-FP-Tree is updated by adding the transactions from the recent batch of
transactions 𝐵𝑛 (i.e., the most current batch of transactions) fits in partition 𝑃𝑛𝑒𝑤 without pruning
infrequent items and by making the Transaction-List. The first partition is processed using the
DISSparse algorithm (Seyfi et al. 2017) proposed in Chapter 3 and the S-DISStream is generated
from discriminative itemsets and non-discriminative subsets in transactions fitting in partition 𝑃1.
By every new transaction in 𝑃𝑛𝑒𝑤 the Transaction-List is updated. The online sliding window is
updated by online itemsets in S-DISStream (i.e., increasing frequency of online itemsets in 𝑃𝑛𝑒𝑤
and decreasing frequency of online itemsets in 𝑃𝑜𝑙𝑑). The online itemsets in the S-DISStream are
tagged as discriminative or non-discriminative based on their updated frequencies and data
stream lengths. By the end of the online sliding of 𝑃𝑛𝑒𝑤, the 𝑃𝑜𝑙𝑑 is checked for online sliding of
the remained transactions (i.e., when 𝑃𝑜𝑙𝑑 has larger number of transaction than 𝑃𝑛𝑒𝑤).
During offline sliding, the S-DISStream is updated by the discriminative itemsets in each
𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) and its 𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑖𝑛) in an offline state. The tags in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 in
S-DISStream are updated by checking the itemsets in 𝑆𝑡𝑎𝑏𝑙𝑒(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) and 𝑆𝑡𝑎𝑏𝑙𝑒(𝑖𝑛) in
S-DISStream based on the recent data stream lengths. The sub-discriminative itemsets are
discovered for better approximation bound in discriminative itemsets in the online sliding
window model by relaxation of 𝛼 and based on HEURISTIC 5.1 and HEURISTIC 5.2
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 127
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 127
modified using Corollary 5-1. By the end of offline sliding of 𝑃𝑛𝑒𝑤 the exact frequencies of the
non-discriminative subsets not updated in S-DISStream are tunned based on their appearance in
the S-FP-Tree and tail pruning is applied in S-DISStream and S-FP-Tree structures. The online
itemsets in S-DISStream are also tagged based on the recent data stream lengths and the process
continues by the next partition. The discriminative itemsets in target data stream 𝑆𝑖 against
general data stream 𝑆𝑗 are reported in offline time intervals in the sliding window frame 𝑊 in
𝐷𝐼𝑖𝑗𝑊 and S-DISSparse algorithm is continued for the new incoming batch of transactions 𝐵𝑛+1.
Algorithm 5.2 (S-DISSparse: Mining Discriminative Itemsets in Data
Streams using the Sliding Window Model)
Input: (1) The discriminative level 𝜃; (2) The support threshold 𝜑; (3) The
relaxation of 𝛼; and (4) incoming batches of transactions fit in partitions 𝑃 with
alphabetically ordered items belonging to data streams 𝑆𝑖 and 𝑆𝑗.
Output: 𝐷𝐼𝑖𝑗𝑤, set of discriminative itemsets in 𝑆𝑖 against 𝑆𝑗 in the sliding
window frame 𝑊 (S-DISStream structure) in online and offline states.
Begin
1) Make S-FP-Tree based on 𝐵1 fits in 𝑃1 and update Transaction-List;
2) Process 𝑃1 using DISSparse algorithm (Seyfi et al. 2017) and make S-
DISStream;
3) While not end of streams do
4) Untag S-FP-Tree and S-DISStream;
5) While not end of partition 𝑃𝑛𝑒𝑤 do // Online sliding window
6) Update S-FP-Tree and Transaction-List by new transaction;
7) If added transaction in partition 𝑃𝑛𝑒𝑤 in 𝑊 has online itemset then
8) Update online itemsets in S-DISStream; // increase frequency
9) If deleted transaction in partition 𝑃𝑜𝑙𝑑 in 𝑊 has online itemset then
10) Update online itemsets in S-DISStream; // decrease frequency
11) End While;
12) Delete remained transactions in partition 𝑃𝑜𝑙𝑑 in online state;
13) Update S-DISStream by discriminative itemsets in every
𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) and 𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙(𝑖𝑛);
14) Update tags in 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 in S-DISStream for itemsets in every
𝑆𝑡𝑎𝑏𝑙𝑒(𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡) and 𝑆𝑡𝑎𝑏𝑙𝑒(𝑖𝑛) based on HEURISTIC 5.1 and
HEURISTIC 5.2 modified by Corollary 5-1;
15) Tune non-discriminative subsets and tag online itemsets in S-DISStream
and apply tail pruning;
16) Report discriminative itemsets 𝐷𝐼𝑖𝑗𝑊 in sliding window frame 𝑊;
17) End while;
End.
The Theorem 5-1 proves the precision of the S-DISSparse method as below.
Theorem 5-1 (Completeness and correctness of S-DISSparse): Based on Theorem 3-3
the DISSparse method (Seyfi et al. 2017) discovers the exact set of discriminative itemsets in
offline sliding states. Based on Lemma 5-1 and Lemma 5-2 the updated potential discriminative
itemsets in each potential 𝑆𝑢𝑏𝑡𝑟𝑒𝑒𝑟𝑜𝑜𝑡 in conditional FP-Tree and its 𝐼𝑛𝑡𝑒𝑟𝑛𝑎𝑙 𝑛𝑜𝑑𝑒𝑟𝑜𝑜𝑡 are
generated completely in S-DISStream, and all stable discriminative itemsets are tagged in S-
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 128
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 128
DISStream correctly based on the recent data stream lengths. Based on Lemma 5-3 the
frequencies of non-discriminative itemsets that appear as a subset of discriminative itemsets are
collected accurately. These prove the completeness and correctness of the S-DISSparse method
by discovering all the discriminative itemsets and their non-discriminative subsets in the offline
sliding window model.
5.5.2 S-DISSparse Algorithm Complexity
In the S-DISSparse algorithm, the significant part attracting considerable complexity is
related to the generating the potential discriminative itemsets and updating the tag of stable
itemsets in the S-DISStream structure. Tunning the frequencies of non-discriminative subsets in
the S-DISStream and applying the tail pruning in S-DISStream and S-FP-Tree have less
complexity by considering the sparse discriminative itemsets. The online sliding in the line 5 to
line 12 is based on a quick search method on S-DISStream structure. The offline sliding is based
on the potential subtrees and the potential internal nodes i.e., updated itemsets. The stable
itemsets in S-DISStream structure are also checked based on a quick search method and tagged as
discriminative or non-discriminative itemsets.
The efficiency of the S-DISSparse algorithm is discussed in detail by evaluating the
algorithm with the input data streams in Chapter 6. Empirical analysis shows the performance of
the proposed method by testing with different parameter setting (e.g., relaxation of 𝛼). The
efficiency of the S-DISSparse algorithm is discussed on large and fast-growing data streams for
exact mining discriminative itemsets in the offline sliding window model, and with approximate
bound guarantee in the online sliding window model.
5.6 CHAPTER SUMMARY
The proposed S-DISSparse method in this chapter is applicable for the large datasets.
The utilized DISSparse method for offline batch processing is efficient for mining discriminative
itemsets in data streams, based on the new heuristic proposed in Chapter 5. The sliding window
frame is divided into smaller partitions for offline sliding and it is also updated by online
transactions for online sliding. The exact discriminative itemsets in data streams are held in the
proposed novel S-DISStream structure in offline time intervals with approximate discriminative
itemsets in an online real time frame. The S-DISSparse algorithm has a complex process. Every
single transaction is checked for online updating and full partitions are processed for offline
sliding. The online sliding happens by online itemsets of the recent and the oldest transactions in
Transaction-List. The offline sliding happens by updating the sliding window model with
recently updated transactions and tagging the itemsets in S-DISStream based on the recent data
stream lengths in the sliding window frame.
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 129
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 129
The offline sliding is used for the efficient mining of the batch processing and the online
updating is used for the quick reporting of the discriminative itemsets. The usability of the online
sliding highly depends on the concept drifts in the input data streams and also the input rate of the
data streams. Considering the incoming partitions in the data streams, the window frame is
updated with high performance if the adjacent partitions have the least concept drifts. In the data
streams with high concept drifts, the sliding window frame is mainly updated through offline
sliding state and the online sliding state adding overhead in the mining process. The sliding
window frame size and also the number of partitions and size of each partition can be defined
based on the specific characteristics of the input datasets based on the domain expert knowledge.
In this chapter, two determinative heuristics and one corollary are proposed for mining
discriminative itemsets in the exact offline sliding window model, and with the highest refined
approximate bound in the online sliding window model. The proposed heuristics applied to the S-
DISSparse method are efficient. The proposed heuristics in this chapter, guarantee to hold the
exact frequencies of the discriminative itemsets, sub-discriminative itemsets and non-
discriminative subsets in the offline sliding window model. The non-discriminative itemsets,
staying as leaf nodes in S-DISStream, are pruned from sliding window model. The highest
refined approximate bound is achieved with setting the smaller relaxation in the online sliding
window model. Following the defined tail pruning techniques, the S-DISStream data structure is
able to be fitted in the main memory, ascertaining that mining discriminative itemsets in data
streams using the sliding window model is realistic in fast-growing data streams.
The S-DISStream structure is stable during the time and the discriminative itemsets
appeared after processing batches with high concept drifts are neutralized by merging in the full
size sliding window frame. The process of building S-DISStream structure can become more
efficient by reordering the data structures based on the new trends in the data streams
periodically. The Desc-Flist order made of the first batch of transactions is the default order for
making all the data structures in the algorithm. The efficiency of the algorithm may be affected
by this default order in the case of high concept drifts in the data streams during the time. The
data structures such as S-FP-Tree, conditional FP-tree, minimized DISTree and S-DISStream can
be updated periodically with new ordering of frequent items for better efficiency (i.e., Desc-Flist
adjusted based on the new trends in frequent items). However, the overhead of restructuring large
S-DISStream structure in the sliding window model must be considered.
The proposed method is extensively evaluated with datasets exhibiting distinct
characteristics in Chapter 6. Based on the experimental results, the S-DISSparse algorithm
exhibits an efficient time and space complexity in the large datasets with high complexity when
we choose the partition size as the smaller percentage of the sliding window frame size. The S-
DISSparse algorithm reports the discriminative itemsets with full accuracy and recall in offline
sliding. The approximate results are reported in the online sliding window frame with higher
Chapter 5: Mining Discriminative Itemsets in Data Streams using the Sliding Window Model Page 130
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 130
accuracy and recall, considering smaller relaxation of 𝛼. The process of the discriminative
itemset mining in the algorithm is highly dependent on the type of datasets and how the itemsets
are distributed in the streams. The discriminative itemsets that appeared with high concept drifts
in specific batches are neutralized quickly during merging in the full size sliding window frame.
The in-memory data structures defined for the algorithm efficiently stay small, based on the
proposed heuristics, and they are stabilized during the process.
We explained many different real world applications for mining discriminative itemsets
using the sliding window model. The sliding window model is useful for real world applications
attracting high attention in recent history. One of the interesting scenarios is in network
monitoring for intrusion detection. Looking for a group set of activities happening more
frequently in one network compared to the rest of networks can be used for personalization or
anomaly detection. The discriminative itemsets in the sliding window model can be useful for
monitoring the recent patterns in the fast speed data streams. The sliding window model displays
the recent discriminative itemsets in the fixed-size window frame, which is updatable in offline
and online states.
In this chapter, exact discriminative itemsets are updated in offline time intervals in the
sliding window model and approximate discriminative itemsets are updated in real time frame in
the online sliding window model. In future work, we propose algorithms for classification based
on the discriminative itemsets in data streams using different window models. In the next
chapter, we evaluate the proposed algorithms for mining discriminative itemsets in data streams,
using different input datasets and based on different parameter settings.
Chapter 6: Evaluation and Analysis Page 131
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 131
6Chapter 6: Evaluation and Analysis
This chapter details all the results of the study in this thesis. The experimental results of different
datasets with different characteristics, on the four presented algorithms are reported and analysed
in detail. This chapter contains a full discussion of the results with reference to the literature. For
each result, similarities and differences to the findings in the literature review are discussed. This
chapter also includes theory building.
This chapter details the different types of experiment for evaluating the presented
algorithms for mining discriminative itemsets in a batch of transactions, mining discriminative
itemsets using the tilted-time window model and mining discriminative itemsets using the sliding
window model. Algorithms are evaluated based on different criteria including time and space
complexities, scalability and sensitivity analysis by performing varied input parameters. The
benchmarking and the type of datasets used for the experiments are explained in Section 6.1. The
DISTree and DISSparse algorithms proposed for mining discriminative itemsets in one batch of
transactions are evaluated in Section 6.2. The H-DISSparse algorithm proposed for mining
discriminative itemsets using the tilted-time window model is evaluated in Section 6.3. The S-
DISSparse algorithm proposed for mining discriminative itemsets using the sliding window
model is evaluated in Section 6.4. The chapter is finalized by the chapter summary in Section 6.5.
6.1 BENCHMARKING
There are several methods proposed with the close concepts to the proposed methods in
this thesis. We list the current similar methods with a brief discussion about their definitions and
their proposed algorithms. We choose the best algorithms for the benchmarking purposes in the
proposed batch processing algorithms and data stream processing algorithms, respectively.
6.1.1 Evaluation benchmarks
Mining discriminative itemset in data streams is a new topic and there not many research
works in this area. In (Lin et al. 2010) three different methods have been proposed for mining
discriminative items in data streams, namely, frequent item based, hash-based method and hybrid
method. These three methods are considered as the first research work on mining discriminative
items in data streams and are used as benchmarks for each other for evaluation purposes. In
Chapter 2 we confirmed that the proposed methods in (Lin et al. 2010; Seyfi 2011; Guo et al.
2011) cannot be used for benchmarking in discriminative itemset mining. Also, based on the
literature reviewed in Chapter 2 there is no research work in mining discriminative itemsets in
Chapter 6: Evaluation and Analysis Page 132
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 132
data streams. Out of the current methods for the batch processing, we choose two of the
algorithms for the benchmarking purposes with presenting the sufficient arguments as in the
section below.
6.1.1.1 Batch processing benchmarks
The DISTree method proposed in this thesis is the first method for mining discriminative
itemsets (Seyfi, Geva and Nayak 2014). The DISTree method is an adapted version of the FP-
Growth (Han, Pei and Yin 2000) modified for more than one dataset. The standard FP-Growth
method is adapted to work on more than one dataset as well as pruning the non-discriminative
itemsets instead of infrequent itemsets. The DISSparse method is the advanced efficient method
for mining discriminative itemsets (Seyfi et al. 2017). The proposed DISTree and DISSparse
algorithms in this thesis are completely novel and mine the complete set of the discriminative
itemsets which are frequent in the target dataset based on the minimum support threshold, and are
discriminative in the target dataset compared to the general dataset based on discriminative level
threshold. Therefore, in the evaluation in Section 6.2, the DISTree algorithm is chosen as a
baseline model to compare with the proposed DISSparse algorithm.
The DISTree method is used as the first baseline for the DISSparse method. The
discriminative itemsets discovered using both methods are similar. The accuracy of the DISTree
method can be confirmed by the completeness of the FP-Growth (Han, Pei and Yin 2000) and
the correctness of discriminative and non-discriminative itemsets by full traversing the DISTree.
Based on the empirical analysis in this chapter, DISTree takes relatively two times longer than
the basic FP-Growth. The discriminative itemsets do not follow the Apriori property and thus the
divide and conquer as in FP-Growth cannot be implemented. There are also extra processes and
data structures working with a bigger size of datasets from more than one dataset. The DISSparse
method is based on determinative heuristics for efficient mining of discriminative itemsets by
limiting the itemset generation to the potential discriminative subsets. The accuracy of the
DISSparse method can be confirmed, based on its completeness of potential discriminative
itemset combination generation and correctness of discriminative itemsets and non-
discriminative itemsets.
The concept of discriminative itemsets is very close to the emerging patterns. There are
many algorithms proposed for mining different types of emerging patterns. We choose one of the
state-of-the-art emerging pattern mining algorithms for the benchmarking with the discriminative
itemset mining algorithm proposed in this thesis. However, the two methods need some
modifications to be prepared for a fare comparison.
The ConsEPMiner (Zhang, Dong and Kotagiri 2000) reduces the cost of emerging
pattern mining by satisfying several constraints including user-defined minimums support,
growth rate and growth-rate improvement. Nevertheless, ConsEPMiner is not efficient when the
Chapter 6: Evaluation and Analysis Page 133
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 133
minimum support is low. Also, it is unable to handle datasets with very high dimensions; for
example, in market basket datasets. The epMiner (Loekito and Bailey 2006) is proposed for
mining EPs from a high-dimensional dataset by employing the 𝛼 and 𝛽 constraints to reduce the
pattern search space. The epMiner algorithm mines minimal patterns occurring frequently i.e.,
> 𝛼% in the positive class and infrequently i.e., < 𝛽% in the negative class.
The Discriminative Pattern Miner (DPMiner) algorithm (Li, Liu and Wong 2007) is able
to discover the delta-discriminative emerging patterns having the maximum frequency in the
contrasting classes. The DPMiner algorithm is much faster than epMiner and based on the
literature reviewed it is the most efficient method to mine emerging patterns (Dong and Bailey
2012). The DPMiner findtheδ-discriminative emerging patterns which occurs in only one of the
classes with almost no occurrence in any other classes i.e., the itemset appear less than 𝛿 in the
rest of the classes where 𝛿 is usually a small integer number 1 or 2. The DPMiner employs the 𝛼
and 𝛽 constraints as in (Loekito and Bailey 2006) to reduce the pattern search space, however, it
does not find some of the useful emerging patterns compared to the proposed DISTree and
DISSparse methods in this thesis. The difference between these algorithms is in the measures
defined for the discriminative itemsets. The discriminative itemset proposed in this thesis which
is relatively discriminative in the target dataset compared to the general dataset.
The ratio between frequencies of the discriminative itemsets in the target dataset i.e., 𝑓𝑖
and general dataset i.e., 𝑓𝑗 in the research problem of discriminative itemset mining is relative
i.e., 𝑓𝑖(𝐼)𝑛𝑗 𝑓𝑗(𝐼)𝑛𝑖⁄ > 𝜃. This measure does not follow Apriori property as a subset of a non-
discriminative itemsetscanbediscriminative.Theδ-discriminative emerging patterns have non-
relative static frequency measure in the negative class i.e., (< 𝛿). The problem ofmining δ-
discriminative emerging patterns is based on the equivalence classes defined by closed patterns
and a set of generators. The measure of (< 𝛿) in 𝛿-discriminative emerging patterns follows the
Apriori property, by finding a border for a non 𝛿-discriminative patterns above the border and
redundant patterns below the border.
Although the DPMiner algorithm (Dong and Bailey 2012) is to mine discriminative
itemsets with 𝑓𝑖 + 𝑓𝑗 > 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 i.e., itemset must be frequent in the datasets, but it
has a specific requirement on 𝑓𝑗 < 𝛿 or 𝑓𝑖 < 𝛿, which is different from the measure used by
DISSparse to determine discriminative itemsets. In the Chapter 3, we modified the DISSparse
and its proposed heuristics to find all the delta-discriminative itemsets i.e., including redundant
patterns. We also modified the original DPMiner to include all the delta-discriminative itemsets.
We compare these methods in case of time and space usage for the desired delta-discriminative
emerging patterns. The proposed data stream mining algorithms in this thesis are designed based
on the basic of the proposed batch processing algorithms. We define the evaluation benchmarks
for the data stream mining algorithms as in the section below.
Chapter 6: Evaluation and Analysis Page 134
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 134
6.1.1.2 Data stream processing benchmarks
The DISTree and DISSparse algorithms are proposed for mining discriminative itemsets
from a single batch of transactions, which can be used for offline updating of the different
window models in data streams. The H-DISSparse algorithm is proposed for mining
discriminative itemsets in data streams using the tilted-time window model by utilizing the
DISSparse method. The precision of the discriminative itemset mining algorithms using the
tilted-time window model is affected by merging the results from multiple batches of
transactions. In Chapter 4 a set of corollaries were defined to improve the precision based on the
properties of the discriminative itemset mining in data streams using the tilted-time window
model. In Section 6.3, the algorithm constructed based on the basic DISTree method is used as
benchmark for the algorithm constructed based on the efficient DISSparse method.
The S-DISSparse algorithm is proposed for mining discriminative itemsets in data
streams using the sliding window model by utilizing the DISSparse method. The discriminative
itemset mining algorithm using the sliding window model are proposed with full accuracy and
recall by following the recent changes in data structures during the offline sliding window
sliding. In Chapter 5, a set of heuristics and corollary were defined to improve the precision of
the discriminative itemset mining algorithms in data streams during the online sliding window
model. In Section 6.4, the original algorithm constructed based on the basic of DISSparse method
is used as benchmark for the efficient algorithm constructed based on the DISSparse method for
the sliding window model.
6.1.2 Evaluation environment
All the algorithms were implemented in C++ and the experiments were conducted on a
desktop computer with an Intel Core (TM) Duo E2640 2.8GHz CPU and 8GB main memory
running 64 bit Microsoft Windows 7 Enterprise. All the synthetic datasets used in the
experiments were generated using the IBM synthetic data generator (Agrawal and Srikant 1994).
The input datasets made of two data streams 𝑆1 and 𝑆2 of different sizes, ratios and number of
unique items have been generated. The 𝑇$: 𝐼$: 𝐷$ format shows the datasets with 𝑇 as the
average transaction length, 𝐼 as the average length of the large itemsets and 𝐷 as the number of
transactions. We used the same 𝑇 for both 𝑆1 and 𝑆2 to indicate that both data streams belong to
the same domain and exhibit similar behaviour. The 𝑆1 and 𝑆2 were generated with different 𝐼 as
there are more large itemsets in 𝑆2 which is made of several smaller data streams. This will also
support generation of a larger number of discriminative itemsets. The real datasets used in the
experiments were mainly obtained from the UCI repository (Dheeru and Karra Taniskidou
2017). The details about the real datasets applied in the experiments are presented in the related
sub-sections.
Chapter 6: Evaluation and Analysis Page 135
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 135
It should be noted that, in our experiments, the runtime means the total execution time,
i.e., the period between input and output, instead of CPU time measured in the experiments in
some literature. The reported runtimes and space usages include the construction time and space
usage related to the FP-Tree structure, for holding the input transactions in a concise way. This is
considered as an overhead for all algorithms. For simplicity, we defined the combination of
𝜑𝜃𝑛1 as minimum support. The number of discriminative itemsets in datasets exponentially goes
up with lower minimum support. There are pretty long discriminative itemsets with a large
number of short discriminative itemsets. The datasets contain abundant mixtures of short and
long discriminative itemsets.
6.2 BATCH PROCESSING
In this section we evaluate the DISTree and DISSparse algorithms by focusing only on a
single batch of transactions. The DISTree and DISSparse algorithms are evaluated in a wide
range of datasets exhibiting various characteristics and with different parameter settings. Three
main input datasets, made of target data stream 𝑆1 and general stream 𝑆2 with different size,
average length of transactions, average length of the large itemsets and number of transactions,
have been generated with 1𝑘 and 10𝑘 numbers of unique items. The general data stream 𝑆2 is
typically considered much bigger than the target data stream 𝑆1 as it combines multiple datasets.
In this section, the strategies and principles are recommended for tuning the parameters based on
the application domains and data stream characteristics.
6.2.1 Evaluation on synthetic datasets
The scalability of DISTree and DISSparse algorithms is presented within different ranges
of discriminative level 𝜃, support threshold 𝜑, ratios between 𝑆1 and 𝑆2 (𝑛2/𝑛1) and also the
number of unique items in the alphabet set ∑.
Dataset 𝑫𝟏 is generated with 𝑆1 as 𝑇25: 𝐼10: 𝐷10𝐾 and 𝑆2 as 𝑇25: 𝐼15: 𝐷50𝐾 limited
to 1𝐾 unique items. The number of discriminative itemsets exponentially goes up in the lower
minimum supports. The scalability of DISSparse is compared with DISTree, by different
discriminative level 𝜃 and a fixed support threshold 𝜑 = 0.01%, as presented in Figure 6.1.
Chapter 6: Evaluation and Analysis Page 136
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 136
Figure 6.1 Scalability with discriminative level 𝜃 for 𝐷1 (support threshold 𝜑 = 0.0001)
DISSparse scales much better than DISTree specifically for lower minimum supports by
decreasing the 𝜃 when a greater number of discriminative itemsets with lower minimum support
and frequency ratio are discovered. The two algorithms have close behaviour in large minimum
supports when there is less number of discriminative itemsets (i.e., for 𝜃 = 10 in Figure 6.1 the
number of discriminative itemsets is less than 200k). By large minimum supports both DISTree
and DISSparse prune many items that are infrequent in the target data stream 𝑆1, during the
making of conditional FP-Tree. In Figure 6.1 the exponential growth in space usage of DISTree
with smaller minimum support is observed caused by exponential number of generated itemset
combinations.
For showing the efficiency gained by the proposed heuristic the experiments are repeated
on 𝐷1, by different discriminative level 𝜃 and a fixed support threshold 𝜑 = 0.01%, by
separately testing the DISSparse algorithm one time eliminating HEURISTIC 3.1 and one time
eliminating HEURISTIC 3.2. The proposed data structures and processes in DISSparse
algorithm without using the proposed heuristics are not efficient and DISSparse does not scale
well compared to DISTree (i.e., the time complexity of DISSparse without considering the
heuristic generally scales similar to DISTree). The space usage of DISSparse eliminating the
heuristics is still better compared to DISTree. The main contribution regarding the DISSparse
efficiency is related to the HEURISTIC 3.2 however the effect of HEURISTIC 3.1 becomes
clear when the lower minimum supports are considered.
The scalability of the DISTree and DISSparse algorithms is tested with different support
thresholds 𝜑 and fixed discriminative level 𝜃 = 15 as presented in Figure 6.2. The time and
space complexity of DISSparse and DISTree by changing support thresholds 𝜑 and a fixed
discriminative level 𝜃 = 15, follow same patterns as in Figure 6.1. The support threshold 𝜑 has
Chapter 6: Evaluation and Analysis Page 137
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 137
similar effect as discriminative level 𝜃 on the time and space complexity as these two parameters
together define the minimum support of discriminative itemsets.
Figure 6.2 Scalability with support threshold 𝜑 for 𝐷1(discriminative level 𝜃 = 10)
The next experiments are done in the datasets defined in the same way as 𝐷1 but with
different dataset length ratios between size of 𝑆1 and 𝑆2. The number of transactions in 𝑆2 is
changed from 10𝑘 to 100𝑘. In this setting, a greater number of itemsets fits to the definition of
discriminative itemsets as ratio 𝑛2/𝑛1 increased gradually from smaller values as presented in
Figure 6.3. The scalability of DISSparse and DISTree by changing the ratio of 𝑛2/𝑛1 is
presented in Figure 6.4. The runtime in DISSparse is mainly increased as a consequence of rise in
the number of discriminative itemsets. The linear increase in the time and space complexity of
DISTree by larger ratios is caused by the requirement of bigger data structures with longer
processing time (i.e., the dataset with 𝑛2
𝑛1= 10 is ten times bigger than dataset with 𝑛1 = 𝑛2 as in
Figure 6.4).
Figure 6.3 Number of the discriminative itemsets with different dataset length ratios (𝑛2/𝑛1)
(discriminative level 𝜃 = 10 and support threshold 𝜑 = 0.0001)
Chapter 6: Evaluation and Analysis Page 138
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 138
Figure 6.4 Scalability with different dataset length ratios (𝑛2/𝑛1) (discriminative level
𝜃 = 10 and support threshold 𝜑 = 0.0001)
Dataset 𝑫𝟐 is generated with 𝑆1 as 𝑇25: 𝐼10: 𝐷10𝐾 and 𝑆2 as 𝑇25: 𝐼15: 𝐷50𝐾 limited
to 10𝐾 unique items. The transactions in 𝐷2 are made of sparse items from 10𝐾 unique items
and the number and average length of discriminative itemsets decreases. Both algorithms prune
many items that are infrequent in 𝑆1 by making conditional FP-Tree and DISTree scale well with
close time complexity to the DISSparse but larger space usage as in Figure 6.5.
Figure 6.5 Scalability with discriminative level 𝜃 for 𝐷2 (support threshold 𝜑 = 0.0001)
Dataset 𝑫𝟑 is generated with 𝑆1 as 𝑇25: 𝐼10: 𝐷100𝐾 and 𝑆2 as 𝑇25: 𝐼15: 𝐷500𝐾
limited to 10𝐾 unique items. The 𝑛1 is much bigger in 𝐷3 as compared to the other datasets that
Chapter 6: Evaluation and Analysis Page 139
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 139
caused larger minimum support and consequently less number of discriminative itemsets (e.g.,
for minimum support 𝜑𝜃𝑛1 = 50. .100 the number of discriminative itemsets with different
ratio 𝜃 varies from 3 million to a hundred thousand as in Figure 6.7). In this setting also both
algorithms prune many items that are infrequent in 𝑆1 by making conditional FP-Tree as in
Figure 6.6.
Figure 6.6 Scalability with discriminative level 𝜃 for 𝐷3 (support threshold 𝜑 = 0.0001)
Figure 6.7 Number of the discriminative itemsets with different 𝜃 for 𝐷3 (support threshold
𝜑 = 0.0001)
DISSparse scales very well for large datasets by a prolific number of discriminative
itemsets with smaller minimum supports as in Figure 6.8. The experiments are conducted on 𝐷3
using homogeneous parameter setting with experiments reported in Figure 6.1 i.e., the support
threshold is set to 𝜑 = 0.00001 for having minimum support, 𝜑𝜃𝑛1 = 0.00001 ∗ 𝜃 ∗
100,000 = 𝜃. The DISSparse and DISTree have smooth growth in their time complexity by
decreasing the discriminative level 𝜃 as in Figure 6.8. However, when the minimum support
becomes very small DISTree becomes intolerable by an exponential number of discriminative
itemsets (i.e., from 6𝑀 for 𝜃 = 35 to 11𝑀 for 𝜃 = 25 as in Figure 6.9).
Chapter 6: Evaluation and Analysis Page 140
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 140
Figure 6.8 Scalability with discriminative level 𝜃 for 𝐷3 (support threshold 𝜑 = 0.00001)
Figure 6.9 Number of the discriminative itemsets with discriminative level 𝜃 for 𝐷3 (support
threshold 𝜑 = 0.0001)
To show the relationship between distribution of the frequent items in each dataset and
the distribution of the items in discriminative itemsets the experiments are conducted on 𝐷1. The
distribution of frequent items is modified in dataset 𝐷1. The frequent items in 𝑆1 are deliberately
started from the items with smaller identifiers as in Figure 6.10; for example, item 0 is the most
frequent item and item 999 is the least frequent item in 𝑆1. The frequent items in 𝑆2 are
delibrately started from the items with the larger identifier as in Figure 6.11; for example, item
999 is the most frequent item and item 0 is the leat frequent item in 𝑆2.
Chapter 6: Evaluation and Analysis Page 141
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 141
Figure 6.10 Frequent items distribution in 𝑆1
Figure 6.11 Frequent items distribution in 𝑆2
The experiments are conducted on the modified 𝐷1 by setting the support threshold to
𝜑 = 0.0001 and discriminative level 𝜃 = 10. The distribution of items in the discriminative
itemsets (i.e., two hundred ten thousand discriminative itemsets based on the parameter setting) is
represented as in Figure 6.12. The frequency of items in discriminative itemsets is divided by ten
for the sake of clarity. The discriminative itemsets are mainly made of the items which have high
frequency in 𝑆1 and less frequency in 𝑆2 as in Figure 6.12.
Figure 6.12 Frequent items distribution in discriminative itemsets in the modified 𝐷1 with
discriminative level 𝜃 = 10 and support threshold 𝜑 = 0.0001
To show the relationship between distribution of the frequent items in consistent datasets
with similar item distribution, other experiments are conducted on 𝐷1. The distribution of
frequent items is modified in dataset 𝐷1. The frequent items in 𝑆1 and 𝑆2 are deliberately started
from the items with the smaller identifier as in Figure 6.13; for example, item 0 is the most
frequent item and item 999 is the least frequent item in both 𝑆1 and 𝑆2.
Chapter 6: Evaluation and Analysis Page 142
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 142
Figure 6.13 Frequent items distribution in discriminative itemsets in the modified 𝐷1 with
discriminative level 𝜃 = 10 and support threshold 𝜑 = 0.0001
The experiments are conducted on the modified 𝐷1 by the similar parameter setting (i.e.,
𝜑 = 0.0001 and 𝜃 = 10). The distribution of items in the discriminative itemsets is represented
as in Figure 6.13. The number of discriminative itemsets decreased, as compared to the previous
experiments, by ten percent (i.e., one hundred ninety thousand discriminative itemsets based on
the parameter setting). The discriminative itemsets are still made of the items that have high
frequency in 𝑆1 but in smaller number.
6.2.2 Evaluation on real datasets
To evaluate the proposed DISSparse algorithm on real applications we ran the experiments
on some real datasets. The susy and mushroom datasets from the UCI repository (Dheeru
and Karra Taniskidou 2017) and accident dataset all provided in (Fournier-Viger et al. 2016)
were used. The selected datasets are dense (i.e., transactions have values for each attribute)
with less sparsity characteristics compared to the synthetic datasets. For this reason we set
the parameters in a way to show the scalability in the best scales.
The susy dataset is related to high-level features derived by physics to help discriminate
between two classes defined as signal and background. It is related to the particles detected using
particle accelerator based on Monte Carlo simulations. This dataset is made of five million
instances and the first column is the class label followed by eighteen features. The transactions
are made of about hundred ninety unique items. We selected the first hundred thousand instances
for the scale of the experiments (i.e., 2% of dataset). As in Figure 6.14 the DISSparse scales
better than DISTree specifically for lower minimum supports when a greater number of
discriminative itemsets are discovered. The number of discriminative itemsets is from 17
thousands for 𝜃 = 1.75 to 2 thousands for 𝜃 = 3. The interesting feature in this dataset
compared to the synthetic market basket dataset is that the majority of the discriminative itemsets
has high frequencies in the both datasets. We observed less number of patterns with zero or small
frequency in the general dataset. This shows the significance of the proposed DISSparse
algorithm for the applications with no inherent discrimination. This feature also exists in most of
the real applications we explained in Chapter 3.
Chapter 6: Evaluation and Analysis Page 143
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 143
Figure 6.14 Scalability with discriminative level 𝜃 for susy dataset (support threshold
𝜑 = 0.01)
We ran the experiments with different 𝜑 and fixed 𝜃 = 2 as presented in Figure 6.15.
The 𝜑 has similar effect as 𝜃 on the time and space complexity.
Figure 6.15 Scalability with support threshold 𝜑 for susy dataset (discriminative level 𝜃 = 2)
The accident dataset is anonymized traffic accident data obtained from the National
Institute of Statistics (NIS) for the region of Flanders (Belgium) for the period 1991-2000
(Fournier-Viger et al. 2016). This dataset is also dense with large and varied transactions sizes.
The transactions are made of about four hundred fifty items. We selected the full size dataset
(i.e., a three hundred forty thousand instances) for the experiments by taking the first column as
the class label. The DISSparse scales slightly better than DISTree for lower minimum supports as
Chapter 6: Evaluation and Analysis Page 144
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 144
in Figure 6.16. In this dataset we observed many discriminative itemsets with high frequency in
both datasets (i.e., two classes of data). However, there are many patterns with zero frequency in
the general dataset compared to the patterns in susy dataset. The interesting feature in this dataset
is that the discriminative itemsets are mixture of itemsets with high frequencies in both datasets,
and itemsets with high frequency in one dataset and zero frequency in other datasets.
Figure 6.16 Scalability with discriminative level 𝜃 for accident dataset (support threshold
𝜑 = 0.01)
The mushroom dataset includes descriptions of hypothetical samples corresponding to
23 species of gilled mushrooms in the Agaricus and Lepiota Family. Each species is identified as
definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter
class was combined with the poisonous one. The Guide clearly states that there is no simple rule
for determining the edibility of a mushroom. This dataset is made of 8124 instances and the first
column is the class label followed by 22 features. The transactions are made of about hundred
twenty unique items. We selected the full size dataset for the experiments.
For the mushroom dataset the two algorithms scale similar in case of time complexity
with better space usage in DISSparse algorithm as in Figure 6.17. We realized for the datasets
(i.e. different classes of dataset) with inherent discrimination, like edible and poisonous classes in
the mushroom dataset, the algorithm finds many discriminative itemsets even with considering
very high minimum supports and discriminative level thresholds.
Chapter 6: Evaluation and Analysis Page 145
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 145
Figure 6.17 Scalability with discriminative level 𝜃 for mushroom dataset (support threshold
𝜑 = 0.001)
The DISSparse algorithm works efficiently for the discriminative itemsets with low
supports and low discriminative level thresholds when there are a large number of items in the
datasets. This is very important in the applications we explained in Chapter 3. However, the
mushroom dataset has very few number of unique items (i.e., less than 120) and does not have
the sparsity characteristics (i.e., the number of discriminative itemsets is not necessarily much
less than the number of frequent itemsets). Because of this, most of the subtrees in the conditional
FP-Tree are considered potential and cannot be ignored.
6.2.3 Discussion on 𝜹-discriminative emerging patterns
The discriminative itemsets are frequent in the target dataset with relatively different
supportsinthetargetdatasetandthegeneraldataset.Theδ-discriminative emerging patterns are
frequent in the target dataset but infrequent (i.e., frequency < 𝛿) in the other datasets. The
DPMiner (Li, Liu and Wong 2007) discovers the equivalence classes (ECs) and employs 𝛿
constraint to reduce the pattern search space by setting a border of non 𝛿-discriminative and
redundant emerging patterns. It excludes the subsets with frequencies that are larger than 𝛿 in the
general dataset and the supersets with less frequency in the datasets. We observed big differences
in the number of discovered discriminative itemsets between DPMiner and DISSparse with
several different parameter settings. The wide difference in the number of discovered itemsets is
in the redundant 𝛿-discriminative emerging patterns. The redundant itemset is a superset of delta-
discriminative itemset with the same infinite ratio between the supports in the target and general
dataset (e.g., the itemset {𝑎, 𝑏, 𝑐} with 𝑓𝑖 = 10 and 𝑓𝑗 = 0 and itemset {𝑎, 𝑏} with 𝑓𝑖 = 15 and
Chapter 6: Evaluation and Analysis Page 146
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 146
𝑓𝑗 = 0 is considered discriminative by DISSparse while for DPMiner, {𝑎, 𝑏} is considered 𝛿-
discriminative, but itemset {𝑎, 𝑏, 𝑐} is not because {𝑎, 𝑏, 𝑐} is superset of the discriminative
itemset {𝑎, 𝑏} with less frequency in the both datasets 𝑆𝑖 and 𝑆𝑗, and such considered redundant).
The discriminative itemsets can be frequent in target dataset and general dataset without
any static limit for their frequency in the general dataset, but DPMiner does not generate itemsets
which are frequent in the general dataset (e.g., by setting 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑠𝑢𝑝 = 0.01, 𝛿 = 5 and
𝜃 = 2 in a dataset the DISSparse discovers the itemset {𝑎, 𝑏, 𝑐} with 𝑓𝑖 = 100 and 𝑓𝑗 = 50 as
discriminative itemset, but the DPMiner skips this itemset because of 𝑓𝑗 = 50 > 𝛿). The
discriminative itemsets are useful when the itemsets are frequent in all datasets and when the
support and confidence of itemsets are needed explicitly, for example, the itemsets in market
basket frequently appeared in all suburbs with relatively higher frequency in the target suburb
with different aging population. For fare comparison with the DISSparse algorithm we modified
the original DPM algorithm to cover all the delta-discriminative emerging patterns (i.e., including
redundant emerging patterns). Also, we explained earlier that the discriminative itemsets
discovered by DISSparse algorithm cannot be limited with the static bound of frequency (< 𝛿) in
the general dataset. We modified the definition criteria and consequently the proposed heuristics
in DISSparse algorithm in Chapter 3, todiscoveralltheδ-discriminative emerging patterns as in
the section below.
6.2.3.1 Evaluation on modified DISSparse and modified DPM
The scalability of the modified algorithms is tested with different 𝑚𝑖𝑛 _𝑠𝑢𝑝 and fixed
𝛿 = 2 on Dataset 𝑫𝟏 as presented in Figure 6.18. Modified DISSparse scales much better than
the modified DPM specifically for lower 𝑚𝑖𝑛 _𝑠𝑢𝑝 when a greater number of delta-
discriminative itemsets are discovered. The algorithms have close behaviour in large 𝑚𝑖𝑛 _𝑠𝑢𝑝
when there is less number of discriminative itemsets (i.e., for 𝑚𝑖𝑛 _𝑠𝑢𝑝 = 60 and 𝑚𝑖𝑛 _𝑠𝑢𝑝 =
65 the number of delta-discriminative itemsets is less than 100k). By large 𝑚𝑖𝑛 _𝑠𝑢𝑝 both
modified DPM and modified DISSparse prune many items that are infrequent in the datasets
during the making of conditional FP-Tree. In Figure 6.18 the exponential growth in time and
space usage of modified DPM with smaller 𝑚𝑖𝑛 _𝑠𝑢𝑝 is observed caused by exponential number
of generated itemset combinations.
Chapter 6: Evaluation and Analysis Page 147
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 147
Figure 6.18 Scalability with 𝑚𝑖𝑛_𝑠𝑢𝑝 for 𝐷1 (δ = 2)
The scalability of the modified algorithms is tested with different delta-discriminative
value δ and fixed 𝑚𝑖𝑛 _𝑠𝑢𝑝 = 50 on Dataset 𝑫𝟏 as presented in Figure 6.19. In this experiment
we observed a very small increase in time and space usage of the both algorithms by changing δ
to larger values.
Figure 6.19 Scalability with δ for 𝐷1 (𝑚𝑖𝑛_𝑠𝑢𝑝 = 50)
For comparison in a bigger dataset with higher sparsity characteristics the scalability of
the modified algorithms is tested with different 𝑚𝑖𝑛 _𝑠𝑢𝑝 and fixed 𝛿 = 50 on Dataset 𝑫𝟑 as
presented in Figure 6.20. We observed similar behaviour as for Dataset 𝑫𝟏 with much better
time and space complexity in modified DISSparse than the modified DPM specifically for lower
Chapter 6: Evaluation and Analysis Page 148
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 148
𝑚𝑖𝑛 _𝑠𝑢𝑝. In Figure 6.20 the exponential growth in time and space usage of modified DPM with
smaller 𝑚𝑖𝑛 _𝑠𝑢𝑝 is observed caused by exponential number of generated itemset combinations.
Figure 6.20 Scalability with 𝑚𝑖𝑛_𝑠𝑢𝑝 for 𝐷3 (δ = 50)
The scalability of the modified algorithms is tested with different delta-discriminative
value δ and fixed 𝑚𝑖𝑛 _𝑠𝑢𝑝 = 450 on Dataset 𝑫𝟑 as presented in Figure 6.21. Similar for
Dataset 𝑫𝟏 we observed a very small increase in time and space usage of both algorithms by
changing δ to larger values.
Figure 6.21 Scalability with δ for 𝐷3 (𝑚𝑖𝑛_𝑠𝑢𝑝 = 50)
We modified the DISTree algorithm to find the delta-discriminative itemsets for the
purpose of comparison with the modified DISSparse and the modified DPM algorithm.
Chapter 6: Evaluation and Analysis Page 149
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 149
However, the time and space usage of the modified DISTree are totally out of range and not
possible to show it in the reported scales which we used for the modified DISSparse and the
modified DPM.
6.2.3.2 Evaluation on real datasets with modified DISSparse and modified DPM
The modified DISSparse algorithm and modified DPM algorithm have different
behaviours with the three real datasets we used in this thesis. The mushroom dataset has high
inherent discriminate classes. The modified DPM is much more efficient with mushroom dataset
compared to the modified DISSparse. The definition of the 𝛿-discriminative itemsets is generally
based on the small 𝛿 value (i.e., 0 or 1) (Li, Liu and Wong 2007). Mining discriminative itemsets
with small 𝛿 values is useful in datasets with inherent discriminate classes when discriminative
itemsets are frequent in one class and have few occurrences in the other classes. Most of the
delta-discriminative itemsets in these type datasets are jumping emerging patterns with high
frequency in one class and zero frequency in the other classes. The modified DPM in this dataset
finds the delta-discriminative itemsets with small 𝛿 much faster than the modified DISSparse.
The modified DISSparse is more efficient when there are no inherent discriminate
classes in the dataset as in the susy and accident datasets. In susy and accident datasets we need
to set large 𝛿 for mining delta-discriminative itemsets which is against the definition in original
DPM (Li, Liu and Wong 2007). The definition of the discriminative itemsets proposed in this
thesis is based on the relative differences of the supports in datasets (i.e., classes of dataset). This
is actually in favour of the modified DISSparse to set the large 𝛿 values for mining delta-
discriminative itemsets. Even with setting the large 𝛿 values in the modified DISSparse, the
original DISSparse algorithm still finds more discriminative itemsets with different relative
supports. The proposed DISSparse method in this thesis targets the applications similar to the
susy and accident datasets with no inherent discrimination in the dataset. Most of the
discriminative itemsets in this type of datasets are frequent in all data classes with much higher
frequency in one class against the other classes. We explained several of the real applications for
the discriminative itemset mining in Chapter 1.
6.2.4 Discussion
The general outcome from experiments discussed in previous sub-sections is explained
here. The DISSparse algorithm with the proposed heuristics exhibits efficient time and space
complexity, specifically when discriminative itemsets with lower minimum support and
frequency ratios are of interest. Setting the support threshold 𝜑 and discriminative level 𝜃, with
considering reasonable size for input batch of transactions 𝑛1, is very important in the real
applications. These three parameters (i.e., 𝜑, 𝜃 and 𝑛1) set the minimum support for the
frequency of discriminative itemsets in the target dataset. Setting of these parameters is highly
Chapter 6: Evaluation and Analysis Page 150
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 150
related to the application and domain experts by considering limited computing and storage
capabilities. The smaller size for an input batch of transactions must be considered for
applications that need continuous updating in short time intervals especially for discriminative
itemsets with small minimum supports and ratios. The discovered patterns may expire and
become out of interest as time goes by and new transactions added to the data streams. The
algorithms were also tested on synthetic simpler datasets with average length of transactions 𝑇
equal to 10 and 15. The number of discriminative itemsets is much smaller even with smaller
minimum supports and frequency ratios and DISTree scales well and close to the DISSparse.
However, the space usage in the DISTree still had exponential growth for smaller minimum
supports as observed in the reported experiments above. The number of unique items in alphabet
∑ can change the distribution of sparse transactions and has to be considered for setting batch
sizes with good scalability.
The algorithms designed for the data stream processing should have single scan on the
input datasets, as explained in Chapter 2 in the literature review. In (Seyfi, Geva and Nayak
2014), we proposed the early algorithm with one scan for mining discriminative itemsets. In this
algorithm the transactions are not sorted based on the nice ordering of the frequent items and FP-
Tree is made of full size transactions that lead to the huge data structure. This will also affect the
size of other data structures as they are not using the common subsequent sharing for time and
space saving. This is not comparable with DISTree and DISSparse methods constructed by
descending order of frequent items in their data structures in the designed experiments. The
results were unacceptable even considering simple and small datasets, showing the necessity of
another scan for sorting the items based on descending order of their frequencies as in (Giannella
et al. 2003).
The emerging pattern mining algorithms cannot be directly used as baseline for
efficiency of the DISSparse algorithm. The essence and the concept are different between EPs
and discriminative itemsets. Conceptually, for EPs, it is easy to deal with this problem. We print
the itemsets with exact frequencies but EP mining methods show the patterns between borders.
JEPs consider the patterns whose support is non-zero in one dataset and zero in the other. eJEPs
are the minimal JEPs. JEPs is more general.
6.3 TILTED-TIME WINDOW MODEL
In this section we evaluate the H-DISSparse algorithm using data streams modelled as
multiple batches of transactions. The H-DISSparse algorithm is evaluated with different
parameter settings for the highest refined approximate bound in discriminative itemsets in the
tilted-time window model. The data streams made of different numbers of transactions were
generated using the IBM synthetic data generator (Agrawal and Srikant 1994). The main dataset
is generated with 𝑆1 as 𝑇25: 𝐼10: 𝐷320𝐾 and 𝑆2 as 𝑇25: 𝐼15: 𝐷1600𝐾 limited to 1𝐾 unique
Chapter 6: Evaluation and Analysis Page 151
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 151
items. The data streams are modelled as 32 continuous batches in the same sizes (i.e., for the
sake of clarity) with 𝑇25: 𝐼10: 𝐷10𝐾 and 𝑇25: 𝐼15: 𝐷50𝐾 belong to the target data stream 𝑆1
and general data stream 𝑆2, respectively. The ratio between size of 𝑆1 and 𝑆2 is also the same for
all 32 batches (i.e., 𝑛2/𝑛1 = 5). It should be noted that in the real applications, the input batches
could be in different sizes and have different characteristics as the data streams have different
speeds and different characteristics during the time. In this section, the strategies and principles
are recommended for tuning the parameters based on the application domains and data stream
characteristics.
6.3.1 Evaluation on synthetic datasets
The scalability of H-DISSparse is presented within offline updating of the tilted-time
window model after processing each batch of transactions. In this section during all experiments
the discriminative level 𝜃 = 10 and support threshold 𝜑 = 0.01% but the scalability of the
algorithms are tested based on different relaxations of 𝛼. It is assumed while the new batch is
loaded with transactions the H-DISStream updating can be done by processing the current batch
of transactions. This works well as far as the algorithms are faster than the rate of incoming data
streams. Table 6.1 shows the number of discriminative itemsets in the tilted-time window model
by processing each batch considering 𝛼 = 1 i.e., no sub-discriminative itemset.
Chapter 6: Evaluation and Analysis Page 152
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 152
Table 6.1 The number of discriminative itemsets in the tilted-time window model
The number of discriminative itemsets in the batches (i.e., presented in 𝑊0) are different
because of distributions of the transactions. The embedded knowledge and the trends in data
streams change through time by the concept drifts. In Table 6.1 the number of discriminative
itemsets in the current window frames 𝑊0 changes dramatically by processing the batches 𝐵1
and 𝐵9, respectively. This has high effects on the algorithm scalability, as in Figure 6.22 and
Figure 6.24. The discriminative itemsets in the larger tilted window frames are usually sparser.
The data streams in the larger window frames 𝑊𝑘 (i.e., 𝑘 > 0) have higher lengths and the
discriminative itemsets appeared with high concept drifts are neutralized. For the sake of clarity,
the scalability of the algorithm is first represented by time and space complexities for processing
the batch of transactions (i.e., not considering tilted-time window model updating) as in
Chapter 6: Evaluation and Analysis Page 153
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 153
Figure 6.22. We used the DISTree algorithm (Seyfi, Geva and Nayak 2014) and DISSparse
algorithm (Seyfi et al. 2017) for processing the batch of transactions as discussed in Chapter 4.
Figure 6.22 Scalability of batch processing not considering the tilted-time window model
updating
The variations in batch processing time and space are caused by the concept drifts in
transaction distribution in the batches. The variations are not big in the DISSparse algorithm as
compared to the DISTree algorithm. However, the DISSparse algorithm also has high time and
space complexities for processing the batches 𝐵1 and 𝐵9 with a high number of discriminative
itemsets.
The tilted-time window model updating time for the algorithm is represented in
Figure 6.23. The H-DISSparse algorithm mainly has less time usage for the tilted-time window
model updating compared to the time usage for batch processing. The fluctuations are mainly
because of the tilted-time window updating by a different number of discriminative itemsets
discovered in the batches. The high growths in the time usage of the algorithms are because of
the wide tail pruning in the H-DISStream structure caused by high concept drifts in the old
batches; for example after processing 𝐵3 a wide number of non-discriminative itemsets is pruned
during tail pruning process. These are mainly appeared in the tilted-time window model after
processing 𝐵1 with high concept drifts, as in Table 6.1.
Chapter 6: Evaluation and Analysis Page 154
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 154
Figure 6.23 Tilted-time window model updating time complexity
We ran the full algorithm with the DISTree method for the batch processing, called H-
DISTree. Obviously the H-DISSparse algorithm is more efficient and scalable for batch of
transactions with different characteristics. The full time complexity of the H-DISTree and H-
DISSparse algorithms is represented in Figure 6.24. The H-DISTree time complexity is highly
affected even by small concept drifts (e.g., the batch processing for 𝐵28 is completely out of
tolerable range). For the rest of the experiments in this section, the scalability of the H-DISSparse
algorithm is represented by the full time complexity in mining discriminative itemsets in the
tilted-time window model.
Figure 6.24 Time complexity of H-DISTree and H-DISSparse algorithms
The H-DISStream size as the biggest data structure in the designed algorithms is
presented in Figure 6.25. Despite the batches with high concept drifts (e.g., the batch of
transactions with large number of discriminative itemsets) the H-DISStream size tends to become
stable with very small growth by processing a larger number of batches in the data streams. The
high growth in the size of H-DISStream caused by concept drifts is quickly neutralized by
processing the new batches and applying tail pruning (e.g., the growth H-DISStream size caused
by 𝐵9 is neutralized by processing the next few batches). Following the compact logarithmic
tilted-time window model and by applying the tail pruning as in Corollary 4-3 the H-DISStream
size stays small as in-memory data structure. The periodic drops in the size of H-DISStream are
caused by merging the tilted window frames. This can be seen more clearly in Figure 6.28 after
processing 𝐵8, 𝐵16, 𝐵24, 𝐵27 and 𝐵32.
Chapter 6: Evaluation and Analysis Page 155
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 155
Figure 6.25 H-DISStream structure size
6.3.1.1 Approximation in discriminative Itemsets in the tilted-time window model
In this section the scalability of the H-DISSparse algorithm, as a highly accurate and
highly efficient method for mining discriminative itemsets using the tilted-time window model
with the highest refined approximate bound, is evaluated with different parameter settings. The
H-DISTree algorithm is not scalable for processing large data streams with the highest
approximate bound and cannot be evaluated in this section.
Three corollaries are defined in Chapter 4 for mining discriminative itemsets using the
tilted-time window model with the highest refined approximate bound. The relaxation of 𝛼 in
Corollary 4-2 is set for the highest refined approximate bound in discriminative itemsets in the
tilted-time window model. The H-DISSparse time usage and the H-DISStream size are
represented in Figure 6.26 by a different setting for the relaxation of 𝛼 i.e., 𝛼 = 1, 𝛼 = 0.9 and
𝛼 = 0.75. The H-DISSparse scales well by relaxation of 𝛼 = 0.9 with improvement in
approximate discriminative itemsets as compared to relaxation of 𝛼 = 1. The H-DISSparse
scalability with smaller relaxation of 𝛼 (e.g., 𝛼 = 0.75) is more sensitive to the concept drifts in
data streams as with higher variations in time and space complexity in Figure 6.26.
Chapter 6: Evaluation and Analysis Page 156
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 156
Figure 6.26 Scalability of H-DISSparse algorithm by relaxation of 𝛼 = 1, 𝛼 = 0.9 and
𝛼 = 0.75
Figure 6.27 shows the number of sub-discriminative itemsets by setting relaxation of
𝛼 = 0.9 and 𝛼 = 0.75, respectively. The sub-discriminative itemsets are considered as overhead
for the algorithms and can be increased exponentially by setting very small relaxation of 𝛼 (e.g.,
by relaxation of 𝛼 = 0.75 the average number of sub-discriminative itemsets in the batches is
greater than 1 million).
Figure 6.27 Number of sub-discriminative itemsets by relaxation of 𝛼 = 0.9 and 𝛼 = 0.75
Table 6.2 shows the number of discriminative itemsets in the tilted-time window model
by processing each batch considering 𝛼 = 0.9.
Chapter 6: Evaluation and Analysis Page 157
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 157
Table 6.2 The number of discriminative itemsets in the tilted-time window model
6.3.1.2 Discriminative Itemsets in the tilted-time window model without tail pruning
The tail pruning defined in Corollary 4-3 is applied to the H-DISStream structure for
space saving by pruning the least potential discriminative itemsets in the tilted-time window
model. The Corollary 4-1 is defined for obtaining the exact frequencies of the itemsets from
their maintaining time in the tilted-time window model. The exact frequencies of the non-
discriminative subsets are obtained by traversing the FP-Tree through Header-Table links for
their appearances in the current batch of transactions. Figure 6.28 shows the scalability of the
original H-DISSparse algorithm eliminating Corollary 4-1 and Corollary 4-3, respectively.
Eliminating Corollary 4-1 adds more time complexity to the few batches (e.g., 𝐵3, 𝐵9, 𝐵11 and
Chapter 6: Evaluation and Analysis Page 158
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 158
𝐵17), mainly because of adding wrong discriminative itemsets to the process. The H-DISStream
size as compared to the original algorithm is much larger by eliminating Corollary 4-1. The H-
DISStream saves many wrong discriminative itemsets. This is caused for more tail pruning and
higher approximation in discriminative itemsets in the tilted-time window model. The size of H-
DISStream eliminating Corollary 4-1 is hided under H-DISStream size eliminating
Corollary 4-3 as in Figure 6.28.
Figure 6.28 Scalability of H-DISSparse algorithm by eliminating Corollary 4-1 and
Corollary 4-3
The H-DISSparse algorithm shows higher time complexity by eliminating the
Corollary 4-3 as a consequence of bigger data structures. However, in two points (i.e., after
processing 𝐵3 and 𝐵11) the time complexity decreases caused by eliminating the wide tail
pruning of discriminative itemsets appeared by concept drifts. The H-DISStream size becomes
much bigger during the time. The periodic drops in H-DISStream size after processing 𝐵8, 𝐵16,
𝐵24, 𝐵27 and 𝐵32 is caused by merging the larger window frames in the tilted-time window
model.
6.3.2 Evaluation on real datasets
To evaluate the proposed H-DISSparse algorithm on real applications we ran the
experiments on the real datasets. The susy dataset from the UCI repository (Dheeru and Karra
Taniskidou 2017) provided in (Fournier-Viger et al. 2016) was used. The selected dataset is
dense (i.e., transactions have values for each attribute) with less sparsity characteristics compared
to the synthetic market basket datasets. For this reason we set the parameters in a way to show
the scalability in the best scales. The susy dataset is related to high-level features derived by
physic to help discriminate between two classes defined as signal and background. It is related to
Chapter 6: Evaluation and Analysis Page 159
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 159
the particles detected using particle accelerator based on Monte Carlo simulations. This dataset is
made of five million instances and the first column is the class label followed by eighteen
features. The transactions are made of about hundred ninety unique items. We selected the 32
batches each of them made of fifty thousand instances for the scale of the experiments. The H-
DISSparse algorithm is evaluated with different parameter settings for the highest refined
approximate bound in discriminative itemsets in the tilted-time window model.
6.3.2.1 Scalability on datasets with less concept drifts in the tilted-time window model
The scalability of H-DISSparse is presented within offline updating of the tilted-time
window model after processing each batch of transactions. In this section during all experiments
the discriminative level 𝜃 = 2 and support threshold 𝜑 = 1% and the scalability of the
algorithms are tested based on different relaxations of 𝛼. It is assumed while the new batch is
loaded with transactions the H-DISStream updating can be done by processing the current batch
of transactions. This works well as far as the algorithms are faster than the rate of incoming data
streams. Table 6.3 shows the number of discriminative itemsets in the tilted-time window model
by processing each batch considering 𝛼 = 1 i.e., no sub-discriminative itemset.
Chapter 6: Evaluation and Analysis Page 160
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 160
Table 6.3 The number of discriminative itemsets in the tilted-time window model
The number of discriminative itemsets in the batches (i.e., presented in 𝑊0) are different
because of distributions of the transactions. The embedded knowledge and the trends in data
streams do not change through time by the concept drifts. In Table 6.3 the number of
discriminative itemsets in the current window frames 𝑊0 does not have high changes. The
discriminative itemsets in the larger tilted window frames are in almost the same numbers as in
the smaller tilted window frames. In the susy datasets the discriminative itemsets in the adjunct
tilted window frames are similar and the same discriminative itemsets are discovered in the larger
merged tilted window frames. The discriminative itemsets appeared in adjunct smaller frames are
Chapter 6: Evaluation and Analysis Page 161
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 161
merged as new discriminative itemsets in larger window frames. The data streams in the larger
window frames 𝑊𝑘 (i.e., 𝑘 > 0) have higher lengths.
For the sake of clarity, the scalability of the algorithm is first represented by time and
space complexities for processing the batch of transactions (i.e., not considering tilted-time
window model updating) as in Figure 6.29. We used the DISTree algorithm (Seyfi, Geva and
Nayak 2014) and DISSparse algorithm (Seyfi et al. 2017) for processing the batch of transactions
as discussed in Chapter 4.
Figure 6.29 Scalability of batch processing not considering the tilted-time window model
updating
The small variations in batch processing time and space are caused by the difference in
the number of discriminative itemsets in the batches. The algorithm mainly has less time usage
for the tilted-time window model updating compared to the batch processing. In the tilted-time
window updating there are no fluctuations mainly because of the similar number of
discriminative itemsets discovered in the continuous batches. There is no wide tail pruning
compared to the experiments with synthetic datasets. This real dataset does not have high concept
drifts in the continuous batches.
We ran the full algorithm with the DISTree method for the batch processing, called H-
DISTree. Obviously the H-DISSparse algorithm is more efficient and scalable for batch of
transactions with different characteristics. The full time complexity of the H-DISTree and H-
DISSparse algorithms is represented in Figure 6.30. The H-DISTree time complexity is affected
even by small concept drifts. For the rest of the experiments in this section, the scalability of the
H-DISSparse algorithm is represented by the full time complexity in mining discriminative
itemsets in the tilted-time window model.
Chapter 6: Evaluation and Analysis Page 162
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 162
Figure 6.30 Time complexity of H-DISTree and H-DISSparse algorithms
The H-DISStream size as the biggest data structure in the designed algorithms is
presented in Figure 6.31. Following the compact logarithmic tilted-time window model and by
applying the tail pruning as in Corollary 4-3 the H-DISStream size stays small as in-memory
data structure. The periodic drops in the size of H-DISStream are caused by merging the tilted
window frames. This can be seen clearly after processing 𝐵8, 𝐵16, 𝐵24, 𝐵27 and 𝐵32.
Figure 6.31 H-DISStream structure size
6.3.2.2 Approximation in discriminative Itemsets in the tilted-time window model
In this section the scalability of the H-DISSparse algorithm, as a highly accurate and
highly efficient method for mining discriminative itemsets using the tilted-time window model
with the highest refined approximate bound, is evaluated with different parameter settings. The
H-DISTree algorithm is not scalable for processing large data streams with the highest
approximate bound and cannot be evaluated in this section.
Three corollaries are defined in Chapter 4 for mining discriminative itemsets using the
tilted-time window model with the highest refined approximate bound. The relaxation of 𝛼 in
Corollary 4-2 is set for the highest refined approximate bound in discriminative itemsets in the
tilted-time window model. The H-DISSparse time usage and the H-DISStream size are
represented in Figure 6.32 by a different setting for the relaxation of 𝛼 i.e., 𝛼 = 1 and 𝛼 = 0.9.
The H-DISSparse scales double by relaxation of 𝛼 = 0.9 with improvement in approximate
discriminative itemsets as compared to relaxation of 𝛼 = 1.
Chapter 6: Evaluation and Analysis Page 163
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 163
Figure 6.32 Scalability of H-DISSparse algorithm by relaxation of 𝛼 = 1 and 𝛼 = 0.9
Figure 6.33 shows the number of sub-discriminative itemsets by setting relaxation of
𝛼 = 0.9. The sub-discriminative itemsets are considered as overhead for the algorithms and can
be increased exponentially by setting very small relaxation of 𝛼.
Figure 6.33 Number of sub-discriminative itemsets by relaxation of 𝛼 = 0.9
Table 6.4 shows the number of discriminative itemsets in the tilted-time window model
by processing each batch considering 𝛼 = 0.9.
Chapter 6: Evaluation and Analysis Page 164
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 164
Table 6.4 The number of discriminative itemsets in the tilted-time window model
The improvements in the algorithm accuracy and recall can be seen by comparing the
Table 6.3 and Table 6.4. In the susy datasets the discriminative itemsets in the adjunct tilted
window frames are similar and by using smaller relaxation 𝛼, more discriminative itemsets are
discovered in the merged tilted window frames.
6.3.3 Discussion
The general outcome from experiments on H-DISSparse algorithm discussed in previous
sub-sections is explained here. The H-DISSparse algorithm with the defined corollaries exhibits
efficient time and space complexity for mining discriminative itemsets using the tilted-time
Chapter 6: Evaluation and Analysis Page 165
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 165
window model. The highest refined approximate bound in discriminative itemsets in the tilted-
time window model is obtained efficiently based on Corollary 4-1, Corollary 4-2 and
Corollary 4-3 by the smaller relaxation of 𝛼. Setting the relaxation of 𝛼 in accompaniment with
other parameters (i.e., support threshold 𝜑, discriminative level 𝜃 and input batch size 𝑛1) is very
important in the real applications. Proper size has to be considered for the current window frame
(i.e., 𝑊0) for updating the tilted-time window model in reasonable time intervals. This is highly
related to the application and domain experts by considering limited computing and storage
capabilities and the approximate bound in the false positive discriminative itemsets.
The changes in trend based on the concept drifts in batches are neutralized quickly in the
tilted-time window model and the in-memory H-DISStream structure is held efficiently during
the life time of data streams. The in-memory H-DISStream without tail pruning (i.e.,
Corollary 4-3) is not efficient even by considering a compact logarithmic tilted-time window
model. The FP-Tree structure made of transactions in one batch considering all items is used
efficiently in H-DISSparse algorithm. This stays as an efficient in-memory structure and does not
affect the algorithms’ scalability as in the batch processing the infrequent items are pruned in
conditional FP-Tree. The H-DISTree algorithm is more sensitive to the concept drifts and not
efficient even without considering sub-discriminative itemsets (i.e., relaxation of 𝛼 = 1). The
main part of time and space complexity in the algorithms is related to the batch processing,
although the tail pruning of non-discriminative itemsets appeared in the old batches can add
complexity to the algorithms as well.
6.4 SLIDING WINDOW MODEL
In this section we evaluate the S-DISSparse algorithm using data streams modelled as
multiple batches of transactions. The S-DISSparse algorithm is evaluated with different datasets
and parameter settings for mining exact discriminative itemsets in the sliding window model by
the offline updating state. The S-DISSparse algorithm is evaluated with different parameter
settings for the highest refined approximate bound in discriminative itemsets in the sliding
window model by the online updating state. The data streams made of different numbers of
transactions were generated using the IBM synthetic data generator (Agrawal and Srikant 1994).
The first dataset called 𝐷1, is generated with 𝑆1 as 𝑇25: 𝐼10: 𝐷60𝐾 and 𝑆2 as 𝑇25: 𝐼15: 𝐷300𝐾
limited to 10𝐾 unique items. To show the unsynchronized behaviour of the data streams in the
online sliding window model, we made a simple code to mix the two data streams together based
on their ratio size. The data streams in 𝐷1 are modelled as 30 continuous batches in the same
sizes (i.e., for the sake of clarity) with 𝑇25: 𝐼10: 𝐷2𝐾 and 𝑇25: 𝐼15: 𝐷10𝐾 belonging to the
target data stream 𝑆1 and general data stream 𝑆2, respectively. The ratio between size of 𝑆1 and
𝑆2 in 𝐷1 is the same for all 30 batches (i.e., 𝑛2/𝑛1 = 5).
Chapter 6: Evaluation and Analysis Page 166
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 166
In the designed experiment we consider that the data streams have same speeds in the
specified time periods and the batches contain an equal number of transactions, for the sake of
simplicity. It should be noted that in the real applications the input batches could be in different
sizes and have different characteristics, as the data streams have different speeds and different
characteristics during the time. The second dataset called 𝐷2, is generated with 𝑆1 as
𝑇25: 𝐼10: 𝐷90𝐾 and 𝑆2 as 𝑇25: 𝐼15: 𝐷450𝐾 limited to 10𝐾 unique items. The data streams in
𝐷2 are modelled as 18 continuous batches in the same sizes with 𝑇25: 𝐼10: 𝐷5𝐾 and
𝑇25: 𝐼15: 𝐷25𝐾 belonging to the target data stream 𝑆1and general data stream 𝑆2, respectively.
The ratio between size of 𝑆1 and 𝑆2 in 𝐷2 is the same for all 18 batches (i.e., 𝑛2/𝑛1 = 5). In this
section, the strategies and principles are recommended for tuning the parameters based on the
application domains and data stream characteristics.
6.4.1 Evaluation on synthetic datasets
The scalability of the S-DISSparse algorithm is presented within the offline sliding and
online sliding window models. The time complexity of S-DISSparse is presented by transaction
processing in the online state and by full size partition processing in the offline state. In the
proposed S-DISSparse algorithm, each batch of input transactions is set to a partition in the
sliding window frame 𝑊. The scalability of S-DISSparse is presented within offline sliding of
the window frame 𝑊 by the recent batch of transactions. The scalability of S-DISSparse is
presented within online sliding of the window frame 𝑊 by every recent transaction. In this
section during all experiments the discriminative level 𝜃 = 25 and support threshold 𝜑 =
0.002%. In the experiments with 𝐷1, the 25 recent partitons are fitted in the sliding window
frame 𝑊 (i.e., 𝑊 = 25). It is assumed while the new batch is loaded with transactions, the S-
DISStream updating can be done by processing the current batch of transactions. This works well
as far as the algorithm is faster than the rate of incoming data streams.
6.4.1.1 Offline sliding discriminative itemsets
The time complexity of the S-DISSparse method in the offline updating sliding window
model by each partition in 𝐷1, is presented in Figure 6.34. The sliding window frame 𝑊 is
initialized by the first 20 partitons in 𝐷1, for the sake of clarity. This is recommended as there
may be a large number of discriminative itemsets discovered by processing the initial partitions
with small data stream lengths in the sliding window frame 𝑊. The sliding window model is
then updated in an offline state by every new partition that is fitted in the window frame 𝑊. The
scalability of the S-DISSparse algorithm is compared with DISSparse algorithm (Seyfi et al.
2017) proposed in Chapter 3 for mining discriminative itemsets in a batch of transactions. The
DISSparse algorithm is used as benchmark by processing the full size window frame 𝑊 after
each offline window model sliding.
Chapter 6: Evaluation and Analysis Page 167
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 167
Figure 6.34 S-DISSparse and DISSparse time complexity for 𝐷1 (window frame 𝑊 = 25)
The efficiency of S-DISSparse is generally better in comparison to the efficiency of
DISSparse in the offline sliding window model. The difference is clearer during offline sliding by
the partition 𝑃23 (i.e., 450 seconds for DISSparse method). During window model sliding by this
partition the window frame 𝑊 is only updated by adding the recent partition. The offline sliding
by the partitions 𝑃26 to 𝑃30 using S-DISSparse show less efficiency as compared to the partitions
𝑃21 to 𝑃25. During window model sliding by these partitions the window frame 𝑊 is updated by
adding the recent partition and deleting the oldest partition.
The time complexity of the offline window sliding using S-DISSparse by some partitions
has the least efficiency as compared with DISSparse; for example, the offline window sliding by
partitions 𝑃26 and 𝑃30 shows similar time usage in both algorithms. The S-DISSparse, in these
points, scale the same as DISSparse as most of the transactions in the sliding window model are
updated during offline sliding by these partitions. The variations in the time complexity of the
algorithms by processing different partitions are caused by the concept drifts in transactions
distribution in different batches.
The space complexity of the S-DISSparse algorithm in the offline updating sliding
window model by each partition, is presented in Figure 6.35. The space usage of the S-
DISStream and S-FP-Tree are reported during offline sliding as the largest datasets used in S-
DISSparse algorithm. The space usage related to the batch processing (i.e., by processing each
offline sliding) is not monitored as it is much smaller in comparison to the size of these datasets.
The S-FP-Tree and S-DISStream are held in main memory during the sliding window model.
The processing time for S-FP-Tree making is small and is included in the total processing time.
Figure 6.35 S-DISSparse space complexity in offline sliding for 𝐷1 (𝑊 = 25)
Chapter 6: Evaluation and Analysis Page 168
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 168
The S-DISStream structure is the biggest data structure in the designed algorithm.
Despite the batches with high concept drifts (e.g., the batch of transactions that adds a large
number of discriminative itemsets to the sliding window frame 𝑊) the S-DISStream size tends to
become stable by sliding with a larger number of batches in data streams. Following the compact
prefix tree structure and by applying the tail pruning the S-DISStream size stays small as in-
memory data structure. The number of discriminative itemsets discovered in the sliding window
frame 𝑊 in the designed experiments is between 2.5 and 3 million (i.e., as in Figure 6.41) by
considering the different distributions of transactions during the sliding window model.
To show the effects of the different window size 𝑊, we run the S-DISSparse algorithm
with the second datasets 𝐷2 which has bigger size partitions. The size of the sliding window
frame 𝑊 is set to the same number of transactions as in the first experiment (i.e., 𝑊 = 10). The
sliding window frame 𝑊 is initialized by the first 8 partitons in 𝐷2, following the first
experiment and for the sake of clarity. The time and space complexity of the S-DISSparse
method in the offline updating sliding window model by each partition in 𝐷2, is presented in
Figure 6.36.
Figure 6.36 S-DISSparse and DISSparse time and space complexity for 𝐷2 (window frame
𝑊 = 10)
The S-DISSparse algorithm shows the least efficiency as compared with the DISSparse
during offline sliding window model. The more number of transactions in the sliding window
frame 𝑊 are updated by the bigger size partitions. This has to be considered for adjusting a
proper sliding window frame size based on the application domains and data stream
characteristics. The sliding window frame should be set much bigger in comparison to the
average size of a single partition.
Chapter 6: Evaluation and Analysis Page 169
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 169
We have conducted experiments with several different datasets and by data streams with
different characteristics. The proposed DISSparse algorithm (Seyfi et al. 2017) in Chapter 3 is
highly efficient and scales well when the sliding window frame 𝑊 is small in comparison to the
average size of the single partition. The S-DISSparse scales well when the sliding window frame
𝑊 is much bigger than the average size of the single partition and when a large number of
discriminative itemsets with small support and ratios are of interest. The S-DISSparse is
evaluated for mining discriminative itemsets using the online sliding window model in the
section below.
6.4.1.2 Online sliding discriminative itemsets
The scalability of the algorithm is tested based on different relaxations of 𝛼 in online
sliding of the window frame 𝑊. The time complexity of the S-DISSparse method by offline and
online processing is presented in Figure 6.37. The online sliding does not add high time
complexity to the S-DISSparse algorithm and it even decreases the overall time complexity of the
algorithm during the processing of some partitions; for example, the time complexity of the
online sliding window model is less than the offline sliding window model by the partitions 𝑃21.
The non-discriminative subsets (of discriminative itemsets) in S-DISStream are updated during
online sliding. This is caused for decreasing the time complexity of the S-DISSparse algorithm
during tuning the frequencies of the non-discriminative subsets in S-DISStream.
Figure 6.37 S-DISSparse time complexity in online and offline sliding for 𝐷1 (𝑊 = 25)
To improve the approximation of the discriminative itemsets in the online updating
sliding window model, we set the relaxation of 𝛼 = 0.9 and 𝛼 = 0.75, respectively. The S-
DISSparse time complexity during offline sliding, online sliding with 𝛼 = 1 (i.e., no sub-
discriminative itemsets), online sliding with 𝛼 = 0.9 and 𝛼 = 0.75, are represented in
Figure 6.38. The S-DISSparse time complexity with 𝛼 = 0.75 is more sensitive to the concept
drifts in data streams; for example, the time complexity increases with higher variation by the
window model sliding with partition 𝑃22 and partition 𝑃26.
Chapter 6: Evaluation and Analysis Page 170
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 170
Figure 6.38 S-DISSparse time complexity for online and offline sliding for 𝐷1 (𝑊 = 25)
with different relaxation of 𝛼
The space usage of the S-DISStream is presented during offline sliding by 𝛼 = 1,
𝛼 = 0.9 and 𝛼 = 0.75 in the S-DISSparse algorithm as in Figure 6.39. The S-FP-Tree structure
size is not affected by different relaxation of 𝛼 as the S-FP-Tree holds all items (i.e., S-FP-Tree
includes infrequent items in sliding window frame 𝑊, as well). The S-DISStream size increases
by smaller relaxation of 𝛼 with small variations by the concept drifts in data streams during the
sliding window frame 𝑊. A greater number of sub-discriminative itemsets has to be saved in the
S-DISStream structure so they can improve the approximation in discriminative itemsets in the
online sliding window model. It should be noted that the size of the Transaction-List structure
used in the online sliding window model is small compared to the S-FP-Tree and S-DISStream
and not presented, for the sake of clarity.
Figure 6.39 S-DISStream size for 𝐷1 (𝑊 = 25) by different relaxation of 𝛼
The online sliding window model is used for more up-to-date and accurate online
answers. The number of itemsets in which their tag is changed during online sliding (e.g.,
discriminative to non-discriminative itemsets or non-discriminative to discriminative itemsets) is
presented in Figure 6.40. The number of itemsets in which their tag is changed during the online
sliding window model is increased by smaller relaxation of 𝛼 (i.e., 𝛼 = 0.9 and 𝛼 = 0.75).
However, choosing a smaller relaxation of 𝛼 may not always improve the approximation in
discriminative itemsets in the online sliding window model. Based on the concept drifts in data
streams, the new discriminative itemsets are discovered in the sliding window frame 𝑊 that do
not exist in S-DISStream. This ensures the better approximation in discriminative itemsets in the
Chapter 6: Evaluation and Analysis Page 171
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 171
online sliding window model by setting a reasonable small relaxation of 𝛼 based on the
application domains and data stream characteristics.
Figure 6.40 Number of itemsets that their tag is changed for 𝐷1 (𝑊 = 25) by different
relaxation of 𝛼
The exact number of discriminative itemsets after each offline sliding window model is
represented in Figure 6.41. The number of sub-discriminative itemsets is also represented by
different relaxation of 𝛼 (i.e., 𝛼 = 0.9 and 𝛼 = 0.75). This also shows that choosing smaller
relaxation of 𝛼 may not always improve the approximation in discriminative itemsets in the
online sliding window model. As in Figure 6.41, by smaller relaxation of 𝛼 (i.e., 𝛼 = 0.75) the
number sub-discriminative itemsets grows even more than the discriminative itemsets, which
adds overhead to the algorithm scalability.
Figure 6.41 Number of discriminative and sub-discriminative itemsets for 𝐷1 (𝑊 = 25) by
different relaxation of 𝛼
6.4.2 Evaluation on real datasets
To evaluate the proposed S-DISSparse algorithm on real applications we ran the
experiments on some real datasets. The susy dataset from the UCI repository (Dheeru and Karra
Taniskidou 2017) provided in (Fournier-Viger et al. 2016) was used. The selected dataset is
dense (i.e., transactions have values for each attribute) with less sparsity characteristics compared
to the synthetic market basket datasets. For this reason we set the parameters in a way to show
the scalability in the best scales. The susy dataset is related to high-level features derived by
physic to help discriminate between two classes defined as signal and background. It is related to
the particles detected using particle accelerator based on Monte Carlo simulations. This dataset is
Chapter 6: Evaluation and Analysis Page 172
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 172
made of five million instances and the first column is the class label followed by eighteen
features. The transactions are made of about hundred ninety unique items. We selected the 25
batches each of them made of twenty thousand instances for the scale of the experiments. The S-
DISSparse algorithm is evaluated with different parameter settings for mining discriminative
itemsets in the offline and online sliding window models. In this section during all experiments
the discriminative level 𝜃 = 2.5 and support threshold 𝜑 = 1%. In the experiments with susy
dataset the 20 recent partitons are fitted in the sliding window frame 𝑊 (i.e., 𝑊 = 20).
The time complexity of the S-DISSparse method in the offline updating sliding window
model by each partition in susy dataset is presented in Figure 6.42. The sliding window frame 𝑊
is initialized by the first 15 partitons in susy dataset for the sake of clarity. This is recommended
as there may be a large number of discriminative itemsets discovered by processing the initial
partitions with small data stream lengths in the sliding window frame 𝑊. The sliding window
model is then updated in an offline state by every new partition that is fitted in the window frame
𝑊. The scalability of the S-DISSparse algorithm is compared with DISSparse algorithm (Seyfi et
al. 2017) proposed in Chapter 3 for mining discriminative itemsets in a batch of transactions. The
DISSparse algorithm is used as benchmark by processing the full size window frame 𝑊 after
each offline window model sliding.
Figure 6.42 S-DISSparse and DISSparse time complexity for susy dataset (window frame
𝑊 = 20)
In susy dataset, the efficiency of S-DISSparse is not better in comparison to the
efficiency of DISSparse in the offline sliding window model. This is mainly because of the small
number of unique items in the transactions i.e., 190 unique item in the susy dataset. The sparsity
characteristic of the dataset is limited and the subsets are updated by every new batch coming in
or old batch going out of sliding window model.
The space complexity of the S-DISSparse algorithm in the offline updating sliding
window model by each partition, is presented in Figure 6.43. The space usage of the S-
DISStream and S-FP-Tree are reported during offline sliding as the largest datasets used in S-
DISSparse algorithm. The space usage related to the batch processing (i.e., by processing each
offline sliding) is not monitored as it is much smaller in comparison to the size of these datasets.
Chapter 6: Evaluation and Analysis Page 173
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 173
The S-FP-Tree and S-DISStream structures are held in main memory during the sliding window
model.
Figure 6.43 S-DISSparse space complexity in offline sliding for susy dataset (𝑊 = 20)
The S-DISSparse is evaluated for mining discriminative itemsets using the online sliding
window model. The scalability of the algorithm is tested based on different relaxations of 𝛼 in
online sliding of the window frame 𝑊. The online sliding does not add high time complexity to
the S-DISSparse algorithm. To improve the approximation of the discriminative itemsets in the
online updating sliding window model, we set the relaxation of 𝛼 = 0.9. The S-DISSparse time
complexity during offline sliding, online sliding with 𝛼 = 1 (i.e., no sub-discriminative itemsets)
and online sliding with 𝛼 = 0.9, are represented in Figure 6.44.
Figure 6.44 S-DISSparse time complexity in online and offline sliding for susy dataset
(𝑊 = 20)
The space usage of the S-DISStream is presented during offline sliding by 𝛼 = 1 and
𝛼 = 0.9 in the S-DISSparse algorithm as in Figure 6.45. The S-FP-Tree structure size is not
affected by different relaxation of 𝛼 as the S-FP-Tree holds all items (i.e., S-FP-Tree includes
infrequent items in sliding window frame 𝑊, as well). The S-DISStream size slightly increases
by smaller relaxation of 𝛼 during the sliding window frame 𝑊. A greater number of sub-
discriminative itemsets has to be saved in the S-DISStream structure so they can improve the
approximation in discriminative itemsets in the online sliding window model. It should be noted
that the size of the Transaction-List structure used in the online sliding window model is small
compared to the S-FP-Tree and S-DISStream and not presented, for the sake of clarity.
Chapter 6: Evaluation and Analysis Page 174
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 174
Figure 6.45 S-DISStream size for susy dataset (𝑊 = 20) by different relaxation of 𝛼
The number of itemsets in which their tag is changed during online sliding (e.g.,
discriminative to non-discriminative itemsets or non-discriminative to discriminative itemsets) is
presented in Figure 6.46. The number of itemsets in which their tag is changed during the online
sliding window model is increased by smaller relaxation of 𝛼 (i.e., 𝛼 = 0.9)..
Figure 6.46 Number of itemsets that their tag is changed for susy dataset (𝑊 = 20) by
different relaxation of 𝛼
The exact number of discriminative itemsets after each offline sliding window model is
represented in Figure 6.47. The number of sub-discriminative itemsets is also represented by
different relaxation of 𝛼 (i.e., 𝛼 = 0.9)
Figure 6.47 Number of discriminative and sub-discriminative itemsets for susy dataset
(𝑊 = 20) by different relaxation of 𝛼
In the section below we discuss the outcome from evaluation of the proposed S-
DISSparse algorithm.
Chapter 6: Evaluation and Analysis Page 175
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 175
6.4.3 Discussion
The general outcome from experiments on the S-DISSparse algorithm discussed in
previous sub-sections is explained here. In this section we evaluated the proposed single pass
algorithm for mining discriminative itemsets in data streams using the sliding window model.
The algorithm uses three in-memory data structures called S-FP-Tree, S-DISStream and
Transaction-List for offline and online sliding respectively. The discriminative itemsets are
discovered in the offline state and the process is continued in the online state for each new
transaction. The online sliding by new transactions and periodic offline sliding by new partition
results in efficient memory consumption and time complexity with good approximation in the
large and fast-growing data streams. The number of discriminative itemsets generated is
significantly less in comparison to frequent itemsets. This makes the discriminative itemsets
more useful for data stream discrimination.
By setting the relaxation of 𝛼 to smaller values, algorithms save the sub-discriminative
itemsets as well with increment of the time and space complexity. The sub-discriminative
itemsets have potential to be discriminative in the full size sliding window model. By using
smaller relaxation of 𝛼 the algorithm achieve better approximation in the online sliding window
frame. The relaxation of 𝛼 should be set in a value such that the increase in the time and space
complexity be acceptable in the data stream application domain.
Based on the reported experiments, the S-DISStream size may grow during the history of
data streams. In Section 6.2 in the batch processing experiments, we discussed in detail the best
way of setting optimum batch sizes. This is very important in the algorithms related to the sliding
window model as the exact discriminative itemsets should be updated and reported in reasonable
time periods. In our experiment we used a homogeneous set of batches in our data streams. In the
real applications, data streams have different speeds and distributions during the time caused for
different size batches with different complexities. This has to be considered in definition of batch
sizes and parameter settings. The S-DISSparse is defined as an efficient method of mining
discriminative itemsets using the sliding window model for the large datasets with high
complexities. The process of the discriminative itemset mining in algorithm is highly dependent
on the type of datasets and how the itemsets are distributed in the streams.
The sliding window model is updated in the offline and online states. The usability of the
online sliding window highly depends on the concept drifts in the input data streams and also the
input rate of the data streams. Considering the incoming partitions in the data streams the
window frame is updated in an online state with high performance if the adjacent partitions have
the least concept drifts. In the data streams with high concept drifts, the sliding window frame is
mainly updated through an offline sliding state and the online sliding state adds overhead on the
mining process. The sliding window frame size and also the number of partitions and size of each
Chapter 6: Evaluation and Analysis Page 176
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 176
partition can be defined based on the specific characteristics of the input datasets and based on
the domain expert knowledge.
6.5 CHAPTER SUMMARY
Considering the experiments conducted in this chapter, the problem of mining
discriminative itemsets in data streams is applicable for the large and fast-growing data streams.
The large number of discriminative itemsets with small minimum supports and frequency ratios
can be discovered efficiently.
The DISTree and DISSparse algorithms were proposed for mining discriminative
itemsets in data streams in a batch of transactions. These algorithms show a good performance
when the discriminative itemsets with higher minimum supports and frequency ratios are in
interest. The DISSparse algorithm has high efficiency for mining large number discriminative
itemsets with smaller minimum supports and frequency ratios. The DISTree algorithm fails in the
larger datasets with higher complexity when the exponential number of discriminative itemsets
has to be discovered based on the parameter settings. The FP-Tree is a data structure used in both
algorithms for holding the input transactions. This may become big and a cause for overhead in
processing the large batch of transactions with high complexity. The time and space complexity
of the algorithms is affected in different batches by distributions of the transactions. However, the
DISSparse algorithm is less affected as compared to the DISTree algorithm. The efficiency of the
DISSparse algorithm is highly related to the proposed heuristics for mining discriminative
itemsets out of the potential discriminative subsets.
The most important thing in the proposed algorithms for the batch processing is the size
of each batch of transactions. The proposed algorithms are used for offline updating of the
different window models. Considering the data stream applications, the updating time intervals
for the output results must not be with high delay. Both algorithms have full accuracy and recall
in mining the discriminative itemsets in a batch of transactions. The offline updating window
model can add approximations to the discriminative itemsets. The approximation can improve by
setting a larger size for the batch of transactions for the offline updating of the window models.
This has to be considered based on the desired outputs and the algorithms’complexity following
the memory and CPU limits.
The H-DISSparse algorithm was proposed for mining discriminative itemsets in data
streams using the tilted-time window model. The H-DISSparse algorithm uses the efficient
DISSparse method for the offline batch processing and can efficiently deal with large and fast-
growing data streams. The tilted-time updating window model in H-DISSparse algorithm holds
the window model small, as per its compact logarithmic structure. The effects of the high concept
drifts in data streams are neutralized in the tilted-time window model by merging the smaller
window frames to the larger window frames. The H-DISStream in-memory structure stays small
Chapter 6: Evaluation and Analysis Page 177
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 177
during the history of data streams by applying the proposed tail pruning technique. The larger
number of sub-discriminative itemsets improves the approximation in mining discriminative
itemsets in data streams using the tilted-time window model. However, it should be noted that
adding more sub-discriminative itemsets to the tilted-time window model may add high
overheads both in the batch processing and tilted window updating parts.
The S-DISSparse algorithm was proposed for mining discriminative itemsets in data
streams using the sliding window model. The S-DISSparse algorithm uses the DISSparse method
with new heuristics for the offline sliding window model. The proposed heuristics for the offline
sliding window model based on the recently updated subsets are efficient when the size of the
sliding window frame is set much bigger than the each partition coming in or going out of the
sliding window frame. The S-DISStream as an in-memory data structure stays small during the
sliding window model. The online sliding window model shows the updated discriminative
itemsets and the change in the tag of itemsets. The larger number of sub-discriminative itemsets
improves the approximation in mining discriminative itemsets in data streams using the online
window model. However, it should be noted that adding more sub-discriminative itemsets to the
sliding window model may add high overheads to the offline sliding part. The more number of
sub-discriminative itemsets may not improve the approximation in discriminative itemsets using
the online sliding window model as the recent discriminative itemsets may not exist in the S-
DISStream data structure.
The interesting characteristic of mining discriminative itemsets in data streams is that a
large number of discriminative itemsets can be discovered efficiently. These itemsets can be used
efficiently for different real world applications like classification, optimization and decision
making. The biggest challenge is to define the best setting of parameters considering the CPU
and memory limits and the desired outputs. The proposed principles in this chapter, with
accompany the knowledge of application domain experts can be used for parameter setting by
considering the different window models. The concept drifts in data streams have to be
considered for setting the smaller parameters.
In this chapter, the data streams were generated and modelled using a synthetic data
generator. The real data streams were modelled using datasets obtained from UCI repository. The
different real data streams may have extra challenges and add more limits to the experiments as
per different application domains. In the next chapter, we conclude the thesis for mining
discriminative itemsets in data streams using different window models and we will discuss the
future works.
Chapter 7: Conclusions Page 178
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 178
7Chapter 7: Conclusions
Pattern mining is one of the interesting research topics in data mining. In data streams,
the patterns in the target datasets are analysed based on the specified time periods. We expand the
frequent itemset mining techniques to the new areas of discriminative itemset mining.
Discriminative itemsets show the distinguishing features of the target data stream in comparison
to the general trends existed in the other data streams. The number of discriminative itemsets is
generally much fewer in comparison to the frequent itemsets. Discriminative itemsets focus on
defining the target data stream distinctly in comparison to the rest of the streams in the collection.
In this thesis, we proposed four algorithms for finding the discriminative itemsets in data streams
based on two different window models. We defined several in-memory data structures with the
effective pruning processes and heuristics for the time and space saving. All the structures
generated and used during the mining process attempt to consume minimum time and space. The
proposed methods have been extensively evaluated with datasets exhibiting their own distinct
characteristics. The data structures generated during the process were able to be fitted in the main
memory. Results ascertain that mining discriminative itemsets in data streams is realistic in fast-
growing data streams.
At first, we did extensive research on the importance of discriminative itemsets in data
streams. We provided research in different algorithms proposed for frequent itemset mining in
data streams, followed by different contrast data mining methods and association classification
mining methods. Then, we defined the problem of mining discriminative itemsets in the data
streams. The problem is defined on the itemsets that are frequent in one data stream and their
frequencies in that stream are much higher than other data streams. To achieve this, we proposed
a simple algorithm for mining the discriminative itemsets in a single batch of data streams. In the
next step, we proposed a heuristic-based algorithm for efficient mining the discriminative
itemsets in a single batch of data streams. After that we expanded the method for mining the
discriminative itemsets in data streams in the tilted-time window model. The discriminative
itemsets are discovered in each batch of transactions and are merged periodically in a tilted-time
window model. The scalability of the proposed methods was analysed with different datasets and
parameter settings, showing their acceptable time and space usage in the fast-speed large data
streams, especially for the heuristic-based method.
Following the proposed methods for mining discriminative itemsets in a single batch of
transactions, we proposed one other algorithm for mining the discriminative itemsets in data
streams over the sliding window model. The proposed method was analysed using different
Chapter 7: Conclusions Page 179
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 179
datasets and multiple parameter settings. The discriminative itemset mining has more challenges
than frequent itemset mining, especially in the sliding window model. During the window frame
sliding, the algorithms have to deal with the combinatorial explosion of itemsets in data streams
coming to the window frame and going out of the window frame. The novel in-memory data
structures are defined for processing discriminative itemsets in combination with an offline and
online sliding window model. We used the offline processing to control the generation of number
of non-potential itemsets, and the online processing is used for online monitoring of the data
streams. The empirical analysis shows that the proposed algorithm provides efficient time and
space complexities in online data stream growing at fast speed. In the future, we plan to develop
methods of discovering discriminative rules using discriminative itemsets, with the aim of
proposing a classifier focusing on distinguishing features of data streams.
This thesis is concluded in this chapter. First, the contributions of this thesis are
summarized. Then, the findings that are drawn from this thesis are described. Finally, limitations
of current work and directions of future work are presented. We follow the research problem in
the future by defining the discriminative rules, which will be used for proposing a novel
classification technique called the discriminative classification.
7.1 SUMMARY OF CONTRIBUTIONS
This thesis integrates the concept of discriminative itemsets with data streams. It expands
the research in contrast data mining for the discriminative itemset mining. The proposed methods
focus on overcoming the weakness of the existing state-of-art methods for discriminative itemset
mining in the three types of tasks. The thesis also shows the importance of optimized parameter
settings in three types of tasks. We show the research shortcomings as below.
Lack of research in contrast data mining for the discriminative itemset mining
Lack of method for the discriminative itemset mining in data streams
Lack of efficient method for mining the discriminative itemsets in the tilted-time
window model
Lack of efficient method for mining the discriminative itemsets in the sliding
window model
Lack of efficient discriminative itemset mining method for different window
models optimized based on the general and specific characteristics of the target
data stream trends compare to the other data streams.
The above mentioned shortcomings are overcome by
Employing an extensive research on contrast data mining for the discriminative
itemset mining
Employing a simple method of discriminative itemset mining
Chapter 7: Conclusions Page 180
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 180
Employing an advanced heuristic-based method for discriminative itemset
mining
Employing an efficient method for discriminative itemset mining in data streams
using the tilted-time model
Employing an efficient method for discriminative itemset mining in data streams
using the sliding window model
Vast evaluation of the proposed methods using different datasets and parameter
settings
The main contributions of this thesis are summarised below:
Developing extensive research to show the importance of discriminative
itemsets in data streams in the real applications.
o Extensive research in different frequent itemset mining methods in data
streams and contrast data mining methods is done. The importance of
the discriminative itemset and its superiority to the frequent itemset is
discussed, and the application of discriminative itemset for the
definition of the discriminative rule is provided.
Developing the simple method of discriminative itemset mining
o After defining the concept of the discriminative itemset mining in data
streams we proposed a simple algorithm for mining discriminative
itemsets in data streams. The proposed algorithm works well only for
the small datasets or within specific parameter settings.
o The time and space complexity of the algorithm is analysed based on the
synthetically generated and real datasets.
Developing an efficient heuristic-based method for discriminative itemset
mining
o After defining the first method of discriminative itemset mining in a
single batch of transactions we develop an efficient method by defining
a heuristic scalable for real world large datasets.
o The algorithm shows acceptable time and space complexity within
datasets with different sizes and different parameter settings.
Developing method of discriminative itemset mining in data streams in tilted-
time model
o The efficient method defined previously is used for mining discriminative
itemsets in tilted-time window model.
Chapter 7: Conclusions Page 181
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 181
o The efficient algorithm defined for mining discriminative itemsets in data
streams using the tilted-time window model works well for the small
and large datasets.
o The tilted-time window model is used for saving the historical
discriminative itemsets.
Developing a method of discriminative itemset mining in data streams in the
sliding window model
o The efficient method defined previously is used for mining discriminative
itemsets using sliding window model.
o The efficient algorithm defined for mining the discriminative itemsets in
data streams using the sliding window model works well for the small
and large datasets.
o The sliding window model is used for saving the offline and online real-
time discriminative itemsets.
A comprehensive analysis done for all the four algorithms on the datasets
generated using a synthetic data generator and real datasets.
o The different parameter settings are tested for mining discriminative
itemsets based on different input datasets.
7.2 SUMMARY OF FINDINGS
The main findings from this thesis are summarised as following:
In response to the first research question, extensive research were done on
frequent itemset mining in data streams and contrast data mining by setting
the place of the research in the literatures. The importance of the
discriminative itemsets in data streams in real applications was emphasized
and the discriminative itemsets were proposed for the application of
classification in data streams.
o It covered the literatures that use the frequent itemset mining and
contrast data mining for the classification of static datasets and data
streams.
In response to the second research question, the concept of the
discriminative itemsets were proposed and a simple algorithm coded based
on the expansion of the FP-Growth (Han, Pei and Yin 2000) for mining the
discriminative in a single batch of transactions. This followed by an
advanced efficient method for mining the discriminative in a single batch of
transactions.
Chapter 7: Conclusions Page 182
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 182
o These itemsets are used for differentiating between the target data
stream and the general data stream.
In response to the third research question, the concept of the discriminative
itemsets in the tilted-time window model and the sliding window model
were proposed. One method was coded by expansion of the proposed
method for a single batch of transactions for efficient mining of the
discriminative itemsets in the tilted-time window model. One method was
coded by expansion of the proposed efficient method for a single batch of
transactions for efficient mining of the discriminative itemsets in sliding
window model.
o These algorithms are scalable in the real world data streams
modelled as continuous batch of transactions.
In response to the fourth research question, the proposed algorithms were
tested with several synthetically generated datasets and real datasets with
different features and parameter settings.
o These algorithms can be customised by different parameter settings
based on the dataset characteristic.
7.3 CONNECTIONS BETWEEN THE THREE TASKS
This thesis contributes in definitions of the discriminative itemset mining in data streams
and four methods of mining discriminative itemsets in different window models. Discriminative
itemsets in a tilted-time window model and a sliding window model can be used for
discriminative classification.
The discriminative itemsets are useful for description mining in data streams in different
window models as well. The discriminative itemsets can be used for mining the discriminative
rules. The discriminative rules in accompaniment with the highly frequent itemsets can be used
for defining the discriminative classification techniques. This technique can be used for
prediction mining in a target data stream using distinguishing features of discriminative rules.
The classification techniques will be fast enough and efficient for fast-growing data streams.
7.4 LIMITATIONS AND THE FUTURE RESEARCH ISSUES
This thesis is applied for discriminative itemset mining techniques in data streams. The
proposed methods are analysed based on the synthetic datasets generated by the IBM synthetic
data generator and the real datasets provided in UCI repository.
One limitation in the proposed method is the concept drifts of the transactions in data
streams for the experiments’ purpose. Considering the concept drifts of the data streams, the
algorithms may have long delays in their runtimes. Another limitation is the size of memory
Chapter 7: Conclusions Page 183
© 2018 Queensland University of Technology-QUT, Science and Engineering Faculty Page 183
which affects the length of the window size. The tilted-time window model should be regularly
restructured and optimised based on the recent trends so it can be fitted in the main memory.
Another limitation is the large number of discriminative itemsets with low supports.
The aim of the discovered discriminative itemsets in data streams in this thesis can lead
them to be employed in classification techniques for prediction mining in large data streams. The
discovered discriminative itemsets in tilted-time and sliding window models can be used for
definition of discriminative rules. These are defined as the rules with higher support and
confidence in the target data streams compared to the general data stream. This includes
recommending the developed discriminative itemset mining algorithms for the discriminative
rule mining in data streams. The discriminative rules in the historical tilted-time and sliding
window models can be adjusted for the purpose of defining new classification techniques for
prediction mining in data streams.
Chapter 7: Conclusions Page 184
184
References
Aggarwal, Charu C. 2007. Data streams: models and algorithms. Vol. 31: Springer Science &
Business Media.
Agrawal, Rakesh and Ramakrishnan Srikant. 1994. "Fast algorithms for mining association rules in
large databases." In Proceedings of the 20th International Conference on Very Large Data
Bases VLDB, edited, 487-499
Ahmed, Chowdhury Farhan, Syed Khairuzzaman Tanbeer, Byeong-Soo Jeong and Ho-Jin Choi. 2012.
"Interactive mining of high utility patterns over data streams." Expert Systems with
Applications 39 (15): 11979-11991. doi: 10.1016/j.eswa.2012.03.062.
Alhammady, Hamad and Kotagiri Ramamohanarao. 2005. "Mining Emerging Patterns and
Classification in Data Streams." The Proceedings of IEEE/WIC/ACM International
Conference on Web Intelligence: 272-275 doi: 10.1109/WI.2005.96.
Amagata, Daichi and Takahiro Hara. 2017 "Mining Top-k Co-Occurrence Patterns across Multiple
Streams." IEEE Transactions on Knowledge and Data Engineering 29 (10): 2249 - 2262.
doi: 10.1109/TKDE.2017.2728537.
Antonie, Maria-LuizaandOsmarR.Za¨ıane.2004."Miningpositive and negative association rules:
An approach for confined rules." In Proceedings of the Knowledge Discovery in Databases:
Pkdd 2004, edited by J. F. Boulicaut, F. Esposito, F. Giannotti and D. Pedreschi, 27-38.
Ayres, Jay, Johannes Gehrke, Tomi Yiu and Jason Flannick. 2002. "Sequential Pattern Mining using
A Bitmap Representation." In Proceedings of the eighth ACM SIGKDD international
conference on Knowledge discovery and data mining, edited, 429-435 doi:
10.1145/775047.775109.
Bailey, James and Elsa Loekito. 2010. "Efficient incremental mining of contrast patterns in changing
data." Information processing letters 110 (3): 88-92. doi: 10.1016/j.ipl.2009.10.012.
Bailey, James, Thomas Manoukian and Kotagiri Ramamohanarao. 2002. "Fast Algorithms for Mining
Emerging Patterns." In Proceedings of the 6th European Conference on Principles of Data
Mining and Knowledge Discovery, edited, 39-50.
Chang, Joong Hyuk and Won Suk Lee. 2003. "Finding recent frequent itemsets adaptively over online
data streams." In Proceedings of the ninth ACM SIGKDD international conference on
Knowledge discovery and data mining, edited, 487-492: ACM. doi: 10.1145/956750.956807
Chang, Joong Hyuk and Won Suk Lee. 2005. "estWin: Online data stream mining of recent frequent
itemsets by sliding window method." Journal of Information Science 31 (2): 76-90. doi:
10.1177/0165551505050785.
Cheng, James, Yiping Ke and Wilfred Ng. 2008. "A survey on algorithms for mining frequent
itemsets over data streams." Knowledge and Information Systems 16 (1): 1-27. doi:
10.1007/s10115-007-0092-4.
Chi, Yun, Haixun Wang, S Yu Philip and Richard R Muntz. 2004. "Moment: Maintaining closed
frequent itemsets over a stream sliding window." In Fourth IEEE International Conference
on Data Mining ICDM '04, edited, 59-66. doi: 10.1109/ICDM.2004.10084.
Chapter 7: Conclusions Page 185
185
Chi, Yun, Haixun Wang, S Yu Philip and Richard R Muntz. 2006. "Catch the moment: maintaining
closed frequent itemsets over a data stream sliding window." Knowledge and Information
Systems 10 (3): 265-294. doi: 10.1007/s10115-006-0003-0.
Clark, Peter and Tim Niblett. 1989. "The CN2 induction algorithm." Machine learning 3 (4): 261-283.
doi: 10.1023/a:1022641700528.
Çokpınar,SametandTaflanİmreGündem.2012."Positiveandnegativeassociationruleminingon
XML data streams in database as a service concept." Expert Systems with Applications 39
(8): 7503-7511. doi: 10.1016/j.eswa.2012.01.128.
Dheeru, Dua and Efi Karra Taniskidou. 2017. "UCI Machine Learning Repository."
Djahantighi, Farhad Siasar, Mohammad-Reza Feizi-Derakhshi, Mir Mohsen Pedram and Zohreh
Alavi. 2010. "An Effective Algorithm for Mining Users Behaviour in Time-Periods."
European Journal of Scientific Research 40 (1): 81-90.
Dong, Guozhu and James Bailey. 2012. Contrast Data Mining: Concepts, Algorithms, and
Applications: CRC Press.
Dong, Guozhu and Jinyan Li. 1999. "Efficient Mining of Emerging Patterns: Discovering Trends and
Differences." In Proceedings of the fifth ACM SIGKDD international conference on
Knowledge discovery and data mining, edited, 43-52. doi: 10.1145/312129.312191.
Dong, Guozhu, Xiuzhen Zhang, Limsoon Wong and Jinyan Li. 1999. "CAEP: Classification by
Aggregating Emerging Patterns, Berlin, Heidelberg, edited, 30-42: Springer Berlin
Heidelberg.
Duan, Lei, Guanting Tang, Jian Pei, James Bailey, Guozhu Dong, Akiko Campbell and Changjie
Tang. 2014. "Mining Contrast Subspaces." In Advances in Knowledge Discovery and Data
Mining: 18th Pacific-Asia Conference, PAKDD 2014, Tainan, Taiwan, May 13-16, 2014.
Proceedings, Part I, Cham, edited by Vincent S. Tseng, Tu Bao Ho, Zhi-Hua Zhou, Arbee L.
P. Chen and Hung-Yu Kao, 249-260: Springer International Publishing. doi: 10.1007/978-3-
319-06608-0_21.
Duan, Lei, Guanting Tang, Jian Pei, James Bailey, Guozhu Dong, Vinh Nguyen, Akiko Campbell and
Changjie Tang. 2016. "Efficient discovery of contrast subspaces for object explanation and
characterization." Knowledge and Information Systems 47 (1): 99-129. doi: 10.1007/s10115-
015-0835-6.
Duda, Richard O and Peter E Hart. 1973. Pattern classification and scene analysis. Vol. 3: Wiley
New York.
Eichinger, Frank, Detlef D. Nauck and Frank Klawonn. 2006. "Sequence Mining for Customer
Behaviour Predictions in Telecommunications." In Proceedings of the Workshop on
Practical Data Mining: Applications, Experiences and Challenges, Berlin, Germany, edited.
Fan, Hongjian and Kotagiri Ramamohanarao. 2002. "An Efficient Single-Scan Algorithm for Mining
Essential Jumping Emerging Patterns for Classification." In Proceedings of the 6th Pacific-
Asia Conference on Advances in Knowledge Discovery and Data Mining, edited, 456-462
Fan, Hongjian and Kotagiri Ramamohanarao. 2003. "A bayesian approach to use emerging patterns
for classification." In Proceedings of the 14th Australasian database conference-Volume 17,
edited, 39-48: Australian Computer Society, Inc.
Farzanyar, Zahra, Mohammadreza Kangavari and Nick Cercone. 2012. "Max-FISM: Mining
(recently) maximal frequent itemsets over data streams using the sliding window model."
Chapter 7: Conclusions Page 186
186
Computers & Mathematics with Applications 64 (6): 1706-1718. doi:
10.1016/j.camwa.2012.01.045.
Fournier-Viger, Philippe, Jerry Chun-Wei Lin, Antonio Gomariz, Ted Gueniche, Azadeh Soltani,
Zhihong Deng and Hoang Thanh Lam. 2016. "The SPMF Open-Source Data Mining Library
Version 2." In Machine Learning and Knowledge Discovery in Databases: European
Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016,
Proceedings, Part III, 36-40. Cham: Springer International Publishing. doi: 10.1007/978-3-
319-46131-1_8.
Gao, Chao, Lei Duan, Guozhu Dong, Haiqing Zhang, Hao Yang and Changjie Tang. 2016. "Mining
Top-k Distinguishing Sequential Patterns with Flexible Gap Constraints." In Web-Age
Information Management: 17th International Conference, WAIM 2016, Nanchang, China,
June 3-5, 2016, Proceedings, Part I, edited by Bin Cui, Nan Zhang, Jianliang Xu, Xiang Lian
and Dexi Liu, 82-94. Cham: Springer International Publishing. doi: 10.1007/978-3-319-
39937-9_7.
Giannella, Chris, Jiawei Han, Jian Pei, Xifeng Yan and Philip S. Yu. 2003. "Mining frequent patterns
in data streams at multiple time granularities." Next generation data mining 212: 191-212.
Guo, Jing, Peng Zhang, Jianlong Tan and Li Guo. 2011. "Mining frequent patterns across multiple
data streams." In Proceedings of the 20th ACM international conference on Information and
knowledge management, edited, 2325-2328: ACM. doi: 10.1145/2063576.2063957.
Guo, Lichao, Hongye Su and Yu Qu. 2011. "Approximate mining of global closed frequent itemsets
over data streams." Journal of the Franklin Institute-Engineering and Applied Mathematics
348 (6): 1052-1081. doi: 10.1016/j.jfranklin.2011.04.006.
Han, Jiawei, Jian Pei and Micheline Kamber. 2011. Data mining: concepts and techniques: Elsevier.
Han, Jiawei, Jian Pei, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal and MC
Hsu. 2001. "Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern
growth." In proceedings of the 17th international conference on data engineering (ICDE
2001) edited, 215-224.
Han, Jiawei, Jian Pei and Yiwen Yin. 2000. "Mining frequent patterns without candidate generation."
In ACM Sigmod Record, edited, 1-12: ACM. doi: 10.1145/335191.335372.
Hashemi, Sattar, Ying Yang, Zahra Mirzamomen and Mohammadreza Kangavari. 2009. "Adapted
one-versus-all decision trees for data stream classification." Knowledge and Data
Engineering, IEEE Transactions on 21 (5): 624-637. doi: 10.1109/TKDE.2008.181.
He, Zengyou, Feiyang Gu, Can Zhao, Xiaoqing Liu, Jun Wu and Ju Wang. 2017. "Conditional
discriminative pattern mining." Information Sciences 375 (C): 1-15. doi:
10.1016/j.ins.2016.09.047.
Hollfelder, Silvia, Vincent Oria and M. Tamer ¨Ozsu. 2000. "Mining user behaviour for resource
prediction in interactive electronic malls." In IEEE International Conference on Multimedia
and Expo ICME 2000, edited, 863-866
Huang, Lan, Chun-guang Zhou, Yu-qin Zhou and Zhe Wang. 2008. "Research on Data Mining
Algorithms for Automotive Customers' Behavior Prediction Problem." In 2008 Seventh
International Conference on Machine Learning and Applications, edited, 677-681. doi:
10.1109/ICMLA.2008.23.
Hyuk, Joong and Chang Won Suk Lee. 2003. "Finding Recent Frequent Itemsets Adaptively over
Online Data Streams." In International conference on knowledge discovery and data mining
(KDD’03),Washington,DC, edited, 487-492. doi: 10.1145/956750.956807.
Chapter 7: Conclusions Page 187
187
Jiang, Nan and Le Gruenwald. 2006. "Research issues in data stream association rule mining." ACM
Sigmod Record 35 (1): 14-19. doi: 10.1145/1121995.1121998.
Kompalli, Prasanna Lakshmi and Ramesh Kumar Cherku. 2015. "Efficient Mining of Data Streams
Using Associative Classification Approach." International Journal of Software Engineering
and Knowledge Engineering 25 (03): 605-631. doi: 10.1142/s0218194015500059.
Lakshmi, K Prasanna and CRK Reddy. 2012. "Compact Tree for Associative Classification of Data
Stream Mining." International Journal of Computer Science Issues (IJCSI) 9 (2).
Leung, Carson Kai-Sang and Quamrul I Khan. 2006. "DSTree: A tree structure for the mining of
frequent sets from data streams." In Sixth International Conference on Data Mining
ICDM'06, edited, 928-932. doi: 10.1109/ICDM.2006.62
Li, Hua-Fu and Suh-Yin Lee. 2009. "Mining frequent itemsets over data streams using efficient
window sliding techniques." An International Journal Expert Systems with Applications 36
(2): 1466-1477 doi: 10.1016/j.eswa.2007.11.061.
Li, Hua-Fu, Suh-Yin Lee and Man-Kwan Shan. 2004. "An efficient algorithm for mining frequent
itemsets over the entire history of data streams." In Proceeding of first international
workshop on knowledge discovery in data streams, edited. doi: 10.1016/j.eswa.2007.11.061.
Li, Jinyan, Guozhu Dong and Kotagiri Ramamohanarao. 2000. "Instance-based classification by
emerging patterns." In Principles of Data Mining and Knowledge Discovery, 191-200:
Springer. doi: 10.1007/3-540-45372-5_19.
Li, Jinyan, Guozhu Dong and Kotagiri Ramamohanarao. 2001. "Making use of the most expressive
jumping emerging patterns for classification." Knowledge and Information systems 3 (2):
131-145. doi: 10.1007/3-540-45571-X_29.
Li, Jinyan, Haiquan Li, Limsoon Wong, Jian Pei and Guozhu Dong. 2006. "Minimum description
length principle: generators are preferable to closed patterns." Paper presented at the
Proceedings of the 21st national conference on Artificial intelligence - Volume 1, Boston,
Massachusetts. AAAI Press.
Li, Jinyan, Guimei Liu and Limsoon Wong. 2007. "Mining statistically important equivalence classes
and delta-discriminative emerging patterns." In Proceedings of the 13th ACM SIGKDD
international conference on Knowledge discovery and data mining, edited, 430-439: ACM.
doi: 10.1145/1281192.1281240.
Li, Wenmin, Jiawei Han and Jian Pei. 2001. "CMAR: Accurate and efficient classification based on
multiple class-association rules." In Proceedings IEEE International Conference on Data
Mining (ICDM '01), edited, 369-376: IEEE.
Li, Xiaoli, S Yu Philip, Bing Liu and See-Kiong Ng. 2009. "Positive Unlabeled Learning for Data
Stream Classification." In SDM, edited, 257-268: SIAM. doi: 10.1137/1.9781611972795.23.
Lim, Tjen-Sien, Wei-Yin Loh and Yu-Shan Shih. 2000. "A comparison of prediction accuracy,
complexity, and training time of thirty-three old and new classification algorithms." Machine
learning 40 (3): 203-228.
Lin, Ming-Yen, Sue-Chen Hsueh and Sheng-Kun Hwang. 2008. "Interactive mining of frequent
itemsets over arbitrary time intervals in a data stream." In The nineteenth conference on
Australasian database ADC '08, edited, 15-21.
Lin, Zhenhua, Bin Jiang, Jian Pei and Daxin Jiang. 2010. "Mining discriminative items in multiple
data streams." World Wide Web 13 (4): 497-522. doi: 10.1007/s11280-010-0094-0.
Chapter 7: Conclusions Page 188
188
Loekito, Elsa and James Bailey. 2006. "Fast mining of high dimensional expressive contrast patterns
using zero-suppressed binary decision diagrams." Paper presented at the Proceedings of the
12th ACM SIGKDD international conference on Knowledge discovery and data mining,
Philadelphia, PA, USA. ACM. doi: 10.1145/1150402.1150438.
Ma, Bing Liu Wynne Hsu Yiming. 1998. "Integrating classification and association rule mining." In
Proceedings of the 4th, edited.
Manku, Gurmeet Singh. 2016. "Frequent Itemset Mining over Data Streams." In Data Stream
Management: Processing High-Speed Data Streams, edited by Minos Garofalakis, Johannes
Gehrke and Rajeev Rastogi, 209-219. Berlin, Heidelberg: Springer Berlin Heidelberg. doi:
10.1007/978-3-540-28608-0_10.
Manku, Gurmeet Singh and Rajeev Motwani. 2002. "Approximate Frequency Counts over Data
Streams." In Proceedings of the 28th international conference on Very Large Data Bases,
edited, 346-357: VLDB Endowment.
Masud, Mohammad M, Clay Woolam, Jing Gao, Latifur Khan, Jiawei Han, Kevin W Hamlen and
Nikunj C Oza. 2012. "Facing the reality of data stream classification: coping with scarcity of
labeled data." Knowledge and information systems 33 (1): 213-244. doi: 10.1007/s10115-
011-0447-8.
Metwally, Ahmed, Divyakant Agrawal and Amr El Abbadi. 2005. "Efficient computation of frequent
and top-k elements in data streams." In International Conference on Database Theory,
edited, 398-412: Springer. doi: 10.1007/978-3-540-30570-5_27.
Minos Garofalakis, Johannes Gehrke and Rajeev Rastogi. 2002. "Querying and mining data streams:
you only get one look". In Tutorial notes of the 28th International Conference on Very Large
Databases, Hong Kong, China.
Mori, Taketoshi, Aritoki Takada, Hiroshi Noguchi, Tatsuya Harada and Tomomasa Sato. 2005.
"Behavior prediction based on daily-life record database in distributed sensing space." In
IEEE/RSJ International Conference on Intelligent Robots and Systems, Vols 1-4, edited,
1703-1709: IEEE. doi: 10.1109/iros.2005.1545244.
Nowozin, Sebastian, Gokhan Bakir and Koji Tsuda. 2007. "Discriminative subsequence mining for
action classification." In 11th International Conference on Computer Vision, edited, 1-8:
IEEE. doi: 10.1109/ICCV.2007.4409049
Patel, Dhaval, Wynne Hsu and Mong Li Lee. 2011. "Discriminative Mutation Chains in Virus
Sequences." In Tools with Artificial Intelligence (ICTAI), 2011 23rd IEEE International
Conference on, edited, 9-16: IEEE. doi: 10.1109/ICTAI.2011.11.
Peng, Wen-Chih and Zhung-Xun Liao. 2009. "Mining sequential patterns across multiple sequence
databases." Data & Knowledge Engineering 68 (10): 1014-1033.
Prasad, U. Devi and S. Madhavi. 2012. "Prediction of Churn Behavior of Bank Customers Using Data
Mining Tools." Business Intelligence Journal 5 (1): 96-101
Quinlan, J Ross. 2014. C4.5: programs for machine learning: Elsevier.
Roberto J. Bayardo, Jr. 1998. "Efficiently mining long patterns from databases." Paper presented at
the Proceedings of the 1998 ACM SIGMOD international conference on Management of
data, Seattle, Washington, USA. ACM. doi: 10.1145/276304.276313.
Saengthongloun, Bordin, Thanapat Kangkachit, Thanawin Rakthanmanon and Kitsana Waiyamai.
2013. "AC-Stream: Associative classification over data streams using multiple class
Chapter 7: Conclusions Page 189
189
association rules." In Computer Science and Software Engineering (JCSSE), 2013 10th
International Joint Conference on, edited, 223-228: IEEE. doi:
10.1109/JCSSE.2013.6567349.
Seyfi, Majid. 2011. "Mining discriminative items in multiple data streams with hierarchical counters
approach." In Fourth International Workshop on Advanced Computational Intelligence
(IWACI), 2011, edited, 172-176 IEEE. doi: 10.1109/IWACI.2011.6159996.
Seyfi, Majid, Shlomo Geva and Richi Nayak. 2014. "Mining Discriminative Itemsets in Data
Streams." In International Conference on Web Information Systems Engineering, edited,
125-134: Springer. doi: 10.1007/978-3-319-11749-2_10
Seyfi, Majid, Richi Nayak, Yue Xu and Shlomo Geva. 2017. "Efficient mining of discriminative
itemsets." Paper presented at the Proceedings of the International Conference on Web
Intelligence, Leipzig, Germany. ACM. doi: 10.1145/3106426.3106429.
Shin, Se Jung and Won Suk Lee. 2008. "On-line generation association rules over data streams."
Information and Software Technology 50 (6): 569-578. doi: 10.1016/j.infsof.2007.06.005.
Song, Zhen-Hui and Yi Li. 2010. "Associative classification over Data Streams." In Information
Engineering and Computer Science (ICIECS), 2010 2nd International Conference on, edited,
1-4: IEEE.
Su, Li, Hong-yan Liu and Zhen-Hui Song. 2011. "A new classification algorithm for data stream."
International Journal of Modern Education and Computer Science (IJMECS) 3 (4): 32. doi:
10.5815/ijmecs.2011.04.05.
Tanbeer, Syed Khairuzzaman, Chowdhury Farhan Ahmed, Byeong-Soo Jeong and Young-Koo Lee.
2009. "Sliding window-based frequent pattern mining over data streams." Information
sciences 179 (22): 3843-3865. doi: 10.1016/j.ins.2009.07.012.
Thabtah, Fadi. 2007. "A review of associative classification mining." The Knowledge Engineering
Review 22 (01): 37-65. doi: 10.1017/s0269888907001026.
Tsai, Pauray SM. 2009. "Mining frequent itemsets in data streams using the weighted sliding window
model." Expert Systems with Applications 36 (9): 11617-11625. doi:
10.1016/j.eswa.2009.03.025.
Tseng, Vincent S and Kawuu W Lin. 2006. "Efficient mining and prediction of user behavior patterns
in mobile web systems." Information and Software Technology 48 (6): 357-369. doi:
10.1016/j.infsof.2005.12.014.
Waiyamai, Kitsana, Thanapat Kangkachit, Bordin Saengthongloun and Thanawin Rakthanmanon.
2014. "ACCD: Associative Classification over Concept-Drifting Data Streams." In Machine
Learning and Data Mining in Pattern Recognition, 78-90: Springer. doi: 10.1007/978-3-319-
08979-9_7
Wang, Haixun, Wei Fan, Philip S. Yu and Jiawei Han. 2003. "Mining Concept-Drifting Data Streams
using Ensemble Classifiers." In Proceedings of the ninth ACM SIGKDD international
conference on Knowledge discovery and data mining, edited, 226-235 doi:
10.1145/956750.956778.
Wang, Jianyong and Jiawei Han. 2004. "BIDE: Efficient Mining of Frequent Closed Sequences."
Paper presented at the Proceedings of the 20th International Conference on Data
Engineering. IEEE Computer Society.
Chapter 7: Conclusions Page 190
190
Wu, Xindong, Chengqi Zhang and Shichao Zhang. 2004. "Efficient mining of both positive and
negative association rules." ACM Transactions on Information Systems (TOIS) 22 (3): 381-
405. doi: 10.1145/1010614.1010616.
Yu, Jeffery Xu, Zhihong Chong, Hongjun Lu and Aoying Zhou. 2004. "False positive or false
negative: mining frequent tenets from high speed transactional data streams." In Thirtieth
International conference on Very large data bases VLDB 04, edited, 204-215
Yu, Kui, Wei Ding, Dan A Simovici and Xindong Wu. 2012. "Mining emerging patterns by streaming
feature selection." In Proceedings of the 18th ACM SIGKDD international conference on
Knowledge discovery and data mining, edited, 60-68: ACM. doi: 10.1145/2339530.2339544.
Yu, Kui, Wei Ding, Dan A. Simovici, Hao Wang, Jian Pei and Xindong Wu. 2015. "Classification
with Streaming Features: An Emerging-Pattern Mining Approach." ACM Transactions on
Knowledge Discovery from Data (TKDD) 9 (4): 1-31. doi: 10.1145/2700409.
Yu, Kui, Wei Ding, Hao Wang and Xindong Wu. 2013. "Bridging causal relevance and pattern
discriminability: Mining emerging patterns from high-dimensional data." IEEE Transactions
on Knowledge and Data Engineering 25 (12): 2721-2739. doi: 10.1109/TKDE.2012.218.
Yuan, Xiaohui, Bill P. Buckles, Zhaoshan Yuan and Jian Zhang. 2002. "Mining negative association
rules." In Seventh International Symposium on Computers and Communications, edited, 623-
628
Zaki, Mohammed J. 2001. "SPADE: An efficient algorithm for mining frequent sequences." Machine
learning 42 (1-2): 31-60. doi: 10.1023/A:1007652502315.
Zaki, Mohammed J. and Ching-Jui Hsiao. 2002. "CHARM: An Efficient Algorithm for Closed
Itemset Mining." In Proceedings of the 2002 SIAM International Conference on Data
Mining, edited, 457-473. doi: 10.1137/1.9781611972726.27.
Zhang, Peng, Xingquan Zhu, Jianlong Tan and Li Guo. 2010. "Classifier and cluster ensembles for
mining concept drifting data streams." In IEEE 10th International Conference on Data
Mining ICDM'10, edited, 1175-1180. doi: 10.1109/ICDM.2010.125.
Zhang, Xiuzhen, Guozhu Dong and Ramamohanarao Kotagiri. 2000. "Exploring Constraints to
Efeciently Mine Emerging Patterns from Large High-dimensional Datasets." In Proceedings
of the sixth ACM SIGKDD international conference on Knowledge discovery and data
mining, edited, 310-314 doi: 10.1145/347090.347158.
Zhao, Li, Lei Wang and Qingzheng Xu. 2012. "Data stream classification with artificial endocrine
system." Applied Intelligence 37 (3): 390-404. doi: 10.1007/s10489-011-0334-8.
Zhu, Xingquan and Xindong Wu. 2007. "Discovering relational patterns across multiple databases." In
2007 IEEE 23rd International Conference on Data Engineering (ICDE 2007), edited, 726-
735: IEEE. doi: 10.1109/ICDE.2007.367918.
Zhu, Xingquan, Xindong Wu and Ying Yang. 2006. "Effective Classification of Noisy Data Streams
with Attribute-Oriented Dynamic Classifier Selection." Knowledge and Information Systems
archive 9 (3): 339-363 doi: 10.1007/s10115-005-0212-y.
top related