Enriching XML Documents Clustering by using
Concise Structure and Content
By
Sangeetha Kutty MCIS. (Auckland University of Technology, New Zealand),
B.Eng. (University of Madras, India)
Thesis submitted for the degree of Doctor of Philosophy to the
Faculty of Science and Technology at
Queensland University of Technology
Brisbane, Queensland, Australia
2011
Keywords
XML documents, clustering, frequent subtree mining, structure, content, Vector Space
Model (VSM), tensors, Tensor Space Model (TSM), paths, graphs, trees, subtrees, induced
subtrees, embedded subtrees, constraints, closed, maximal, apriori, prefix-based pattern
growth, matricization, INEX, Wikipedia, ACM dataset, IEEE dataset, tensor, random
projection, random indexing.
iii
Abstract
With the growing number of XML documents on the Web it becomes essential to effectively
organise these XML documents in order to retrieve useful information from them. A
possible solution is to apply clustering on the XML documents to discover knowledge that
promotes effective data management, information retrieval and query processing. However,
many issues arise in discovering knowledge from these types of semi-structured documents
due to their heterogeneity and structural irregularity. Most of the existing research on
clustering techniques focuses only on one feature of the XML documents, this being either
their structure or their content due to scalability and complexity problems. The knowledge
gained in the form of clusters based on the structure or the content is not suitable for real-
life datasets. It therefore becomes essential to include both the structure and content of
XML documents in order to improve the accuracy and meaning of the clustering solution.
However, the inclusion of both these kinds of information in the clustering process results in
a huge overhead for the underlying clustering algorithm because of the high dimensionality
of the data.
The overall objective of this thesis is to address these issues by: (1) proposing methods
to utilise frequent pattern mining techniques to reduce the dimension; (2) developing mod-
els to effectively combine the structure and content of XML documents; and (3) utilising
the proposed models in clustering. This research first determines the structural similarity
in the form of frequent subtrees and then uses these frequent subtrees to represent the
constrained content of the XML documents in order to determine the content similarity.
A clustering framework with two types of models, implicit and explicit, is developed.
The implicit model uses a Vector Space Model (VSM) to combine the structure and
the content information. The explicit model uses a higher order model, namely a 3-
order Tensor Space Model (TSM), to explicitly combine the structure and the content
information. This thesis also proposes a novel incremental technique to decompose large-
sized tensor models to utilise the decomposed solution for clustering the XML documents.
The proposed framework and its components were extensively evaluated on several
v
real-life datasets exhibiting extreme characteristics to understand the usefulness of the pro-
posed framework in real-life situations. Additionally, this research evaluates the outcome
of the clustering process on the collection selection problem in the information retrieval on
the Wikipedia dataset. The experimental results demonstrate that the proposed frequent
pattern mining and clustering methods outperform the related state-of-the-art approaches.
In particular, the proposed framework of utilising frequent structures for constraining the
content shows an improvement in accuracy over content-only and structure-only clustering
results. The scalability evaluation experiments conducted on large scaled datasets clearly
show the strengths of the proposed methods over state-of-the-art methods.
In particular, this thesis work contributes to effectively combining the structure and
the content of XML documents for clustering, in order to improve the accuracy of the
clustering solution. In addition, it also contributes by addressing the research gaps in
frequent pattern mining to generate efficient and concise frequent subtrees with various
node relationships that could be used in clustering.
vi
Table of Contents
Keywords iii
Abstract v
List of Figures xi
List of Tables xv
Glossary xvii
Statement of Original Authorship xix
Publications xxi
Acknowledgements xxiii
Chapter 1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Research aims & objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Research significance and contributions . . . . . . . . . . . . . . . . . . . . 9
1.5 Thesis overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Chapter 2 Background and Literature Review 13
2.1 XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Data models for XML mining . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Vector Space Model (VSM) . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.3 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.4 Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 XML clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.1 Based on structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.1.1 Vector Space Model (VSM) . . . . . . . . . . . . . . . . . . 29
2.3.1.2 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.1.3 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.1.4 Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.2 Based on content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.2.1 Vector Space Model (VSM) . . . . . . . . . . . . . . . . . . 36
vii
2.3.3 Based on structure and content . . . . . . . . . . . . . . . . . . . . . 37
2.3.3.1 Vector Space Model (VSM) . . . . . . . . . . . . . . . . . . 37
2.3.3.2 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.3.3 Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3.3.4 Tensor Space Model (TSM) . . . . . . . . . . . . . . . . . . 41
2.3.4 Research gaps in XML clustering . . . . . . . . . . . . . . . . . . . . 43
2.4 Frequent pattern mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4.1 An overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.4.2 Frequent pattern mining methods . . . . . . . . . . . . . . . . . . . . 47
2.4.2.1 Vector Space Model (VSM) . . . . . . . . . . . . . . . . . . 47
2.4.2.2 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.4.2.3 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.4.2.4 Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.4.3 Research gaps in frequent pattern mining . . . . . . . . . . . . . . . 59
2.5 Summary and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.6 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Chapter 3 Research Design 63
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2 Research Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.1 Phase-One: Pre-processing . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.2 Phase-Two: Frequent Pattern Mining . . . . . . . . . . . . . . . . . 66
3.2.3 Phase-Three (a): Clustering using VSM . . . . . . . . . . . . . . . . 67
3.2.4 Phase-Three (b): Clustering using TSM . . . . . . . . . . . . . . . . 67
3.3 Experiment Set-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.4.1 Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.4.2 Real-life Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.4.2.1 Small-sized real-life dataset . . . . . . . . . . . . . . . . . . 69
3.4.2.2 Medium-sized real-life dataset . . . . . . . . . . . . . . . . 70
3.4.2.3 Large-sized real-life datasets . . . . . . . . . . . . . . . . . 71
3.5 Evaluation measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.5.1 Frequent pattern mining . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.5.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.5.3 Collection selection evaluation using NCCG measure . . . . . . . . . 80
3.6 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.6.1 Frequent pattern mining . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.6.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.6.2.1 Based on representations . . . . . . . . . . . . . . . . . . . 84
3.6.2.2 Based on other clustering methods from INEX . . . . . . . 85
3.6.2.3 Clustering using different tensor decompositions . . . . . . 88
3.7 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Chapter 4 Frequent Pattern Mining of XML Documents 91
viii
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2 Pre-Processing of the structure in XML documents . . . . . . . . . . . . . . 93
4.3 Types of subtrees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3.1 Concise Frequent Induced (CFI) subtrees . . . . . . . . . . . . . . . 95
4.3.2 Concise Frequent Embedded (CFE) subtrees . . . . . . . . . . . . . 99
4.4 Frequent subtree mining: Background . . . . . . . . . . . . . . . . . . . . . 102
4.4.1 The 1-Length frequent subtree generation . . . . . . . . . . . . . . . 102
4.4.2 Projecting the dataset using the prefix trees . . . . . . . . . . . . . . 104
4.5 Concise frequent subtree mining: Proposed techniques . . . . . . . . . . . . 106
4.5.1 Search space reduction using the backward scan . . . . . . . . . . . 107
4.5.2 Node extension concise checking . . . . . . . . . . . . . . . . . . . . 108
4.6 Methods using the proposed techniques for generating concise frequent sub-
trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.6.1 Generating concise frequent induced subtrees . . . . . . . . . . . . . 111
4.6.1.1 Prefix-based Closed Induced Tree Miner (PCITMiner) . . . 114
4.6.1.2 Prefix-based Maximal Induced Tree Miner (PMITMiner) . 115
4.6.1.3 Length Constrained Prefix-based Closed Induced Tree Miner
(PCITMinerConst) . . . . . . . . . . . . . . . . . . . . . . 116
4.6.1.4 Length Constrained Prefix-based Maximal Induced Tree
Miner (PMITMinerConst) . . . . . . . . . . . . . . . . . . 117
4.6.2 Generating concise frequent embedded subtrees . . . . . . . . . . . . 118
4.6.2.1 Prefix-based Closed Embedded Tree Miner (PCETMiner) . 119
4.6.2.2 Prefix-based Maximal Embedded Tree Miner (PMETMiner)119
4.6.2.3 Length Constrained Prefix-based Closed Embedded Tree
Miner (PCETMinerConst) . . . . . . . . . . . . . . . . . . 120
4.6.2.4 Length Constrained Prefix-based Maximal Embedded Tree
Miner (PMETMinerConst) . . . . . . . . . . . . . . . . . . 121
4.7 Empirical evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.7.1 Evaluation of frequent pattern mining methods on synthetic datasets 123
4.7.2 Evaluation of frequent pattern mining methods on real-life datasets 128
4.7.2.1 On small-sized real-life dataset . . . . . . . . . . . . . . . . 128
4.7.2.2 On medium-sized real-life dataset . . . . . . . . . . . . . . 130
4.7.2.3 On large-sized real-life datasets . . . . . . . . . . . . . . . . 131
4.8 Discussion and summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.8.1 Algorithmic Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.8.2 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
4.9 Chapter Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Chapter 5 XML Clustering 145
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.2 Hybrid Clustering of XML Documents (HCX) Methodology : An Overview 146
5.2.1 Hybrid Clustering of XML documents using the Vector Space Model
(HCX-V) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
ix
5.2.2 Hybrid Clustering of XML documents using the Tensor Space Model
(HCX-T) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.3 Using the Vector Space Model (VSM) . . . . . . . . . . . . . . . . . . . . . 148
5.3.1 Identifying the coverage of concise frequent subtrees . . . . . . . . . 149
5.3.2 Pre-processing of the structure-constrained content of XML documents153
5.3.3 Representation of the structure-constrained content in ICF . . . . . 155
5.3.4 Similarity measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
5.4 Using the Tensor Space Model (TSM) . . . . . . . . . . . . . . . . . . . . . 158
5.4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.4.1.1 Tensor concepts . . . . . . . . . . . . . . . . . . . . . . . . 159
5.4.1.2 Tensor operations . . . . . . . . . . . . . . . . . . . . . . . 161
5.4.1.3 Tensor decomposition techniques . . . . . . . . . . . . . . . 163
5.4.2 Modelling in tensor space – An overview . . . . . . . . . . . . . . . . 165
5.4.3 Generation of structure features for TSM . . . . . . . . . . . . . . . 168
5.4.4 Generation of content features for TSM . . . . . . . . . . . . . . . . 169
5.4.5 The TSM representation, decomposition and clustering . . . . . . . 171
5.5 Empirical evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
5.5.1 Accuracy of clustering methods . . . . . . . . . . . . . . . . . . . . . 175
5.5.1.1 ACM dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 175
5.5.1.2 DBLP dataset . . . . . . . . . . . . . . . . . . . . . . . . . 179
5.5.1.3 INEX2007 dataset . . . . . . . . . . . . . . . . . . . . . . . 181
5.5.1.4 INEX IEEE dataset . . . . . . . . . . . . . . . . . . . . . . 183
5.5.1.5 INEX 2009 dataset . . . . . . . . . . . . . . . . . . . . . . 183
5.5.2 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
5.5.3 Time complexity analysis . . . . . . . . . . . . . . . . . . . . . . . . 188
5.5.4 Scalability analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
5.6 Discussion and summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
Chapter 6 Conclusion 201
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
6.2 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
6.3 Summary of findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
6.4 Limitations and future extensions . . . . . . . . . . . . . . . . . . . . . . . . 205
6.5 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Bibliography 209
Appendices 227
Chapter A Details of the real-life datasets 227
Chapter B Empirical Evaluation of Frequent Mining results 231
x
List of Figures
1.1 Using clustering in Information Retrieval . . . . . . . . . . . . . . . . . . . 4
1.2 A sample XML dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Classification of XML data . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Sample DTD (conf.dtd) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Sample XSD (conf.xsd) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Sample XML document (conf.xml) . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Classification of XML models . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 An example of (a) a dense representation; (b) a sparse representation of an
XML dataset modelled in VSM using their feature frequency . . . . . . . . 21
2.7 Sample XML fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.8 (a) A graph; (b) a labelled graph; (c) a directed graph . . . . . . . . . . . . 24
2.9 Graph representation of conf.dtd . . . . . . . . . . . . . . . . . . . . . . . . 25
2.10 Tree representation of the XML document given in Figure 2.4 . . . . . . . . 26
2.11 Paths derived from XML document model (in Figure 2.4): (a) A complete
path; (b) a partial path; (c) a complete path with text node . . . . . . . . . 27
2.12 Hierarchy of XML frequent pattern mining . . . . . . . . . . . . . . . . . . 45
2.13 Example of a subtree from the sample XML dataset in Figure 1.2 . . . . . 52
2.14 (a) A tree; (b) an induced subtree; (c) an embedded subtree . . . . . . . . . 54
2.15 A sample tree using node labels as alphabets instead of the tag names . . . 54
2.16 (a) a document tree dataset; (b) frequent patterns and their projections in
that dataset using pattern-growth approach . . . . . . . . . . . . . . . . . . 55
3.1 Research design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.1 The pre-processing phase for structure of XML documents . . . . . . . . . . 93
4.2 (a) a document tree DTp; (b) Prefix trees of (a) . . . . . . . . . . . . . . . . 103
4.3 Algorithm for generating concise frequent subtrees . . . . . . . . . . . . . . 111
4.4 Function Fre for generating concise frequent subtrees . . . . . . . . . . . . 112
4.5 Classification of the proposed methods . . . . . . . . . . . . . . . . . . . . . 113
4.6 Runtime and number of subtrees comparison on F5 dataset . . . . . . . . . 124
4.7 Runtime and number of length constrained frequent concise subtrees com-
parison on F5 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.8 Runtime and number of subtrees comparison on the D10 dataset . . . . . . 127
4.9 Runtime and number of subtrees comparison on ACM dataset . . . . . . . . 129
xi
4.10 Runtime and number of length constrained frequent concise subtrees com-
parison on ACM dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.11 Runtime and number of subtrees comparison on DBLP dataset . . . . . . . 130
4.12 Runtime and number of length constrained frequent concise subtrees com-
parison on DBLP dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
4.13 Runtime and number of subtrees comparison on INEX2007 dataset . . . . . 132
4.14 Runtime and number of length constrained frequent induced subtrees com-
parison on INEX2007 dataset at 20% . . . . . . . . . . . . . . . . . . . . . . 133
4.15 Runtime and number of length constrained frequent induced subtrees com-
parison on INEX2007 dataset at 50% . . . . . . . . . . . . . . . . . . . . . . 133
4.16 Runtime and number of length constrained frequent induced subtrees com-
parison on INEX IEEE dataset at 20% and 50% . . . . . . . . . . . . . . . 134
4.17 Runtime and number of length constrained frequent embedded subtrees
comparison on INEX IEEE dataset at 50% . . . . . . . . . . . . . . . . . . 134
4.18 Runtime and number of length constrained frequent induced subtrees com-
parison on INEX 2009 dataset at 20% and 50% . . . . . . . . . . . . . . . . 135
4.19 Runtime and number of length constrained frequent embedded subtrees
comparison on the INEX 2009 dataset at 50% . . . . . . . . . . . . . . . . . 136
4.20 Comparison of the runtimes vs number of frequent subtrees on ACM, DBLP
and INEX 2007 datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.21 Comparison of the runtimes vs number of frequent subtrees on INEX IEEE
and INEX 2009 datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.1 Hybrid Clustering of XML documents (HCX) methodology . . . . . . . . . 147
5.2 High level definition of HCX-V approach . . . . . . . . . . . . . . . . . . . . 149
5.3 Sparse representation of a XML dataset modelled in VSM using their term
frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.4 Comparison of VSM and TSM: (a) sample XML document; (b) concise
frequent subtrees; (c) Vector Space Model (VSM) for (a) and (b) using
HCX-V; and (d) Tensor Space Model(TSM) for (a) and (b). . . . . . . . . . 158
5.5 Comparison of vector, matrix and tensor . . . . . . . . . . . . . . . . . . . 160
5.6 Fibers of a mode-3 tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
5.7 Slices of a mode-3 tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.8 Mode-1 matricization of a mode-3 tensor . . . . . . . . . . . . . . . . . . . . 162
5.9 mode-n matricization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.10 Visualisation of a mode-3 tensor for the XML document dataset . . . . . . 166
5.11 High level definition of HCX-T approach . . . . . . . . . . . . . . . . . . . . 167
5.12 Illustration of Random Indexing (RI) on a mode-3 tensor resulting in a
randomly reduced tensor Tr. . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
5.13 Progressive Tensor Creation and Decomposition algorithm (PTCD) . . . . 173
5.14 Results of clustering on the ACM dataset using 5 categories . . . . . . . . . 176
5.15 Results of clustering on the ACM dataset using 2 categories . . . . . . . . . 177
xii
5.16 Results of clustering on the ACM dataset using different types of subtrees
for HCX-V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
5.17 Results of clustering on the ACM dataset using different types of subtrees
for HCX-T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
5.18 Impact of RI on the quality of individual clusters on ACM dataset . . . . . 178
5.19 Results of clustering on the DBLP dataset . . . . . . . . . . . . . . . . . . . 179
5.20 Results of clustering on different types of concise frequent subtrees on the
DBLP dataset using HCX-V . . . . . . . . . . . . . . . . . . . . . . . . . . 180
5.21 Results of clustering on different types of concise frequent subtrees on the
DBLP dataset using HCX-T . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
5.22 Results of clustering methods using different types of concise frequent sub-
trees on the INEX 2007 dataset using HCX-V . . . . . . . . . . . . . . . . . 182
5.23 A comparison of the NCCG values of the different clustering methods on
the INEX 2009 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
5.24 A comparison of the number of clusters on the NCCG values . . . . . . . . 185
5.25 A comparison of the different clustering methods on the INEX 2009 dataset
using cumulative recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
5.26 Cumulative gain for the topic id 2009005 . . . . . . . . . . . . . . . . . . . . 186
5.27 Cumulative gain for the topic id 2009043 . . . . . . . . . . . . . . . . . . . . 187
5.28 Sensitivity of length constraint on the micro- and macro-purity values for
INEX IEEE dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
5.29 Scalability of HCX-T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
5.30 Scalability of PTCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
5.31 Scalability of the decomposition in PTCD . . . . . . . . . . . . . . . . . . . 189
5.32 Comparison of the proposed clustering methods over the state-of-the-art
clustering methods on the large-sized datasets . . . . . . . . . . . . . . . . . 191
5.33 Comparison of different types of concise frequent subtrees on clustering
based on datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
5.34 Comparison of different types of concise frequent subtrees on clustering . . 192
5.35 Comparison of tensor decomposition algorithms . . . . . . . . . . . . . . . . 193
5.36 A comparison of the average of all metrics in the chosen real-life datasets . 195
5.37 A comparison of the number of terms in the chosen real-life datasets . . . . 195
5.38 A comparison of the weighting schemes - tf-idf and BM-25 . . . . . . . . . 198
xiii
List of Tables
2.1 VSM generated from the structure of XML document given in Figure 2.4 . 21
2.2 Transactional data model generated from the content of XML document
given in Figure 2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Comparison of different types of clustering methods using structure of XML
documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4 Popular tensor decomposition algorithms based on CP and Tucker . . . . . 42
2.5 Classifications of frequent tree mining methods . . . . . . . . . . . . . . . . 52
3.1 Synthetic datasets and their parameters . . . . . . . . . . . . . . . . . . . . 69
3.2 Details of categories in the ACM dataset . . . . . . . . . . . . . . . . . . . . 70
3.3 Details of ACM and DBLP datasets . . . . . . . . . . . . . . . . . . . . . . 71
3.4 Details of categories in the DBLP dataset . . . . . . . . . . . . . . . . . . . 71
3.5 Details of categories in the INEX IEEE dataset . . . . . . . . . . . . . . . . 72
3.6 Details of categories in the INEX 2007 dataset . . . . . . . . . . . . . . . . 73
3.7 Details of large-sized datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.8 Details of the top-20 categories in the INEX 2009 dataset using Wikipedia
categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.9 Details of the top-20 categories in the INEX 2009 dataset using ad hoc queries 75
3.10 Benchmarks for frequent pattern mining methods . . . . . . . . . . . . . . . 84
3.11 Benchmarks for clustering methods . . . . . . . . . . . . . . . . . . . . . . . 84
4.1 Document tree dataset example (DT ) . . . . . . . . . . . . . . . . . . . . . 95
4.2 Frequent induced subtrees generated from DT (in Table 4.1) using prefix-
pattern growth approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3 Closed Frequent Induced subtrees generated from DT (in Table 4.1) . . . . 97
4.4 Maximal Frequent Induced subtrees generated from DT (in Table 4.1) . . . 97
4.5 Length Constrained Closed Frequent Induced subtrees generated from DT
(in Table 4.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.6 Length Constrained Maximal Frequent Induced subtrees generated from
DT (in Table 4.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.7 Frequent embedded subtrees generated from DT (in Table 4.1) using prefix-
pattern growth methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.8 Closed Frequent Embedded (CFE) subtrees generated from DT (in Table
4.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.9 Maximal Frequent Embedded (MFE) subtrees generated from DT (in Table
4.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
xv
4.10 Length Constrained Closed Frequent Embedded (CFEConst) subtrees gen-
erated from DT (in Table 4.1) . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.11 Length Constrained Maximal Frequent Embedded (MFEConst) subtrees
generated from DT (in Table 4.1) . . . . . . . . . . . . . . . . . . . . . . . . 102
4.12 < A1 − 1 > projected instances dataset . . . . . . . . . . . . . . . . . . . . 105
4.13 < B1 − 1 > projected instances dataset . . . . . . . . . . . . . . . . . . . . 105
4.14 < A1B2 − 1− 1 > projected instances dataset . . . . . . . . . . . . . . . . . 105
4.15 Runtime comparison of length constrained subtrees on the D10 dataset . . 126
4.16 Length constrained subtrees in the D10 dataset . . . . . . . . . . . . . . . . 127
4.17 Summary of frequent pattern mining results on synthetic datasets . . . . . 136
4.18 Summary of frequent pattern mining results on real-life datasets . . . . . . 137
5.1 Tensor notations and descriptions . . . . . . . . . . . . . . . . . . . . . . . . 159
5.2 Summary of the term size and tensor entries in INEX 2009 and INEX IEEE
datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
5.3 Impact of dimensionality reduction on the clustering results on the ACM
dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
5.4 Impact of dimensionality reduction on the clustering results on the DBLP
dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
5.5 Results of clustering on the INEX2007 dataset . . . . . . . . . . . . . . . . 181
5.6 Results of clustering on the INEX IEEE dataset using 18 categories . . . . 183
5.7 Results of clustering on the INEX 2009 dataset . . . . . . . . . . . . . . . . 184
5.8 Details of ad hoc queries with large categories . . . . . . . . . . . . . . . . . 185
5.9 Constraint lengths for the real-life datasets . . . . . . . . . . . . . . . . . . 197
A.1 Details of all the categories in INEX 2009 dataset using Wikipedia categories228
A.2 Details of all categories in INEX 2009 dataset using ad hoc queries ordered
by the topic Id . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
B.1 Runtime comparison of Length Constrained Subtrees on F5 dataset . . . . 231
B.2 Length Constrained Subtrees in F5 dataset . . . . . . . . . . . . . . . . . . 232
xvi
Glossary of terms and abbreviations
CFE Closed Frequent Embedded Subtrees.
CFI Closed Frequent Induced Subtrees.
Content in XML document The text between the start and end tag in an XML
document. For instance, <Title> Data Mining </Title>, content refers to the text “Data
Mining” between the tags <Title></Title>.
Frequent Patterns Patterns which occurs more than a user-specified threshold limit
(minimum support or min supp) in a given dataset.
Frequent Patterns Mining A data mining task focussing on the extraction of frequent
patterns from a given data.
HCX-V Hybrid Clustering of XML documents using Vector Space Model (VSM).
HCX-T Hybrid Clustering of XML documents using Tensor Space Model (TSM).
INEX INitiative for Evaluation of XML Retrieval.
IR Information Retrieval.
LSI Latent Semantic Indexing.
MFE Maximal Frequent Embedded Subtrees.
MFI Maximal Frequent Induced Subtrees.
PCA Principal Component Analysis.
PCETMiner Prefix-based Closed Embedded Tree Miner.
PCETMinerConst Length Constrained Prefix-based Closed Embedded Tree Miner.
xvii
PCITMiner Prefix-based Closed Induced Tree Miner.
PCITMinerConst Length Constrained Prefix-based Closed Induced Tree Miner.
PMETMiner Prefix-based Maximal Embedded Tree Miner.
PMETMinerConst Length Constrained Prefix-based Maximal Embedded Tree Miner.
PMITMiner Prefix-based Maximal Induced Tree Miner.
PMITMinerConst Length Constrained Prefix-based Maximal Induced Tree Miner.
RI Random Indexing.
Structures in XML document The element tags and their nesting dictate the struc-
ture of an XML document.
Subtree The tree which is a child of a node in another tree.
SVD Singular Value Decomposition.
TSM Tensor Space Model.
VSM Vector Space Model.
XML eXtensible Markup Language.
XML Frequent Patterns Mining Mining of XML documents for frequent patterns
which are either structure or content-oriented or a combination of both.
xviii
Statement of Original Authorship
The work contained in this thesis has not been previously submitted to meet requirements
for an award at this or any other higher education institution. To the best of my knowledge
and belief, the thesis contains no material previously published or written by another
person except where due reference is made and except for one of the evaluation measures
for clustering, collection selection evaluation, discussed in Section 3.5.3 in Chapter 3, which
was developed in collaboration with other volunteers in the clustering task in the INEX
forum. Also, the concept of cumulative recall plots for clustering discussed in subsection
5.5.1.5 in Section 5.5 in Chapter 5 was developed by Chris de Vries, a team member in
the INEX forum.
Signature:
Date:
xix
Publications Derived from this Thesis
1. Kutty, S., R. Nayak, and Y. Li. XML documents clustering using tensor space
model, in proceedings of the 15th Pacific-Asia Conference on Knowledge Discovery
and Data Mining (PAKDD 2011), Shenzen, China (to appear in 2011).
2. Kutty, S., T. Tran, and R. Nayak, A study of XML models for data mining: repre-
sentations, methods, and issues in XML data mining: models, methods, and appli-
cations, A. Tagarelli, Editor, Idea Group Inc., USA (to appear in 2011).
3. Kutty, S., R. Nayak, and Y. Li, Utilising semantic tags in XML clustering, in Focused
Retrieval and Evaluation, S. Geva, J. Kamps, and A. Trotman, Editors. 2010,
Springer Berlin / Heidelberg. p. 416-425.
4. Kutty, S., R. Nayak, and Y. Li. XML documents clustering using tensor space
model-A preliminary study, in proceedings of the IEEE ICDM 2010 Workshop on
Optimization Based Methods for Emerging Data Mining Problems (OEDM ’10).
2010, Sydney, Australia.
5. Kutty, S., R. Nayak, T. Tran, and Y. Li. Clustering XML documents using frequent
subtrees, in Advances in Focused Retrieval, S. Geva, J. Kamps, and A. Trotman,
Editors. 2009, Springer Berlin / Heidelberg. p. 436-445.
6. Kutty, S., R. Nayak, and Y. Li, XCFS: an XML documents clustering approach using
both the structure and the content, in proceedings of the 18th ACM conference on
Information and knowledge management. 2009, ACM: Hong Kong, China. p. 1729-
1732.
7. Kutty, S., R. Nayak, and Y. Li, HCX: an efficient hybrid clustering approach for XML
documents, in proceedings of the 9th ACM symposium on Document engineering.
xxi
2009, ACM: Munich, Germany. p. 94-97.
8. Kutty, S., R. Nayak, and Y. Li, XML data mining: process and applications, in
Handbook of Research on Text and Web Mining Technologies, M. Song and Y.-F.
Wu, Editors. 2008, Idea Group Inc., USA.
9. Kutty, S., R. Nayak, T. Tran, and Y. Li. Clustering XML documents using closed
frequent subtrees: A structural similarity approach, in Advances in Focused Re-
trieval, N. Fuhr, J. Kamps, M. Lalmas and A. Trotman, Editors. 2008, Springer
Berlin / Heidelberg. p. 183-194.
10. Kutty, S., R. Nayak, and Y. Li, PCITMiner: prefix-based closed induced tree miner
for finding closed induced frequent subtrees, in proceedings of the sixth Australasian
conference on Data mining and analytics - Volume 70. 2007, Australian Computer
Society, Inc.: Gold Coast, Australia. p. 151-160.
xxii
Acknowledgements
I would like to express my sincere gratitude and deep appreciation to my principal
supervisor, Dr. Richi Nayak, for her continuous guidance, encouragement, and support
throughout this research. The achievements in this thesis would not be possible without
her supervision. I also thank Prof. Yuefeng Li, my associate supervisor, for his valuable
suggestions concerning my research and for helping me with my publications. I am thankful
to the QUT faculty-based award (QUTFBA) for funding me and my research.
Thanks to Dr. Lei Zhou for providing the PrefixTreeISpan and PrefixTreeESpan for
benchmarking purposes. Also, a special thanks to Prof. Mohammed Javeed Zaki for the
TreeMinerV.
My thanks also go to the volunteers of the INEX forum for the availability of the
INEX datasets and the evaluation methods that have been used in this thesis. My special
thanks to QUT’s High Performance Computing (HPC) team for providing the facilities to
conduct experiments on HPC systems.
I am grateful to my husband, Mr. Anand Kutty and my children Preeti Kutty and
Deepti Kutty for their love, understanding, support and patience. I am indebted to
my parents-in-law, Dr. Kutty Venkatesan and Dr. Suguna Venkatesan for their constant
support and advice during difficult times. My special thanks to my beloved parents,
Mr. R. K. Srinivasan and Mrs. R. Udayakumari for their love, encouragement and support
throughout my studies.
Further thanks are for my colleagues in Faculty of Science and Technology, Mrs. Dine-
sha Weragama, Mr. Reza Hassanzadeh, Mrs. Esther Ge, Mr. Rakesh Rawat, Mr. Daniel
Emerson, Mr. Aishwarya Bose and Mr. Paul de Braak for creating a friendly environment
to share our knowledge. My special thanks to Mrs. Tien Tran for her suggestions and
xxiii
co-operation throughout this research.
Thank you everyone for providing me with the opportunity to do this research.
xxiv
Chapter 1
Introduction
With increasingly distributed intranets and with the massive growth of the Internet, XML
(eXtensible Markup Language) has now become a ubiquitous standard for information
representation and exchange for both intranet and Internet [120]. Due to the simplicity
and flexibility of XML, a diverse variety of applications ranging from scientific literature
and technical documents to handling news summaries [104] utilise XML in information
representation and exchange. More than 50 domain-specific languages have been devel-
oped based on XML [25], such as MovieXML for encoding movie scripts, GraphML for
exchanging graph structured data, Geography Markup Language (GML) for expressing ge-
ographical features and interchanging them over the Internet, Twitter Markup Language
(TML) for structuring the twitter streams, Chemical Markup Language, Mathematics
Markup Language(MathML)[101] and many others. XML has also been used to represent
the web-based free-content encyclopedia known as Wikipedia, which has more than 3.4
million XML documents.
The increased popularity of XML has raised many issues regarding the methods of
how to effectively manage the XML data and retrieve these XML documents in large
collections. A possible solution to the problem of handling large XML collections is to
1
group similar XML documents. This task of grouping in data mining is referred to as
clustering. Clustering task groups unknown data into smaller groups according to the
data commonality without having any prior knowledge about the dataset. The clustering
of similar XML documents has been perceived as potentially being one of the more effective
solutions to improve document handling by facilitating better information retrieval, data
indexing, data integration and query processing [109].
In spite of its potential, there are several challenges in clustering XML documents.
Unlike the clustering of text documents or flat data, clustering of XML documents is
an intricate process [69] and consequently the most commonly used clustering methods
for text clustering cannot be used for clustering these documents. This is due to the
fact that XML documents are semi-structured in nature and have a flexible structure as
well as their content showing the semantics. The semi-structured nature of XML data
requires the computation of similarity by including their structural similarity. However,
the inclusion of structure increases the dimensionality that the clustering method needs
to handle.
XML gained its popularity because of its structure and its inherent flexibility in repre-
senting content. However, most of the XML clustering methods adopt a naıve approach by
utilising either only the content features and ignoring its structure features or its structure
features and not its content features. Nevertheless, these methods, with their single-feature
focus, have a significant cost associated with them, since all the valuable information that
is embedded in the documents is potentially lost. Hence, they tend to falsely group docu-
ments that are similar in both features. To correctly identify similarity among documents,
the clustering process should use both their structure and their content information.
This research focuses on finding whether combining the structure and the content of
XML documents improves the accuracy of the clustering results. With the explosion in
2
the number of XML documents, clustering just the content of XML documents itself is
expensive. To combine the structure of XML documents along with the content will add
more complexities. Hence, it is essential to have an effective and efficient pre-processing
technique to reduce the dimensionality of XML documents for such a combination. This
research identifies ways of utilising the frequent patterns to reduce the dimension and
combine both the structure and the content of XML documents for use in clustering. The
structure and the content of the XML documents can be combined either implicitly or
explicitly. These combinations do not only allow for the reduction in the dimensionality of
the terms but also prove to be efficient in improving the quality of the clustering solution
over varied types of datasets.
This dissertation will explore two main areas. Firstly, it looks into frequent pattern
mining to generate concise frequent patterns for the efficient pre-processing of XML doc-
uments for clustering. Secondly, it also looks into clustering of XML documents for com-
bining the structure and the content using two types of combination, implicit and explicit,
by employing the concise frequent patterns generated before. The methods proposed in
both frequent pattern mining and clustering are evaluated against other state-of-the-art
methods using a number of datasets showing diverse characteristics.
1.1 Motivation
With the growing number of XML documents on the Internet and organisational intranets
the need becomes inevitable to effectively organise these XML documents in order to
retrieve useful information from them. The absence of such an effective organisation of
the XML documents causes a search of the entire collection of XML documents. This
search will not only be computationally expensive but also could result in a poor response
time for even simple queries.
3
In order to effectively manage the XML documents collection, it is indispensable to
apply clustering methods on these documents to group them based on their similarity.
Figure 1.1 shows an information retrieval scenario using the clustering of XML documents.
In this scenario, an user has an information need and hence makes a request using a query
to the Information Retrieval (IR) system. Instead of searching the entire collection, the
efficiency and the precision of the search engine can be improved if the retrieval system
searches only a subset of the collection in the form of clusters of documents.
Cluster 1
Cluster 2
Cluster N
XML Documents Collection
Query
Answer list
Information Need
Retrieval
IRsystem
Figure 1.1: Using clustering in Information Retrieval
In addition to applications in IR, clustering can also be applied to discover knowledge
for effective data/schema management, web mining and query processing. In spite of
the benefits of using XML, XML documents clustering is not as trivial a process as the
clustering of text documents. Instead there arise many challenges in the clustering of XML
documents due to the nature of these documents. They are:
� The presence of two main features. Unlike text documents, which are unstructured,
XML documents are semi-structured in nature and contains two features – Structure
and Content. The structure of the XML documents is used to store its content hence
the clustering method should not only be applied on one feature but should instead
be applied on both of these features.
4
� A hierarchical relationship. The structure of XML documents maintains a hierar-
chical relationship among its elements. Hence, this relationship should be preserved
while clustering.
� User defined tags. XML allows users to create their own tags. This flexibility
in design results in polysemy problems. The same tag name can convey different
meanings based on the context in different XML documents. For example, the tag
“bank” can mean “a financial institution”, “a river bank” or as a verb “to rely upon”.
The example shown in Figure 1.2 reveals the importance of using both the structure
and content features for XML clustering. Figure 1.2 shows the fragments of six XML
documents from the publishing domain: the XML fragments shown in (a), (b), (c) and
(e) share the same structure and the fragments in (d) and (f) share a similar structure.
It can be noted in Figure 1.2 that although the fragments in (a) and (b) have a
similar structure to fragments in (c) and (e), these two sets of fragments differ in their
content. Utilising a clustering method based only on the structure similarity of XML
documents will result in two clusters about “Books” and “Conference Articles”. However,
this fails to further distinguish the documents in the “Books” cluster based on the subjects
“Biology” and “Data Mining”, resulting in meaningless clusters. On the other hand,
utilising a clustering method based only on content similarity provides clusters based only
on the subject and not on the type of publication and hence fails to distinguish between
“Books” and “Conference Articles”. In order to derive meaningful clusters, these fragments
should be analysed in terms of both their structure and content similarity. Clustering the
XML documents by considering the structure and content features together will result
in three clusters, namely “Books on Data Mining (DM)”, “Books on Biology (Bio)” and
“Conference articles on Data Mining” having (a) and (b), (c) and (e), and (d) and (f)
fragments respectively. These kinds of meaningful clusters could be used for the effective
5
On the Origin of Species
Book
Title Author
Name
Publisher
Name
Charles Darwin John Murray
Book
Title Author Publisher
Eibe Frank
Data Mining: Practical Machine
Learning Tools and Techniques Addison Wesley
Name Name
ConfTitle
Conference
ConfAuthor ConfLoc
John SmithSurvey of Clustering Techniques
ICDM
ConfName
LA
ConfTitle
Conference
ConfAuthor ConfYear
Michael Bonchi
An exploratory study on
Frequent Pattern mining
AusDM
ConfName
2007
Book
Title Author Publisher
John Brown
Classification of Plants Species
Cambridge Press
Name Name
(a)
(c)
(e)
(b)
(d)
(f)
Name
Morgan Kaufmann
Book
Title Author
Name
Publisher
Data Mining concepts and Techniques
Micheline Kamber
Figure 1.2: A sample XML dataset
storage and retrieval of XML documents.
The clustering task on XML documents involves grouping the XML documents with-
out any prior knowledge according to the structure and content similarities among them.
Clustering methods utilising only the structure features of the documents cannot accu-
rately group the documents that are similar in structure but diverse in content. On the
6
other hand, clustering methods utilising only the content features of the documents con-
sider the documents as a “bag of words” and ignore the structure features [73]. The
disadvantage of these types of methods is that when there are two documents that are
similar in content but different in structure, these may be falsely grouped as belonging
to the same group. Thereby this thesis proposes to develop clustering methods for XML
documents by considering both the structure and the content features.
Often the XML documents are represented in the Vector Space Model (VSM) to be
processed for clustering [94]. VSM is a model for representing documents as vectors of
identifiers. The inclusion of both structure and content features in VSM results in very
high dimensionality for the input matrix. The application of clustering on this matrix
becomes an expensive task in terms of memory consumption and computational time for
very large datasets. To mitigate this problem, it is vital to reduce the dimensions of
the input data and create a suitable data model without compromising the accuracy of
clustering results.
This research proposes a method of utilising frequent patterns to address the dimen-
sionality explosion caused by the combination. These frequent patterns generated using
frequent pattern mining methods are used to reduce the size of the input data matrix.
These patterns also aid in the creation of an effective data model for clustering the XML
documents by capturing the relationship between their structure and content.
Further, this research uses the Tensor Space Model (TSM), a higher dimensional model,
for modelling the XML documents by directly capturing their structure and content rela-
tionships. It also proposes scalable tensor decomposition techniques to effectively analyse
the relationships between the structure and content of the XML documents and to aid in
clustering these documents. Finally, this research evaluates the output of the clustering
process, the clusters of XML documents, for information retrieval.
7
1.2 Research questions
The following research questions have been examined in this research:
� How to cluster XML documents effectively?
– How can the structure and content of XML documents be combined for im-
proving the accuracy of a clustering solution?
– How does a clustering method using both of these features perform on real-life
datasets?
– How does the clustering method using both of these features compare with
clustering methods using a single-feature focus?
– How to handle the high dimensionality resulting due to the combination? Do
the dimensionality reduction techniques incur an information loss?
� How to utilise frequent patterns to control the dimensionality of the combination of
structure and content features?
– Are the state-of-the-art frequent pattern mining methods scalable for large-
sized real-life XML datasets? If not, how to improve the efficiency and the
effectiveness of these methods?
1.3 Research aims & objectives
The objective of this research is to provide methods for more effective and meaningful
grouping of XML documents. To achieve this objective, the concept of clustering is utilised
to group similar XML documents based on the common structure and content that they
share. The high dimensionality due to the combination of these two features is reduced
by employing the XML frequent mining methods proposed in this research.
8
The proposed research can be broken down into two separate tasks, namely:
� XML frequent pattern mining : Develop frequent pattern mining methods to
mine for concise representations of frequent subtrees from structure of XML docu-
ments represented as trees. Generate different types of subtrees with a parent-child
relationship or with an ancestor-descendant relationship to identify hidden similar-
ities based on the relationships. Also, analyse the effectiveness of these subtrees in
capturing the structural commonalities for use in clustering.
� XML clustering : Develop hybrid clustering methods to group the documents
based on the content corresponding to the concise frequent structures in each doc-
ument. Explore different forms of representation such as implicit and explicit by
use of Vector Space Model and higher order models such as Tensor Space Model
respectively. Also, to analyse the suitability of the different models for clustering
various types of XML documents collection.
1.4 Research significance and contributions
As XML has become a popular standard for data exchange on the Internet, it is essential
to store XML documents effectively to facilitate easy management of the XML documents.
Hence, this research work contributes to the existing body of literature by providing more
accurate and better grouping using clustering methods that could be useful on real-life
datasets. By using not only the structure but also the content similarity among the XML
documents, the accuracy of the clustering solution can be improved.
This research makes important contributions to clustering by proposing two novel
approaches of non-linearly combining the structure and the content of XML documents
using implicit and explicit combinations. The proposed clustering methods are extensively
9
evaluated on real-life datasets to analyse their performance on these datasets and to un-
derstand the suitability of the proposed clustering methods in practice. The results were
also evaluated on the collection selection problem in information retrieval which is based
on query results from manual assessors.
Moreover, in contrast to the previous research in XML clustering, this research con-
tributes by proposing a novel way of using a high dimensional data model, the Tensor
Space Model (TSM) [63], to explicitly capture non-linearly both the structure and the
content of XML documents. It also provides an incremental decomposition technique to
effectively decompose large-sized dense tensors efficiently and effectively.
This research makes a vital contribution by efficiently reducing the dimensionality of
the dataset for clustering by utilising only the frequent patterns in the XML documents
as well as their corresponding content. This research also attempts to bridge the gaps
in frequent pattern mining by proposing various types of concise frequent pattern mining
methods to generate frequent subtrees based on the node relationship and conciseness.
This research also evaluates the proposed frequent pattern mining methods against several
state-of-the-art methods to show their strengths and weaknesses in both synthetic and
real-life datasets. Additionally, the effectiveness of the proposed frequent pattern mining
methods were evaluated on clustering.
By converging two parallel fields in data mining, frequent pattern mining and clus-
tering, this research has enriched the knowledge as well as bridged the gaps in these two
fields.
10
1.5 Thesis overview
This thesis is designed to explore the use of combining the structure and the content
features in XML documents for clustering. The study will primarily focus on developing
clustering methods for XML documents to identify interesting knowledge effectively. The
secondary focus of this thesis is to develop frequent pattern mining methods for reducing
the dimensionality of the input matrix for clustering. Furthermore, this thesis attempts
to evaluate the possibility of enhancing information retrieval by the use of data mining
outputs.
The remainder of the thesis is organised as follows:
� Chapter 2 reviews recent developments in the main topics in this research in both
clustering and frequent pattern mining. It also includes XML and the data models
for XML mining. This chapter helps to gain an insight in the methods used to iden-
tify the weaknesses in the state-of-the-art methods in both clustering and frequent
pattern mining. This has helped to identify the research gaps in both of these tasks.
� Chapter 3 describes the research design used in this thesis including the experimen-
tal design, detailed description of the datasets and the various evaluation metrics
used to evaluate the frequent pattern mining and clustering methods. It also dis-
cusses the benchmarks used to evaluate the proposed methods.
� Chapter 4 covers the proposed frequent pattern mining methods for the purpose
of clustering. It includes details about the prefix-based pattern growth approach
of generating the different types of concise frequent subtrees. While doing so, it
details the various techniques for efficiently mining the concise frequent subtrees. It
also proposes new types of concise subtrees suitable for mining very large and dense
datasets. Finally, it presents the empirical evaluation on both synthetic and real-life
11
datasets and analyses the results.
� Chapter 5 presents the proposed methods for combining the structure and the
content of the XML documents. It begins with the proposal of the hybrid clustering
methodology called the Hybrid Clustering of XML documents (HCX) for non-linearly
combining the structure and the content features in XML documents. The Hybrid
Clustering of XML documents using VSM (HCX-V) uses the implicit combination in
this methodology; the Hybrid Clustering of XML documents using the Tensor Space
Model (TSM) (HCX-T) uses the explicit combination. The proposed clustering
methods are evaluated using the metrics defined in Chapter 3. The empirical results
and the analysis of the experiments for the proposed method against the benchmarks
conducted on the various datasets are also covered in this chapter.
� Chapter 6 presents the final conclusions, summarises the findings, and lists the
main contributions of the work developed in this thesis. A few research extensions
from this work are also identified.
12
Chapter 2
Background and Literature Review
The main focus of this chapter is two fold: to provide the background knowledge and
to present a critical review of the related work relevant to the two pattern mining tasks,
namely frequent pattern mining and clustering on XML data. This chapter begins with
an introduction about XML data to provide an overview of the data domain considered
in this research. Section 2.1 on XML begins by introducing the concept of XML data
to explain the difference between its counterparts such as text and unstructured data in
regards to data mining. Section 2.2 describes the various data models that have been used
for modelling XML for mining. There exists several data models such as Vector Space
Model (VSM), paths, trees and graphs and each of them are discussed in detail in this
section.
Further, this chapter analyses the various XML clustering methods to date according
to the data models, the similarity measures and the methods that are used for clustering in
Section 2.3. Finally, the related works pertaining to the pre-processing step for clustering,
that is frequent pattern mining, are covered in detail. This chapter also provides a review
of the literature related to the various frequent pattern mining methods based on the
different data models and frequent pattern generation techniques. This chapter concludes
13
by presenting the limitations of the related works in identifying research gaps that are
required to be addressed in this research.
2.1 XML
The eXtensible Markup Language (XML) is a markup language defined by the World
Wide Web Consortium (W3C) for improved data representation and exchange over its
predecessors. XML is a simplified form of the Standard Generalized Markup Language
(SGML) [1, 98, 62]. SGML is a notation that has been widely used for a number of years for
professional document preparation [62]. However, it was found to be cumbersome, causing
difficulties in learning SGML, as it attempted to provide many features and flexibilities.
One example of SGML’s flexibility is that it allowed the absence of end tags based on the
context. These disadvantages have resulted in the development of XML which was simpler
and less flexible than SGML.
XML differs from the popular HyperText Markup Language (HTML). XML is used
to describe the content, while HTML is used to describe the format and display of the
same content. Apart from this, XML allows for user-defined tags and hence has a much
more flexible structure than HTML, which uses only pre-defined tags. Using the user-
defined tags, XML could specify not only the data but also the structure of the data.
These tags also help to create nesting to show how various elements present in XML data
are integrated into other elements. Due to this, XML data are often referred to as self-
describing. Also, the XML data can be represented in a common data format which helps
the processing and displaying of these data in an application and platform independent
way. Sol [99] has highlighted the four major benefits of using XML language:
� XML separates data from presentation which means making changes to the display
14
of data does not affect the XML data;
� Searching for data in XML documents becomes easier as search engines can parse
the description-bearing tags of the XML documents;
� An XML tag is human readable; even a person with no knowledge of XML language
can still read an XML document; and
� Complex structures and relations of data can be encoded using XML.
There are two types of XML data: XML schema definition and XML document as
shown in Figure 2.1. An XML schema definition contains the structure and data definitions
of XML documents [2]. An XML document, on the other hand, is an instance of the XML
schema that contains the data content represented in a structured format.
XML Data
XML Document XML Schema
Structure Content DTD XSD
Ill-formed Well-formed Valid
Figure 2.1: Classification of XML data
The provision of the XML schema definition with XML documents makes it different
from the other types of semi-structured data such as HTML and BibTeX. The schema
imposes restrictions on the syntax and structure of XML documents. The two most
popular XML document schema languages are Document Type Definition (DTD) and
XML-Schema Definition (XSD). XSD, an enhancement of DTD, consists of the following
features:
15
� Extensibility to future additions;
� Greater richness and usefulness;
� Ease for learning as XSD is written in XML;
� Wider range of data type support; and
� New features such as namespace support.
Figures 2.2 and 2.3 show DTD and XSD examples respectively. An XML document,
on the other hand, is an instance of the XML schema that contains the data content
represented in a structural format. An example of a simple XML document conforming
to the schemas from Figures 2.2 and 2.3 is shown in Figure 2.4. An XML document
be either ill-formed, well-formed, or valid, according to how it abides the XML schema
definition. An ill-formed document does not have a fixed structure, which implies that
it does not conform to the XML syntax rules such as lack of XML declaration statement
and it contains more than one root element. A well-formed document conforms to the
XML syntax rules and may have a document schema but the document does not conform
to it. It contains exactly one root element, and sub-elements are properly nested within
each other. Finally, a valid XML document is a well-formed document which conforms to
a specified XML schema definition [100].
Each XML document can be divided into two parts – markup constructs and content.
A markup construct consists of the characters that are marked up using “<” and “/>”.
The content is the set of characters that is not included in the markup. There are two types
of markup constructs, tags (or elements) and attributes. Tags are the markup constructs
which begin with a start tag “<” and end with an end tag “/>”, such as conf, title, year
and editor for the conf.xml document in Figure 2.4. On the other hand, the attributes
are markup constructs consisting of a name/value pair that exist within a start-tag. In
16
<!ELEMENT conf(id,title,year, editor?, paper*)> <!ATTLIST conf id ID #REQUIRED> <!ELEMENT title (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT editor (person*)> <!ELEMENT paper (title,author,references?)> <!ELEMENT author (person*)>
<!ELEMENT person(name,email) <!ELEMENT name(#PCDATA)> <!ELEMENT email(#PCDATA)> <!ELEMENT references (paper*)>
Figure 2.2: Sample DTD (conf.dtd)
the running example, id is the attribute and SIAM10 is its value. Examples of content
are “SIAM Data Mining Conference”, “2010” and “Bing Liu”.
Due to the widespread use of XML documents for various applications such as data
transformation, integration and retrieval, there has been a great deal of interest in obtain-
ing useful information from a large collection of XML by mining the data [81]. Mining on
XML data can be broadly classified into four major categories namely XML classification
[136], XML clustering [86, 104, 136], XML frequent pattern mining [54, 107, 122, 138] and
XML association rules mining (or link analysis) [17, 116]. Among these, XML clustering
and XML frequent pattern mining tasks have been more popular amongst researchers due
to their usability in varied application domains. Also, XML frequent pattern mining is one
of the first task in generating association rules. In the following sections, the literature on
these two popular data mining tasks, frequent pattern mining and clustering, which are
also the focus of this thesis will be reviewed.
Before going into the details of the mining tasks, it is essential to review the literature
on data models to understand how XML documents could be effectively represented for
mining. The following section reviews the various data models that have been used for
mining XML documents.
17
<?xml version="1.0" encoding="UTF-8"?><xsd:schema xmlns:xsd=http://www.w3.org/2001/XMLSchema,targetNamespace=http://www.conferences.org,xmlns=http://www.conferences.org,elementFormDefault="qualified">
<xsd:element name="conf"><xsd:complexType> <xsd:sequence>
<xsd:element ref="id" minOccurs="1" maxOccurs= "1"/> <xsd:element ref="title" minOccurs="1" maxOccurs= "1"/> <xsd:element ref="year" minOccurs="1" maxOccurs= "1"/> <xsd:element ref="editor" minOccurs="0" maxOccurs= "unbounded"/> <xsd:element ref="paper" minOccurs="1" maxOccurs= "unbounded"/> </xsd:sequence> </xsd:complexType> </xsd:element>
<xsd:element name="editor"><xsd:complexType><xsd:sequence>
<xsd:element ref="person" minOccurs="1" maxOccurs="unbounded"/></xsd:sequence> </xsd:complexType> </xsd:element>
<xsd:element name="paper"> <xsd:complexType> <xsd:sequence> <xsd:element ref="title" minOccurs="1" maxOccurs="1"/> <xsd:element ref="author" minOccurs="1" maxOccurs="unbounded"/> <xsd:element ref="references" minOccurs="0" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> </xsd:element><xsd:element name="author">
<xsd:complexType><xsd:sequence>
<xsd:element ref="person" minOccurs="1" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> </xsd:element><xsd:element name="person">
<xsd:complexType><xsd:sequence>
<xsd:element ref="name" minOccurs="1" maxOccurs="1"/> <xsd:element ref="email" minOccurs="1" maxOccurs="1"/> </xsd:sequence> </xsd:complexType> </xsd:element><xsd:element name="references">
<xsd:complexType><xsd:sequence>
<xsd:element ref="paper" minOccurs="1" maxOccurs="1"/> </xsd:sequence> </xsd:complexType> </xsd:element>
<xsd:element name="id" type="xsd:string"/><xsd:element name="title" type="xsd:string"/><xsd:element name="name" type="xsd:string"/><xsd:element name="email" type="xsd:string"/></xsd:schema>
Figure 2.3: Sample XSD (conf.xsd)
18
<?xml version="1.0"?><!DOCTYPE conf SYSTEM "conf.dtd"><conf id=”SIAM10”>
<title> SIAM Data Mining Conference</title><year> 2010 </year><editor>
<person> <name>”Bing Liu”</name>
<email>[email protected]</email> </person>
</editor><paper><title>”MACH: Fast Randomized Tensor Decompositions”</title><author>
<person> <name>” Charalampos E. Tsourakakis ”> </name> <email>”[email protected]”</email>
</person></author><references><paper>
<title>” Unsupervised multiway data analysis: A literature survey ” </title> <author>
<person> <name>”Acar E”> </name>
<email>” [email protected]”</email> </person></author><author>
<person> <name>”Yener B”> </name>
<email>” [email protected]”</email> </person>
</author></paper></references>
</paper></conf>
Figure 2.4: Sample XML document (conf.xml)
19
2.2 Data models for XML mining
To suit the objectives and the needs of XML mining methods, XML data has been repre-
sented in various models. The XML data models can be classified into four major categories
namely Vector Space Model, graph, tree and path models as illustrated in Figure 2.5.
XML Data
XML SchemaXML Document
TreeContent Structure
Vector Space Model Tree Path
Graph Path
Graph
Figure 2.5: Classification of XML models
2.2.1 Vector Space Model (VSM)
The Vector Space Model (VSM) was initially proposed as a model for representing text
documents or objects such as vectors of features. It has been used widely in both frequent
pattern mining and clustering for modelling XML documents [17, 116, 36, 114]. When
using the VSM model for XML mining, the feature of the document structure is its sub-
structure, which can be a tag, subpath, subtree, or subgraph. The feature of the document
content is its term, which is a pre-processed word.
In VSM, each of the documents is represented as a feature vector containing either
binary (0 or 1), frequency or weighted values. There are two ways of representing the
XML document in VSM – dense and sparse. In the dense VSM representation for XML
documents collection, each document vector is represented for every feature in the collec-
tion. If it does not contain the feature then the document vector for that feature has a
20
f1 f2 f3 f4 f5 f6
d1 1 1 0 2 6 0d2 0 0 0 2 0 0d3 0 0 3 0 0 1
d1 1 1 2 1 4 2 5 6d2 2 1d3 3 3 6 1
(a) (b)Figure 2.6: An example of (a) a dense representation; (b) a sparse representation ofan XML dataset modelled in VSM using their feature frequency
value of 0. The sparse VSM representation retains only the non-zero values along with
the feature (tag or term) id. This improves the efficiency in computation especially for
sparse datasets where the number of non-zeroes is less compared to the number of zeroes.
Figure 2.6 gives examples of a dense and a sparse representation using only the frequency
of the feature. It shows that the size of the sparse representation is smaller than the dense
representation. Mining using the sparse representation is more efficient when the number
of zeroes is more; however, if there is a greater number of non-zeroes then the sparse
representation could incur in additional overhead, as the feature indices are also stored.
Table 2.1 shows a sparse VSM using the tags of the sample XML document given in
Figure 2.4. It can be seen that this model is not only simple but also enables an easy
representation of data and allows simpler management and analysis of the data. However,
a serious problem arises due to the nature of this model when it is used in mining, since
it does not consider the hierarchical relationship between the tags of the XML document.
Consequently, mining methods using this model for XML data representation can provide
inaccurate results.
Table 2.1: VSM generated from the structure of XML document given in Figure 2.4
Document Id Tags
1 conf 1 title 3 year 1 editor 1 person 4name 2 email 4 paper 2 author 3 reference 1
Consider, the example fragments from the sample document given in Figure 2.7,
the fragment <name>“Bing Liu”</name> in Figure 2.7 (a) and the fragment <name>
21
“Charalampos E. Tsourakakis” </name> in (b). Both the fragments contain the same
tag set; however, the former fragment refers to editor’s name and the latter to author’s
name.
<editor><person><name>”Bing Liu”</name><email>[email protected]</email>
</person></editor>
<author><person><name>” Charalampos E. Tsourakakis ”> </name><email>”[email protected]”</email>
</person></author>
(a) (b)
Figure 2.7: Sample XML fragments
When frequent pattern mining was applied on the XML tags, then <name></name>
will be output as a frequent structure. In reality, the tag <name></name> is not a
frequent structure as its parents are different. Hence, to avoid inaccurate results, it is
essential to consider the hierarchical relationships among the data items, not just among
the tag names.
VSM models the content of the XML document similar to the tag representation. Some
of the common pre-processing techniques such as stop-word removal, stemming [89, 91]
are applied on the content to identify the unique words to be represented in VSM. An
example is shown in Table 2.2 using the XML document in Figure 2.4, which shows that
the words such as “on”, “for”, “a” and “of” are removed as they are the stop words. Then,
the stemmed words are generated using stemming techniques. The words do not include
the tag names in the XML document; therefore, it is not clear whether the name “bing”
refers to an author or an editor of the paper. This may result in imprecise knowledge
discovery.
Thus, it is essential to include not only the hierarchical structural relationships among
the data items but also the content while mining for XML documents. One such model
22
Table 2.2: Transactional data model generated from the content of XML documentgiven in Figure 2.4
Document Id Tags
1 siam10 1 siam 1 data 1 mineng 1 2010 1 bing 1 liu 1liub 1 mach 1 fast 1 random 1 tensor 1decomposition 1 charalampos 1 tsourakakis 1ctsourak 1 cmu 1 unsupervised 1 multiwai 1data 1 analysis 1 literature 1 survei 1acar 1 acare 1 cs 3 rpi 2 edu 3 yener 2
which preserves the hierarchical information can be found in graphs. The following sub-
section will provide details of using this model.
2.2.2 Graphs
Due to the extensive research in graph mining methods for semi-structured data, many of
these methods are used for XML mining [5]. In these methods, the XML tags along with
their hierarchical relationships are modelled as graphs.
Graph: A graph can be defined as a triple (V, E, f) where V represent the set of nodes
(or vertices) and an edge set E with a mapping function f: E → V × V. The nodes are
the elements in XML documents and the edge set E consists of the links that connect the
nodes in order to present parent-child relationships.
As illustrated in Figure 2.8, there are different types of graphs, among which the
popular types are :
1. labelled graph or unlabelled graph;
2. directed or undirected graph;
3. cyclic or acyclic; and
4. connected or disconnected graph.
A labelled graph denoted by (V, E, f, Σ, L) contains an additional alphabet Σ which
23
(a) (b) (c)
A
R
B C
P
Q
S U T
F
E
D
Figure 2.8: (a) A graph; (b) a labelled graph; (c) a directed graph
represents the node (P, Q, R, S, T, U) and additional edge labels (A, B, C, D, E, F) with
a labelling function to assign the labels to the vertices and edges. A graph is directed or
undirected if it indicates the order in which the vertices are connected or disconnected
with each other and with the edge labels. A cyclic graph is the one in which the first and
last vertices in the path are the same. If all the vertices are connected with at least one
edge in such a way that there exists a path from every node to any node in the graph then
it is a connected graph, otherwise it is unconnected.
Often graph models are used in representing the schema of an XML document rather
than representing the XML document itself due to the presence of a cyclic relationship in
a schema. The labelled graph representation of the schema shown in Figure 2.2 is given in
Figure 2.9 where ovals represent the nodes with the labels and circles for nodes without
any labels. It can be noted from the graph representation that there is a cyclic reference
to the element ‘paper’ from the element ‘reference’.
2.2.3 Trees
Often XML documents occur naturally as trees or can easily be converted into trees using
node-splitting methodology [9]. Trees are a form of graphs that is acyclic (no cycles) and
24
year
conf
person
?
author
* +
name email
title
editor
*
paper
reference
? * title
id
Figure 2.9: Graph representation of conf.dtd
is connected graphs.
Tree: A tree is denoted as T = (V, v0, E, f), where (1) V is the set of nodes; (2)v0 is
the root node which does not have any edges entering in it; (3) E is the set of edges in the
tree T ; (4) f is a mapping function f: E → V × V.
If there exists two trees T = (V, v0, E, f) and T ′ = (V′,v1, E′, f), T and T ′ are
isomorphic, written as T ∼= T ′ to each other if there exists a node mapping bijective
function f: V → V′ such that (v0,v1)∈ E ⇔ (f(v0), f(v1)) ∈ E′. Such a map f is called an
isomorphism [34]. Hence, the two labelled trees T and T ′ are isomorphic to each other if
an one-to-one mapping from T to T ′ exists that preserves the root, node labels, and both
the adjacency and the non-adjacency of the vertices.
XML Parsers are used to extract either structure or the content features or both from
the XML documents. Among the XML parsers, Document Object Model (DOM) parsers
can be used to derive the tree-structure (a node tree) with the elements, attributes, and
text defined as nodes for the given XML documents. XML documents can be modelled
25
conf
title references
year
SIAM DataMining Conference
Bing Liu [email protected]
Charalampos E,Tsourakakis
Evrim Acaracare@
cs.rpi.edu
MACH: FastRandomized
Tensor Decomposition
2010
Unsupervisedmultiway dataanalysis: Aliteraturesurvey
papereditor
person
name email
title
name email
author
person
name email
author
person
title
Figure 2.10: Tree representation of the XML document given in Figure 2.4
as unranked or ordered labelled trees where labels correspond to XML tags which may
or may not carry semantic information. More precisely, when considering only the tree
structure, it should be noted that each node can have an arbitrary number of children,
the children of a given node are ordered and each node has a label in the vocabulary of
the tags [19].
For example, using the example XML document shown in Figure 2.4, a tree-based
model for this XML document can be derived as shown in Figure 2.10. In this figure,
the oval shapes indicate the tags and the terms are represented using rectangular boxes.
This tree representation contains both structure and content of XML documents with the
leaf node representing the terms. Therefore, to represent only the structure of the XML
documents, the terms are removed in which case the leaf node represents a tag.
26
2.2.4 Paths
The XML elements can also be modelled as paths which maintain the hierarchical rela-
tionship among their nodes.
Path: Let there be a tree T = (V,v0, E, f), a path P in it with length j is a sequential
expression of the nodes, given by (v0, v1, . . . , vj) where (vj−1, vj) ∈ E.
A path could be either a partial path or a complete path. A partial path contains the
edges in sequential order from the root node to a node in the document; a complete path
(or unique path) contains the edges in sequential order from the root node to a leaf node.
A leaf node is a node that encloses the content or text. In Figure 2.11, (a) and (b) show
an example of a complete and a partial path model respectively for the structure of the
XML document using its tags. A complete path can have more than one partial path with
varying lengths.
name
Bing Liu
conf
editor
person
conf
editor
person
name
conf
editor
person
(a) (b) (c)
Figure 2.11: Paths derived from XML document model (in Figure 2.4): (a) A completepath; (b) a partial path; (c) a complete path with text node
As shown in Figure 2.11(c), paths can be used to model both the structure and the
content by considering the leaf node as the text node. However, this kind of model results
in repeated paths for different text nodes. For instance, if there is another editor “Malcom
Turn” for this conference proceedings, then the path with the text node for this editor
27
will have the same path as that of the editor “Bing Liu” and the only difference will be
in the text node. Hence, to reduce the redundancy in the structure and to capture the
sibling relationship, the structure of the XML documents can be modelled as “Trees” or
“Graphs” as discussed above.
This discussion of the various data models which represent the XML data for mining
leads into the following sections, which will look into the two major mining focus areas in
this research: clustering and frequent pattern mining.
2.3 XML clustering
The increased use of XML documents for data representation and exchange has attracted
a great deal of interest among researchers for efficient data management and retrieval [67].
Clustering has been perceived by the research community as a task for offering an efficient
data management solution [104]. The clustering process of XML data plays a dominant
role in many data applications such as information retrieval, data integration, document
classification, web mining and query processing [111].
Clustering is used to explore interrelationships among a collection of documents which
results in homogeneous clusters [64]. In these homogeneous clusters, the documents within
one cluster are more similar to each other than the documents belonging to a different
cluster. Clustering on XML documents can be performed by exploring the interrela-
tionships using the features inherent in these documents. This could be based on their
structure features or content features or a combination of structure and content features.
Most of the previous works on XML clustering focus on utilising either the structure
[4, 27, 50, 110, 60, 69, 114] or the content of XML documents [129, 36, 111]. Clustering
using both structure and content features has received significant attention recently in an
28
attempt to improve the accuracy of the clustering solution. In this subsection, the related
works for clustering, based on their features, will be presented. Furthermore, the related
works on each of these features are organised according to the data models (as described
in Section 2.2) used for representing them for clustering.
2.3.1 Based on structure
Due to the semi-structured nature of XML documents, there have been an extensive
number of methods proposing clustering XML documents based on their structures. Since
it was considered that documents having similar structure belong to the same group and
this was also considered as the only feature required for clustering, these methods can
therefore be divided according to the data model used to model them for clustering.
2.3.1.1 Vector Space Model (VSM)
Though VSM is a popular model for representing XML documents based on their content
especially for text-centric documents, it has also been used to model the structure of
these documents. In the work by Doucet and Lehtonen [36], vectors representing the tag
names for the documents have been used to represent the XML documents for clustering.
However, as discussed before, this model ignores the hierarchical relationship between the
tag names. The work by Vercoustre et al. [114] models the XML documents in VSM
by using the frequency of the paths that are present in it. On the contrary, Leung et
al. [73] identify the common paths in the document collection and uses it to represent
the XML documents as a boolean vector of these common paths. A similar clustering
method, Closed Frequent Structures-based Progressive Clustering (CFSPC) [69], utilises
VSM to represent the common subtrees as a boolean vector. Though this model is simple
to represent, it could suffer from the typical disadvantages of boolean vectors such as the
29
absence of partial matching and difficulties in identifying the ranking between documents
which have similar substructures but varying lengths.
Once the XML dataset is represented in VSM, similarity between a pair of documents
can be measured using various distance measures such as Cosine, Euclidean, Manhat-
tan, Jaccard, Dice, Simple matching and Overlap [33]. A comprehensive survey of these
measures can be found in [20]. The most common similarity measure for calculating the
distance between vectors is a cosine measure. The cosine between two vectors, di and dj ,
representing two XML documents is given by,
cos θ =di.dj
|di||dj |(2.1)
The cosine similarity provides the measure of the angle between the two documents to
check whether the two documents point to the same direction or not. Another similarity
distance which captures the magnitude of the documents is the Euclidean distance. It
measures the distance between the documents by measuring the length between the terms.
The Euclidean distance between two documents is given by:
Edi,dj=
√√√√ m∑k=1
(dtki − dtk
j )
2
(2.2)
The clusters produced using this distance tend to be spherical in nature. On the other
hand, Manhattan distance computes the distance between two data points in a grid-like
path. The Manhattan distance between two data points is the sum of the differences of
their corresponding components as given by:
Mdi,dj=
m∑k=1
|dtki − dtk
j | (2.3)
30
Other distance measures such as Simple matching, Overlap, Jaccard and Dice, which
are usually applied on sets, can also be used for vectors. These distance measures are
the intersection between two vectors di and dj features; however, their denominators are
different. For instance, the Overlap measure is computed by the intersection between the
two vector features over the size of the vector that has the least features, whereas the
Jaccard measure is calculated as the intersection of the two vector features over the size
of the union set of the two vector features.
2.3.1.2 Graphs
The XML clustering methods on utilising graph structure can be grouped into two types:
node clustering and graph clustering. Flake et al. [38] and Aggarwal et al. [5] give a
good overview of XML graph clustering methods. The node clustering methods attempt
to group the underlying nodes with the use of a distance (or similarity) value based on
the edges. In this case, the edges of the graph are labelled with a numerical distance
values. These numerical distance values are used to create clusters of nodes. On the other
hand, the graph clustering methods use the underlying structure as a whole and calculate
the similarity between two different graphs. This task is more challenging than the node
clustering tasks because of the need to match the structures of the underlying graphs,
and then to use these structures for clustering purposes. However, the graph clustering
methods using the underlying structure do not make any assumptions about the structure
and hence tend to provide better results [5].
A popular graph clustering method for XML documents is S-GRACE [117], which uses
a hierarchical clustering method based on the ROCK method [44] for similarity compu-
tation. S-GRACE computes the distance between two graphs by measuring the common
set of nodes and edges. Firstly, S-GRACE scans the XML documents and computes their
31
s-graphs. The s-graph of two documents is the sets of common nodes and edges. The
s-graphs of all documents are then stored in a structure table called SG, which contains
two fields of information: a bit string representing the edges of an s-graph and the ids of
all the documents whose s-graphs are represented by this bit string. Once the SG is con-
structed, clustering can be performed on the bit strings. By exploiting the link (common
neighbours) between s-graphs, the best pair of clusters are selected and then merged in a
hierarchical manner.
Another method that borrows techniques from artificial neural networks and uses
graphs for XML clustering is the Graph Self-Organizing Map (GraphSOM) [45] that allows
the encoding of the XML structure in the form of graphs. As the GraphSOMs require
training, they are trained to cluster XML formatted documents based on topological infor-
mation in the tags and on the type of XML tag embedded in the document. Though this
type of clustering can handle complex structures but it reported to have poor accuracy
[30, 137] on Wikipedia datasets.
2.3.1.3 Trees
Due to the complexity in graph clustering caused by the presence of cyclic relationships
between nodes, clustering XML documents using tree models has gained popularity over
the graph model. Clustering using tree models is one of the well-established fields of
XML clustering methods. Several clustering methods modelling the XML data as trees
have been developed to determine XML data similarity. The reputed methods of tree edit
distance are extended to compute the similarity between the XML documents.
The tree edit distance is based on dynamic programming techniques for a string-to-
string correction problem [115]. The tree edit distance essentially involves three edit
operations, namely changing, deleting, and inserting a node, to transform one tree into
32
another tree. The tree edit distance between two trees is the minimum cost between the
costs of all possible tree edit sequences based on a cost model. The basic intuition behind
this technique is that the XML documents with the minimum distance are likely to be
similar and hence they can be clustered together.
Some of the clustering techniques that use the tree-edit distance are Nierman and
Jagdish [87] and Dalamagas et al. [26]. In [87, 26], the tree-edit distance is used to
compute the structural similarity between each pair of documents. XML documents with
a minimum distance are considered to be similar. A study showed that XML document
clustering using tree summaries provide high accuracy for documents [26]. The structural
summaries of the XML documents were extracted and used to compute the tree-edit
distance. However, this type of similarity computation requires a quadratic number of
comparison between the elements in the documents resulting in prohibitive computational
complexity of the algorithm. Also, this similarity computation may lead to incorrect
results as the calculated tree-edit distance can be large for very similar trees conforming
to the same schema for different size trees [124]. To resolve this issue, an efficient element
similarity measure was introduced in [86] based on the level-wise similarity of the nodes.
It utilises a novel global criterion function, the LevelSim, that measures the similarity at
a clustering level utilising the hierarchical relationships between elements of documents.
The elements in different level positions are allocated different weights. By counting the
common elements that share common ancestors, the hierarchical relationships of elements
are also considered in this measure. An improvement to XCLS is XEdge [10], which uses
the same level-wise similarity but instead of nodes it applies to edge to capture more
hierarchical relationship.
There are other clustering methods which avoid modifying the tree structure as in
tree edit distance methods by breaking the paths of tree-structured data into a collection
33
of macro-path sequences where each macro-path contains a tag name, its attribute, data
types and content. A matrix similarity of XML documents is then generated based on the
macro-path similarity technique. Clustering of XML documents is then performed based
on the similarity matrix with the support of approximate tree inclusion and isomorphic
tree similarity [97]. Many other approaches have also utilised the idea of tree similarity
for XML document change detection [119] and for extracting the schema information from
an XML document, such as those proposed in Garofalakis et al. [43] and Moh et al. [79].
Besides mining the structural similarity of the whole tree, other techniques have also
been developed to mine the frequent pattern in subtrees from a collection of trees [107].
The method proposed by Termier et al. [107] consists of two steps: first it clusters the
trees based on the occurrence of the same pairs of labels in the ancestor relation using
the apriori heuristic; after the trees are clustered, a maximal common tree is computed to
measure the commonality of each cluster to all the trees.
2.3.1.4 Paths
There have been several XML clustering methods determining structural similarity based
on the paths shared between documents [73, 84]. The paths model represents the structure
of the document as a collection of paths. A clustering method measures the similarity
between XML documents by finding the common paths [83].
One of the common techniques for identifying the common paths is to apply frequent
pattern mining on the collection of paths to extract the frequent paths of a constrained
length and to use these frequent paths as representatives for the cluster. This technique
has been utilised by Hwang and Ryu [50] and XProj [4]. XProj [4], clusters the XML
documents by extracting the frequent substructures in each of the clusters. XProj con-
verts a tree structure into a sequence (or path) of node labels and extracts the frequent
34
subsequence or subpaths.
Another simple method of finding XML data similarity according to common paths
is by treating the paths as a feature for the VSM model [36]. Other methods such as
XSDCluster [85], PCXSS [84] and XClust [72] adopt the concept of schema matching for
finding the similarity between paths. The path similarity is obtained by considering the
semantic and structural similarity of a pair of elements appearing in two paths. These path
measures in these methods are computationally expensive, since these measures consider
many attributes of the elements such as data type and their constraints.
As the path does not include the sibling relation between the nodes in a tree, this
type of model may result in information loss when used for clustering XML. Also, path-
based frequent pattern mining methods may fail to provide concise substructures for the
structure of XML documents with a high branching factor.
Table 2.3 summarises the different XML document clustering methods based on the
various models. This comparison will help to understand the different types of models and
the similarity measures that have been used in the literature on clustering XML documents
using their structure.
2.3.2 Based on content
There have also been several clustering methods developed that use only the content
features of XML documents. These are especially suitable for text-centric XML documents
that have less structure information and more content. Most of these clustering methods
focus on representing the content in VSM with very little focus on other data models due
to the simplicity of the VSM model.
35
Table 2.3: Comparison of different types of clustering methods using structure ofXML documents
Models Methods Similarity approach
VSM Vercoustre et al. [114] Euclidean distance on pathsLeung et al. [73] Euclidean distance on pathsCFSPC [69] Cosine similarity on
frequent patterns
Graph S-GRACE [117] distance measure basedon s-graphs
GraphSOM [45] Euclidean distance
Tree Nierman and Jagdish [87] tree-edit distanceDalamagas [26] tree-edit distance
Path XProj [4] frequent pathsDoucet and Ahonen [36] Euclidean distancePCXSS [84] path similarityXEdge [10] edge similarityXMine [82] path similarityXML C [50] path similarityXClust [72] path similarity
2.3.2.1 Vector Space Model (VSM)
VSM is commonly used by XML clustering methods that focus on content features [95].
The techniques using this model utilise the content of the XML documents by treating
them as a bag of words similar to text documents and to then cluster them.
Doucet et al. [36] represent the content of the documents in VSM and apply the k-
means algorithm to group them. As for large datasets, the clustering of XML data using
all the content can be expensive due to the presence of large number of terms.
Recently, a content-based clustering method called Cover-Coefficient Based Clustering
Methodology (C3M) [7] for clustering XML documents based on its content was proposed.
It is a single-pass partitioning type clustering method which measures the probability of
selecting a document given a term that has been selected from another document. It pro-
poses two approaches – term-centric and document-centric index pruning – which generates
compact representations of the term or document indices respectively and then represent
them in VSM. Finally, the documents represented with these reduced representations in
the VSM are clustered.
36
Using only the content of the XML documents is suitable for documents which have
similar structure and hence the structure could be ignored. However, due to the prevalence
of heterogeneous XML documents in real-life datasets, the use of only the content in the
XML documents for clustering is not sufficient to provide an effective clustering solution.
In order to improve the effectiveness, researchers have resorted to utilising the semantic
dictionary WordNet to measure the synonym similarity of the keywords existing between
two documents [104]. However, this approach is not suitable for documents which contain
the same words when they are not related as they are from different themes.
The majority of these methods focus on clustering the XML documents by identifying
structure or content similarity between them. However, as pointed out earlier in the
chapter, for some datasets it becomes essential to include both the structure and the
content similarity in order to identify clusters.
2.3.3 Based on structure and content
Methods with a single-feature focus using either structure or content tend to falsely group
documents that are similar in both features. To correctly identify similarity among doc-
uments, the clustering process should use both their structure and content information.
However, most of the clustering methods using both the structure and content features
of the XML documents adopt naıve methods of combining them linearly, due to the com-
plexity inherent in the process of combining them. The following subsections detail the
different methods based on the models used.
2.3.3.1 Vector Space Model (VSM)
VSM due to its simplicity can also be used to model both the structure and the con-
tent of the XML documents. A representation which links the structure and the con-
37
tent features together is the Structured Link Vector Model (SLVM) [128] that represents
both these features of XML documents using vector linking. For instance in the SLVM
model, given an XML document Di, it is defined as a matrix Di ∈ Rn×m, such that,
Di =⟨Di(1),Di(2), . . . ,Di(m)
⟩where m is the number of elements, Di(l) ∈ Rn is the TF-
IDF feature vector representing the element ei, given as Di(l) = TF (tj,Dl, ei) ∗ IDF (tj)
for all j = 1 to n, where TF (tj,Dl, ei) is the frequency of the term tj in the element ei of
Di. An improvement of this model using the concepts of LSI is the SLVM-LSI [56].
Common Rare Pattern (CRP) and 4-length Rare Pattern (4RP) clustering methods
proposed by Yao and Zerida [131] use the VSM model for representing the paths. Each
path contains the edges in sequential order from a node i to a term in the data content.
These methods create a large number of features; for instance, if there are 5 distinct terms
in a document, and 5 of these distinct terms are in two different paths, then there will be
10 different features altogether.
The clustering method by Tran et al. [109] also models the structure and the content of
the XML documents in VSM. In this model, structure and content similarity are calculated
independently for each pair of documents and then these two similarity components are
combined using a weighted linear combination approach to determine a single similarity
score between documents. This approach is based on latent semantic analysis to develop
a kernel to incorporate content as well as the structure of XML documents. However,
using this method for clustering does have three key limitations. First, the approach
combines the structure and the content linearly, which may result in poor accuracy as the
relationship between the structure and the content is not used. It also causes difficulties in
tuning the parameters for combining the two features. These methods, therefore require
extensive experiments to identify best parameters settings to achieve a good clustering
solution. Second, the computational requirements for building the kernel are high since this
38
method has to compute the pair-wise similarity between the objects that are required to be
clustered. As a result, this method can only be applied to relatively small data sets (a few
thousand documents) and cannot be used to effectively cluster real-life document corpus of
XML. Third, by relying only on the pair-wise similarities between documents, this method
tends to produce suboptimal clustering solutions especially when the similarities are low
relative to the cluster sizes. The key reason for this is that these clustering methods
can determine the overall similarity of a collection of objects (i.e., a cluster) only by
using measures derived from the pair-wise similarities (e.g., average, median, or minimum
pair-wise similarities). However, such measures, unless the overall similarity between the
members of different clusters is high, are quite unreliable because they cannot capture
what is common between the different objects in the collection [139].
As proposed in this thesis, using the common substructure of the XML documents,
their content is extracted and represented in VSM. This method helps not only to capture
the relationship between the structure and the content but also to remove the uncommon
content as they are the outliers affecting the accuracy of the clustering solution.
2.3.3.2 Trees
The Semantic XML Clustering (SemXClust) method [103] is the seminal work to cluster
semantically related XML documents by utilising both the structural information and the
content. This method represents XML documents in a set of tree tuples with the structure
and the content features enriched by the support of an ontology knowledge base to create a
set of semantically cohesive and smaller-sized documents. These tree tuples are modelled
as transactions and transactional clustering methods are then applied.
On the contrary, based on the node relationship in the tree structure of the XML
documents, different types of subtrees were identified. These subtrees are then used in
39
[67, 68] to extract the content of the XML documents. These methods do not make any
assumption on the structure among the elements of the XML documents and they also
generate structural summaries of them. Since these subtrees were not only frequent but
were also concise, the content extracted using them is precise and therefore improves the
accuracy of the clustering solution.
2.3.3.3 Paths
To capture the hierarchical structure and the content of XML documents, some researchers
[114, 131] have made attempts to include the text along with the path representation
in order to cluster XML documents using both structure and content features. Results
from their studies show that for both these methods, clustering performance (F1-measure)
degrades by including the structure in the content on the Wikipedia dataset in comparison
to representing only its content. However this is a data specific characteristic and the
authors [131] extract all the paths and their corresponding content and hence it results in
an explosion in the dimension. However, it is essential to utilise the relationship between
the structure and the content in clustering.
Approaches using the linear combination of structure and content in VSM or using
paths often fail to scale for even small collections of a few hundred documents, and in some
situations this has resulted in poor accuracy [114]. A linear combination of structure and
content features of XML documents cannot perform effectively, since the mapping between
the structure and its corresponding content is lost. The content and structure features
inherent in an XML document should be modelled in a way that the mapping between
the content of the path or tree can be preserved and used in further analysis. One such
model is a multi-dimensional model combining these two features.
There has been limited research on clustering using multi-dimensional aspects of the
40
documents due to the complexity and the explosion of data produced as a result of com-
bining these multi-dimensional features. The complexity becomes worse when dealing
with large-sized datasets, although a BitCube representation [133] was used to cluster and
query the XML documents with paths, words and documents as the three dimensions of
the BitCube. Each entry in the BitCube presents either the presence or absence of a given
word in the path of a document. The XML document collection has been first partitioned,
using the top down approach, into small bitcubes based on the paths and then the smaller
bitcubes have been clustered using the bit-wise distance and their popularity measures.
However, this method used all the paths in the XML documents and hence might incur a
heavy computational complexity for a very large collection of documents.
2.3.3.4 Tensor Space Model (TSM)
The Tensor Space Model (TSM) helps to alleviate the disadvantages inherent in the VSM
by directly preserving the relationship between the structure and the content of the XML
documents. In the TSM, the content corresponding to its structure is stored and hence
the model could help to analyse the relationship between the structure and the content.
Traditionally, tensors have been widely used in physics for stress and strain analysis.
TSMs have been successfully used in representing and analysing multi-dimensional data
in signal processing [80], web mining [78] and many other fields [112]. Tensor clustering is
a multi-way data analysis task which is currently gaining importance in the data mining
community. The simplest tensor clustering scenario, co-clustering or bi-clustering, where
two dimensions are simultaneously clustered, is well established [8, 104]. Authors in [14]
proposed a method for multi-way clustering on tensors by extending the co-clustering
technique from matrices to tensors using relational graphs. Also, recently the approxi-
mation based Combination Tensor Clustering method [55] was proposed which clusters
41
along each of the dimensions and then the cluster centres are represented in the tensor.
These co-clustering techniques capture only the 2-way relationships among the features
and ignore the dependence of multiple dimensions in clustering, which may result in a loss
of information while grouping the objects.
Several decomposition algorithms have been developed [37] to analyse the TSM and
to derive correlations and relationships from different features represented in TSM. There
are two broad families of tensor decompositions, namely CANDECOMP/PARAFAC (CP)
[61] and TUCKER [113]. CP is a higher-order analogue of Singular Value Decomposition
(SVD) or Principal Component Analysis (PCA). The CP solutions are not unique due to
the heavy dependence of CP on initial guess, whereas HOSVD and Tucker tend to provide
unique solutions. Table 2.4 presents a summary of other tensor decomposition algorithms
based on CP and Tucker. Acar and Yener [37] and Kolda and Bader [63] present a detailed
survey of all these decomposition algorithms.
Table 2.4: Popular tensor decomposition algorithms based on CP and Tucker
Based on Other tensor decompositions
Tucker Memory Efficient Tucker (MET)MACHHigh-Order Singular Value Decomposition (HOSVD)High-Order Orthogonal Iteration (HOOI)
CP Individual Differences in Scaling (INDSCAL)Implicit Slice Canonical Decomposition (IMSCAND)
Recently, the Incremental Tensor Analysis (ITA) methods [102] such as STA (Stream-
ing Tensor Analysis), DTA (Dynamic Tensor Analysis) and WTA (Window-based Tensor
Analysis) were also proposed to deal with large datasets. These methods are efficient in
decomposing sparse tensors (density ≤ 0.001%). However, when large dimensions exists
or if the tensor is dense, then these decomposition techniques fail to decompose. Real life
XML documents when represented in TSM are dense with about 127M (where “M” denotes
Million) non-zero entries with over 1M terms. Hence, these decomposition algorithms can-
not be applied. A memory-efficient implementation of Tucker, MET [58], was proposed to
42
avoid the intermediate blow-out in tensor factorisation. Recently, a random decomposition
technique, MACH [112] was proposed to be suitable for large dense datasets. The number
of entries in the tensor is randomly reduced using Achlioptas-McSherry’s technique [3] to
decrease the density of the dataset. When MACH is used in clustering dense datasets in
an attempt to reduce the density, MACH might ignore the documents with smaller length
because these documents could be grouped into a single cluster, in spite of differences in
their structure and content.
It can clearly be seen that there are several contributions to the research on TSM but
only limited research exists on using TSM for clustering the XML documents. To the best
of the author’s knowledge, authors in [96] have only applied tensor clustering in a semi-
structured documents dataset using IMSCAND (Implicit Slice Canonical Decomposition).
They utilised six pre-defined similarity values in a tensor model to group bibliographic
data such as similarity among words in abstracts, between names of authors, keywords,
words in the title, co-citations and co-reference information. This type of assumption
prevents IMSCAND from applying on different types of XML document collection. In
contrast, this research conducts tensor clustering on XML documents without any prior
assumption, which is appropriate in the case of clustering a large number of documents
with a diverse nature.
2.3.4 Research gaps in XML clustering
A review of the literature has revealed that the following research gaps in XML Clustering
exist:
� Most of the existing methods [110, 4, 114, 73, 65] use either the structure or the
content of the XML documents for clustering but not both.
43
� Techniques which attempt to cluster the XML documents using both of these features
fail to scale for very large datasets [114] or result in poor accuracy due to the linear
combination [109, 36, 132].
� Most of the clustering methods [36, 109, 104] have relied on the two-dimensional
VSM for grouping the XML documents, with limited research on using TSM [96] for
clustering.
In order to address these research gaps in the clustering of XML documents using
both the structure and the content of the XML documents, a feature selection method is
required to reduce the dimensionality of the combination. Also, utilising all the structure
and the content features of the XML documents is infeasible for very large number of
documents. Hence, it is essential to identify not all but only common patterns among
these features in XML documents. One of the popular technique for identifying common
patterns is frequent pattern mining. Frequent pattern mining has already been used as a
kernel function in classification [136], clustering [4], association rules and sequential rules.
The following section reviews the research works on frequent pattern mining.
2.4 Frequent pattern mining
Frequent pattern mining was first introduced along with association rules mining [6] to
analyse customer-buying behaviour from retail transaction databases. Frequent pattern
mining in these databases involves identifying patterns that occur quite often, and hence
these patterns are called frequent. In general, frequent patterns are a set of items or item-
sets in transactional databases. The frequent patterns are called subsequences, subtrees
and subgraphs, when extracted from sequential databases, trees and graphs respectively.
This section briefly introduces the frequent pattern mining on XML documents, presents
44
the various data models that have been used for frequent pattern mining and the methods
that have used to generate frequent patterns that will be useful for clustering.
2.4.1 An overview
Frequent pattern mining on XML documents involves identifying the common patterns
based on an user-defined support threshold often referred to as minimum support denoted
by (min supp). The frequent pattern mining can be defined as follows:
Given an XML dataset (a collection of XML documents) D = {D1,D2, ...,Dn}, find
frequent patterns P = {p1, p2, . . . , pr}, such that for every pi ∈ P , freq(pi) >= min supp
where freq(pi) is the percentage of documents in D that contain pi.
Due to the simplicity of this mining task, there have been several works conducted
on frequent pattern mining. Figure 2.12 presents a taxonomy of frequent pattern mining
based on the features of the XML documents.
XML Frequent Pattern Mining
Structure mining Content mining Structure and Content Mining
Intra-Structure mining
Inter-Structure mining Content analysis Structural
clarification
Figure 2.12: Hierarchy of XML frequent pattern mining
XML Frequent Structure Mining
Mining of the frequent structures can be broadly divided into two categories, namely
intra-structure mining or inter-structure mining (refer to Figure 2.12). Intra-structure
mining deals with finding information of structure within an XML document. Knowledge
45
is gained concerning the internal structure of XML documents, that is, their schema
definitions. On the contrary, the inter-structure mining is concerned with mining on a set
of documents and identifying frequent substructures among them. As XML documents are
often viewed as trees or graphs due to their hierarchical structure, the result of frequent
pattern mining on their structure will be a set of subtrees or subgraphs as in <employee>
tag often contains <salary> tag. This information could be used in the information
retrieval domain to quickly locate the salary details when queried.
XML Frequent Content Mining
The content of XML documents basically refers to the text between the start and
the end tag. For instance, in <Title> Data Mining </Title>, content refers to the text
“Data Mining” between the start tag <Title> and the end tag </Title>. Based on the
purpose, XML frequent content mining is classified into content analysis and structural
clarification. Content analysis is similar to the relational database mining for identifying
a frequently occurring instance of a relation. For instance, it can be used to identify
frequently occurring items in a transaction. On the other hand, the structural clarification
helps to distinguish two structurally similar documents based on their contents such as
homographs.
XML Frequent Structure and Content Mining
Apart from mining structure and content separately, a new category of mining the
structure and content together was introduced in [66]. To apply frequent pattern mining
on both these features, either the partial or full structure of XML documents along with
its content could be used. Hence, in contrast to content mining, the structure and content
mining methods do not enforce the restriction that the structure of the document should
be fixed or constant. Similar to content mining, the combined structure and content
frequent pattern mining helps to provide structural clarification and content analysis.
46
However, a number of challenges exist while mining for the combined structure and content
information. The major challenge is that the mining method should be highly scalable
as the file size will be huge due to the storage of both the content and the structure
information, thus causing scalability issues for the mining methods.
Based on the previous discussions, it is clear that applying frequent pattern mining
on both the features is not useful, as this will also result in a huge explosion of data. As
the structure has been used to store the content, the dimensionality reduction could be
achieved by applying frequent pattern mining on the structure of the XML documents.
Hence, the following subsections will review frequent pattern mining literature based on
the structure of the XML documents using the various models discussed in Section 2.2.
2.4.2 Frequent pattern mining methods
The perception of the model of XML documents for frequent pattern mining forms the
basis for how it is mined for frequent patterns. For instance, if the XML document is
viewed as a vector of tags then the application of frequent pattern mining results in a
list of frequently occurring tags. Another view of the XML documents is graphs or trees,
where each XML document represented in a string format corresponds to the transactions.
The application of frequent pattern mining on trees or graphs results in frequent subtrees
or subgraphs. This subsection will look into frequent pattern mining methods based on
the structure of the XML documents.
2.4.2.1 Vector Space Model (VSM)
Inspired from the frequent itemset mining, XML data is represented as a vector of tags in
a sparse VSM as it mimics the characteristics of the transaction database which records
the items and its occurrences. However, applying frequent pattern mining on the sparse
47
VSM representation of XML documents uses the standard techniques for frequent itemset
mining and is not specific to XML data, and it therefore loses the relationship between
the tags.
In the sparse VSM, each XML document in the dataset is represented based on either
its tags or its content. The frequent pattern mining method begins with a complete scan
of the tags in the VSM to identify 1 -length frequent tags with a length of 1 by testing the
support of each tag. The 1 -length frequent tags are combined together to form 2 -length
candidate tags. These candidate tags are tested to verify whether they are frequent or not.
The process is repeated with the 2 -length frequent tags until there are no more frequent
tags that could be found. This approach of frequent pattern generation is referred to
as generate-and-test as we generate the k+1 -length candidates by joining the frequent
k -length candidates and test whether the generated candidates are actually frequent. In
order to reduce the number of candidates generated, the apriori heuristic has been applied.
The basic intuition of this heuristic is that any subset of a frequent subpattern should also
be frequent. Hence, while generating the candidates only the frequent subpatterns are
used ignoring the infrequent subpatterns.
As the apriori heuristic generates candidates it could become expensive when there are
a very large number of documents which contain a large number of documents. In order
to overcome the problem of candidate generation, a novel approach called pattern-growth
was proposed which adopts the “divide-and-conquer” heuristic to recursively partition the
dataset based on the frequent patterns generated and then it mines them for frequent
patterns in each of the partitions. This technique is essentially suitable for datasets which
have large numbers of documents and are dense in nature.
Gu, Hwang, and Ryu [50] used the VSM model to identify frequent patterns from
them. Also, this model has been used by Braga, Campi, Ceri, Klemettinen, and Lanzi [17]
48
and Wan and Dobbie [116] to generate frequent patterns which were then used to create
association rules from these XML documents.
2.4.2.2 Graphs
Frequent graph mining is often popular for XML schema. This type of graph mining can
also be applied for the XML dataset in which various documents are linked to each other.
Frequent graph mining can be defined as follows.
Given a graph dataset D, find a subgraph g such that freq(g) ≥ min supp where
freq(g) is the percentage of graphs in D that contain g.
Similar to frequent itemset mining, the frequent graph mining methods count the
support of the 1-length subgraphs by scanning the dataset and identifying the 1-length
frequent subgraph. The next step involved is candidate generation in which it uses either
joins or extensions to the 1-length frequent subgraph to generate candidates. If a candidate
is generated by joining two frequent subgraphs then the technique is referred to as join.
If a candidate is generated by extending the nodes with any of the 1-frequent node then
it is referred to as extension. As there could be many possibilities to join or extend a
node; in order to limit the number of candidates generated and to avoid redundancy, the
rightmost node extension technique was introduced [12]. In this popular technique, only
the rightmost node is extended hence avoiding the repeated generation of same nodes.
After generating the candidates their support is determined by scanning through the
dataset. This process of generate-and-test is repeated until there are no more candidates
that could be generated.
Based on the graph traversal, the graph mining methods can be classified into breadth-
first and depth-first methods. Some of the apriori-based graph miners which belong to
49
the breadth-first category are AGM (Apriori-based Graph Mining method) [51], AcGM
(Apriori-based connected Graph Mining method) [53] and FSG [32]. These methods use
the apriori heuristic to reduce the search space. On the other hand, gSpan [128] and FFSM
[48] support depth-first traversal to find frequent subgraphs.
One of the well-known problems with graph mining methods is the subgraph isomor-
phism which has been considered as NP-complete [42]. Subgraph Isomorphism is the
problem of identifying whether a given graph, G, is an isomorph of subgraph, H, or not.
In simpler terms, it defines whether the graph G contains all the vertices of the subgraph
H or not. Given a graph G = (Vg, Eg) and a subgraph H = (Vh, Eh), G is said to contain
H when iff (x,y) in Eg contains (x, y) in Eh where x and y are the nodes of the graphs G
and H. As it is evident in graph frequent mining methods that the candidate subgraphs
are tested against the graph dataset, hence the frequent mining method is an instance of
subgraph isomorphism problem. Due to the huge number of candidate checks required, it
takes an exponential amount of time to identify frequent subgraphs.
However, some of the recent techniques such as Biased Apriori-based Graph Mining
(B-AGM) [52] provide results in an acceptable time period. In spite of the advancements
in graph mining, these methods have often been criticized for producing more generic
results than accurate results and this incurs in an expensive step in applying canonisation
for transforming the data into an uniform representation suitable for mining [135]. As
a result, recent researchers have shifted their attention to tree mining methods by using
trees instead of graphs and adapting graph mining techniques for frequent pattern mining
XML data, which is discussed in the following sub section.
50
2.4.2.3 Trees
Similar to frequent graph mining, the objective of applying frequent pattern mining on
XML documents represented in the tree data model is to identify frequent subtrees present
in the data. Frequent tree mining methods can be classified based on several factors such
as tree, subtree representation, traversal strategy, canonical representation, tree mining
approach adopted and the type of candidate generation techniques used for apriori-based
methods. Table 2.5 provides an outline of them.
The initial work on frequent tree mining was undertaken by Zaki [134], who proposed a
method to discover all subtrees in a forest (i.e., a large collection of ordered trees). There
are two steps involved in this method. Firstly, it enumerates candidate k-length subtrees
and counts the frequency of these subtrees by performing depth-first search. Secondly,
the k+1 -length subtree is generated from k -length subtrees where its frequency is greater
than a threshold. With a small alteration, this method can be applied in discovering the
frequent tree with unordered labelled trees.
In frequent tree mining of XML data, it has been noted from the sample XML dataset
(in Figure 1.2) that often the entire tree will not be frequent as in fragments (d) and
(e) due to the nodes “ConfLoc” and “ConfYear” respectively. Rather there is a good
possibility that parts of these trees could be frequent. The parts of such trees are referred
to as subtrees. A subtree from the sample XML dataset is shown in Figure 2.13.
There are different types of subtrees based on:
� Node relationship – induced and embedded
� Conciseness – Closed and Maximal
Based on the node relationship, the subtrees could have either parent-child or ancestor-
51
Table 2.5: Classifications of frequent tree mining methods
Based Types Methods
Tree representation Free Tree FreeTreeMiner[23]Rooted Unordered uFreqt [88]TreeRooted Ordered Unot [13]Tree
Subtree representation Induced subtree FREQT [12], uFreqt [88],HybridTeeMiner [21], Unot [13]PrefixTreeISpan [141]
Embedded subtree TreeMinerV [136],TMp [35], X3-Miner [106]
Traversal Strategy Depth-first FREQT [12], HBMFP [77],TreeMinerV [136], uFreqt [88]
Breadth-first FreeTreeMiner [23], X3-Miner [106]Depth-first & TreeMiner [134],Gaston [88]Breadth-first HybridTeeMiner [21]
Canonical Pre-order TreeMiner [134]representation string encoding
Level-wise HybridTeeMiner [21]encoding
Candidate Generation Extension FREQT [12], Unot [13]Join TreeMiner [134], PathJoin [123]Combination of HybridTeeMiner [21]extension and join
ConfTitle
Conference
ConfAuthor ConfName
Figure 2.13: Example of a subtree from the sample XML dataset in Figure 1.2
descendant relationships among their nodes resulting in induced and embedded subtrees
respectively. On the other hand, based on the conciseness, subtrees could be closed or
maximal if these subtrees are frequent and having the same support or different support
respectively.
52
Node relationship
The two types of subtrees based on the node relationship are induced subtrees and
embedded subtrees.
Induced subtree
For a tree T with edge set E and a node set V , we say that a Tree T ′ with node set
V ′, edge set E′ is an induced subtree of T if and only if (1) V ′ ⊂ V (2) E′ ⊂ E (3) the
labeling of the nodes of V ′ and E′ in T ′ is preserved in T . In simpler terms, an induced
subtree T ′ is a subtree which preserves a parent-child relationship among the vertices of
the tree, T .
Embedded subtree
For a tree T with edge set E and a node set V , we say that a Tree T ′ with node set
V ′, edge set E′ is an embedded subtree of T if and only if (1) V ′ ⊂ V ; (2) E′ ⊂ E;(3)
the labelling of nodes of V ′ in T ′ is preserved in T ; (4) (v1,v2) ∈ E, where v1 is the
ancestor of v2 in T ′ iff v1 is the ancestor of v2 in T ; and (5) for v1, v2 ∈ V′, preorder(v1)
< preorder(v2) in T ′ iff preorder(v1) < preorder(v2) in T . In other words, an embedded
subtree T ′ preserves ancestor-descendant relationship among the vertices of the tree, T .
Figure 2.14 shows an induced and an embedded subtree generated from the Tree, T .
It can be seen that Figure 2.14 (b) preserves the parent-child relationship and in Figure
2.14(c), though node “Book” is not the parent for “Name”, it is its ancestor node.
Based on the “traversal strategy”, the frequent-tree mining methods can be classified
into three categories, namely depth-first, breadth-first or a combination of both. In order
to traverse the trees, trees are often represented as a set of strings, with the representation
starting from the root node to the child nodes either in a breadth-first fashion (left-to-
right) or in a depth-first (top-to-bottom) fashion. ‘-1’ is used to mark the end of the nodes
53
Publisher
Book
Title Author
Name Name
Book
Title Author
Name
Book
Title Name
(a) (b) (c)
Figure 2.14: (a) A tree; (b) an induced subtree; (c) an embedded subtree
in that level or to indicate backtracking. The tree in Figure 2.15 (a simplified tree of the
Figure 2.14(a), using node labels as alphabets instead of tag names) can be represented
as < A B -1 C E -1 -1 D F -1 -1 -1> in level-wise encoding and pre-order string encoding.
The signs < and > represent the start and end of the representations respectively.
D
A
B C
E F
Figure 2.15: A sample tree using node labels as alphabets instead of the tag names
Similar to frequent itemset mining on transactional datasets, there are two popular
frequent tree mining approaches:
� generate-and-test
� divide-and-conquer
As the name implies, generate-and-test approach generate k+1-length candidates from
frequent k-length patterns and tests, whether the generated k + 1-length candidates are
54
CondDB B
Tid Trees
1 C -1
D -1
2 C -1
F -1
Freq Pattern
Proj
A
B
C
DT
Tid Trees
1 A B C -1 D -1 -1 -1
2 A B C -1 -1 F -1 -1
3 A F D -1 -1 K -1 -1
CondDB A
Tid Trees
1 B C -1 D -1 -1
2 B C -1 -1
F -1
3 F D -1 -1
(a) (b)
No projections
Figure 2.16: (a) a document tree dataset; (b) frequent patterns and their projectionsin that dataset using pattern-growth approach
frequent or not, against the dataset. One of the disadvantages of this approach is the
enormous number of candidate checks required and hence the requiring of many scans of
the dataset.
To overcome this disadvantage, a pattern-growth approach [46] was proposed which
divides the search space based on the frequent patterns generated from the previous phase.
Consider a document tree dataset DT as shown in Figure 2.16 with the tree ids and their
corresponding trees. Let us mine DT for frequent subtrees with a min supp ≥2. A
scan is conducted on TD to identify frequent patterns say ai, aj , ..., ak and based on the
discovered frequent patterns, the dataset is projected by extracting the patterns following
the frequent pattern. For instance, in Figure 2.16, the frequent patterns are A, B and
C. The conditional datasets are CondDBA, CondDBB and CondDBC for A, B and C
respectively. Further, the projected dataset is scanned to discover frequent items. Based
on the frequent substructures discovered, the dataset is projected and this process is
repeated until there are no more frequent substructures.
55
The TreeMiner method proposed by [135] adopts the generate-and-test approach to
generate embedded subtrees. However, it uses a vertical format to ease the support
counting for subtrees. Some of the other frequent pattern mining methods, which use
a generate-and-test strategy, are FREQT [12], HBMFP [77], TreeMinerV [136], uFreqt
[88] to mine induced frequent subtrees. On the other hand, XSpanner and Chopper [117]
utilise pattern-growth approaches to generate embedded subtrees.
Conciseness
General frequent pattern mining methods have been often criticized for producing
a very large number of frequent subtrees, which cause difficulties in understanding and
interpreting the results. In order to reduce the number of frequent subtrees and to derive
meaningful information from these frequent patterns, two popular concise representations
were proposed: closed and maximal.
Closed subtree
In a given document tree dataset, DT = {DT1,DT2,DT3, ...,DTn}, if two frequent
subtrees DT ′ and DT ′′ exist, and a frequent subtree DT ′ is closed of DT ′′ iff for every
DT ′ ⊃ DT ′′, supp(DT ′) = supp(DT ′′). This property is called closure.
Maximal subtree
In a given tree dataset, DT = {DT1,DT2,DT3, ...,DTn}, if two frequent subtrees
DT ′ and DT ′′ exist, DT ′ is said to be maximal of DT ′′ iff DT ′ ⊃ DT ′′, supp(DT ′) ≤
supp(DT ′′). This property is called maximality.
Unlike closure, maximality does not impose strict restrictions to have the same support;
hence it could result in M < C < F where:
M = Number of Maximal frequent subtrees
56
C = Number of Closed frequent subtrees
F = Number of Frequent subtrees
There are some methods that generate concise representations using only the generate-
and-test approach. Among them the popular ones are PathJoin [123], CMTreeMiner [24]
and DryadeParent [11]. PathJoin [123] uses a compact data structure called FST-forest to
generate only maximal frequent subtrees. On the other hand, CMTreeMiner was the first
method that was proposed to discover all closed and maximal frequent labelled induced
subtrees without first discovering all frequent subtrees. CMTreeMiner uses two pruning
techniques: left-blanket and right-blanket pruning. The blanket of a tree is defined as the
set of immediate supertrees that are frequent, where an immediate supertree of a tree T
is a tree that has one more node than T . The left-blanket of a tree T is the blanket where
the node added is not in the rightmost path of T (the path from the root to the rightmost
node of T ). The right-blanket of a tree T is the blanket where the node added is in the
rightmost path of T . CMTreeMiner computes, for each candidate tree, the set of trees
that are occurrence-matched with its blanket’s trees. If this set is not empty, two pruning
techniques using the left-blanket and right-blanket are applied. If it is empty, then they
check if the set of trees that are transaction-matched but not occurrence-matched with
its blanket’s trees is also empty. If this is the case, there is no supertree with the same
support and then the tree is closed. CMTreeMiner is a labelled tree method and it was
not designed for unlabelled trees. As is pointed out by the authors of CMTreeMiner, if
the number of distinct nodes are very large then the memory usage of CMTreeMiner is
expected to increase and hence its performance is expected to deteriorate.
Arimura and Uno proposed Cloatt [11] that applies closed mining on attribute trees,
which is a subclass of labelled ordered trees and can also be regarded as a fragment
of description logic with functional roles only. Additionally, these attribute trees have
two sibling nodes that cannot have the same label and are defined using a relaxed tree
57
inclusion. In the literature, closed frequent path mining methods also exist [127, 118].
However, due to the presence of sibling relationships in trees directly extending these path
mining methods for tree mining is not suitable.
Termier et al. proposed DryadeParent [108] as a closed frequent attribute tree mining
method to achieve performance comparable to CMTreeMiner. The DryadeParent method
is based on the computation of tiles (closed frequent attribute trees of depth 1) in the data
and on an efficient hooking strategy that reconstructs the closed frequent trees from these
tiles. Whereas CMTreeMiner [24] uses a classical generate-and-test approach to build
candidate trees edge by edge, the hooking strategy of DryadeParent finds a complete
depth level at each iteration and does not need tree mapping tests. The authors in [108]
claim that their experiments have shown that DryadeParent is faster than CMTreeMiner
in most settings and that the performances of DryadeParent are robust with respect to the
structure of the closed frequent trees to find, whereas the performances of CMTreeMiner
are biased toward trees having most of their edges on their rightmost branch. As attribute
trees are trees such that two sibling nodes cannot have the same label, DryadeParent is not
a method appropriate for dealing with real-life datasets where the sibling nodes have same
labels. To the best of our knowledge, no approach exists that utilises the pattern-growth
technique that are particularly suited for dense datasets to both induced and embedded
subtrees.
2.4.2.4 Paths
The ability to capture the hierarchical relationship between the tags has facilitated the use
of paths for frequent pattern mining. The path model can also be used to represent the
structure of the XML document as discussed earlier. Similar to the frequent tree mining,
frequent paths are discovered. Firstly, a scan of the dataset is conducted to identify the
58
frequent 1-length path which is just a node. These frequent nodes are combined to form
2-length paths. Testing is then carried out to verify how often they occur in the dataset.
If they occur more than the min supp then the paths are considered as frequent. This
technique is much more suitable for partial paths than for complete paths as the frequency
of complete paths could often be very low and hence there might not be sufficient frequent
paths to output. Often a large set of subpaths is generated especially for lower support
thresholds or on dense datasets. To reduce the number of common subpaths, a new
threshold called maximum support threshold (max supp) was introduced to avoid the
generation of very common subpaths, as these very common subpaths do not provide any
interesting or new knowledge [4]. In spite of the advancement in frequent path mining,
the frequent paths generated do not capture the sibling relationship between nodes hence
may incur an information loss.
2.4.3 Research gaps in frequent pattern mining
The following lists the research gaps in frequent pattern mining:
� Lack of efficient methods to generate concise frequent subtrees using a prefix-based
pattern growth approach.
� Lack of methods that could provide concise frequent subtrees with a parent-child
relationship (induced) or an ancestor-descendant (embedded) relationship.
� Lack of testing on real-life large-sized datasets.
� Lack of testing on the effectiveness of the concise frequent subtrees.
59
2.5 Summary and discussion
This chapter has reviewed the literature relevant to the two focus areas of this research
namely XML clustering and frequent pattern mining. It has provided details of the various
models of representing the XML documents, their features and the techniques for clustering
and frequent pattern mining. From the literature review, the following limitations are
ascertained based on the current XML clustering and frequent pattern mining methods:
� Lack of effective clustering approaches that could combine the structure and the
content of XML documents non-linearly.
– Lack of effective measures to control the increase in dimensionality due to com-
bination of structure and content features.
– Lack of higher-dimension models for clustering XML documents.
– Limited clustering methods using the outcome of frequent pattern mining meth-
ods in clustering the XML documents. Even these clustering methods have used
only one feature and not both features of XML documents. Moreover, these
methods are limited to very small datasets [4].
� Lack of efficient frequent pattern mining methods that could scale for datasets with
large number of documents and/or very high branching factor which is the nature
of dense datasets.
– To the best of the author’s knowledge, no frequent subtree mining approach
exists that could provide concise frequent subtrees using prefix-based pattern
growth methods for mining real-life datasets.
– Lack of concise frequent pattern mining methods that could generate concise
frequent subtrees with different types of node relationships and conciseness.
60
In order to get around the above-mentioned limitations, frequent subtrees should be
used to identify the common substructures and to utilise them to extract the content from
the XML documents. By doing so, the structure is utilised and its corresponding content is
used for clustering. There could be two ways of representing the combination: implicitly
using the VSM or explicitly using the TSM. These types of combinations can help to
reduce the complexity and the scalability of applying clustering on XML documents by
restricting only to content constrained by the structure features.
2.6 Chapter summary
This chapter has reviewed the current state-of-the-art research in the problem areas ad-
dressed within this thesis. As mentioned in Chapter 1, the two main problem areas are
clustering and frequent pattern mining on semi-structured data, XML. This chapter has
provided an overview of XML, its features, its benefits over other semi-structured data
and the current research on XML mining.
This chapter also examined the state of the research into the specific clustering tasks
addressed in this thesis. They are the data models of XML for the purpose of clustering
and calculating their similarity measures. A survey of the main clustering methods for
XML was presented which helped to identify that the existing XML clustering methods
rely either on the structure or on its content and hence could result in poor accuracy. A
general remark is that due to the complexity and scalability issues, both of these features
are not included in the clustering process. A common challenge is to identify an effective
approach to combine these two features, structure and content, and combat the complexity
without sacrificing the accuracy of the clustering solution.
The remainder of the chapter examined the state of the research into a frequent pat-
61
tern mining problem, with the main focus on tree-structured data sources and on the
sub-problem of concise frequent subtree generation. It was noted that XML document
structures can be represented as tree structures essentially in the form of rooted, ordered
and labelled trees. However, different types of frequent subtrees exist such as induced
and embedded subtrees based on the type of relationship between the nodes. A general
overview of theoretical foundations of topics related to the frequent subtree mining was
first provided. This allows to distinguish the important aspects of the tree mining prob-
lem and to understand the importance of tree-structured data over other representations.
Two different frequent pattern generation approaches, apriori and pattern growth, were
discussed. Strengths and weaknesses of these approaches were indicated, followed by an
overview of the existing frequent subtree mining methods with respect to the types of
subtrees mined. Finally, the research gaps in both clustering and frequent pattern mining
were presented. This has led the research in this thesis and the following chapters will
describe how this research addresses these limitations effectively.
62
Chapter 3
Research Design
3.1 Introduction
This chapter describes the research design used in developing the proposed frequent pat-
tern mining and clustering methods. The frequent pattern mining methods for generating
different types of concise frequent patterns are discussed in Chapter 4. Chapter 5 pro-
vides the clustering methods for non-linearly combining the structure and content of XML
documents.
In this chapter, the datasets that have been used to benchmark these proposed frequent
pattern mining methods and clustering methods over the state-of-the-art methods are
presented. A wide range of both synthetic datasets and the real-life datasets exhibiting
variations in their characteristics were chosen to evaluate the impact of the various features
in a dataset on the proposed methods. Some of the characteristics are the nodes’ branching
factor, depth of the trees and the size of the dataset. This chapter also covers a wide
range of evaluation metrics that were used to measure the effectiveness of the approaches
proposed in this thesis.
63
Finally, several state-of-the-art methods in both frequent pattern mining and clustering
that have been used for evaluating the effectiveness of the proposed techniques have been
provided.
3.2 Research Design
The aim of this research from a clustering perspective is to develop an efficient clustering
method for providing meaningful and accurate clusters. From the frequent pattern mining
perspective, the aim of this research is to develop efficient and effective frequent patterns
that are useful in capturing the structural similarity and that aid in clustering of XML
documents. As illustrated in Figure 3.1, there are three major phases in the proposed
research. Each of them is described as follows.
3.2.1 Phase-One: Pre-processing
The first phase is pre-processing, which includes extracting the structure and content
of the XML documents and representing them in the form of trees. This is one of the
most important and time-consuming tasks in any data mining project as it is often an
iterative step and consumes several iterations. The main purpose of this phase is to
provide a suitable representation of XML documents for the mining techniques such as
frequent pattern mining and clustering in the subsequent phases. The first step in this
phase involves extracting the structure and content of XML documents. The structure
and content of XML documents require several pre-processing steps. For instance, the
structure of XML documents needs to be parsed and converted into a tree-like structure
and then represented in a suitable form for mining tasks. An analysis of many XML
documents revealed that they contain redundant information in their structure. To reduce
64
Phase 1:Pre-processing
Phase 3a:Clustering using VSM
Model thestructure of XML
documents asdocument trees
ProcessedDocument
trees
Pre-processing ofDocument trees
Phase 2: Frequent Pattern Mining
Concise FrequentSubtree Mining usingprefix-based patterngrowth techniques
ConciseFrequentSubtrees
Phase 3b: Clustering using TSM
Extract contentusing concise
frequent subtrees
ApplyClusteringalgorithm
Cluster 1
Cluster 2
Cluster N
Cluster 3
Represent the contentconstrained by theconcise frequent
subtrees in TensorSpace Model (TSM)
Apply TensorDecomposition
algorithmRepresent the
content constrainedby the concise
frequent subtreesin Vector Space
Model
Extract contentusing concise
frequent subtrees
ApplyClusteringalgorithm
Decomposedvalues
Cluster concisefrequentsubtrees
XMLdocuments
Figure 3.1: Research design
65
the information overload, these redundant structures should be identified and removed.
The content of XML documents also requires pre-processing such as stemming, stopword
removal and shorter words removal. The output of this phase is the document trees and
the processed content. The next phase of this research is to apply frequent pattern mining
techniques on the generated document trees, based on only the structure features of the
XML documents.
3.2.2 Phase-Two: Frequent Pattern Mining
The main aim of this research is to combine the structure and the content features of XML
documents for clustering. However, representing the structure and the content features
of XML documents in VSM is an expensive task and could cause information overload
for a clustering algorithm. Hence to reduce the overload, this research proposed to utilise
frequent pattern mining techniques to identify the prominent subtrees and use them to
extract the relevant content.
This phase generates the concise frequent structure of the XML documents to create
an useful form to be used for the clustering phase. In order to extract the content of
the XML documents using the structure-based frequent patterns, two types of frequent
subtrees exist, namely induced and embedded subtrees, preserving the parent-child and the
ancestor-descendant relationships respectively. This research makes use of both types of
subtrees, assuming that embedded subtrees may expose some of the hidden relationships
that are not identified using induced subtrees. The number of generated subtrees after
applying the frequent tree mining algorithms to document trees is usually very large. It
is essential to reduce the number of frequent subtrees generated by identifying only the
concise representations such as closed and maximal.
This thesis proposes four frequent pattern mining algorithms for generating the closed
66
and maximal frequent induced and embedded subtrees. Utilising the concise frequent
subtrees generated from the frequent pattern mining algorithms, the content from the XML
documents is extracted and represented in the pre-cluster form suitable for clustering. By
doing so, the dimension of input data matrix for clustering is reduced.
3.2.3 Phase-Three (a): Clustering using VSM
This phase involves implicitly representing the structure and the content of XML docu-
ments using the frequent subtrees. It performs a non-linear combination of structure and
content features by utilizing the concise frequent subtrees generated from the previous
phase to extract the content and represent them in a Vector Space Model (VSM). Fi-
nally, a clustering algorithm is then applied on the VSM to create the required number of
clusters.
3.2.4 Phase-Three (b): Clustering using TSM
Clustering using the TSM phase proposes a novel way of combining the structure and the
content using a multi-dimensional model. It begins with clustering the concise frequent
subtrees and then extracting the content corresponding to them. The extracted content,
along with the structure and the document id, is then represented in a higher-order Tensor
Space Model (TSM). Unlike the VSM, the TSM involves representing both the features
– structure and the content – in an explicit manner for a given document. A tensor
decomposition algorithm is then applied on the tensor; the resulting decomposed values
provide the structure and content similarity between the documents. Clustering is then
applied on the decomposed values to generate the required number of clusters.
In order to evaluate the proposed methods developed for both frequent pattern mining
and clustering, the following section will discuss about their experiment set-up.
67
3.3 Experiment Set-Up
Experiments were conducted on the QUT High Performance Computing system, with a
RedHat Linux operating system, 16GB of RAM and a 3.4GHz 64bit Intel Xeon processor
core. C++, C# and Matlab were used for the implementation of the proposed algorithms.
To be consistent with the other frequent tree mining algorithms for the purpose of bench-
marking, C++ was used to implement the proposed frequent tree mining algorithms. C#
was used for parsing the XML documents and extracting its structure and content. For
creating and manipulating the tensors, the toolbox was provided in Matlab and hence
this programming language was used. Matlab was also used to develop the proposed ten-
sor decomposition algorithm; python was used as the scripting language to evaluate the
results.
3.4 Datasets
Both the synthetic and real-life datasets were used in the evaluation of the mining tech-
niques. The synthetic datasets were primarily used to benchmark some of the existing
frequent tree mining algorithms against the proposed methods on their runtime and scala-
bility performance. The real-life datasets includes varied-size datasets ranging from small
to large.
3.4.1 Synthetic Datasets
The Zaki’s tree generator1 has often been used to generate the synthetic datasets for
benchmarking the tree mining algorithms. Using the Zaki’s tree generator, two synthetic
datasets were generated, namely the F5 and D10 datasets, with the parameters as indicated
1http://www.cs.rpi.edu/˜zaki/software
68
in Table 3.1, where “f” represents the fan out factor, “d” the depth of the tree, “n” the
number of unique labels for the trees, “m” the total number of nodes in a parent tree and
“t” indicates the number of trees.
Table 3.1: Synthetic datasets and their parametersName DescriptionF5 -f 5 -d 10 -n 100 -m 100 -t 100000D10 -f 10 -d 10 -n 100 -m 100 -t 100000
Studies have indicated that the performance of some of the existing frequent subtree
mining methods degrade for datasets having a high branching factor [108]. To evaluate
the performance of the proposed frequent pattern mining algorithms against the current
state-of-the-art algorithms, two datasets F5 and D10 with varied branches, fan-out factors
of 5 and 10 respectively, were generated.
3.4.2 Real-life Datasets
Five real-life datasets have been used for benchmarking both frequent subtree mining and
clustering methods. These are classified based on the size of the dataset into the following
groups:
� Small-sized real-life dataset;
� Medium-sized real-life dataset; and
� Large-sized real-life datasets.
3.4.2.1 Small-sized real-life dataset
The ACM dataset is a small-sized real-life dataset that contains 140 XML documents
corresponding to two DTDs, IndexTermsPage.dtd and OrdinaryIssuePage.dtd (with about
70 XML documents for each DTD), similar to the setup in XProj [4] . It does not contain
69
Table 3.2: Details of categories in the ACM datasetCategory types Categories # DocumentsStructure-Only IndexTermsPages 70
OrdinaryIssuePages 70Structure-Content DTD-based 70
General 7Mobile computing 3
Database Management Systems (DBMS) 42Others 18
any schema definitions such as XSD or DTD. Also, this dataset contains both semantic
tags and formatting tags. Table 3.2 provides the two sets of categories for this dataset.
Previous researchers [4] have used the ACM dataset to cluster the documents into two
groups according to their structural similarity. To compare the proposed work with theirs,
experiments were conducted with two cluster categories according to structural similarity.
It is actually comparatively easier and more straightforward to group this dataset accord-
ing to structural similarity as the documents come with two different schema definitions.
More complexity has been added in the second set of experiments by conducting the
structure-and-content-based clustering. This experimental design utilises expert knowl-
edge and is based on 5 groups considering both the structural and the content features of
XML documents. The first category is based on the document structure and the remaining
four categories are based on the document content, namely General, Mobile computing,
Database Management Systems (DBMS) and Others.
3.4.2.2 Medium-sized real-life dataset
The dataset used in this thesis is a subset of journal articles and conference papers from
the original XML DBLP archive. Table 3.3 shows the details of the DBLP and Table 3.4
shows the categories that have been used to split the documents in this dataset.
This dataset is a subset of the DBLP archive, which is a digital bibliography on com-
puter science containing journal articles, conference papers, books, book chapters and
theses. DBLP exhibits a certain structural variety different from other datasets. It is
70
Table 3.3: Details of ACM and DBLP datasetsAttributes ACM DBLPNo. of Docs 140 3882No. of tags 38 32
No. of internal nodes 2070 28674Max length of a document 45 25
Average length of a document 14 7No. of distinct terms 7135 10766Total No. of words 38141 75742Size of the collection 1 MB 4.36MB
Presence of formatting tags No NoPresence of Schema Yes No
Number of Categories 5 8
Table 3.4: Details of categories in the DBLP datasetCategory Name # Documents
Books 1282Conference 1664Journals 783
Miscellaneous 2Persons 13Phd 74
Technical report 29World wide web 35
characterized by a small average depth and offers quite short text descriptions (e.g., au-
thor names, paper titles, conference names). Table 3.4 provides the categories in the
DBLP that contains 8 categories with dblp articles mostly from books, conferences and
journals.
3.4.2.3 Large-sized real-life datasets
The three datasets that belong to this group are :
� INEX IEEE dataset;
� INEX 2007 dataset; and
� INEX 2009 dataset.
These datasets, each of more than 5000 documents, were obtained from the clustering
task in the INitiative for the Evaluation of XML Retrieval (INEX)2. INEX forum is a col-
laborative forum bringing together researchers from many fields to evaluate their methods
2http://www.inex.otago.ac.nz/
71
in XML Mining and IR, using real-life datasets such as Wikipedia and IEEE proceedings.
The clustering task in this forum began in 2002 with the IEEE proceedings. In 2005 this
collection of IEEE proceedings was expanded with more IEEE proceedings and in 2006
the IEEE collection was complemented with an XML dump of the Wikipedia, which was
later updated in 2009. The Wikipedia datasets used in 2008 was considered highly instable
[137] and there was a very small number of labels in that dataset.
INEX IEEE dataset
The IEEE collection version 2.2, which has been used in the INEX document mining
track 2006, consists of 6054 articles originally published in 23 different IEEE journals
from 2002 to 2004. The articles follow a complex schema that includes front matter, back
matter, section headings, text formatting tags, and mathematical formula [104].
Table 3.5 provides the details of the categories in this dataset with 6 thematic labels
and 2 structural labels. The thematic or content labels are Computer, Graphics, Hard-
ware, Artificial Intelligence (AI), Internet and Parallel Computing. The structural labels
are IEEE Transactions and IEEE Journals. For instance, “tc” category belong to the
“transactions” structural label and “Computer” content/thematic label. In simple words,
“tc” is an IEEE Transaction on the Computer.
Table 3.5: Details of categories in the INEX IEEE datasetContent / Computer Graphics Hardware AI Internet ParallelStructure
Transactions tc, ts tg tp, tk tdon
Journals an, co, cs, it, so cg dt, mi ex, mu lc pdon
INEX 2007 dataset
The INEX 2007 Wikipedia clustering task corpus contains 48,305 documents. These
documents have deep structures and a high branching factor. The documents set does
not contain any schema definitions such as XSD or DTD. Also, this dataset contains both
72
semantic tags and formatting tags.
Table 3.6 lists the categories that in the INEX 2007 dataset. There were 21 categories
which are not well balanced: some categories are large (Portal:Law is composed of about
25% of the documents), while others are small (Portal:Music is very small with only 0.5 %
of the documents) [41]. This dataset having these 21 categories will help to identify how the
proposed models behave in the presence of small categories and also to identify ambiguous
categories such as Portal:Pornography and Portal:Sexuality; or Portal:Christianity and
Portal:Spirituality.
Table 3.6: Details of categories in the INEX 2007 datasetId Category # Documents
2112299 Portal:Law 121051597184 Portal:Literature 84181484914 Portal:Sports and games 72671480358 Portal:Art 38841886386 Portal:Physics 26593091788 Portal:Christianity 22342773006 Portal:Chemistry 23141685758 Portal:History 15883091127 Portal:Spirituality 13292914908 Portal:Sexuality 12192879927 Portal:War 11301620218 Portal:Archaeology 6601507239 Portal:Aviation 6172328885 Portal:Formula One 5911486363 Portal:Astronomy 5551895383 Portal:Trains 4842635947 Portal:Comics 3042314377 Portal:University 2752257163 Portal:Pornography 2412263642 Portal:Writing 230474166 Portal:Music 201
INEX 2009 dataset
INEX 2009 clustering task corpus containing 54,575 documents was used in this re-
search as there were a number of submissions for the clustering task which will help to
evaluate the proposed clustering methods against them. The subset contained 5,243 unique
entity tags and 1,900,072 unique terms. Table 3.7 shows the INEX 2009 dataset used in
this research.
As shown in Table 3.7 there are two sets of categories in this dataset. The first set
of categories is derived from Wikipedia categories and the top-20 of the categories are
73
listed in Table 3.8. A complete list of all the categories in this set is provided in Appendix
A.1. In this category set, the documents in the dataset contains multiple categories, with
most of the documents containing more than one category. Hence, the sum of the number
of documents in each of the categories exceeds the total number of documents in the
collection.
Table 3.7: Details of large-sized datasetsAttributes INEX IEEE INEX 2007 INEX 2009No. of Docs 6054 48,305 54,575No. of tags 165 5814 34,686
No. of internal nodes 472,351 4,487,819 15,128,407Max length of a document 691 659 10347
Average length of a document 78 19 277No. of distinct terms 114,976 535,351 1,900,072Total No. of words 3,695,550 16,682,466 21,480,198Size of the collection 272MB 360 MB 2.94GB
Presence of formatting tags Yes Yes YesPresence of Schema Yes Yes Yes
Number of Categories 18 15 4052
Table 3.8: Details of the top-20 categories in the INEX 2009 dataset using Wikipediacategories
Id Category # Documents1 People 153592 Society 126633 Geography 90654 Culture 90335 Politics 85896 History 80357 Nature 57888 Countries 57249 Applied sciences 556810 Humanities 520511 Business 373412 Technology 358413 Science 337814 Arts 283715 Historical eras 278016 Health 276017 Entertainment 252118 Belief 241719 Life 230120 Language 2140
The second set of categories in INEX 2009 dataset is used to evaluate the collection
selection problem (discussed in Section 3.5). It is based on the 52 topics in the ad hoc
queries posed by the volunteers in the INEX forum. Table 3.9 lists the top-20 queries that
were used and the number of documents that were found to be relevant to the query. The
full list of all the categories in this set is provided in Appendix A.2. Among the 52 queries
there were about 22 queries which had less than 5 relevant documents. Using this set of
categories will help to identify a very accurate clustering solution that could be useful for
74
Table 3.9: Details of the top-20 categories in the INEX 2009 dataset using ad hocqueries
Id Query Title # Documents2009043 NASA missions 1352009005 Chemists physicists scientists alchemists periodic 82
table elements2009093 French revolution 402009013 Native American Indian wars against 33
colonial Americans2009039 Roman architecture 272009063 D-Day normandy invasion 272009040 Steam engine 252009055 European union expansion 242009036 Notting Hill Film actors 222009051 Rabindranath Tagore Bengali literature 182009023 “Plays of Shakespeare”+Macbeth 162009076 Sociology and social issues and 14
aspects in science fiction2009105 Musicians Jazz 102009035 Bermuda Triangle 92009061 France second world war normandy 92009064 Stock exhange insider trading crime 92009096 Eiffel 92009001 Nobel prize 82009011 Olive oil health benefit 82009033 Al-Andalus taifa kingdoms 8
information retrieval.
3.5 Evaluation measures
Distinct evaluation measures were used for both the frequent pattern mining phase and
clustering phase in this research. For evaluating frequent pattern mining methods, the
commonly used metrics such as the runtime and the number of frequent patterns gener-
ated for various support thresholds (min supp) are utilised. On the other hand, purity,
F1 measures and NMI were used for evaluating the efficiency of the clustering solution
produced by the proposed clustering methods and the benchmarks for clustering. The
execution times for the decomposition algorithms are also used to evaluate the clustering
methods.
With the availability of the results of manual evaluation for information retrieval on the
INEX 2009 dataset, existing evaluation metrics are not suitable to evaluate the clustering
methods for information retrieval. Hence, a new measure called Normalized Cumulative
Cluster Gain (NCCG) (introduced in [81] as part of the INEX 2009 clustering task) was
75
utilised for evaluating the effectiveness of the clustering solution for the problem of collec-
tion selection.
3.5.1 Frequent pattern mining
There are two evaluation measures that are used to evaluate the performance of the fre-
quent pattern mining methods. They are the runtime of these methods and the number
of frequent patterns generated from the frequent pattern mining methods.
Runtime (λ) in seconds
This is the time taken to complete the generation of the frequent patterns from the
given dataset for a given support threshold (min supp). It is measured in seconds.
Number of Frequent Patterns (ρ)
This is the total number of frequent patterns generated from a given dataset for a
given support threshold (min supp).
3.5.2 Clustering
This research focuses on using purity, F1 and NMI measures to evaluate the clustering
methods.
Purity
The standard criterion of purity is used to determine the quality of clusters by mea-
suring the extent to which each cluster contains documents primarily from one category.
The simplicity and the popularity of this measure means that it has been used as the only
evaluation measure for the clustering task in the INEX 2006 and INEX 2009. In general,
the larger the value of purity, the better the clustering solution.
76
Let ω = {w1, w2, . . . , wK}, denote the set of clusters for the dataset D and ξ =
{c1, c2, . . . , cJ} represent the set of categories. The purity of a cluster wk is defined as:
P (wk) =maxj |wk ∩ cj |
|wk| (3.1)
where wk is the set of documents in cluster wk and cj is the set of documents that occurs in
category cj. The numerator indicates the number of documents in category k that occurs
most in cluster j and the denominator is the number of documents in the cluster wk.
The purity of the clustering solution ω can be calculated based on micro-purity and
macro-purity. Micro-purity of the clustering solution ω is obtained as a weighted sum of
individual cluster purity. Macro-purity is the unweighted arithmetic mean based on the
total number of categories [29].
Micro− Purity(ω, ξ) =
∑Kk=0 P (wk) ∗ |wk|∑K
k=0 |wk ∩ cj |(3.2)
Macro− Purity(ω, ξ) =
∑Kk=0 P (wk)
J(3.3)
F1-measure
Another standard measure that is used to evaluate the clustering solution is the F1-
measure. It helps to calculate not only the number of documents that are correctly clas-
sified together in a cluster but also the number of documents that are misclassified from
the cluster.
In order to calculate the F1-measure, three types of decisions are used. Among them
77
there are two types of correct decisions: True Positives (TP) and True Negatives (TN). A
TP decision assigns two similar documents to the same cluster; a TN decision assigns two
dissimilar documents to different clusters. On the other hand, a False Positive (FP) is an
error decision that assigns two dissimilar documents to the same cluster [76]. Though there
is another error decision, FN, that assigns two similar documents to different clusters, it
is not used in calculating F1-measure.
Using the TP, TN and FP decisions, the precision and the recall for the micro-F1 are
defined as:
precisionmicro−F1 =
∑Jj=1 TPj∑J
j=1 TPj + FPj
(3.4)
recallmicro−F1 =
∑Jj=1 TPj∑J
j=1 TPj + TNj
(3.5)
The precision and the recall for the macro-F1 are defined as
precisionmacro−F1 =
∑Jj=1 TPj
∑Jj=1 TPj+FPj
J(3.6)
recallmacro−F1 =
∑Jj=1 TPj
∑Jj=1 TPj+TNj
J(3.7)
where TPj is the number of documents in category cj that exists in cluster wk, TPj is the
number of documents that is not in category cj but that exists in cluster wk and TNj is
the number of documents that is in category cj but does not exist in cluster wk.
F1 can now be defined as:
78
F1 =2× precision× recall
precision+ recall(3.8)
Micro-F1 =2× precisionmicro-F1 × recallmicro-F1
precisionmicro-F1 + recallmicro-F1(3.9)
Macro-F1 =2× precisionmacro-F1 × recallmacro-F1
precisionmacro-F1 + recallmacro-F1(3.10)
Normalized Mutual Information (NMI)
Another evaluation measure is the Normalized Mutual Information (NMI) which helps
to identify the trade-off between the quality of the clusters against the number of clusters
[76].
NMI [76] is defined as,
NMI(ω, ξ) =I(ω; ξ)
[H(ω) +H(C)]/2(3.11)
I(ω; ξ) =
∑k
∑j P (wk ∩ cj)log
P (wk∩cj)P (wk)P (cj)∑
k
∑j|wk∩cj |
N logN |wk∩cj ||wk||cj|
(3.12)
where P (wk), P (cj) and P (wk ∩ cj) indicate the probabilities of a document in cluster wk,
category cj and in both wk and cj .
H(ω) is the measure of uncertainty given by,
H(ω) =−∑
k(P (wk)logP (wk))
−∑
k|wk|N log |wk|
N
(3.13)
79
3.5.3 Collection selection evaluation using NCCG measure
This evaluation measure was used in evaluating the INEX 2009 dataset [81] and is based on
Van Rijsbergen’s clustering hypothesis. Van Rijsbergen and his co-workers [92] conducted
an intensive study on the use of the clustering hypothesis test on information retrieval,
which states that documents which are similar to each other may be expected to be relevant
to the same requests; dissimilar documents, conversely, are unlikely to be relevant to the
same requests. If the hypothesis holds true, then relevant documents will appear in a small
number of clusters and the document clustering solution can be evaluated by measuring
the spread of relevant documents for the given set of queries.
To test this hypothesis on a real-life dataset, the INEX 2009 dataset, the clustering
task was evaluated by determining the quality of clusters relative to the optimal collection
selection [81]. Collection selection involves splitting a collection into subsets and recom-
mending which subset needs to be searched for a given query. This allows a search engine
to search a fewer number of documents, resulting in improved runtime performance over
searching the entire collection.
The evaluation of collection selection was conducted using the manual query assess-
ments for a given set of queries from the INEX 2009 Ad Hoc track [81]. The manual
query assessment is called the relevance judgment in Information Retrieval (IR) and has
been used to evaluate ad hoc retrieval of documents. It involves defining a query based on
the information need, a search engine returning results for the query and humans judging
whether the results returned by the search engine are relevant to the information need.
Better clustering solutions in this context will tend to (on average) group together
relevant results for (previously unseen) ad hoc queries. Real ad hoc retrieval queries and
their manual assessment results are utilised in this evaluation. This approach evaluates
80
the clustering solutions relative to a very specific objective – clustering a large document
collection in an optimal manner in order to satisfy queries while minimising the search
space. The metric used for evaluating the collection selection is called the Normalized
Cumulative Cluster gain (NCCG) [81].
The NCCG is used to calculate the score of the best possible collection selection ac-
cording to a given clustering solution of n number of clusters. The score is better when
the query result set contains more cohesive clusters. The Cumulative Gain of a Cluster
(CCG) is calculated by counting the number of documents of the cluster that appear in
the relevant set is returned for a topic by manual assessors.
CCG(c, t) =
n∑
i=1
(Reli) (3.14)
For a clustering solution for a given topic, a (sorted) vector CG is created representing
each cluster by its CCG value. Clusters containing no relevant documents are represented
by a value of zero. The cumulated gain for the vector CG is calculated, which is then
normalized on the ideal gain vector. Each clustering solution c is scored for how well it
has split the relevant set into clusters using CCG for the topic t.
SplitScore(t, c) =
|CG|∑ cumsum(CG)
nr2(3.15)
where nr = Number of relevant documents in the returned result set for the topic t and
cumsum indicates the cumulative sum.
A scenario with worst possible split is assumed to place each relevant document in a
distinct cluster. Let CG1 be a vector that contains the cumulative gain of every cluster
81
with each document.
MinSplitScore(t, c) =
|CG1|∑ cumsum(CG1)
nr2(3.16)
The normalized cluster cumulative gain (nCCG) for a given topic t and a clustering
solution c is given by,
nCCG(t, c) =SplitScore(t, c)−MinSplitScore(t, c)
1−MinSplitScore(t, c)(3.17)
The mean and the standard deviation of the nCCG score over all the topics for a
clustering solution are then calculated.
Mean(nCCG(c)) =
∑nt=0 nCCG(t, c)
Total Number of topics(3.18)
Std Dev (nCCG(c)) =
∑nt=0 nCCG(t, c)−Mean(nCCG(c))
Total Number of topics(3.19)
The NCCG value varies from 0 to 1. A larger value of NCCG for a given clustering
solution is better, since it represents the fact that an increased number of relevant doc-
uments are clustered together in comparison to a smaller number of relevant documents.
Further details of this metric can be found in [81]
Decomposition Time (λd)
This is the time in seconds taken to complete decomposing the tensor model that has
82
been built using the structure and the content of XML documents.
3.6 Benchmarks
This section details the benchmarks used for evaluating both the frequent pattern min-
ing and clustering methods. The aim of evaluating the proposed methods against these
benchmarks is to understand the strengths and weaknesses of both the proposed methods
and these benchmarks for the chosen datasets (detailed in the previous section). Not only
the state-of-the-art methods in both frequent pattern mining and clustering tasks were
used as benchmarks but also different methods to serve as benchmarks were created to
understand the effectiveness of the proposed methods over these benchmarks.
3.6.1 Frequent pattern mining
The experiments were evaluated against other state-of-the-art methods. In frequent pat-
tern mining, methods such as MB3-Miner [105], TreeMinerV [134], PrefixTreeISpan [141]
and PrefixTreeESpan [140] are used to benchmark the proposed frequent pattern min-
ing methods. Among them, TreeMinerV and MB3-Miner are the representatives for the
generate-and-test approach to generate embedded subtrees. The PrefixTreeISpan and
PrefixTreeESpan methods adopt a prefix-based pattern growth algorithm to generate fre-
quent induced and embedded subtrees respectively. Table 3.10 details the benchmarks for
frequent pattern mining methods on the type of subtrees, generation approach and the
distinct advantage of the benchmarks.
83
Table 3.10: Benchmarks for frequent pattern mining methodsName Type of Generation Distinct
Subtrees approach advantageMB3-Miner [105] Induced Generate-and-test Tree Model Guided
candidate generationto reducecandidate enumeration
TreeMinerV [134] Embedded Generate-and-test SimplicityPrefixTreeISpan [141] Induced Prefix-based Suitable for
pattern growth dense datasetsPrefixTreeESpan [140] Embedded Prefix-based Suitable for
pattern growth dense datasets
3.6.2 Clustering
To clearly understand the strengths and weaknesses of the proposed hybrid clustering
methods, various clustering representations, other clustering methods from INEX and
clustering methods using different decomposition techniques were used as benchmarks in
this research. This subsection details each of them. Table 3.11 lists the benchmarks for
clustering methods.
Table 3.11: Benchmarks for clustering methodsBased on NameRepresentations SO
COS+C
Clustering Methods BilWeb-CO [81]from INEX PCXSS [84]
CRP and 4RP [131]Doucet et al. [36]
Clustering using different CPtensor decompositions Tucker
MACH
3.6.2.1 Based on representations
The objective of this set of comparisons was to evaluate whether the combination of
structure and content used by the proposed clustering methods is better than structure-
only and content-only. Furthermore, it is also used to understand whether the proposed
type of combination is better than the linear or naive way of combining the structure and
the content of XML documents for clustering.
The following are the representations that are used for comparing the outputs of the
proposed clustering method on various real-life datasets.
84
Structure Only (SO) Representation
An input matrix D×CF is generated where D represents the documents and CF rep-
resents the list of concise frequent induced subtrees that have been used. Each document
is represented by the CF subtrees that are present in it.
Content Only (CO) Representation
The content of XML documents is represented in a matrix D×Terms with Terms are
obtained after pre-processing techniques such as stop-word removal, stemming and integer
removal. Each entry in the matrix contains the term frequency of terms in D.
Structure and Content (S+C) Representation
In this representation, the structure and content features for the documents are rep-
resented in a matrix by concatenating the occurrences of the structure features (CF sub-
trees) and the content features (terms) side by side for a document. It is represented as
[D × CF ;D × Terms].
3.6.2.2 Based on other clustering methods from INEX
This research aims to evaluate the proposed methods against the clustering results of the
methods of other participants from the INEX forum that are available. BilWeb-CO [81],
PCXSS [84], CRP, 4RP, Word descriptor [131], Doucet et al. [36] and Self-Organising
Map for Structured Data (SOM-SD), Contextual SOM-SD (CSOM-SD) and GraphSOM
(Graph-based SOM). Each of the clustering methods is discussed in detail as follows:
BilWeb-CO [81]
This methodology was proposed by a research team from the Bilkent Web Databases
Research Group, which used Cover-Coefficient based Clustering Methodology (C3M) for
85
clustering the XML documents on the INEX 2009 dataset detailed in Section 3.4. C3M
is a single-pass partitioning type clustering method based on the probability of selecting
a document given a term that has been selected from another document. To scale for the
large collection of documents in the INEX 2009 dataset, a compact representation of the
documents was generated by adapting term-centric and document-centric index pruning
techniques. They cluster the documents with these compact representations for various
pruning levels, again using the C3M method.
Progressively Clustering XML by Semantic and Structural Similarity (PCXSS)
[84]
This clustering method uses only the structural similarity between XML documents. It
is based on the structural similarity of the XML documents. Firstly, it defines a similarity
measure called CPSim (Common Path Similarity) between an XML document which is
computed between the two documents and is based on the following criteria:
� The number of common nodes between the document and the documents of the
cluster;
� The number of common nodes in a given path of the XML document; and
� The ordering of the nodes of the XML document.
CPSim computed from the aforementioned criteria is then used by this incremental
clustering method.
Common Rare Path (CRP), 4RP (4-length Rare Path) and Word descriptor
[131]
The authors, Yao and Zarida, utilised the paths to cluster XML documents using both
structure and content features. Their experimental methods were applied on the INEX
86
2007 dataset. Three methods were proposed, namely CRP, 4RP and Word descriptor,
based on the paths. In the CRP clustering method, the complete root path which is the
full path starting from the root node to the text node is used to measure the similarity
between two XML documents. In the 4RP clustering method, a partial path of length 4
(containing 4 nodes) which contains the content in the text node is used to measure the
similarity between XML documents. Finally, the word descriptor utilises only the words
present in the document. All of these methods used a combination of partitional and
agglomerative clustering approaches to cluster the documents.
Doucet et al. [36]
This method represents the structure and the content features of the XML documents
in a VSM and then directly applies a K-means algorithm for the clustering. It assigns a
weight for both the structure and the content features for integrating these two features
in the VSM linearly.
SOM-based approaches [60, 45]
There were three approaches that will be used to benchmark the proposed methods
based on the neural network model, SOM, namely SOM-SD, CSOM-SD and GraphSOM.
The SOM-SD focuses only on clustering using the structural properties but CSOM-SD uses
the contextual information. These two approaches were used in the evaluation of INEX
IEEE dataset. GraphSOM is an extension of CSOM-SD that utilises a graph structure to
model the XML documents. By modelling the documents as graphs, the authors in [45]
claim that they could avoid information loss inherent in vector-based representation.
87
3.6.2.3 Clustering using different tensor decompositions
In order to understand the impact of the proposed progressive decomposition algorithm
for tensor in this thesis (detailed in Chapter 5), three clustering methods were used. These
clustering methods uses the same methodology as that of the proposed clustering method
using tensors, HCX-T, but instead of the proposed progressive decomposition algorithm,
the state-of-the-art decomposition algorithms such as CANDECOMP/PARAFAC (CP),
Tucker and MACH were used.
CP
This method uses the CANDECOMP/PARAFAC decomposition algorithm for decom-
posing the tensor model created using the structure and the content of the XML docu-
ments. In this method, the left singular matrix resulting from applying CP decomposition
on the tensor is used as an input for K-means clustering.
Tucker
This method uses the same tensor model as the Clustering using CP method but uses
the Tucker decomposition algorithm for decomposing the tensor.
MACH
This method uses the concepts of the most recent scalable and randomized decompo-
sition technique, MACH decomposition (discussed in the previous chapter), that will be
used to decompose the tensor. MACH randomly projects the original tensor to a reduced
tensor with smaller percentage of entries (10% from the original tensor as specified in
[112]) and then uses Tucker decomposition to decompose the reduced tensor.
Using these representations and clustering methods, the proposed clustering methods
will be benchmarked using the evaluation metrics detailed in this chapter.
88
3.7 Chapter summary
This chapter has presented the experimental design for the experiments that will be con-
ducted in this research. It has also analysed the various datasets – synthetic and real-life
datasets – based on their attributes, which will provide a good understanding of the exper-
imental results for both frequent pattern mining and clustering. The details of the choice
of the datasets were also covered. The evaluation metrics that will be used for benchmark-
ing the proposed methods were also presented, along with the benchmarks that will be
used for comparing the proposed frequent pattern mining and clustering methods detailed
in Chapter 4 and 5 respectively.
An extensive empirical study will be carried out by modifying the pre-processing steps,
the support threshold for frequent pattern mining and the clustering technique to achieve
best clustering results. This would result in a desirable clustering solution according to
the datasets used. In addition to this, the clustering results from both the phases will be
compared to understand the impact of clustering techniques using VSM and TSM.
89
Chapter 4
Frequent Pattern Mining of XML
Documents
4.1 Introduction
This chapter introduces the frequent pattern mining methods that have been developed
in this thesis to generate from a set of XML documents. This chapter proposes a suite of
frequent subtree mining methods which generate concise representations of induced and
embedded subtrees using the prefix-based pattern growth approach. An in-depth empirical
analysis is conducted to evaluate the efficiency of the proposed methods on both synthetic
and real-life datasets over the state-of-the-art frequent pattern mining methods.
Discovering frequent subtrees has practical significance such as improving user under-
standing about a data source, helping database indexing and access method design and
serving as the first step in classifying and clustering tree-structured data [67, 135]. The
main aim of applying frequent pattern mining on the structure of XML documents in this
research is to get the concise representation of the structure of the document collection.
91
This permits a reduction in the dimension for further application and in this research the
application is clustering. Hence, in this phase the frequent subtrees are generated for a
given user-defined support threshold. For the purpose of clustering, the content of the
XML documents is extracted using these frequent subtrees.
However, with the size of the document trees, the number of frequent subtrees usually
grows exponentially especially in situations when there are a very large number of nodes
in a tree. There are two consequences resulting from this exponential growth. Firstly, it
causes difficulties in analysing the results. The sheer number of generated patterns makes it
difficult to have a comprehensive explanation. In a way, it defeats the purpose of applying
frequent pattern mining, that is, getting frequent or common patterns that can explain the
dataset. Secondly, the frequent subtree mining algorithm could become intractable. An
attempt to overcome this problem by increasing the support threshold could result in the
loss of important and interesting patterns. Hence, to alleviate the explosion in the number
of frequent subtrees, this research aims to generate concise representations by restricting
the number of frequent subtrees.
This chapter begins with the basic pre-processing steps required to convert an XML
document into tree structure suitable for frequent subtree mining. It then discusses the
different types of subtrees based on their conciseness, relationships between the nodes and
the constraints that could be applied on them. The details of frequent subtree mining
methods using the pattern growth techniques are then provided to understand the basics
of the prefix-based pattern growth approach and the benefits of it over the “generate-
and-test” approach. The remainder of the chapter provides the details about the various
methods that are developed in this thesis to generate concise frequent subtrees using the
prefix-based pattern growth approach. It further details the individual methods using
these techniques to efficiently generate the different types of concise frequent subtrees.
92
The experimental section evaluates the proposed methods and compares them with the
state-of-the-art methods. Finally, the analysis of the experimental results is presented in
the discussion section.
4.2 Pre-Processing of the structure in XML documents
Most of the tree mining methods cannot be applied directly on XML documents. They
often need a pre-processing step, which uses the XML documents as input and outputs a
rooted, ordered, labelled tree for each document. The rooted, ordered and labelled tree
reflects the tree structure of the original XML document, where the node labels are the
tags of the XML document. Then the documents will be further transformed depending
on the document model used by the frequent pattern mining task.
The pre-processing of the structure of XML documents involves three sub-phases as
shown in Figure 4.1. They are:
� Parsing;
� Representation; and
� Duplicate branches removal.
Duplicate branches removal Parsing
Representation (Depth-first String encoding)
Remove duplicate paths by string matching
Convert document trees to Paths
Document trees
XML documents
Convert Paths to document trees
Figure 4.1: The pre-processing phase for structure of XML documents
93
Parsing
Parsing of XML documents can be done using a Simple API for XML (SAX) or a Doc-
ument Object Model (DOM) parser. Both of these parsers take a very different approach:
SAX parser provides a sequence of tags or terms in the order of occurrence in the XML
document; DOM parser provides a hierarchical object model in the form of a tree of nodes
where nodes are the tags in the XML document. Since the DOM parser preserves the
hierarchical relationship among the nodes, it is used in this research to extract the tree
structure of the XML documents.
Each XML document in the dataset is parsed and modelled as a rooted labelled ordered
document tree. The document tree is rooted and labelled since a root node always exists
in the document tree and all the nodes are labelled using the tag names. The left-to-right
ordering is preserved among the child nodes of a given parent in the document tree and
therefore they are ordered.
Representation
The document trees need to be represented in a way that is suitable for mining in the
subsequent phase. A popular representation for trees, the depth-first string format [22],
is used to represent the document trees. The depth-first string encoding traverses a tree
in the depth-first order. It represents the depth-first traversal of a given document tree in
a string like format where every node has a “-1” to represent backtracking.
In a document tree dataset, DT , a document tree DTi with only one node having
a label X, the depth-first string of DTi is S(DTi) =< X > . For a document tree
DTi with multiple nodes having labels X,Y,Z and K, the depth-first string of DTi is
S(DTi) =< XaY bZc − 1 . . . Kn − 1− 1− 1 > where the superscripts a, b, c, . . . , n are the
increasing positions of nodes in the pre-order traversal of the tree.
94
Duplicate branches removal
Many real-life datasets contain a large number of document trees having duplicate
branches. These duplicate branches carry repeated information and they cause additional
overheads in the mining process due to their redundancy. Hence, these duplicate branches
need to be removed. In order to remove the duplicate branches, the document tree is
converted in a series of paths. These duplicate paths are then identified from the series of
paths using string matching and removed. The remaining paths are combined together to
create the document trees.
4.3 Types of subtrees
In general, frequent pattern mining methods are applied to generate two types of subtrees,
induced and embedded, which preserve the parent-child and the ancestor-descendant re-
lationships among their nodes respectively. This section defines various types of concise
representations based on induced and embedded subtrees. The different types of subtrees
and the concepts of proposed frequent subtree mining approach will be explained using
the example document trees dataset DT shown in Table 4.1.
Table 4.1: Document tree dataset example (DT )Tree Id Tree Pre-order string1 < A1B2C3 − 1− 1E4 − 1− 1 >2 < A1B2C3 − 1− 1E4F 5 − 1− 1− 1 >3 < A1E2F 3 − 1G4 − 1− 1 >
4.3.1 Concise Frequent Induced (CFI) subtrees
There are two types of concise frequent induced subtrees, closed and maximal, as defined
in Chapter 2. This research defines these concise representations of induced subtrees as
follows.
95
Definition: Closed Frequent Induced (CFI) subtree
In DT , let there be two frequent induced subtrees DT ′ and DT ′′. The frequent induced
subtree DT ′ is closed of DT ′′ iff (1) DT ′ ⊃t DT ′′ where ⊃t denotes the supertree rela-
tionship, (2) supp(DT ′) = supp(DT ′′), (3) there exists no supertree for DT ′ having the
same support as that of DT ′ and (4) DT ′ is the induced supertree of DT ′′. This property
is called an induced closure and DT ′ is a CFI subtree in DT .
Definition: Maximal Frequent Induced (MFI) subtree
In DT , two frequent induced subtrees DT ′ and DT ′′ exist. The frequent induced subtree
DT ′ is said to be maximal of DT ′′ iff (1) DT ′ ⊃t DT ′′, supp(DT ′) ≥ supp(DT ′′), (2) there
exists no supertree for DT ′ having a support greater than that of DT ′ and (3) DT ′ is the
induced supertree of DT ′′. This property is called an induced maximality and DT ′ is the
MFI subtree in DT .
These two types of subtrees and their benefits will be explained using the running
example dataset DT in Table 4.1. Consider Table 4.2, which lists the frequent subtrees
using the prefix-pattern growth approach generated on a support threshold, min supp=2.
It can be seen that subtrees such as (< B1−1 >:2), (< C1−1 >:2), (< A1B2−1−1 >:2),
(< B1C2−1−1 >:2) are subtrees of (< A1B2C3−1−1−1 >: 2) with the same support.
On applying the closure property, it can be seen from Table 4.3 that there are 3 CFI
subtrees for the example DT in comparison to 13 frequent induced subtrees (as shown
in Table 4.2). It is interesting to note that closure has reduced the number of frequent
induced subtrees by three-fold in this example. As only the subtrees with the same support
are checked for closure, this results in no loss of information.
On the other hand, by applying the maximality, it can be seen from Table 4.4 that
there are 2 MFI subtrees in comparison to 13 frequent induced subtrees (as shown in Table
96
Table 4.2: Frequent induced subtrees generated from DT (in Table 4.1) using prefix-pattern growth approach
No. of Nodes Frequent Induced Subtrees
1 (< A1 − 1 >: 3), (< B1 − 1 : 2), (C1 − 1 >: 2),(< E1 − 1 >: 3)(< F 1 − 1 >: 2)
2 (< A1B2 − 1− 1 >: 2), (< B1C2 − 1 : 2),(< A1E2 − 1− 1 >: 3)(< E1F 2 − 1− 1 >: 2)
3 (< A1B2C3 − 1− 1− 1 >: 2), (< A1B2 − 1E3 − 1− 1 >: 3),(< A1E2F 3 − 1− 1− 1 >: 3)
4 (< A1B2C3 − 1− 1E4 − 1− 1 >: 2)
Table 4.3: Closed Frequent Induced subtrees generated from DT (in Table 4.1)
No. of Nodes Closed Frequent Induced Subtrees
2 (< A1E2 − 1− 1 >: 3)
3 (< A1E2F 3 − 1− 1− 1 >: 2)
4 (< A1B2C3 − 1− 1E4 − 1− 1 >: 2)
4.2). The subtree with two nodes, (< A1E2 − 1 − 1 >: 3), is an induced subtree of the
subtree with three nodes, (< A1E2F 3 − 1− 1− 1 >: 2). Hence, by applying maximality,
(< A1E2 − 1− 1 >: 3) could be eliminated as its supertree, (< A1E2F 3 − 1− 1− 1 >: 2),
is present.
Table 4.4: Maximal Frequent Induced subtrees generated from DT (in Table 4.1)
No. of Nodes Maximal Frequent Subtrees
3 (< A1E2F 3 − 1− 1− 1 >: 2)
4 (< A1B2C3 − 1− 1E4 − 1− 1 >: 2)
The research now proposes a new type of concise frequent subtrees based on their
length, the Length Constrained Concise Frequent Induced subtree. The length constrained
concise frequent induced subtrees are used in this method for the following reasons:
� Extracting all the concise frequent induced subtrees is computationally expensive
for datasets with a high branching factor;
� All concise frequent induced subtrees are not required while utilising them in re-
trieving the content for clustering; and
� For some datasets the longer sized concise frequent induced subtrees could become
more specific and hence could impact on the quality of clustering solutions in using
these patterns.
97
Now the length constrained frequent closed and maximal induced subtrees will be
discussed.
Definition: Length Constrained Closed Frequent Induced (CFIConst) subtree
In DT , for a given support threshold,min supp, and a length constraint, const, let
there be two frequent subtrees DT ′ and DT ′′. The frequent subtree DT ′ is closed of
DT ′′ iff (1) DT ′ ⊃t DT ′′, supp(DT ′) = supp(DT ′′), (2) � DT ∗ ⊃t DT ′, such that
supp(DT ∗)=supp(DT ′), (3) DT ′ is the induced supertree of DT ′′, and (4) len(DT ′) ≤
const. This property is called the length constrained induced closure and DT ′ is a length
constrained CFI subtree in DT , denoted by CFIConst.
Definition: Length Constrained Maximal Frequent Induced (MFIConst) sub-
tree
In DT , for a given min supp and const, two frequent subtrees DT ′ and DT ′′ exist.
The frequent subtree DT ′ is closed of DT ′′ iff (1) DT ′ ⊃t DT ′′, supp(DT ′) ≥ supp(DT ′′)
, (2) � DT ∗ ⊃t DT ′, such that supp(DT ∗) ≥ min supp and supp(DT ∗) �= supp(DT ′), and
(3) DT ′ is the induced supertree of DT ′′ (4) len(DT ′) ≤ const. This property is called
the length constrained induced maximality and DT ′ is a length constrained MFI subtree in
DT , denoted by MFIConst.
Tables 4.5 and 4.6 use the running example in Table 4.1 to show the length constrained
closed and maximal frequent induced subtrees with the const ≤ 3 respectively.
Table 4.5: Length Constrained Closed Frequent Induced subtrees generated from DT(in Table 4.1)
#Nodes Length Constrained Closed Frequent Induced subtrees
2 (< A1E2 − 1− 1 >: 3)
3 (< A1B2C3 − 1− 1− 1 >: 2)(< A1E2F 3 − 1− 1− 1 >: 2)(< A1B2 − 1E3 − 1− 1 >: 2)
It should be noted that in this example the length constrained concise frequent in-
duced subtrees produce a greater number of concise frequent induced subtrees but the
98
Table 4.6: Length Constrained Maximal Frequent Induced subtrees generated fromDT (in Table 4.1)
#Nodes Length Constrained Maximal Frequent Induced subtrees
3 (< A1B2C3 − 1− 1− 1 >: 2)(< A1E2F 3 − 1− 1− 1 >: 2)(< A1B2 − 1E3 − 1− 1 >: 2)
length of each generated subtrees is controlled. This is due to the fact that the the
threshold condition for constraint length, const = 3, avoids the generation of the subtree
(< A1B2C3−1−1E4−1−1 >: 2) which is the supertree for (< A1B2C3−1−1−1 >: 2)
and (< A1B2 − 1E3 − 1 − 1 >: 2). However, if the constraint length (const) is set to 4
in this example, it could discover all the concise frequent subtrees. Hence, determining
the correct constraint length helps to avoid much information loss and to improve the
computational efficiency for the length constrained CFI frequent pattern mining methods.
4.3.2 Concise Frequent Embedded (CFE) subtrees
The concise frequent induced subtrees discussed in the previous subsection present a strict
relationship among their nodes by using only a parent-child relationship. This reduces the
possibility of discovering some hidden similarities between the trees. In order to increase
the prospect of identifying the hidden similarity, the embedded subtrees are utilised to
impose a less strict relationship by allowing the ancestor-descendant relationship among
their nodes [18, 135]. As the embedded subtrees identify hidden relationships, the number
of embedded subtrees generated is larger than the number of induced subtrees and this
could result in an information explosion when the average depth of the tree is large. In
order to control the number of embedded subtrees, it is essential to generate only the
concise representations called the Concise Frequent Embedded subtrees.
As was the case with concise frequent induced subtrees, this research defines four types
of concise frequent embedded subtrees using ancestor-descendant relationship among their
nodes.
99
Definition: Closed Frequent Embedded (CFE) subtree
In DT , let there be two frequent embedded subtrees DT ′ and DT ′′. The frequent em-
bedded subtree DT ′ is closed of DT ′′ iff (1) DT ′ ⊃t DT ′′, supp(DT ′) = supp(DT ′′), (2)
� DT ∗ ⊃t DT ′, such that supp(DT ∗)=supp(DT ′), and (3) DT ′ is the embedded supertree
of DT ′′. This property is called an embedded closure and DT ′ is the CFE subtree in DT .
Definition: Maximal Frequent Embedded (MFE) subtree
In DT , consider two frequent embedded subtrees DT ′ and DT ′′ exist. The frequent
embedded subtree DT ′ is said to be maximal of DT ′′ iff (1) DT ′ ⊃t DT ′′, (2) � DT ∗ ⊃t
DT ′, such that supp(DT ∗) ≥ min supp and supp(DT ∗) �= supp(DT ′), and (3) DT ′ is the
embedded supertree of DT ′′. This property is called an embedded maximality and DT ′ is
the MFE subtree in DT .
Definition: Length Constrained Closed Frequent Embedded (CFEConst) sub-
tree
In DT , for a given min supp and const, two frequent subtrees DT ′ and DT ′′ exist.
The frequent embedded subtree DT ′ is closed of DT ′′ iff (1) DT ′ ⊃t DT ′′, supp(DT ′) =
supp(DT ′′), (2) � DT ∗ ⊃t DT ′, such that supp(DT ∗)=supp(DT ′), and (3) DT ′ is the
embedded supertree of DT ′′ (4) len(DT ′) ≤ const. This property is called an embedded
closure and DT ′ is the length constrained CFE subtree in DT .
Definition: Length Constrained Maximal Frequent Embedded (MFEConst)
subtree
In DT , for a given min supp and const, there exists two frequent subtrees DT ′ and
DT ′′. The frequent subtree DT ′ is maximal of DT ′′ iff (1) DT ′ ⊃t DT ′′, supp(DT ′) =
supp(DT ′′), (2) � DT ∗ ⊃t DT ′, such that supp(DT ∗) ≥ min supp and supp(DT ∗) �=
supp(DT ′), and (3) DT ′ is the embedded supertree of DT ′′, and (4) len(DT ′) ≤ const.
100
This property is called an embedded maximal and DT ′ is the length constrained MFE
subtree in DT .
Considering the running example dataset DT in Table 4.1, it can be seen that there are
15 frequent embedded subtrees inDT , as listed in Table 4.7, as compared to the 13 frequent
induced subtrees listed in Table 4.2. It can be noted that (< A1B2C3−1−1E4−1−1 >: 2)
is the supertree of the 8 subtrees that have the same support value of 2 and can replace
them. Similarly, (< A1E2 − 1− 1 >: 3) and (< A1F 2 − 1− 1 >: 2) can also replace their
respective subtrees having the same support.
Table 4.7: Frequent embedded subtrees generated from DT (in Table 4.1) using prefix-pattern growth methods
#Nodes Frequent Embedded Subtrees
1 (< A1 − 1 >: 3), (< B1 − 1 >: 2), (< C1 − 1 >: 2),(< E1 − 1 >: 3), (< F 1 − 1 >: 2)
2 (< A1B2 − 1− 1 >: 2), (< B1C2 − 1− 1 >: 2),(< A1C2 − 1− 1 >: 2), (< A1E2 − 1− 1 >: 3), (< A1F 2 − 1− 1 >: 2)
3 (< A1B2C3 − 1− 1− 1 >: 2), (< A1B2 − 1E3 − 1− 1 >: 2),(< A1C2 − 1E3 − 1− 1 >: 2), (< A1E2F 3 − 1− 1− 1 >: 2)
4 (< A1B2C3 − 1− 1E4 − 1− 1 >: 2)
The CFE subtrees result set will be freqT (DT ) : (< A1E2 − 1 − 1 >: 3), (< A1F 2 −
1 − 1 >: 2), (< A1B2C3 − 1 − 1E4 − 1 − 1 >: 2) as shown in 4.8. The MFE subtrees
result set will be freqT (DT ) : (< A1E2 − 1 − 1 >: 3), (< A1F 2 − 1 − 1 >: 2), (<
A1B2C3 − 1 − 1E4 − 1 − 1 >: 2) as shown in 4.9. Tables 4.10 and 4.11 show the length
constrained closed and maximal embedded subtrees generated for the sample XML dataset
with the length constraint, const=3.
Table 4.8: Closed Frequent Embedded (CFE) subtrees generated from DT (in Table4.1)
#Nodes Closed Frequent Embedded Subtrees
2 (< A1E2 − 1− 1 >: 3), (< A1F 2 − 1− 1 >: 2)
4 (< A1B2C3 − 1− 1E4 − 1− 1 >: 2)
Table 4.9: Maximal Frequent Embedded (MFE) subtrees generated from DT (in Table4.1)
#Nodes Frequent Subtrees
2 (< A1F 2 − 1− 1 >: 2)
4 (< A1B2C3 − 1− 1E4 − 1− 1 >: 2)
101
Table 4.10: Length Constrained Closed Frequent Embedded (CFEConst) subtreesgenerated from DT (in Table 4.1)
#Nodes Length Constrained Closed Frequent Embedded Subtrees
2 (< A1E2 − 1− 1 >: 3)
3 (< A1B2C3 − 1− 1− 1 >: 2)(< A1B2 − 1E3 − 1− 1 >: 2)(< A1C2 − 1E3 − 1− 1 >: 2)(< A1E2F 3 − 1− 1− 1 >: 2)
Table 4.11: Length Constrained Maximal Frequent Embedded (MFEConst) subtreesgenerated from DT (in Table 4.1)
#Nodes Length Constrained Maximal Frequent Embedded subtrees
3 (< A1B2C3 − 1− 1− 1 >: 2)(< A1B2 − 1E3 − 1− 1 >: 2)(< A1C2 − 1E3 − 1− 1 >: 2)(< A1E2F 3 − 1− 1− 1 >: 2)
The background of the prefix-based pattern growth for frequent subtree mining will
be provided in the next section to understand the process of prefix-based pattern growth
technique and its benefits over the “generate-and-test” approach.
4.4 Frequent subtree mining: Background
The prefix-based pattern growth for mining frequent subtrees involves the following three
phases:
� The 1-Length frequent subtree generation;
� Projecting the dataset using the prefix trees; and
� Mining the prefix tree projected dataset
4.4.1 The 1-Length frequent subtree generation
For a given user-defined minimum support (min supp), the prefix-based subtree growth
technique starts with a scan of the document tree dataset DT to determine the 1-Length
frequent subtrees that have a support greater than the min supp. A 1-Length frequent
subtree containing a single node is represented using the following subtree pre-order string
format as SubX = (< Xa − 1 >: Supp). The subtree pre-order representation includes an
102
element called Supp to indicate the support value of the subtree.
With the running example in Table 4.1, this step when applied with an user-defined
support threshold, min supp=2, results in the following 1-Length frequent subtrees: (<
A1 − 1 >:3), (< B1 − 1 >: 2), (< C1 − 1 >: 2), (< E1 − 1 >: 3) (< F 1 − 1 >:2). The
subtree (< G1−1 >:1) is infrequent as it occurs only once, which is less than the min supp
value, and hence it is not included in the output.
Definition: Prefix-Tree
Let there be a document tree DTp with m nodes and let Tp be a tree with n nodes where
n ≤ m. The pre-order scanning of the document tree DTp from its root until its n-th node
results in a tree Tj. If Tj is isomorphic (or structurally identical) to Tp, then Tp is called
the prefix-tree of DTp .
Figure 4.2 shows the prefix-trees for the document tree DTp illustrated in 4.2(a). The
4 prefix-trees for A containing 1, 2, 3 and 4 nodes are identified for the document tree
DTp following the pre-order string representation. As the node ‘E’ follows the prefix tree
< A1B2C3 − 1− 1− 1 > it does not have a prefix-tree < A1E2 − 1− 1 >.
C
A
B
A A
B
A
EB
C
A
EB
C
(a)
(b)
Figure 4.2: (a) a document tree DTp; (b) Prefix trees of (a)
103
4.4.2 Projecting the dataset using the prefix trees
The next step in this process involves projecting the dataset using the prefix trees. The
process is started by using the 1-Length frequent subtrees as prefix-trees. To build the
Tp-prefix-projected dataset, every document tree in DT is checked to establish whether it
contains the prefix-tree Tp. If a document tree DTp contains Tp then its projected instance
for Tp is constructed.
Definition: The Prefix-tree projected instance
Consider a prefix tree Tp with n nodes. If a document tree DTp ∈ DT with m nodes
exists, m ≥ n, Tp ⊂t DTp; then the Tp-prefix projected instance of DTp is the pre-order
scanning of DTp from n+1 node to m.
The Prefix-tree projected dataset can be defined as follows.
Definition: The Prefix-tree projected dataset
A Tp-prefix-tree projected dataset is obtained by constructing the Tp-prefix-tree projected
instances for all the document trees in DT.
In the running example with min supp=2, using the generated 1-Length frequent
subtrees as prefix-trees, DT is projected. The prefix-trees of the document tree (DT1)
with a tree-id of 1 for 1-Length frequent subtree (< A1 − 1 >:3) is given by < A1 − 1 >,
< A1B2 − 1− 1 >, < A1B2C3 − 1− 1− 1 >, < A1B2C3 − 1− 1E4 − 1− 1 >.
Tables 4.12, 4.13 and 4.14 provide the projected instances dataset of the prefix-trees
< A1−1 >, < B1−1 > and < A1B2−1−1 > respectively. To improve the efficiency, the
projected instances from the infrequent 1-Length subtrees are eliminated. It can be noted
in Table 4.12 that the tree with Tree Id 3 does not contain the node ‘G’ as it is infrequent
and hence this node is eliminated in projection. The generated projected instances are
104
mined using the technique detailed in the following subsection.
Table 4.12: < A1 − 1 > projected instances dataset
Tree Id Pre-order strings of Trees
1 B2C3 − 1− 1E4 − 1
2 B2C3 − 1− 1E4F 5 − 1− 1
3 E2F 3 − 1− 1
Table 4.13: < B1 − 1 > projected instances dataset
Tree Id Pre-order strings of Trees
1 C3 − 1
2 C3 − 1
Table 4.14: < A1B2 − 1− 1 > projected instances dataset
Tree Id Pre-order strings of Trees
1 C3 − 1
2 C3 − 1
Mining the prefix-tree projected dataset
As a next step in the prefix-pattern growth, each of the prefix-tree projected datasets
are mined to identify the Growth Node (GN).
Definition: Growth Node (GN)
Given two prefix-trees Tp and T ′p with m and m+1 nodes respectively, where Tp is the
prefix of T ′p. If a node n occurs in T ′
p, but not in Tp then the node n is the Growth Node
(GN) of Tp with respect to T ′p.
If a GN is frequent then the prefix-tree with the GN forms the frequent subtree.
For each of the frequent GNs the corresponding projection is constructed and mined
recursively until there are no more frequent GNs to be projected.
Mining the prefix-tree projected dataset in the running example containing the prefix-
tree dataset of < A1−1 >, there are two GNs, namely the nodes ‘B’ and ‘E’ in Table 4.12,
as they are the children of < A1 − 1 >; this example uses only a parent-child relationship
and not an ancestor-descendant relationship. If the latter relationship is considered then
105
‘C’ and ‘F ’ will be also be considered as GNs. They are frequent and have a support value
greater than min supp. These frequent GNs combine with the prefix-tree < A1 − 1 > to
form their corresponding new prefix-trees < A1B2 − 1 − 1 >, < A1E2 − 1 − 1 >. These
prefix-trees will be used to identify the GNs. For instance, for the partitioned dataset
< A1 − 1 > provided in Table 4.12, the GNs are nodes labelled ‘B’ and ‘E’ as they
occur as first nodes in the projected instance. The support of GNs ‘B’and ‘E’ is 2 and 3
respectively in the < A1 − 1 > prefix-projected dataset. Hence ‘B’ and ‘E’ are frequent
GNs. Using these two frequent GNs, two separate projections are constructed and mined
for the frequent subtrees. Table 4.14 shows the projection for < A1B2 − 1 − 1 > but the
projection for < A1B2 − 1C3 − 1− 1 > is empty and hence this projection is terminated.
The projection for < A1B2 − 1 − 1 > is then mined recursively until all the frequent
subtrees are identified.
So far, the basic process of generating frequent subtrees using the prefix-pattern growth
technique has been discussed. The following section provides the details of the techniques
for generating concise frequent subtrees on both induced and embedded subtrees. It will
also present the algorithm using these techniques and the frequent subtree mining methods
for generating the different types of subtrees.
4.5 Concise frequent subtree mining: Proposed techniques
Unlike the situation in itemset mining, generating closed or maximal subtrees from trees
is a challenge, due to the presence of hierarchical relationships and the need to preserve
these relationships while generating concise frequent subtrees. A naıve approach to the
generation of closed or maximal frequent subtrees is first to generate all the frequent
subtrees and then to eliminate the subtrees based on their support by checking the closure
or maximality. This is an expensive task when there are a large number of frequent
106
subtrees generated or when the frequent subtree mining could not be completed. Also,
by identifying concise frequent subtrees using this naıve approach, the process becomes
an additional step and results in computational overhead to the frequent subtree mining
process. Hence, it is essential to identify an efficient method that can provide the concise
result set as well as improve the efficiency of the frequent subtree mining process. There
are a number of concise pattern mining approaches proposed in the frequent itemset and
sequential mining [117, 127]. Unlike the itemset or sequential mining, trees can have
multiple branches and hence closure using the traditional techniques cannot be used for
tree structured datasets.
This thesis proposes two techniques for effectively mining concise frequent subtrees
using the pattern-growth approach discussed below. These involve:
1. Search space reduction using the backward scan; and
2. Node extension concise checking.
4.5.1 Search space reduction using the backward scan
This technique is applied after the generation of 1-Length frequent subtrees. It conducts
a backward scan of the document tree dataset, DT , to reduce the search space using the
following conditions:
Condition 1: Backward scan
Let there be two 1-Length prefix trees Tp and T ′p, each having a node labelled v and v′
respectively for DT . If v′ is the ancestor node or parent node of v in all trees in DT then
the projection of Tp is stopped as the projection of the subtree T ′p based on v′ will include
all the subtrees generated using the prefix tree Tp.
107
As the projection of the parent or the ancestor node includes the projection of the
child node or the descendant node respectively, this condition aids in reducing the number
of projections and hence the search space is also reduced effectively. However, by applying
this condition only, complete concise frequent subtrees cannot be obtained. Due to the
nature of the repeated projections, there is a possibility that some of the subtrees generated
are not concise. In order to check whether a subtree is concise or not for the generated
frequent subtrees, node extension concise checking is performed.
4.5.2 Node extension concise checking
The concise checking allows testing for the existence of either of the two properties -
maximal or closure in the tree dataset. This checking is applied to identify a subtree that
is extended by the node as concise or not. If the extended subtree is found as concise then
the original subtree is not considered concise.
According to the definitions of the concise frequent subtrees, a prefix-tree Tp is not
concise if at least one prefix-tree T ′p exists with the same support as that of Tp or with
a support greater than or equal to min supp. With the use of the prefix-based subtree
growth technique to generate frequent subtrees, the prefix-tree T ′p can occur in two possible
ways:
1. In the same prefix-projected dataset of Tp; and
2. In a different prefix-projected dataset of Tp.
Considering the example tree dataset, the prefix-tree T ′p =< A1B2 − 1− 1 > occurs in
the same prefix-projected dataset as that of Tp =< A1 − 1 >. The conciseness of Tp with
respect to T ′p can be checked by using the Growth Node (defined in Section 4.3) extension
108
closure checking or Growth Node extension maximality checking condition, according to
the type of concise trees that are being generated.
Condition 2a: Growth Node extension closure checking
A prefix tree Tp can be extended to T ′p in the same prefix-projected dataset using its
GNs. If any of the GNs for a given prefix-projected instance have the same support as
Tp, then Tp is not closed.
Condition 2b: Growth Node extension maximality checking
A prefix tree Tp can be extended to T ′p in the same prefix-projected dataset using its
GNs. If any of the GNs for a given prefix-projected instance have support greater than
min supp, then Tp is not maximal.
The growth node extension checking is not a computationally expensive step as it
involves checking only the support of the GN , which is a 1-Length frequent subtree in the
projected dataset. This technique can be used to reduce the number of frequent subtrees
to generate concise frequent subtrees.
Let us consider the next type of extension, where the extension of Tp occurs in a
different prefix-projected dataset from Tp. To check for the conciseness of the prefix-trees,
the following conditions are applied.
Condition 3a: Ancestor Node extension closure checking
If a prefix-tree Tp with m nodes and a prefix-tree T ′p exists with the common m nodes
and an additional node b having the same support as that of Tp in a different prefix-
projected instance from T ′p, then Tp is not closed and b is the ancestor node extension of
Tp.
109
Condition 3b: Ancestor Node extension maximality checking
If a prefix-tree Tp with m nodes and a prefix-tree T ′p exists, with the common m nodes
and an additional node b having the support ≥ min supp in a different prefix-projected
instance from T ′p then Tp is not maximal and b is the ancestor node extension of Tp.
In order to efficiently check for conciseness for ancestor node extensions, a technique
called “maintain-and-test” is deployed. A naıve approach to checking for conciseness is to
check for all the ancestor node extension events based on their support. However, it is an
expensive operation as it includes a number of checks. To reduce the number of checks,
a parameter, the sum of the tree IDs, is included to check for closure or maximal. To
apply this technique, first check whether for a given prefix-tree Tp there exists an ancestor
node extension of a node b in a different projected dataset, resulting in T ′p having the
same support and sum of tree IDs as Tp. If it exists then they are checked for closure or
maximality. The “maintain-and-test” approach reduces the number of checks as it avoids
checking all the prefix-trees with the same support and hence reduces the computational
overhead. Also, this concise checking technique is applied when the new frequent subtrees
are generated, which helps to reduce the number of concise checks. The algorithm and its
recursive function is presented in Figures 4.3 and 4.4 respectively.
4.6 Methods using the proposed techniques for generating
concise frequent subtrees
This section discusses the use of the proposed techniques explained in the preceding sec-
tion in order to generate the different types of concise frequent subtrees. Figure 4.5 shows
the classification of the proposed methods for generating the two types of concise repre-
sentations and based on the node relationships of the subtrees. The following subsections
110
Input : Document Dataset:D, Document Tree Dataset:DT, MinimumSupport:min supp, Length Constraint:const
Output: Concise Frequent Subtrees:CFS
begin1. Scan DT and find all 1-Length frequent subtrees f = {f1, f2, ..., fn};2. for the node b in every frequent subtree fi in f do
if there exists the same ancestor node c for fi in all the document trees thenDo not construct the prefix-tree projected dataset for fi;
elsei. Find all occurrences of fi in DT and construct < fi − 1 > - projecteddataset (i.e. ProDS(DT,< fi − 1 >)) through collecting allcorresponding Project- Instances in DT ;ii. Apply the Fre(< fi − 1 >, 1, P roDS(DT,< fi − 1 >),min sup, supp(fi)) to mine the projected dataset until no more GNsthat can be found;iii. Obtain concise frequent subtrees, CFSProDS from the Fre functionon the projected dataset ProDS;iv. Insert CFSProDS into CFS;
end
end
end
Figure 4.3: Algorithm for generating concise frequent subtrees
discusses how each of the proposed frequent pattern mining methods utilise the proposed
concise generation techniques earlier in detail.
4.6.1 Generating concise frequent induced subtrees
As induced subtrees preserve only the parent-child relationships, generating concise fre-
quent induced subtrees using the techniques explained in the previous section requires
imposing only the parent-child relationship and not the ancestor-descendant relationship.
For all the methods for concise frequent induced subtrees generation, the process begins
with the generation of 1-Length frequent subtrees and the identification of the prefix-tree
projected datasets based on the generated 1-Length frequent subtrees. The search space
reduction condition 1 is then applied using the generated 1-Length frequent subtrees.
111
Function Fre (T p, n, ProDS(DT, T p), min supp, prepat supp)
Input : prefix-tree:Tp, length of Tp:n, <Tp>-projected dataset:(ProDS(DT,Tp)), minimum support threshold:min supp, support of theprevious subtree used to generate this projected dataset:prepat supp
Output: Concise Frequent Subtrees (CFSProDS)
begin1. Scan ProDS(DT, Tp) once to find all the 1-Length frequent GNs(GN0,...,k)according to Condition 1;2. Set output=true;3. Count the support of all GNs;4. if supp(GN0||GN1, ..., ||GNk) == supp(Tp) then
The subtree is not a CFS, output = false;end5. for each GNi in GN do
if GNi is frequent theni. Extend Tp with GNi to form the prefix tree T ′
p;
ii. if output thenInsert T ′
p into CFSProDS;
end
elsei. Check T ′
p for occurrence of any of its subtree with the same support
and sum of tree IDs in the output ;ii. if there exists any subtree for T ′
p then
Remove the subtree of T ′p and insert T ′
p into CFSProDS;
end
end
end6. Find all occurrences of GNi in ProDS(DT, Tp), construct the< T ′
p >-projected database (i.e. ProDS(DT, T ′p)) through collecting all
corresponding Project-Instances in ProDS(DT, Tp);7. Call Fre(T ′
p, n+ 1, P roDS(DT, T ′p),min supp, prepat supp) using the newly
created T ′p;
end
Figure 4.4: Function Fre for generating concise frequent subtrees
112
Proposed Methods
Prefix-based pattern growth methods
For Concise Induced subtrees
For Concise Embedded subtrees
Maximal LengthConstrained
LengthConstrained
PCITMINER PMITMINER PCITMINERCONST
PMITMINERCONST
PCETMINERCONST
PMETMINERCONST
Closed
PCETMINER
Maximal
PMETMINER
Closed
CFI MFI CFIConst MFIConst CFE MFE CFEConst MFEConst
Concise Frequent subtrees
Node relationship
Conciseness
Figure 4.5: Classification of the proposed methods
Using these 1-Length frequent subtrees, the frequent node labels in the document trees
are identified. The frequent node labels in every document tree are checked for their
parent nodes. If the given 1-Length frequent subtrees contain the same parent node in
all the subtrees then the 1-Length frequent subtrees will not be used in projecting the
dataset as the projections created by the parent nodes of the 1-Length frequent subtrees
include the projections created by the 1-Length frequent induced subtrees. Hence, these
1-Length frequent subtrees can be removed from the set of 1-Length frequent subtrees,
thereby reducing the search space
In the example dataset, the subtree having the node labelled ‘A’ is a root node in
all the trees in the DT , so it is not checked for its parent node. However, the subtrees
containing the internal nodes ‘B’, ‘C’, ‘E’ and ‘F’ are checked for their parent nodes in
all the document trees in DT . This checking reveals that the parent node is ‘A’ in all
the trees; hence there is no need to project the subtrees (< B1 − 1 >: 2), (< C1 − 1 >:
2), (< E1 − 1 >: 3) (< F 1 − 1 >:2), as the projection of (< A1 − 1 >:3) includes the
projections of all the other subtrees. By excluding the projections of the internal nodes,
the number of subtrees and the number of projections required are significantly reduced.
Due to the reduced search space, the efficiency of the method can be improved.
113
The concise checking techniques are applied using the generated 1-Length frequent
subtrees. The concise checking techniques depend on the type of subtree generated hence
they are described separately for each type of subtrees.
4.6.1.1 Prefix-based Closed Induced Tree Miner (PCITMiner)
As the PCITMiner generates Closed Frequent Induced (CFI) subtrees, the next task fo-
cuses on utilising the Growth Node (GN) extension closure checking condition (condition
2a) to generate the CFI subtrees. The GN for induced subtrees is based on the parent-
child relationship, so these nodes are essentially the child nodes of the prefix tree Tp. To
check for closure, the support of each of the GN is stored. If any GN exists that has the
same support as that of Tp, then Tp is not output.
As can be seen from Table 4.12 for the < A1−1 > projected dataset, the growth nodes
are ‘B’ and ‘E’. The supp(‘B’)=2 and supp(‘E’)=3. It can be seen that the GN , ‘E’, has
the same support as that of the prefix tree < A1 − 1 >, hence this subtree is not output
as a CFI subtree. This is due to the fact that the prefix-tree < A1 − 1 > with the GN
‘E’ could result in a subtree < A1E2−1−1 > which will be the supertree for < A1−1 >,
having the same support as that of < A1 − 1 >.
Finally, the condition 3a for ancestor node extension is used in PCITMiner as a parent
node extension and the closure checking is applied when the prefix-tree occurs in a dif-
ferent projected dataset. To efficiently check for closure for parent node extensions using
the “maintain-and-test” technique, the sum of the tree IDs, is included. To apply this
technique, it must first be checked whether, for a given prefix-tree Tp , a parent node
extension of a node b exists in a different projected dataset, resulting in T ′p having the
same support and sum of tree IDs as Tp. If it exists then it is checked for closure. The
“maintain-and-test” approach reduces the number of checks as it avoids checking all the
114
prefix-trees with the same support, and thus reduces the computational overhead.
From Table 4.12, Tp =< A1B2−1E3−1−1 > and prefix-tree T ′p =< A1B2C3−1E4−1−
1−1 >, it can be noted that T ′p is generated from the prefix-tree< A1B2C3−1−1−1 > and
not from Tp. Hence T′p occurs in a different projected dataset from Tp =< A1B2−1−1 >.
This shows that the condition 3a helps to identify this type of ancestor node closure.
4.6.1.2 Prefix-based Maximal Induced Tree Miner (PMITMiner)
PMITMiner adopts similar techniques to PCITMiner but it generates the MFI subtrees
that have maximality in their conciseness and parent-child relationship among their nodes.
To generate MFI subtrees using PMITMiner, the next task after applying search space
reduction technique is to apply a Growth Node (GN) extension maximality checking
condition (condition 2b). The GN for induced subtrees is based on the parent-child
relationship; these nodes are essentially the child nodes of the prefix tree Tp. To check for
maximality, if any GN having a support greater than the min supp is present then Tp is
not output as it is not maximal. This check is different from closure check where GNs
with the same support as that of the prefix tree Tp is checked.
Table 4.12 shows that the < A1 − 1 > projected dataset, the growth nodes are ‘B’
and ‘E’, with supp(‘B’)=2 and supp(‘E’)=3 respectively. As the conciseness is based on
maximality, by applying condition 2b, the GNs will be checked whether they are frequent
(support greater than min supp) instead of checking them for the same support as in
closure. Hence, the GN , ‘B’, having a support greater than min supp, is frequent and
the prefix tree < A1 − 1 > is not maximal.
Finally, the ancestor node extension maximality checking condition (condition 3b) for
checking ancestor node extension is used in PMITMiner as a parent node extension (as
115
induced subtrees maintains parent-child relationship among their nodes) and the maxi-
mality checking is applied when the prefix-tree occurs in a different projected dataset.
The “maintain-and-test” technique is utilised and the testing is based on the min supp
using only their support values as the sum of the tree-IDs is not required for checking this
condition. To apply this technique, it is first checked whether for a given prefix-tree Tp,
a parent node extension of a node b exists in a different projected dataset resulting in T ′p
having a support greater than min supp. If it exists, then only the prefix-tree is checked
for maximality.
The discussion on the concise frequent induced subtrees leads to the discussion of the
length constrained concise induced subtrees, PCITMinerConst and PMITMinerConst.
4.6.1.3 Length Constrained Prefix-based Closed Induced Tree Miner (PCIT-
MinerConst)
The PCITMinerConst method adopts the same pruning and extension checking techniques
as PCITMiner however each generated subtree is checked for its length. This length is
an user-defined parameter. Usually this parameter is set to a value greater than 2 as a
subtree with a length of 1 will be a node and a subtree with a length of 2 will be a path.
PCITMinerConst method begins with the search space reduction technique similar to
PCITMiner in which only the 1-Length frequent subtrees are generated. Hence, the con-
straint checking is not carried out in this technique. The constraint checking is performed
after applying both of the node extension checking techniques.
After applying condition 2a for growth node extension closure checking, if the Tp tree
is frequent and the length of the Tp tree is greater than the user-defined length threshold,
then the generated Tp is output as a CFIConst subtree; lastly the projections for the
116
projected dataset are terminated when the generated Tp is equal to the length threshold.
Finally, the condition 3a for ancestor node extension is used in PCITMinerConst as
a parent node extension and the closure checking is applied when the prefix-tree occurs
in a different projected dataset along with the length threshold. The “maintain-and-test”
technique that uses the sum of the tree IDs and the support is included to check for
closure. To apply this technique, firstly check whether for a given prefix-tree Tp, a parent
node extension of a node b exists in a different projected dataset, resulting in T ′p having
the same support and sum of tree IDs as Tp. If it exists then it is checked for closure. This
process is repeated until there are no more CFIConst subtrees that could be generated.
4.6.1.4 Length Constrained Prefix-based Maximal Induced Tree Miner (PMIT-
MinerConst)
Similar to the PCITMinerConst method, PMITMinerConst adopts the same pruning and
extension checking techniques for generating the length constrained maximal frequent
induced subtrees, MFIConst subtrees. The search space reduction technique generates
only the 1-Length frequent subtrees. Hence, the constraint checking is not carried out.
The constraint checking is performed after both of the node extension checking techniques.
After applying conditions 2b and 3b, if the length of the Tp tree is greater than the
user-defined length threshold then the generated Tp is output as a MFIConst and then
the projections for the projected dataset are terminated. Finally, the condition 3b for
ancestor node extension is used in PMITMinerConst as a parent node extension and the
maximality checking is applied when the prefix-tree occurs in a different projected dataset.
The “maintain-and-test” technique is utilised and the testing is based on the min supp.
117
4.6.2 Generating concise frequent embedded subtrees
Generating concise frequent embedded subtrees from the proposed techniques for generat-
ing concise frequent subtrees involves utilising ancestor-descendant relationships. However,
checking these relationships are much more difficult than the parent-child relationships as
there could be different levels of descendants for a given ancestor.
Similar to induced subtrees, the generation of 1-Length frequent subtrees is conducted
by scanning the document dataset, because no hierarchical relationships exist in these 1-
Length frequent subtrees. Applying the search space reduction condition involves checking
the ancestor nodes. However, checking all the ancestor nodes is an expensive task. Hence,
a heuristic of searching the ancestor nodes will be proposed. According to it, if a node
n exists that has a support equal to or greater than the support of the descendant node,
then it is adopted. This heuristic helps to reduce the number of ancestors that have to be
searched.
Another simple heuristic checks if a node n appears in every DT . If it does then this
node is not checked for its ancestors, since there is a possibility that it could be a root
node. This heuristic is applicable only on specific datasets which will be discussed further
in Section 4.8.
For example, among the 1-Length frequent embedded subtrees (< A1 − 1 >: 3), (<
B1 − 1 >: 2), (< C1 − 1 >: 2), (< E1 − 1 >: 3)(< F 1 − 1 >: 2), the subtree having
the node labelled ‘A’ is a root node in all the trees in DT , so, it is not checked for
its ancestors. Nevertheless the subtrees containing the internal nodes ‘B’, ‘C’, ‘E’ and
‘F ’ are checked for their ancestor nodes in all the document trees in DT . This reveals
that the ancestor node is ‘A’ in all the trees so there is no need to project the subtrees.
The projection of (< A1 − 1 >: 3) includes the projections of the following subtrees:
118
(< B1 − 1 >: 2), (< C1 − 1 >: 2), (< E1 − 1 >: 3)(< F 1 − 1 >: 2).
4.6.2.1 Prefix-based Closed Embedded Tree Miner (PCETMiner)
The Growth Node(GN) extension closure checking condition (condition 2a) is applied to
identify whether any GNs exist in the same projected dataset that have the same support
as the prefix-tree Tp. This step involves checking whether any of the GNs have a support
equal to Tp. If this is the case, then the Tp is not output as a CFE subtree.
From the running example, it can be seen that when the prefix-tree projected dataset
contains the prefix-tree dataset of < A1 − 1 >, it can be found that the node ‘E’ has the
same support as that of its prefix-tree. Hence, the < A1 − 1 > is not a CFE and hence it
is not output.
The ancestor node extension closure checking condition (condition 3a) is applied to
verify the ancestor nodes in a different projected dataset. If in one prefix-projected dataset,
a prefix-tree Tp with m nodes exists and in a different prefix-projected dataset, a prefix-
tree T ′p exists with the common m nodes and an additional node b having the same support
as that of Tp then Tp is not closed as the additional node b in T ′p is the ancestor node
extension of Tp.
The “maintain-and-test” technique uses the sum of tree IDs and the support as a
hashing function to identify whether any Tp ⊃t T′p exist that have the same support as
that of Tp. If it exists then the Tp is not output due to the closure property.
4.6.2.2 Prefix-based Maximal Embedded Tree Miner (PMETMiner)
The Growth Node (GN) extension closure checking condition (condition 2b) is applied
to identify whether any GNs exist in the same projected dataset that are frequent with
119
the prefix-tree Tp. This step is a computationally inexpensive step as it involves checking
whether the GN is frequent or not. If any frequent GN exists then Tp is not output as a
MFE subtree.
From the running example, it can be seen that the prefix-tree projected dataset con-
tains the prefix-tree dataset of < A1 − 1 >. Several GNs, namely the nodes ‘B’, ‘C’, ‘E’
and ‘F ’, can be found. They are frequent so the < A1 − 1 > is not a MFE and hence it is
not output.
The ancestor node extension maximality checking condition (condition 3b) is applied
to verify the ancestor nodes in a different projected dataset. If in one prefix-projected
dataset, a prefix-tree Tp with m nodes exists and in a different prefix-projected dataset, a
prefix-tree T ′p exists with the common m nodes and an additional node b having a support
greater thanmin supp then Tp is not maximal as the additional node b in T ′p is the ancestor
node extension of Tp. As with PCETMiner, the “maintain-and-test” technique is used .
Instead of using both the sum of tree IDs and the support value, only the support value
as a hashing function in this technique to identify any frequent T ′p, in which case the Tp
is not output.
4.6.2.3 Length Constrained Prefix-based Closed Embedded Tree Miner (PCETMin-
erConst)
The PCETMinerConst method adopts the same techniques as PCETMiner; however each
generated subtree is checked for its length. As in the search space reduction technique,
only the 1-Length frequent subtrees are generated. Hence, the constraint checking is
not carried out. The constraint checking is performed after both of the node extension
checking techniques.
120
After applying condition 2a and 3a, if the length of the Tp tree is greater than the
user-defined length threshold then the generated Tp is output as a CFEConst subtree and
the projections for the projected dataset are then terminated.
Finally, condition 3a for ancestor node extension is used in PCETMinerConst as an
ancestor node extension and the closure checking is applied when the prefix-tree occurs
in a different projected dataset. The “maintain-and-test” technique is utilised and the
testing is based on both the sum of tree IDs and the support of the subtrees. The pruning
option provided by the length threshold helps to terminate the frequent subtree mining
process earlier than PCETMiner.
4.6.2.4 Length Constrained Prefix-based Maximal Embedded Tree Miner
(PMETMinerConst)
Similar to the PCETMinerConst method, PMETMinerConst adopts the same pruning and
extension checking techniques as PMETMiner; however each generated subtree is checked
for its length. As in the search space reduction technique, only the 1-Length frequent
subtrees are generated. Hence, the constraint checking is not carried out. The constraint
checking is performed after both of the node extension checking techniques.
After applying conditions 2b and 3b, if the length of the Tp tree is greater than the
user-defined length threshold, the generated Tp is output as a MFE and then the projections
for the projected dataset are terminated.
Finally, the condition 3b for ancestor node extension is used in PMETMinerConst and
the maximality checking is applied when the prefix-tree occurs in a different projected
dataset. The “maintain-and-test” technique is utilised and the testing is based on the
min supp.
121
To understand the effectiveness of the proposed methods for concise frequent sub-
tree mining, they are benchmarked against other state-of-the-art frequent subtree mining
methods on both synthetic and real-life datasets using the evaluation measures discussed
in Chapter 3.
4.7 Empirical evaluation
The experiments for frequent subtree mining were conducted to understand the effective-
ness of the proposed concise frequent subtree mining methods over the prefix-based pat-
tern growth methods for frequent subtree mining such as PrefixTreeISpan [141] and Prefix-
TreeESpan [140], which generate induced and embedded subtrees respectively. In addition,
experiments were conducted to understand the effectiveness of the proposed methods over
other state-of-the-art apriori-based generate-and-test mining method, TreeMinerV, and
over the enumeration based method MB3-Miner [105] for generating embedded subtrees,
as discussed in Chapter 3.
All the proposed methods were written in C++ with the STL library support and com-
piled using an intel compiler with the -O3 optimisations. These methods were evaluated
on both the synthetic datasets and the real-life datasets detailed in Chapter 3. More-
over, the experiments were conducted by constraining the length of the concise frequent
subtrees generated using the length constrained concise frequent subtree mining methods.
The range of the length of the subtree (const) was set from 3 to 11. The choice of the
lower bound of this range for the const to 3 is because when the length of the subtree is 2,
it becomes a path, and when it has the length of 1, the subtree becomes a node (or tag).
This section is designed to perform a comparison of the following sets based on the
evaluation metrics and the datasets discussed in Chapter 3:
122
� Concise Frequent subtrees vs frequent subtrees using prefix-based pattern mining
methods;
� The prefix-based approach vs the generate-and-test approach;
� Closed vs Maximal frequent pattern mining methods;
� Induced vs Embedded frequent pattern mining methods; and
� Concise Frequent Pattern mining vs Constrained Concise frequent pattern mining
methods.
The experimental study is conducted based on the divisions of the datasets discussed
in the research design (Chapter 3). They are:
� Synthetic datasets
– On F5 dataset; and
– On D10 dataset.
� Real-life datasets
– On small-sized real-life dataset;
– On medium-sized real-life dataset; and
– On large-sized real-life datasets.
4.7.1 Evaluation of frequent pattern mining methods on synthetic datasets
In this subsection, the proposed concise frequent pattern mining methods will be compared
for their runtime and the number of patterns on the two synthetic datasets, F5 and D10.
The difference between these two datasets is their branching factor, with D10 having
higher fan-out or branching factor than F5.
123
On F5 dataset
0
0.1
0.2
0.3
0.4
0.5
0.6
2 4 6 8 10
Ru
nti
me (
in s
ecs)
Minimum Support (min_supp in %)
Runtime comparison of Induced Subtree miners on F5 Dataset
PCITMiner
PMITMiner
PrefixTreeISpan
0
0.2
0.4
0.6
0.8
1
1.2
2 4 6 8 10
Ru
nti
me (
in s
ecs)
Minimum Support(min_supp in %)
Runtime comparison of Embedded Subtree miners on F5 Dataset
PCETMiner
PMETMiner
PrefixTreeESpan
MB3-Miner
TreeMinerV
(a) Runtimes for Induced subtrees (b) Runtimes for Embedded subtrees
1
10
100
1 2 4 6 8 10
# F
req
Su
btr
ees
Minimum Support (min_supp in %)
# Frequent Induced subtrees on F5 dataset
CFI Subtrees
MFI Subtrees
FI Subtrees
1
10
100
1000
1 2 4 6 8 10
# F
req
Su
btr
ees
Minimum Support (min_supp in %)
# Frequent Embedded subtrees on F5 dataset
CFE Subtrees
MFE Subtrees
FE Subtrees
(c) No. of Induced subtrees (d) No. of Embedded subtrees
Figure 4.6: Runtime and number of subtrees comparison on F5 dataset
Figure 4.6(a) shows that PCITMiner performs much faster than the PrefixTreeISpan;
PCETMiner perform faster than both TreeMinerV and PrefixTreeESpan, especially at
lower support thresholds, as in Figure 4.6(b). It performs almost on a par with MB3-
Miner. In spite of the very large number of subtrees in this dataset, both PCITMiner and
PCETMiner are able to perform better than other methods, as shown in Figures 4.6(c)
and (d).
A comparison of the length constrained mining methods for various constraint lengths
is shown in Figure 4.7 (more results are presented in Appendix B.1 and B.2). It clearly
reveals that length constrained concise frequent induced subtree miners generate patterns
in less time than the concise frequent embedded subtree miners. This is due to the time
taken to identify the embedded relationship between the nodes and also from applying
124
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
3 5 7 9 11
Ru
nti
me (
in s
ecs)
Constraint Length
Runtime comparison of Length Constrained Subtree miners on F5 (at 2% min_supp)
PCITMinerConst
PMITMinerConst
PCETMinerConst
PMETMinerConst
0
0.05
0.1
0.15
0.2
0.25
3 5 7 9 11
Ru
nti
me (
in s
ecs)
Constraint Length
Runtime comparison of Length Constrained Subtree miners on F5 (at 10% min_supp)
PCITMinerConst
PMITMinerConst
PCETMinerConst
PMETMinerConst
(a) Runtimes for Induced subtrees (b) Runtimes for Embedded subtrees
0
5
10
15
20
25
3 5 7 9 11
# F
req
Su
btr
ees
Constraint Length
# Length Constrained Subtrees on F5 dataset (at 2% min_supp)
CFIConst
MFIConst
CFEConst
MFEConst
00.5
11.5
22.5
33.5
44.5
3 5 7 9 11#
Freq
Su
btr
ees
Constraint Length
# Length Constrained Subtrees on F5 dataset (at 10% min_supp)
CFIConst
MFIConst
CFEConst
MFEConst
(c) No. of Induced subtrees (d) No. of Embedded subtrees
Figure 4.7: Runtime and number of length constrained frequent concise subtreescomparison on F5 dataset
the conciseness on them. A comparison of the runtime of the length constrained concise
frequent subtree miners and their corresponding concise frequent subtree miners shows that
there are fewer length constrained frequent induced subtrees produced due to the early
termination based on the constraint length. On the other hand, the CFEConst subtrees
are greater in number due to the various combinations of the nodes and differences in
support values. The CFEConst is less than the frequent embedded subtrees. However,
the MFEConst, since it is based only on the support threshold and not on the independent
support values, is able to produce a lesser number of subtrees than the CFEConst. It can
also be noted that the number of subtrees varies at a lower support threshold; however
there is no variation of the number of subtrees at the higher support threshold, even at
10% due to far fewer longer trees in the dataset. Since it can be seen that the frequent
subtrees could be generated faster and also with an increase in the length of the frequent
125
subtrees, the number of maximal subtrees generated could be reduced.
On D10 dataset
Figure 4.8 shows that PCITMiner generates concise frequent subtrees faster than its
counterpart PCETMiner on the D10 dataset. This is due to the larger number of concise
frequent embedded subtrees generated in comparison to the concise frequent induced sub-
trees and embedded subtrees. The same reason applies for the embedded subtree miners
MB3-Miner and TreeMinerV, which perform faster at 1% in the lower threshold on D10.
However, in the same dataset, at supports higher than 2%, PCETMiner is almost equal to
or is faster than MB3-Miner and TreeMinerV. PCETMiner performs much faster than the
prefix-based pattern growth method for frequent embedded subtrees, PrefixTreeESpan,
inspite of a 13-fold reduction in the number of subtrees generated. In general, the pro-
posed PCITMiner and PCETMiner methods produce far fewer patterns, in comparison to
other induced and embedded subtrees miners.
Table 4.15: Runtime comparison of length constrained subtrees on the D10 datasetMin supp Const PCITMiner PMITMiner PCETMiner PMETMiner
Const const Const Const2 3 0.76 0.75 1.03 1.04
5 1.04 1.05 2.09 2.117 1.12 1.13 2.42 2.429 1.1 1.13 2.43 2.4311 1.14 1.12 2.45 2.45
4 3 0.58 0.58 0.61 0.615 0.71 0.73 0.78 0.797 0.73 0.73 0.79 0.799 0.75 0.74 0.79 0.7911 0.75 0.75 0.8 0.8
6 3 0.52 0.5 0.49 0.495 0.6 0.59 0.61 0.617 0.58 0.58 0.58 0.599 0.61 0.6 0.6 0.611 0.6 0.6 0.58 0.58
8 3 0.43 0.44 0.33 0.335 0.43 0.44 0.34 0.347 0.43 0.44 0.34 0.349 0.43 0.44 0.34 0.3511 0.44 0.44 0.34 0.34
10 3 0.43 0.44 0.33 0.335 0.43 0.44 0.34 0.347 0.43 0.44 0.34 0.349 0.43 0.44 0.34 0.3411 0.44 0.44 0.34 0.34
It is interesting to note that from Tables 4.15 and 4.16 for constrained frequent pattern
mining methods there are fewer MFEConst subtrees produced than there are CFIConst
subtrees. This shows the strength of maximality even while preserving the embedded rela-
tionship. With respect to the runtime, PCITMinerConst and PMETMinerConst perform
126
0
0.2
0.4
0.6
0.8
1
1 2 4 6 8 10
Ru
nti
me (
in s
ecs)
Minimum Support(min_supp in %)
Runtime comparison of Induced Subtree miners on D10 Dataset
PCITMiner
PMITMiner
PrefixTreeISpan0
1
2
3
4
5
6
7
8
1 2 4 6 8 10
Ru
nti
me (
in s
ecs)
Minimum Support (min_supp in %)
Runtime comparison of Embedded Subtree miners on D10 Dataset
PCETMiner
PMETMiner
PrefixTreeESpan
MB3-Miner
TreeMinerV
(a) Runtimes for Induced subtrees (b) Runtimes for Embedded subtrees
1
10
100
1000
1 2 4 6 8 10
# F
req
Su
btr
ees
Minimum Support (min_supp in %)
# Frequent Induced subtrees on D10 dataset
CFI Subtrees
MFI Subtrees
FI Subtrees
1
10
100
1000
1 2 4 6 8 10#
Freq
Su
btr
ees
Minimum Support (min_supp in %)
# Frequent Embedded subtrees on D10 dataset
CFE Subtrees
MFE Subtrees
FE Subtrees
(c) No. of Induced subtrees (d) No. of Embedded subtrees
Figure 4.8: Runtime and number of subtrees comparison on the D10 dataset
Table 4.16: Length constrained subtrees in the D10 datasetMin supp Const PCITMiner PMITMiner PCETMiner PMETMiner
Const const Const Const2 3 17 14 42 39
5 26 13 83 567 27 9 56 289 25 7 51 2311 25 7 51 23
4 3 9 7 15 125 10 4 15 87 10 4 15 89 10 4 15 811 10 4 15 8
6 3 6 4 11 85 7 2 11 57 7 2 11 59 7 2 11 511 7 2 11 5
8 3 3 2 4 25 3 2 4 27 3 2 4 29 3 2 4 211 3 2 4 2
10 3 3 2 4 25 3 2 4 27 3 2 4 29 3 2 4 211 3 2 4 2
the same.
A comparison of the results on the two synthetic datasets, F5 and D10, reveals that,
with respect to the branching factor, the proposed concise frequent methods are scalable
at very low support thresholds and could reduce the number of subtrees by up to 90%,
127
especially for embedded subtrees. In spite of the high branching factor resulting in a very
large number of frequent subtrees due to the various possible ancestor-descendant rela-
tionships it provides, the proposed methods could produce a reduced number of subtrees
with runtime almost comparable with other benchmarks. Hence, the proposed methods
are also suitable for datasets with a high branching factor.
4.7.2 Evaluation of frequent pattern mining methods on real-life datasets
This subsection evaluates the proposed frequent pattern mining methods on real-life
datasets such as ACM, INEX IEEE, INEX 2007 and INEX 2009 using the evaluation
metrics discussed in Chapter 3.
4.7.2.1 On small-sized real-life dataset
The runtime and the number of subtrees comparisons of the frequent pattern mining
methods are shown in Figure 4.9 for the ACM dataset.
It can be noted that the number of MFI subtrees generated at 10% and 20% was equal
to 3 and was 4 for higher thresholds. As the documents essentially came from two DTDs,
there is not much variation in the MFI subtrees. On the other hand, this commonality
cannot be identified using only the frequent induced subtrees. If the grouping is based
on the structure of the documents then MFI can easily be used to identify the structural
clusters. The impact of the constraints is shown in Figure 4.10. However, there is a very
minimal difference in the number of length constrained concise frequent patterns and also
in the runtime.
128
0
0.2
0.4
0.6
0.8
1
1.2
1.4
10 20 30 40 50
Ru
nti
me (
in s
ecs)
Minimum Support (min_supp in %)
Runtime comparison of Induced Subtree miners on ACM Dataset
PCITMiner
PMITMiner
PrefixTreeISpan
0.01
0.1
1
10
100
1000
10000
10 20 30 40 50
Ru
nti
me (
in s
ecs)
Minimum Support (min_supp in %)
Runtime comparison of Embedded Subtree miners on ACM Dataset
PCETMiner
PMETMiner
PrefixTreeESpan
MB3-Miner
TreeMinerV
(a) Runtimes for Induced subtrees (b) Runtimes for Embedded subtrees
1
10
100
1000
10000
100000
10 20 30 40 50
# F
req
Su
btr
ees
Minimum Support (min_supp in %)
# Frequent Induced subtrees on ACM dataset
CFI subtrees
MFI subtrees
FI subtrees
1
10
100
1000
10000
100000
1000000
10000000
10 20 30 40 50#
Freq
Su
btr
ees
Minimum Support (min_supp in %)
# Frequent Embedded subtrees on ACM dataset
CFE subtrees
MFE subtrees
FE subtrees
(c) No. of Induced subtrees (d) No. of Embedded subtrees
Figure 4.9: Runtime and number of subtrees comparison on ACM dataset
0.01
0.1
1
10
100
1000
10000
100000
3 5 7 9
Ru
nti
me (
in s
ecs)
Constraint Length
Runtime comparison of Length Constrained Subtree miners on ACM (at 10% min_supp)
PCITMinerConst
PMITMinerConst
PCETMinerConst
PMETMinerConst
0
0.02
0.04
0.06
0.08
0.1
3 5 7 9
Ru
nti
me (
in s
ecs)
Constraint Length
Runtime comparison of Length Constrained Subtree miners on ACM (at 30% min_supp)
PCITMinerConst
PMITMinerConst
PMETMinerConst
PMETMinerConst
(a) Runtimes for Induced subtrees (b) Runtimes for Embedded subtrees
1
10
100
1000
10000
100000
1000000
3 5 7 9
# F
req
Su
btr
ees
Constraint Length
# Length Constrained Subtrees on ACM dataset (at 10% min_supp)
CFIConst
MFIConst
CFEConst
MFEConst
1
10
100
1000
3 5 7 9
# F
req
Su
btr
ees
Constraint Length
# Length Constrained Subtrees on ACM dataset (at 30% min_supp)
CFIConst
MFIConst
CFEConst
MFEConst
(c) No. of Induced subtrees (d) No. of Embedded subtrees
Figure 4.10: Runtime and number of length constrained frequent concise subtreescomparison on ACM dataset
129
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
10 20 30 40 50
Ru
nti
me (
in s
ecs)
Minimum Support (min_supp in %)
Runtime comparison of Induced Subtree miners on DBLP Dataset
PCITMiner
PMITMiner
PrefixTreeISpan
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
10 20 30 40 50
Ru
nti
me (
in s
ecs)
Minimum Support (min_supp in %)
Runtime comparison of Embedded Subtree miners on DBLP Dataset
PCETMiner
PMETMiner
PrefixTreeESpan
MB3-Miner
TreeMinerV
(a) Runtimes for Induced subtrees (b) Runtimes for Embedded subtrees
1
10
100
1000
10 20 30 40 50
# F
req
Su
btr
ees
Minimum Support (min_supp in %)
# Frequent Induced subtrees on DBLP dataset
CFI Subtrees
MFI Subtrees
FI Subtrees
0.1
1
10
100
1000
10000
10 20 30 40 50
# F
req
Su
btr
ees
Minimum Support (min_supp in %)
# Frequent Embedded subtrees on DBLP dataset
CFE Subtrees
MFE Subtrees
FE Subtrees
(c) No. of Induced subtrees (d) No. of Embedded subtrees
Figure 4.11: Runtime and number of subtrees comparison on DBLP dataset
4.7.2.2 On medium-sized real-life dataset
It can be seen from Figure 4.11 that the proposed methods have identified the concise
frequent subtrees in optimal duration for the DBLP dataset. A comparison of the runtime
between the concise frequent subtree miners revealed that there is a minimal difference
in the runtime (≈ 0.1 secs). This is because the average length of the subtrees in the
dataset is only 25 (as mentioned in Chapter 3) and the frequent patterns do not have
much variation in their node relationships.
From the comparison of constrained length frequent subtrees at 10% and 30% support
threshold in Figure 4.12, both PCITMinerConst and PMITMinerConst take almost the
same amount of time to produce varied length subtrees, but the number of MFIConst
subtrees produced by PMITMinerConst is much less in comparison to CFIConst. This
implies that there are a large number of subtrees that have different support values and
130
00.10.20.30.40.50.60.70.80.9
3 5 7 9 11
Ru
nti
me (
in s
ecs)
Constraint Length
Runtime comparison of Length Constrained Subtree miners on DBLP (at 10% min_supp)
PCITMinerConst
PMITMinerConst
PCETMinerConst
PMETMinerConst
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
3 5 7 9 11
Ru
nti
me (
in s
ecs)
Constraint Length
Runtime comparison of Length Constrained Subtree miners on DBLP (at 30% min_supp)
PCITMinerConst
PMITMinerConst
PCETMinerConst
PMETMinerConst
(a) Runtimes for Induced subtrees (b) Runtimes for Embedded subtrees
0.1
1
10
100
1000
3 5 7 9 11
# F
req
Su
btr
ees
Constraint Length
# Length Constrained Subtrees on DBLP dataset (at 10% min_supp)
CFIConst
MFIConst
CFEConst
MFEConst
0.1
1
10
100
3 5 7 9 11#
Freq
Su
btr
ees
Constraint Length
# Length Constrained Subtrees on DBLP dataset (at 30% min_supp)
CFIConst
MFIConst
CFEConst
MFEConst
(c) No. of Induced subtrees (d) No. of Embedded subtrees
Figure 4.12: Runtime and number of length constrained frequent concise subtreescomparison on DBLP dataset
maximality replaces them with their supertrees. However, PMITMinerConst takes almost
the same time as PCITMinerConst due to the large number of maximality checks required
for these subtrees.
4.7.2.3 On large-sized real-life datasets
In this subsection, the experiments were conducted on the large-sized real-life datasets
such as INEX 2007, INEX IEEE and INEX 2009 datasets to evaluate the efficiency of the
proposed methods and the benchmarks.
INEX 2007 dataset
Figure 4.13 shows that the proposed concise frequent subtree mining methods were
scalable for this large dataset, even with a maximum length of 48. The other benchmarks,
131
PrefixTreeESpan and TreeMinerV, were scalable only for higher support thresholds on this
dataset. These benchmarks failed at lower support threshold for this dataset. Though Pre-
fixTreeESpan was scalable, it takes longer than PCETMiner. Also, TreeMinerV fails to
produce any results for support thresholds less than 40%. This behaviour of frequent pat-
tern mining methods shows the strengths of applying the proposed methods to generate
concise frequent subtrees over the benchmarks to generate frequent subtrees. It is inter-
esting to note that at a support threshold of 10%, there are 10 times more CFE subtrees
than CFI subtrees. Figures 4.14 and 4.15 show the length constrained frequent subtree
miners with support thresholds of 20% and 50% respectively.
0
10
20
30
40
50
10 20 30 40 50
Ru
nti
me (
in s
ecs)
Minimum Support (min_supp in %)
Runtime comparison of Induced Subtree miners on INEX 2007 dataset
PCITMiner
PMITMiner
PrefixTreeISpan
1
10
100
1000
10000
100000
10 20 30 40 50
Ru
nti
me (
in s
ecs)
Minimum Support (min_supp in %)
Runtime comparison of Embedded Subtree miners on INEX 2007 Dataset
PCETMiner
PMETMiner
PrefixTreeESpan
TreeMinerV
(a) Runtimes for Induced subtrees (b) Runtimes for Embedded subtrees
1
10
100
1000
10000
10 20 30 40 50
# F
req
Su
btr
ees
Minimum Support (min_supp in %)
# Frequent Induced subtrees on INEX 2007 dataset
CFI Subtrees
MFI Subtrees
FI Subtrees
1
10
100
1000
10000
100000
10 20 30 40 50
# F
req
Su
btr
ees
Minimum Support (min_supp in %)
# Frequent Embedded subtrees on INEX 2007 dataset
CFE Subtrees
MFE Subtrees
FE Subtrees
(c) No. of Induced subtrees (d) No. of Embedded subtrees
Figure 4.13: Runtime and number of subtrees comparison on INEX2007 dataset
On the other large-sized datasets, INEX IEEE and INEX 2009, where the average
length of the subtrees is greater than 100, the number of documents are large (greater
than 5000), and therefore none of the frequent pattern mining methods could be ap-
plied. However, to identify concise frequent subtrees, the proposed length constrained
132
0
5
10
15
20
25
30
35
40
3 5 7 9 11
Ru
nti
me (
in s
ecs)
Constraint Length
Runtime comparison of Length Constrained Induced Subtree miners on INEX 2007 (at 20% min_supp)
PCITMinerConst
PMITMinerConst
1
10
100
1000
10000
100000
3 5 7 9 11
Ru
nti
me (
in s
ecs)
Constraint Length
Runtime comparison of Length Constrained Embedded Subtree miners on INEX 2007 (at 20% min_supp)
PCETMinerConst
PMETMinerConst
(a) Runtimes for Induced subtrees (b) Runtimes for Embedded subtrees
1
10
100
1000
3 5 7 9 11
# F
req
Su
btr
ees
Constraint Length
# Length Constrained Subtrees Induced on INEX 2007 dataset (at 20% min_supp)
CFIConst
MFIConst
1
10
100
1000
10000
3 5 7 9 11#
Freq
Su
btr
ees
Minimum Support (min_supp in %)
# Length Constrained Embedded Subtrees on INEX 2007 dataset (at 20% min_supp)
CFEConst
MFEConst
(c) No. of Induced subtrees (d) No. of Embedded subtrees
Figure 4.14: Runtime and number of length constrained frequent induced subtreescomparison on INEX2007 dataset at 20%
0
20
40
60
80
100
120
140
3 5 7 9
Ru
nti
me (
in s
ecs)
Constraint Length
Runtime comparison of Length Constrained Subtree miners on INEX 2007 (at 50% min_supp)
PCITMinerConst
PMITMinerConst
PCETMinerConst
PMETMinerConst
1
10
100
1000
3 5 7 9
# F
req
Su
btr
ees
Constraint Length
# Length Constrained Subtrees on INEX 2007 dataset (at 50% min_supp)
CFIConst
MFIConst
CFEConst
MFEConst
(a) Runtimes for subtrees (b) No. of subtrees
Figure 4.15: Runtime and number of length constrained frequent induced subtreescomparison on INEX2007 dataset at 50%
concise frequent subtree miners, PCITMinerConst, PMITMinerConst, PCETMinerConst
and PMETMinerConst, are applied.
INEX IEEE dataset
Although the number of document trees was only 6054, the average length of a doc-
133
ument was more than 75. This has resulted in a very dense dataset and none of the
benchmarks for frequent pattern mining were able to mine for frequent patterns.
INE
1
10
100
1000
10000
3 5 7 9 11
Ru
nti
me (
in s
ecs)
Constraint Length
Runtime comparison of Length Constrained Induced Subtree miners on INEX IEEE (at 20% min_supp)
PCITMinerConst
PMITMinerConst
1
10
100
1000
3 5 7 9 11
Ru
nti
me (
in s
ecs)
Constraint Length
Runtime comparison of Length Constrained Induced Subtree miners on INEX IEEE (at 50% min_supp)
PCITMinerConst
PMITMinerConst
(a) Runtimes for Induced subtrees at 20% (b) Runtimes Induced subtrees at 50%
1
10
100
1000
10000
100000
1000000
3 5 7 9 11
# F
req
Su
btr
ees
Constraint Length
# Length Constrained Induced Subtrees on INEX IEEE dataset (at 20% min_supp)
CFIConst
MFIConst
1
10
100
1000
10000
3 5 7 9 11
# F
req
Su
btr
ees
Constraint Length
# Length Constrained Induced Subtrees on INEX IEEE dataset (at 50% min_supp)
CFIConst
MFIConst
(c) No. of Induced subtrees at 20% (d) No. of Induced subtrees at 50%
Figure 4.16: Runtime and number of length constrained frequent induced subtreescomparison on INEX IEEE dataset at 20% and 50%
0
10
20
30
40
50
60
10 20 30 40 50
Ru
nti
me (
in s
ecs)
Minimum Support (min_supp in %)
Runtime comparison of Length Constrained Embedded Subtree miners on INEX IEEE
PCETMinerConst
PMETMinerConst
0
500
1000
1500
2000
2500
3000
3500
20 30 40 50 60
# F
req
Su
btr
ees
Minimum Support (min_supp in %)
# Length Constrained Embedded Subtrees in INEX IEEE dataset (const=3)
CFEConst
MFEConst
(a) Runtimes for Embedded subtrees (b) No. of Embedded subtrees
Figure 4.17: Runtime and number of length constrained frequent embedded subtreescomparison on INEX IEEE dataset at 50%
Also, it took longer than 2 hours for PCITMiner, PMITMiner, PCETMiner and
PMETMiner to mine for frequent subtrees. Hence the mining process was terminated.
However, the length constrained concise frequent subtrees were applied and the results
134
were reported in Figures 4.16 and 4.17.
INEX 2009 dataset
Figures 4.18 and 4.19 show that all the length constrained induced and embedded
subtree miners are scalable for the INEX 2009 dataset with varying constraint lengths.
This shows that these concise frequent pattern mining methods could be applied for these
types of highly dense and deep structured datasets. These miners were able to mine for
frequent subtrees, even with the increase in the constraint length at both lower and higher
support thresholds.
0200400600800
100012001400160018002000
3 5 7 9 11
Ru
nti
me (
in s
ecs)
Constraint Length
Runtime comparison of Length Constrained Induced Subtree miners on INEX 2009 (at 20% min_supp)
PCITMinerConst
PMITMinerConst
0
20
40
60
80
100
120
140
3 5 7 9 11
Ru
nti
me (
in s
ecs)
Constraint Length
Runtime comparison of Length Constrained Induced Subtree miners on INEX 2009 (at 50% min_supp)
PCITMinerConst
PMITMinerConst
(a) Runtimes for Induced subtrees at 20% (b) Runtimes Induced subtrees at 50%
0
1000
2000
3000
4000
5000
6000
7000
3 5 7 9 11
# F
req
Su
btr
ees
Constraint Length
# Length Constrained Subtrees on INEX 2009 dataset (at 20% min_supp)
CFIConst
MFIConst
0
20
40
60
80
100
120
140
160
3 5 7 9 11
# F
req
Su
btr
ees
Constraint Length
# Length Constrained Subtrees on INEX 2009 dataset (at 50% min_supp)
CFIConst
MFIConst
(c) No. of Induced subtrees at 20% (d) No. of Induced subtrees at 50%
Figure 4.18: Runtime and number of length constrained frequent induced subtreescomparison on INEX 2009 dataset at 20% and 50%
Tables 4.17 and 4.18 provide a comparison of the performance of the proposed methods
and the benchmarks, on both the synthetic and real-life datasets where λ and ρ denote the
runtime and the number of patterns generated for each of the methods. “F” indicates that
the methods fail to scale for the respective datasets. In this setting the support threshold
135
0
500
1000
1500
2000
2500
3000
3500
4000
20 30 40 50 60 70
Ru
nti
me (
in s
ecs)
Minimum Support (min_supp in %)
Runtime comparison of Length Constrained Embedded Subtree miners on INEX 2009 (const=3)
PCETMinerConst
PMETMinerConst
0
500
1000
1500
2000
2500
3000
3500
4000
20 30 40 50 60 70
# F
req
Su
btr
ees
Minimum Support (min_supp in %)
# Length Constrained Subtrees on INEX 2009 dataset (const=3)
CFEConst
MFEConst
(a) Runtimes for Embedded subtrees (b) No. of Embedded subtrees
Figure 4.19: Runtime and number of length constrained frequent embedded subtreescomparison on the INEX 2009 dataset at 50%
of 2% is used for the synthetic datasets, ACM, DBLP, INEX IEEE and INEX 2009 at
20%, INEX 2007 at 30% (as TreeMinerV fails below this threshold).
Table 4.17: Summary of frequent pattern mining results on synthetic datasetsMethods Datasets
F5 D10λ ρ λ ρ
PCITMinerConst 0.33 17 0.76 17PMITMinerConst 0.33 9 0.75 14PCETMinerConst 0.6 21 1.03 45PMETMinerConst 0.6 23 1.04 87
PCITMiner 0.51 21 0.73 32PMITMiner 0.57 7 0.66 6PCETMiner 0.65 21 1.55 51PMETMiner 0.65 10 1.56 23
PrefixTreeISpan 0.5 44 0.73 72PrefixTreeESpan 0.5 289 2.64 104
MB-3Miner 0.83 289 1.47 104TreeMinerV 1.16 289 1.64 104
From the empirical evaluation on the runtime and the number of frequent subtrees
(summarised in Figures 4.20 and 4.21), it is clear that the proposed methods for concise
frequent subtrees such as PCITMiner, PMITMiner, PCETMiner and PMETMiner perform
better than PrefixTreeISpan [141] and PrefixTreeESpan [140], not only in reducing the
number of subtrees but also in reducing runtimes. However, these methods took longer
on the INEX IEEE and the INEX 2009 datasets for longer documents and hence the
constrained concise frequent pattern mining methods were applied on these datasets.
MB3-Miner performs the same as TreeMinerV on ACM dataset, as shown in Figure
4.20 (a). On the DBLP dataset, MB3-Miner clearly outperforms TreeMinerV; however, for
136
Table
4.18:Summary
offrequentpattern
miningre
sultson
real-life
datasets
Meth
ods
Data
sets
ACM
DBLP
INEX
INEX
INEX
2007
IEEE
2009
λρ
λρ
λρ
λρ
λρ
PCIT
MinerCon
st0.12
842
0.3
109
2.4
321.97
169
104.6
263
PMIT
MinerCon
st0.15
841
0.3
103
3.65
201.97
132
72.87
210
PCETMinerCon
st10.09
1861
0.54
232
11.97
134
44.31
2006
3347.16
3657
PMETMinerCon
st10.16
1203
0.54
186
13.1
134
44.31
2006
3349.17
3619
PCIT
Miner
0.12
190.14
3413.91
70F
FF
F
PMIT
Miner
0.14
30.17
613.9
11F
FF
F
PCETMiner
2.5
180.34
731503
1002
FF
FF
PMETMiner
2.5
60.34
101503.3
253
FF
FF
PrefixTreeISpan
0.12
3265
0.36
337
19.6
665
FF
FF
PrefixTreeE
Span
1.54
69519
0.42
448
1761.47
11705
FF
FF
MB-3Miner
169519
0.13
448
FF
FF
FF
TreeM
inerV
1.02
69519
0.25
448
13677.1
11705
FF
FF
137
0 1 2 3 4 5 6 7
x 104
10−2
10−1
100
101
102
Run
time
(in
secs
)
Number of Frequent Subtrees
ACM
0 50 100 150 200 250 300 350 400 450
10−0.8
10−0.7
10−0.6
10−0.5
10−0.4
10−0.3
Run
time
(in se
cs)
Number of Frequent Subtrees
DBLP dataset
PCITMinerConstPMITMinerConstPCETMinerConstPMETMinerConstPCITMinerPMITMinerPCETMinerPMETMinerPrefixTreeISpanPrefixTreeESpanMB3MinerTreeMinerV
(a) ACM (b) DBLP
0 2000 4000 6000 8000 10000 12000100
101
102
103
104
105
Runti
me (i
n sec
s)
Number of Frequent Subtrees
INEX 2007
(c) INEX 2007
Figure 4.20: Comparison of the runtimes vs number of frequent subtrees on ACM,DBLP and INEX 2007 datasets
0 500 1000 1500 2000 2500100
101
102
Run
time
(in se
cs)
Number of Frequent Subtrees
INEX IEEE
0 1000 2000 3000 4000101
102
103
104
Run
time
(in se
cs)
Number of Frequent Subtrees
INEX 2009PCITMinerConstPMITMinerConstPCETMinerConstPMETMinerConstPCITMinerPMITMinerPCETMinerPMETMinerPrefixTreeISpanPrefixTreeESpanMB3MinerTreeMinerV
(a) INEX IEEE (b) INEX 2009
Figure 4.21: Comparison of the runtimes vs number of frequent subtrees on INEXIEEE and INEX 2009 datasets
138
lower support thresholds it performs faster than PrefixTreeESpan, and in some situations it
is almost similar to PCETMiner and PMETMiner. However, at higher support thresholds,
the latter two methods seem to be faster than MB3-Miner. Also, the MB3-Miner could not
scale for the INEX 2007 dataset, but PrefixTreeESpan and PCETMiner could efficiently
mine for frequent subtrees, even for lower support thresholds, in less than 45 seconds.
The memory consumption of the MB3-Miner exceeded the experimental design set-up of
16GB of memory, even for supports of about 50% , and therefore the mining process was
terminated. From this empirical evaluation, it can be seen that MB3-Miner is not scalable
for large-sized datasets and for datasets with deeper trees and or a large branching factor.
However, it is efficient for small and medium-sized datasets such as ACM and DBLP, which
have fewer documents and narrower, shorter branches than large-sized datasets have. In
addition, MB3-Miner could not scale for either the INEX 2009 or the INEX IEEE datasets
and hence their results were not reported.
4.8 Discussion and summary
This section discusses the algorithmic design of the proposed frequent pattern mining
methods and the empirical evaluation of these methods and the benchmarks conducted in
this chapter.
4.8.1 Algorithmic Design
The algorithmic complexity of the concise frequent pattern mining algorithms is deter-
mined as O(d ∗ s ∗ m) where d represents the number of documents, s is the number
of 1-Length concise frequent subtrees, and m is the number of iterations of the function
Fre in the concise frequent pattern mining algorithm. For the length constrained concise
139
frequent pattern mining algorithm, m is terminated early if the length of the generated
subtrees is equal to const.
4.8.2 Empirical Evaluation
� As it can be seen from the mining results, the frequent pattern mining methods
such as TreeMinerV and MB3-Miner were not scalable for INEX 2007, INEX IEEE,
and INEX 2009 datasets. These datasets had a larger number of nodes and longer
document trees, compared to other datasets. The proposed methods were able to
mine for concise frequent subtrees for these large-sized datasets with the aid of tree
length constraints.
� The results from the larger datasets clearly indicate that the post-processing of
frequent subtrees to generate concise frequent subtrees is not practically feasible
since the frequent subtree mining methods could not complete the mining process
due to the explosion of the number of frequent subtrees generated. The process to
generate concise frequent subtrees should be embedded within the frequent pattern
mining process, as in this thesis.
� As a parent-child relationship is a subset of an ancestor-descendant relationship,
there are a larger number of embedded subtrees generated than induced subtrees.
In some datasets, there was about 10 times increase in the number of embedded
subtrees to induced subtrees.
� The property of maximal helps to reduce the number of subtrees over closed and
frequent subtrees. Unlike the methods for generating closed frequent subtrees, the
methods for generating maximal frequent subtrees such as PMITMiner, PMET-
Miner, PMITMinerConst and PMETMinerConst avoid checking the GNs for the
same support and hence they are more efficient than closed frequent subtree mining
140
methods in reducing the number of frequent patterns. This is due to the reason
that maximality is a less stringent condition, and its checking is faster than closure.
However, due to the number of checks required, it takes almost the same amount of
time as that of closure in small datasets to identify maximal pattern. In datasets,
when the number of patterns are more, the maximality checks are increased with
this, hence resulting in this taking a longer time than the closure checks. Though
maximality results in a reduced number of subtrees, it is essential to identify whether
this results in any information loss in future applications. Hence, in the next section,
a comparison of the clustering accuracy is conducted on both maximal and closed
frequent subtrees.
� Furthermore, since the length constrained subtree mining methods are efficient in
performance, its effectiveness was evaluated by conducting a selectivity analysis.
The length constrained subtree mining methods produce length constrained concise
frequent patterns which may not be fully concise for the given dataset due to early
termination. This depends on the nature of the datasets; at lower const values the
number of subtrees generated are less in number. However, in some datasets ACM
and DBLP, when the length of the subtrees is between 4 and 8, then the number of
length constrained concise subtrees are greater than the number of concise subtrees.
Hence, it is essential to understand the impact of these generated subtrees in the
clustering task. Moreover, it is also vital to understand if any information loss occurs
in using the length constrained subtrees in future applications.
� The empirical analysis reveals that the proposed concise frequent subtree mining
methods are not only scalable for even 100,000 documents (in F5 and D10 datasets)
but also for datasets with a high branching factor with document trees of more than
10,000 nodes and containing about 34,000 tags (in INEX 2009 dataset).
141
� In spite of the studies indicating that the performance of some of the existing closed
frequent subtree mining methods degrades for datasets having a high branching fac-
tor [107], the experimental results show that the response time of PCITMiner on
datasets having high branched trees, in F5 and D10, is lower than that of Prefix-
TreeISpan.
� A comparison of the prefix-based pattern growth and generate-and-test approaches
reveals that most of the methods using the latter approach cannot scale for dense
datasets that could be scaled by methods using the prefix-based pattern growth
approach.
4.9 Chapter Conclusion
This chapter has provided the details for the proposed concise frequent subtree mining
methods. The pre-processing steps required to extract trees from the XML data were
presented. An overview of the various types of concise frequent subtrees was presented
along with the details of the constraint parameters. Several concise frequent subtree min-
ing methods were proposed using the two optimisation techniques, search space reduction
using backward scan and node extension checking.
Furthermore, all the proposed methods were evaluated on both synthetic and real-life
datasets exhibiting extreme characteristics. These methods were also benchmarked against
the existing state-of-the-art frequent pattern mining methods for both the induced and
the embedded frequent subtrees. While doing so, various parameters, such as the runtime
required to generate the concise frequent subtrees, the number of frequent subtrees and
the number of projections required to generate them on various support thresholds, were
also examined. This chapter has also conducted an in-depth sensitivity analysis of the
142
length constraint on the runtime, the number of projections and the number of frequent
subtrees.
143
Chapter 5
XML Clustering
5.1 Introduction
The discussion in Chapter 2 has established the merit of combining both the structure
and the content features for XML clustering. Chapter 2 also laid the foundation of a
hybrid XML clustering approach, which would benefit the frequent subtrees in the clus-
tering process of XML documents. Chapter 4, which was on frequent pattern mining
approaches, emphasised the importance of concise patterns not only in reducing the run-
time performance but also in deriving meaningful information. The focus of this chapter
is to present a novel clustering methodology that utilises the concise frequent subtrees and
their corresponding content implicitly and explicitly.
This chapter explains and analyses the XML clustering methodology which has been
developed in this thesis. The chapter begins with an overall presentation of the proposed
Hybrid Clustering of XML documents (HCX) methodology, giving details of each of the
steps in the subsequent sections. The second section explains the details of the HCX using
the Vector Space Model to implicitly express the relationship between the structure and
the content features in XML documents. The third section introduces the novel approach
145
using the Tensor Space Model to explicitly express the relationship between these two
features. An in-depth analysis of the two clustering approaches over other state-of-the-art
approaches were conducted using real-life datasets. The final section provides a discussion
of the analysis of the proposed clustering approaches.
5.2 Hybrid Clustering of XML Documents (HCX) Method-
ology : An Overview
Figure 5.1 provides the overview of the HCX methodology for clustering the XML doc-
uments. It begins by utilising the concise frequent subtree mining methods to generate
different types of concise frequent subtrees. Using one of the different types of concise fre-
quent subtrees, it can extract the content from the XML documents and represent both of
these features for clustering. There are two models in this methodology that represent the
structure and the content of the XML documents, the Vector Space Model (VSM) and the
Tensor Space Model (TSM). Each of the two models adopts different techniques to utilise
the generated concise frequent subtrees to extract the content from the XML documents.
The clustering methods using the VSM and the TSM are called the Hybrid Clustering of
XML documents using the Vector Space Model (HCX-V) and Hybrid Clustering of XML
documents using the Tensor Space Model (HCX-T) respectively.
5.2.1 Hybrid Clustering of XML documents using the Vector Space
Model (HCX-V)
This method involves non-linearly combining the structure and the content of XML docu-
ments in the VSM. HCX-V begins by identifying the documents that contain the frequent
subtrees and then extracting their corresponding content. The content thus extracted is
146
HCX-THCX-V
CFI CFEMFI CFIConst MFIConst MFE CFEConst MFEConst
Application of Concise Frequent Subtree Mining
using prefix-based pattern growth
Identify the coverage of the documents for
the generated Concise Frequent
subtrees
Apply Partitional Clustering algorithm
Cluster 1
Cluster 2
Cluster N
Cluster 3
Combine the PCFs to form Intermediate Cluster Form(ICF) in
Vector Space Model(VSM)
XML Documents
Generate the Intermediate
Cluster Form(ICF) in Tensor Space
Model(TSM)
Apply Random Indexing
Apply Tensor Decomposition algorithm and
cluster the decomposed
values
Extract content from XML
documents using coverage and
represent in Pre-cluster Form(PCF)
Extract content from XML documents
using coverage and represent in Pre-
cluster Form(PCF)
Cluster the concise frequent subtrees
Identify the coverage of the documents for
the clusters of Concise Frequent
subtrees
Figure 5.1: Hybrid Clustering of XML documents (HCX) methodology
147
represented in an Intermediate Cluster Form (ICF ), which is a VSM. The ICF combines
the structural commonalities in the form of concise frequent subtrees with the content
information of XML documents to group the XML documents together. Hence, this phase
utilises an implicit combination of the structure and the content of the XML documents.
5.2.2 Hybrid Clustering of XML documents using the Tensor Space
Model (HCX-T)
HCX-T combines the structure and the content of XML documents in a novel way using a
multi-dimensional model. This involves clustering the concise frequent subtrees and then
using these clusters to extract the content. The extracted content is then represented in
the three dimensions of XML documents : the document id, the structure and the content
in a higher-order tensor model. Unlike the VSM, the TSM involves representing both the
features – structure and the content – in an explicit manner for a given document. As
indicated in italics in Figure 5.1, a Random Indexing step can also be included. This,
however, is essentially applied only for very large datasets when the number of features
are too high.
5.3 Using the Vector Space Model (VSM)
This subsection details the use of the Vector Space Model (VSM) [95] for non-linearly
combining the structure and the content for clustering. The VSM is a model for repre-
senting text documents or any objects as vectors of identifiers, for example, terms. When
using the VSM for XML clustering, the feature of the document content is a set of terms,
and the feature of the document structure is a set of substructures such as tags, paths,
subtrees, or subgraphs. In this research, the concise frequent subtrees will be used to
148
represent the document structure.
Figure 5.2 provides the high level definition of Hybrid Clustering of XML documents
using a Vector Space Model (HCX-V). The first task in this clustering method is to identify
the coverage of the concise frequent subtrees that are generated by applying the proposed
concise frequent pattern growth methods (in Chapter 4) on XML datasets.
Input : Document Dataset:D, Document Tree Dataset:DT, MinimumSupport:min supp, Length Constraint:const, Number of Clusters:c,Concise Frequent Subtrees:{CF1, . . . , CFj}
Output: Clusters: {Clust1, . . . , Clustc}begin
for every document Di ∈ D do1. Identify the coverage of the document tree DTi,δ(DTi) = {CF1, . . . , CFj′} ;2. for every CFj′j ∈ δ(DTi) do
i. Extract the structure-constrained content in Di
C(Di, CFj′j ) = {C(N1), . . . , C(Nm)}. The set of terms,
C(Ni) = {t1, . . . , tk ′} ∈ T , where T is the term list for the document Di ;ii. Represent the PCF of Di as a vector of the sum of occurrences of theunique terms in C(Di, δ(DTi)) ;
end3. Combine the PCF s of all Di in D to form ICF ;4. Divide the ICF into two clusters Clustx and Clusty ;5. while the similarity criterion of the content collection clusters Clustxand Clusty is greater than a threshold do
Bisect Clustx and Clusty until the number of clusters obtained is c ;end
end
end
Figure 5.2: High level definition of HCX-V approach
5.3.1 Identifying the coverage of concise frequent subtrees
The coverage of the document trees is identified from the given set of concise frequent
subtrees for the document trees dataset. The concept of coverage of the document trees
for a given concise representation is defined below.
149
Definition: Coverage of a document tree
Let there be a document tree dataset DT , which on applying the proposed concise
frequent pattern mining methods results in a set of concise frequent subtrees, CF =
{CF1, . . . , CFj}. A document tree DTi ∈ DT is said to be covered by a CF subtree,
CFj′j ∈ CF , if DTi preserves the same relationship among its nodes as that of CFj′j .
The coverage (δ) of a document tree DTi ∈ DT , denoted by δ(DTi) is the set of concise
frequent subtrees, {CF1, . . . , CFj′} ∈ CF , where j′ ≤ j, that covers theDTi. The coverage
of a document tree DTi, δ equals {CF1, . . . , CFj′} if DTi preserves the same relationship
among its nodes as that of {CF1, . . . , CFj′}.
In the case of closed frequent subtrees that are both induced and embedded, there
exist some overlapping subtrees in the coverage of the document trees which are defined
below.
Definition: Overlapping subtrees
Let there be two closed frequent subtrees, CFg, CFh ∈ δ(DTi). These two closed
frequent subtrees are called overlapping subtrees in δ(DTi) iff (1) CFg ⊂t CFh with either
an induced or an embedded relationship (2) supp(CFg)=α and supp(CFh)=β where α �= β.
Let there be a document tree dataset, DT on which by applying frequent subtree
mining with a given support threshold s results in frequent subtree mining result set,
O = {CF ′1, CF ′
2, CF ′3} such that CF ′
2 ⊃t CF ′1 and CF ′
2 ⊃t CF ′3. This contains three
frequent subtrees having a support of s, s and s + 1 respectively. Based on the closure
property, the closed subtrees will be CF ′2, CF ′
3 as there is a difference in the support
values for CF ′2 and CF ′
3. However by maximality, there is only one maximal subtree,
CF ′3. If there is a document tree DTi, such that DTi ⊃t CF ′
2 and DTi ⊃t CF ′3 then this is
required to extract similar content corresponding to both CF ′2 and CF ′
3. In order to avoid
150
this redundancy in content and to improve the efficiency of extraction, these overlapping
subtrees are identified and retain only the supertrees. In this situation, CF ′2 is retained
but CF ′3 is removed. This is because the content of CF ′
2 includes the content of CF ′3.
It should also be noted that the overlapping subtrees do not occur for maximal subtrees
since the maximal subtrees capture only the supertrees that are shown in the example.
The overlapping subtrees occur due to the presence of closed frequent subtrees having
different supports and subtree relationships with each other. If such overlapping subtrees
exist, then the subtree CFg is removed from δ(DTi). This process of removing overlapping
subtrees is conducted to avoid redundancy as these overlapping subtrees convey the same
information for the given document. Thus, the removal process facilitates in improving the
computational efficiency and there is no expected disadvantage of removing the overlapping
subtrees.
The next task in this method involves extracting the document content according to
its coverage. This document content is called the structure-constrained content. The
structure-constrained content contained within subtrees according to the coverage of a
DTi, δ(DTi), is defined below.
Definition: Structure-Constrained content features of an XML document ac-
cording to its coverage.
The structure-constrained content features of a given CFj′j ∈ δ(DTi), C(Di, CFj′j) of
an XML document Di, are a collection of node values corresponding to the node labels
in CFj′j of Di. Further, the structure-constrained content features of the set of concise
frequent subtrees in δ(DTi), C(Di, δ(DTi)), that represent an XML document Di, are a
collection of node values corresponding to node labels in its coverage δ(DTi).
The structure-constrained content of a CFj′j in Di ∈ D is retrieved from the XML
151
document Di. The sum of occurrences of the terms in the structure-constrained content
features is computed according to the coverage of the document. Given the coverage
of the document DTi, δ(DTi) = {CF1, . . . , CFj′} the structure-constrained content of
δ(DTi) is a collection of node values corresponding to the δ(DTi) given by C(Di, δ(DTi)) =
{C(Di, CF1), . . . , C(Di, CFj′)}, where C(Di, CFj′j ) ∈ C(Di, δ(DTi)), C(Di, CFj′j ) = {C
(N1), . . . , C(Nm)} with m nodes and C(Nm′) is the node value of node Nm′ wherem′ ≤ m.
Node value for a node (or tag), C(Nm′ , inDi is a vector of terms, {t1, . . . , tk′} that the node
contains. Each of the terms are obtained after pre-processing the node values. Various
pre-processing steps involved in obtaining the terms will be discussed in the following
subsection 5.3.2. For a given term tk′k in C(Di, CFj′j ) = {t1, . . . , tk′}, the sum of the
occurrences of the term tk′k in all concise frequent subtrees of δ(DTi) is computed using:
ς(tk′k) =k′∑i=1
ti(CFj′j ) (5.1)
The resulting vector, called the Pre-Cluster Form (PCF ), containing the sum of oc-
currences of all the terms for δ(DTi), is generated for each document Di in the collection
D. All these PCF s are combined in a matrix form called the Intermediate Cluster Form
(ICF ), which is essentially a VSM. ICF is a matrix of the form D×T where D represents
the documents, T represents the terms that are present in all δ(DTi), and the value of the
matrix cell is represented by ς(tk′k), the sum of the occurrences of a term in its vector.
As discussed before, in order to obtain the term from the node values the pre-processing
steps in the following subsection are applied.
152
5.3.2 Pre-processing of the structure-constrained content of XML doc-
uments
Similar to the pre-processing of structure (as discussed in Section 4.2), the pre-processing
phase of the structure-constrained content of XML documents involves four stages such as
1. Stop-word removal
2. Stemming
3. Integer removal
4. Shorter length words removal
Stop-word removal
Stop words are the words that are considered poor in terms of indexing and hence
these words need to be removed prior to performing data analysis. Traditionally stop
words consists of terms which occur very frequently such as ‘the’, ‘of’ and ‘and’. In order
to remove these stop words, a stop list or stop word list is generated and used to filter out
words that make poor index terms [92].
The most common stop list available for English text, from Christopher Fox, which
contains a list of 421 words [39]. Fox’s stop list includes variants of the stop words, such as
the word ‘group’ with its variants: ‘groups’, ‘grouped’ and ‘grouping’. It should be noted
that not all the common words can be considered as stop words. For example, ‘not’ is a
common word but it represents negation so if it is removed the meaning of the sentence
is changed. Hence, care should be taken in choosing the common stop list.
153
Instead of choosing a common stop list, a list that suits the dataset under investigation
should be considered. For example, the use of a common stop list causes the removal of the
word ‘back’ even though ‘back’ (a part of body) is a useful term in the medical domain. It is
therefore essential to customise the stop list by considering the domain-specific knowledge
in order to avoid removing important words for specific domains. In this research, the
stop word list has been customised by considering the tag names and the content of XML
documents for each relevant dataset.
Stemming
Stemming words is a process to remove affixes (suffixes or prefixes) from the words
and/or to replace them with the root word. For example, the word ‘students’ becomes
‘student’ and the word ‘says’ becomes ‘say’. Several well-known stemming algorithms
have been developed [47, 74, 89, 91]. The strength and similarity of different stemming
algorithms have been evaluated in [40]. The stemming process not only reduces the variety
of words, thereby the storage size, but also increases the performance of information
retrieval systems [15]. Studies have also demonstrated that stemming enhances the recall
aspect, a common measure in information retrieval [93]. This research uses the Porter
stemming algorithm [91] for affix removal. The major reasons for using this algorithm are
its simplicity [121] and the effectiveness of the results it produces.
Integer removal
Due to the huge size of some of the datasets in this research, the INEX IEEE, the
INEX 2007, and the INEX 2009 datasets, a very large number of unique terms occur.
Hence, it is essential to reduce the dimension of the dataset without incurring information
loss. A careful analysis of the dataset revealed that there were a large number of integers
and they did not contribute to the semantic of the documents, so they were removed in
the pre-processing step.
154
Shorter length words removal
Based on the analysis of the datasets involved in this research, words with fewer than
4 characters were considered as meaningless, thus, they were removed.
5.3.3 Representation of the structure-constrained content in ICF
After the pre-processing of the extracted content, the content is represented in a sparse
VSM representation which retains only the non-zero values along with the term id (indi-
cated in bold in Figure 5.3). This representation improves the efficiency in computation
especially for sparse datasets where the number of non-zeroes is less than the number of
zeroes.
d1 1 1 4 2 5 6 d2 1 3 2 1 4 2 d3 1 1 2 7 3 3 4 2 5 1 6 1
Figure 5.3: Sparse representation of a XML dataset modelled in VSM using theirterm frequency
Often the use of raw term frequency in the VSM suffers from the major issue that
all the terms are considered equally important. Therefore some terms have less or no
discriminating power while using these terms for clustering the documents. Consider
a document dataset from a publication domain in which the term “author” occurs in
almost every document. In order to reduce the effect of such frequently occurring terms,
weights are applied on the terms and these term weights are used instead of their raw
term frequencies in representing the terms in the matrix. Various schemes exist that are
used to compute the weights of the terms in the VSM. However, this research uses two
schemes, term frequency-inverse document frequency (tf-idf ) and Okapi BM-25, in order
to weight the terms in documents.
155
Term weighting
The most popular term weighting scheme is the term frequency-inverse document
frequency (tf-idf ) weighting. The weight vector for a given document Di is w(Di) =
{w1,i, w2,i, . . . , wn,i} where wt,i is the weight of the tk′k term and can be calculated as:
wtk′k ,i= tft ∗ log
|D||d : t ∈ d| (5.2)
where tft is the term frequency of a given term tk′k in document Di, log|D|
|d:tk′k∈d|is the
inverse document frequency (idf) and ‘*’ mean multiply. |D| is the total number of
documents in the XML dataset; |d : tk′k ∈ d| is the number of documents containing the
term tk′k . The idf of a rare term and of a frequent term are high and low respectively.
Another popular weighting scheme is the Okapi BM-25, which works on utilising simi-
lar concepts to those of tf-idf. However, this weighting scheme has two tuning parameters,
K1 and b. K1 and b influence the effect of term frequency and document length respec-
tively. The default values are K1 = 2 and b = 0.75. BM-25 weighting depends on three
parameters, Collection Frequency Weight (CFW), term frequency (tft-defined before) and
Normalized Document Length (NDL).
The Collection Frequency Weight (CFW) for the term tk′k is
CFW = log|D| − log(|d : tk′k ∈ d|) (5.3)
The Normalized Document Length for the given document Di is the ratio of the length
of the document Di over the average length of a document in D.
NDL(Di) =DL(Di)
avg(DL(D))(5.4)
156
where DL(Di) is the length of a document Di and avg(DL(D)) is the average length of a
document in D.
Combining these three parameters, the BM25 weighting for a given term tk′k is given
by,
Bf =CFW ∗ tft ∗ (K1 + 1)
K1 ∗ ((1 − b) + (b ∗NDL(Di))) + tft(5.5)
5.3.4 Similarity measures
Once the terms in the structure-constrained content are represented in the ICF , which
is a VSM, a clustering method is applied on the ICF to generate the required number of
clusters. The similarity between each pair of PCF s in the ICF is computed. Let there
be two vectors of terms, di and dj, in the given ICF matrix for two documents Di and
Dj respectively. The similarity between the two vectors, di and dj, is computed using the
cosine similarity function,
cos θ =di.dj
|di||dj |(5.6)
The repeated bisection partitional clustering method is used in this thesis [59]. This
method divides the ICF into two groups and then selects one of the groups according to a
clustering criterion function and bisects further. This process is repeated until the desired
number of clusters is achieved. During each step of bisection, the cluster is bisected so
that the resulting 2-way clustering solution locally optimises a particular criterion function
[59].
157
5.4 Using the Tensor Space Model (TSM)
Using the ICF in the VSM to represent the structure and the content of XML documents
implicitly might not be enough to model both the structure and the content features of
XML documents effectively. This is due to the loss of direct or explicit mapping between
the structure and its corresponding content. Figure 5.4 shows that for a given document
5.4(a), and a set of concise frequent subtrees CF1 and CF2, representing the content in
the VSM using HCX-V captures the structure implicitly; however, the explicit relationship
between these two features is lost. For instance, if there is one document in the collection
which contains “John Murray” as the publisher name and another document in the same
collection that contains “John Murray” as the author name, the similarity of the terms in
the VSM could put these two documents together; however, the structure of the documents
makes their context different.
Book
Title Author
Name
<Book Id=B105> <Title> On the Origin of Species</Title> <Author>
<Name>Charles Darwin</Name> </Author><Publisher>
<Name>John Murray </Name> <Place> London</Place>
</Publisher><Year> 1859</Year></Book>
Origin Species Charles Darwin John Murray London
CF1
CFJ
Publisher
Name Place
CF1CF2
D1
D2
Dn
(b)
(c) (d)
(a) T1 TK
D1
Dn
T1 TKOrigin Species Charles Darwin
John Murray London
Figure 5.4: Comparison of VSM and TSM: (a) sample XML document; (b) concisefrequent subtrees; (c) Vector Space Model (VSM) for (a) and (b) using HCX-V; and(d) Tensor Space Model(TSM) for (a) and (b).
Thus, the content and the structure features inherent in an XML document should be
modelled in a manner that ensures that the mapping between the content of the subtree
158
could be preserved and used in further analysis. Hence in this section, a novel method
of representing the XML documents in the Tensor Space Model (TSM) and utilising the
TSM for clustering is proposed. In the TSM, the content corresponding to its structure
is stored, which helps to analyse the relationship between the structure and the content.
By utilising the TSM for clustering, not only the intrinsic properties of the structure and
the content features but also the relations between these two features are used.
5.4.1 Background
Firstly, the preliminaries of tensors are provided. Tensor notations and conventions used
in this research are akin to the notations used by previous works [37, 49, 57, 112]. Table
5.1 provides the tensor notations.
Table 5.1: Tensor notations and descriptions
Notations Definition and Description
a, b Scalars
a, b Vectors
A, B Matrix
Aij Element (i,j) in a matrix A
T Tensor shown using calligraphic fonts
Tijk An entry in tensor T×n n-mode product
o vector outer product
N Number of orders or modes or ways or ranks
5.4.1.1 Tensor concepts
The tensor concepts will be defined and described in this subsection.
Definition: Tensor
A tensor T is a multi-mode array. The mode of a tensor is the number of dimensions,
also known as orders or ways (used interchangeably in this thesis). Figure 5.5 compares
the mode-1 (vector), mode-2 (matrix) and mode-3 tensors.
159
Mode/Order/Way -1 -2 -3
Correspondence Vector Matrix Tensor
Example
DB
Terms
DM
DB
Terms
D1
Terms
Dn
Dn Do
cu
men
ts
CFS1
CFSJ
D1
t1 tK
Figure 5.5: Comparison of vector, matrix and tensor
Definition: Norm of a Tensor
The norm of a mode-n tensor T ∈ RI1×I2×...×IN is defined as the square root of the
sum of the squared entries(t) in the tensor T .
‖T ‖ =
√√√√I1∑
i1=1
I2∑i2=1
. . .
IN∑iN=1
t2i1i2...iN (5.7)
Definition: Tensor Fiber
A tensor fiber is an one-dimensional fragment of a tensor, obtained by fixing all indices
but one. These fibers are nothing but the higher-order or higher-mode analogue of rows
and columns, as shown in Figure 5.6. A column vector is a mode-1 or column fiber,
denoted by t:jk. A row vector is a mode-2 or row fiber, denoted as ti:k. Finally, the tube
vector is a mode-3 or tube fiber, denoted by tij:.
����
�����
���
�����
���
�����
���
�����
���
�����
���
�����
���
����
����
����
����
����
����
����
����
����
����
����
����
����
����
(a) Column fibers (b) Row fibers (c) Tube fibersFigure 5.6: Fibers of a mode-3 tensor
160
Definition: Tensor Slice
A tensor slice is a two-dimensional fragment of a tensor, obtained by fixing all indices
but two. The three types of slices of a mode-3 tensor T , horizontal, lateral, and frontal,
are denoted by Ti::, T:j:, and T::k respectively, as shown in Figure 5.7.
����
�����
���
�����
���
�����
���
�����
���
�����
���
�����
���
����
����
����
����
����
����
����
����
����
����
����
����
����
����
����
����
����
(a) Horizontal Slices (b) Lateral Slices (c) Frontal Slices
Figure 5.7: Slices of a mode-3 tensor
5.4.1.2 Tensor operations
The two main types of tensor operations related to this research will be discussed here.
They are:
� Matricization
� n-mode (matrix) product
Matricization
For some computations, it is necessary to treat the entire tensor in matrix form. To do
this, the process of matricization is applied by rearranging the elements of a tensor into
a matrix, as shown in Figure 5.8. Essentially, it means that the mode-2 fibers of T are
mapped to the rows of matrix T(1) and the modes-2 and -3 are mapped to the columns of
this matrix.
There are several ways [61, 63, 70] of ordering the columns for matricization but this
ordering is not important as long as the same ordering is retained in further calculations
161
T(1)TFigure 5.8: Mode-1 matricization of a mode-3 tensor
[63]. This process is also referred to as “unfolding” or “flattening”. There are three ways
of matricizing a mode-3 tensor [70], as shown in Figure 5.9.
m n o
p q r
s t u
v w x
a b c
d e f
g h i
j k l
T
T(1)=
xwvlkjutsihgrqpfedonmcba
xurolifcwtqnkhebvspmjgdaT(2)=
xwvutsrqponmlkjihgfedcbaT(3)=
Figure 5.9: mode-n matricization
n-mode (matrix) product
The n-mode (matrix) product of a tensor T ∈ RI1×I2×...×IN with a matrix X ∈ RJ×In ,
denoted by T ×nX with a tensor of size I1×I2× . . .×In−1×J×In+1× . . . IN . Essentially,
this means that each mode-n fiber is multiplied by the matrix X. Based on the elements
it can be given by
(T ×n X)i1...idjin+1...iN =
In∑in=1
ti1i2...iNxjin (5.8)
162
The n-mode (matrix) product of a tensor T with the matrix X is equivalent to multi-
plying X by the appropriate flattening of X which is expressed as:
Y = T ×n X ≡ Y(n) = XY(n) (5.9)
Some interesting points to note are:
� if the modes of multiplication are different then the order of multiplication is irrele-
vant as shown in the equation below :
T ×m P×n Q = T ×n Q×m P if (m �= n) (5.10)
� if a tensor T is multiplied with two matrices with the same mode n, then the following
equation holds true :
T ×n P×n Q = T ×n QP if (m = n) (5.11)
5.4.1.3 Tensor decomposition techniques
In order to analyse the tensors, decomposition techniques are applied in a manner similar
to SVD. SVD is a well-known factorisation given by
X = UΣVT (5.12)
SVD decomposes a matrix into a sum of mode-1 matrices. In other words, the matrix
X ∈ RI×J can also be expressed as a minimal sum of mode-1 matrices:
163
X = σ1(u1 ◦ v1) + σ2(u2 ◦ v2) . . . σr(ur ◦ vr) (5.13)
where ui ∈ RI and vi ∈ RJ and i = 1, 2, . . . , r. Also, ui and vi are the Ith and J th
columns of U and V. The numbers σi on the diagonal of the diagonal matrix Σ are the
singular values of X where r is the mode or rank of the matrix X.
Two of the main applications of decompositions are Principal Component Analysis
(PCA) and Latent Semantic Indexing (LSI). Extending SVD to higher-mode tensors is
complicated, since the mode concept for the tensors become indistinct.
In essence, the purpose of tensor decomposition is to rewrite the tensor as a sum of
mode-1 tensors. For a tensor T ∈ RI×J×K, it could be expressed as:
T = (u1 ◦ v1 ◦w1) + (u2 ◦ v2 ◦w2) . . . (ur ◦ vr ◦wr) (5.14)
where ui ∈ RI , vi ∈ RJ , and wi ∈ RK and i = 1, 2, . . . , r.
The minimum representation for the tensor SVD is not always orthogonal, which im-
plies that the vectors ui, vj , wk do not necessarily form orthonormal sets. For this reason,
tensor decomposition has no orthogonality constraint imposed on these vectors.
Tensor decompositions enable an overview of the relationships that can be further
used in clustering. There are several tensor decomposition techniques, amongst which the
most popular are the CANDECOMP/PARAFAC (CP) [61] and Tucker [113] decomposi-
tions. CP decomposes a tensor as a sum of rank-one tensors (or vectors), and the Tucker
decomposition is the higher-order form of principal component analysis [63].
164
CP decomposition of a tensor T is given by
T ≈m∑r=1
ar ◦ br ◦ cr (5.15)
where m is a positive integer, ◦ represents vector outer product (which means that each
element of the tensor is the product of its corresponding vector elements) and ar ∈ RI ,
br ∈ RJ , and cr ∈ RK .
Tucker decomposes a tensor into a core tensor multiplied (or transformed) by a matrix
along each mode. Hence, in the three-way case where T ∈ RI×J×K, it becomes
T ≈ Y ×1 A×2 B×3 C =
P∑p=1
Q∑q=1
R∑r=1
gpqrap ◦ bq ◦ cr (5.16)
In this equation, A ∈ RI×P , B∈ RJ×Q, and C∈ RK×Q are the factor matrices
(which are usually orthogonal). These factor matrices are the principal components in
each mode. The tensor Y ∈ RP×Q×R is called the core tensor and its entries show the
level of interaction between the different components [63].
5.4.2 Modelling in tensor space – An overview
This subsection looks at modelling the XML documents in the TSM. Given the documents
set D, its corresponding set of Concise Frequent (CF ) subtrees and the set of terms for
each CF subtree (T ); the collection of XML documents is now represented as a mode-3
tensor T ∈ RD×CF×T as shown in Figure 5.10. The tensor is populated with the number
of occurrences of the structure-constrained terms tk ∈ {t1, . . . , tK} corresponding to the
CFj ∈ {CF1, . . . , CFJ} for a document Di ∈ {D1, . . . ,Dn}.
As mentioned in Chapter 2, TSM has a critical problem: a TSM is not scalable for very
165
CFS1
CFSJ
D1
Dn
Origin Species Charles Darwin
John Murray London
D
ocu
men
ts
Terms
t1 tK
Figure 5.10: Visualisation of a mode-3 tensor for the XML document dataset
large and dense datasets, since capturing all the terms corresponding to concise frequent
subtrees results in a very large-sized tensor. Hence, to alleviate the problem of scalability,
two optimisation techniques are applied on the dimensions, CF and T , to reduce the size
of the tensor.
Figure 5.11 provides an overview of the HCX-T method. It begins by grouping into
structural clusters the concise frequent subtrees generated using one of the proposed con-
cise frequent subtree mining methods. The use of structural clusters helps in grouping
similar concise frequent subtrees together. These structural clusters are then used to ex-
tract the content features from the documents. Once the structure and content features
are obtained for each document, the documents are represented in the TSM along with
their structure and content features. The next task is to decompose the created TSM to
obtain factorised matrix Uη. Lastly, the K-means algorithm or a partitional clustering
algorithm is applied on the left singular matrix for the “Document” dimension Uη to
obtain the clusters of documents.
166
Input : XML Document Dataset: D, Document Tree Dataset:DT, Number ofClusters:c, RI Vectors Length: γ, Concise Frequent Subtrees: CF={CF1, CF2, . . . , CFj} and Number of Required Dimensions: η
Output: Clusters: {Clust1, . . . , Clustc}1. Form clusters of similar concise frequent subtrees in CF ,CFSC = {CFSC1, . . . , CFSCh}, h � j where CFSCh
′′ = {CFjj , CFj′′′}, using
CLOPE algorithm ;2. for every document Di ∈ D do
Identify the CFSC existing in the document tree DTi ∈ DT , δ(DTi) ={CFSCl, . . . , CFSCh} ;for every CFSCj in δ(DTi) do
retrieve the structure-constrained content in Di,C(Di, CFSCj) = C(N1), . . . , C(Nm). The term setC(Nm) = {t1, . . . , tk} ∈ T , where T is the term list in D ;
end
end3. Apply random indexing using the γ length random vectors on the termscollection to reduce the term space to T ′ ;4. Form a tensor T ∈ RD×CFSC×T ′
, where each tensor element is the number oftimes a term tk′k occurs in CFSCj′j for a given document Di as represented in
C(Di, CFSCj′j );5. Apply the proposed tensor decomposition algorithm, PTCD to the tensor T andget the resulting left singular matrix Uη for UD ;6. Apply clustering on Uη to generate the c number of clusters ;
Figure 5.11: High level definition of HCX-T approach
167
5.4.3 Generation of structure features for TSM
One of the methods from the frequent pattern mining methods that were proposed in
Chapter 4, is used to generate concise frequent subtrees. However, as stated in [126],
a small change in the support threshold (particularly on the lower support threshold)
may generate hundreds of concise frequent patterns that cannot be pruned by employing
concise frequent pattern mining based only on the nature of the datasets. Capturing the
content with all the concise frequent subtrees and representing them in the TSM will be
more expensive than using the VSM, since in HCX-V the concise frequent subtrees for
each document are joined in its coverage, based on the structure of the document. The
content corresponding to the joined subtrees in that document is retrieved and represented
in the VSM.
Moreover, it can be seen from Figure 5.11 that checking whether the mined CF subtrees
exist in a given document tree or not is a computationally expensive operation. This
problem arises due to the graph isomorphism problem. This step can be optimised by using
a group of similar subtrees based on the similarity of the subtrees, and then retrieving the
content corresponding only to the group of similar CF , instead of comparing each CF
tree against all documents. This step helps to reduce the computational complexity of
checking whether the CF subtrees are present in a given document tree.
The CLOPE [130] algorithm for clustering transactional data has been modified to
include subtrees rather than items in order to conduct the substructure clustering of the
CF trees based on the similarity of the subtrees. The cluster of CF subtrees, Con-
cise Frequent Subtree Cluster (CFSC), becomes a tensor dimension for representing and
analysing XML documents. Let CFSC be a set of CF subtrees, where CFSCj is given
by {CFj′p , CFj′q , CFj′r}, CFj′p ,CFj′q ,CFj′r ∈ CF .
168
5.4.4 Generation of content features for TSM
The Concise Frequent Subtree Cluster, CFSC, representing the structure of the XML doc-
uments is used in retrieving the structure-constrained content from the XML documents.
The coverage of a CFSCj ∈ CFSC and its constrained content for the given document
Di will now be defined. Compared with the content features of an XML document, the
structure-constrained content features include the node values corresponding only to the
node labels of CF subtrees that form CFSCj.
Definition 1: Structure-Constrained content features of an XML document
according to the CFSC that covers it.
The structure-constrained content features of a given CFSCj, C(Di, CFSCj) of an
XML document Di, are a collection of node values corresponding to the node labels in the
CFSCj where CFSCj is a cluster of CF subtrees corresponding to DTi.
The node value of a node (or tag) of a CFSCj ∈ CFSC,C(Ni), in Di is a vec-
tor of terms {t1, . . . , tk} that the node contains. These terms are obtained after stop-
word removal and stemming. Firstly, the CF subtrees corresponding to the CFSCj =
{CFr, . . . , CFs} for a given document Di are flattened into their nodes {N1, . . . , Nm} ∈ N ,
where N is the list of nodes in DT. Then the node values of {N1, . . . , Nm} are accumulated
and their occurrences for a document Di are recorded.
In large datasets, the number of terms in the structure-constrained content is very
large in tensor space, as shown in Table 5.2 with actual values shown in parenthesis and
where “M” denotes Million entries. These are the number of terms obtained even after
stop word removal and stemming.
169
Table 5.2: Summary of the term size and tensor entries in INEX 2009 and INEXIEEE datasets
Datasets Term Size # Non-zero Tensor entries
INEX 2009 ≈ 1M (1,026,857) 127M (127,961,025)
INEX IEEE ≈ 0.18M(176,407) 14M(14,053,618)
Random Indexing
To reduce this very large term space, the popular term space reduction technique,
Random Indexing (RI) technique is applied. RI techniques have been favoured by many
researchers due to their simplicity and low computational complexity [75] in comparison
to other popular dimensionality reduction methods such as SVD and PCA that are com-
putationally expensive with about O(mn2) for a m × n matrix. In RI, each term in the
original space is given a randomly generated index vector as shown in Figure 5.12. These
index vectors are sparse in nature and have ternary values (0 , -1 and 1). Sparsity of the
index vectors is controlled via a seed length that specifies the number of randomly selected
non-zero features.
Equation 5.17 proposed by Achlioptas [3] is used to generate the distribution for creat-
ing the random index vector for every term in the structure-constrained content of CFSC.
rij =√3.
⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩
+1 with probability 1/6
0 with probability 2/3
−1 with probability 1/6
(5.17)
For a given document Di, the index vectors of length l for all the terms corresponding
to the given CFSCj are added. This concept of RI on a tensor can be explained using
Figure 5.12. Consider, a tensor= R3×2×4 (in Figure 5.12(a)) with 3 documents, 2 CFSC,
4 terms and 7 non-zero entries. The entries in the tensor correspond to the occurences
170
of a given term in the given CFSC for the document. Using Equation 5.17, the random
index vectors of length 6 for the 4 terms are generated as shown in Figure 5.12(b). Let us
consider the document D1 with three tensor entries a121 = 1, a123 = 1 and a124 = 1 that
correspond to CFSC2 and three terms Term1, Term3 and Term4. For the three terms
in D1 their random vectors (from Figure 5.12(b)) are added. The resulting vector (a12:)
for D1, given in Figure 5.12(c), contains two non-zero entries. The sparse representation
of the resulting vector is obtained by using only the non-zero entries in the vector, which
finally results in two tensor entries a123 = 1 and a124 = −1. Figure 5.12(d) shows the
final reduced tensor Tr in sparse representation containing 6 non-zero entries. From this
example, it can be seen that even for such a small dataset, the RI technique could reduce
the term space.
It can be seen that the number of entries in the randomly reduced T , Tr, is less than
its original T and that Tr maintains the shape of T as it retains the similarity that exists
between D2 and D3. The index vectors in RI are sparse; hence the vectors use less memory
store and they are added faster. The randomly-reduced structure-constrained content of
CFSC becomes another tensor mode for representing and analysing XML documents.
5.4.5 The TSM representation, decomposition and clustering
Given the tensor T , the next task is to find the hidden relationships between the dimen-
sions. A tensor decomposition algorithm enables an overview of the relationships that can
be further used in clustering. However, as already mentioned in Chapter 2, most of these
decomposition techniques cannot be applied on very large or dense tensor as the tensors
cannot be loaded into memory. To alleviate this problem, the tensors need to be built and
unfolded or matricized incrementally. The process of matricization or unfolding along the
1-mode of T will result in a matrix T(1). This means that the mode-1 fibers (higher order
171
1 -1 0 0 0 0
0 0 1 -1 0 0
-1 1 0 0 0 0
0 -1 1 0 0 0
1 0 0 -1 0 0
0 -1 0 1 0 0
T
a12:=
a21:=
a31:=
= a121=1; a123=1; a124=1;a212=1; a213=2; a312=1; a313=1;
(a)
(b)
(c)
r = a123=1; a124=-1; a211=1; a212=-1; a311=1; a312=-1;
(d)
1 -1 0 0 0 0
Term1=
Term2=
Term4=
Term3=
T
Figure 5.12: Illustration of Random Indexing (RI) on a mode-3 tensor resulting in arandomly reduced tensor Tr.
172
analogue of rows and columns) are set as the columns of the resulting matrix.
Now the proposed Progressive Tensor Creation and Decomposition (PTCD) shown in
Figure 5.13 is applied to progressively build and then decompose the tensor using Singular
Value Decomposition (SVD). PTCD takes as input the tensor data file, TF , that contains
the entries to build the tensor. The input also includes the size of the block that is used
to build the tensor and the number of modes in the given tensor.
Input : Tensor data File: TF, Block size: b, Number of modes (orders): M wherem ∈ {1, 2, . . . ,M} and Number of required dimensions: η
Output: Matricized Tensor : T and left singular matrix with η dimensions formode-1 : Uη
begin1. for every T(m) ∈ {T(1),T(2), . . ., T(M) } do
Initialize T(m) = φ ;
end2. Divide TF into blocks of size b ;3. for every block b do
Create tensor Tb ;for m = 1 to M do
//Matricize the tensor;
T′(m) = Unfold Tb along its mth mode;
//Update the mode-m matricized tensor;T(m) =T(m)+T′
(m);
end
end4. Compute SVD on T(1) with η dimensions, Tη
(1)= UηΣη VTη ;
end
Figure 5.13: Progressive Tensor Creation and Decomposition algorithm (PTCD)
The motivation for this new tensor decomposition algorithm is that the computations
by other decompositions store the fully formed matrices which are dense and hence cannot
scale to very large sized tensors. However PTCD stores the sparse matrices generated
progressively and enables further processing to be performed on the tensor. PTCD builds
the tensor progressively by unfolding the tensor entries for the user-defined block size b
to a sparse matrix T′(m), where mode m ∈ {1, 2, . . . ,M} and M is the number of modes.
173
This unfolded matrix, T′(m), is then used to update the final sparse matrix T(m). After
updating all the tensor entries to the final matrix T(m), it is then decomposed using SVD
which results in the left singular matrix, Uη and right singular matrix, Vη. The tensor
decomposition results in clustering of the data as theoretically and empirically proved by
Huang et al. [49]. K-means clustering is then applied on Uη to generate the required
number of clusters c. The purpose of using K-means clustering after tensor decomposition
is to cluster the non-clustered data into clusters and also to benchmark the clustering
solution against other state-of-the-art clustering methods.
5.5 Empirical evaluation
Having discussed the proposed clustering methods, HCX-V and HCX-T, in the previous
sections, this section conducts their empirical evaluation. The main focus of this section
is:
� to empirically investigate the accuracy of the clustering solution; that is, understand-
ing how the proposed clustering methods behave in different datasets;
� to understand how the implicit and explicit combination of structure and content of
XML documents works for the different real-life datasets;
� to provide a basis of comparison for different types of subtrees on the clustering
solution;
� to understand the scalability of the proposed clustering methods;
� to investigate the sensitivity of the parameters used in the proposed methods; and
� to analyse how the clustering solution will be useful in other application as in the
collection selection problem.
174
The experiments presented here illustrate the suitability of the proposed methodology
for various types of XML datasets detailed in Chapter 3. The ACM dataset is chosen to
understand how the proposed techniques scale for small-sized datasets and to evaluate the
clustering solution using categories generated from both the structure and the content of
the XML documents. The DBLP dataset is used to compare the efficiency of the proposed
clustering methods with shorter-length documents. Furthermore, the DBLP dataset con-
tains documents from varied sources such as conferences, books, journals, theses, technical
reports and web pages. INEX IEEE, INEX 2007 and INEX 2009 datasets were chosen to
demonstrate the applicability of the proposed methods on large real-life datasets based on
Wikipedia pages. These three datasets have also been used for benchmarking clustering
tasks in the INEX forum. Also, the clustering task submission results on these datasets
were used as benchmarks to evaluate the efficiency of the proposed clustering methods.
The following subsections conduct the analysis of the accuracy, sensitivity, time com-
plexity and scalability of the clustering methods.
5.5.1 Accuracy of clustering methods
One of the goals of developing the clustering methods is to improve the accuracy of the
clustering solution for a range of real-life datasets over other representations and state-
of-the-art clustering methods. This section conducts the analysis of the different types of
real-life datasets considered in this thesis, based on the accuracy of the clustering solution.
It uses the evaluation metrics detailed in Chapter 3, such as purity, F1-measure and NMI.
5.5.1.1 ACM dataset
Figure 5.14 shows the clustering results conducted on the ACM dataset using both HCX-V
and HCX-T for 5 categories.
175
0.6 0.80.7
0.75
0.8
0.85
0.9
0.95
Micro−Purity
Mac
ro−
Purit
y
0.4 0.6 0.8 1
0.4
0.5
0.6
0.7
0.8
0.9
1
Micro−F1
Mac
ro−
F1
HCX−T
HCX−V
MACH
Tucker
CP
S+C
CO
SO
(a) Purity (b) F1-measure
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
HCX-T HCX-V MACH Tucker CP S+C CO SO
NMI
(c) NMI
Figure 5.14: Results of clustering on the ACM dataset using 5 categories
The results clearly indicate that HCX-T performs better than HCX-V for various types
of subtrees. Also, the solution provided by using the Tucker decomposition is not suitable
for this dataset.
Figure 5.15 shows that most of the clustering methods provide a clustering solution
of 1 for the structure-based category as it is a simple one based on the DTD of the XML
documents. However, the S+C combination fails to provide a perfect solution, which
shows that it is not suitable if the clustering solution depends only on the structure of the
XML documents.
Figures 5.16 and 5.17 provide a comparison of the accuracy of the clustering solution
over the different types of concise frequent subtrees. In this dataset, it is interesting to
note that the use of CFI subtrees, rather than any other type of subtrees, improves the
176
1 1
0.79
0.52
0.71
0.940.87
11 1
0.7
0.5
0.9 0.940.84
1
0
0.2
0.4
0.6
0.8
1
1.2
HCX-T HCX-V MACH Tucker CP S+C CO SO
Micro-purity Macro-purity
1 1
0.72
0.43
0.65
0.940.84
11 1
0.34 0.33 0.31
0.94
0.51
1
0
0.2
0.4
0.6
0.8
1
1.2
HCX-T HCX-V MACH Tucker CP S+C CO SO
Micro-F1 Macro-F1
0
0.2
0.4
0.6
0.8
1
1.2
HCX-T HCX-V MACH Tucker CP S+C CO SO
(a) Purity (b) F1
(c) NMI
Figure 5.15: Results of clustering on the ACM dataset using 2 categories
accuracy of the clustering solution in the TSM. However, the clustering solution produced
by HCX-T performs much better even with the CFI subtrees. On the contrary, HCX-V
performs better with CFIConst subtrees. This could be due to the presence of shorter
length implicit relationships between XML documents. The use of longer patterns present
in the CFI subtrees in HCX-V adversely impacts the accuracy of the clustering solution.
This shows the effectiveness of CFIConst subtrees over CFI subtrees, which, as shown in
the previous Chapter, are faster to generate.
0.75 0.75 0.75
0.71
0.91
0.79 0.8
0.71
0.79 0.79 0.79
0.7
0.9
0.83 0.83
0.76
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
CFI MFI CFE MFE CFIConst MFIConst CFEConst MFEConst
Micro-purity Macro-purity
0.7 0.7 0.70.68
0.92
0.76 0.76
0.660.66 0.65 0.660.64
0.91
0.67 0.67
0.59
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
CFI MFI CFE MFE CFIConst MFIConst CFEConst MFEConst
Micro-F1 Macro-F1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
CFI MFI CFE MFE CFIConst MFIConst CFEConst MFEConst
NMI
(a) Purity (b) F1
(c) NMI
Figure 5.16: Results of clustering on the ACM dataset using different types of subtreesfor HCX-V
177
0.91
0.75 0.75 0.76
0.88
0.68
0.62 0.61
0.92
0.79 0.79
0.850.88
0.62
0.77
0.62
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
CFI MFI CFE MFE CFIConst MFIConst CFEConst MFEConst
Micro-purity Macro-purity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
CFI MFI CFE MFE CFIConst MFIConst CFEConst MFEConst
NMI
NMI
0.91
0.78
0.7 0.69
0.88
0.64
0.540.59
0.91
0.65 0.66
0.33
0.89
0.57
0.42
0.54
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CFI MFI CFE MFE CFIConst MFIConst CFEConst MFEConst
Micro-F1 Macro-F1
Figure 5.17: Results of clustering on the ACM dataset using different types of subtreesfor HCX-T
A comparison of HCX-T was conducted with and without the random indexing option,
to assess the impact of random indexing. From Table 5.3, it is clear that the random
indexing option is not only suitable for reducing the term-size, especially, that which is
required for large-sized datasets, but also improves the accuracy of the clustering solution.
Furthermore, the time taken to decompose, given by λd, is reduced for randomly indexed
datasets due to the reduced term size in the tensor as shown in Table 5.3. The impact of
dimensionality reduction using RI is also studied in Figure 5.18 on individual clustering
solution and they show clearly the reduction does not affect the quality of these solutions.
Table 5.3: Impact of dimensionality reduction on the clustering results on the ACMdataset
Methods Micro-purity Macro-purity Micro-F1 Macro-F1 NMI λd
HCX-T with RI 0.91 0.92 0.91 0.91 0.82 0.5HCX-T without RI 0.83 0.93 0.87 0.83 0.73 4.5
0.74
0.76
0.78
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
1 2 3 4 5
Cluster Number
Micro-purity(with RI)
Micro-purity(without RI)
Macro-purity(with RI)
Macro-purity(without RI)
0.7
0.75
0.8
0.85
0.9
0.95
1
1 2 3 4 5
Cluster Number
Micro-F1(with RI)
Micro-F1(without RI)
Macro-F1(with RI)
Macro-F1(without RI)
(a) Purity (b) F1-measureFigure 5.18: Impact of RI on the quality of individual clusters on ACM dataset
178
5.5.1.2 DBLP dataset
Next, the clustering methods are evaluated on the medium-sized DBLP dataset. Figure
5.19 shows that on the DBLP dataset, the HCX-T performs better than all the other
clustering methods. A comparison of the decomposition algorithms reveals that both CP
and PTCD perform equally well but Tucker fails to provide better results, especially with
the noticeable drop in Micro-purity, Macro-F1 and NMI values.
0.7 0.75 0.8 0.85 0.9 0.95 10.7
0.75
0.8
0.85
0.9
0.95
1
Micro−Purity
Mac
ro−
Purit
y
0.2 0.4 0.6 0.8 10.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Micro−F1
Mac
ro−
F1
HCX−T
HCX−V
MACH
Tucker
CP
S+C
CO
SO
(a) Purity (b) F1-measure
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
HCX-T HCX-V MACH Tucker CP S+C CO SO
NMI
(c) NMI
Figure 5.19: Results of clustering on the DBLP dataset
Figures 5.20 and 5.21 provide a comparison of the accuracy of the clustering solution
generated by the HCX-V and HCX-T using various types of subtrees. They clearly indicate
again the results confirmed in the ACM dataset, that CFIConst performs better than CFI.
It is interesting to note that, in HCX-T, with the use of CFE subtrees, the clustering
results are similar to those of using CFIConst. This could be due to the nature of this
179
0.9
0.95 0.96
0.84
0.950.93
0.95 0.950.92 0.92 0.91
0.83
0.910.94
0.91 0.91
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
CFI MFI CFE MFE CFIConst MFIConst CFEConst MFEConst
Micro-purity Macro-purity
0.880.93 0.95
0.81
0.94 0.91 0.94 0.94
0.430.36
0.46
0.31
0.46
0.34
0.46 0.46
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
CFI MFI CFE MFE CFIConst MFIConst CFEConst MFEConst
Micro-F1 Macro-F1
(a) Purity (b) F1-measure
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
CFI MFI CFE MFE CFIConst MFIConst CFEConst MFEConst
NMI
(c) NMI
Figure 5.20: Results of clustering on different types of concise frequent subtrees onthe DBLP dataset using HCX-V
dataset, with shorter trees resulting in fewer ancestor-descendant relationships.
0.90.92
0.96
0.87
0.980.96 0.95
0.93
0.89
0.930.95
0.88
0.940.91 0.91
0.87
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
CFI MFI CFE MFE CFIConst MFIConst CFEConst MFEConst
Micro-purity Macro-purity
0.89 0.91 0.940.86
0.97 0.96 0.95 0.92
0.35 0.35 0.360.32
0.47 0.44 0.45 0.44
0
0.2
0.4
0.6
0.8
1
1.2
CFI MFI CFE MFE CFIConst MFIConst CFEConst MFEConst
Micro-F1 Macro-F1
(a) Purity (b) F1-measure
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
CFI MFI CFE MFE CFIConst MFIConst CFEConst MFEConst
NMI
(c) NMI
Figure 5.21: Results of clustering on different types of concise frequent subtrees onthe DBLP dataset using HCX-T
The results provided in Table 5.4 shows that on the DBLP dataset, without the random
180
indexing option, the quality of the cluster is improved and the time to decompose is also
reduced, confirming the results in ACM dataset.
Table 5.4: Impact of dimensionality reduction on the clustering results on the DBLPdataset
Methods Micro-purity Macro-purity Micro-F1 Macro-F1 NMI λd
HCX-T with RI 0.98 0.94 0.97 0.47 0.72 3.4HCX-T without RI 0.93 0.92 0.93 0.45 0.64 8.6
5.5.1.3 INEX2007 dataset
The results of clustering shown in Table 5.5 reveal that on the INEX 2007 dataset, HCX-V
performs much better than HCX-T. HCX-V is also better than other representations such
as S+C, SO and CO methods.
Table 5.5: Results of clustering on the INEX2007 datasetMethods Micro-Purity Macro-PurityHCX-T 0.27 0.28HCX-V 0.59 0.66S+C 0.36 0.43
MACH 0.25 0.25Tucker Fails FailsCP Fails FailsCO 0.55 0.64SO 0.25 0.26
CRP [131] 0.44 0.494RP [131] 0.42 0.49
GraphSOM [45] 0.26 0.27Word-descriptor [41] 0.58 0.67
Two main reasons emerge for the decreased accuracy of HCX-T:
� Nature of the categories in this dataset
� Nature of the XML documents
As mentioned in Chapter 3, 21 categories are based on the content and not on the
structure of the XML documents. Furthermore, the format of the XML documents is
in eXtensible HyperText Markup Language (XHTML) instead of being in XML format.
XHTML is an extension of HTML but uses XML format, so these documents have more
formatting tags and fewer semantic tags.
181
The very large number of formatting tags and the small number of semantic tags
have impacted on the performance of combining both the structure and the content of
XML documents for the HCX-T in this dataset. HCX-V utilises the structures implicitly
and the representation for clustering uses the content. This has resulted in HCX-V a
better performance than HCX-T. Furthermore, by avoiding the content from the infrequent
subtrees, clustering solutions produced by HCX-V have resulted in a better performance
than CO.
Figure 5.22 shows that amongst the different types of concise frequent subtrees, the
CFI subtrees perform better when these subtrees are used in clustering.
0.59
0.480.51 0.51 0.49
0.460.5 0.49
0.66
0.6 0.58 0.57 0.59
0.5
0.57 0.56
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
CFI MFI CFE MFE CFIConst MFIConst CFEConst MFEConst
Micro-purity Macro-purity
0.53
0.42
0.47 0.460.43
0.39
0.440.42
0.27
0.220.24
0.22 0.23
0.15
0.220.2
0
0.1
0.2
0.3
0.4
0.5
0.6
CFI MFI CFE MFE CFIConst MFIConst CFEConst MFEConst
Micro-F1 Macro-F1
(a) Purity (b) F1-measure
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
CFI MFI CFE MFE CFIConst MFIConst CFEConst MFEConst
NMI
(c) NMI
Figure 5.22: Results of clustering methods using different types of concise frequentsubtrees on the INEX 2007 dataset using HCX-V
A comparison of the various subtrees using the TSM was not conducted as there was
minimal difference (≈ ± 0.002) in their accuracy.
182
5.5.1.4 INEX IEEE dataset
A comparison of the clustering methods on the INEX IEEE dataset shown in Table 5.6
indicates that the proposed HCX-T is not only scalable for this large-sized and deeper
structured dataset but also provides an accurate clustering solution over other clustering
methods that uses both structure and content.
Table 5.6: Results of clustering on the INEX IEEE dataset using 18 categoriesMethods Micro-F1 Macro-
F1HCX-T 0.35 0.34HCX-V 0.29 0.27S+C 0.27 0.21CP Fails Fails
Tucker Fails FailsMACH 0.21 0.20CO 0.19 0.16SO 0.14 0.12
Nayak et al. [31] 0.18 0.12Doucet et al. [31] 0.35 0.29SOM-SD [60] 0.38 0.34CSOM-SD [60] 0.13 0.09
This dataset contains categories based on both structure and content, and the results
clearly indicate that techniques which use only the content fail to provide better results.
This dataset has been chosen for sensitivity analysis and further results will be discussed
in Section 5.5.2.
5.5.1.5 INEX 2009 dataset
The experimental results for clustering on the INEX 2009 dataset shown in Table 5.7
reveal that HCX-V performs the best of all the methods. The clustering methods using
the decomposition algorithms Tucker and CP fail to scale for this dataset even with the
random indexing applied on it. This is due to the very large number of documents (54K);
however, the PTCD algorithm could decompose the large tensor in less than 100 seconds.
This shows the scalability of PTCD algorithm. However, the clustering results produced
by HCX-T are lower in quality in comparison to HCX-V. This shows that implicitly
capturing the relationship between the content and the structure is well suited for this
183
dataset. Though the tags are semantic in nature, there are very large number of tags
(34,686) and hence the direct mapping between the structure and the content is lost.
Table 5.7: Results of clustering on the INEX 2009 datasetMethods Micro-purity Macro-purityHCX-V 0.49 0.53HCX-T 0.39 0.40S+C 0.35 0.36Tucker Fails FailsCP Fails Fails
MACH 0.36 0.34CO 0.38 0.38SO 0.36 0.35
BilWeb-CO [81] 0.37 0.38
An evaluation of the clustering solution for the collection selection measure, NCCG, in
Figure 5.23 shows that again HCX-V performs better than other methods. By searching
only 20% of the documents, HCX-V could achieve NCCG scores of upto 0.8. Also, HCX-
T performs much better than the clustering methods using the linear combination of
structure and content features and methods using only one feature.
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
10 20 30 40 50 60 70 80 90 100
NC
CG
v
alue
s
% of documents searched
NCCG on INEX 2009 Wikipedia dataset
HCX-V
BilWeb-CO
HCX-T
CO
SO
S+C
Figure 5.23: A comparison of the NCCG values of the different clustering methodson the INEX 2009 dataset
A comparison conducted in Figure 5.24 on the number of clusters against the NCCG
values demonstrate that with the increase in the number of clusters the NCCG value drops.
This shows that for this dataset smaller number of clusters are suitable for the proposed
clustering methods.
In order to visualise the efficiency of the clustering methods with respect to the number
184
0 100 200 300 400 500 600 700 800 900 10000.68
0.7
0.72
0.74
0.76
0.78
0.8
0.82
0.84
0.86
Figure 5.24: A comparison of the number of clusters on the NCCG values
of clusters searched to identify the relevant documents, cumulative recall plots were used 1.
These plots use the percentage of clusters searched and the fraction of relevant documents.
In these plots, a recall value is observed only once the documents in the cluster were seen
which implies that until the first cluster is searched the recall value remains to be 0.
This plot, shown in Figure 5.25, clearly demonstrates that, the HCX-V achieves an early
and higher recall in comparison to other methods. It is also interesting to note that by
searching only 3% of clusters, nearly 75% of relevant documents can be found, which
proves that clustering hypothesis proposed by van Rijsenberg [92] holds.
A further analysis of the two large categories for the ad hoc queries with topic ids
2009005 and 2009043 (detailed in Table 5.8) is shown in Figures 5.26 and 5.27.
Table 5.8: Details of ad hoc queries with large categoriesTopic Id Title Query2009005 chemists physicists //article[about(.,periodic table elements chemists
scientists alchemists //physicists scientists alchemists)]periodic table elements (person|chemist|alchemist|scientist|physicist)
200904 NASA missions //group[about(.//missions, NASA)]
These Figures clearly shows that for topic ids 2009005 and 2009043, the HCX-V method
could identify 70% and 50% of relevant documents by searching only 1% of clusters re-
spectively. It should be noted that by identifying these relevant clusters, searching of the
1Detailed cumulative recall plots can be found in http://inex.de-vries.id.au/media/other/cumulative recall/subset/
185
entire document collection can be avoided which in turn helps to improve the efficiency of
retrieving the documents.
0 1 2 3 4 5 6 70
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
% of clusters searched
Frac
tion o
f relev
ant d
ocum
ents
HCX−V
HCX−T
S+C
BilWeb−CO
CO
SO
Figure 5.25: A comparison of the different clustering methods on the INEX 2009dataset using cumulative recall
0 1 2 3 4 5 6 7 8 90
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1INEX 2009 distribution of relevant documents over clusters for topic 2009005
% Clusters searched
Fractio
n of re
levan
t docu
ments
Figure 5.26: Cumulative gain for the topic id 2009005
5.5.2 Sensitivity analysis
In order analyse the sensitivity of the proposed methods, INEX IEEE dataset was chosen.
There are two reasons for this choice: (1) it is one of the large-sized dataset having
longer patterns, and (2) the categories in the INEX IEEE dataset were based on both the
186
0 1 2 3 4 5 6 70
0.2
0.4
0.6
0.8
1
INEX 2009 distribution of relevant documents over clusters for topic 2009043
% Clusters searched
Fractio
n of re
levan
t docu
ments
Figure 5.27: Cumulative gain for the topic id 2009043
structure and the content; this dataset was utilised to analyse the sensitivity of the length
constraint and min supp values on the purity. Experiments were conducted by varying the
length constraint (const) of the CF subtrees from 3 to 10 for support thresholds from 10%
to 30%. Figure 5.28 indicates that, with the increase in the length constraint, the micro-
purity and macro-purity values drop especially at the 10% and 30% support threshold.
Moreover, a length constraint of over 7 shows a negative impact on the purity. With longer
length patterns the content corresponding to the CF subtrees becomes specific and hence
results in less accuracy than the content corresponding to shorter subtrees. This shows
the suitability of constraining the CF subtrees in PCITMinerConst.
0.15
0.16
0.17
0.18
0.19
0.2
0.21
0.22
0.23
0.24
0.25
3 4 5 6 7 8 9 10
Mic
ro
-pu
rit
y
Length of CFI subtrees
min_supp_10%
min_supp_20%
min_supp_30%
0.15
0.16
0.17
0.18
0.19
0.2
0.21
0.22
0.23
0.24
0.25
3 4 5 6 7 8 9 10
Ma
cro
-pu
rit
y
Length of CFI Subtrees
min_supp_10%
min_supp_20%
min_supp_30%
(a) Micro-Purity (b) Macro-Purity
Figure 5.28: Sensitivity of length constraint on the micro- and macro-purity valuesfor INEX IEEE dataset
187
5.5.3 Time complexity analysis
This section will analyse the time complexity of HCX-V and HCX-T. That of HCX-V is
composed of two components, the extraction of content using concise frequent subtrees and
then application of clustering. The time complexity of HCX-V is determined as O(dm) +
O(dt) where d represents the number of documents, m is the number of concise frequent
subtrees, and t is the number of terms.
The time complexity of HCX-T comprises four major components: the clustering of CF
subtrees, random indexing, matricization and decomposition in PTCD. This is determined
as O(drp) +O(tkγ)+O(drγ)+O(dk′γ) where r is the number of structure-based clusters,
p is the number of similarity computation iterations, γ is the size of the random index
vector, k and k′ are the non-zero entries per column in the tensor before random indexing
and in the matricized tensor after random indexing respectively. The time complexity of
PTCD is O(drγ) + O(dk′γ), which includes the cost of matricization along the n-mode
and the sparse Singular Value decomposition (SVD) [16, 90] respectively.
5.5.4 Scalability analysis
All the real-life datasets, ACM, DBLP, INEX IEEE, INEX 2007 and INEX 2009, were
used for the scalability analysis, with the number of clusters set equal to the number of
categories (5, 8, 18, 21 and 40 respectively). This analysis was conducted using 1000
documents, with min supp at 10%, const at 5, and the value of γ at 100. The reason for
this setting is to understand how the proposed clustering method HCX-T performs with
PTCD for datasets of extreme nature.
It can seen from Figures 5.29, 5.30 and 5.31 that both HCX-T and PTCD scale nearly
linearly with the dataset size. The PTCD algorithm includes two main steps: (1) loading
188
0
1000
2000
3000
4000
5000
6000
7000
8000
2 4 6 8 10Ti
me
(in s
ecs)
Replication factor
ACM
DBLP
INEX IEEE
INEX2007
INEX 2009
Figure 5.29: Scalability of HCX-T
0
50
100
150
200
250
300
350
400
2 4 6 8 10
Tim
e (in
sec
s)
Replication factor
ACM
DBLP
INEX IEEE
INEX 2007
INEX 2009
Figure 5.30: Scalability of PTCD
0
5
10
15
20
25
30
2 4 6 8 10
Tim
e (in
sec
s)
Replication factor
ACM
DBLP
INEX IEEE
INEX 2007
INEX 2009
Figure 5.31: Scalability of the decomposition in PTCD
the tensor file into memory by matricization, and (2) decomposing the matrices using SVD.
It can be seen from Figures 5.30 and 5.31 that minimal time is spent on decomposition
while a large chunk of time on loading the tensor file into memory.
189
5.6 Discussion and summary
This section discusses and summarises the empirical evaluation conducted for the cluster-
ing methods based on the proposed HCX methodology.
� Comparison of Structure Only (SO), Content Only (CO) and Structure and Content
features for clustering and the effect of the proposed clustering methods HCX-V vs
HCX-T.
Evaluating the accuracy of the clustering solution that was produced in all the
datasets using SO, CO, S+C, HCX-V and HCX-T reveal that the non-linear combi-
nation of the structure and the content features perform better than using only one
feature or combining the two features matrices together as in S+C method. This
shows that a relationship exists between the structure and the content of XML doc-
uments and hence by utilising which shows an improvement in the accuracy of the
clustering solution.
The experimental evaluation on the chosen five real-life datasets reveals that HCX-T
performs better in three datasets, ACM, DBLP and INEX IEEE, and HCX-V on the
rest, INEX 2007 and INEX 2009. It is interesting to note that the HCX-V method
performs better in comparison to HCX-T for the two Wikipedia datasets used in
this research.
This raises the issue of when to choose HCX-V over HCX-T for a new dataset. For
datasets with fewer semantic tags and where the desired grouping is more heavily
based on the content of the documents than the structure, then HCX-V can be
used. However, if the datasets have more semantic tags than syntactic tags and the
desired grouping is on both the structure and the content of the XML documents,
then HCX-T can be used.
190
� Comparison of the proposed methods over the state-of-the-art clustering methods
Figure 5.32 shows the comparison of the proposed clustering methods over the state-
of-the-art clustering methods. As these results are from the INEX forum, the clus-
tering submissions for each of the datasets were not evaluated on all the evaluation
metrics. For instance, the submissions for the INEX IEEE dataset were evaluated
only on Macro-purity, those for INEX 2007, on Micro- and Macro-F1 values, and
those for INEX 2009 on Micro- and Macro-purity values.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
HCX-V HCX-T CRP 4RP SOM BilWeb-Co
Micro-
INEX 2007
INEX 2009
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
HCX-V HCX-T Tran et al Doucet et al CRP 4RP SOM BilWeb-Co
Macro-
INEX IEEE
INEX 2007
INEX 2009
(a) Micro- values comparison (b) Macro- values comparison
Figure 5.32: Comparison of the proposed clustering methods over the state-of-the-artclustering methods on the large-sized datasets
From these results, it is clearly shown that HCX-V outperforms the state-of-the-art
clustering methods such as CRP, 4RP and SOM for the INEX 2007 dataset. It also
outperforms the Bilweb-Co method [81] on the INEX 2009 dataset. On the other
hand, HCX-T outperforms the methods proposed by Tran et al. (PCXSS) [84] and
Doucet et al. [36], which shows that the proposed clustering methods improve the
accuracy over other clustering methods that use only the structure features or that
linearly combine the structure and content.
� Comparison of the different types of node relationships (induced and embedded) of
the frequent subtrees in clustering
Figures 5.33 and 5.34 show the relative ranking of different types of concise frequent
subtrees on clustering. The relative ranking is obtained by setting a scale of 1 to 8
191
where the best performing method gets a value of 8 and the worst method gets a
value of 1 for both HCX-V and HCX-T. However, for the INEX IEEE and INEX
2009 datasets, the range was set from 1 to 4, as full concise frequent subtrees could
not scale to these datasets.
0
10
20
30
40
50
60
70
80
Rel
ativ
e R
anki
ng
Concise Frequent Subtrees
ACM Dataset
DBLP dataset
INEX 2007 Dataset
INEX IEEE Dataset
INEX 2009 Dataset
Figure 5.33: Comparison of different types of concise frequent subtrees on clusteringbased on datasets
0
10
20
30
40
50
60
70
80
ACM DBLP INEX 2007 INEX IEEE INEX 2009
Rela
tive
rank
ing
Datasets
CFI
MFI
CFE
MFE
CFIConst
MFIConst
CFEConst
MFEConst
Figure 5.34: Comparison of different types of concise frequent subtrees on clustering
These figures indicate that in most of the datasets using CFIConst subtrees for clus-
tering perform much better than type of subtrees. In spite of using CFI subtrees
for clustering provide the best result on the ACM dataset using HCX-T, the rel-
ative rank not only ranking those based on HCX-T but also on HCX-V in which
CFI subtrees perform poor. Although clustering methods using CFEConst subtrees
perform relatively better than using MFIConst in the three datasets ACM, DBLP
and INEX 2007, however it fails to provide similar results in INEX IEEE and INEX
192
2009 datasets.
These results demonstrate that, for the chosen XML datasets, in most of the in-
stances induced subtrees provide not only a more accurate solution than embedded
subtrees but also the time taken to extract the content is also less. This could be ei-
ther due to the extraction of too many hidden relationships, which results in adding
noise, or to the fact that these embedded relationships are not suitable to identify
the correct categories.
� Impact of the proposed tensor decomposition and state-of-the-art decomposition al-
gorithms
The comparison of performance of all the decomposition algorithms on various
datasets is shown in Figure 5.35, with relative ranks ranging from 1 to 4, with 4
being the best score. This shows that the PTCD algorithm, with its progressive
tensor building option and decomposition, is not only scalable but is also able to
provide more accurate results than the popular Tucker decomposition algorithm [113]
in particular.
0
2
4
6
8
10
12
14
16
18
20
ACM DBLP INEX 2007 INEX IEEE INEX 2009
Rela
tive
Rank
ing
Datasets
PTCD
CP
MACH
Tucker
Figure 5.35: Comparison of tensor decomposition algorithms
In the ACM and DBLP datasets, the CP decomposition algorithm performs on a
par with PTCD in HCX-T. In spite of its potential to provide accurate results, CP
fails to scale for all the large-sized datasets, even after reducing the term space to
193
approximately 1000 terms with a random indexing option.
Another decomposition algorithm, MACH [112], applies random indexing to reduce
the number of tensor entries especially designed to suit a dense dataset, though it
could scale for large-sized datasets. This often results in poor performance, due to
the fact that some datasets have small-sized documents in some categories, applying
this algorithm could therefore remove a number of tensor entries in these categories,
resulting in combining the small-sized categories together. This shows the scalability
and efficiency of PTCD over the state-of-the-art decomposition algorithms.
� Impact of Random Indexing in HCX-T
The option of random indexing is proposed for models using TSM to reduce the term
space for large-sized datasets. However, experimental results shown in Tables 5.3
and 5.4 reveal that the clustering results produced using RI are better than without
RI for the small and medium-sized datasets, ACM and DBLP datasets that do not
require size reduction. Apart from improving the accuracy, using RI helps to reduce
the decomposition times, even for the small and medium-sized datasets.
However, the interesting problem here in using RI is to how to choose the value γ
for the seed length in random indexing. The Johnson-Lindenstrauss result can be
used to get the bounds for γ as given by γ =⌈4(ε2/2− ε3/3)−1ln n
⌋[28].
� Impact of features reduction using HCX
One of the objectives of this research is to combine the structure and the content
of XML documents in a non-linear approach that aids in feature reduction. Hence,
in order to clearly understand the impact of the proposed HCX methodology, on
reducing the features, an analysis was conducted.
It is evident from Figures 5.36 and 5.37 that in spite of a drastic reduction in the
number of features for the proposed clustering methods, there is an improvement in
194
performance over the CO and S+C representations.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ACM DBLP INEX 2007 INEX IEEE INEX 2009
Aver
age o
f pur
ity, F
1 and
NM
I
Dataset
Evaluation on all datasets
HCX-V
HCX-T
CO
S+C
SO
Figure 5.36: A comparison of the average of all metrics in the chosen real-life datasets
1
10
100
1000
10000
100000
1000000
10000000
ACM DBLP INEX 2007 INEX IEEE INEX 2009
# Te
rms
Datasets
Comparison of # terms used for clustering
HCX-V
CO
S+C
Figure 5.37: A comparison of the number of terms in the chosen real-life datasets
� Comparison of different types of concise frequent subtrees on clustering
In most of the datasets, among the concise frequent subtrees, closed frequent induced
subtrees provide better results than other types of subtrees, except in the DBLP
dataset. Though MFI subtrees are fewer and could be produced in almost the same
time as CFI subtrees, their performance in clustering on some datasets is low, as
some of the common subtrees are not determined and their corresponding content is
therefore not used in clustering. The drop in performance is noticeable in the ACM
dataset with HCX-T and on the INEX 2007 with HCX-V methods.
Let us analyse the two concise representations, maximal and closed frequent subtrees,
195
to understand the poor performance of MFI over CFI. Let there be a document tree
dataset, DT on which by applying frequent subtree mining with a given support
threshold s results in a frequent subtree mining result set, O = {DT ′1,DT ′
2,DT ′3}.
It contains three frequent subtrees having a support of s, s and s + 1 respectively.
Among which, DT ′1 ⊂t DT ′
2, DT ′2 ⊂t DT ′
3 and DT ′3 ⊃t φ. Applying the definition
of the closed frequent subtrees on the frequent subtrees result set, O, it can be
found that supp(DT ′1) = supp(T ′
2) and DT ′3 ⊃t DT ′
1 so it can be removed from
the output as its supertree, DT ′3, includes the information contained in DT ′
1. As
supp(DT ′2) �= supp(DT ′
3), the frequent subtree DT ′2 is considered closed as there are
some document trees which contains (DT ′3) but not (DT ′
2). Also, no DT ′3 ⊃ φ exists
therefore DT ′3 is closed. Hence, DT ′
2 and DT ′3 are the two closed frequent subtrees
in DT .
The three subtrees T ′1, T
′2 and T ′
3 need to be checked to see which ones are maximal.
According to the definition of maximal frequent subtrees, T ′3 is the only maximal
frequent subtree, for the reason that T ′3 ⊃t T
′1 and T ′
2 and maximal frequent subtrees
do not consider the difference in support values. That is the reason for the reduced
number, in comparison to the closed subtrees (that is two). The total number of
output patterns is less considering the maximal frequent subtree representation;
however, this representation suffers from information loss. This shows that there
are only s number of document trees which contain T ′2. By using T ′
3 for extracting
content, the document trees that do not contain T ′3 but contains only T ′
2 will be
ignored. This results in information loss and hence could have resulted in the poor
accuracy discovered for the maximal subtrees in clustering.
Therefore, the comparison of these two concise representations reveals that though
the maximal frequent subtree representation provides a reduced pattern set, this
could result in information loss. Alternatively, the closed frequent subtree represen-
196
tation provides a concise pattern set without any information loss, since the closure
property eliminates only the redundant information. This clearly provides the reason
for the poor accuracy of maximal frequent subtrees over the closed frequent subtrees.
� Impact of Constraint Length of the frequent subtrees on clustering
The sensitivity analysis reveals that with the increase in constraint length, a drop
in performance occurs in the INEX IEEE dataset; however the drop is not notice-
able in other datasets. The time taken to perform the experiments has increased
dramatically as it takes longer to extract the content corresponding to longer length
subtrees. The Table 5.9 shows the lengths of constraints that gave the best results
for the concise frequent subtrees.
Table 5.9: Constraint lengths for the real-life datasets
Datasets const
ACM 22
DBLP 3
INEX 2007 18
INEX IEEE 5
INEX 2009 6
� Impact of support threshold
The sensitivity analysis shows that with an increase in support threshold over 30%
of the dataset, the accuracy drops. This is due to the fact that there are fewer
subtrees and the corresponding content is too little to provide a good clustering
solution. However, in most of the situations, there are fewer variations in accuracy
with the change in support threshold. It should be noted that decreasing the support
threshold results in a very large number of frequent subtrees; however, with the
strength of concise frequent subtrees, the large number of frequent subtrees can be
controlled. Also, the ability to mine at a lower support threshold helps in identifying
large number of fine-grained clusters.
� Comparison of the weighting schemes tf-idf and BM25
197
A comparison of the two weighting schemes, tf-idf and BM-25, as shown in Figure
5.38 reveals that BM-25 performs well in comparison to tf-idf in most of the datasets.
The results are noticeable in the INEX 2009 dataset, with about 5% improvement
over its counterpart tf-idf.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ACM DBLP INEX 2007 INEX IEEE INEX 2009
Aver
age o
f pur
ity, F
1 and
NM
I
Dataset
Weighting Schemes comparison
BM-25
TF-IDF
Figure 5.38: A comparison of the weighting schemes - tf-idf and BM-25
� Impact of clustering on information retrieval
With the availability of real ad hoc retrieval queries and their manual assessment
results for INEX 2009 dataset, this dataset has been chosen for evaluating clustering
on information retrieval. The results clearly shows that HCX-V is effective in clus-
tering by identifying relevant documents earlier than the other clustering methods.
This also shows the effectiveness of clustering on information retrieval.
5.7 Conclusion
This chapter has presented a clustering methodology which uses two models to combine
the structure and the content of the XML documents. It has also conducted a series
of experiments that empirically evaluates HCX in the context of clustering the XML
documents. The purpose of the experiments is to show the effectiveness of the proposed
methodology, accuracy, scalability and applicability on various real-life XML datasets.
198
The first section presented the overall HCX methodology, with each of the two models
described in detail in the subsequent two sections.
The contributions made in this chapter can be summarised as:
� a clustering method (HCX-V) using the VSM to combine the structure and the
content implicitly;
� a clustering method (HCX-T) that utilises the tensor model to efficiently combine
the content and structure features of XML documents;
� a randomised tensor decomposition algorithm for large size tensors; and
� an experimental analysis of the proposed methods for their accuracy, time complexity
and scalability.
The experimental results over the real-life datasets show that by adopting the HCX
methodology for clustering the XML documents, the accuracy of the clustering solution
can be improved as it captures non-linearly the relationship between the structure and
the content of the XML documents. Moreover, this methodology helps in reducing the
number of terms used for clustering and hence improves the performance of the clustering
methods.
The successful application and competitive results of HCX, even on very large-sized
datasets, demonstrates its effectiveness on real-life XML datasets, with a large number of
documents (such as INEX 2007 and INEX 2009), on datasets with longer trees (such as
the INEX IEEE and the INEX 2009), and on datasets whose categories are based even on
one feature (such as ACM with 2 categories (structure) and INEX 2007 (content)).
Most importantly, the consistency of the results obtained in both frequent pattern
mining and clustering methods shows that this complete framework is suitable for real-life
199
datasets. The improvement in accuracy of the clustering methods over many state-of-the-
art methods demonstrates that combining structure and content of XML documents for
clustering is useful in domains such as information retrieval for the problem of collection
selection.
200
Chapter 6
Conclusion
6.1 Overview
With the increasing popularity of XML, there is an explosion in the number of XML docu-
ments both in Internet and intranets. In order to effectively manage these large collections
of XML documents, clustering has been identified as an effective solution. However, there
exist several challenges in clustering of XML documents that need to be addressed in order
to provide an accurate clustering solution. Hence, the main objective of this research is to
improve accuracy by exploring and developing different clustering methods utilising both
the structure and the content of XML documents on real-life datasets.
This research has proposed clustering methods to capture the structure and the content
of XML documents. These clustering methods efficiently capture only the concise frequent
structures of the XML documents generated using the proposed frequent pattern mining
methods. Also, this research has investigated the effect of using the content-only, structure-
only, and content-and-structure features using various models for clustering the XML
documents. Furthermore, the research also applies the results of the proposed clustering
methods to improve the effectiveness of collection selection in ad hoc information retrieval.
201
6.2 Summary of contributions
This thesis provides an overview of the XML data, XML frequent pattern mining methods
and the clustering methods using various models such as the VSM and the TSM. Based
on a literature review of current work, the following shortcomings were noted:
� Lack of efficient concise frequent subtree mining methods suitable for real-life datasets;
� Lack of non-linear approaches to combine the structure and the content of XML
documents in clustering; and
� Lack of efficient dimensionality reduction methods to combat the increase in dimen-
sionality due to the combination.
This thesis focused on overcoming these shortcomings by proposing a number of effi-
cient concise frequent subtree mining methods and XML clustering methods using novel
representation models in this research.
The main contributions are summarised below:
� Developed concise frequent pattern mining methods based on different types of sub-
trees.
This thesis has bridged the gap found in the literature about frequent pattern min-
ing methods by proposing varied types of subtrees based on the node relationship
and conciseness. It has also proposed these methods using the prefix-based pattern
growth approach that is particularly suited for dense datasets. Furthermore, this
thesis has proposed methods to generate new types of concise frequent subtrees based
on their length, that are specifically appropriate for many large-sized datasets. It
has also compared the efficiency of the frequent pattern mining methods with many
state-of-the-art methods to portray the efficiency of the proposed methods.
202
� Evaluated the efficiency of concise frequent subtrees in clustering.
This research has evaluated the efficiency of utilising concise frequent subtrees in the
clustering process. It has also conducted an in-depth analysis of the nature of each
of the subtrees and their effectiveness in obtaining the clustering solution.
� Developed a novel methodology of combining the structure and the content of XML
documents both implicitly and explicitly.
Instead of adopting the traditional linear method for combining the structure and
content of XML documents, this thesis has proposed a novel methodology of using
structure features to derive content features. Doing so not only helps to capture the
relationship between these two features, but also reduces the dimension resulting
from the combination. The proposed non-linear approach could also be extended to
other domains for combining two or more features.
Also, in this thesis a novel method of using a multi-dimensional data model, the
Tensor Space Model (TSM), to capture the two features of the XML documents
explicitly has been proposed. The results of the experiments also indicate that
capturing the features explicitly helps in improving the accuracy.
� Proposed a progressive tensor decomposition algorithm to effectively scale to very
large numbers of documents.
This thesis has introduced a novel and scalable tensor decomposition method for
decomposing very large datasets. In many situations, particularly in large dense
datasets, tensors could not be used as the existing decompositions algorithms fail
to scale for them. However the proposed PTCD algorithm has made feasible the
decomposition of tensors that are built for even large-sized datasets with about
50K documents. Furthermore, the results obtained from small-sized datasets were
comparable with the popular decomposition algorithm, CANDECOMP/PARAFAC
203
(CP), results.
� Studied the impact of clustering on ad hoc retrieval using very large real-life dataset.
On the collection selection problem, the cluster hypothesis by Rijsbergen [92] was
studied using the relevant documents for a given query from manual assessors. The
overall results from this study established that
– this type of evaluation confirmed the cluster hypothesis holds; and
– the clustering solution using structure and the content of XML documents
provides better results than using only one feature.
6.3 Summary of findings
The main findings from this thesis can be summarized as:
� HCX provides improved accuracy (up to 12% improvement) compared to structure-
only, content-only, linear combination of structure and content and other state-of-
the-art approaches;
� HCX-T shows an improvement in accuracy over HCX-V for the XML Datasets that
have the following characteristics:
– A strong relationship between structure and content features;
– Categories relying upon using both structure and content; and
– More semantic tags than formatting tags;
� HCX-V is preferred over HCX-T for the XML datasets with the following charac-
teristics:
– A weak relationship between structure and content features;
204
– Categories based mostly on the content; and
– A combination of both formatting and semantic tags.
� With respect to the effectiveness of concise frequent subtrees based on node rela-
tionship in clustering, induced is preferred over embedded subtrees. In particular,
Closed Frequent Induced (CFI) subtrees are preferred over Maximal Frequent In-
duced (MFI) subtrees since an information loss occurs with the use of MFI in clus-
tering. Additionally, the length-constrained subtrees demonstrate an improvement
in accuracy when the complete concise subtrees cannot be generated, especially for
larger trees.
� By eliminating the content corresponding to infrequent substructures, the dimen-
sionality of the input (content) data is reduced and the accuracy of the clustering
solution is improved.
6.4 Limitations and future extensions
This research focuses mainly on the clustering of XML documents using a tree-based
model. Several extensions can be made to improve these currently proposed methods in
the future, such as extending the clustering methods so that they could be applied to
different types of semi-structured data.
Since this research has explored the use of structure and content of XML documents
for clustering; future research should focus on using other features such as links between
the documents and semantic relationships between the documents while creating the TSM
to improve the accuracy further and while studying the impact of these features on the
clustering accuracy of these documents.
The accuracy of the proposed methods suffers on datasets which contain more for-
205
matting tags than semantic tags; hence future work can include an automatic semantic
tagging system to create more meaningful tags. The YAGO ontology used to create se-
mantic tags in the INEX 2009 dataset could be used to create semantic tags for the other
XML datasets.
The proposed frequent pattern mining methods were able to mine a large document
collection with a good response time; however, there is always room for improvement, espe-
cially in efficiency. With the improvement in computing resources the efficiency of frequent
pattern mining methods could be further improved by parallelising them. The nature of
frequent pattern mining methods based on prefix-pattern growth approach strongly sup-
ports the idea that each of the projections based on the frequent patterns could be mined
in parallel. This idea will help improve the performance of these methods and alleviate
the problem of running these frequent pattern mining methods on a single machine.
Further future work could also include frequent mining for patterns such as concise
sequential subtrees and episode subtrees from datasets which contain the sequential infor-
mation. This thesis has utilised frequent subtree mining in clustering; future works will
also focus on creating tree-based association rules with structure and content information
to aid in information retrieval. Also, future works will attempt to utilise Non-negative
Matrix Factorization (NMF) in the proposed PTCD algorithm instead of Singular Value
Decomposition (SVD) as there is no restriction of orthogonality constraint on the derived
semantic space [71]. Also, NMF utilises only non-negative values for all the latent semantic
vectors [125] thus it could further reduce the computational complexity.
206
6.5 Final remarks
Clustering of XML documents has been quite popular among researchers in recent years.
The newly proposed ideas of combining the structure and content of XML documents has
generated a great deal of interest for clustering XML documents. The use of frequent
subtree mining for generating tree summaries and utilising them for clustering opens the
possibility of gaining more useful representations. This research increases the potential
of applying frequent pattern mining to various other aspects of knowledge discovery and
data mining tasks. The importance of the work presented in this thesis, has been demon-
strated with publications in conferences, book chapters and workshops. This chapter has
summarised the key findings and the contributions it has made to the research commu-
nity. It has also identified various future extensions in both frequent pattern mining and
clustering that could be applied to the proposed methods.
207
Bibliography
[1] International standard ISO 8879: Information processing - text and office systems -
standard generalised markup language (SGML), 1986.
[2] S. Abiteboul, P. Buneman, and D. Suciu. Data on the web: from relations to semi-
structured data and XML. Morgan Kaufmann, San Francisco, California, 2000.
[3] D. Achlioptas and F. McSherry. Fast computation of low-rank matrix approxima-
tions. J. ACM, 54(2):1–19, 2007.
[4] C. C. Aggarwal, N. Ta, J. Wang, J. Feng, and M. J. Zaki. Xproj: a framework for
projected structural clustering of XML documents. In Proceedings of the 13th ACM
SIGKDD international conference on Knowledge discovery and data mining, pages
46–55. ACM, San Jose, California, USA, 2007.
[5] C. C. Aggarwal and H. Wang. Graph data management and mining: A survey of
algorithms and applications. In Managing and Mining Graph Data, pages 13–68.
2010.
[6] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast discovery
of association rules. In Advances in knowledge discovery and data mining, pages
307–328. American Association for Artificial Intelligence, 1996.
209
[7] I. Altingovde, D. Atilgan, and A. Ulusoy. Exploiting index pruning methods for
clustering XML collections. In S. Geva, J. Kamps, and A. Trotman, editors, Focused
Retrieval and Evaluation, volume 6203 of Lecture Notes in Computer Science, pages
379–386. Springer Berlin / Heidelberg, 2010.
[8] A. Anagnostopoulos, A. Dasgupta, and R. Kumar. Approximation algorithms for co-
clustering. In Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART
symposium on Principles of database systems (PODS), pages 201–210, New York,
NY, USA, 2008. ACM.
[9] R. Anderson. Professional XML. Wrox Press Ltd, Birmingham, England, 2000.
[10] P. Antonellis, C. Makris, and N. Tsirakis. XEdge: clustering homogeneous and
heterogeneous XML documents using edge summaries. In Proceedings of the 2008
ACM symposium on Applied computing, SAC ’08, pages 1081–1088, New York, NY,
USA, 2008. ACM.
[11] H. Arimura and T. Uno. An output-polynomial time algorithm for mining frequent
closed attribute trees. In S. Kramer and B. Pfahringer, editors, Inductive Logic Pro-
gramming, volume 3625 of Lecture Notes in Computer Science, pages 1–19. Springer
Berlin / Heidelberg, 2005.
[12] T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Satamoto, and S. Arikawa. Efficient
substructure discovery from large semi-structured data. In SIAM International Con-
ference on Data Mining (SDM), 2002.
[13] T. Asai, H. Arimura, T. Uno, and S. Nakano. Discovering frequent substructures in
large unordered trees. In The 6th International Conference on Discovery Science,
2003.
210
[14] A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. S. Modha. A generalized
maximum entropy approach to bregman co-clustering and matrix approximation.
In Proceedings of the tenth ACM SIGKDD international conference on Knowledge
discovery and data mining (KDD), pages 509–514, New York, NY, USA, 2004. ACM.
[15] M. W. Berry, Z. Drmac, and E. R. Jessup. Matrices, vector spaces, and information
retrieval. SIAM J. of Computing, 41(2):335–362, 1999.
[16] E. Bingham and H. Mannila. Random projection in dimensionality reduction: ap-
plications to image and text data. In Proceedings of the seventh ACM SIGKDD
international conference on Knowledge discovery and data mining (KDD), pages
245–250, New York, NY, USA, 2001. ACM.
[17] D. Braga, A. Campi, S. Ceri, M. Klemettinen, and P. L. Lanzi. A tool for extracting
XML association rules. In Proceedings. 14th IEEE International Conference on Tools
with Artificial Intelligence, 2002. (ICTAI 2002)., pages 57–64, 2002.
[18] B. Bringmann. To see the wood for the trees: Mining frequent tree patterns, 2006.
[19] L. Candillier, L. Denoyer, P. Gallinari, M. C. Rousset, A. Termier, and A. M. Ver-
coustre. Mining XML documents. 2007.
[20] S. H. Cha. Comprehensive survey on distance/similarity measures between probabil-
ity density functions. International J. Mathematical Models and Methods in Applied
Sciences, 1(4):300–307, 2007.
[21] Y. Chi, R. R. Muntz, S. Nijssen, and J. N. Kok. Frequent subtree mining - an
overview. J. Fundam. Inf., 66(1-2):161–198, 2004.
[22] Y. Chi, S. Nijssen, R. R. Muntz, and J. N. Kok. Frequent subtree mining- an
overview. In J. Fundamenta Informaticae, volume 66, pages 161–198. IOS Press,
2005.
211
[23] Y. Chi, Y. Yang, and R. R. Muntz. Indexing and mining free trees. In IEEE
International Conference on Data Mining (ICDM), pages 509–512, 2003.
[24] Y. Chi, Y. Yang, Y. Xia, and R. R. Muntz. CMTreeMiner: Mining both closed and
maximal frequent subtrees. In The Eighth Pacific Asia Conference on Knowledge
Discovery and Data Mining (PAKDD). 2004.
[25] R. Cover. Xml applications and initiatives. http://xml.coverpages.org/
xmlApplications.html, 2005.
[26] T. Dalamagas, T. Cheng, K. Winkel, and T. Sellis. A methodology for clustering
XML documents by structure. Information Systems, 31(3):187–228, 2006.
[27] T. Dalamagas, T. Cheng, K. Winkel, and T. K. Sellis. Clustering XML documents
by structure. In SETN, 2004.
[28] S. Dasgupta and A. Gupta. An elementary proof of the johnson-lindenstrauss lemma.
Technical report, 1999.
[29] C. De Vries and S. Geva. Document clustering with k-tree. In S. Geva, J. Kamps,
and A. Trotman, editors, Advances in Focused Retrieval, volume 5631 of Lecture
Notes in Computer Science, pages 420–431. Springer Berlin / Heidelberg, 2009.
[30] L. Denoyer and P. Gallinari. Report on the XML mining track at INEX 2005
and INEX 2006: categorization and clustering of XML documents. SIGIR Forum,
41(1):79–90, 2007.
[31] L. Denoyer, P. Gallinari, and A. M. Vercoustre. Report on the XML mining track
at INEX 2005 and INEX 2006. In INEX 2006, pages 432–443, Dagstuhl Castle,
Germany, 2006.
212
[32] M. Deshpande, M. Kuramochi, and G. Karypis. Frequent sub-structure-based ap-
proaches for classifying chemical compounds. In IEEE International Conference on
Data Mining (ICDM), pages 35–42, 2003.
[33] M. M. Deza and E. Deza. Encyclopedia of distances. Springer, 1 edition, July 2009.
[34] R. Diestel. Graph Theory (Graduate Texts in Mathematics). Springer, 3rd edition,
August 2005.
[35] D. Dongjie, M. Zhixin, X. Yusheng, and L. Li. Mining tree patterns using frequent
2-subtree checking. In Second International Symposium on Knowledge Acquisition
and Modeling, 2009. KAM ’09., volume 2, pages 162–165, nov. 2009.
[36] A. Doucet and M. Lehtonen. Unsupervised classification of text-centric XML docu-
ment collections. In 5th International Workshop of the Initiative for the Evaluation
of XML Retrieval, INEX, pages 497–509, 2006.
[37] A. Evrim, B, and Y. Lent. Unsupervised multiway data analysis: A literature survey.
IEEE Transformations on Knowledge and Data Engineering, 21(1):6–20, 2009.
[38] G. Flake, R. Tarjan, and M. Tsioutsiouliklis. Graph clustering and mininum cut
trees. Internet Mathematics, 1(4):305–408, 2003.
[39] C. Fox. A stop list for general text. ACM SIGIR Forum, 24(1-2):19–35, 1989.
[40] W. B. Frakes and C. Fox. Strength and similarity of affix removal stemming algo-
rithms. ACM SIGIR Forum, 37(1):26–30, 2003.
[41] N. Fuhr, M. Lalmas, A. Trotman, and J. Kamps. Focused Access to XML documents.
In 6th International Workshop of the Initiative for the Evaluation of XML Retrieval,
INEX 2007. Selected Papers, Lecture Notes in Computer Science (LNCS), Dagstuhl
Castle, Germany, December 17-19, 2007.
213
[42] M. Garey and D. Johnson. Computers and intractability : a guide to the theory of
NP-completeness. W. H. Freeman, San Francisco, 1979.
[43] M. N. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim. XTRACT: a
system for extracting document type descriptors from XML documents. volume 29,
pages 165–176. ACM, 2000.
[44] S. Guha, R. Rastogi, and K. Shim. Rock: a robust clustering algorithm for categor-
ical attributes. pages 512 –521, Mar. 1999.
[45] M. Hagenbuchner, A. Tsoi, A. Sperduti, and M. Kc. Efficient clustering of structured
documents using graph self-organizing maps. In N. Fuhr, J. Kamps, M. Lalmas, and
A. Trotman, editors, Focused Access to XML Documents, volume 4862 of Lecture
Notes in Computer Science, pages 207–221. Springer Berlin / Heidelberg, 2008.
[46] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation.
In Proceedings of the 2000 ACM SIGMOD international conference on Management
of data, pages 1–12. ACM Press, Dallas, Texas, United States, 2000.
[47] D. Harman. How effective is suffixing? J. American Society for Information Science,
42(1):7–15, 1991.
[48] J. Huan, W. Wang, and J. Prins. Efficient mining of frequent subgraphs in the
presence of isomorphism. In Proceedings of the IEEE International Conference on
Data Mining, pages 549–559. IEEE Computer Society, 2003.
[49] H. Huang, C. Ding, D. Luo, and T. Li. Simultaneous tensor subspace selection and
clustering: The equivalence of high order SVD and k-means clustering. In Proceeding
of the 14th ACM SIGKDD international conference on Knowledge discovery and
data mining (KDD), pages 327–335, New York, NY, USA, 2008. ACM.
214
[50] J. H. Hwang and K. H. Ryu. Clustering and retrieval of xml documents by structure.
In O. Gervasi, M. Gavrilova, V. Kumar, A. Lagan, H. Lee, Y. Mun, D. Taniar,
and C. Tan, editors, Computational Science and Its Applications ICCSA 2005,
volume 3481 of Lecture Notes in Computer Science, pages 925–935. Springer Berlin
/ Heidelberg, 2005.
[51] A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining
frequent substructures from graph data. In Proceedings of the 4th European Con-
ference on Principles and Practice of Knowledge Discovery in Databases (PKDD),
pages 13–23, Lyon, France, 2000.
[52] A. Inokuchi, T. Washio, and H. Motoda. A general framework for mining frequent
subgraphs from labeled graphs. Fundamenta Informaticae, 66(1-2):53–82, 2005.
[53] A. Inokuchi, T. Washio, K. Nishimura, and H. Motoda. A fast algorithm for mining
frequent connected subgraphs. Technical report, IBM Research, Tokyo Research
Laboratory, 2002.
[54] P. J., S. D. R., and K. U. EFoX: A scalable method for extracting frequent subtrees.
In International Conference on Computational Science, 2005.
[55] S. Jegelka, S. Sra, and A. Banerjee. Approximation algorithms for tensor clustering.
In Algorithmic Learning Theory, pages 368–383. 2009.
[56] J. W. Jian, W. K. Cheung, and X. O. Chen. Integrating element and term semantics
for similarity-based XML document clustering. IEEE / WIC / ACM International
Conference on Web Intelligence (WI), pages 222–228, 2005.
[57] S. Jimeng, T. Dacheng, P. Spiros, S. Y. Philip, and F. Christos. Incremental tensor
analysis: Theory and applications. ACM Trans. Knowl. Discov. Data, 2(3):1–37,
2008.
215
[58] T. G. K. and S. Jimeng. Scalable tensor decompositions for multi-aspect data mining.
In ICDM 2008: Proceedings of the 8th IEEE International Conference on Data
Mining, pages 363–372, December 2008.
[59] G. Karypis. CLUTO – software for clustering high-dimensional datasets, 2007.
[60] M. Kc, M. Hagenbuchner, A. C. Tsoi, F. Scarselli, A. Sperduti, and M. Gori. XML
document mining using contextual Self-organizing Maps for structures. In 5th In-
ternational Workshop of the Initiative for the Evaluation of XML Retrieval, INEX,
pages 510–509, Dagstuhl Castle, Germany, 2005.
[61] H. A. L. Kiers. Towards a standardized notation and terminology in multiway
analysis. J. Chemometrics, 14(3):105–122, 2000.
[62] N. Klarlund, T. Schwentick, and D. Suciu. Xml: Model, schemas, types, logics,
and queries. In J. Chomicki, R. van der Meyden, and G. Saake, editors, Logics for
Emerging Applications of Databases, pages 1–41. Springer, 2003.
[63] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM
Review, 51(3):455–500, 2009.
[64] S. B. Kotsiantis and P. E. Pintelas. Recent advances in clustering: A brief survey.
WSEAS Transactions on Information Science and Applications, 1:73–81, 2004.
[65] L. Kurgan, W. Swiercz, and K. J. Cios. Semantic mapping of XML tags using
inductive machine learning. In 11th International Conference on Information and
Knowledge Management (CIKM), Virginia, USA, 2002.
[66] S. Kutty, R. Nayak, and Y. Li. XML data mining: Process and applications. In
M. Song and Y. F. Wu, editors, Handbook of Research on Text and Web Mining
Technologies. Idea Group Inc., USA, 2008.
216
[67] S. Kutty, R. Nayak, and Y. Li. HCX: an efficient hybrid clustering approach for XML
documents. In DocEng ’09: Proceedings of the 9th ACM symposium on Document
engineering, pages 94–97, New York, NY, USA, 2009. ACM.
[68] S. Kutty, R. Nayak, and Y. Li. XCFS: an XML documents clustering approach using
both the structure and the content. In Proceedings of the 18th ACM conference on
Information and knowledge management (CIKM), CIKM ’09, pages 1729–1732, New
York, NY, USA, 2009. ACM.
[69] S. Kutty, T. Tran, R. Nayak, and Y. Li. Clustering XML documents using closed
frequent subtrees: A structural similarity approach. In N. Fuhr, J. Kamps, M. Lal-
mas, and A. Trotman, editors, Focused Access to XML Documents, volume 4862 of
Lecture Notes in Computer Science, pages 183–194. Springer Berlin / Heidelberg,
2008.
[70] L. D. Lathauwer, B. D. Moor, and J. Vandewalle. A multilinear singular value
decomposition. SIAM J. Matrix Anal. Appl., 21(4):1253–1278, 2000.
[71] D. Lee and W. W. Chu. Comparative analysis of six XML schema languages. In
ACM SIGMOD Record, volume 29, pages 76–87, 2000.
[72] L. M. Lee, L. H. Yang, W. Hsu, and X. Yang. Xclust: Clustering xml schemas for
effective integration. In CIKM’2002, Virginia, 2002, November.
[73] H. P. Leung, F. L. Chung, S. C. F. Chan, and R. Luk. XML document clustering
using common XPath. In International Workshop on Challenges in Web Information
Retrieval and Integration (WIRI ’05), pages 91–96, 2005.
[74] J. Lovins. Development of a stemming algorithm. Mechanical Translation and
computational Linguistics, 11(1):23–31, 1968.
217
[75] S. Magnus. An introduction to random indexing. In Methods and Applications of
Semantic Indexing Workshop at the 7th International Conference on Terminology
and Knowledge Engineering, TKE 2005, 2005.
[76] C. D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval.
Cambridge University Press, NY, USA, 2008.
[77] Q. Mei and Y. Liu. A tree structure frequent pattern mining algorithm based on
hybrid search strategy and bitmap. In IEEE International Conference on Intelligent
Computing and Intelligent Systems, 2009. ICIS 2009., volume 1, pages 452–456,
2009.
[78] A. Mirzal. Weblog clustering in multilinear algebra perspective. International Jour-
nal of Information Technology, 15(1), 2009.
[79] C. H. Moh, E. P. Lim, and W. K. Ng. DTD-Miner: a tool for mining DTD from XML
documents. In the 2nd International Workshop on Advance Issues of E-Commerce
and Web-Based Information Systems, 2000.
[80] D. Muti and S. Bourennane. Survey on tensor signal algebraic filtering. Signal
Process, 87(2):237–249, 2007.
[81] R. Nayak, C. de Vries, S. Kutty, and S. Geva. Report on the XML mining track
clustering task at INEX 2009. In S. Geva, J. Kamps, and A. Trotman, editors,
Focused Retrieval and Evaluation. Springer, 2010.
[82] R. Nayak and W. Iryadi. XMine: A methodology for mining XML structure. In
X. Zhou, J. Li, H. Shen, M. Kitsuregawa, and Y. Zhang, editors, Frontiers of WWW
Research and Development - APWeb 2006, volume 3841 of Lecture Notes in Com-
puter Science, pages 786–792. Springer Berlin / Heidelberg, 2006.
218
[83] R. Nayak and W. Iryadi. XML schema clustering with semantic and hierarchical
similarity measures. Knowledge-Based Systems, 20(4):336–349, 2006.
[84] R. Nayak and T. Tran. A progressive clustering algorithm to group the XML Data
by structural and semantic Similarity. International Journal of Pattern Recognition
and Artificial Intelligence (IJPRAI), 21(4):723–743, 2007.
[85] R. Nayak and F. B. Xia. Automatic integration of heterogeneous xml-schemas. In
Proceedings of the International Conferences on Information Integration and Web-
based Applications and Services, 2004.
[86] R. Nayak and S. Xu. XCLS: A fast and effective clustering algorithm for heteroge-
nous XML documents. In Proceedings of the Pacific-Asia Conference on Knowledge
Discovery and Data Mining (PAKDD), pages 292–302, Singapore, 2006.
[87] A. Nierman and H. V. Jagadish. Evaluating structural similarity in XML documents.
In 5th International Conference on Computational Science (ICCS’05), Wisconsin,
USA, 2002.
[88] S. Nijssen and J. Kok. Efficient discovery of frequent unordered trees. In Proceedings
of International Workshop on Mining Graphs, Trees, and Sequences, 2003.
[89] C. D. Paice. Another Stemmer. ACM SIGIR Forum, 24(3):56–61, 1990.
[90] C. H. Papadimitriou, H. Tamaki, P. Raghavan, and S. Vempala. Latent semantic
indexing: a probabilistic analysis. In Proceedings of the seventeenth ACM SIGACT-
SIGMOD-SIGART symposium on Principles of database systems, PODS ’98, pages
159–168, New York, NY, USA, 1998. ACM.
[91] M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.
[92] C. J. v. Rijsbergen. Information Retrieval. Butterworths, 1979.
219
[93] G. Salton and M. J. McGill. Introduction to modern information retrieval. McGraw-
Hill, Inc., New York, NY, USA, 1986.
[94] G. Salton and M. J. McGill. Introduction to modern information retrieval. McGraw-
Hill Book Co., New York, 1989.
[95] G. Salton, A. Wong, and C. Yang. A vector space model for automatic indexing.
Communication of ACM, 18(11):613–620, 1975.
[96] T. M. Selee, T. G. Kolda, W. P. Kegelmeyer, and J. D. Griffin. Extracting clusters
from large datasets with multiple similarity measures using IMSCAND. In M. L.
Parks and S. S. Collis, editors, CSRI Summer Proceedings 2007, Technical Report
SAND2007-7977, Sandia National Laboratories, Albuquerque, NM and Livermore,
CA, pages 87–103, December 2007.
[97] Y. Shen and B. Wang. Clustering schemaless XML document. In 11th International
Conference on Cooperative Information System, 2003.
[98] J. M. Smith and R. Stutely. SGML: The User’s Guide to ISO 8879. Ellis Horwood
Ltd, Chichester, 1988.
[99] S. Sol. Advantages of XML:moving beyond format, 1998.
[100] I. Stuart. XML Schema: a brief introduction, 2004.
[101] F. M. Suchanek, A. S. Varde, R. Nayak, and P. Senellart. The hidden web, xml and
the semantic web: scientific data management perspectives. pages 534–537. ACM,
2011.
[102] J. T. Sun, Z. Chen, H. J. Zeng, C. Lu, Yu, C. Y. Shi, andW. Y. Ma. Supervised latent
semantic indexing for document categorization. In IEEE International Conference
on Data Mining (ICDM), pages 535– 538, 2004.
220
[103] A. Tagarelli and S. Greco. Toward semantic XML clustering. In Proceedings of
SIAM International Conference on Data Mining (SDM), pages 188–199, 2006.
[104] A. Tagarelli and S. Greco. Semantic clustering of XML documents. ACM Transac-
tions on Information Systems, 28(1):1–56, 2009.
[105] H. Tan, S. T. Dillon, F. Hadzic, L. Feng, and E. Chang. MB3 Miner: mining eM-
Bedded sub-TREEs using Tree Model Guided candidate generation. In Proceedings
of the 1st International Workshop on Mining Complex Data, held in conjunction
with ICDM05, 2005.
[106] H. Tan, T. Dillon, L. Feng, E. Chang, and F. Hadzic. X3-Miner: Mining patterns
from XML database. In Proceedings of Data Mining ’05, Skiathos, 2005.
[107] A. Termier, M.-C. Rousset, and M. Sebag. TreeFinder: a first step towards XML
data mining. In ICDM 2002. Proceedings. 2002 IEEE International Conference on
Data Mining, 2002., pages 450–457, 2002.
[108] A. Termier, M.-C. Rousset, M. Sebag, K. Ohara, T. Washio, and H. Motoda. Effi-
cient mining of high branching factor attribute trees. In Fifth IEEE International
Conference on Data Mining., page 4 pp., 2005.
[109] T. Tran, S. Kutty, and R. Nayak. Utilizing the structure and content information for
xml document clustering. In S. Geva, J. Kamps, and A. Trotman, editors, Advances
in Focused Retrieval, volume 5631 of Lecture Notes in Computer Science, pages
460–468. Springer Berlin / Heidelberg, 2009.
[110] T. Tran and R. Nayak. Evaluating the performance of the XML document clustering
by structure only. In 5th International Workshop of the Initiative for the Evaluation
of XML Retrieval, INEX, pages 473–484, Dagstuhl Castle, Germany, 2006.
221
[111] T. Tran, R. Nayak, and P. Bruza. Document clustering using incremental and
pairwise approaches. In N. Fuhr, J. Kamps, M. Lalmas, and A. Trotman, editors,
Focused Access to XML Documents, volume 4862 of Lecture Notes in Computer
Science, pages 222–233. Springer Berlin / Heidelberg, 2008.
[112] C. E. Tsourakakis. MACH: Fast randomized tensor decompositions. In The SIAM
Data Mining Conference (SDM), pages 689–700, Columbus, Ohio, USA, 2010.
[113] L. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika,
31:279–311, 1966.
[114] A. M. Vercoustre, M. Fegas, S. Gul, and Y. Lechevallier. A flexible structured-
based representation for XML document mining. In Advances in XML Information
Retrieval and Evaluation, pages 443–457. 2006.
[115] R. A. Wagner and M. J. Fischer. The String-to-String Correction Problem. J. ACM,
21:168–173, January 1974.
[116] J. W. W. Wan and G. Dobbie. Mining association rules from XML data using
XQuery. In Proceedings of the second workshop on Australasian information security,
Data Mining and Web Intelligence, and Software Internationalisation, pages 169–
174. Australian Computer Society, Dunedin, New Zealand, 2004.
[117] C. Wang, M. Hong, J. Pei, H. Zhou, W. Wang, and B. Shi. Efficient pattern-growth
methods for frequent tree pattern mining. In H. Dai, R. Srikant, and C. Zhang,
editors, Advances in Knowledge Discovery and Data Mining, volume 3056 of Lecture
Notes in Computer Science, pages 441–451. Springer Berlin / Heidelberg, 2004.
[118] J. Wang and J. Han. BIDE: Efficient mining of frequent closed sequences. In
Proceedings of the 20th International Conference on Data Engineering, pages 79–90.
IEEE Computer Society, 2004.
222
[119] Y. Wang, D. J. DeWitt, and J. Y. Cai. X-Diff: An effective change detection algo-
rithm for XML documents. In IEEE International Conference on Data Engineering,
2003.
[120] E. Wilde and R. J. Glushko. XML fever. J. Comm. ACM, 51:40–46, July 2008.
[121] P. Willett. The porter stemming algorithm: Then and now. Program: Electronic
Library and Information Systems, 40(3):219–223, 2006.
[122] C. N. Win and K. H. S. Hla. Mining frequent patterns from XML data. In 6th Asia-
Pacific Symposium on Information and Telecommunication Technologies (APSITT),
pages 208–212, 2005.
[123] Y. Xiao, J. F. Yao, Z. Li, and M. H. Dunham. Efficient data mining for maximal
frequent subtrees. In Proceedings of the IEEE International Conference on Data
Mining (ICDM), pages 379–386, Washington, DC, USA, 2003. IEEE Computer So-
ciety.
[124] G. Xing, Z. Xia, and J. Guo. Clustering XML documents based on structural
similarity. In R. Kotagiri, P. Krishna, M. Mohania, and E. Nantajeewarawat, editors,
Advances in Databases: Concepts, Systems and Applications, volume 4443 of Lecture
Notes in Computer Science, pages 905–911. Springer Berlin / Heidelberg, 2007.
[125] W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix
factorization. In Proceedings of the 26th annual international ACM SIGIR conference
on Research and development in informaion retrieval, SIGIR ’03, pages 267–273,
New York, NY, USA, 2003. ACM.
[126] X. Yan, H. Cheng, J. Han, and D. Xin. Summarizing itemset patterns: a profile-
based approach. In Proceedings of the eleventh ACM SIGKDD international con-
223
ference on Knowledge discovery in data mining (KDD), pages 314–323, New York,
NY, USA, 2005. ACM.
[127] X. Yan, J. Han, and R. Afshar. CloSpan: Mining closed sequential patterns in large
datasets. In SIAM International Conference on Data Mining (SDM), pages 166–177,
2003.
[128] J. Yang and X. Chen. A semi-structured document model for text mining. J.
Computer Science and Technology, 17(5):603–610, 2002.
[129] J. Yang, W. K. Cheung, and X. Chen. Learning the kernel matrix for XML document
clustering. In e-Technology, e-Commerce and e-Service, 2005.
[130] Y. Yang, X. Guan, and J. You. CLOPE: a fast and effective clustering algorithm
for transactional data. In Proceedings of the eighth ACM SIGKDD international
conference on Knowledge discovery and data mining (KDD), pages 682–687, New
York, NY, USA, 2002. ACM.
[131] J. Yao and N. Zarida. Rare patterns to improve path-based clustering of wikipedia
articles. In N. Fuhr, M. Lalmas, and A. Trotman, editors, Pre-proceedings of the
Sixth Workshop of Initiative for the Evaluation of XML Retrieval, pages 224–231,
Dagstuhl, Germany, 2007.
[132] J. Yao and N. Zerida. Rare patterns to improve path-based clustering. In 6th
International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX
2007, Dagstuhl Castle, Germany, Dec 17-19, 2007.
[133] J. Yoon, V. Raghavan, and L. Kerschberg. BitCube: clustering and statistical
analysis for XML documents. In Thirteenth International Conference on Scientific
and Statistical Database Management, Fairfax, Virginia, 2001.
224
[134] M. J. Zaki. Efficiently mining frequent trees in a forest. In Proceedings of the eighth
ACM SIGKDD international conference on Knowledge discovery and data mining
(KDD), pages 71–80. ACM Press, Edmonton, Alberta, Canada, 2002.
[135] M. J. Zaki. Efficiently mining frequent trees in a forest: Algorithms and applications.
IEEE Transactions on Knowledge and Data Engineering, 17(8):1021–1035, 2005.
[136] M. J. Zaki and C. C. Aggarwal. XRules: an effective structural classifier for XML
data. In Proceedings of the ninth ACM SIGKDD international conference on Knowl-
edge discovery and data mining (KDD), pages 316–325. ACM Press, Washington,
D.C., 2003.
[137] S. Zhang, M. Hagenbuchner, A. Tsoi, and A. Sperduti. Self organizing maps for the
clustering of large sets of labeled graphs advances in focused retrieval. volume 5631
of Lecture Notes in Computer Science, pages 469–481. Springer Berlin / Heidelberg,
2009.
[138] W. S. Zhang, D. X. Liu, and J. P. Zhang. A novel method for mining frequent
subtrees from XML data. In Z. R. Yang, H. Yin, and R. Everson, editors, Intelligent
Data Engineering and Automated Learning (IDEAL 2004), volume 3177 of Lecture
Notes in Computer Science, pages 300–305. Springer Berlin / Heidelberg, 2004.
[139] Y. Zhao and G. Karypis. Data clustering in life sciences. Molecular Biotechnology,
31:55–80, 2005.
[140] L. Zou, Y. Lu, H. Zhang, and R. Hu. PrefixTreeESpan: a pattern growth algorithm
for mining embedded subtrees. In K. Aberer, Z. Peng, E. Rundensteiner, Y. Zhang,
and X. Li, editors, Web Information Systems WISE 2006, volume 4255 of Lecture
Notes in Computer Science, pages 499–505. Springer Berlin / Heidelberg, 2006.
225
[141] L. Zou, Y. Lu, H. Zhang, R. Hu, and C. Zhou. Mining frequent induced subtrees by
prefix-tree-projected pattern growth. In Seventh International Conference on Web-
Age Information Management Workshops, 2006. WAIM ’06., pages 18–26, 2006.
226
Appendix A
Details of the real-life datasets
This section details the two categories used in INEX 2009 dataset. Tables A.1 and A.2
list the categories using the Wikipedia categories and the ad hoc queries ordered by their
topic Id.
227
Table A.1: Details of all the categories in INEX 2009 dataset using Wikipedia cate-gories
Id Category Number of documents1 People 153592 Society 126633 Geography 90654 Culture 90335 Politics 85896 History 80357 Nature 57888 Countries 57249 Applied sciences 556810 Humanities 520511 Business 373412 Technology 358413 Science 337814 Arts 283715 Historical eras 278016 Health 276017 Entertainment 252118 Belief 241719 Life 230120 Language 214021 Environment 211622 Places 193523 Fields of history 187624 Human geography 156425 Recreation 143126 Disambiguation pages 136827 Information 134228 Companies 133029 Vocabulary 128630 Pharmaceutical sciences 123831 Religion 121632 Science stubs 120233 Law 115834 Agriculture 114835 Biology 114336 Literature 113137 Debuts 111638 Military 109539 Space 104940 Government 1027
228
Table A.2: Details of all categories in INEX 2009 dataset using ad hoc queries orderedby the topic Id
Id Query Title # Documents2009001 Nobel prize 82009002 Best movie 22009005 Chemists physicists scientists alchemists periodic 82
table elements2009006 Opera singer Italian Spanish -soprano 72009010 Applications bayesian networks bioinformatics 22009011 Olive oil health benefit 82009012 Vitiligo pigment disorder cause treatment 72009013 Native American Indian wars against 33
colonial Americans2009020 IBM computer 52009022 Szechwan dish food cuisine 72009023 “Plays of Shakespeare”+Macbeth 162009026 Generalife gardens 22009028 Fastest speed bike scooter car motorcycle 62009033 Al-Andalus taifa kingdoms 82009035 Bermuda Triangle 92009036 Notting Hill Film actors 222009039 Roman architecture 272009040 Steam engine 252009041 The Scythians 32009042 Sun Java 12009043 NASA missions 1352009046 Penrose tiles tiling theory 52009047 “Kali’s child” criticisms reviews 2
Psychoanalysis of Ramakrishna’s mysticism2009051 Rabindranath Tagore Bengali literature 182009054 Tampere region tourist attractions 52009055 European union expansion 242009061 France second world war normandy 92009062 Social network group selection 12009063 D-Day normandy invasion 272009064 Stock exhange insider trading crime 92009065 Sunflowers Vincent van Gogh 12009066 Folk metal groups finland 12009068 China great wall 22009070 Health care reform plan 22009071 Earthquake prediction 22009073 Web link network analysis 12009076 Sociology and social issues and 14
aspects in science fiction2009079 Dangerous paraben bisphenol-A 32009082 South african nature reserve 12009087 History bordeaux 22009089 World wide web history 62009092 Ski +waxing -water -wave 12009093 French revolution 402009096 Eiffel 92009104 Lunar mare formation mechanism 62009105 Musicians Jazz 102009108 Sustainability indicators metrics 82009109 Circus acts skills 42009110 Paul is dead hoax theory 22009111 Europe solar power facility 22009113 Toy Story Buzz Lightyear 3D 2
rendering Computer Generated Imagery2009115 virtual museums 2
229
Appendix B
Empirical Evaluation of Frequent Mining
results
Table B.1: Runtime comparison of Length Constrained Subtrees on F5 datasetMin supp Const PCITMiner PMITMiner PCETMiner PMETMiner
Const const Const Const2 3 0.33 0.33 0.6 0.6
5 0.43 0.43 0.6 0.67 0.44 0.45 0.61 0.619 0.45 0.45 0.59 0.5911 0.45 0.44 0.58 0.58
4 3 0.3 0.32 0.34 0.345 0.36 0.36 0.4 0.47 0.37 0.37 0.4 0.49 0.37 0.37 0.39 0.3911 0.37 0.37 0.41 0.41
6 3 0.24 0.24 0.24 0.245 0.2 0.2 0.26 0.267 0.25 0.25 0.26 0.269 0.25 0.25 0.26 0.2611 0.26 0.26 0.26 0.26
8 3 0.25 0.25 0.24 0.245 0.24 0.24 0.24 0.247 0.24 0.24 0.24 0.249 0.24 0.24 0.23 0.2311 0.24 0.24 0.24 0.24
10 3 0.22 0.22 0.2 0.25 0.23 0.23 0.19 0.197 0.23 0.23 0.18 0.189 0.23 0.23 0.2 0.211 0.23 0.23 0.2 0.2
231
Table B.2: Length Constrained Subtrees in F5 datasetMin supp Const PCITMiner PMITMiner PCETMiner PMETMiner
Const const Const Const2 3 17 9 21 23
5 17 7 21 107 17 7 21 109 17 7 21 1011 17 7 21 10
4 3 8 6 14 115 9 4 11 57 9 4 11 59 9 4 11 511 9 4 11 5
6 3 5 3 8 55 6 2 7 27 6 2 7 29 6 2 7 211 6 2 7 2
8 3 4 3 6 45 4 2 5 27 4 2 5 29 4 2 511 4 2 5 2
10 3 3 2 4 24 5 2 4 26 7 2 4 28 9 2 4 211 3 2 4 2
232