universidade de lisboa instituto superior tecnico...
TRANSCRIPT
UNIVERSIDADE DE LISBOA
INSTITUTO SUPERIOR TECNICO
Pattern Mining on Data Warehouses:A Domain Driven Approach
Andreia Liliana Perdigao da Silva
Supervisor: Doctor Claudia Martins Antunes
Thesis approved in public session to obtain the PhD degree inInformation Systems and Computer Engineering
Jury final classification: Pass with Merit
JuryChairperson: Chairman of the IST Scientific BoardMembers of the Committee:
Doctor Sebastian Ventura Soto
Doctor Francisco Jose Moreira Couto
Doctor Alıpio Mario Guedes Jorge
Doctor Claudia Martins Antunes
Doctor Alexandre Paulo Lourenco Francisco
Doctor Sara Alexandra Cordeiro Madeira
2014
UNIVERSIDADE DE LISBOA
INSTITUTO SUPERIOR TECNICO
Pattern Mining on Data Warehouses:A Domain Driven Approach
Andreia Liliana Perdigao da Silva
Supervisor: Doctor Claudia Martins Antunes
Thesis approved in public session to obtain the PhD degree inInformation Systems and Computer Engineering
Jury final classification: Pass with Merit
Jury
Chairperson: Chairman of the IST Scientific Board
Members of the Committee:
Doctor Sebastian Ventura Soto, Associate Professor, University of Cordoba, Spain
Doctor Francisco Jose Moreira Couto, Professor Associado, Faculdade de Ciencias, Universi-
dade de Lisboa
Doctor Alıpio Mario Guedes Jorge, Professor Associado, Faculdade de Ciencias, Universidade
do Porto
Doctor Claudia Martins Antunes, Professora Auxiliar, Instituto Superior Tecnico, Universidade
de Lisboa
Doctor Alexandre Paulo Lourenco Francisco, Professor Auxiliar, Instituto Superior Tecnico,
Universidade de Lisboa
Doctor Sara Alexandra Cordeiro Madeira, Professora Auxiliar, Instituto Superior Tecnico,
Universidade de Lisboa
Funding Institutions
Fundacao para a Ciencia e a Tecnologia
2014
Resumo
Um desafio crescente do data mining prende-se com a capacidade de lidar com grandes quantidades de
dados complexos e dinamicos. Em muitas aplicacoes reais, os dados complexos estao organizados em
multiplas tabelas de dados, relacionadas entre si, o que torna a sua analise como um todo mais difıcil e
desafiante. Uma forma comum de representar um modelo multi-dimensional e atraves de um esquema
em estrela, que consiste numa tabela de factos central, que liga um conjunto de tabelas de dimensao.
Esta tabela de factos guarda normalmente um conjunto enorme de registos, que torna quase impossıvel
ter todos os dados em memoria. Mais ainda, nem todos os dados podem estar disponıveis a priori,
uma vez que novos dados estao, muito provavelmente, continuamente a ser gerados. Outro problema
comum dos algoritmos de descoberta de padroes e o facto destes gerarem um elevado numero de padroes,
independentes dos conhecimentos do utilizador. Este tao grande numero de resultados e a sua falta de
foco dificultam a interpretacao e seleccao de resultados, e por isso limitam a utilizacao destas tecnicas
para apoio a decisao.
Neste trabalho, argumenta-se que e possıvel descobrir padroes em dados modelados num esquema
em estrela de modo eficiente, bem como incorporar restricoes de domımio no processo de descoberta,
para focar os resultados no conhecimento de domımio existente e nas expectativas dos utilizadores. De
modo a demonstrar a validade desta tese, e proposto um novo algoritmo – StarFP-Stream, que combina
tecnicas de descoberta de padroes em varias tabelas com tecnicas para fluxos contınuos de dados (ou
streams). Este algoritmo e capaz de explorar eficientemente grandes e crescentes quantidades de dados
de um esquema em estrela, e em varios nıveis de agregacao. Tambem sao propostos dois algoritmos –
CoPT and CoPT4Streams, para introduzir restricoes numa tabela de dados estaticos ou num stream,
respectivamente. Os algoritmos usam uma estrutura em arvore compacta, e sao capazes de acelerar
a incorporacao de qualquer tipo de restricoes, evitando testes desnecessarios e eliminando mais cedo
os padroes invalidos. Finalmente, tambem e definido um conjunto de restricoes desenhadas para um
esquema em estrela, e e proposto um novo algoritmo – D2StarFP-Stream, para introduzir essas restricoes
na descoberta de padroes multi-dimensionais.
Alem disso, os algoritmos sao avaliados sobre conjuntos de dados artificiais e reais, tanto do domınio
de vendas, como na saude e na educacao.
i
ii
Abstract
A growing challenge in data mining is the ability to deal with complex, voluminous and dynamic data.
In many real world applications, complex data is organized in multiple inter-related database tables,
which makes their analysis as a whole more difficult and challenging. A very common multi-dimensional
model is a star schema that consists of a central fact table, linking a set of dimensional tables. This
fact table usually stores a massive number of records, which makes almost impossible to have all data in
primary memory. Furthermore, not all data may be available a priori, since new data is most likely being
continuously generated. Another problem of pattern discovery algorithms, is the fact that they generate
a huge number of patterns, independent of user expertise. Such large number of results and their lack of
focus hinder the interpretation and selection of results, and therefore make it harder to use these results
for decision support.
In this work we argue that it is possible to efficiently and effectively mine large amounts of data
modeled as a star schema, as well as to incorporate domain constraints into the discovery process, to
focus the results according to the domain knowledge and user expectations. In order to demonstrate
the validity of this thesis, we propose a new algorithm – StarFP-Stream, that combines multi-relational
with data streaming techniques, and is able to mine a large and growing star schema efficiently, at any
aggregation level. We also propose two algorithms – CoPT and CoPT4Streams, for pushing constraints
into static and growing single tables, respectively. The algorithms make use of a compact tree structure,
and are able to speed up the incorporation of any type of constraints, by avoiding unnecessary tests and
pruning earlier invalid patterns. Finally, we also define a set of constraints designed for a star schema,
and propose a new algorithm – D2StarFP-Stream, that is able to incorporate these constraints into
multi-dimensional mining.
Additionally, we evaluate our algorithms over both artificial and real data, in the sales, healthcare
and education domains.
iii
iv
Palavras-Chave
Keywords
Palavras-Chave
Descoberta de Informacao
Descoberta de Padroes
Exploracao de Armazens de Dados
Esquemas em Estrela
Descoberta de Informacao em Dados Multi-Relacionais
Descoberta de Informacao em Fluxos Contınuos de Dados
Arvores de Padroes
Conhecimento de Domınio
Incorporacao de Restricoes
Restricoes Multi-Dimensionais
Keywords
Data Mining
Pattern Mining
Mining Data Warehouses
Star Schemas
Multi-Relational Data Mining
Mining Data Streams
Pattern-Trees
Domain Knowledge
Constrained Mining
Multi-Dimensional Constraints
v
vi
Acknowledgments
I would like to thank to all who have been present and have contributed to this thesis in so many ways.
First and foremost, I would like to thank my adviser, Professor Claudia Antunes, for all the support,
encouragement, guidance and confidence she gave me, and for the countless hours of talk during these
five years.
I would also like to thank my colleagues and friends (in alphabetical order): David Duarte, Nuno
Lopes, Rui Henriques. They have contributed to this work in the form of insightful discussions, collabo-
ration and key advices.
A big thank you also to my conference friends, for making the experience of attending conferences
more enjoyable and productive, technically and socially.
A special thank you to Filipe, for his presence, patience, encouragement and support. He also con-
tributed to this work with more technical discussions and guidance.
This work was financially supported in part by FCT (Fundacao para a Ciencia e a Tecnologia) under
grant SFRH/BD/64108/2009 and research projects educare (PTDC/EIA-EIA/110058/2009) and D2PM
(PTDC/EIA-EIA/110074/2009). I’m very grateful and indebted for their support.
Finally, a big thank you to my family and all other friends for their continuous support and encour-
agement through all these years.
vii
viii
Contents
1 Introduction 1
1.1 Open Issues in Multi-Relational Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Finding Patterns on Star Schemas: An Introduction 7
2.1 The Core of the Multi-Dimensional Model: a Star Schema . . . . . . . . . . . . . . . . . . 8
2.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Challenges and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Finding Patterns on Large Star Schemas 15
3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 MRPM over Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 StarFP-Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.1 Rationale behind the star stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.2 Pattern-Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.3 Algorithm StarFP-Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.5 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.6 Strengths and Weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.7 Comparison with Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.8 Time Sensitive Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 The Groundwork on Domain Driven Data Mining 35
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.1 Inductive Logic Programming - Discussion and Arguments . . . . . . . . . . . . . 37
4.1.2 Domain Driven Data Mining – Discussion and Arguments . . . . . . . . . . . . . . 37
4.1.3 Semantic Data Mining – Discussion and Arguments . . . . . . . . . . . . . . . . . 38
4.2 Domain Knowledge Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 Constrained Pattern Mining: Problem Definition . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 A new Framework for Constrained Pattern Mining . . . . . . . . . . . . . . . . . . . . . . 43
4.5 Constraint Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
ix
4.6 Constraint Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.7 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.8 Constrained Pattern Mining Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.8.1 Properties vs. Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.8.2 Categories vs. Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.8.3 Data Sources vs. Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.9 Discussion and Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5 Pushing Constraints into Pattern Mining 61
5.1 Pushing Constraints into a Static Pattern-Tree . . . . . . . . . . . . . . . . . . . . . . . . 62
5.1.1 Pattern-Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.1.2 Constraint Pushing Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.1.3 Algorithm CoPT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.1.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.1.5 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2 Pushing Constraints into a Dynamic Pattern-Tree . . . . . . . . . . . . . . . . . . . . . . 68
5.2.1 Pattern-Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2.2 Constraint Pushing Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2.3 Algorithm CoPT4Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2.5 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3 Towards the Incorporation of Constrains into
Multi-Dimensional Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3.1 Transactional vs. Non-Transactional Data . . . . . . . . . . . . . . . . . . . . . . . 75
5.3.2 Constraints in Star Schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.3 Pushing Star Constraints into Pattern Mining over Star Schemas . . . . . . . . . . 79
5.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4 Mining Stars with Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4.1 Constraining Business Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4.2 D2Star FP-Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4.4 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.5 Conclusions and Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6 A Case Study in Healthcare 89
6.1 The Hepatitis Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2 The Hepatitis Multi-Dimensional Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2.1 Building the Star Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2.2 Understanding the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3.1 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.4 Hepatitis Application Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.5 Finding Discriminant Patterns and Association Rules . . . . . . . . . . . . . . . . . . . . 97
6.5.1 Interesting Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.5.2 Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.6 Improving Prediction using Multi-Dimensional Patterns . . . . . . . . . . . . . . . . . . . 100
6.6.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
x
6.6.2 Methodology into Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.6.3 Analysis of Multi-Relational Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.6.4 Enriched Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.7 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7 A Case Study in Education 107
7.1 TheEducare Multi-Dimensional Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.2 Predicting Student Grades Using Multi-Dimensional Patterns . . . . . . . . . . . . . . . . 109
7.2.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.2.2 Methodology into practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.2.3 Analysis of Multi-Relational Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2.4 Enriched Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.3 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8 Conclusions and Future Work 115
8.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
xi
xii
List of Figures
2.1 Star Internet Sales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 An example of a pattern-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 StarFP-Stream example: Part of the pattern-tree resulting from the first batch . . . . . . 22
3.3 StarFP-Stream example: DimFP-tree of dimensions Customer and Product . . . . . . . . 23
3.4 StarFP-Stream example: Super FP-tree of the second batch . . . . . . . . . . . . . . . . . 23
3.5 StarFP-Stream example: Part of the final pattern-tree . . . . . . . . . . . . . . . . . . . . 24
3.6 AW experiments: Number of patterns returned and precision . . . . . . . . . . . . . . . . 30
3.7 AW experiments: Average and detailed pattern-tree size . . . . . . . . . . . . . . . . . . . 31
3.8 AW experiments: Average and detailed update time . . . . . . . . . . . . . . . . . . . . . 32
3.9 AW experiments: Average maximum memory per batch . . . . . . . . . . . . . . . . . . . 33
4.1 A framework for constrained pattern mining . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1 CoPT : Time with AM constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 CoPT : Checks with AM constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3 CoPT : Time with M constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4 CoPT : Checks with M constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.5 CoPT : Time with Mixed constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.6 CoPT : Checks with Mixed constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.7 CoPT4Streams: Average size of the pattern-tree . . . . . . . . . . . . . . . . . . . . . . . 73
5.8 CoPT4Streams: Average time needed to update the pattern-tree . . . . . . . . . . . . . . 73
5.9 CoPT4Streams: Average number of constraint checks . . . . . . . . . . . . . . . . . . . . 73
5.10 Example of transactional and corresponding non-transactional data . . . . . . . . . . . . . 76
5.11 A star schema, showing transactional and non-transactional data . . . . . . . . . . . . . . 77
5.12 D2StarFP-Stream: Average size of the pattern-tree . . . . . . . . . . . . . . . . . . . . . . 86
5.13 D2StarFP-Stream: Average maximum memory needed . . . . . . . . . . . . . . . . . . . . 86
5.14 D2StarFP-Stream: Average update time of the pattern-tree . . . . . . . . . . . . . . . . . 86
6.1 Hepatitis relational model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2 Hepatitis star schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3 Hepatitis: Number of exams per patient (female and male) . . . . . . . . . . . . . . . . . 93
6.4 Hepatitis: Number of exams per patient diagnosed with hepatitis B, C or still undiagnosed 93
6.5 Hepatitis: Distribution of exams per stage of hepatitis . . . . . . . . . . . . . . . . . . . . 93
6.6 Hepatitis: Number of exams per patient, at each stage of hepatitis . . . . . . . . . . . . . 94
6.7 Hepatitis Star : Number of patterns returned and precision . . . . . . . . . . . . . . . . . . 95
6.8 Hepatitis Star : Average and detailed pattern-tree size . . . . . . . . . . . . . . . . . . . . 95
6.9 Hepatitis Star : Average and detailed update time . . . . . . . . . . . . . . . . . . . . . . . 96
6.10 Hepatitis Star : Average maximum memory per batch . . . . . . . . . . . . . . . . . . . . 96
xiii
6.11 The multi-dimensional methodology for enriching classification . . . . . . . . . . . . . . . 100
6.12 Classification in Hepatitis: Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.13 Classification in Hepatitis: Size of the trees . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.1 An example of an educational data-warehouse . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.2 Classification in Educare: Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.3 Classification in Educare: Size of the trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
xiv
List of Tables
2.1 Star Internet Sales: Dimension Tables Product, Customer and Sales Territory . . . . . . 9
2.2 Star Internet Sales: Sales Orders Fact Table . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1 StarFP-Stream example: A subset of the final patterns . . . . . . . . . . . . . . . . . . . . 24
3.2 Correspondence between StarFP-Stream and SWARM representations . . . . . . . . . . . 26
3.3 AW experiments: A summary of the dataset characteristics . . . . . . . . . . . . . . . . . 29
3.4 AW experiments: Batches corresponding to each error . . . . . . . . . . . . . . . . . . . . 29
4.1 Advantages and disadvantages of the different forms of domain knowledge representations 42
4.2 Content constraints and respective properties . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Structural constraints and respective properties . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4 Algorithms for each constraint property . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5 Algorithms for each constraint category . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.1 Differences on mining transactional and non-transactional data . . . . . . . . . . . . . . . 76
6.1 Hepatitis: Important exams and corresponding thresholds and categories . . . . . . . . . . 92
6.2 A summary of the Hepatitis star characteristics . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3 Hepatitis Star : Batches corresponding to each error . . . . . . . . . . . . . . . . . . . . . . 94
6.4 Hepatitis Star : Some examples of the patterns found . . . . . . . . . . . . . . . . . . . . . 98
6.5 Hepatitis Star : Some examples of the association rules found . . . . . . . . . . . . . . . . 99
6.6 Classification in Hepatitis: Some examples of the multi-relational patterns found . . . . . 104
7.1 Educare: Some examples of patterns found for the Enrollment Star . . . . . . . . . . . . . 111
7.2 Educare: Some examples of patterns found for the Teaching QA Star . . . . . . . . . . . . 111
xv
xvi
Chapter 1
Introduction
To undertake the rapid growth of data, the area of Data Mining [FPSM92] emerged with the goal
of creating methods and tools capable of analyzing these data and extracting useful information, that
companies can exploit and apply to their businesses. Finding frequent patterns in data has become an
important and widely studied task in data mining, since it allows us to find different types of interesting
relations among data, such as association rules [AS94, PH02], correlations [BMS97], sequences [PHW07],
multi-dimensional patterns [D03, SA10], episodes, emerging patterns [MTIV97, DL99], etc. Pattern
mining (PM) is also recognized as an important tool that helps in data pre-processing, and in other data
mining tasks like classification [LHM98, SA14b, SA14a] and clustering [LSW97].
Despite the great advances in this field, the challenges imposed by the era of big data continue to
defy those algorithms. Indeed, big data brought a completely new context to operate, changing the
data nature from static to dynamic, but also from tabular to more complex data sources, such as social
networks (expressed as graphs) and data warehouses (expressed as multi-dimensional models).
In this new context, and more than ever, users need effective and efficient ways to mine these more
complex and growing data, so that results can actually be used for decision support in real world problems.
Data Warehouses (DW) are an example of data repositories that emerged to make easier the analysis of
data, which clearly separates the representation of business dimensions and events, into a set of different,
but related tables [Inm96]. However, and despite the ultimate goal and the advances of data mining
algorithms, many of them are designed to deal with one single table and cannot be reused in several
domains.
One of the major challenges of mining multiple tables is how to join and create the tuples to be mined,
during the mining process. The most used approach is to join all the tables into one before mining, and
apply an existing and efficient single-table algorithm. However, in large applications, this initial process
may not be realistically computed, and if it can, the resulting table is so large and sparse that it presents
a huge overhead to the already expensive mining process [NFW02].
The area of Multi-relational data mining (MRDM) [D03] was born from the need to explore data
stored in multiple interrelated tables, and aims for the development of efficient data mining techniques
that are able to discover frequent patterns that involve multiple tables, in their original structure. There
were deep early advances brought by Inductive Logic Programming (ILP) [DR97, MEL01, NK01], and
in recent years, the most common techniques have been extended to the multi-relational case.
1.1 Open Issues in Multi-Relational Pattern Mining
Despite the progress in the area, just a few algorithms developed for MRDM are dedicated to the multi-
dimensional model present in DW, the star schema [CJS00, NFW02, XX06, SA11]. A star schema [KR02]
1
consists in a central fact table containing the business events, and a set of surrounding dimension tables,
comprising the specific data about each business dimension.
The main problem with the existing algorithms is that they are often not scalable with respect to
the number of events and relations in the database. And in the case of DW, there is an urgent need for
algorithms able to deal with large datasets, due to their growing nature: records are added along time,
but never deleted. In some manner, we can say that DW can be compared to data streams, in the sense
that they are continuously growing along time, since new records are added to the fact table for each
event occurrence.
To the best of our knowledge, there are only two algorithms that are able to mine multiple data stream
tables [FCAM09, HYXW09], and both are based on ILP, and therefore suffer from the same limitations
than other ILP techniques, such as needing all tables in prolog form. They also suffer from the candidate
generation bottleneck, well known in traditional pattern mining [AS94]. There is therefore the need for
new and more efficient algorithms.
Another problem of pattern mining algorithms, and also of MRDM, is the fact that they generate
a huge number of patterns (thousands or more), independent of user expertise. Such large number of
results and their lack of focus hinder the interpretation and selection of results, and therefore make it
harder to use these results for decision support. Actually, this is one of the reasons why pattern mining
techniques are not better accepted and applied on real businesses.
Several ways have been proposed to minimize these bottlenecks, and the use of domain knowledge
is the most accepted and common approach to focus the algorithms in areas where they are more likely
to gain information and return more interesting results [BJ05]. This knowledge driven data mining has
gained attention in recent years, and the ways to represent and use the domain knowledge have evolved,
from simple user interactions and annotations, going through the use of constraints, to the use of domain
ontologies. These new forms of representation are a promising way to guide data mining algorithms
through the analysis of more complex and multi-dimensional data, since they can make explicit the
existing dependencies and relations between business dimensions. However, the problem of efficiently
mining multi-dimensional data with domain knowledge remains unsolved.
The bulk of the research in this area is centered on constrained data mining, since constraints are easily
defined and interpreted by the users, capturing application semantics and user expectations. Moreover,
they can also be efficiently incorporated into the algorithms to guide them and filter the search space
and the results [Bay05].
As far as we know, there is no work that integrates constraints into relational mining. This integration
is not straightforward, since star schemas contain both transactional (the fact table) and non-transactional
(the dimensions) data, and existing constrained algorithms are designed only for transactional data, hence
cannot be directly reused on the whole star.
1.2 Thesis Statement
In this dissertation, we argue that it is possible to efficiently and effectively find patterns in
large amounts of data modeled as a star schema, as well as to incorporate constraints into
those algorithms, to focus the mining results according to the domain knowledge and user
expectations, and therefore deliver less, but more interesting patterns.
The analysis of this thesis statement leads to some questions that need to be addressed:
• How can we mine data modeled as a star schema directly? MRDM intends to develop algorithms for
mining multiple tables directly, and there are some algorithms designed for star schemas. We discuss
2
these algorithms and respective strategies in Chapter 3, as well as the challenges and limitations of
existing work.
• What is the difference between mining large amounts of data in a star schema, and a smaller star
schema? Mining large quantities of data imposes new challenges to the mining process, for both
single and multiple tables, due to memory and time limitations. In order to optimize memory
usage and time spent, and actually be able to mine big data, one well known approach is to use
data streaming techniques. In this work, we argue that it is possible to integrate MRDM and data
streaming techniques for mining large star schemas. We describe this approach in more detail in
Chapter 3.
• What does effectively mean? Existing algorithms are able to mine directly data modeled as a star
schema, but they do not scale well with the number of dimensions and records in the database.
In this sense, most of them may not be able to even finish the mining process, if there are large
amounts of data. Furthermore, denormalizing the whole schema into one table, and use an efficient
traditional pattern mining algorithm is not a solution, since this extra step may be infeasible for
big quantities of data. In this manner, effectively means that we can actually finish the discovery
process in large star schemas, keeping results updated.
• What does efficiently mean? There are several well known and efficient pattern mining algorithms
designed for single tables. Mining multiple tables directly should perform better than denormalizing
the tables into one and using one of those algorithms, since it skips the overhead of an extra joining
step. Therefore, efficiently means that, in general, a multi-relational pattern mining approach
should take less time than a join-before-mining approach.
• How can domain knowledge and user expectations be captured through constraints? There are several
forms of domain knowledge representations, with constraints the most used and well known form.
In Chapter 4, we present a discussion of the use of domain knowledge, as well as a new framework
for constrained pattern mining, that helps organizing and understanding constraints and their use.
• How can we efficiently incorporate constraints into the algorithms? There are many different types
of constraints, which hinders the definition of general algorithms. However, most of constraints
share a set of properties that allows for defining efficient strategies (explained in Chapter 4). In this
work, we argue that it is possible to efficiently push constraints as a post-processing step, by making
use of an efficient tree structure – the pattern-tree. We present this in more detail in Chapter 5.
• Which constraints can we push into the mining of a star schema and how? Mining a star schema
introduces new challenges to the constrained process, since star schemas contain both transactional
and non-transactional data, and therefore existing algorithms cannot be applied directly. Also, the
structure of the star schema itself encompasses more opportunities for constraining, and therefore
we define a set of constraints for a star schema, named Star Constraints, as well as a set of strategies
to incorporate them (Chapter 5).
The validation of this thesis will be carried out through four main procedures. First, comparing
the performance of our multi-relational algorithms with their non-multi-relational counterparts, i.e. with
algorithms that need to denormalize the tables into one before mining. Second, by comparing the number
of patterns returned using constrained and unconstrained algorithms. Third, the interest of the discovered
multi-relational patterns will be evaluated in two case studies with real data (Chapter 6 and 7), by using
a set of different measures to select the best patterns, use them to enrich classification training data,
and analyzing if they improve prediction accuracy. Finally, in order to test the different parameters, a
performance evaluation is performed using synthetic data, in each chapter.
3
1.3 Contributions
In order to demonstrate the validity of this thesis, we propose a new method, Star Frequent-Pattern
Stream (StarFP-Stream), which combines MRDM and data streaming techniques, for frequent pattern
mining in large star schemas [SA12b]. StarFP-Stream does not materialize the join between the tables,
and adopts a pattern growth strategy, therefore not suffering from the candidate generation bottleneck,
like the only two related algorithms in the literature. It is also able to correctly aggregate the business
events, and therefore finding patterns at the right aggregation level [SA12a]. By using a strategy similar
to the one followed on mining data streams, the algorithm is able to mine both large and growing datasets
modeled as star schemas, by avoiding multiple scans to the data, and optimizing both memory usage
and performance. It estimates an approximate frequency of items, based on the number of times they
occur since they have first appeared, and on a user defined maximum error threshold. Only the estimated
frequent patterns are kept in an efficient prefix-tree summary structure, called pattern-tree.
Experiments show that StarFP-Stream is accurate and efficient [SA14d], and demonstrate that it
greatly outperforms its single table predecessor in terms of time, when the second one is applied to a
joined table. In this manner, it is possible to say that our algorithm overcomes the join before mining
approach.
To address the second goal of this thesis, and as a starting point, before addressing the multi-
dimensional case, we propose two efficient algorithms for pushing constraints as a post-processing step,
into a pattern-tree: Constraint pushing into a Pattern-Tree (CoPT ) [SA13a] and CoPT4Streams [SA13b],
are the algorithms proposed for single table datasets and for single table data streams, respectively. By
using the pattern-tree structure, both algorithms are able to optimize the incorporation of any constraint,
avoiding unnecessary tests and eliminating invalid patterns earlier, according to the properties of the con-
straints. Experiments show that the algorithms are efficient and effective, even for constraints with small
selectivity, when compared to a baseline approach that does not take constraint properties into account.
We then analyze in more detail the challenges and prospects of constrained multi-dimensional min-
ing [SA13c], and propose the definition of a set of constraints that can be defined according to the star
schema. We also analyze a set of strategies for the incorporation of constraints in star schemas, based on
constraint properties and on the structure of the star itself.
Finally, we also propose an algorithm for pushing the defined star constraints into the discovery of
patterns over large and growing star schemas, named D2StarFP-Stream. It is an extension of StarFP-
Stream, that is able not only to minimize the bottlenecks of the first, by returning and keeping less results,
but also to focus these results on the existing domain knowledge.
Experiments show that the algorithm is memory efficient, requiring smaller summary structures and
less memory, and that it surpasses the unconstrained StarFP-Stream. They also show that it takes less
time per batch, as the selectivity of the constraints increase.
The developed algorithms are also evaluated in two case studies over real data, one in the healthcare
domain [SA14c, SA14a], and another in the educational domain [SA14b].
1.4 Outline
This dissertation contains 8 chapters. The first (Chapter 1) motivates this thesis and presents a summary
of its main goals and contributions. The thesis statement is also presented, along with an explanation of
each claim made.
Chapter 2 introduces the main concepts of pattern mining and presents a detailed description of the
multi-relational pattern mining problem on star schemas. In order to better understand the domain, we
formally present the concepts of a star schema and corresponding dimensions and facts, and we define
4
the notation used in the rest of the dissertation.
In Chapter 3, the first algorithm, StarFP-Stream, is proposed for mining directly a large star schema,
using data streaming techniques. The flow of the algorithm is illustrated using an example, and a
detailed analysis of its complexity, strengths and weaknesses is also presented. This chapter also makes
a comparison of StarFP-Stream with the related work, demonstrating the importance and novelty of our
algorithm, and discusses how it can be adapted for a time sensitive model. Finally, the chapter presents
the performance evaluation over a synthetic star schema.
A discussion on how domain knowledge has been used in data mining is made in Chapter 4. The
existing forms of domain knowledge representation are described, along with an examination of their
advantages and disadvantages. The main portion of this chapter is dedicated to the description of the
proposed framework for constrained pattern mining. Existing constrained algorithms are organized and
explained based on the different types of constraints, on their properties and on the nature of the data
sources being mined. The end of the chapter presents the open issues in this area.
In Chapter 5, a new strategy for pushing constraints into pattern mining, through the use of a
pattern-tree, is proposed. In particular, we propose two algorithms: CoPT and CoPT4Streams, for
mining static tables or data streams, respectively. A performance study of each algorithm is presented
for constraints with different selectivities and different properties. The chapter then discusses what is
the difference between introducing constraints into multi-relational and traditional pattern mining, and
presents a solution for overcoming those differences. It first describes a set of constraints defined based on
the star schema, and then proposes a set of strategies for pushing these constraints into multi-relational
pattern mining. In the end of the chapter, a new algorithm, named D2StarFP-Stream, is proposed, for
incorporating the previously defined star constraints into the mining of growing multi-dimensional star
schemas, fulfilling therefore the goal of this thesis. Some experimental results are also presented here, for
constraints with different selectivities.
Chapters 6 and 7 present the results obtained in two case studies, using real data, the first in the
healthcare domain, and the second in the educational domain. In both studies, we first show the perfor-
mance evaluation of our algorithms, and then we evaluate the quality of the discovered multi-relational
patterns by using them to enrich classification data, and examining the results, to test if they improve
predictions.
This dissertation concludes in Chapter 8, where a summary of this thesis and results achieved are
presented. Moreover, some guidelines for future research are also suggested.
5
6
Chapter 2
Finding Patterns on Star Schemas:
An Introduction
The rapid development of the Internet and evolution of technologies made companies realize that they
can benefit from that to improve their businesses and gain competitive advantage. To undertake the rapid
growth of data, everywhere and in a great variety of fields, the area of Data Mining (DM) emerged with
the goal of creating methods and tools capable of analyzing these data and extracting useful information
that companies can exploit and apply to their businesses.
Data mining [FPSM92] is formally defined as the nontrivial extraction of implicit, previously unknown,
and potentially useful information from data. We can say that data mining is a set of techniques that help
getting appropriate, accurate and useful information automatically, which we cannot find with standard
query tools and statistical analysis. Fundamentally, traditional data mining is the analysis of a table
with data, i.e. a set of instances, described by a fixed set of attributes, for the construction of a model
to explain these data. The model discovered is then evaluated, being confronted with the expectations
of the user, essentially measuring the model’s capability to explain, whether data already known, as yet
unknown.
Association rules (AR) [AIS93] were first introduced in 1993 and correspond to an important data
mining paradigm that helps to discover patterns that conceptually represent causality among discrete
entities (or items) [ZO98]. Given a set of records, where each transaction is a set of objects (called
items), an association rule is an expression of the form X ⇒ Y , where X and Y are sets of items (called
itemsets) [Sri96]. The intuitive meaning of such a rule is that database records which contain X tend to
contain Y , with a certain probability.
In order to find these trends, first there is the need to find the items and sets of items that co-occur more
frequently, and based on that, only then association rules are build. These frequent occurrences are called
patterns, and finding patterns, as shown in many studies (e.g. [AS94]), is significantly more costly in terms
of time than the rule generation step [Pei02]. In this sense, the bulk of existing work in this area is centered
on the task of finding frequent patterns in data, a task known as Pattern Mining (PM) or Frequent Itemset
Mining (FIM). There have been great advances in PM, and it now allows for the discovery of several types
of relations besides association rules [AS94, PH02], such as correlations [BMS97], sequences [PHW07],
multi-dimensional patterns [D03, SA10], episodes and emerging patterns [MTIV97, DL99], etc.
Despite these advances, the new era of big data brought new challenges and requirements to existing
techniques. Nowadays, we have unbounded quantities of the most diversified data, in many different
domains, and there is a great and increasing need for tools that are able to efficiently integrate and
analyze these data for decision support.
7
In fact, the data storage paradigm has changed in the last decade, from operational databases to data
repositories that make easier to analyze data and to find useful information. Data warehouses (DW)
are an example of such repositories, that clearly separate the representation of business dimensions and
events, into a set of different, but related tables [Inm96].
Multi-Relational Data Mining (MRDM) [D03] is an area that aims for the discovery of patterns that
involve multiple tables, in their original structure, i.e. without joining all the tables before mining. In
recent years, the most common mining techniques have been extended to the multi-relational context,
but there are few dedicated to the multi-dimensional model most present in DW, the star schema [SA10,
FCAM09, HYXW09], and they are often not scalable. Therefore, finding efficient and effective ways for
dealing with this kind of data is still a challenge.
In this chapter, we first define in detail the multi-dimensional model – the star schema, that is the
main object of this thesis (Section 2.1). Then, we present the problem statement for multi-relational
pattern mining over star schemas (Section 2.2), as well as the challenges introduced by this domain and
the existing related work (Section 2.3).
2.1 The Core of the Multi-Dimensional Model: a Star Schema
A star schema is a multidimensional model that models data as a set of facts, each describing an event or
occurrence, characterized by a particular combination of dimensions and a set of measures. An example
of a star schema can be seen in Fig. 2.1. It contains four dimension tables: Product, Date, Customer and
Sales Territory, and one fact table, registering some sales.
Figure 2.1: Star Internet Sales.
To help understanding the definitions and the flow of the algorithm, we describe a simplified example
of the contents of a database following the star schema on Fig. 2.1. Tables 2.1 and 2.2 present the content
of the dimensions and of the fact table, respectively (dimension Date was omitted here, since it can be
inferred by the key).
8
In the context of a database, a table contains one or more descriptive fields, called attributes, and each
row consists in a set of values for those attributes. A table can therefore be seen as a simple set of pairs
(attribute, value), corresponding to the characteristics of the data in analysis. As dimensions provide the
context for facts, they should also contain one single primary key, that can be used as a foreign key in
the fact table.
Table 2.1: Dimension Tables Product, Customer and Sales Territory.
ProductProductKey Name Category Color
p1 Mountain Bike Bike Blackp2 Road Bike Bike Redp3 Bike Shorts Clothes Multip4 Gloves Utilities Blackp5 Mountain Seat Seatsp6 Road Seat Seatsp7 Mountain Tire Tiresp8 Road Tire Tires
CustomerCustomerKey Status Gender
c1 M Fc2 S Mc3 S Mc4 M Mc5 M Mc6 S F
Sales TerritoryTerritoryKey Country Group
s1 USA Americas2 Canada Americas3 UK Europes4 France Europe
Table 2.2: Internet Sales Orders Fact Table
Sales Order DateKey ProductKey CustomerKey TerritoryKey1 20040510 p1 c1 s12 20040821 p6 c2 s32 20040821 p8 c2 s33 20040907 p3 c3 s14 20050803 p2 c4 s44 20050803 p6 c4 s44 20050803 p8 c4 s45 20060213 p5 c5 s35 20060213 p7 c5 s36 20060217 p5 c1 s16 20060217 p7 c1 s17 20060509 p2 c5 s47 20060509 p3 c5 s48 20060515 p5 c3 s18 20060515 p7 c3 s19 20060527 p1 c6 s29 20060527 p4 c6 s29 20060527 p5 c6 s210 20060930 p5 c2 s110 20060930 p7 c2 s1
Definition 1. A Dimension table D is a set of tuples (tidD, X), with tidD the primary key (also referred
to as transaction id) and X a set of pairs (attribute, value).
Definition 2. A Fact table FT is composed of a set of tuples with n foreign keys, connecting it to the
n dimensions that provide context to its records: (tidD1, tidD2
,... tidDn). A fact table may also contain
one or more numerical measurement fields, called facts or measures, that quantify some property of the
events.
9
As can be seen in our example, (p1, Name=Mountain Bike, Category=Bike, Color=Black) is a tuple
of dimension Product, that is referenced in the fact table in sales order number 1 and 9.
Definition 3. A Star schema S is a tuple (D1, D2,... Dn, FT ), composed of one fact table and the
corresponding n dimension tables.
For simplicity, and since most pattern mining techniques do not deal with numerical values, we do
not consider measures in this work. Nevertheless, they can be included and treated like other attributes,
by first transforming measures into categorical values (e.g. partitioning into ranges [RS98]), and then
considering them as an additional dimension, as usually done in OLAP (OnLine Analytical Processing).
While dimension tables contain the characteristics of the business entities, like products and clients,
usually unchanged or slowly changing, the fact table records the events of the business, like the sales,
which are characterized by some combination of the attributes in dimensions (the context). The way to
understand the fact table depends on the meaning of a business event (or business fact).
In general, each row of the fact table corresponds to one business event (e.g. one sale per row).
However, it is common to have a control number, such as an order number, that allows us to group
the rows that were generated as part of the same business event (e.g. one or more rows for each sale).
These control identifiers are usually represented as degenerated (or empty) dimensions, containing only
a primary key (the control number or id) and no descriptive attributes. For this reason, they usually do
not have a physical table associated, instead, the id is put directly into the fact table (e.g. the sales order
number in Fig. 2.1). In the presence of a degenerated dimension, this key/id can act as a primary key
of the fact table, since these keys separate the different business events (rows with the same degenerated
key correspond to the same event). In our example, in the first order, product p1 was bought alone, but
in the second order, p6 and p8 were bought together. Moreover, we have 10 orders, therefore 10 business
facts.
Note that a degenerated key can be seen as an aggregation key, since it indicates what facts should be
aggregated in order to have an event. Similarly, we can think of aggregating the facts by any other key
(or combination of keys). For example, if we consider the ProductKey as the aggregation key, we combine
all sales of the same product, and therefore we can find the common characteristics and behaviors of
these products’ buyers (product profiles). We could also consider the pair (CustomerKey, DateKey) as
the aggregation key, and find, e.g. which types of products are being bought by particular customers,
each season (customer seasonal profiles).
2.2 Problem Statement
Frequent pattern mining aims at enumerating all frequent patterns that conceptually represent relations
among discrete entities (or items). Depending on the complexity of these relations, different types of
patterns arise, with the transactional patterns being the most common. A transactional pattern is just
a set of items that occur together frequently. A well-known example is a market-basket, the set of items
that are bought in the same transaction by a significant number of customers.
In this context:
Definition 4. An item i corresponds to one pair (attribute, value). An itemset X is a set of items.
Itemsets can be:
• intra-dimensional – if all items belong to the same dimension;
• inter-dimensional – if items belong to more than one dimension.
10
An example of an intra-dimensional itemset is (Country=UK, Group=Europe), i.e. the european
country of UK. On the other hand, itemsets (Color=Red, Gender=F ), i.e. red products transacted by
female customers, and (Semester=2, Category=Seats, Country=UK ), i.e. seats bought by clients from
UK in the second semester, are examples of inter-dimensional itemsets.
Events in the fact table can also be aggregated according to some entity or entities, so that we can
discover frequent behaviors or profiles. For example, if we aggregate the facts in the star Internet Sales
per customer, we can discover sets of products bought together, by particular customers. In this sense,
itemsets can also be:
• Aggregated – if they result from the aggregation of events of the fact table, i.e. if they contain
combinations of items with the same attribute.
An example of an aggregated itemset is (Category=Tires, Category=Seats), i.e. tires bought together
with seats. Note that aggregated itemsets are a special case of intra-dimensional ones: items belong to
the same dimension, and may also belong to the same attribute of that dimension.
Let ID be the set of all items of dimension D, and I =⋃n
j=1 IDj= {i1, i2, . . . , im} be the set of
all items. We assume that all items are unique (by, e.g. adding the name of the dimension before the
attribute name).
The support of an itemset is defined as the number of its occurrences in the database. In the case of a
star schema, we have to consider that the number of occurrences of one item in some dimension depends
on the number of occurrences of the corresponding transactions in the fact table.
So, for an intra-dimensional itemset X of dimension D, let’s define TD(X) as the set of all primary
keys of transactions in D that contain X. The support of X is the number of different business facts that
contain each of those keys. Let getFacts : TD(X)→ TFT define a function that gives the business facts
that contain all the keys in TD(X). Hence, sup(X) = |⋃
t∈TD(X) getFacts(t)|. Following our example,
X = {Color=Black} is an intra-dimensional itemset of table Product that corresponds to both products
p1 and p4 (TD(X) = {p1, p4}). Therefore, sup(X) = |getFacts(p1) ∪ getFacts(p4)|. Since p1 appears in
orders 1 and 9 and p4 in order 9, we can conclude that sup(X) = |{1, 9}| = 2. Note that keys only count
once per business fact, i.e. although e.g. client c2 appears twice in order 2, this corresponds to just one
order, and therefore it counts once.
Inter-dimensional itemsets contain items from multiple dimensions, so they can be defined as X =⋃nj=1XDj
, where XDj⊆ IDj
, i.e. X is the union of n intra-dimensional itemsets. Note that, using this
definition, X is an intra-dimensional itemset if all XDjare empty, except one. Thus, TDj
(XDj) (or TDj
,
for short) is the set of all primary keys of transactions in Dj that contain XDj. X occurs if all XDj
occur, which means that some key of all TDjmust occur.
Definition 5. The support of an itemset X is the number of different business facts that contain at least
one key from each TDj.
sup(X) =
∣∣∣∣∣ ⋃T∈
⊗nj=1 TDj
getFacts(T )
∣∣∣∣∣In the equation,
⊗TDj
gives all combinations composed of one key from each TDj, i.e. from each
dimension. For example, in X = {Color = Black,Gender = F}, the first item comes from dimen-
sion Product, and appears twice: TProduct = {p1, p4}, and the other item comes from dimension Cus-
tomer, and TCustomer = {c1, c6}. For X to occur, some of those products must have been bought by
some of those clients, therefore sup(X) = |getFacts({p1, c1}) ∪ getFacts({p1, c6}) ∪ getFacts({p4, c1}) ∪getFacts({p4, c6})| = |{1}∪{9}∪∅∪{9}| = 2. In this example, two female clients bought black products.
Let N be the total number of business facts in S.
11
Definition 6. A pattern is a frequent itemset, i.e. an itemset whose support is greater or equal than a
user defined minimum support threshold, σ ∈]0, 1].
X is a pattern if sup(X) ≥ σ ×N
Naturally, patterns can also be intra-dimensional, inter-dimensional and aggregated.
The problem of multi-relational frequent pattern mining over star schemas is to find all patterns in a
star S.
Since a star schema is a particular case of a relational database, hereinafter we refer to our problem as
multi-dimensional pattern mining (as an equivalent of multi-relational pattern mining over star schemas),
to find all multi-dimensional patterns in a star.
2.3 Challenges and Related Work
In order to deal with multiple tables, pattern mining has to join somehow the different tables, creating
the tuples to be mined. An option that allows for the use of the existing single-table algorithms, is to
join all the tables in one before mining (a step also known as propositionalization or denormalization).
At a first glance, it may seem easy to join the tables into one, and then do the mining process on the
joined result [NFW02]. However, when multiple tables are joined, the resulting table will be much larger
and sparser, with an explosion of attributes, value repetitions and null values, making the mining process
more expensive and time consuming.
Denormalizing the star in Fig. 2.1 would result, as an example, in one table with almost 20 columns
(the SalesOrderNumber, plus all attributes of all dimensions and all measures) and as much rows as the
fact table. In that table, each row of each dimension is replicated as many times as the corresponding
keys appear in the fact table.
There are two major problems: First, in large applications, the join of all related tables often cannot
be realistically computed because of the distributed nature of data: large dimension tables and the many-
to-many relationship blow up. Second, even if the join can be computed, the multifold increase, in both
size and dimensionality, presents a huge overhead to the already expensive pattern mining process:
1. The number of columns will be close to the sum of the number of columns in the individual tables,
or much more if there are degenerated dimensions (since in this case, the fact table has several rows
for the same event, and therefore all attributes of all records must be associated to that event in
the denormalized table);
2. If the join result is stored on disk, the I/O cost will increase significantly for multiple scanning steps
in data mining;
3. For mining frequent itemsets of small sizes, a large portion of the I/O cost is wasted on reading the
full records containing irrelevant dimensions;
4. The joined table will eventually have many repetitions of the same values. While when using several
tables, we can just link several times for some value that is stored (once) in some other table, with
low memory effort, this is not possible when using just one table. Moreover, these repetitions of
values may cause distortions in the computation of the measures of interest, and therefore hinder
the discovery of really interesting patterns;
5. There will be, as well, many missing/null values, since each entity may have different number of
records associated (for example, the products sold in each transaction).
12
Research in this area has shown that methods that follow the philosophy of mining before joining
usually outperform the methods following the joining before mining approach, even when the latter
adopts the known fastest single-table algorithms [NFW02].
One of the great potential benefits of MRDM is the ability to automate this process to a significant
extent. Fulfilling this potential requires solving the significant efficiency problems that arise when at-
tempting to mine directly from a relational database, as opposed to from a single pre-extracted flat file
[Dom03].
In recent years, the most common types of patterns and approaches considered in data mining have
been extended to the multi-relational case and have been successfully applied to a number of different
problems in a variety of areas [DR97, NK01, D03, RV04, Kan05]. However, just a few are able to deal
with star schemas directly [CJS00, NFW02, XX06, SA10].
Historically there have been two major approaches to research in artificial intelligence: one based on
logic representations, and one focused on statistical ones. While the first is able to deal better with the
complexity of the real world, the second is better when dealing with uncertainty [DKP+06a]. In fact, the
most common approach for pattern mining is based on statistics.
Even so, the first multi-relational methods have been developed by the logical approach, in particular
by the Inductive Logic Programming (ILP) community, about ten years ago. And WARMR [DR97],
SPADA [MEL01] and FARMER [NK01] are the most representative ones. As stated by those authors,
ILP approaches achieve a good accuracy in data analysis, but they are usually not scalable with respect
to the number of relations and attributes in the database. Therefore they are inefficient for databases
with large schemas. Nevertheless, there has been an effort to minimize this bottleneck by making use
of optimization techniques like parallelization and distribution (see [FSC05] for a detailed survey), and
also sampling [ACTM11]. Despite their powerful representation capabilities, which are beyond our star
schema, another drawback of ILP approaches, and a reason for not being widely used, is that they need
all data in a declarative language, such as Prolog. Luckily, there are already some tools that are able to
automatically translate from a relational database to these representations, easing therefore the use of
these algorithms. In this work, however, we opted for a statistical approach.
Few approaches were designed for frequent pattern mining over star schemas:
An apriori-based algorithm is introduced by Jensen and Soparkar (2000) [CJS00], that first generates
frequent tuples in each single table using a slightly modified version of Apriori [AS94], and then looks
for frequent tuples whose items belong to different tables via a multi-dimensional count array. It does
not construct the whole joined table and processes each row as the row is formed, thus storage cost for
the joined table is avoided. Cristofor and Simovici (2001) [CS01] eliminated the explosion of candidates
present in Jensen’s algorithm, and they are also able to produce the local patterns existing among
attributes of the same table, i.e. patterns that are frequent with respect to their dimension table, but
not with respect to the relationship (or fact) table.
Ng et al. (2002) [NFW02] proposed an efficient algorithm that mines first each table separately, and
then two tables at a time to find patterns from multiple tables. The idea is to perform local mining on
each dimension table, and then “bind” two dimensional tables at each iteration, i.e. mine all frequent
itemsets with items from two different tables without joining them. After binding, those two tables are
virtually combined into one, which will be “binded” to the next dimension table.
Xu and Xie (2006) [XX06] presented a novel algorithm, MultiClose, that first converts all dimension
tables to a vertical data format, and then mines each of them locally, with a closed algorithm. The
patterns are stored in two-level hash trees, which are then traversed in pairs to find multi-table patterns;
StarFP-Growth, proposed by Silva and Antunes [SA10], is a pattern-growth method, based on FP-
Growth [HPY00]. Its main idea is to construct a tree for each dimension (named DimFP-Tree), according
13
to the global support of its items (i.e. depending on the number of times the corresponding keys appear in
the fact table). Then, the algorithm builds a global FP-Tree structure, named Super FP-Tree, combining
the branches in the DimFP-Trees, accordingly to the facts. All the multi-dimensional patterns are then
retrieved by traversing this tree using FP-Growth.
There are other algorithms for finding multi-relational frequent itemsets [RV04, Kan05], however they
just consider one common attribute at a time, and the patterns discovered by those methods will not
reflect the co-occurrences among dimensions in a star schema.
14
Chapter 3
Finding Patterns on Large Star
Schemas
The ability to mine complex data has been recognized as one of the goals for the future in data mining
[YW06], and dealing with multi-relational, large and growing data has deserved some attention in the
last years, with deep advances on mining data streams.
Data Warehouses (DW) meet both these lines of research since, apart from having multiple inter-
related tables, records are added along time, but never deleted. Dimension tables are usually large, but
not too large, and slowly changing compared to fact tables. In some manner, DW can be compared to
data streams in the sense that it is continuously growing along time, since new records are added to the
fact table for each event occurrence.
Indeed, due to their growing nature, in order to efficiently mine DW, we propose to adopt a strategy
similar to the one followed on mining data streams: avoid multiple scans of the dataset, optimize memory
usage and use a small constant time per record [LLH11].
This brings new challenges to both MRDM and mining data streams, since existing algorithms for
mining multiple relations are usually not scalable with the number of records, as well as not able to deal
with new data as it arrives, and existing algorithms for mining data streams are designed for a single
data table [GHP+03, LLH11].
To the best of our knowledge, there are only two algorithms that are able to mine multiple data
stream tables, and both are based on ILP, and therefore suffer from the same limitations than other ILP
techniques. They also suffer from the candidate generation bottleneck, well known in traditional pattern
mining.
In this chapter, we propose a method, Star Frequent-Patterns on Streams (StarFP-Stream) that
combines MRDM and data streaming techniques, for frequent pattern mining in large star schemas.
StarFP-Stream does not materialize the join between the tables, and adopts a pattern growth strat-
egy [HPY00], therefore not suffering from the candidate generation bottleneck, like the only two related
algorithms in the literature [FCAM09, HYXW09]. By using a strategy similar to the one followed on
mining data streams, the algorithm is able to mine both large and growing star schemas, by avoiding
multiple scans to the data, and optimizing memory usage and time spent.
StarFP-Stream was first proposed in [SA12b] and then updated in [SA12a] to correctly aggregate
the business events in the presence of degenerated dimensions, and therefore finding patterns at the
right aggregation level. In this work, we present an overview of StarFP-Stream, where we describe the
algorithm in detail and illustrate it with the example of Chapter 2, extracted from a star schema used in
the experiments. We also present an analysis of our algorithm’s complexity, strengths and weaknesses,
15
and a comparison with the related work.
In Section 3.1 we formally define the problem of finding frequent itemsets over growing star schemas.
Section 3.2 reviews existing work on multi-relational data streams, and the proposed method is described,
exemplified and analyzed in Section 3.3. The performance of our algorithm is evaluated over a DW, and
results are presented in Section 3.4. And finally, Section 3.5 concludes the chapter with a discussion and
some open issues.
3.1 Problem Statement
The definitions presented in Chapter 2 consider that the star is static and that the database is mined all
together. In order to consider growing star schemas, we need to extend those definitions.
Let us now assume that the tables are data streams, where new business facts continuously arrive.
We now have what we call a star stream, but only the fact table needs to be treated as an actual stream
(the fact stream)1.
Definition 7. A fact stream FS = B1 ∪ B2 ∪ ...Bk is a sequence of batches, where Bk is the current
batch, B1 the oldest one, and each batch is a set of business facts. Let N be the current length of the
stream, i.e. the number of business facts seen so far.
Note that if the fact table is not a stream, we can still treat it like one, by dividing it into batches.
Following our example in Chapter 2, we can consider that the fact stream (Table 2.2) is composed of two
batches – B1 with the first 5 business facts (gray rows), and B2 with the other 5 (white rows).
As it is unrealistic to hold all streaming data in the limited main memory, data streaming algorithms
have to sacrifice the correctness of their results by allowing some items and itemsets to be discarded. This
means that the support calculated for each item is an approximate value (denoted by sup′), instead of
the real value. These counting errors should be as small as possible, but still allowing an effective usage
of memory by discarding very infrequent items.
In data streaming algorithms, these errors are bounded by a user defined maximum error threshold,
ε ∈ [0, 1[, such that ε � σ, i.e. it is much lower than the minimum support threshold. Therefore, the
difference between the real and approximated support should be at most εN .
Definition 8. An approximate pattern is an itemset whose estimated support is greater or equal than the
minimum support threshold minus the maximum error allowed.
X is an approximate pattern if sup′(X) ≥ (σ − ε)×N
Given σ and ε, the problem of multi-relational frequent pattern mining over star streams consists of
finding all approximate patterns in the star S.
3.2 Related Work
Although there exist some algorithms that are able to find multi-dimensional patterns in star schemas
(Section 2.3), they are often not scalable. Indeed, due to the growing nature of data warehouses, it is
necessary to adopt a strategy similar to the one followed on mining data streams. However, most of
existing algorithms for data streams are designed for a single table [GHP+03, LLH11].
A data stream is an ordered sequence of instances that are constantly being generated and collected.
The nature of these streaming data makes the mining process different from traditional data mining in
several aspects:
1Dimensions can also be streams. But since for a foreign key to appear in the fact table, it must be alreadycreated and populated in the corresponding dimension, only the fact table needs to be considered a stream.
16
1. Each element should be examined at most once and as fast as possible;
2. Memory usage should be limited, even though new data elements are continuously arriving;
3. The results generated should be always available and updated;
4. Frequency errors on results should be as small as possible.
This implies the creation and maintenance of a memory-resident summary data structure, that stores
only the information that is strictly necessary to avoid loosing patterns [LLH11]. Hence, data stream
mining algorithms have to sacrifice the correctness of their results by allowing some counting errors.
Existing approaches can be deterministic or probabilistic: deterministic if they only allow an error in
the frequency counts, but guarantee that all real frequent patterns are returned (i.e. there are no false
negatives); and probabilistic if, besides an error, they also allow a probability of failure, i.e. there is a
probability that some real patterns are not returned (there might be false negatives). In this work, we
decided to focus on deterministic algorithms, so that we can have the guarantee that we do not miss any
real pattern.
The first proposed algorithm was Lossy Counting [MM02]. It divides the data stream into batches and
maintains frequent items in a set summary structure along with their estimated frequency and maximum
error. The algorithm guarantees that: (1) all itemsets whose true frequency exceeds σN are reported
(there are no false negatives); (2) frequencies are underestimated by at most εN ; and (3) false positives
have a true frequency of at least (σ − ε)N .
Giannella et al. [GHP+03] presented a novel algorithm, called FP-Streaming, that adapts FP-
Growth [HPY00] to mine frequent itemsets in time sensitive data streams and gives the same guarantees
as Lossy Counting. They make use of the FP-tree structure and its compression properties to main-
tain time sensitive frequency information about patterns. The stream is divided into batches and a tree
structure (called FP-stream) is updated at every batch boundary. Each node in this tree represents a
pattern (from the root to the node) and its frequency is stored in the node, in the form of a tilted-time
window table, which keeps frequencies for several time intervals. The tilted-time windows give a loga-
rithmic overview on the frequency history of each pattern, allowing the algorithm to address queries that
request frequent itemsets over arbitrary time intervals, rather than only over the entire stream (called a
landmark model). It can also be used in this later case, without temporal information and with the same
guarantees, by storing only one frequency in each node, instead of a time table (let us call this simpler
version as SimpleFP-Stream).
Some other algorithms were proposed to mine frequent itemsets in data streams (see [LLH11] for a
more exhaustive survey), but most of them are adaptations of the strategies applied in the algorithms
above.
3.2.1 MRPM over Data Streams
To the best of our knowledge, there are only two works on multi-relational frequent pattern mining over
data streams. They are both based on ILP, hence, for dealing with multi-relational databases, these
algorithms need all data in prolog form: a set of predicates of variables and constants for representing
the relations and attributes in the database. Both consider that there exists one relation (i.e. one table)
that represents the target relation, which is the main subject or unit of analysis. And patterns found
represent the frequent relations between other tables/attributes and the target.
In Fumarola et al. [FCAM09], SWARM, a Sliding Window Algorithm for Relational Pattern Mining
over data streams, is proposed. SWARM is a deterministic approach for data streams and is based on
a sliding window model, i.e. the stream is divided into a set of batches (or slides) from which a window
17
with the most recent ones is kept. The idea is to find all patterns in each slide by building a SE-tree
(Set Enumerated tree). This tree starts with the target relation as the root, and nodes are iteratively
expanded, by adding the predicates that have some variable in common (candidate generation), and then
evaluated (support check). A global SE-tree is used to keep the patterns for the window. It stores a
sliding vector in each node with the support of the respective patterns for each slide of the window, so
that when a new slide flows, the support vector is shifted to remove the expired support and the tree is
pruned to eliminate unknown patterns.
Hou et al. [HYXW09] presented RFPS (Relational Frequent Patterns in Streams), a probabilistic
approach based on period sampling, for finding relational patterns over a sliding time window of a
relational data stream. Since it is based on WARMR [DR97], it needs the database in prolog form.
RFPS is an apriori-based algorithm [AS94] that first generates and tests candidates with the help of a
Patterns Joint Tree (with the possible refinements of predicates), and then maintains frequent patterns
in a virtual stream tree, based on a periodical sampling probability.
After presenting our algorithm, we compare and discuss in more detail the results of these algorithms,
and show that StarFP-Stream is different and useful.
3.3 StarFP-Stream
StarFP-Stream is a MRDM algorithm that is able to find approximate frequent dimensional patterns in
large databases following a star schema. It is able to deal with degenerated dimensions, and to aggregate
the rows of the fact table into business facts, making possible the mining of the star at the right business
level. At the same time, it is also an algorithm for mining multiple dimensional data streams, hence able
to mine growing star schemas.
In this work, we will assume a landmark model, i.e. that patterns are measured from the start of
the stream up to the current moment. We discuss later how StarFP-Stream can be adapted to a time
sensitive model.
StarFP-Stream combines the multi-dimensional strategies of StarFP-Growth [SA10] with the data
streaming strategies of FP-Streaming [GHP+03], and guarantees that all real frequent itemsets are re-
turned. It does not materialize the join of the tables, making use of the star properties, and it processes
one batch of data at a time, maintaining and updating frequent itemsets in a pattern-tree structure.
Like data streaming challenges, StarFP-Stream can be asked to produce a list of current frequent
itemsets along with their estimated frequencies, at any point in time.
3.3.1 Rationale behind the star stream
As noted above, in a star stream, only the fact table needs to be treated as a stream (denoted as the fact
stream), since when a new fact arrives, the corresponding occurring transactions must have already been
added to the corresponding dimensions.
In this work, we follow the philosophy of the streaming algorithm Lossy Counting [MM02], for the
division of data into batches and for guaranteeing the maximum error. We explain these ideas in detail.
This fact stream is conceptually divided into k batches of d1/εe business facts each, so that the batch
id (1..k) exactly refers to the maximum error threshold, i.e. k = εN , with N = k|B| the number of facts
seen so far. This means that, to be frequent, an item must appear more than k times (the equivalent of
once per batch).
All items that appear more than σN times, are frequent with respect to the entire stream. Items that
appear less than σN but more than k are possibly frequent and have to be maintained, since they may
become frequent later.
18
Lemma 1. Items that only appear k times (or less), are infrequent and can be discarded, because even
if they reappear later in other batches and become frequent, the loss of support will not affect significantly
the calculated support, i.e. the difference between estimated and real frequencies is at most εN .
Proof. Considering that an itemset I first occurs in batch Bj , let us denote f the real frequency of I and
f its estimated frequency, after the current batch Bi (with j ≤ i ≤ k), and ∆ = j − 1 the maximum
error of I (i.e. the number of times it could have appeared and been ignored before j). Itemsets that are
frequent since the first batch, have ∆ = 0 and f = f . Otherwise they can have been discarded in the first
∆ batches. Therefore, f ≤ f + ∆. And since ∆ ≤ i− 1 ≤ εN , we can state that f ≤ f + ∆ ≤ f + εN .
Lemma 2. All patterns with f + ∆ ≥ σN are returned, therefore: (1) there are no false negatives, i.e.
all real patterns are returned; and (2) all false positives returned are guaranteed to have a support above
(σ − ε)N .
Proof. Real patterns have true frequency f ≥ σN . Since f + ∆ ≥ f , returning all patterns with
f + ∆ ≥ σN will include all real patterns. Similarly, since f + ∆ ≤ f + εN , then for all patterns,
f + εN ≥ σN ⇔ f ≥ (σ − ε)N , i.e. the frequencies of the returned patterns are off σ at most ε.
This lemma guarantees that the recall, i.e. the percentage of real patterns that are returned, is 100%
(no false negatives). However, the precision, i.e. the percentage of patterns returned that are real, is
normally below 100%, because itemsets with estimated frequencies between the minimum support and
maximum error are also returned (there are some false positives).
3.3.2 Pattern-Tree
In order to make the storage and search for patterns efficient, estimated frequent itemsets are kept in a
prefix-tree summary structure, called pattern-tree, along with their corresponding support and error.
As a prefix-tree, all prefixes are stored, so that we can easily search for any prefix. As an example, if
the itemset (a, b, c) is in the pattern-tree, then both a and (a, b) are also in that tree, and they share the
same branch. Since we are storing patterns, due to anti-monotonicity, all prefixes of a pattern are also
a pattern, as well as all subsets resulting of its strict porwerset. Using the same example, if (a, b, c) : 5
is a pattern with support 5, then both a, b, c, (a, b), (a, c) and (b, c) are patterns (with support equal
or higher than 5), and therefore all are put in the pattern-tree. In this case, we have 4 branches in the
tree: one with a, (a, b) and (a, b, c); one with a and (a, c) (note that the node a is the same node in
both branches); another with b and (b, c); and finally another branch, only with c. This example of a
pattern-tree is presented in Figure 3.1;
a : 8Δ
b : 6Δ
c : 5Δ
c : 5Δ
b : 7Δ
c : 6Δ
c : 6Δ
Figure 3.1: An example of a pattern-tree, corresponding to pattern (a, b, c) and all subsets.
19
In this sense, a pattern-tree is a prefix-tree structure, where every node corresponds to a pattern,
composed of the items from the root to this node, and the estimated support and error (f and ∆)
attached to this node.
Considering that patterns that share the same prefix also share the same nodes in the tree, the size
of the tree is usually much smaller than having all patterns in a list or a table, and the search for an
itemset is usually much faster.
3.3.3 Algorithm StarFP-Stream
The main idea of the algorithm is to have a local tree structure for each dimension that will store the
occurring transactions, as new business facts arrive. These trees, called DimFP-Trees, contain the intra-
dimensional itemsets of the current batch. When |B| facts have arrived, those local trees are combined
into one global one, called Super FP-Tree, that will contain the inter-dimensional itemsets. This tree
is then used to extract the multi-dimensional patterns of the current batch, which are, in turn, used to
update the global pattern-tree, described above.
The detailed algorithm is presented in Algorithms 1 and 2.
Algorithm 1 StarFP-Stream Pseudocode
Input: Star Stream S, error rate εOutput: Approximate frequent items with threshold σ, whenever the user asks1: i = 1, |B| = 1/ε, flist and ptree are empty2: B1 ← the first |B| business facts3: L← StarFP-Growth(B1, support = ε|B|+ 1)4: flist ← frequent items in B1, sorted by minimum support5: for all patterns P ∈ L do6: insert P in the ptree with max error i− 17: N = |B|, discard B1 and L8: i = i+ 1, initialize n DimFP-trees to empty9: for all arriving business fact f = (tidD1 , tidD2 , ..., tidDn) do
10: N = N + 111: for all Dimension Dj do12: T ← transaction of Dj with tidDj
13: insert T in the DimFP-treej14: flist ← append new items introduced by T15: if all business facts of Bi arrived then16: super-tree ← combineDimFP-trees(DimFP-trees, Bi)17: FP-Growth-for-streams(super-tree, ∅, ptree, i)18: discard the super-tree19: tail-pruning(ptree.Root, i)20: i = i+ 1, initialize n DimFP-trees to empty
combineDimFP-trees(DimFP-trees dim-trees, Batch of business facts Bi)fptree ← new FP-treefor all business fact f ∈ Bi do
for all Dimension Dj doT ← branch of DimTreej with tidDj
sort items in T accordingly to flistinsert T in fptree
return fptree
tail-pruning(Pattern-tree node R, Batch id i)for all children C of R do
if C.support+ C.error ≤ i thenremove C from the tree
elsetail-pruning(C, i)
When mining a star as a whole, items are ordered in a support descending order, which is known to
enhance the compactness of the trees [HPY00]. However, when we are dealing with streams, the order of
items can not depend on their support, not only because we do not know the alphabet of items from the
beginning and we only see one transaction at a time, but also because items appear and disappear and
their support changes. Therefore, the list of items (flist) is dynamic, with items appended as they appear,
20
so that all patterns in the pattern-tree are sorted accordingly to their order of appearence. In this sense,
we decided to process the first batch separately, as a whole, using StarFP-Growth [SA10], to initialize
both the order of items and the pattern-tree (rows 2–6 in Algorithm 1). Note that, while processing this
batch, we should only scan the transactions of dimensions that appear in the batch, instead of the whole
dimensions.
After the first batch, all next business facts are processed as they arrive (rows 11–14 in Algorithm
1). So, every time one fact arrives, it is scanned and the transaction corresponding to each key is stored
in the respective DimFP-Tree. These local trees are simple prefix-trees (like the pattern-tree) that allow
us to compress the intra-dimensional itemsets of each batch. Besides, they also contain a header table
with the occurring keys, and a link from them to the corresponding branches in the tree, so that given a
foreign key, we can easily reach and regenerate the respective transaction. Note that we can not discard
any item that appears in the current batch, since we do not know if it is already frequent or not. Thus,
all items found must be in the DimFP-tree. We can say that the DimFP-tree is a compact and efficient
representation of the dimension table of the current batch.
When a batch is complete (rows 15–20 in Algorithm 1), the DimFP-Trees are combined to form
a global tree, called Super FP-Tree. In this step we have to scan the facts of the batch a second time,
otherwise we would not know which intra-dimension itemsets of the different DimFP-Trees occur together.
However, a fact is just a set of tids, therefore this extra scan is not significant. So the Super FP-Tree
is constructed by first looking for the co-occurring foreign keys of a fact in each DimFP-Tree, and then
joining the corresponding branches (see function combineDimFP-trees). Hence, it will contain all the
inter-dimension itemsets of the respective batch.
Algorithm 2 FP-Growth-for-streams Pseudocode
Input: FP-tree fptree, Itemset α, Pattern-tree ptree, Current batch id i1: if fptree = ∅ then2: return3: else if fptree contains a single path P then4: for all β ∈ P(P ) do5: processPattern(ptree, α ∪ β: min[support(nodes∈ β)], i)6: else7: for all a ∈ Header(fptree) do8: β ← α ∪ a : a.support9: if processPattern(ptree, β, i) = false then
10: proceed to the next a11: else12: treeβ ← conditional fptree on a13: FP-Growth-for-streams(treeβ , β, ptree, i)
processPattern(Pattern-tree ptree, Itemset I, Batch id i)if I ∈ ptree thenP ← last node of I in ptreeP.support← increment by I.supportif P.support+ P.error ≤ i then
return false//Type II Pruningelse if I.support > ε|B| then
insert I in ptree with support = I.support and maximum error = i− 1else
return false//Type I pruningreturn true
The Super FP-Tree is then mined using FP-Growth algorithm, presented in Algorithm 2, modified as
following:
For each mined itemset I (see function processPattern):
1. if it is not in the pattern-tree (i.e. its f ≤ i − 1, according to Lemma 1), test Type I Pruning :
if I only occurs once in Bi and it is not in the pattern-tree, it is infrequent and thus we do not
insert it in the pattern-tree (Lemma 1) and we can stop mining the supersets of I (according to the
anti-monotone property [AS94]: if I is infrequent, all supersets of I are also infrequent).
21
Otherwise, insert I into the tree with the number of occurrences in Bi and maximum error ∆ = i−1.
2. If I is in the pattern-tree (i.e. its f > i− 1):
(a) Update its frequency, by adding the number of occurrences in Bi;
(b) Test Type II Pruning : if f + ∆ ≤ i, it will be deleted later because it is infrequent (Lemma
1), therefore we can stop mining the supersets of I. Otherwise, FP-Growth continues with I.
After mining the batch, we can discard the Super FP-Tree and prune the pattern-tree, by Tail Pruning :
prune all items in the tree whose f + ∆ ≤ i (Lemma 1). The pattern-tree is now updated, and contains
all approximate frequent itemsets until that batch. The next steps consist only in preparing the next
batch and wait for new facts.
If there are no more batches, or every time a user asks for a list with the current frequent itemsets,
we just need to scan the pattern-tree and return all itemsets with f + ∆ ≥ σN (Lemma 2).
3.3.4 Example
To help understanding the flow of the algorithm StarFP-Stream, we illustrate the flow of our algorithm
with the example presented in tables 2.1 and 2.2. Let the minimum support threshold be 50% of the
database and the maximum error be 20%. For this error, the fact table is divided into 2 batches with 5
business facts each (|B| = 1/ε). In a static environment, without error, 50% of support means that an
itemset is frequent if it occurs at least in 5 business facts (σN).
The algorithm starts by processing the first batch separately. Fig. 3.2 illustrates part of the pattern-
tree with the resulting patterns. For example, pattern (Gender=M) has a support of 4 business facts, and
pattern (Gender=M,Category=Seats,Category=Tires) of 3 facts. Both were added to the pattern-tree in
the first batch, therefore ∆ = 0. An example of an infrequent item is (Year=2006). Since it occurs only
once, it was ignored and not inserted in the pattern-tree.
Year = 2004 :2Δ=0
TerritoryGroup= America :2
Δ=0
Country= USA :2Δ=0
Category = Bike :2 Δ=0
…
Gender = M :4Δ=0
Year = 2004 :3 Δ=0
Category = Seats :3 Δ=0
Category= Seats :3Δ=0
TerritoryGroup= Europe :3
Δ=0
Category= Tires :3Δ=0
Category= Tires :3
Δ=0
TerritoryGroup= Europe :3
Δ=0
Figure 3.2: Part of the pattern-tree resulting from the first batch. Gray nodes represent patterns whose f + ∆ ≥σN = 2.5, and therefore are returned by the algorithm. White nodes have σN > f + ∆ ≥ εN = 2, therefore theyare not returned, but can not be discarded.
Next, for each arriving business fact, the respective occurring transactions are inserted in the corre-
sponding compact DimFP-trees. Using our example, when the 6th sales order arrives, transactions p5
and p7 are inserted into the DimFP-tree of dimension Product; transaction c1 into the tree of dimension
Customer; and so forth. The DimFP-trees of dimensions Customer and Product are illustrated in Fig.
3.3, as if all business facts of the second batch had already arrived (for simplicity, we omitted the product
names). We can see that products p1, p2, p3, p4, p5 and p7 occurred, and that the corresponding DimFP-
tree maintains a link of each key to the respective path in the tree, which facilitates further searches.
Note that each of these trees contain all possible intra-dimensional patterns of the current batch.
22
Gender M : 3
Gender F : 2
Status M : 1
Status S : 2
c1 c2 c3 c5 c6
Status M : 1
Status S : 1
Category Seat: 4
Category Tires: 3
Category Bike: 2
Category Clothes: 1
Category Utilities: 1
p1 p2 p3 p4 p5 p7
Color Black: 1
Color Red: 1
Color Multi: 1
Color Black: 1
Figure 3.3: DimFP-Tree of dimensions Customer (left) and Product (right), at the end of the second batch.
Gender = M : 3
Status = M : 1 Status = S : 2
Gender = F : 2
Status = M : 1 Status = S : 1
Category = Bike: 1 Category = Seat: 1Category = Seat: 2 Category = Seat: 1
Color = Red: 1 Category = Tires: 2 Category = Tires: 1 Category = Bike: 1
Category = Clothes: 1
TerritoryGroup = America : 2 TerritoryGroup = America : 1
Category = Utils: 1
Color = Multi: 1 Color = Black: 1
TerritoryGroup = Europe : 1
Country = France : 1 Country = USA : 2 Country = USA : 1
TerritoryGroup = America : 1
Country = Canada : 1
Year = 2006 : 1 Year = 2006 : 2 Year = 2006 : 1 Year = 2006: 1
Month = May : 1 Month = May : 1 Month = Sep : 1 Month = Feb : 1 Month = May: 1
Figure 3.4: Super FP-tree of the second batch.
When all 5 business facts of this batch have arrived, the DimFP-trees are combined into one, the Super
FP-tree presented in Fig. 3.4, by scanning again the business facts, looking for the keys in the DimFP-
trees and joining the co-occurring paths (for convenience, nodes with items from the same dimension are
presented together, instead of ordered according to the flist). Using our example again, when scanning
sales order 7, all paths corresponding to client c5, products p2 and p3, territory s4 and date 20060509 are
joined in only one path (the left most path in the figure). By doing this, the Super FP-tree puts together
the intra-dimensional itemsets and forms inter-dimensional ones.
The Super FP-tree is then mined using the described adaptation of the FP-Growth algorithm and the
pattern-tree is updated. Fig. 3.5 shows a subset of the final pattern-tree. During this mining, discovered
patterns that are already in the pattern-tree are updated. For example, the itemset (Gender=M) was
already in the pattern-tree, and occurs 3 times in the second batch. It now has a frequency of 7, and since
it is higher than the maximum error (2 for the second batch), we can keep mining its supersets (Type II
Pruning). Also, new discovered patterns are added to the tree. For example, itemset (Year = 2006) was
not in the pattern-tree (although it appeared once in the first batch). Since its frequency in this batch
is 5, higher than the error, it is inserted in the tree, with estimated support 5, and ∆ = 1 (Note that
f + ∆ = 6 = f). Itemsets that occur only once and were not in the pattern-tree, such as (Category =
Clothes), are not added nor further explored (Type I Pruning).
23
Finally, the pattern-tree is pruned by Tail Pruning, to eliminate all infrequent itemsets (f + ∆ ≤ 2).
One example is (Gender = M, Year = 2004):2 that appeared in the first pattern-tree, but not in the
second one.
TerritoryGroup= Europe :4
Δ=0
Year = 2006 :3Δ=1
Year = 2004 :3 Δ=0
TerritoryGroup= Europe :3
Δ=0
Month = May :3 Δ=1
Category = Bike :4 Δ=0
Gender = M :7Δ=0
Category = Seats :7 Δ=0
TerritoryGroup= America :6
Δ=0
Year = 2006 :5 Δ=1
Category= Seats :5Δ=0
Category= Tires :5Δ=0
Category= Tires :6
Δ=0
TerritoryGroup= America :4
Δ=1
Year = 2006 :4 Δ=1
Country= USA :5Δ=0
Figure 3.5: Part of the final pattern-tree.
These steps are then repeated for new facts.
If we ask for a list with the current frequent itemsets, a scan to the pattern-tree would return all
itemsets with f + ∆ ≥ 6. In our example, the complete pattern-tree would have 175 nodes, but would
return only 17 patterns. A subset of the final patterns returned by the algorithm, corresponding to the
pattern-tree in Fig. 3.5, can be found in table 3.1.
Table 3.1: A subset of the final patterns.
Pattern Dimensions(Year=2006):5 Date(TerritoryGroup=America):6 Territory(TerritoryGroup=America,Country=USA):5 Territory(Category=Seats):7 Product(Category=Seats,Category=Tires):6 Product(Category=Seats,TerritoryGroup=America):4 Product, Territory(Category=Seats,TerritoryGroup=America, Product, Territory,
Year = 2006):4 Date(Gender=M):7 Customer(Gender=M,Category=Seats):5 Customer, Product(Gender=M,Category=Seats,Category=Tires):5 Customer, Product
Note that the algorithm found intra- and inter-dimensional patterns. For example, the pattern (Cate-
gory=Seats,Category=Tires):6 is an intra-dimensional pattern that states that seats and tires were bought
together in 6 internet sales (60%). Notice that this pattern relates different products that co-occurred,
and therefore could only be found because we aggregated the sales by the degenerated dimension (sales or-
der number). An example of an inter-dimensional pattern is (Category=Seats, TerritoryGroup=America,
Year = 2006):4, relating both dimensions Product, Sales Territory and Date. Note that this pattern has
real f = 5, but f = 4, since it appeared only once in the first batch and was discarded. However, it was
not missed because f + ∆ = 5. Although pattern (Category=Bike):4 has the same f , it was not returned
because its ∆ is zero (and it was indeed not a real pattern).
3.3.5 Complexity Analysis
Since we are working with one batch at a time, which corresponds to a certain amount of facts and
corresponding transactions in each dimension, we can assume that we work with a smaller star at each
point in time. Let that conditional star be SB . Let |B| be the number of business facts in a batch, and
24
mi the number of rows in the fact table necessary for representing the fact i in that batch. The size
of the respective fact table can be given by |FT |B = n ×∑|B|
i=1mi primary keys, with n the number of
dimensions (note that in the case of a star without degenerated dimensions mi = 1, and the size of the
fact table is n × |B|). Let also tdiB correspond to the number of transactions of dimension i that occur
in this batch. The size of each dimension is given by |Di|B = tdiB × cdi, with cdi the number of columns
of dimension i. Then the size of the star is |S|B = |FT |B +∑n
i=1 |Di|B .
Joining the tables before mining, would result in a much larger table, whose size would be the number
of rows in the batch times the sum of all columns in all dimensions:∑|B|
i=1mi ×∑n
i=1 cdi. This would
have a negative impact on the memory needed, as well as on the time, not just because of the extra
pre-processing step for the creation of this table, but also because of all steps involving the scans to the
transactions.
Each new business fact i that arrives is treated separately in O(mi ×∑n
j=1 cdj), since there is only
the need to scan once the occurring transactions of each dimension, to insert them in the corresponding
DimFP-tree.
When a batch is complete, we need to take 3 steps: (1) combine the trees; (2) run FP-Growth on the
combined tree; and (3) prune the pattern-tree.
In (1), the DimFP-trees are combined in O(∑|B|
i=1mi ×∑n
j=1 cdj log(cdj) + cdj). This step consists in
one scan to the fact table, to find and get the branches of the trees of different dimensions that co-occur in
each fact. The co-occurring items are then sorted and inserted in a Super FP-tree. Sorting and inserting
is bounded by the number of columns in the dimensions.
This Super FP-tree is then used in (2) to run the algorithm FP-Growth [for more details, see HPYM04],
with just one difference: for each possible pattern, we have to first look for it in the pattern-tree, and
insert it if it is not there, which is O(2mi ×∑n
j=1 cdj). Both searching and inserting in the tree are
linear operations, and depend on the size of the pattern, which is, in the worst case, the set of all items
occurring in one business fact.
Finally, step (3) consists in scanning the pattern-tree to remove current infrequent items (tail-pruning).
If one node is infrequent, there is no need to scan its children, because they are infrequent too. However,
in the worst case (there are no infrequent nodes, e.g.), we have to scan all nodes in the tree.
3.3.6 Strengths and Weaknesses
As a data streaming algorithm, StarFP-Stream gives the following guarantees, like [MM02, GHP+03]:
• All itemsets whose true frequency exceeds σN are returned (there are no false negatives);
• No itemset whose true frequency is less than (σ − ε)N is returned;
• And estimated frequencies are less than true frequencies by at most εN .
As a multi-relational algorithm for star schemas, StarFP-Stream guarantees that it mines the star
directly, without materializing the join of the tables, and that all multi-dimensional patterns are returned.
Like any algorithm, StarFP-Stream also has some limitations:
• As an FP-Growth [HPY00] based algorithm, it has to scan the facts twice, first to know which
transactions of dimensions occur, and second to combine them in the end of a batch. However, a
fact is just a set of tids, therefore the time needed for each scan and the memory needed to keep
it, are not significant, when compared to scanning and storing transactions of items;
• And the pattern-tree tends to be very large, since it has to keep all frequent and possible frequent
patterns. Nevertheless, its size tends to be stable as the batches increase, and it is able to return
the patterns for every minimum support σ � ε, anytime.
25
3.3.7 Comparison with Related Work
As referred in Section 3.2.1, there are only two algorithms for relational pattern mining over data streams:
RPFS [HYXW09] and SWARM [FCAM09]. However, they are not directly comparable with StarFP-
Stream.
While our algorithm is deterministic, RPFS is a probabilistic approach that only uses a sample of the
data. Deterministic approaches allow an error in the frequency counts, but guarantee that all real frequent
patterns are returned (i.e. there are no false negatives). On the contrary, probabilistic approaches, besides
an error, also allow a probability of failure, i.e. there is a probability that some real patterns are not
returned (there might be false negatives). And therefore we can not make a fair comparison between
StarFP-Stream and RPFS.
On the other hand, SWARM is deterministic, but the transformation of the data into their input is
not straightforward, as well as the correspondence between their results and ours. We would need both
an extra pre-processing and an extra post-processing steps. We analyze this in more detail hereinafter.
The authors define 3 types of predicates: key, for target objects; structural, for relations between
objects; and property predicates, for values taken by properties of an object (binary) or of a relation be-
tween two objects (ternary). Table 3.2 presents a comparison between our representation and SWARM ’s.
Capitalized letters correspond to variables, which can take any value.
Table 3.2: Correspondence between StarFP-Stream and SWARM representations.
StarFP-Stream SWARM
Item → Property predicate(with variables)
(color, “Black′′) color(P, “Black′′)
Dim transaction → Prop-erty predicate (with values)
{p1, (color, “Black′′)} color(p1, “Black′′)
Fact key → key predicate sale order 1 order(1)
Fact → Binary structuralpredicates
{1, p1, c1}{order(1), soldWhat(1, p1),soldTo(1, c1)}
Measure → Ternary struc-tural predicates
{1, p1, c1,m} quantity(1, p1,m)
Intra-dimensional pattern (color, “Black′′){order(O), soldWhat(O,P ),color(P, “Black′′)}
Inter-dimensional pattern{(color, “Black′′),(gender, “M ′′)}
{order(O), soldWhat(O,P ),color(P, “Black′′), soldTo(O,C),gender(C, “M ′′)}
As can be seen there, items correspond to property predicates. Each transaction of a dimension is
therefore a set of property predicates, one per attribute. In our star schema case, we assume that our
target are business facts and that structural predicates correspond to relations between the facts and
entities in dimensions. In this sense, a fact table can be mapped to their representation in a set of N key
predicates (one per fact) and n×N binary structural predicates (one per dimension, per fact). If there
are measures in the fact table, they can also be represented by ternary structural predicates. However,
there are no n-ary predicates, therefore they can not represent measures that depend on more than two
objects.
Every time a new order is made, all the predicates relating to this order must flow
through the stream. So, for example, order {1, 20040510, p1, c1, s1} should be transformed to
{order(1), soldWhen(1, 20040510), soldWhat(1, p1), soldTo(1, c1), soldWhere(1, s1)}, along with all
property predicates related to the entities in question, such as category(p1, “Bike′′), Color(p1, “Black
′′),
from dimension Product, and status(c1, “M′′), gender(c1, “F
′′) from Customer.
26
In SWARM, patterns must relate the target with other relations and attribute values, and therefore
they are built by expanding the key predicate with the other related predicates. In this sense, SWARM
finds first-order patterns as shown in the table. In the last example, our pattern stating that “Black′′
products are bought frequently by male customers, is translated to: orders of products which color is
black, sold to clients of gender male, are frequent.
However, while expanding the predicates, the algorithm does not allow repetitions, and this means
that it can not deal with degenerated dimensions, like in our example, and therefore it can not always
find patterns at the right business level (e.g. {(category, “Seats′′), (category, “Tires′′)}). And hence it
is not comparable to our algorithm.
3.3.8 Time Sensitive Model
StarFP-Stream can also be extended to a time sensitive data streams paradigm. By being aware of time,
outdated data can be discarded, more recent patterns can have more weight than older ones, and patterns
can be examined at different granularities. This is very important to many real-world applications, where
changes of patterns and their trends are more interesting than patterns themselves (e.g. shopping and
fashion trends, Internet bandwidth usage, etc.).
There are several ways to achieve this, in light of existing works [GHP+03, LLH11]. One simple
approach is to keep more than one frequency per pattern (i.e. per node in the pattern-tree), corresponding
to the support of the pattern in each of the most recent batches. Whenever a new batch arrives, we can
just shift the supports and discard the oldest one. The estimated frequency of each pattern can be the
sum of the stored frequencies, and therefore patterns that were frequent before but not recently can be
discarded. We can also consider a pondered sum of the frequencies, by giving an higher weight to more
recent batches.
Another example is to consider more elaborated time divisions, such as the logarithmic time windows
adopted by Gianella et al. [GHP+03]. In this case, we can consider periods (called windows) of different
time granularities (e.g. day, month, year) and instead of discarding the oldest frequencies, they are
aggregated. In this sense, when a new period arrives, the shift of frequencies corresponds to adding them
to the group of frequencies at a higher granularity (e.g. a pattern has support h on the last day, thus
when a new day comes, h is added to the number of times it has appeared in the last month). We can
then give more importance to patterns that are frequent in more recent windows, and discard the oldest
windows if their support is not significant.
3.4 Performance Evaluation
This section presents the experiments conducted to evaluate the performance of our data streaming
algorithm. Our goal is to evaluate the accuracy, time and memory usage, and to show that:
1. StarFP-Stream is capable of mining directly a star schema with degenerated dimensions, at the
right aggregation level;
2. Our algorithm has a high accuracy and does not miss any real pattern;
3. The time needed for each batch is less than the time needed to denormalize each fact before mining
the batch;
4. Mining directly the star is better than joining before mining, with and without degenerated dimen-
sions.
27
We assume that we are facing a landmark model, where all patterns are equally relevant, regardless
of when they appear in the data. Therefore, we test StarFP-Stream over an adaptation of FP-Streaming
for landmark models, which we will call SimpleFP-Stream, like described in Section 3.2. Since SimpleFP-
Stream does not deal with stars directly, it has to join the tables in one. And since it will have a star
stream as input, with business facts arriving continuously, it denormalizes each business fact when it
arrives (i.e. it goes to every dimension and joins all the transactions corresponding to the tids of the
business fact in question), before mining it.
FP-Growth was also implemented by us, so that we can run it on all data, compare the returned
patterns (the exact patterns) and evaluate the accuracy of StarFP-Stream results (approximate patterns).
We tested the algorithms with a sample of the AdventureWorks 2008 Data Warehouse, described
below. In order to analyze the algorithms in the absence and presence of degenerated dimensions, we
first test the performance of both algorithms on a traditional star, ignoring the degenerated dimension,
and then considering it, so that the algorithms can aggregate the facts related with the same business
transaction, and find patterns at the right business level.
In this work we analyze the accuracy of the results, as well as the behavior of the pattern-tree
and the time and memory used by each algorithm. Experiments were conducted varying both minimum
support and maximum error thresholds: σ ∈ {50%, 40%, 30%, 20%} and ε ∈ {10%, 5%, 4%, 3%, 2%, 1%}2.
Dimension tables were kept in memory and the fact table is read as new facts are needed. Note that
the course of the mining process of streaming algorithms does not depend on the minimum support
defined, only on the maximum error allowed. The support only influences the pattern extraction from
the pattern-tree, which, in turn, is ready for the extraction of patterns that surpass any asked support
(σ � ε). Since the size of the batches is defined by the error (|B| = d1/εe), by varying the error we are
varying the batch size.
The computer used to run the experiments was an Intel Xeon E5310 1.60GHz (Quad Core), with
2GB of RAM. The operating system used was GNU/Linux amd64 and the algorithms were implemented
using the Java Programming language (Java Virtual Machine version 1.6.0 24).
3.4.1 Data Description
We tested the algorithms in one sample of the AdventureWorks 2008 Data Warehouse3, created by
Microsoft especially to support data mining scenarios. The AdventureWorks DW is from an artificial
company that manufactures and sells metal and composite bicycles to North American, European and
Asian commercial markets.
In this work, we analyze a sample of the star Internet sales, considering four dimensions: Customer,
Product, Date and Sales Territory (who bought what, when and where), as it is shown in Fig. 2.1. Each
dimension has only one primary key and other attributes (no foreign keys). Numerical attributes were
excluded (except year and semester in dimension Date), as well as translations and other personal textual
attributes, like addresses, phone numbers, emails, names and descriptions.
This star contains information about more than 60 thousand individual customer Internet sales, from
July 2001 to July 2004. The fact table has the keys of those four dimensions and a control number (other
attributes were removed). The control number, attribute SalesOrderNumber, is a degenerated dimension,
that indicates which products were bought in the same sales order. There are 27600 Internet sales orders
and 60399 rows (individual sales) in the fact table.
In order to evaluate the performance of the algorithms in the absence and presence of degenerated
dimensions, we have chosen two stars to use in these experiments: (1) AW T-Star – the traditional star,
2A common way to define the error is ε = 0.1σ [LLH11]. Additionally, we use a larger error to see how worsethe results are, and a smaller error to see the improvements.
3AdventureWorks Sample Data Warehouse is available at http://sqlserversamples.codeplex.com/
28
i.e. the star in Fig. 2.1 without the degenerated attribute SalesOrderNumber ; and (2) AW D-Star – the
degenerated star, i.e. the star as it is presented in Fig. 2.1.
Table 3.3 presents a summary of the dataset characteristics.
Table 3.3: A summary of the dataset characteristics.
AW
T − star D − StarNumber of facts 60.400 27.600
Number of transactions per fact 1 [1; 8]
Number of attributes per dimension [2; 7]
Number of entries per dimension [12; 18.485]
3.4.2 Experimental Results
The results obtained in these experiments describe the performance of the algorithms when dealing with
traditional star schemas (T-Star) and with stars that have degenerated dimensions (D-Star). We first
discuss the accuracy of the results, and then the size of the pattern-tree. Finally, we present the time
and memory used by each algorithm.
In traditional stars, each row in the fact table corresponds to one business fact. In degenerated stars,
several rows in the fact table may correspond to the same business fact. This means that there are
less business facts than rows, and therefore the D-Star will have less number of batches, and they will
probably have different sizes.
For a better understanding of the domain of each experiment, the number of batches and their size,
corresponding to each error, are shown in table 3.4.
Table 3.4: Batches corresponding to each error.
ErrorNumber of BusinessFacts per Batch
Number of Batches
T-Star D-Star
10% 10 6039 2760
5% 20 3019 1308
4% 25 2415 1104
3% 34 1776 812
2% 50 1207 552
1% 100 603 276
Accuracy
The accuracy of the results is influenced by both error and support thresholds. Therefore, we conducted
tests on StarFP-Stream varying both. Note that the resulting patterns of StarFP-Stream and SimpleFP-
Stream are the same (the algorithms only differ in how they manipulate the data). The exact patterns
were given by FP-Growth (with all facts as input) and were compared with the approximate ones.
We know that as the minimum support decreases, the number of patterns increases, since we require
fewer occurrences of an item for it to be frequent. And as the maximum error increases, the number of
patterns returned also tends to increase, because although we can discard more items, we have to return
more possible patterns to make sure we do not miss any real one. As expected, this is verified in these
experiments, as can be seen in Fig. 3.6 (left).
29
10#
100#
1000#
10000#
5.0%# 2.0%# 1.0%# 0.5%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 0.01%# 0.1%# 1.0%#Error:'
80%#
85%#
90%#
95%#
100%#
5.0%# 2.0%# 1.0%# 0.5%#
Precision'
Support'
0.01%# 0.1%# 1.0%#Error:'
1#
10#
100#
1000#
50%# 40%# 30%# 20%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 1%# 3%# 5%# 10%#Error:'
1#
10#
100#
1000#
50%# 40%# 30%# 20%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 1%# 3%# 5%# 10%#Error:'
50%#
60%#
70%#
80%#
90%#
100%#
50%# 40%# 30%# 20%#
Precision'
Support'
1%# 3%# 5%# 10%#
Error:'
50%#
60%#
70%#
80%#
90%#
100%#
50%# 40%# 30%# 20%#Precision'
Support'
1%# 3%# 5%# 10%#
Error:'
0#
200#
400#
600#
800#
10%# 5%# 4%# 3%# 2%# 1%#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
'
Error'
0#
1000#
2000#
3000#
4000#
5000#
6000#
10%# 5%# 4%# 3%# 2%# 1%#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
'
Error'
0#
200#
400#
600#
800#
1000#
1200#
1.00%# 0.50%# 0.10%# 0.05%# 0.01%#
Pa,ern'Tree'Size''
(tho
usan
ds'of'n
odes)'
'
Error'
0#
50#
100#
150#
200#
250#
300#
1# 101# 201# 301# 401# 501#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odde
s)'
Batch'
0#
2000#
4000#
6000#
8000#
10000#
1# 101# 201# 301# 401# 501# 601# 701# 801#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
Batch'
0#
100#
200#
300#
400#
500#
600#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
Batch'
10#
100#
1000#
10000#
5.0%# 2.0%# 1.0%# 0.5%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 0.01%# 0.1%# 1.0%#Error:'
80%#
85%#
90%#
95%#
100%#
5.0%# 2.0%# 1.0%# 0.5%#
Precision'
Support'
0.01%# 0.1%# 1.0%#Error:'
1#
10#
100#
1000#
50%# 40%# 30%# 20%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 1%# 3%# 5%# 10%#Error:'
1#
10#
100#
1000#
50%# 40%# 30%# 20%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 1%# 3%# 5%# 10%#Error:'
50%#
60%#
70%#
80%#
90%#
100%#
50%# 40%# 30%# 20%#
Precision'
Support'
1%# 3%# 5%# 10%#
Error:'
50%#
60%#
70%#
80%#
90%#
100%#
50%# 40%# 30%# 20%#Precision'
Support'
1%# 3%# 5%# 10%#
Error:'
0#
200#
400#
600#
800#
10%# 5%# 4%# 3%# 2%# 1%#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
'
Error'
0#
1000#
2000#
3000#
4000#
5000#
6000#
10%# 5%# 4%# 3%# 2%# 1%#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
'
Error'
0#
200#
400#
600#
800#
1000#
1200#
1.00%# 0.50%# 0.10%# 0.05%# 0.01%#
Pa,ern'Tree'Size''
(tho
usan
ds'of'n
odes)'
'
Error'
0#
50#
100#
150#
200#
250#
300#
1# 101# 201# 301# 401# 501#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odde
s)'
Batch'
0#
2000#
4000#
6000#
8000#
10000#
1# 101# 201# 301# 401# 501# 601# 701# 801#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
Batch'
0#
100#
200#
300#
400#
500#
600#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
Batch'
(a) AW T-Star
10#
100#
1000#
10000#
5.0%# 2.0%# 1.0%# 0.5%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 0.01%# 0.1%# 1.0%#Error:'
80%#
85%#
90%#
95%#
100%#
5.0%# 2.0%# 1.0%# 0.5%#
Precision'
Support'
0.01%# 0.1%# 1.0%#Error:'
1#
10#
100#
1000#
50%# 40%# 30%# 20%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 1%# 3%# 5%# 10%#Error:'
1#
10#
100#
1000#
50%# 40%# 30%# 20%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 1%# 3%# 5%# 10%#Error:'
50%#
60%#
70%#
80%#
90%#
100%#
50%# 40%# 30%# 20%#
Precision'
Support'
1%# 3%# 5%# 10%#
Error:'
50%#
60%#
70%#
80%#
90%#
100%#
50%# 40%# 30%# 20%#Precision'
Support'
1%# 3%# 5%# 10%#
Error:'
0#
200#
400#
600#
800#
10%# 5%# 4%# 3%# 2%# 1%#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
'
Error'
0#
1000#
2000#
3000#
4000#
5000#
6000#
10%# 5%# 4%# 3%# 2%# 1%#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
'
Error'
0#
200#
400#
600#
800#
1000#
1200#
1.00%# 0.50%# 0.10%# 0.05%# 0.01%#
Pa,ern'Tree'Size''
(tho
usan
ds'of'n
odes)'
'
Error'
0#
50#
100#
150#
200#
250#
300#
1# 101# 201# 301# 401# 501#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odde
s)'
Batch'
0#
2000#
4000#
6000#
8000#
10000#
1# 101# 201# 301# 401# 501# 601# 701# 801#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
Batch'
0#
100#
200#
300#
400#
500#
600#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
Batch'
10#
100#
1000#
10000#
5.0%# 2.0%# 1.0%# 0.5%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 0.01%# 0.1%# 1.0%#Error:'
80%#
85%#
90%#
95%#
100%#
5.0%# 2.0%# 1.0%# 0.5%#
Precision'
Support'
0.01%# 0.1%# 1.0%#Error:'
1#
10#
100#
1000#
50%# 40%# 30%# 20%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 1%# 3%# 5%# 10%#Error:'
1#
10#
100#
1000#
50%# 40%# 30%# 20%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 1%# 3%# 5%# 10%#Error:'
50%#
60%#
70%#
80%#
90%#
100%#
50%# 40%# 30%# 20%#
Precision'
Support'
1%# 3%# 5%# 10%#
Error:'
50%#
60%#
70%#
80%#
90%#
100%#
50%# 40%# 30%# 20%#
Precision'
Support'
1%# 3%# 5%# 10%#
Error:'
0#
200#
400#
600#
800#
10%# 5%# 4%# 3%# 2%# 1%#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
'
Error'
0#
1000#
2000#
3000#
4000#
5000#
6000#
10%# 5%# 4%# 3%# 2%# 1%#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
'
Error'
0#
200#
400#
600#
800#
1000#
1200#
1.00%# 0.50%# 0.10%# 0.05%# 0.01%#
Pa,ern'Tree'Size''
(tho
usan
ds'of'n
odes)'
'
Error'
0#
50#
100#
150#
200#
250#
300#
1# 101# 201# 301# 401# 501#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odde
s)'
Batch'
0#
2000#
4000#
6000#
8000#
10000#
1# 101# 201# 301# 401# 501# 601# 701# 801#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
Batch'
0#
100#
200#
300#
400#
500#
600#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
Batch'
(b) AW D-Star
Figure 3.6: Number of patterns returned (left) and precision (right), by StarFP-Stream.
It is interesting to see that the algorithms return more patterns for AW D-Star than for AW T-Star.
For example, for 30% of support, mining each row as a single fact asks for patterns that appear in more
than 18 thousand rows. By aggregating per degenerated key, e.g. we can find the products and sets of
products that are bought together in more than 8280 sales. This leads to the increase of the number of
patterns returned, since items appearing more than 8280 times but less than 18 thousand, are infrequent
in the first case, but frequent when aggregating.
We can see in the charts that, although StarFP-Stream returns more patterns than the exact ones,
it returns just a few more. The precision helps evaluating that, measuring the rate of real patterns over
the patterns returned by the streaming algorithm.
Fig. 3.6 (right) presents the precision as the support varies. These results depend on the data
characteristics, namely in the number of hidden patterns and in the history of occurrences of items across
the batches processed. We can see that, as the error increases, the precision decreases, for all support
thresholds. In other words, the smaller the error, fewer non real patterns are returned. The overall results
show that the precision is always above 60%.
In the case of the AW T-Star, we can state that for 40% of support the algorithm achieved better
results (100% of precision for errors between 1% and 5%, and 93% for an error of 10%). This may mean
that patterns that appear more than 40% of the times are well defined and consequently are monitored
early during processing. For 50% of support, all errors achieved the same precision of 83%. The precision
for the AW D-Star is similar, achieving better results for smaller errors than the T-Star.
The recall of StarFP-Stream (and SimpleFP-Stream) is proved theoretically to be 100% (see Section
3.3.1), meaning that there are no false positives, i.e. there are no real patterns that the algorithm
considers infrequent. The size of the batches is defined in terms of the error, so that we can discard the
first n occurrences of an item if n is less than the current number of batches, and still not loose any real
pattern. This fact was also verified in these experiments.
In terms of accuracy, we can state that the streaming results are accurate and achieve a high precision.
30
Pattern-Tree
The pattern-tree is the key element of these algorithms, since it is the summary structure that holds all
the possible patterns. The maximum error and the characteristics of the data influence its size, which
in turn influences the time and memory needed. The minimum support only counts for extracting the
patterns out of the pattern-tree, and it does not influence its size.
Since both algorithms use the same rules to construct the pattern-tree, it will be equivalent on both
cases. Therefore, we only present the results of the pattern-tree constructed by StarFP-Stream.
10#
100#
1000#
10000#
5.0%# 2.0%# 1.0%# 0.5%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 0.01%# 0.1%# 1.0%#Error:'
80%#
85%#
90%#
95%#
100%#
5.0%# 2.0%# 1.0%# 0.5%#
Precision'
Support'
0.01%# 0.1%# 1.0%#Error:'
1#
10#
100#
1000#
50%# 40%# 30%# 20%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 1%# 3%# 5%# 10%#Error:'
1#
10#
100#
1000#
50%# 40%# 30%# 20%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 1%# 3%# 5%# 10%#Error:'
50%#
60%#
70%#
80%#
90%#
100%#
50%# 40%# 30%# 20%#
Precision'
Support'
1%# 3%# 5%# 10%#
Error:'
50%#
60%#
70%#
80%#
90%#
100%#
50%# 40%# 30%# 20%#
Precision'
Support'
1%# 3%# 5%# 10%#
Error:'
0#
200#
400#
600#
800#
10%# 5%# 4%# 3%# 2%# 1%#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
'
Error'
0#
1000#
2000#
3000#
4000#
5000#
6000#
10%# 5%# 4%# 3%# 2%# 1%#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
'
Error'
0#
200#
400#
600#
800#
1000#
1200#
1.00%# 0.50%# 0.10%# 0.05%# 0.01%#
Pa,ern'Tree'Size''
(tho
usan
ds'of'n
odes)'
'
Error'
0#
50#
100#
150#
200#
250#
300#
1# 101# 201# 301# 401# 501#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odde
s)'
Batch'
0#
2000#
4000#
6000#
8000#
10000#
1# 101# 201# 301# 401# 501# 601# 701# 801#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
Batch'
0#
100#
200#
300#
400#
500#
600#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
Batch'
(a) Average size – AW T-Star
10#
100#
1000#
10000#
5.0%# 2.0%# 1.0%# 0.5%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 0.01%# 0.1%# 1.0%#Error:'
80%#
85%#
90%#
95%#
100%#
5.0%# 2.0%# 1.0%# 0.5%#
Precision'
Support'
0.01%# 0.1%# 1.0%#Error:'
1#
10#
100#
1000#
50%# 40%# 30%# 20%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 1%# 3%# 5%# 10%#Error:'
1#
10#
100#
1000#
50%# 40%# 30%# 20%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 1%# 3%# 5%# 10%#Error:'
50%#
60%#
70%#
80%#
90%#
100%#
50%# 40%# 30%# 20%#
Precision'
Support'
1%# 3%# 5%# 10%#
Error:'
50%#
60%#
70%#
80%#
90%#
100%#
50%# 40%# 30%# 20%#
Precision'
Support'
1%# 3%# 5%# 10%#
Error:'
0#
200#
400#
600#
800#
10%# 5%# 4%# 3%# 2%# 1%#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
'
Error'
0#
1000#
2000#
3000#
4000#
5000#
6000#
10%# 5%# 4%# 3%# 2%# 1%#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
'
Error'
0#
200#
400#
600#
800#
1000#
1200#
1.00%# 0.50%# 0.10%# 0.05%# 0.01%#
Pa,ern'Tree'Size''
(tho
usan
ds'of'n
odes)'
'
Error'
0#
50#
100#
150#
200#
250#
300#
1# 101# 201# 301# 401# 501#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odde
s)'
Batch'
0#
2000#
4000#
6000#
8000#
10000#
1# 101# 201# 301# 401# 501# 601# 701# 801#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
Batch'
0#
100#
200#
300#
400#
500#
600#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
Batch'
(b) Size with 3% error – AW T-Star
10#
100#
1000#
10000#
5.0%# 2.0%# 1.0%# 0.5%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 0.01%# 0.1%# 1.0%#Error:'
80%#
85%#
90%#
95%#
100%#
5.0%# 2.0%# 1.0%# 0.5%#
Precision'
Support'
0.01%# 0.1%# 1.0%#Error:'
1#
10#
100#
1000#
50%# 40%# 30%# 20%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 1%# 3%# 5%# 10%#Error:'
1#
10#
100#
1000#
50%# 40%# 30%# 20%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 1%# 3%# 5%# 10%#Error:'
50%#
60%#
70%#
80%#
90%#
100%#
50%# 40%# 30%# 20%#
Precision'
Support'
1%# 3%# 5%# 10%#
Error:'
50%#
60%#
70%#
80%#
90%#
100%#
50%# 40%# 30%# 20%#
Precision'
Support'
1%# 3%# 5%# 10%#
Error:'
0#
200#
400#
600#
800#
10%# 5%# 4%# 3%# 2%# 1%#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
'
Error'
0#
1000#
2000#
3000#
4000#
5000#
6000#
10%# 5%# 4%# 3%# 2%# 1%#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
'
Error'
0#
200#
400#
600#
800#
1000#
1200#
1.00%# 0.50%# 0.10%# 0.05%# 0.01%#
Pa,ern'Tree'Size''
(tho
usan
ds'of'n
odes)'
'
Error'
0#
50#
100#
150#
200#
250#
300#
1# 101# 201# 301# 401# 501#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odde
s)'
Batch'
0#
2000#
4000#
6000#
8000#
10000#
1# 101# 201# 301# 401# 501# 601# 701# 801#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
Batch'
0#
100#
200#
300#
400#
500#
600#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
Batch'
(c) Average size – AW D-Star
10#
100#
1000#
10000#
5.0%# 2.0%# 1.0%# 0.5%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 0.01%# 0.1%# 1.0%#Error:'
80%#
85%#
90%#
95%#
100%#
5.0%# 2.0%# 1.0%# 0.5%#
Precision'
Support'
0.01%# 0.1%# 1.0%#Error:'
1#
10#
100#
1000#
50%# 40%# 30%# 20%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 1%# 3%# 5%# 10%#Error:'
1#
10#
100#
1000#
50%# 40%# 30%# 20%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 1%# 3%# 5%# 10%#Error:'
50%#
60%#
70%#
80%#
90%#
100%#
50%# 40%# 30%# 20%#
Precision'
Support'
1%# 3%# 5%# 10%#
Error:'
50%#
60%#
70%#
80%#
90%#
100%#
50%# 40%# 30%# 20%#
Precision'
Support'
1%# 3%# 5%# 10%#
Error:'
0#
200#
400#
600#
800#
10%# 5%# 4%# 3%# 2%# 1%#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
'
Error'
0#
1000#
2000#
3000#
4000#
5000#
6000#
10%# 5%# 4%# 3%# 2%# 1%#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
'
Error'
0#
200#
400#
600#
800#
1000#
1200#
1.00%# 0.50%# 0.10%# 0.05%# 0.01%#
Pa,ern'Tree'Size''
(tho
usan
ds'of'n
odes)'
'
Error'
0#
50#
100#
150#
200#
250#
300#
1# 101# 201# 301# 401# 501#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odde
s)'
Batch'
0#
2000#
4000#
6000#
8000#
10000#
1# 101# 201# 301# 401# 501# 601# 701# 801#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
Batch'
0#
100#
200#
300#
400#
500#
600#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
Batch'
(d) Size with 3% error – AW D-Star
Figure 3.7: Average (left) and detailed (right) pattern-tree size.
Fig. 3.7 (left) reveals, for each error, the average size of the pattern-tree after processing a batch.
Thus we confirm that, as the error decreases, the size of the pattern-tree increases. This is explained by
the fact that for higher errors, the batches are smaller and the algorithms can discard much more possible
patterns than for lower errors. Although being a summary structure, it still is a very large structure,
with thousands of nodes.
With AW D-Star, the Super FP-Tree and the pattern-tree are substantially larger and different: when
aggregating, they have less paths, but longer ones, because of the co-occurrences of items of the same
table in the same transaction.
Fig. 3.7 (right) shows the detailed size of the pattern-trees, for a fixed error of 3%. In the AW T-Star
chart, we can see that, in the first batches, the pattern-tree is larger, but tends to stabilize a few batches
ahead. This behavior is common for all errors, and reveals that there were a lot of patterns that were
frequent only in the beginning. Contrariwise, the pattern-tree of AW D-Star is smaller in the beginning.
This happens because until the sale order number 5400, customers only bought one product at a time,
and therefore there are fewer patterns and no co-occurrences of more than one product in the same sale.
In both cases, the spikes are caused by the introduction of recently appearing itemsets, followed by
their removal a few batches later in the pruning step. It is interesting to see that despite the spikes, the
trees always come back to the same size. This might indicate that the patterns are well defined, and they
keep consistent across the batches.
These results are important to understand the fluctuations in time and space described below.
31
Time
With respect to data streams, processing time is usually analyzed in two different ways [LLH11]: the
time required to process one batch (update time) and the time needed to return the patterns, for a given
support (query time). The first consists in the elapsed time from the reading of a transaction to the
update of the pattern-tree. The second is the time needed to scan the pattern-tree.
The minimum support does not influence the construction of the pattern-tree, and thus the update
time. On the contrary, the maximum error influences both update and query time. Both times must not
depend on the total number of transactions.
0#
2#
4#
6#
8#
10%# 5%# 4%# 3%# 2%# 1%#
Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0#
100#
200#
300#
400#
10%# 5%# 4%# 3%# 2%# 1%#
Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0#
5#
10#
15#
20#
25#
30#
1.0%# 0.5%# 0.1%# 0.05%# 0.01%#
Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0.01#
0.1#
1#
10#
1# 101# 201# 301# 401# 501#
Time'(s)'
Batch'
StarFPStream# SimpleFPStream#
1#
10#
100#
1000#
10000#
1# 51# 101#151#201#251#301#351#401#451#501#551#601#651#701#751#801#
Time'(s)'
Batch'
SimpleFPStream# StarFPStream#
0#
2#
4#
6#
8#
10#
Time'(s)'
Batch'
SimpleFPStream# StarFPStream#
0#
200#
400#
600#
800#
1000#
1200#
10%# 5%# 4%# 3%# 2%# 1%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#StarFPStream#
0#
100#
200#
300#
400#
500#
600#
10%# 5%# 4%# 3%# 2%# 1%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#StarFPStream#
0#
50#
100#
150#
200#
250#
300#
1.0%# 0.5%# 0.1%# 0.05%# 0.01%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#
StarFPStream#
(a) Average time – AW T-Star
0#
2#
4#
6#
8#
10%# 5%# 4%# 3%# 2%# 1%#
Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0#
100#
200#
300#
400#
10%# 5%# 4%# 3%# 2%# 1%#
Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0#
5#
10#
15#
20#
25#
30#
1.0%# 0.5%# 0.1%# 0.05%# 0.01%#
Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0.01#
0.1#
1#
10#
1# 101# 201# 301# 401# 501#
Time'(s)'
Batch'
StarFPStream# SimpleFPStream#
1#
10#
100#
1000#
10000#
1# 51# 101#151#201#251#301#351#401#451#501#551#601#651#701#751#801#
Time'(s)'
Batch'
SimpleFPStream# StarFPStream#
0#
2#
4#
6#
8#
10#
Time'(s)'
Batch'
SimpleFPStream# StarFPStream#
0#
200#
400#
600#
800#
1000#
1200#
10%# 5%# 4%# 3%# 2%# 1%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#StarFPStream#
0#
100#
200#
300#
400#
500#
600#
10%# 5%# 4%# 3%# 2%# 1%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#StarFPStream#
0#
50#
100#
150#
200#
250#
300#
1.0%# 0.5%# 0.1%# 0.05%# 0.01%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#
StarFPStream#
(b) Time with 3% error – AW T-Star
0#
2#
4#
6#
8#
10%# 5%# 4%# 3%# 2%# 1%#
Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0#
100#
200#
300#
400#
10%# 5%# 4%# 3%# 2%# 1%#
Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0#
5#
10#
15#
20#
25#
30#
1.0%# 0.5%# 0.1%# 0.05%# 0.01%#
Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0.01#
0.1#
1#
10#
1# 101# 201# 301# 401# 501#
Time'(s)'
Batch'
StarFPStream# SimpleFPStream#
1#
10#
100#
1000#
10000#
1# 51# 101#151#201#251#301#351#401#451#501#551#601#651#701#751#801#
Time'(s)'
Batch'
SimpleFPStream# StarFPStream#
0#
2#
4#
6#
8#
10#
Time'(s)'
Batch'
SimpleFPStream# StarFPStream#
0#
200#
400#
600#
800#
1000#
1200#
10%# 5%# 4%# 3%# 2%# 1%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#StarFPStream#
0#
100#
200#
300#
400#
500#
600#
10%# 5%# 4%# 3%# 2%# 1%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#StarFPStream#
0#
50#
100#
150#
200#
250#
300#
1.0%# 0.5%# 0.1%# 0.05%# 0.01%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#
StarFPStream#
(c) Average time – AW D-Star
0#
2#
4#
6#
8#
10%# 5%# 4%# 3%# 2%# 1%#
Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0#
100#
200#
300#
400#
10%# 5%# 4%# 3%# 2%# 1%#
Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0#
5#
10#
15#
20#
25#
30#
1.0%# 0.5%# 0.1%# 0.05%# 0.01%#
Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0.01#
0.1#
1#
10#
1# 101# 201# 301# 401# 501#
Time'(s)'
Batch'
StarFPStream# SimpleFPStream#
1#
10#
100#
1000#
10000#
1# 51# 101#151#201#251#301#351#401#451#501#551#601#651#701#751#801#
Time'(s)'
Batch'
SimpleFPStream# StarFPStream#
0#
2#
4#
6#
8#
10#
Time'(s)'
Batch'
SimpleFPStream# StarFPStream#
0#
200#
400#
600#
800#
1000#
1200#
10%# 5%# 4%# 3%# 2%# 1%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#StarFPStream#
0#
100#
200#
300#
400#
500#
600#
10%# 5%# 4%# 3%# 2%# 1%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#StarFPStream#
0#
50#
100#
150#
200#
250#
300#
1.0%# 0.5%# 0.1%# 0.05%# 0.01%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#
StarFPStream#
(d) Time with 3% error – AW D-Star
Figure 3.8: Average (left) and detailed (right) update time.
Fig. 3.8 (left) shows the average update time per batch, for both algorithms and for all errors. For
consistency, we do not take into account the time needed to process the first batch, since it is processed
separately.
We can state that SimpleFP-Stream demands, on average, more time than StarFP-Stream. This
difference is even higher when there is a degenerated dimension (AW D-Star), since the first has to
denormalize several rows before mining each business fact. This demonstrates that, for star streams,
denormalize before mining takes more time than mining directly the star schema, specially in the presence
of degenerated dimensions, corroborating our goal and one of the goals of MRDM.
The update time should tend to be constant and not depend on the number of transactions. This can
be verified in Fig. 3.8 (right), that shows in detail the time needed per batch, for 3% error. There, we
can see that the update time tends to be constant as more batches are processed.
The higher values in the first batches for AW T-Star, and lower values in the same batches for D-Star
are directly related with the size of the pattern-tree. Around order 5400 (batch 160), the data change
(customers start buying more than one product at a time). Without aggregations (AW T-Star), the
algorithms are able to prune almost half the patterns and therefore they need less time to process each
batch. Considering the degenerated dimension (AW D-Star), the aggregations start only at this point,
therefore batches start being larger from here, and the algorithms need more time to process each batch.
Summarizing, these charts reflect that, as the error decreases, the larger are the batches and more
32
time is needed to process them. They show that the time needed tends to be constant, and depends
mainly on the size of the batches and on the size of the current pattern-tree. StarFP-Stream is the
algorithm that performs better and needs less time to process each batch, overcoming the “join before
mining” approach (SimpleFP-Stream), with and without degenerated dimensions.
The query time turned out to be insignificant, comparing to update time, taking always less than
0.005 seconds. As the error decreases, the pattern-tree increases, and the time needed to extract the
patterns also increases, but just in milliseconds. The same happens with the minimum support, since the
lower the support, the more patterns have to be returned.
Memory
The space or memory used by the algorithms was also studied. It depends on the intermediate structures
used by the algorithms, and it is strongly related with the size of the pattern-tree (and therefore with
the error bound). To analyze the maximum memory per batch, we measured the memory used by the
algorithms for each batch, right before discarding the Super FP-Tree and doing the pruning step.
0#
2#
4#
6#
8#
10%# 5%# 4%# 3%# 2%# 1%#
Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0#
100#
200#
300#
400#
10%# 5%# 4%# 3%# 2%# 1%#
Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0#
5#
10#
15#
20#
25#
30#
1.0%# 0.5%# 0.1%# 0.05%# 0.01%#
Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0.01#
0.1#
1#
10#
1# 101# 201# 301# 401# 501#
Time'(s)'
Batch'
StarFPStream# SimpleFPStream#
1#
10#
100#
1000#
10000#
1# 51# 101#151#201#251#301#351#401#451#501#551#601#651#701#751#801#
Time'(s)'
Batch'
SimpleFPStream# StarFPStream#
0#
2#
4#
6#
8#
10#
Time'(s)'
Batch'
SimpleFPStream# StarFPStream#
0#
200#
400#
600#
800#
1000#
1200#
10%# 5%# 4%# 3%# 2%# 1%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#StarFPStream#
0#
100#
200#
300#
400#
500#
600#
10%# 5%# 4%# 3%# 2%# 1%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#StarFPStream#
0#
50#
100#
150#
200#
250#
300#
1.0%# 0.5%# 0.1%# 0.05%# 0.01%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#
StarFPStream#
(a) AW T-Star
0#
2#
4#
6#
8#
10%# 5%# 4%# 3%# 2%# 1%#
Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0#
100#
200#
300#
400#
10%# 5%# 4%# 3%# 2%# 1%#Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0#
5#
10#
15#
20#
25#
30#
1.0%# 0.5%# 0.1%# 0.05%# 0.01%#
Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0.01#
0.1#
1#
10#
1# 101# 201# 301# 401# 501#
Time'(s)'
Batch'
StarFPStream# SimpleFPStream#
1#
10#
100#
1000#
10000#
1# 51# 101#151#201#251#301#351#401#451#501#551#601#651#701#751#801#
Time'(s)'
Batch'
SimpleFPStream# StarFPStream#
0#
2#
4#
6#
8#
10#
Time'(s)'
Batch'
SimpleFPStream# StarFPStream#
0#
200#
400#
600#
800#
1000#
1200#
10%# 5%# 4%# 3%# 2%# 1%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#StarFPStream#
0#
100#
200#
300#
400#
500#
600#
10%# 5%# 4%# 3%# 2%# 1%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#StarFPStream#
0#
50#
100#
150#
200#
250#
300#
1.0%# 0.5%# 0.1%# 0.05%# 0.01%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#
StarFPStream#
(b) AW D-Star
Figure 3.9: Average maximum memory per batch.
Fig. 3.9 presents the average maximum memory per batch. We can note that it increases with the
decrease of the error and that the algorithms perform very similar. With the star without degenerated
dimensions (AW T-Star), StarFP-Stream needs slightly more memory per batch than SimpleFP-Stream,
which was expected, since the first has to construct the DimFP-trees for each dimension, while the second
puts the denormalized facts in just one FP-tree. With the AW D-Star, this difference is not even visible.
We can also see that the algorithms need both more memory per batch when dealing with degenerated
dimensions, due to the storage of larger trees.
Naturally, as in other pattern mining algorithms, the memory used increases exponentially with the
decrease of the error, since the error defines what is considered frequent, and therefore the smaller the
error, the more has to be kept. However, and just as in the time usage, the memory needed tends to
stabilize and not depend on the number of batches processed so far, as required by the data streaming
paradigm. This memory behavior is directly related with the size of the pattern-tree.
3.5 Discussion and Conclusions
In this chapter we described an algorithm, named StarFP-Stream, for mining patterns on very large
data repositories, modeled as a star schema. The algorithm finds frequent patterns at some level of
aggregation, and it is able to deal with degenerated dimensions by aggregating the rows in the fact table
corresponding to the same business fact, and still mining directly the star.
Experimental results show that StarFP-Stream is accurate as its predecessors, achieving precisions
above 60% and 100% of recall. The pattern-tree tends to be very large, but its size tends to be stable,
33
and it is able to return the patterns for every minimum support, at anytime. The time and memory
needed by the algorithm tend to be constant and do not depend on the total number of transactions
processed so far, but only on the size of the batches and on the size of the current pattern-tree, which
in turn depends on the characteristics of data. StarFP-Stream greatly outperforms SimpleFP-Stream in
terms of time. In this manner, it is possible to conclude that our algorithm overcomes the “join before
mining” approach.
Despite being an efficient algorithm that is able to mine large and growing star schemas, it still suffers
from the main bottleneck of pattern mining: it returns a huge number of unfocused patterns. In this
data streaming environment, this problem is even more visible, since streaming algorithms must keep an
even higher amount of possible frequent patterns. To tackle this problem, the most studied and used
approach is to incorporate domain knowledge into the pattern mining algorithms, to help filtering the
patterns and getting less and more interesting results, in the user and application points of view. In our
case, it can greatly reduce the size of the pattern-tree, and therefore the memory needed per batch, as
well as the time needed, since we would process smaller pattern-trees. Accomplish this, however, is not
straightforward, since the introduction of domain knowledge, such as constraints, has been tackled over
single transactional tables. To the best of our knowledge, there is no approach for incorporating domain
knowledge into multi-relational pattern mining. We discuss and progress on this in the following chapters
of this dissertation.
Another path for improvement could also be the creation of a parallelized version of StarFP-Stream,
which could significantly improve the time needed, as well as increase the throughput of the algorithm.
We could parallelize the processing of each fact in a batch, and when the batch is complete, while the
Super FP-Tree is being mined, the new batch may already be collected and new facts can be processed in
parallel. There have been some efforts in the parallelization of traditional pattern mining, in particular
of the base FP-Growth algorithm [LWZ+08], which may also serve as a basis for parallelizing the mining
of the SuperFP-Tree of our StarFP-Stream algorithm.
34
Chapter 4
The Groundwork on Domain Driven
Data Mining
Despite the advances and the recognizable value of pattern mining for finding different types of inter-
esting relations among data, from association rules [AS94, PH02] to sequences [PHW07] and emerging
patterns [MTIV97, DL99], it tends not to be widely used in real world applications.
One of the main reasons, and one of the common criticisms pointed out to pattern mining is the
fact that it generates a huge number of patterns, independent of user expertise, making it very hard to
understand and use the results [SVA97, HCXY07].
The truth is that, if we only want patterns with higher support, only already known patterns are
found, or none at all; otherwise, if the support is set too low, the number of patterns explode, and it is
very difficult to distinguish the few real useful patterns among the many uninteresting ones. A balance
is required, and ways to limit the number of results are needed.
Some strategies to cut down the number of patterns returned were already proposed and tested,
namely: (1) reducing the set of attributes to consider [BGMP05]; (2) identifying only the k-best patterns;
(3) mining condensed representations [PBTL99, Zak00a, BBR00]; (4) using sampling and approximation
techniques; and (5) using background knowledge to limit the results to what is unknown.
Nonetheless, we need more than just reducing the number of patterns. Another bottleneck of pattern
mining is the lack of focus on user expectations and weak actionability of the results. User expectations
are subjective and are not easy to convert to a machine readable form [GSD07]. However, they are related
with the domain and with the business goals. In this sense, the use of domain knowledge in the search
process is recognized as a promising approach to solve this problem.
The existing knowledge of a specific application domain, regarding the important aspects of the
business, like its structure, data, processes, persons, goals, etc., is referred to as domain knowledge. This
knowledge usually belongs to experts, and depends on their experience and view of the business. Different
experts for a same domain may have different perspectives of the business, and therefore different domain
knowledge.
Being able to explore this knowledge makes it possible, not only to focus the results, but also to reduce
the number of results, hence improving their interpretability and actionability, and eventually improving
the business leverage and the acceptance of pattern mining.
With respect to the use of domain knowledge in data mining, there are some fundamental issues that
need attention:
• How to capture domain knowledge? Domain knowledge is usually implicit and tacit. How to extract
this knowledge from people is non trivial, specially because different people may have different
35
perspectives of the business and goals, and also because they do not always have conscience of what
they know, or do not know how to transmit it (the knowledge acquisition bottleneck);
• How to represent domain knowledge? In order to use this knowledge, it is necessary to formalize it
in some human and machine readable representation. Several ways have been proposed to represent
this knowledge, such as in the form of annotations, introduced directly by users or experts, con-
straints over the search and results space, and ontologies, modeling the relations between important
concepts;
• How to involve domain knowledge in the data mining process? Algorithms must use this knowledge
to constrain the search and results to what is important. How to efficiently use this knowledge
is still a considerable challenge. On one side, if we restrict too much, we may be reducing the
discovery process to a simple hypothesis testing task, that can only find already known patterns
[HG02]. On the other side, if we restrict too little, we go back to a traditional pattern mining
process, and may find too many uninteresting patterns. Another concern is related to the fact that
there are several different representations, and therefore the ways to incorporate each one may also
vary. Furthermore, we also have to be careful not to depend on one specific domain, or else the
algorithms cannot be reused;
• How to evaluate the interestingness and novelty of resulting patterns? This problem is common
to traditional data mining, and involves the creation of interestingness measures that are able to
evaluate the results based on the existing knowledge, and the selection of only the ones that are
novel.
The problems of acquiring and representing domain knowledge are outside the scope of this work.
Rather, in this thesis we will focus on how the existing domain knowledge representations have been
incorporated into the pattern mining task.
The work on the introduction of domain knowledge or semantics in pattern mining has been increasing
in the last decade. In a broader view, existing algorithms mostly use this knowledge in the pre- or post-
processing step of the Knowledge Discovery in Databases (KDD) process [FPSS96]:
As a pre-processing task, domain knowledge has been used mainly to reduce the search space or to
improve the quality of data. This can be achieved by, for example, filtering original data, selecting only
the most important records [BGMP05]; enriching data by adding related concepts in the background
knowledge [HF95, SA96, SVA97]; and replacing concepts and missing values based on the domain knowl-
edge. While these techniques help on filtering and improving the quality of the data to mine, they may
need a lot of time to be tuned. Additionally, they might eliminate important data and they do not
guarantee that the algorithms will return less and novel results.
As a post-processing step, domain knowledge has been used to evaluate the interestingness and novelty
of the discovered patterns, so that only the best (and new) results are shown to the users [WJL03, JS04,
JS05, PT98, CLZ07]. Using domain knowledge only as a post-processing step, means that all data has
to be processed to find all patterns, and then all patterns must be processed again to evaluate and filter
them. Thus, it may not be the most efficient strategy.
A more balanced approach is to use the domain knowledge during the data mining phase. In this
direction, algorithms are able to prune the search space and filter the results “on the fly”, and there-
fore return less and more focused results, and at the same time, needing less time and memory than
other approaches. Mainly, existing approaches incorporate domain knowledge into the discovery process
to avoid generating uninteresting candidates or going to uninteresting paths, and to avoid testing all
data [SA95, SVA97, PH02, BJ05, Ant09b, ME09b].
36
In this chapter we analyze in detail the existing approaches on the incorporation of domain knowledge
in the pattern mining process, and present a new global view of the work done in this area. In particular,
we propose a new framework for constrained pattern mining, that helps organizing the existing strate-
gies to incorporate constraints in the search process, based on the semantics and properties of existing
constraints, as well as on the data sources being constrained.
Section 4.1 presents the background on the use of domain knowledge in pattern mining, and section
4.2 describes the different forms of domain knowledge that have been used, along with their advantages
and disadvantages. Sections 4.4 to 4.8 present in detail the framework for constrained pattern mining
and existing related work. And finally, section 4.9 concludes the work with some discussion and open
issues.
4.1 Background
The use of domain knowledge has been explored in data mining since its early years, in a somehow
independent manner among different areas, such as inductive logic programming, semantic data mining,
and more recently, domain driven data mining.
In this section we discuss the approaches and goals of each of these areas.
4.1.1 Inductive Logic Programming - Discussion and Arguments
Inductive Logic Programming (ILP) is a well known and studied paradigm of machine learning that
is concerned with inducing classification rules from examples and background knowledge, all of which
expressed using logical representations, such as Prolog programs [NCW97, LM04, LE09]. It was born
from the interception of Concept Learning and Logic Programming, with the goal of prediction within
the representation framework of Horn Clausal Logic (HCL).
The fact that all information must be written in declarative languages (like Prolog and Datalog) is
one of the drawbacks of ILP approaches, and one of the reasons for not being widely used. Nevertheless,
its structure promotes the representation and use of domain knowledge. There are many ILP algorithms
that are able to introduce this knowledge into the discovery process (see, for example, [RR04, MEL01,
LM04, LR98, RV00, LM03, Lis05]).
ILP techniques must also deal with the tradeoff between expressiveness and efficiency of the used
representations. Studies show that current algorithms would scale relatively well as the amount of back-
ground knowledge increases. But they would not scale well with the number of relations involved, and in
some cases, with the complexity of the patterns being searched [D96, LM04].
4.1.2 Domain Driven Data Mining – Discussion and Arguments
The methodology of Domain Driven Data Mining, D3M, was proposed recently [CZ06, CZZ+07, CLZ07,
CYZZ10b, CYZZ10a], defending an urgent need for Actionable Knowledge Discovery (AKD) to support
businesses and applications.
The motivation behind D3M is the gap between academic objectives (innovation, performance and
generalization) and business goals (problem solving), and between academic outputs and business ex-
pectations [CYZZ10a]. So that data mining can be better accepted and advantageously applied in real
businesses and applications, it is necessary to create methods and tools capable of analyzing real world
data and extracting actionable knowledge, i.e. useful information that can be (as far as possible) directly
converted into decision-making actions. The term “actionability” measures the ability of a pattern to
prompt a user to take concrete actions to his advantage in the real world [CLZ07].
37
To achieve that, data mining must involve the ubiquitous intelligence surrounding the business prob-
lem, such as human intelligence, domain intelligence, network and organizational/social intelligence
[CYZZ10a]. Therefore, they propose a paradigm shift from data-centered knowledge discovery to domain-
driven actionable knowledge discovery.
Research included in this area of D3M has been centered on the proposal of methods dedicated to
specific domains, with a special emphasis on the actionability of the results. The specificity of those
methods difficult their application to other domains, and the need for a standard methodology that is
able to incorporate domain knowledge in the mining process remains an open issue. We can argue that
existing work in D3M is more centered in the actionability of results in some domain, than on the reuse
of the proposed strategies.
4.1.3 Semantic Data Mining – Discussion and Arguments
The name Semantic Data Mining (SDM) has been used to denote several approaches to data mining, in
a not very consistent way:
1. Defining the semantic of DM [PDS08, Ant09b] (or semantic meta-mining [NVTL09, JLL10]). Defin-
ing the semantic of the DM process itself may help understanding the actual process as well as the
dependencies between the several approaches. By identifying and formally representing the respec-
tive inputs, outputs, configurations and even workflows, it is possible to discover problems and
solutions, and envision more efficient strategies;
2. Extracting semantics from data [Set10, EC07]. This may be seen as the original goal of DM, which
is to discover useful knowledge from data. In this case, this knowledge or semantics has usually the
form of keywords or features, that give meaning to data;
3. Mining semantic data. With the explosive growing of the Semantic Web and of the Internet re-
sources, there is more and more semantic information available worldwide. Mining this semantic
information directly may improve its understanding and use [TLT08, LVS+11];
4. Adding semantics to data. This approach is also powered by the growth of the Semantic Web, and
it is most known as semantic annotation [DP08, Liu10]. By enriching data with semantics, it is
possible to help users understanding the data, and use it to get better results;
5. Using the existing semantics of some domain to guide DM algorithms. By incorporating the knowl-
edge inherent to each domain, data mining techniques are able to focus the search and modeling
process, and find more interesting results. Despite the advances, most of the existing work included
in this area are designed for some specific domain, and therefore cannot be reused. Also, being able
to capture and represent the semantics of some domain is not straightforward, but there has been
an effort to create and use increasingly expressive forms of representation of domain knowledge
[Ant07, NVTL09, JLL10].
In this work we consider the last approach, and analyze in detail the use of domain knowledge to
guide DM algorithms in the search for more focused results. We focus mainly on the existing generic
forms of domain knowledge representation, and on the strategies created to incorporate these forms. The
motivation is that, by being able to use generic representations, the algorithms can be applied to any
real problem, and still guide the discovery process through the specific knowledge of that domain.
38
4.2 Domain Knowledge Representations
Modeling has been one of the core parts of information science, either in information systems or in
artificial intelligence. In both ones, it is generally accepted that, without a good model, no system works
adequately.
The advances in the areas of modeling and knowledge representation allow for using the developed
mature formalisms to represent existing knowledge, and therefore making possible the exploration of
those models to guide the discovery process.
The use of domain knowledge in data mining has been a topic of extensive research, and several
representations have been proposed and analyzed, from simpler forms of knowledge, like annotations, to
more elaborated and expressive representations, like ontologies.
Each form of representation allows the formalization of more or less complex forms of knowledge, and
therefore has its advantages and disadvantages, and can be used in different ways to guide the mining
process. Usually, the more complex the model, the more complex it is to efficiently incorporate it.
Domain knowledge used by existing data mining approaches can be divided in: human interactions,
annotations (or labels), constraints and graph-based models. Along time, several strategies have been
proposed to incorporate these forms of knowledge representations, some general, but most of them ad-hoc
approaches.
Human Interactions Techniques based on human interactions are approaches (known as interactive
approaches) that utilize the user or expert in the actual discovery process, letting him direct the
flow of the algorithms and influence the selection of results.
The reasoning behind these approaches is that, from the users point of view, the pattern discovery
process is an interactive and iterative process [Bou04]. They define the data to analyze, choose the
desired parameters and thresholds, and interpret and evaluate the quality and applicability of the
results.
However, the actual discovery phase is usually a black box, and therefore it is difficult to trackback
the results, and find out which parameters or constraints lead to the interesting results. This also
leads to another problem, related to the difficulty on choosing the best parameters and constraints.
The users do not always know exactly what they want a priori, and this black box approach makes
it very inefficient to try different values (it is necessary to run the algorithms again, with the
new parameters). To overcome this, we need interactive approaches, capable of involving the user
in the discovery process. These approaches should be able to use their feedback iteratively and
incrementally [NDD99, GB00, GMV11].
Active learning techniques also require human interactions to help in the learning process. The
main idea is to make the system iteratively ask the oracle (e.g. the user or the expert) to give some
new information (e.g. labels or evaluations), and from the answers, the system is able to learn what
are the user knowledge and expectations [Set10, XSMH06].
Annotations One simple form of attaching domain knowledge to data is to add some labels or anno-
tations that characterize in some way the context in which the user or expert wants data to be
analyzed.
These labels can be, for example, the ratings given by customers on a social network, “relevant” and
“not relevant” tags, insights about important objects and possible relations, and desired categories.
In more complex or critical domains, like genetics, fraud detection, speech recognition and media,
these labels must be given by experts. However, with the rapid growth of data, most of the times it
is too expensive and time consuming to ask humans to label all data [Set10]. Techniques like active
39
learning try to automatize this labeling process, by iteratively and interactively learn how to label
new data points. But, how are these labels used in DM to find useful models?
Labels have been used mostly by classification and semi-supervised clustering techniques, to train
or initialize the models that will then categorize unknown or unlabeled instances [BBM02, SA12c].
Normally, the more labels used as background knowledge, the better the results. That is why,
generally, classification results are more accurate than semi-supervised approaches, which are, in
turn, better than unsupervised ones. However, algorithms also depend on the quality of labels,
and it does not guarantee that the results are more interesting, and that the algorithms return less
patterns/smaller models [SA12c].
Constraints The most used way to represent user expectations is through the definition of constraints
[Bay05]. Essentially, constraints are filters on the data or on the results, that capture application
semantics and allow the users to somehow control the search process and focus de algorithms on
what is really interesting. There are many types of constraints, from simple constraints that limit
the items appearing in patterns [SVA97], to more complex constraints requiring that patterns must
conform a regular expression [GRS99].
The work on constrained pattern mining is extensive and the most widely used, therefore we describe
it in more detail in the next section. We analyze both the different types of constraints, as well as
their properties and strategies for their incorporation in pattern mining. We also propose a new
framework to describe constrained pattern mining algorithms based on three of their dimensions:
constraint categories, constraint properties and data sources.
Graph-based Models Graph-based representations are one valuable and more expressive source of
domain knowledge, since they are able to capture the conceptual structure of the domain, and model,
in a more intuitive (and visual) way, the existing concepts and relations. Examples of graph-based
representations are taxonomies (or concept hierarchies), ontologies, bayesian and markov networks.
• Taxonomies are hierarchies of concepts, that can be seen as directed acyclic graphs (DAG),
containing the is-a relations existing between the concepts of the database. They have been
used in pattern mining in several ways: to enrich concepts in data with their ancestors in the
taxonomy, and therefore avoid redundant processing and duplicates [SA95, AS94]; and to find
patterns in all hierarchical levels, by mining one level each time, and use the results to mine
the next (more specific) level [HF95, MEL01, LM04].
• Ontologies are content theories about the objects, their properties and relations, that are
possible in a specified domain of knowledge, forming the heart of any system of knowledge
representation for that domain [CJB99]. They can be seen as extensions of taxonomies,
since they can represent not only the is-a relations between concepts, but also other types
of relations, hierarchies between relations, and axioms, that constrain the interpretation of
concepts [SHB06].
In a pragmatic view, an ontology mainly defines a directed graph, with concepts represented by
nodes and relations by edges, which can be efficiently traversed by search domain-independent
algorithms.
The use of ontologies in data mining with the purpose of finding more interesting results
is recent, and a great part of existing works are ad-hoc applications to specific problems.
They have been used as a pre-processing step, to categorize and enrich data [KLSP07], or
has a post-processing step, to filter patterns or association rules based on the relations of
their concepts in the ontology and on user defined constraints [MGB08]. There are also some
40
approaches that are able to use ontologies to influence the discovery process itself, either
by defining a set of constraints based on the ontology and using those constraints to avoid
generating invalid [Ant07, Ant09b] or uninteresting candidates (that are too distant from each
other) [ME09b, ME09a], or by using it to replace instances by corresponding concepts and
using the relations to grow only valid patterns [JLL07].
• Bayesian networks encode the joint distribution over a set of attributes, and provide well
understood inference mechanisms that make easier the computation of the probability of arbi-
trary events (in this case, combinations of concepts) in the network [JN07]. Bayesian networks
are an easily interpretable alternative language for expressing background knowledge, and are
used in frequent pattern mining to find whether the discovered knowledge is entailed by the
previously available knowledge [JS04, JS05].
• Markov Logic Networks (MLN) are one recent and promising example of a graph-based model
that can roughly be seen as bayesian networks with weights [Dom07]. They are able to model
all possible worlds, with the corresponding dependencies, probabilities and weights, and the
probability of each world depends on the sum of the product of the weight of each formula
(combination of concepts and relations) with the number of corresponding instantiations that
are true in that world.
Markov networks are a powerful representation for joint distributions, but learning them from
data is extremely difficult, and therefore they have not been widely used. The Alchemy sys-
tem [KSR+07] includes inference and learning algorithms for MLNs, and has been used for
knowledge-rich data mining in several domains [DKP+06b], like information extraction, link
prediction, entity resolution and social network analysis.
Table 4.1 summarizes the analysis of the advantages and disadvantages of each of the above domain
knowledge representations.
It is important to note that existing approaches for each kind of representation are interrelated and
can not be perfectly separated. One reason resides in the fact that there are similar forms of knowledge
formalizations, and therefore strategies for one may serve for another, with smaller adaptations. Another
reason is the fact that it is possible to formulate the problem of mining one type of knowledge into
a similar problem, with other type of knowledge. For example, we can define constraints from most
of other knowledge representations, like ontologies, and in this manner, it is possible to incorporate
ontologies using constraints and constrained algorithms.
4.3 Constrained Pattern Mining: Problem Definition
The oldest and more studied constraint in pattern mining is the minimum support threshold [AS94],
which states that, to be interesting, a pattern must have occurred more than the given threshold. In
fact, what we call traditional pattern mining corresponds to the discovery of frequent itemsets from data.
Therefore, the minimum support is not usually considered as a constraint, but as a strong measure that
should be the basis of all other pattern mining approaches. In this sense, constrained pattern mining is
perceived as the use of constraints beyond the minimum support, i.e. the discovery of frequent itemsets
from data that satisfy some constraint.
In a normal constrained problem, we are dealing with one single table. In this sense, we follow the
notation used in previous chapters, but define the main concepts for a traditional single table environment.
Formally, let I = {i1, i2, . . . , im} be a set of distinct literals, called items. A subset of items is denoted
as an itemset. A superset of an itemset X is also an itemset, that contains all items in X and more. The
support of an itemset X is the number of occurrences in the dataset. In this context, an itemset is frequent
41
Table 4.1: Advantages and disadvantages of the different forms of domain knowledge representations.
KnowledgeRepresentation
Advantages Disadvantages
HumanInteractions
• Facilitate interpretation and evalua-tion;
• Provide traceability of results;• Easy to re-try with different parame-
ters;• Results are in accordance with user
expectations;• No need to express knowledge
beforehand.
• Users do not always know what theywant;• Labor intensive for complex domains;• Not easy to re-use in different do-
mains;• Users must learn how to interact with
the system;• There is no interface, perfect for ev-
ery users.
Annotations
• Results are more accurate, even forfew fractions of labels;
• Results are more likely to be in ac-cordance with the context;
• Normally, the more labels, the bestthe results.
• Label all data is too expensive andtime consuming;• Not effective for unbalanced datasets;• The choice of seeds may negatively
influence the results;• Labels may be wrong;• Use only one label is limited, and use
multiple labels is not trivial.
Constraints
• Constraints capture application se-mantics;
• Allow the user to control the miningprocess;
• Reduce the number of results;• Increase the efficiency of the algo-
rithms;• Improve the interpretability of
results.
• Restricting too much leads to a sim-ple hypothesis testing;• Constrain too few leads to an explo-
sion of results and less efficiency;• More complex constraints are not
trivially incorporated into thealgorithms.
Graph-basedModels
• More expressive power and easy toextend;
• Formally represent experts view ofthe domain;
• In general, are more intuitive repre-sentations of concepts and relations;
• Results more interesting according tothe model;
• Independent from mining methods(as opposed to constraints).
• More computational complexity;• Need for a graphical notation, that
can be used by mining methods;• Algorithms must deal with multiple
relations and mappings for the sameconcepts;• There may be multiple models for the
same domain. The choice may influ-ence the results;• The more complex are the mod-
els, the more difficult to understandthem.
42
if its support is no less than a predefined minimum support threshold, σ ∈ [0, 1]: sup(X) ≥ σ ×N , with
N the total number of transactions in data.
Definition 9. A constraint C is a predicate on the powerset of I [PHL01], i.e. C : 2I 7→ {true, false}.An itemset X satisfies a constraint C, if C(X) = true.
A pattern corresponds to a frequent itemset that satisfies the constraint C, i.e. if sup(X) ≥ σ ×N ∧C(X) = true. And, given σ and C, the problem of constrained frequent pattern mining is to find all
patterns in a dataset that satisfy the imposed constraint.
4.4 A new Framework for Constrained Pattern Mining
In this section we gathered the different proposed constraints in the literature and analyzed them in terms
of their semantics and properties. We also examined closely the existing strategies for their incorporation
into the pattern discovery, and how these constraints and strategies are adapted to different data source
types. In this sense, we propose the framework for constrained pattern mining presented in Fig. 4.1. This
framework is a classification scheme to organize and analyze constrained pattern mining algorithms, based
on three different perspectives that influence the choice of the strategies to use: constraint categories,
constraint properties and data sources.
Constrained++Pa-ern+Mining+Algorithms+
+
Data Sources
Figure 4.1: A framework for constrained pattern mining.
Categories: According to the semantics of the constraints, we can divide them in a set of different
categories. These categories are defined based on what is being constrained. As an example, we
may apply constraints over the items, the values of the items, the relations among items, etc.
Properties: Apart from the categories, constraints can be categorized by a set of properties, according
to their behavior when adding or removing items to an itemset. For example, there are constraints
that, once unaccepted for one itemset, are always unaccepted for any of its supersets (it is called
an anti-monotonic constraint). These behaviors, or properties, allow us to define and apply more
generic and efficient strategies in order to introduce these constraints into the discovery process.
Data Sources: The nature of the data sources may also influence the constraints and the strategies that
may be used. Data can be tabular or multi-relational, and the source can be dynamic or static.
The challenges introduced by these more complex types of data require, for example, the definition
of new types of constraints, or/and the nullity of some assumptions, like the persistency of data.
Associated to each constraint are also the strategies used to incorporate them. These strategies
depend on the category and property of the constraints, but also on the data source. There are a lot of
43
ad hoc strategies, designed for some specific constraints, and some more generic approaches, designed for
constraints following a specific property. However, there is no algorithm able to efficiently incorporate all
types of constraints, and the great majority is designed for tabular and static data sources only.
In the next sections we describe the framework in more detail.
4.5 Constraint Categories
According to the semantics and form of constraints, they can be divided in the following categories1. Let
P be a pattern, and P.attr be the value of all elements of the pattern P for attribute attr (e.g. P.price
corresponds to the price of all products in P ):
1. Content constraints: These constraints correspond to filters over the content of the discovered
patterns. They are conditions over the value of the items that would appear (or not) in the resulting
patterns. They try to capture the semantics of the application and introduce it into the mining
process.
(a) Item constraints: They express conditions on the presence or absence of some items in the final
patterns [SVA97]. These were the first proposed constraints, different from minimum support.
For example, a school teacher may be interested in patterns relating his discipline with others:
maths ∈ P . Thus, these constraints allow for the discovery of patterns that relate some specific
known items with others unknown. From another perspective, a school teacher may also be
interested in patterns containing only the students of his discipline, instead of all students:
P ⊆ {s1, s2, ..., sn}, with si the students in question. So, they also allow for the discovery of
unknown frequent relations between the known elements.
(b) Value constraints: These constraints assume that a value is associated with each item, and
limit this value for every element of a pattern [NLHP98].
For example, a market customer may only be interested in products which price is less than
a specific value. In this manner, the constraint P.price ≤ e 100 will only return patterns of
products with price not exceeding e 100.
Another interesting application of value constraints is weighted pattern mining, where items
have a weight that shows their importance. We can establish a weight constraint P.weight ≥ wto indicate to the algorithm that we are only interested in itemsets with a weight higher than
w [YL05].
(c) Aggregate constraints: These constraints also assume that a value is associated with each item,
and that several aggregate functions (e.g. sum, average, max, min) can be used over these
values [NLHP98]. An aggregate constraint limits the value of aggregate functions over the set
of items in the patterns.
For example, a marketing analyst may be interested in products for undergraduate students,
and therefore the maximum age in the target audience of products in each pattern should be
18 years (max(P.age) ≤ 18). Or he can be interested in sets of products with an average price
no higher than a given value (avg(P.price) ≤ v).
Formally, aggregate functions can be divided into three categories [HKP11]: distributive, al-
gebraic and holistic. Distributive functions can be computed in a distributed manner. i.e.
applying them to each partition and then applying them to those partition results, is the
1In this work we adopt and extend the notation presented by Ng et al. [NLHP98].
44
same as applying them to all data without partitioning (e.g min,max, count and sum). Al-
gebraic functions can be obtained by applying some algebraic operator to two or more results
from distributed functions (e.g. avg can be computed as sum/count; Other examples include
min N , max N , variation and std deviation). Due to these properties, distributed aggregate
constraints are usually easier to push into PM algorithms [NLHP98]. Algebraic aggregate
constraints need more attention, because we can only confirm that they are satisfied after
computing the sub functions over all elements in the itemsets. Even so, some efficient tech-
niques were proposed to deal with such constraints [PHL01, PH02, WJY+05, ZCD07]. Finally,
aggregate functions can also be holistic, meaning that there is no algebraic function that char-
acterizes their computation (e.g. mode and median). Holistic aggregate constraints are more
difficult to push, i.e. it is difficult to create a generalized strategy to push them.
2. Structural constraints: These constraints define conditions on the content and on the structure
of data [Ant09a].
(a) Length constraints: They specify a limit on the length of the patterns, i.e. on the maximum
or minimum number of items in each pattern.
For example, a sales analyst may be interested in patterns with at most 5 products (patterns
with more items usually have lower support, and are not significant): |P | ≤ 5. This accelerates
the mining process and also limits the number of results.
(b) Sequence constraints: The most studied structural constraints have been represented as regular
expressions.
Formal languages, such as regular and context free languages, provide a simple, natural syntax
for the specification of sequences, and have sufficient expressive power for specifying a wide
range of constraints [GRS99, AO02, PHW07]. Enforcing regular expressions (RE) into the
mining process minimizes the computational cost, by focusing only on sequences that can
potentially be in the final answer set. RE constraints are specified as RE over the set of items
using the established set of regular expression operators (like disjunctions). They specify the
possible (or the most interesting) combinations of items, and the order they should have. A
sequential pattern satisfies the constraint if it is accepted by its equivalent deterministic finite
automata or push-down automata.
(c) Network constraints: One promising type of constraints are network constraints, which are
defined based on the characteristics of domain knowledge in the form of a network (a graph-
based representation like taxonomies and ontologies). These networks model the concepts
existing in the domain, as well as the (hierarchical and possibly non-hierarchical) relations
between these concepts. Each item in the database can be mapped to the corresponding
concept in the network. This means that we can restrict both the concepts associated with
each item, as well as the relations between items in an itemset [Ant08, Ant09b].
i. Conceptual constraints: They express conditions on the presence or absence of some con-
cepts in the patterns. One concept is said to be present in an itemset if it contains some
item that is mapped to that concept in the network. These constraints are like item con-
straints, but instead of specific items, we are looking for the specific concepts for which
they are mapped. In this sense, we can specify, for example, that one or one set of con-
cepts must (or cannot) be present in patterns, or restrict the possible concepts to one
45
specific accepted (or unaccepted) set. Most content constraints, (and others, like sequence
constraints) can also be applied to concepts instead of items.
ii. Taxonomical constraints: These constraints establish restrictions based on the family ties
among concepts, defined by some taxonomy [Ant08]. We can require, for example, that
the concepts in a pattern belong to the family of a specific concept, or to belong to the
same family (same family constraints). If there are multiple hierarchies, we may want to
find patterns in which their concepts belong not to the same family, but to a closer family
(close family constraints). We can also define constraints to require that the concepts in
patterns must belong to some or to the same hierarchical level (level constraints).
iii. Relational constraints: If the network also models non-taxonomical relations, we can re-
strict the type and number of relations between items in each pattern. Two items are
related if the concepts for which they are mapped are related in the network. The sim-
plest relational constraint is to limit the presence or absence of some relations in patterns.
But we may also create constraints based on the connectivity between items. For exam-
ple, (1) all items must be related to at least another (weakly connected), or to all others
(strongly connected); and (2) there must be a chain of relations between items (softly
connected).
iv. Distance constraints: Distance constraints limit the number of indirect relations that
connect two concepts (and therefore two items) [ME09b]. For concepts related in more
than one way, the distance is the smallest one, i.e. the lowest number of edges between
two concepts. These constraints allow us to define to what extent the user considers
two items related, and therefore how important is that relation. As an example, we can
consider that relations with more than three indirections are not important, and therefore
we can impose a maximum distance between concepts. Distance constraints also allow us
to guarantee that the items in each pattern share the same context, by being all related.
(d) Temporal constraints: These constraints restrict the resulting patterns based on the temporal
dimension. They allow us to find temporal and sequential patterns, analyze their evolution
over time, limit the duration and gap between events, etc. [SA96, Zak00b, PHW02, AO03].
Temporal constraints are usually defined in databases where each transaction has a timestamp,
and each pattern is a frequent ordered sequence of time stamped itemsets.
i. Duration constraints: They limit the time between the oldest and newest event in the
pattern, i.e. it indicates that the timestamp difference between the first and the last
transactions in the patterns must be longer or shorter than a given period.
For example, for short-term pattern analysis, we may impose a limit of at most 3 months,
and for long-term analysis, we may say that we are interested in patterns where the
duration is at least 1 year.
ii. Gap constraints: Gap constraints define the maximum or minimum time interval between
consecutive events in each pattern, i.e. the timestamp difference between every two adja-
cent transactions must be longer or shorter than a given gap value.
For example, in a medical domain, doctors may specify that the maximum gap between
two exams must be 6 months to obtain relevant patterns, so that they help on a correct
diagnosis or treatment.
iii. Periodical constraints: These constraints define a periodicity in which patterns should
hold. This concept was first introduced by Ozden [ORS98] for association rules, in which
46
the time dimension is divided into equally spaced user-defined time intervals, and a rule
is said to be cyclic if it holds for a fixed periodicity along the whole length of the sequence
of time intervals. This allows for the discovery of seasonal patterns.
For example, in an educational domain, teachers may be interested only in patterns that
occur every semester during exams (low number of students in classes, or high affluence
to office hours).
When mining temporal databases, one can also consider other types of constraints, like lifespan
and growth constraints. Lifespan constraints impose a limit on the lifetime of items in patterns,
i.e. they define a maximum or minimum time interval between the first and the last appearance
of each item in the database. And growth constraints were proposed to capture emerging
patterns and their evolution over time [DL99]. A pattern is emergent if its support increased
more than a given threshold in the most recent time interval. Thus, a growth constraint
defines a limit on the growth rate of patterns (i.e. the ratio of the support of the pattern in
the most recent period over its support in the previous time period). Convergent and divergent
constraints look for patterns whose period shrinks or grows along time [BA14].
(e) Other : Other structural constraints are being proposed in domains like graph pattern mining
[ZYHY07] (density ratio, density, diameter, edge and vertex connectivity), defined according
to the number of edges and vertices on the graphs.
3. Interestingness measures: Interesting measures are constraints that impose quantitative condi-
tions over the set of items in the pattern or rule. They rank the results according to their usefulness
and utility, according to some user chosen function. Usually, only the results that surpass a user
defined threshold are considered interesting, and therefore only those are presented to the final user.
The choice of the best interesting measure, as well as the choice of the threshold that separates
interesting from uninteresting patterns, are two non trivial problems of pattern mining that may
have a great impact on the quality of the results.
The most known interestingness measure is the minimum support, which has been used since the
first proposal of pattern mining [AS94]. Establishing a minimum support threshold allows us to draw
a limit on the support beneath which we consider itemsets infrequent, and therefore not interesting
information that can be discarded. It gives to pattern mining several important advantages [Bay05],
since it preserves the discovery of unknown and important patterns and improves the efficiency of
the algorithms during mining. However, it suffers from some limitations that are more evident
when we start dealing with larger and denser datasets. The main limitation is the fact that results
may be redundant and in a large quantity, and they are not user-oriented, and thus they may not
correspond to user expectations.
Other interestingness measures have appeared trying to improve the quality of the results that are
returned to the user. Most of them are still not user oriented, but provide a good way of reducing
the number of results. Examples are rule based measures, such as confidence [AS94], correlation
(including lift, cosine, χ2 and all confidence) [AS94, BMS97, HCXY07] and the improvement of
a rule [Bay05, BA99]. They measure the interestingness of association rules, which are generated
based on the patterns found by pattern mining.
Most of existing interestingness measures, with the exception of the minimum support, are used only
to evaluate the resulting patterns. However, Bayardo [Bay05] showed that some of these measures
can be rewritten in a form that is composed of elements of other forms of constraints, so that they
can be used during the mining process.
47
Some studies have also been conducted to find interesting and unexpected patterns, based on
what is already known. Padmanabhan and Tuzhilin [PT98] use probability-based belief to describe
user confidence in unexpected rules. Wang and Lakshmanan [WJL03] are able to capture the
unexpectedness and strength of a rule. Jaroszewicz and Simovici [JS04] define the interestingness
of an itemset as the absolute difference between its support in the data and its expected support.
The user defines a minimum threshold, and the reasoning behind it is that, if the difference is small,
the itemset is uninteresting, since it is already known [JS05]. More recently, Cao et al. [CLZ07]
introduced knowledge actionability to measure the ability of a pattern to be converted to a concrete
action in the real world.
4.6 Constraint Properties
There are several different constraints, which hinders the creation of algorithms that are able to incor-
porate them, without being specific for some constraint. Fortunately, studies show that constraints have
some properties that allow for efficient and generic strategies to prune the search space and improve the
performance of the algorithms.
These “nice” properties [PH02] are:
1. Anti-monotonicity: A constraint is said anti-monotone if and only if, whenever an itemset X
violates it, so does any superset of X. Also, a disjunction or a conjunction of anti-monotonic
constraints is also an anti-monotonic constraint.
For example, assume an item constraint saying that all items must belong to an accepted set of
items V (X ⊆ V ). If an itemset X violates it, it means that it contains some item i /∈ V . All
supersets of X will have that item, and therefore all supersets will violate the constraint.
The best known and simple example of an anti-monotone constraint is the minimum support thresh-
old [AS94]. As an anti-monotone constraint, if an itemset is infrequent, so does any of its supersets.
2. Monotonicity: A constraint is said monotonic if and only if, whenever an itemset X satisfies it, so
does any superset of X [GLW00]. Conjunctions and disjunctions of monotonic constraints are still
monotonic, and monotonic constraints can be seen as the negation of anti-monotonic constraints.
Following the example above, imagine now an item constraint defining that every pattern must
contain at least the items from a set V (V ⊆ X). If an itemset violates it, i.e. does not have all
known items, a superset can satisfy it, by introducing the missing item ∈ V . However, if an itemset
satisfies the constraint (i.e. it contains all the required items), all supersets also satisfy it, because
they contain the same items and more.
3. Succinctness: In its essence, a constraint is succinct if it is possible to enumerate all possible
patterns, based on the powersets of the elements of the alphabet of items [NLHP98].
A simple example is the value constraint X.price ≤ e 100. It is a succinct constraint because
we can select from the alphabet all items with price ≤ e 100 using the selection predicate: I1 =
ρprice≤e 100(Items), and the itemsets that satisfy the constraint are exactly only those in the strict
powerset of I1: 2I1 . Another example is the item constraint {a} ⊆ X. We can select all items from
the alphabet that are not a, using the predicate: I2 = ρitem 6=a(Items), and say that all itemsets
resulting from the powerset of I2 do not contain a, and cannot be patterns. It is a succinct constraint
since we can define that the itemsets that satisfy it are, exactly, 2Items − 2I2 (the powerset of all
items in the alphabet, except the powerset of I2).
48
Formally, a succinct constraint is defined as follows:
• An itemset X ⊆ Items is a succinct set if it can be expressed as ρp(Items), for some selection
predicate p;
• SP ⊆ 2Items is a succinct powerset (SP) if there is a fixed number of succinct sets
I1, I2, ..., Ik ⊆ Items, such that SP can be expressed as unions and minus of the strict power-
sets of I1, I2, ..., Ik.
• A constraint is succinct if the set of itemsets that satisfy it is a succinct powerset.
A succinct constraint can be considered a special case of conjunctions of anti-monotonic and mono-
tonic constraints.
Another characteristic of these constraints, is that we can easily define a function that can generate
the members of a satisfying itemset: the member generating function (or MGF). In our first example,
the MGF is simply {X| X ⊆ I1 & X 6= ∅} (all non empty subsets of I1). In the second example,
the MGF is {X1 ∪ X2| X1 = {a} & X2 ⊆ I2} (the conjunction of itemset {a} with all subsets of
I2).
4. Prefix-monotonicity: A constraint is prefix-monotone2 if there is an order of items that allows
the algorithms to treat it as anti-monotonic or monotonic. By fixing an order on items, each
transaction can be seen as a sequence, and therefore we can use the notion of prefixes and suffixes,
as the first or last items in the ordered transaction, respectively.
A constraint is prefix-monotone, if it is prefix anti-monotonic or prefix monotonic. Formally, a
constraint C is prefix anti-monotonic (resp. prefix monotonic) if there is an order R over the
set of items, and assuming each itemset X = i1i2...in is ordered accordingly to order R, such
that, whenever an itemset X violates (resp. satisfies) C, so does any itemset with X as prefix
(X ′ = X ∪ {in+1} = i1i2...inin+1).
For example, an aggregate constraint like C ≡ avg(X) ≥ 20, is not monotonic, neither anti-
monotonic nor succinct. But, if we order the items in a value-descending order (and assuming only
positive values), an itemset X = i1i2...in has higher average than its supersets X ′ = i1i2...inin+1.
This means that, if X violates C, also all its supersets X ′ will. Thus, C is prefix anti-monotonic.
5. Mixed Monotonicity: Leung and Sun [LS12] proposed recently the concept of mixed monotone
constraints, to define constraints that are both anti-monotonic and monotonic, at the same time,
for different groups of possible values (positive and negative).
Formally, let Item denote the set of items, which is divided into two disjointed groups based on the
sign of their attribute values: ItemP , the set of items with positive value (including 0), and ItemN ,
with negative value. Then, a constraint is mixed monotone if, for any itemset X: (a) whenever
X satisfies C, all supersets of X formed by adding items from one specific group, also satisfy C
(monotonic for that group); and (b) whenever X violates C, all supersets of X formed by adding
items from another group, also violate C (anti-monotonic for the other group).
This property was proposed in particular for aggregate constraints using the sum function, where
items may contain negative numerical values. The aggregate constraint sum(X) ≥ v, for example,
2Prefix-monotone constraints were first proposed with the name of convertible constraints [PH02]. Since wecan convert other constraints using several approaches (like using relaxations), we use the term prefix-monotoneto designate the constraints that are convertible due to the order of items.
49
is monotonic for positive values (including zero), and anti-monotonic for negative values. The aggre-
gate constraint sum(X) ≤ v is, on the contrary, anti-monotonic for positive values and monotonic
for negative ones.
Table 4.2 and 4.3 associates these properties to content and structural categories, respectively.
Table 4.2: Content constraints and respective properties (∗ means it depends on the function).
AM M Succinct PAM PM Mixθ"∈"{∉} X X Xθ"∈"{∈} X X Xθ"∈"{⊈} X X Xθ"∈"{⊆} X X Xθ"∈"{⊈} X X Xθ"∈"{⊆} X X X
P"θ"v θ"∈"{<,≤,>,≥} X X Xθ"∈"{<,≤} X X Xθ"∈"{>,≥} X X Xθ"∈"{<,≤} X X Xθ"∈"{>,≥} X X Xθ"∈"{<,≤} X Xθ"∈"{>,≥} X X
sum(P)"θ"v"(positive"and"negative)
θ"∈"{<,≤,>,≥} X
avg(P)"θ"v θ"∈"{<,≤,>,≥} X Xθ"∈"{<,≤} X *θ"∈"{>,≥} * Xθ"∈"{<,≤} * Xθ"∈"{>,≥} X *
Holistic median(P)"θ"v θ"∈"{<,≤,>,≥} X X
Constraints
Algebraic
Content
Aggregate
Categorie
s
Properties
f(P)"θ"v"(f"is"prefix"decreasing)
f(P)"θ"v"(f"is"prefix"increasing)
Item
Value
Deterministic
x"θ"P
X"θ"P
P"θ"X
min(P)"θ"v
max(P)"θ"v
sum(P)"θ"v"(positive"or"negative)
Table 4.3: Structural constraints and respective properties.
AM M Succinct PAM PM Mixθ"∈"{<,≤} X Xθ"∈"{>,≥} X X
RE "BB XConceptual
sameBfamily "BB XcloseBfamily "BB Xlevel(P)"θ"l θ"∈"{<,≤,>,≥} Xrelational
weakly"or"softlyBconnected "BB
stronglyBconnected "BB XDistance distance(P)"θ"v θ"∈"{<,≤,>,≥} X
θ"∈"{<,≤} Xθ"∈"{>,≥} X
Gap gap(P)"θ"v θ"∈"{<,≤,>,≥} X
Properties
(like"Item"constraints)
(like"Item"constraints)
(No"property)
Structural Network
Temporal
Length
Sequence
Taxonomical
Relational
Duration
|P|"θ"v
duration(P)"θ"v
Categorie
s
Constraints
4.7 Data Sources
Despite the advances on constrained pattern mining, the great majority of existing work is only concerned
with tabular data. However, the rapid growth of data, in both quantity and in the variety of different
types of data structures, has brought new requirements to data mining techniques.
On one side, in many real world applications data appear in the form of continuous data streams, as
opposed to traditional static datasets. In this sense, data sources can be:
50
Static: When we are in the presence of static data sources, we can make some assumptions over these
data that ease the definition and incorporation of constraints into the mining algorithms. These
assumptions are: (1) all data are available from the beginning, and therefore we can know in
advance, for example, what are all possible items in the dataset (the alphabet); (2) no new data
will appear, and therefore decisions are generally taken based on all data and are persistent. For
example, after reading the available data, infrequent items are effectively infrequent and can be
eliminated. If these items could appear later, they could become frequent, which would invalidate
the former decision of their deletion; (3) since all data are available, we can usually make several
passes over data; and (4) there are typically fewer memory and time limitations.
Continuous: Data sources are continuous, or data streams, if they are continuously being generated
and collected [MM02]. The nature of these streaming data makes the mining process different
from traditional data mining in several aspects, as referred in Section 3.2: (1) each element should
be examined at most once and as fast as possible; (2) memory usage should be limited, even
though new data elements are continuously arriving; and (3) the results generated should be always
available and up to date. This means that only the information that is strictly necessary to avoid
loosing patterns should be kept [LLH11], and other must be deleted. This will result in errors in
frequency counting, and these errors on results should be as small as possible. Beyond that, all the
assumptions when in the presence of static datasets cannot be made: data are not all available a
priori, and conclusions may not be persistent (items reported as infrequent may become frequent
later).
On the other side, this new era made it necessary to create new and more efficient ways to store and
analyze these data. In fact, the data storage paradigm has changed, from operational databases to data
repositories that make easier to analyze data and to find useful information.
Despite this change, and the fact that most of real world applications involve multiple tables, and
eventually multiple data sources, existing algorithms for constrained pattern mining are only able to deal
with one single data table. On the other side, existing algorithms for mining multiple tables (described
in Chapter 2) are not able to deal with constraints.
As referred in Section 2.3, dealing with multiple tables introduces a set of new challenges to pattern
mining algorithms, and even more to constrained mining, due to the nature of data. Multi-relational mod-
els contain not only transactional data (the occurring events or transactions) but also non-transactional
data (the characteristics of entities) [SA13c], thus when mining these models, we are mining both types
of data. The problem is that existing constraints were proposed for transactional data (the goal is to con-
strain the co-occurrences of entities based on their characteristics, and not to constrain the co-occurrences
of characteristics themselves), and this requires the adaptation of both the constraints and algorithms.
Although these difficulties, mining these multi-relational models is also an opportunity for constrained
mining. The existing relations between tables may lead to the definition of new constraints based on the
structure of the models, so that we can find and guide the algorithms through the most interesting
relations, in the user point of view.
Also, new and more complex data types have also appeared and have become popular nowadays, such
as social networks and other graph-based models. The research on data mining over these data sources
is increasing, but there is still much work to do regarding the incorporation of constraints on the mining
of these sources.
51
4.8 Constrained Pattern Mining Algorithms
How to enforce these constraints into pattern mining is not trivial, and depends heavily on the constraints
in question.
Performing an extensive search is not a viable solution mostly due to the size of the search space. A
naive approach starts by running an existing traditional pattern mining algorithm, and only then test
the constraints and filter the patterns that do not satisfy them. However, most of the times, the itemsets
that satisfy some constraint are much less than the ones resulting from a traditional pattern mining run.
Therefore, this first step is somehow unnecessarily time consuming for a constrained environment.
Several algorithms have been proposed in the literature for the integration of constraints into pattern
mining, some of them designed for some particular type of constraints, others more general, designed for all
constraints following some “nice” property. Nearly all algorithms were proposed for single transactional
tables, and there are some proposals that are able to deal with data streams [LK06].
4.8.1 Properties vs. Algorithms
1. Anti-monotonicity: Both apriori-like [AS94], pattern-growth [HPY00] and vertical [Zak00b]
methods use the anti-monotonicity of minimum support to stop exploring itemsets that are not
frequent. Their idea is to start from frequent length-1 itemsets and iteratively find longer frequent
itemsets. Apriori-like methods iteratively generate all candidates (supersets) of current frequent
itemsets, and test them for frequency. Infrequent itemsets are discarded and therefore not used in
the next candidate generation step. Pattern-growth methods recursively grow frequent smaller pat-
terns to longer ones, based on the co-occurrences of items in the database, with no need to generate
all candidates. Infrequent itemsets are also discarded and are not grown. Vertical algorithms first
transform the database into a vertical data format, in which, instead of having a set of items per
transaction id, they have a set of transaction ids per item. The number of ids per item corresponds
to the support of that item, therefore these methods do not need to scan the database to count
their support, neither of larger intemsets. The strategy is similar to pattern-growth algorithms,
and larger itemsets are formed by intersecting the sets of ids of the corresponding smaller frequent
itemsets.
In this manner, it is somehow intuitive to push other anti-monotone constraints to those approaches:
we can only use the itemsets that satisfy the constraint to generate/grow longer itemsets, i.e. we
can discard all itemsets that do not satisfy the constraint, because their supersets will also violate
it [PH02]. This strategy is the basis of existing algorithms, when in the presence of anti-monotone
constraints [NLHP98, PH00, PH02, BJ05].
Anti-monotonicity, if used actively, can drastically reduce the search space. It is the strongest
property, by being the one that allows the algorithms to prune more, with less effort, minimizing
the computational cost, and at the same time maximizing the efficacy of the results. However, it
is not possible to ensure the efficiency of pushing this type of constraints, since it depends on their
selectivity [Bou04], i.e. the rate of itemsets that can be discarded: the less selective a constraint is,
the less efficient.
2. Monotonicity: In the case of a monotonic constraint, we cannot discard itemsets that violate
it, because supersets can satisfy it. However, when we find some itemset that satisfies it, we can
automatically generate all possible supersets of that itemset and return those that are frequent,
without more testing for the constraint. Thus, monotonic constraints can also be used to improve
the efficiency of pattern mining, by avoiding multiple unnecessary tests.
52
The basic strategy is to find the frequent k-itemsets and, for those that satisfy the constraint, there is
no need to test for that constraint when generating/growing frequent (k+1)-itemsets [PHL01, BJ05].
It is shown that the strategy for anti-monotonic constraints is more powerful, since it can eliminate
early much more itemsets than the monotonic strategy [GRS99]. But again, it depends on constraint
selectivity.
3. Succinctness: A succinct constraint is, at the same time, succinct and anti-monotonic (e.g.
X.price ≤ e 100) or succinct and monotonic (e.g. {a} ⊆ X). The way to push them depends
on that:
If the constraint is both succinct and anti-monotonic, we can prune from the beginning the items
that do not satisfy it (i.e. use only the elements of the respective member generating function –
MGF). In our first example, we can discard all items with price higher than e 100, because no item
with that price will satisfy the constraint.
If the constraint is succinct and monotonic, we cannot eliminate items, but we know, from the
MGF, which items and combinations satisfy the constraint. Therefore we can start with the possible
values of the first member, and from there generate candidates by joining these values with the next
member, one by one. In our second example, the corresponding MGF is {X1∪X2|X1 = {a} & X2 ⊆I2}, and since the first member must be the element a (it satisfies the constraint by itself), we just
have to join it with the other values from the second member X2 to form other possible patterns,
with no constraint check.
Succinct constraints were first proposed by Ng et al. [NLHP98], as well as an apriori-based al-
gorithm, called CAP (Constrained APriori), implementing the strategies above. Later on, Leung
et al.[LLN02] proposed FPS (FP-tree based mining of Succinct constraints) that uses the same
strategy but with a pattern-growth approach. These strategies are the basis for pushing succinct
constraints [BJ05].
4. Prefix-monotonicity: These constraints may seem straightforward to push into pattern mining
algorithms, since they can be treated as anti-monotonic or monotonic, just by imposing the correct
order of items in all itemsets. However, there is one main difference: one cannot discard all itemsets
that violate a constraint, because itemsets may violate it as a prefix, but not as a suffix of a valid
prefix. For example, an itemset X = {20, 10}, does not satisfy the constraint C ≡ avg(X) ≥ 20.
However, an itemset X ′ = {30, 20, 10}, with X as a suffix, satisfies it.
These constraints were proposed by Pei et al. [PHL01] as well as a pattern growth algorithm, FIC
(Frequent Itemset mining with Convertible constraints)3, with a strategy similar to the algorithm
PrefixSpan [PHMA+01] for sequential pattern mining. In the presence of a prefix anti-monotonic
constraint (FICA), the idea is to keep all frequent itemsets, but only grow itemsets that satisfy
the constraint (valid prefixes). Apriori algorithms can also adapt this strategy, by also keeping
all frequent itemsets (even if they violate the constraint), and only generate candidates with valid
prefixes.
In the presence of prefix monotonic constraints (FICM ), all frequent itemsets must be kept too,
but, as soon as some frequent itemset satisfies the constraint (valid prefix), algorithms do not need
to test any itemset with it as a prefix, just grow them (i.e. generate all supersets with that prefix)
and return them [PHL01], after confirming their frequency.
3A first draft of this algorithm was proposed in [PH00], with the name CFG (Constrained Frequent patternGrowth).
53
An important contribution of this property is the fact that regular expressions in sequential pattern
mining are prefix-monotone constraints. This means that they can be treated with a very similar
strategy. Pei et al. [PHW02] took advantage of this and proposed Prefix-Growth, a pattern growth
algorithm that recursively grows longer sequences from smaller ones, but only projects sequences
that are a valid prefix. The same authors also presented an overview of constrained-based sequential
pattern mining [PHW07], where they state that PrefixGrowth achieves better performance than
other ad hoc algorithms for regular expressions [SA96, GRS99, Zak00b], and is able to push more
constraints.
5. Mixed Monotonicity: Leung and Sun [LS12] also proposed the algorithm FPM (Frequent Pattern
mining for Mixed monotone constraints). FPM is a pattern-growth algorithm, that is able to explore
the properties of prefix-trees and to include mixed monotone constraints in a quite simple way.
The idea is to first divide the items into positive and negative sets (ItemP and ItemN ) and then
order the items in ItemP in ascending order of values, and ItemN in descending order. The mining
process proceeds iteratively starting by the monotonic group and only then piecing together the
anti-monotonic group (in the case of sum(P ) ≥ v, starts with ItemP and then ItemN ). Thus,
while mining the monotonic group, if an itemsets satisfies the constraint, no checking is needed
for the supersets composed of items of that group. When the processing of this group finishes,
the algorithm adds items from the anti-monotonic group, one by one, and if the resulting itemsets
violate the constraint, one stops exploring them.
This strategy can be applied in a wide range of domains, including financial markets and air tem-
perature, to correctly and efficiently find patterns where constraints involve manipulating negative
values.
Table 4.4 presents a summary of the algorithms that implement the strategies above, taking into
advantage the properties of the constraints.
Table 4.4: Algorithms designed to incorporate constraints that follow specific properties. Note that algorithmsfor prefix-monotone or for conjunctions of constraints are also able to deal with simple AM or M constraints.
Properties Algorithms
Anti-Monotonicity andMonotonicity
CAP [NLHP98]
SuccinctnessCAP [NLHP98]
FPS [LLN02]
Prefix-Monotonicity
CFG [PH00]
FIC [PHL01]
Prefix-Growth [PH02]
Mixed-Monotonicity FPM [LS12]
Conjunctions ofAnti-monotone andMonotone Constraints
G [BJ00]
BMS+ and BMS* [GLW00]
Molfea [RK01]
DualMiner [BGKW03]
ExAnte [BGMP05]
MUSIC [SC05]
[RJLM10]
54
There are some constraints that do not have nice properties for pushing (i.e. are not anti-monotonic,
neither monotonic, succinct, nor prefix or mixed monotone). For example, combinations of monotonic and
anti-monotonic constraints and most of the existing interestingness measures [Bay05]. These constraints
are not easily pushed into the pattern mining process. And an exhaustive search is not an efficient
solution, since the number of frequent itemsets can still be much higher than those that satisfy the
constraint. Fortunately, there are some strategies proposed to deal with such constraints that try to take
advantage of the benefits of constraint properties.
One widely used approach is to introduce constraint relaxations (weaker constraints) [NLHP98,
GRS99, AO05] that allow the algorithms to prune some search space and therefore make the discov-
ery more efficient. These relaxations depend on the constraint, but there has been a major effort to find
relaxations that have nice properties. Thus, the idea is to run a more efficient algorithm over the data
using the relaxation, and then to perform an extensive search on the results (instead of on all data). Since
relaxations are weaker than the original constraint (though stronger than just using frequency pruning),
results must always be tested against the constraint so that only valid itemsets are returned.
Another approach is to use more than one strategy (one after the other or simultaneously). However,
as highlighted by Boulicaut and Jeudy [BJ00], if we are dealing with a conjunction of monotonic and
anti-monotonic constraints, we face a tradeoff between anti-monotonic and monotonic pruning. This may
happen because, when a monotonic constraint is pushed, it might save tests on monotonic constraints.
But, the results of those tests could have led to more effective anti-monotonic pruning [SVA97, GRS99].
As an example, pushing the monotonic constraint length(P ) ≥ 10, would avoid the generation of itemsets
of size less than 10. However, then, there would be a lot of candidates of size higher than 10 and all of
them would have to be tested for the anti-monotonic constraint. If the smaller itemsets have been tested
for the anti-monotonic constraint, many itemsets of size higher than 10 might have been already pruned,
and therefore not tested.
The identification of a good strategy for pushing these constraints needs the a priori knowledge about
the constraint selectivity. However, this is in general not available [Bou04]. Boulicaut and Jeudi [BJ00]
also proposed a strategy (and the G algorithm) that might help dealing with these conjunctions, by
choosing the order of constraint pushing based on their selectivity and evaluation cost. With this in
mind, Bonchi et al. [BGMP03] proposed an adaptive strategy, ACP (Adaptive Constraint Pushing),
that is able to dynamically give more importance or focus to anti-monotonic or monotonic pruning to
maximize efficiency, depending on the ratio of itemsets found infrequent. The same authors also proposed
ExAnte [BGMP05], a pre-processing algorithm that is able to reduce the data by eliminating, repeatedly,
all itemsets that violate the monotone constraints and then all that violate the frequency or the anti-
monotone constraints. It can be followed by any efficient traditional pattern mining algorithm, but it
requires several scans to the data.
Other algorithms were proposed based on version spaces and on border representations. Essentially,
it has been realized that, for example, the space of solutions of a monotonic constraint is completely
characterized by its set (or border) of maximally specific elements. Likewise, the space of solutions of
an anti-monotonic constraint is completely characterized by its set (or border) of maximally general
elements. The idea is that, given a conjunction of an anti-monotonic and a monotonic constraint, it is
possible to start a level-wise search from the minimal itemsets that satisfy the monotonic constraint,
until reaching the maximal itemsets satisfying the anti-monotonic constraint [Bou04]. These properties
have been exploited by level-wise algorithms [MT97] to mine conjunctions, such as G [BJ00], BMS+
and BMS* [GLW00], Molfea [RK01], MUSIC [SC05] and Dualminer [BGKW03], and by the algorithm
proposed by De Raedt et al. [RJLM10] to mine arbitrary expressions, over anti-monotonic and monotonic
constraints.
55
4.8.2 Categories vs. Algorithms
Besides the algorithms described above, there are some algorithms designed specifically for some particular
constraint category. Table 4.5 summarizes these algorithms.
Table 4.5: Algorithms designed to incorporate content and structural constraint categories.
Categories Algorithms
ContentItem and Value
MultipleJoins, Reorder and Direct [SVA97]
WFIM (weights) [YL05]
Aggregate DnA and BP-cubing [WJY+05, ZCD07]
Structural
Sequence (and length)
SPIRIT [GRS99]
[AO02]
Sim [CMB02]
Re-Hackle [ALB03]
ε-accepts [AO04]
Network
Onto4AR framework [Ant08]
D2Apriori [Ant09b]
SemAware [ME09b]
Temporal
(gap and duration)
GSP [SA96]
C-SPADE [Zak00b]
Gen-PrefixSpan [AO03]
(episodes)
[MTIV97]
MBD-LLBorder [DL99]
(cycles)
sequential and interleaved [ORS98]
Srikant and Agrawal [SA95] were the first to introduce item constraints, the first different from
minimum support. They proposed three apriori-based algorithms – MultipleJoins, Reorder and Direct
– that are able to deal with boolean combinations of these constraints, i.e. C = D1 ∧ D2 ∧ ... ∧ Dm,
where each Di = Ci1 ∨ Ci2 ∨ ... ∨ Cin , and each Cij an item constraint of the form i ∈ S or i 6∈ S.
Despite being composed of anti-monotonic and monotonic constraints, these conjunctions are neither of
them. Nevertheless, each individual C is a simple item constraint. The first two algorithms proposed
(MultipleJoins and Reorder) use an anti-monotonic relaxation of the constraint, by finding first an itemset
S such that all patterns (i.e. valid itemsets) must have some item from S. Candidate generation is
optimized by joining only itemsets, whose prefixes or suffixes contain items from S. Reorder is an
optimization of MultipleJoin that reorders itemsets to contain items from S in the first place. Direct
pushes the complete constraint, at the cost of a more complex candidate generation phase [BJ05]. The
idea of the algorithm is to explore the smaller constraints comprising the main constraint and join the
frequent itemsets that satisfy each one separately.
An alternative ad-hoc strategy for mining aggregate constraints was proposed with BP-Cubing
(Bound Prune Cubing) [WJY+05], an algorithm of the family of DnA (Divide and Approximate) al-
gorithms [ZCD07]. They propose a divide and approximate approach that first divides the search space
into subspaces (group-by partitions) and then seek for individual constraint approximations in each sub-
space to achieve the best results. They also propose the integration of more aggregate functions, like sum
of squares, positive sum, negative sum and variance.
56
Some algorithms were proposed do deal with sequences that accept gap and duration constraints. The
first was GSP [SA96] (Generalized Sequential Patterns), an apriori-based algorithm, that organizes the
data according to given time windows, and generates (k+ 1)-sequences based on the k-sequences (joining
(k − 1)-itemsets when the prefix of one equals the suffix of the other). The authors also proposed the
integration of taxonomies, by extending transactions with the ancestors of items. C-SPADE [Zak00b]
is an extension that outperforms GSP and allows other constraints, like length, minimum gap and item
constraints. GenPrefixSpan [AO03] is a pattern-growth algorithm with the same goal as GSP, but without
the candidate generation bottleneck. Over time, other algorithms for temporal constraints were proposed,
for example, to find episodes (sets of events that must occur close to each other) [MTIV97, DL99] and
cycles [ORS98].
The first algorithms for mining with regular expression (RE) constraints did not make use of the
prefix-monotone property, and hence they had to create some relaxations, in order to achieve a balance
between the efficiency of the algorithms and the effective push of the constraints. An example is the family
of the apriori-like SPIRIT (Sequential Pattern mIning with Regular expressIons consTraints) algorithms
[GRS99]. SPIRIT(N) only requires that all elements in the pattern appear in the RE, which is a simple
anti-monotonic item constraint relaxation. SPIRIT(L) only generate sequences that are legal w.r.t some
state of the RE, i.e. the corresponding transactions must be possible in the RE. For SPIRIT(V), possible
patterns must be valid suffixes, i.e. existing transactions that lead to a final state. Finally, SPIRIT(R)
enforces the complete constraint, and only generates valid sequences. The first relaxations are easier
to push into the algorithms, however, weaker constraints will prune less possible patterns than stronger
ones. Therefore, all versions except R must test all results against the original constraint before returning
it.
Antunes and Oliveira [AO02] followed SPIRIT ideas, and adapted it to deal with context free grammars
(CFG), through the use of pushdown automata. These grammars are more powerful than RE, since
they can express the same and more languages. Authors also show that the increase of the expressive
power of the language used for specifying constraints does not impair the performance of the algorithms.
The same authors also proposed the pattern-growth algorithm ε-accepts [AO04], to find sequences that
approximately conform a CFG, by allowing some insertions, deletions or replacements, in the middle of
the sequences. Capelle et al. [CMB02] proposed an apriori-like algorithm with a similar goal. They
assume a reference sequence, given by the user, and calculate the similarity of the discovered sequences
with the reference (i.e. the number of differences). Those sequences that surpass a similarity threshold
are returned.
Some adaptive strategies have also appeared for pushing RE. The algorithm RE-Hackle [ALB03],
represents RE in a tree structure called Hackle-tree, containing one node per operator (disjunction,
conjunction, ’∗’ operator) and one path per combination of these operators in the RE. This tree is
scanned at each candidate generation step, and an extraction function (that depends on the operator) is
used in each node to extract the valid candidates. From these candidates, frequent ones are used for the
next generation.
Antunes [Ant07, Ant08, Ant09b] was the first to propose the introduction of ontologies in DM as a
constrained mining problem, and defines a set of ontology constraints, along with the framework Onto4AR
(Ontologies for Association Rules). The goal of this framework is to work for any ontology, and therefore
for any domain, and allow the users to choose the ontology constraints to incorporate in the mining
process. The framework allows not only for the introduction of some of the network constraints described
above, as well as of weak and strong compositions of those constraints. The same author also proposes the
algorithms D2Apriori (Domain Driven Apriori) and D2FP-Growth, that first acquire domain knowledge
from the knowledge base within the ontology, and then instantiate the constraints, read the data creating
their representation, and finally identify frequent constrained patterns.
57
Mabroukeh and Ezeife [ME09b] proposed SemAware, a generic framework for sequential pattern min-
ing that integrates semantic information in the form of ontologies into all phases of web usage mining
process. It defines an apriori-based algorithm that prunes candidate generation and search space ac-
cordingly to the semantic distance between objects and a maximum distance threshold that can be user
specified or automatically calculated from the minimum support and the number of edges in the ontology.
Before the mining process, a matrix with the topological distance between concepts in the ontology is
built. And during the mining process, sequences and candidates with more than the allowed maximum
distance are pruned, with no need for support counting (anti-monotonicity property). The same authors
have extended this work to take into account weights in the relations. By combining the distance matrix
with a weight matrix, they define a weighted distance constraint, corresponding to a weighted sum of
the distance between two concepts times the weights in that path. In [ME09a], the distance matrix
is combined with the transition probability matrix from Markov models, therefore the same algorithm
(SemAware) is able to guide a markov process.
4.8.3 Data Sources vs. Algorithms
As noted before, the great majority of algorithms proposed for constrained pattern mining were designed
for mining one single and static data table.
Leung et al. [LK06] was the first to propose the integration of data streams with constrained mining,
with two algorithms, ApproxCFPS and ExactCFPS (Approximated and Exact Constrained Frequent
Patterns for Streams). These algorithms are able to push succinct constraints deep into the algorithm
FP-Streaming [GHP+03], and to find all approximated or exact patterns, respectively, in data streams.
The ideas are simple and consist on, for succinct anti-monotonic constraints: remove all single items that
violate the constraint before processing each transaction. And for succinct monotonic constraints: for
each batch of transactions, divide the items into mandatory and optional items, and order transactions
so that mandatory items appear first, and therefore there is only the need to mine itemsets that first
contain the mandatory items. While the algorithms efficiently push constraints into data stream mining,
they are only able to handle constraints that are succinct.
To the best of our knowledge, there is no algorithm designed for pushing constraints into the mining
of multiple tables. This incorporation is not straightforward, since we are usually in the presence of
multiple types of entities and events, with different characteristics, and the support of items depend
on the inter-relations between them. As referred in Section 2.3, one of the common ways to deal with
multiple tables is to join all of them into one single table and apply one single table pattern mining
algorithm. However, even if this pre-processing step can be computed, this denormalized table usually
contains both transactional and non-transactional data, and the usual goal is to mine and find the common
characteristics of the transacted entities, as opposed to the goal of constrained pattern mining, that is to
find the common entities transacted together (only considering the entities whose characteristics satisfy
the constraints). In this sense, since most of existing constraints were proposed for transactional data,
applying them to a multi-relational domain requires the adaptation of the algorithms.
4.9 Discussion and Open Issues
The use of domain knowledge in data mining has been recognized as one of the 10 most important
challenges in DM [YW06], not only because this domain knowledge represents the semantics of the domain
and user expectations, but also because, by introducing it into the discovery process, it is possible to
guide the algorithms through the discovery of more interesting and focused results. It is, therefore, one
promising approach to minimize two drawbacks of pattern mining: the large number of results, and their
58
lack of focus on user expectations.
This area of data mining guided by domain knowledge has been evolving, and several representations
have been proposed and analyzed, including human interactions, annotations, constraints, taxonomies,
ontologies, and other forms. Each form of representation allows for the formalization of more complex
forms of knowledge (from first to last), and therefore has its advantages and disadvantages, and can
be used in different ways to guide the mining process. It is important to note, though, that the more
complex the model, the more difficult it is to understand and deal with it, and to incorporate it in the
mining process.
The most explored form of domain knowledge are domain constraints. Essentially, they are filters on
the data or results that capture application semantics and user needs in an intuitive manner. There have
been proposed several types of constraints, and depending on their properties, some general and several
ad-hoc strategies have appeared (see tables 4.4 and 4.5), and have already been extended and applied to
a variety of problems and domains.
The use of constraints has been increasingly associated with other areas. One interesting example
is the algorithm U-FPS, proposed by [LB09] to deal with constrained frequent pattern mining from
uncertain data. It is able to represent user beliefs on the presence or absence of items in data, and
also to push constraints deep into the discovery process – succinct [LB09] or (prefix-)monotone [LHB10]
constraints. More recently, some authors made the correspondence between pattern mining and constraint
programming (SAT solvers) [RGN08, NJG11]. The advantage of these approaches is that the definition of
constraints is independent of the SAT solver, i.e. we can try several SAT solvers for the same constraint
specifications, as well as we can use the same SAT solver to solve different constraints. The major problem
stems from the non trivial specification and mapping of the domain and constraints (as we already know
them) to a language that can be used by the solver. Fortunately, this problem is being addressed by
techniques like the one proposed by [NJG11].
Despite all advantages and opportunities in the use of constraints, it is important not to forget its
tradeoff: pushing too restrictive constraints may lead the discovery process to a simple hypothesis testing
approach. There are already some approaches that have this tradeoff into account, and propose some
solutions, like the use of constraint relaxations.
Even though the great advances in the use of domain knowledge in pattern mining, it is clear that
there are several research paths to follow. One open discussion is when to push the domain knowledge.
On one side, pushing as a pre-processing step reduces the data to analyze, but may eliminate important
data; Pushing as a post-processing step is discovery preserving, but requires all data to be analyzed; And
incorporating domain knowledge during the actual discovery process allows us to gradually reduce the
search space and guide the mining only through promising paths, thus avoiding processing all data and,
if used wisely, avoiding eliminating potentially interesting data. Therefore, there is a need to develop and
extend algorithms that are able to push constraints during the pattern mining process.
Also, besides the use of constraints, most of the existing algorithms for other forms of knowledge
representation do not allow the discovery of more complex patterns, like sequential and temporal patterns.
Furthermore, there are many ad-hoc approaches, and few general strategies. This generally hinders
their application to different domains, and their actual use for decision support. It is undeniable the need
for an integration theory.
Finally, existing algorithms are mainly designed for transactional and single table databases. There
is a need to create new constraints and adapt these strategies (or create new ones) to be able to deal
with more complex and demanding data, as other forms of structured data, like multi-relational models,
graphs and xml files, where their inherent structure can be explored.
In Chapter 5, we address and discuss in more detail these last two open issues.
59
60
Chapter 5
Pushing Constraints into Pattern
Mining
The previous chapters of this dissertation discuss two important lines of research for pattern mining. On
one side, the mining of both large, growing and multi-relational databases is more and more important
nowadays, in this era of big data, where all kinds of data are continuously being generated and made
available. On the other side, constrained mining is very important for the pattern mining task, since
it can significantly improve the results and applicability of these techniques, by decreasing the number
of patterns returned, and focusing the discovery process in areas where it is more likely to gain more
interesting information.
There is therefore a need for the integration of these areas, and despite the great advances in each
separate area, there is still a lot of work to do in order to incorporate constraints into the mining of more
complex and dynamic databases.
In this sense, we first propose two efficient and general algorithms for pushing constraints following
any property. Both algorithms incorporate any constraint as a post-processing step, into a pattern-tree
(the same structure used by our multi-dimensional algorithm StarFP-Stream). The first algorithm is
called Constraint pushing into a Pattern-Tree (CoPT ) [SA13a], and the second CoPT4Streams [SA13b].
They are designed for single table (and static) datasets and for single table data streams, respectively. By
using the pattern-tree structure, both algorithms are able to optimize the incorporation of any constraint,
avoiding unnecessary tests and eliminating invalid patterns earlier, according to the properties of the
constraints. Experiments show that the algorithms are efficient and effective, for all constraint properties,
and even for constraints with small selectivity.
Afterwards, we analyze in detail the incorporation of constraints in a multi-dimensional domain, and
propose a set of constraints that can be applied when mining a star schema (Star Constraints): entity
type, entity, attribute and measure constraints. Based on the strategies proposed for the algorithms
CoPT and CoPT4Streams, as well as on the related work described in Chapter 4, we also propose a set
of approaches for pushing the above constraints in pattern mining over star schemas.
To the best of our knowledge, there is no work on the incorporation of constraints in multi-relational
mining, and therefore this work is an important first step towards this integration.
In this chapter we first present in detail the two algorithms for pushing constraints into a pattern-
tree (Sections 5.1 and 5.2). Then, in Section 5.3 we first analyze the difficulties on pushing constraints
in the multi-relational domain, and then we define the star constraints and propose a set of strategies
for incorporating those constraints in the discovery of patterns over star schemas. Finally, Section 5.5
discusses and concludes the chapter.
61
5.1 Pushing Constraints into a Static Pattern-Tree
As described in Section 4.3, the problem of constrained pattern mining is to find all frequent itemsets
that satisfy some constraint.
In this section we propose a set of strategies to push constraints that have nice properties into pattern
mining, through the use of a pattern-tree structure. These are post-processing strategies that, combined
with the properties of the pattern-tree, make it possible to efficiently filter the results accordingly to any
constraint.
We also propose an algorithm, called CoPT (Constraint Pushing into a Pattern-Tree), that implements
these strategies and is able to incorporate any of those constraints efficiently, and therefore return less
and more interesting results. As a post-processing algorithm, any traditional pattern mining algorithm
can be used before to search for frequent itemsets, and its results, kept in a pattern-tree, can be processed
directly by CoPT.
5.1.1 Pattern-Tree
A pattern-tree, as first described for StarFPStream in Section 3.3, is a compact prefix tree structure that
holds information about patterns.
In its basis, each node contains an item and a support, and edges link items that occur together,
forming the itemsets. Therefore, to each node in the pattern-tree corresponds an itemset, composed by
the items from the root to this node, and the support attached to it. Note that each node may contain
other fields, if needed, such as the error when mining data streams. In this sense, the children of a given
node (its subtree) correspond to the supersets of the respective itemset.
As a prefix tree, itemsets that share the same prefix also share the same nodes corresponding to that
prefix. Since there are often a lot of sharing of frequent items among patterns, the size of the tree is
usually much smaller than having them in a list or a table, and the search for an itemset is usually much
faster.
Note that if an itemset (a, b, c) : 5 is a frequent itemset, then both a, b, c, (a, b), (a, c) and (b, c) are
also frequent, with support higher or equal to 5, and therefore they are also in the pattern-tree. This
means that, for each itemset in the tree, all elements of its strict powerset are also in the tree. This may
seem undesirable or redundant at a first look, but it is a important property that facilitates the pruning
of the tree while searching for constraint satisfaction.
5.1.2 Constraint Pushing Strategies
In order to push constraints into a pattern-tree, we define a set of strategies that can be used, based on
constraint properties. A naive approach is to perform a simple depth-first search (DFS) to traverse the
tree and test all nodes for all types of constraints (note that, when we test a node for a constraint, we
mean that we test the itemset corresponding to that node). However, not all nodes need to be tested.
For example, if an itemset of a node violates an anti-monotonic constraint, no superset will satisfy it,
and therefore there is no need to test the children of that node, neither to keep them in the tree. Hence,
we can take advantage of constraint properties and perform a constrained DFS, stopping the search at
some points and avoiding unnecessary tests.
Another possible approach is to push the constraint right before inserting each itemset in the pattern-
tree. However, while this may be better in terms of memory, because the pattern-tree would be smaller,
this means that we have to test every itemset. By scanning the tree, we may skip the constraint checking
of a lot of itemsets.
62
Furthermore, constraints can be used, not only to filter the results, but also to prune the pattern-tree
and remove invalid itemsets for future accesses.
Next, we describe the strategies for pushing constraints satisfying each property.
Anti-Monotonicity (AM):
Pushing an AM constraint (CAM ) is pretty straightforward. While performing a DFS, if the node:
(a) Satisfies CAM : keep it in the tree and return it as a pattern;
(b) Violates CAM : there is no need to search its subtree because all supersets also violate the constraint.
Therefore we can prune the tree and remove this node, as well as all of its children.
Monotonicity (M):
To incorporate a monotonic constraint (CM ), we cannot remove nodes that violate it, because the super-
sets of this node (its children) may satisfy it. So, while traversing the tree, if the node:
(a) Satisfies CM : keep it in the tree and return it as a pattern. Do the same for each node in its subtree,
without testing for the constraint; (Note that if we are just pruning the tree, not yet returning the
patterns, we do not even need to scan the subtree, because all supersets satisfy the constraint, and
there is nothing to remove.)
(b) Violates CM : If it is a leaf node (has no supersets), we can remove it, as well as all parents that
become a leaf because of this elimination. If it is not a leaf, continue the search to its children,
since they can satisfy the constraint.
Succinctness (S):
In the presence of a succinct constraint, we can apply the strategies for CAM or CM , whether it is
succinct anti-monotonic (CSAM ) or succinct monotonic (CSM ), respectively. However, the succinctness
of a constraint allow us to know, from the outset, which items satisfy or not satisfy the constraint.
Therefore, we can use that to take advantage of this property, and obtain a more efficient search.
With this in mind, we can first divide the items into two groups: items that satisfy or are necessary for
the satisfaction of the constraint, Is; and items that violate, or are not required for the satisfaction of the
constraint, Iv. And before inserting itemsets into the pattern-tree, we can order the itemsets according
to those groups.
CSAM : With a SAM constraint, single items that violate it can be discarded. If we order items in
itemsets so that Iv appears before Is (Iv closer to the root and Is to the leafs), when applying the
CAM strategy, we only need to check the first level of the pattern-tree. If the node violates the
constraint, remove it and its sub-tree; if the node satisfies, all of its children will also satisfy, because
they belong to Is, so we can return all of them as patterns, without testing for the constraint.
CSM : In the case of a SM constraint, Is contains the mandatory items and Iv the optional items. If an
itemset with items from Is satisfy the constraint, all of its supersets formed by adding items from
Is or Iv also satisfy it. Itemsets with items only from Iv violate the constraint. In this sense, if
we order itemsets so that items from Is appear first than items from Iv, when applying the CM
strategy, we only need to do it until the first node from Iv. This is because, if we arrive to a node
like this and still need to test the constraint, it means it has not been satisfied by items from Is,
and next items also cannot satisfy it because they are optional, therefore we do not need to test
anything more.
63
Prefix-Monotonicity (P ):
Since prefix-monotone constraints can only be treated as AM (CPAM ) or M (CPM ) constraints if items
are ordered by a particular order, we just need to sort the itemsets according to that order before inserting
them in the pattern-tree, and apply the CAM or CM strategy, respectively. Otherwise, we have to traverse
the whole tree and check all nodes for the constraint.
Mixed-Monotonicity (Mix):
Mixed-monotone constraints (CMix) are both AM and M , for different groups of values. In this case,
we just have to divide the items into those groups: IAM and IM , and put IM before IAM in the tree,
i.e. sort itemsets so that items from the IM group appear above items from IAM . The idea is to start
with the CM strategy, until a node that satisfies it, or a node from IAM appears. From that node, we
can apply the CAM strategy and prune invalid nodes from its sub-tree. So, for each node, start with the
monotone strategy:
1. Monotone strategy: If the itemset:
(a) Satisfies CMix: Keep it in the tree and return it as a pattern. We can now change to the
anti-monotone strategy and proceed;
(b) Violates CMix: If it is a leaf, remove it, as well as all parents that become a leaf. If it is a
node from IAM , remove it, and all its sub-tree. Otherwise, continue to its children.
2. Anti-monotone strategy: If the itemset satisfies the constraint, keep it in the tree and return it as a
pattern. If it violates the constraint, prune the tree from this node removing it and all its children.
Combinations of constraints:
Most of the times, combinations of constraints that individually have these nice properties, do not
have nice properties. And pushing constraints with no nice properties, means that all tree needs to be
traversed, and all nodes must be tested. Nevertheless, there are three important aspects.
First, disjunctions or conjunctions of anti-monotonic (reps. monotonic) constraints are also anti-
monotonic (reps. monotonic) constraints. Therefore, we can push them all at the same time (as one
single constraint), with the exact CAM (resp. CM ) strategy.
Second, for other properties, we need to sort the items according to the order that allows us to take
the most advantage of the property. When we have more than one constraint that needs an order, if they
are compatible (i.e. do not change the order of items of the other), it is possible to apply the respective
strategies at the same time. However, if the orders are not compatible, we cannot apply any strategy
above.
But, and finally, since we prune the pattern-tree to remove itemsets that violate a constraint, the size
of the pattern-tree is generally smaller after applying some strategy. In this sense, we can still efficiently
push several constraints, one after another, over the pattern-tree resulting from pushing the previous
constraint. And we can do it in an efficient order, by pushing first AM constraints, and then M ones.
Constraints with no nice properties (or with incompatible orders), can be pushed in the end, over the
smallest pattern-trees.
5.1.3 Algorithm CoPT
Since there are a lot of similarities between the strategies presented above, they can be combined into
one single generic strategy or algorithm. We propose therefore the algorithm CoPT (Constraint Pushing
into a Pattern-Tree), that is able to efficiently and effectively push any constraint into a pattern-tree.
64
Algorithm 3 CoPT Pseudocode
Input: Support σ, Dataset D, Constraint COutput: All frequent itemsets that satisfy C
if C has order thenorder ← best order for C
p-tree ← empty tree with order orderrun a pattern mining algorithm with σ and D, and insert results into the p-treeL← pushConstraint(p-tree, C)return L
Patterns ← pushConstraint(Pattern-Tree p-tree, Constraint C)L← ∅for all Node N , children of the root of p-tree do
remove? ← push(N , C, {}, L)if remove? is true then
remove N from rootreturn L
boolean ← push(Node N , Constraint C, Itemset itset, Patterns L)isPattern? ← true, current ← itset ∪N.item : N.supportif Constraint is not null then
if C is Succinct and N.item ∈ C.Iv thenreturn true //remove this node
if current satisfies C thenif C is Monotonic or C is Succinct then
if C is Mixed thenChange C to AM for next children
elseC ← null //no need to test any children
elseif C is Anti-monotonic then
return trueisPattern? ← false
if isPattern? is true thenL← L ∪ current
for all Node T , children of N doremove? ← push(T , C, current, L)if remove? is true then
remove T from Nif isPattern? is false and N is leaf then
return truereturn false
The pseudo-code of the algorithm is presented in Algorithm 3.
Essentially, to push a constraint, CoPT first checks what is the order of items for that constraint,
and creates an empty pattern-tree with it (if there is no order, items are put in the pattern-tree in a
support-descending order, which is known to improve the compactness of the tree [HPY00]). Then a
traditional pattern mining algorithm can run over the dataset to get frequent itemsets. While running
it, results are inserted in the pattern-tree (note that the algorithm does not need any change. Only the
pattern-tree knows how to sort and insert the itemsets). After that, we can push the constraint into the
pattern-tree.
Following function push for each node, current corresponds to the itemset composed of items from
root to this node, and until proved otherwise, it is a pattern. If there is no constraint to check (e.g. a
CM already satisfied), add it as a pattern and do the same for all of its children. Otherwise, (1) if the
constraint C is succinct (SAM or SM) and the node violates it, it can be removed; (2) if current satisfies
C: (a) C is mixed and we can change the strategy to AM ; (b) C is monotonic and no child needs testing;
or (c) C is succinct AM , and only the first level of the tree needs testing. (3) if current violates C, it is
not a pattern, and if C is AM we can prune the tree from here. After checking the constraints, if the
node was not pruned, we can test its children. Finally, after pushing C into the children, if the node is
not a pattern and is a leave, we can remove it.
65
5.1.4 Performance Evaluation
The goal of these experiments is to analyze the behavior of our algorithm in the presence of all types
of constraints, and prove that CoPT is able to effectively and efficiently push them into a pattern-tree,
taking advantage of their properties.
In these experiments we use a transaction database automatically generated by the program developed
at IBM Almaden Research Center [AS94]. The dataset has 10k transactions, with an average of 25 items
per transaction and a domain of 1000 items (with values from zero to 1000). In addition, in order to test
the mixed-monotone constraint, we consider an equivalent dataset but with negative values, by making
values vary from −500 to 500.
We analyze the time needed to push the constraints on these datasets, as well as the size of the pruned
pattern-tree and the number of constraint checks the algorithm needed to make. Since the behavior of the
algorithm can depend on the selectivity of the constraints, we also use it in our experiments. Selectivity
is defined as the ratio of frequent itemsets that violate the constraint, over the total number of frequent
itemsets, i.e. how much we can filter. Therefore, we test CoPT with several constraints with different
selectivities, varying from 10% to 90%. We also tested several minimum supports, and since results are
consistent, we present the results for a support of 0.5%, and results presented correspond to the average
of several runs with different constraints with equivalent selectivity. Also, to have a term of comparison,
we test our algorithm against a version that checks all nodes for the constraints (i.e. not taking into
account constraint properties), named CoPT+.
The traditional pattern mining algorithm used was FP-Growth [HPY00], since it is an efficient al-
gorithm that does not suffer from the candidate generation problem. The computer used to run the
experiments was an Intel Core i7 CPU at 2GHz (Quad Core), with 8GB of RAM and using Mac OS X
Server 10.7.5 and the algorithms were implemented using Java (JVM version 1.6.0 37).
Experimental Results
As the core of our algorithm, the pattern-tree plays an important role in these experiments. Independently
of the constraint, most of the times the size of the pattern-tree after pushing the constraint is smaller
than the original one, because it does not contain leaves that violate it. As the selectivity increases,
the more itemsets violate the constraint, and therefore the more can be discarded from the tree. In the
case of an AM constraint (either AM , SAM or PAM), the number of nodes in the final pattern-tree
corresponds to the number of frequent itemsets that satisfy the constraint (the number of patterns). In
the case of M constraints (M , SM , PM and Mix), this might not be true, since nodes that violate the
constraint have to be kept if there is some superset that satisfies the constraint.
In fact, the time needed by the traditional unconstrained pattern mining algorithm corresponds to
the bulk of time needed: about 5 hours for these settings. After having the patterns in a pattern-tree,
and due to its compact nature, it is fast (compared to pattern mining) to look for patterns that satisfy
some constraint, even constraints with no nice properties (CoPT+) and with less selectivity. Fig. 5.1, 5.3
and 5.5 show the time needed for pushing AM , M and Mix constraints into a pattern-tree, respectively.
We can see there that pushing constraints taking into account their properties (CoPT ) takes less time
than testing all nodes (CoPT+), for every constraint property. For all AM and Succinct constraints,
as the selectivity increases, the time needed to prune the tree decreases, since they can eliminate earlier
more itemsets that violate it. On the contrary, M and SM constraints tend to increase the time needed,
because they take more time until finding itemsets that satisfy it (so that they can stop checking the
constraint). The time is therefore related to the number of constraint checks.
These constraint checks are also an important part of the algorithm, since theoretically, taking ad-
vantage of constraint properties results in less tests. Fig. 5.2, 5.4 and 5.6 show interesting results about
66
that. For AM constraints (AM and PAM), the number of tests decreases with the increase of selectivity,
because the number of itemsets that violate and can be discarded increases. For M constraints (both M
and PM) the trend is reversed. This happens because the M strategy only stops checking when itemsets
satisfy the constraints. If there are more itemsets that violate (more selectivity), more itemsets need to
be tested. Using the succinctness of constraints brings the highest improvements, both in time needed
and in constraint checks avoided. The number of tests for succinct constraints does not depend on the
selectivity, because only and all nodes of the first level of the tree need to be tested (in this case, about 800
nodes). Note that the tree has more than 300 thousand nodes, and only 800 need to be checked. Finally,
Mix constraints have a “mix” of the behavior of M and AM constraints. As the selectivity increases,
more itemsets belonging to both groups of values violate the constraint, and the more violating itemsets
from IAM , the more can be pruned, but the more violating itemsets from IM , the more constraint checks
are required. Hence, there is a tradeoff between both strategies.All#AM
All#M
0#
25#
50#
75#
100#
125#
150#
175#
0%# 20%# 40%# 60%# 80%# 100%#
Time%(m
s)%
Selec,vity%CoPT+# CoPT#(AM)#CoPT#(SAM)# CoPT#(PAM)#
0#
50#
100#
150#
200#
250#
300#
350#
0%# 20%# 40%# 60%# 80%# 100%#Num
ber%o
f%Nod
es%
Thou
sand
s%
Selec,vity%CoPT+# CoPT#(AM)#
CoPT#(SAM)# CoPT#(PAM)#
0#
50#
100#
150#
200#
250#
300#
350#
0%# 20%# 40%# 60%# 80%# 100%#N.%C
onstraint%C
hecks%(thou
sand
s)%
Selec,vity%CoPT+# CoPT#(AM)#CoPT#(SAM)# CoPT#(PAM)#
0#
25#
50#
75#
100#
125#
150#
175#
0%# 20%# 40%# 60%# 80%# 100%#
Time%(m
s)%
Selec,vity%CoPT+# CoPT#(M)#CoPT#(SM)# CoPT#(PM)#
0#
50#
100#
150#
200#
250#
300#
350#
0%# 20%# 40%# 60%# 80%# 100%#N.%C
onstraint%C
hecks%(thou
sand
s)%
Selec,vity%CoPT+# CoPT#(M)#CoPT#(SM)# CoPT#(PM)#
Figure 5.1: Time with AM .
All#AM
All#M
0#
25#
50#
75#
100#
125#
150#
175#
0%# 20%# 40%# 60%# 80%# 100%#
Time%(m
s)%
Selec,vity%CoPT+# CoPT#(AM)#CoPT#(SAM)# CoPT#(PAM)#
0#
50#
100#
150#
200#
250#
300#
350#
0%# 20%# 40%# 60%# 80%# 100%#Num
ber%o
f%Nod
es%
Thou
sand
s%
Selec,vity%CoPT+# CoPT#(AM)#
CoPT#(SAM)# CoPT#(PAM)#
0#
50#
100#
150#
200#
250#
300#
350#
0%# 20%# 40%# 60%# 80%# 100%#N.%C
onstraint%C
hecks%(thou
sand
s)%
Selec,vity%CoPT+# CoPT#(AM)#CoPT#(SAM)# CoPT#(PAM)#
0#
25#
50#
75#
100#
125#
150#
175#
0%# 20%# 40%# 60%# 80%# 100%#
Time%(m
s)%
Selec,vity%CoPT+# CoPT#(M)#CoPT#(SM)# CoPT#(PM)#
0#
50#
100#
150#
200#
250#
300#
350#
0%# 20%# 40%# 60%# 80%# 100%#N.%C
onstraint%C
hecks%(thou
sand
s)%
Selec,vity%CoPT+# CoPT#(M)#CoPT#(SM)# CoPT#(PM)#
Figure 5.2: Checks with AM .
All#AM
All#M
0#
25#
50#
75#
100#
125#
150#
175#
0%# 20%# 40%# 60%# 80%# 100%#
Time%(m
s)%
Selec,vity%CoPT+# CoPT#(AM)#CoPT#(SAM)# CoPT#(PAM)#
0#
50#
100#
150#
200#
250#
300#
350#
0%# 20%# 40%# 60%# 80%# 100%#Num
ber%o
f%Nod
es%
Thou
sand
s%
Selec,vity%CoPT+# CoPT#(AM)#
CoPT#(SAM)# CoPT#(PAM)#
0#
50#
100#
150#
200#
250#
300#
350#
0%# 20%# 40%# 60%# 80%# 100%#N.%C
onstraint%C
hecks%(thou
sand
s)%
Selec,vity%CoPT+# CoPT#(AM)#CoPT#(SAM)# CoPT#(PAM)#
0#
25#
50#
75#
100#
125#
150#
175#
0%# 20%# 40%# 60%# 80%# 100%#
Time%(m
s)%
Selec,vity%CoPT+# CoPT#(M)#CoPT#(SM)# CoPT#(PM)#
0#
50#
100#
150#
200#
250#
300#
350#
0%# 20%# 40%# 60%# 80%# 100%#N.%C
onstraint%C
hecks%(thou
sand
s)%
Selec,vity%CoPT+# CoPT#(M)#CoPT#(SM)# CoPT#(PM)#
Figure 5.3: Time with M .
All#AM
All#M
0#
25#
50#
75#
100#
125#
150#
175#
0%# 20%# 40%# 60%# 80%# 100%#
Time%(m
s)%
Selec,vity%CoPT+# CoPT#(AM)#CoPT#(SAM)# CoPT#(PAM)#
0#
50#
100#
150#
200#
250#
300#
350#
0%# 20%# 40%# 60%# 80%# 100%#Num
ber%o
f%Nod
es%
Thou
sand
s%
Selec,vity%CoPT+# CoPT#(AM)#
CoPT#(SAM)# CoPT#(PAM)#
0#
50#
100#
150#
200#
250#
300#
350#
0%# 20%# 40%# 60%# 80%# 100%#N.%C
onstraint%C
hecks%(thou
sand
s)%
Selec,vity%CoPT+# CoPT#(AM)#CoPT#(SAM)# CoPT#(PAM)#
0#
25#
50#
75#
100#
125#
150#
175#
0%# 20%# 40%# 60%# 80%# 100%#
Time%(m
s)%
Selec,vity%CoPT+# CoPT#(M)#CoPT#(SM)# CoPT#(PM)#
0#
50#
100#
150#
200#
250#
300#
350#
0%# 20%# 40%# 60%# 80%# 100%#N.%C
onstraint%C
hecks%(thou
sand
s)%
Selec,vity%CoPT+# CoPT#(M)#CoPT#(SM)# CoPT#(PM)#
Figure 5.4: Checks with M .
0"
50"
100"
150"
200"
250"
0%" 20%" 40%" 60%" 80%" 100%"
Time%(m
s)%
Selec,vity%
CoPT+" CoPT"(Mix)" 0"
50000"
100000"
150000"
200000"
250000"
300000"
350000"
400000"
450000"
0%" 20%" 40%" 60%" 80%" 100%"
Num
ber%o
f%Nod
es%%
Selec,vity%
CoPT"(Mix)" Pa6erns"
0"
100"
200"
300"
400"
500"
0%" 20%" 40%" 60%" 80%" 100%"N.%C
onstraint%C
hecks%(thou
sand
s)%
Selec,vity%
CoPT+" CoPT"(Mix)"
Figure 5.5: Time with Mixed.
0"
50"
100"
150"
200"
250"
0%" 20%" 40%" 60%" 80%" 100%"
Time%(m
s)%
Selec,vity%
CoPT+" CoPT"(Mix)" 0"
50000"
100000"
150000"
200000"
250000"
300000"
350000"
400000"
450000"
0%" 20%" 40%" 60%" 80%" 100%"
Num
ber%o
f%Nod
es%%
Selec,vity%
CoPT"(Mix)" Pa6erns"
0"
100"
200"
300"
400"
500"
0%" 20%" 40%" 60%" 80%" 100%"N.%C
onstraint%C
hecks%(thou
sand
s)%
Selec,vity%
CoPT+" CoPT"(Mix)"
Figure 5.6: Checks with Mixed.
67
5.1.5 Discussion and Conclusions
In this section, we propose a new set of post-processing strategies for pushing constraints into pattern
mining, through the use of the efficient pattern-tree structure. These strategies take advantage of con-
straint properties, so that we can filter earlier the frequent itemsets that satisfy each constraint, and
avoid unnecessary tests. We also propose a general algorithm, named CoPT , that combines the defined
strategies and is able to push any constraint into a pattern-tree, and still taking advantage of their
properties.
Experimental results show that the algorithm is effective and efficient. It needs a small amount of
time to push and prune the pattern-tree (when compared to the time needed by the pattern mining
algorithm), even for constraints with small selectivity, and checks much less nodes and needs less time
than an approach that does not take into account constraint properties.
Despite the benefits of CoPT , it is a post-processing approach. This means that some traditional
pattern mining algorithm must run first to discover all frequent itemsets. This usually takes much
time, and results in a large quantity of frequent itemsets that need to be again evaluated. A path for
improvement is to create a more balanced approach and use the strategies proposed here to filter itemsets
during the actual discovery process.
An important contribution of CoPT is the fact that it uses the same pattern-tree structure that is
used by our algorithm StarFP-Stream. However, it makes some assumptions that are not valid when we
move to a streaming environment, such as all data is available from the beginning. In the streaming case,
new data is continuously appearing, and this means that we do not know a priori the alphabet and order
of items, and furthermore, we need to analyze to what extent we can remove itemsets from the tree if
the same itemsets can appear later on.
5.2 Pushing Constraints into a Dynamic Pattern-Tree
In this section we adapt and discuss the set of strategies proposed above for pushing constraints into
stream pattern mining, through the use of the pattern-tree structure. The problem of constrained pattern
mining over data streams is to find all approximate patterns (with estimated support higher than the
threshold) that satisfy some constraint.
We also propose a generic algorithm, called CoPT4Streams (Constraint Pushing into a Pattern-Tree
for Streams), that combines and implements these strategies and is able to dynamically discover all
patterns that satisfy any user defined constraint. CoPT4Streams pushes constraints into the pattern-tree
structure at each batch boundary in an efficient way, by taking advantage of the properties of constraints,
and filters all patterns and possibly patterns in that tree, resulting in a much smaller summary, and
therefore less memory and time needed.
Since it is an algorithm that is applied to the pattern-tree, any data streaming algorithm can be used
along with our CoPT4Streams, provided that it uses a pattern-tree as its summary data structure.
5.2.1 Pattern-Tree
As described above for CoPT, a pattern-tree is a compact prefix tree structure that holds information
about patterns. In the streaming environment, this tree contains also information about the error asso-
ciated to each pattern.
In this context, each node of a pattern-tree contains an item, an approximate support and a maximum
error, and edges link items that occur together, forming the patterns. Therefore, each node in a pattern-
tree corresponds to an approximate pattern, composed of the items from the root to this node, and the
estimated support and error attached to this node.
68
5.2.2 Constraint Pushing Strategies
As we are integrating data streams and constraints, some questions arise. Note that the pattern-tree
must be updated in every batch, to renew the current approximated frequent itemsets. And therefore the
order in which the items in patterns are inserted in the tree must remain the same across the batches.
1. Data are not available a priori, and so we do not know all possible items at the beginning. In the
cases where the order of items matter (e.g. for prefix-monotone constraints), new items that should
be placed between already known items may appear. Is it possible to efficiently take advantage of
constraint properties, even when the order of items changes?
2. In a static application, invalid itemsets could be removed from the tree, since they do not satisfy
the constraint (for both AM and M constraints). In a data stream, these itemsets could reappear in
following batches, and valid supersets of current invalid itemsets could also appear later (in the case
of M constraints). Can we, at some batch, remove itemsets in the tree that violate the constraint?
With these differences, the main question is:
• Can we use the same strategies as the algorithm CoPT?
The answer is yes to both questions, with small adaptations, essentially because for a pattern to
appear in the pattern-tree (i.e. to be approximately frequent), all of its subsets must appear too. But
we will delve into these questions further ahead.
We assume that constraints have fixed parameters (for example, min(X) < v, in which X is an itemset
and v is a fixed threshold), i.e. parameters do not depend on the number of transactions seen so far, and
do not change across different batches (e.g. we do not consider constraints like min(X) < min(all items
seen so far)). This makes the satisfaction of constraints permanent, meaning that, if an itemset satisfies
(reps. violates) a constraint in some batch, it always satisfies (reps. violates) the same constraint, in any
later batch.
Anti-Monotonicity:
For pushing an AM constraint (CAM ) we can use the same strategy as used for mining static data tables
(as CoPT ). The only difference is that we don’t have to return any pattern as a result, since we are
pushing constraints at the end of a batch, to filter the pattern-tree for the next batch.
The reason we can apply the same strategy is that, for AM constraints, itemsets that violate the
constraint can be removed, because they will never satisfy the constraint. Even if they reappear in later
batches because they are frequent, they will be removed again, since they violate the constraint (answer
to question 2).
Recalling the strategy:
While performing a DFS, if the node satisfies CAM , keep it in the tree and proceed to its children; if
it violates CAM , there is no need to search its subtree because all supersets also violate the constraint.
Therefore, prune the tree and remove this node, as well as all of its children.
Monotonicity:
To incorporate a monotonic constraint (CM ), we can also adopt a similar strategy as proposed for CoPT.
Since we do not have to return results, when we find a satisfying itemset, we do not need to traverse its
supersets to return them. In this sense, we not only save time on constraint checks, but also on traversing
the tree.
69
Answering again to question 2, for M constraints, all itemsets with no supersets in the tree (leafs)
that violate the constraint can be removed, because they will never satisfy the constraint. Note that,
if some valid superset appears in later batches, it means that both that itemset and the superset are
frequent, and therefore both will appear in the tree, in the same branch. However, only the superset will
be returned as pattern, because it is the only one valid. Summing, there is no need to keep an invalid
itemset in the tree, while it has no valid supersets.
Recalling the strategy:
If a node satisfies CM , keep it in the tree and do not scan the subtree, because all supersets
satisfy the constraint, and there is nothing to remove; If it violates CM and it is a leaf node (has
no supersets), remove it, as well as all parents that become a leaf because of this elimination. If
it violates but is not a leaf, continue the search to its children (we cannot remove it, because the
supersets of this node, its children, can satisfy the constraint).
Succinctness:
Recall that a succinct constraint allow us to know, by looking for single items, which of them satisfy (Is)
or not satisfy (Iv) the constraint. With this in mind, before inserting itemsets into the pattern-tree, we
can order their contents according to those two groups of items.
In this sense, succinct constraints refer to question 1, since they need the items to be sorted. However,
in this streaming environment, we do not know the overall order a priori, and therefore new items from
the first group may appear and need to be placed before all already known items from the second group
(e.g. for a CSAM , at some batch, Iv = {a} and Is = {b}. If an item c appears and belongs to Iv (therefore
Iv = {a, c} and Is = {b}), and we have the occurrence of itemset abc, the order of items to be inserted
in the pattern-tree should be acb).
The truth is that this poses no problem, as long as the relative order of existing items does not change,
because if an itemset with new items appear in the tree, all subsets will also appear, and all subsets that
not include these new items remain with the same order (using the example above, both itemsets a, b,
c, ab, ac, cb and acb must be in the tree, and as can be noted, the itemsets without item c maintain the
order, such as ab in this case).
Therefore, we can follow the CoPT strategy proposed for static datasets, but without the need for
returning itemsets.
Recalling the strategy:
CSAM : With a SAM constraint, order items in itemsets so that Iv appears before Is. At each
batch boundary, apply the CAM strategy only on the first level of the pattern-tree. If the node
violates the constraint, remove it and its sub-tree; if the node satisfies the constraint, all of its
children will also satisfy it, because they belong to Is, so skip testing for the constraint.
CSM : In the case of a SM constraint, order itemsets so that items from Is (mandatory) appear
first than items from Iv (optional). When applying the CM strategy after each batch, only
do it until find the first node from Is that satisfies the constraint (since all supersets satisfy).
Otherwise, until the first node from Iv, because if we arrive to a node from this group, and
still need to test the constraint, it means it has not been satisfied by items from Is, and the
following items will also not satisfy it because they are optional. In this case, do not test this
node, neither any child, and remove them from the tree.
70
Prefix-Monotonicity:
Prefix-monotone constraints can only be treated as AM (CPAM ) or M (CPM ) constraints if items are
ordered by a particular order. Answering to question 1, and like for succinct constraints, this order is
not a problem, as long as the relative order of existing items does not change.
Therefore, we just need to follow the same approach as CoPT.
Sort the itemsets according to the correct order before inserting them in the pattern-tree, and
apply the CAM or CM strategy, respectively.
Mixed-Monotonicity:
As for mixed-monotone constraints (CMix), the answer to questions 1 and 2 follow the same reasoning
as explained above for prefix-monotone and succinct constraints.
In this sense, we can also apply CoPT.
Divide the items, as they appear, into two groups: anti-monotonic IAM and monotonic IM , and
put IM before IAM in the tree. Then start with the CM strategy, until a node that satisfies it, or a
node from IAM appears. From that node, apply the CAM strategy to all of its supersets (children)
and prune invalid nodes from its sub-tree.
5.2.3 Algorithm CoPT4Streams
Based on the discussion above, we propose an extension of CoPT, called CoPT4Streams (Constraint
Pushing into a Pattern-Tree for Streams), that is able to efficiently and effectively push any constraint
into a pattern-tree, when mining data streams.
The idea is to run CoPT4Streams over the pattern-tree resulting of the mining of each batch, and using
the resulting smaller tree to mine the next batches. By doing this, the algorithm is able to filter what is
really interesting for the users, and keep smaller summary structures, which result in improvements on
both memory and time needed, as well as on the number of the patterns returned.
Since constraint satisfaction is permanent, we can perform an extra optimization (besides using con-
straint properties) and only compute the satisfaction of some node once, by e.g. keeping one flag in each
node indicating if it satisfies or violates the constraint. Thus, we can mitigate the constraint checking for
nodes that remain in the tree from one batch to another (nodes closer to the root).
Essentially, to push a constraint, CoPT4Streams works as follows. For each batch, and for each
approximate pattern discovered by the streaming algorithm, it is ordered according to the order of items
for that constraint, if exists, and inserted in the tree (if there is no order, items are put in the pattern-tree
in a support-descending order [HPY00]).
At each batch boundary, we can push the constraint C into the pattern-tree, by scanning the tree
according to the constraint property. So, for each node, if the node is new in the tree (i.e. if we never
checked for the constraint), we can first see, in the case of succinct or mixed constraints, if the item in
the node belongs to the second group of items. If so, it means the node can be discarded (the constraint
was not satisfied by the first group of items), along with its children. Then, or in the case of other type
of constraints, we should check for the constraint (and store the result into the satisfaction flag in the
node).
When we know the result of the constraint checking:
1. If the itemset corresponding to this node satisfies C:
71
(a) C is mixed and we can change the strategy to AM ;
(b) C is monotonic and no child needs testing; or
(c) C is succinct AM , and also no child needs testing (only the first level of the tree).
2. If the itemset violates C, it is not a pattern, and if C is AM (including SAM and PAM) we can
prune the tree from here.
After checking the constraints, if the node was not pruned, we can test its children. Finally, after
pushing C into the children, if the node is not a pattern and is a leave, we can remove it. Note that this
final node pruning is made for every constraint, even if they have no “nice” properties. However, in this
later case, all nodes need to be tested.
5.2.4 Performance Evaluation
The goal of these experiments is to analyze the behavior of our algorithm in the presence of a data stream,
and all types of constraints, and prove that CoPT4Streams is able to effectively and efficiently push
them into a pattern-tree at each batch, taking advantage of their properties.
In these experiments, similar to the experiments with CoPT, we use a database automatically gen-
erated by the program developed at IBM Almaden Research Center [AS94]. The dataset has 100k
transactions, with an average of 10 items per transaction and a domain of 1000 items (with values from
zero to 1000). In addition, in order to test the mixed-monotone constraint, we consider an equivalent
dataset but with negative values (by making values vary from −500 to 500).
Recall that, the higher the selectivity, the more we can filter, and less patterns are returned. But
on the other hand, the lower the selectivity, the more patterns need to be kept and returned (and we
get closer to the problems of unconstrained techniques). Therefore, we test CoPT4Streams with several
constraints with different selectivities, varying from 10% to 90%.
We also tested several minimum supports and errors, and since results are consistent, we present the
results for a support of 0.1% and an error of 0.01% (a common way to define the error, is ε = 0.1σ),
and results presented correspond to the average of several runs with different constraints with equivalent
selectivity. Also, to have a term of comparison, we test our algorithm against CoPT4Streams+, a version
that checks all nodes for the constraints (i.e. that does not take into account constraint properties).
The data streams algorithm used was SimpleFP-Stream (a simplification of FP-Streaming [HPY00]
presented and used in Chapter 3, for evaluating StarFP-Stream). It was chosen because it is an efficient
algorithm for single table data streams that does not suffer from the candidate generation problem,
and keeps current patterns into a pattern-tree. The size of each batch is defined by |B| = 1/ε, which
corresponds to 100 batches of 1000 transactions in each batch. The computer used to run the experiments
was an Intel Core i7 CPU at 2GHz (Quad Core), with 8GB of RAM and using Mac OS X Server 10.7.5
and the algorithm was implemented using Java (JVM version 1.6.0 37).
By definition, data streaming techniques return more patterns than traditional algorithms for static
datasets, and the higher the error allowed, the more patterns are returned and the less accuracy they
obtain. By incorporating constraints into data streams, we can filter not only the patterns returned, but
also the patterns that must be kept in memory, improving the performance of the algorithms, either in
terms of time, memory and results.
Experimental Results
We first analyze the average size of the pruned pattern-tree. When applying constraints, more itemsets
can be discarded, and therefore the pattern-tree is smaller than in an unconstrained environment. In
72
turn, a smaller pattern-tree in every batch may have an impact on the time needed to update the tree
and on the number of constraint checks the algorithm need to make. Remember that the update time
is perceived as the time needed to process one batch of transactions until the complete update of the
pattern-tree (Section 3.4.2). Since the trends are the same, whether a constraint is AM or M, fig. 5.7
to 5.9 show the average results when in the presence of AM (an average of both AM, SAM and PAM)
and M (average of both M, SM and PM) constraints. The only difference is that, in the unconstrained
case, as well as for the simple AM and M constraints, there is no need to sort the items in the patterns.
On the other side, succinct, prefix- and mixed-monotone constraints require that items are put in the
pattern-tree sorted according to some specific order. This means that all itemsets must be sorted before,
which results in an overhead in time, that depends on that order.NOLSUCTimeToCheckUpdateTime
6.1 16.56.7 15.9
5.7 12.9
4.8 9.7
1.3 5.81.0 5.3
8.1 16.59.3 15.9
9.6 12.9
9.7 9.7
6.9 5.87.0 5.3
8.0 29.88.2 34.0
8.4 29.8
8.2 21.5
6.5 16.36.1 15.5
8.8 29.88.8 33.9
9.1 29.8
9.3 21.5
7.9 16.37.6 15.6
0L
10000L
20000L
30000L
40000L
50000L
60000L
10%L20%L 40%L 60%L 80%L90%L
Size%of%the
%Pa,
ern%Tree%
Selec2vity%
CoPT4StreamsL(AM)L CoPT4StreamsL(M)L
UnconstrainedL
0L
20000L
40000L
60000L
10%L20%L 40%L 60%L 80%L90%L
Num
ber%o
f%Con
straint%C
hecks%
Selec2vity%CoPT4StreamsL(AM)L CoPT4StreamsL(M)LCoPT4Streams+L(AM)L CoPT4Streams+L(M)LUnconstrainedL
0L
10L
20L
30L
40L
50L
10%L20%L 40%L 60%L 80%L90%L
Upd
ate%Time%(s)%
Selec2vity%CoPT4StreamsL(AM)L CoPT4StreamsL(M)LCoPT4Streams+L(AM)L CoPT4Streams+L(M)LUnconstrainedL
Figure 5.7: Average size of the pattern-tree, perbatch, after pushing the constraint.
0L5L10L15L20L25L30L35L40L45L50L
10%L20%L 40%L 60%L 80%L90%L
Upd
ate%Time%(s)%
Selec2vity%
CoPT4StreamsL(AM)L CoPT4StreamsL(M)L
UnconstrainedL
Figure 5.8: Average time needed per batch, to up-date the pattern-tree.
NOLSUCTimeToCheckUpdateTime
6.1 16.56.7 15.9
5.7 12.9
4.8 9.7
1.3 5.81.0 5.3
8.1 16.59.3 15.9
9.6 12.9
9.7 9.7
6.9 5.87.0 5.3
8.0 29.88.2 34.0
8.4 29.8
8.2 21.5
6.5 16.36.1 15.5
8.8 29.88.8 33.9
9.1 29.8
9.3 21.5
7.9 16.37.6 15.6
0L
10000L
20000L
30000L
40000L
50000L
60000L
10%L20%L 40%L 60%L 80%L90%L
Size%of%the
%Pa,
ern%Tree%
Selec2vity%
CoPT4StreamsL(AM)L CoPT4StreamsL(M)L
UnconstrainedL
0L
20000L
40000L
60000L
10%L20%L 40%L 60%L 80%L90%L
Num
ber%o
f%Con
straint%C
hecks%
Selec2vity%CoPT4StreamsL(AM)L CoPT4StreamsL(M)LCoPT4Streams+L(AM)L CoPT4Streams+L(M)LUnconstrainedL
0L
10L
20L
30L
40L
50L
10%L20%L 40%L 60%L 80%L90%L
Upd
ate%Time%(s)%
Selec2vity%CoPT4StreamsL(AM)L CoPT4StreamsL(M)LCoPT4Streams+L(AM)L CoPT4Streams+L(M)LUnconstrainedL
Figure 5.9: Average number of constraint checks per batch.
As expected, as the selectivity increases, more itemsets can be removed from the tree, and therefore the
size of the pattern-tree is smaller, as well as the time needed to update smaller pattern-trees. We can also
confirm in fig. 5.7 that AM constraints allow us to prune much more itemsets than M constraints, leading
to much smaller pattern-trees. This is explained by the fact that itemsets that violate M constraints but
have supersets that satisfy them, cannot be discarded from the tree. By similar reasons, AM constraints
need, in average, less time to update the pattern-tree than M constraints. Fig. 5.8 also shows that
pushing AM or M constraints into the pattern-tree results in a decrease of the update time, even when
73
the selectivity is low. Since CoPT4Streams+ needs to check all nodes for the constraint, it needs some
more time to update the pattern-tree.
In fig. 5.9, we analyze the average number of constraint checks. We can state that pushing constraints
is always better, even with the naive approach, CoPT4Streams+, due to the resulting smaller pattern-trees
from one batch to another. Nevertheless, taking into account constraint properties to avoid constraint
checks (CoPT4Streams) requires significantly less number of constraint checks. It is interesting to see
that the trends are the same for both the static (Section 5.1.4) and streaming cases. As the selectivity
increases, the number of constraint checks for AM constraints decreases, since the number of itemsets
that can be discarded increases. But on the contrary, for M constraints, the number of tests increases
along with the increase of the selectivity. This happens because the M strategy only stops checking when
itemsets satisfy the constraint. And if there are more items that violate it, more itemsets need to be
tested.
The behavior of Mixed constraints is consistent with the trends presented above: pushing them into
the pattern-trees results in much smaller trees, and therefore less constraint checks and update time,
when comparing with both the unconstrained and the CoPT4Streams+ algorithms. As the selectivity
increases, the number of patterns in the trees decreases, as well as the time needed to process them. The
number of constraint checks tends to be constant, independently of the selectivity of the constraints.
5.2.5 Discussion and Conclusions
In this section, we analyzed a set of strategies for pushing constraints into stream pattern mining, through
the use of the efficient pattern-tree structure. These strategies take advantage of constraint properties, so
that we can filter earlier the frequent itemsets that do not satisfy each constraint, and avoid unnecessary
tests. By doing this for each batch of transactions, greatly decreases the size of the pattern-trees that
need to be maintained for this streaming environment, and therefore helps focusing the pattern mining
task and returning much less, but more interesting results. We also propose a general algorithm, named
CoPT4Streams, that combines the defined strategies and is able to dynamically push any constraint
into a pattern-tree, and still taking advantage of their properties.
Experimental results show that the algorithm is effective and efficient. The pattern-trees maintained
are much smaller, which generally results on less time needed. It also checks much less nodes and needs
less time than an approach that does not take into account constraint properties.
Despite the benefits of CoPT4Streams, and along the line of CoPT , it is a post-processing approach
(applied after the processing of each batch), which needs that an unconstrained algorithm run to first
discover all possible frequent patterns. This usually takes much time, and results in a large quantity of
frequent itemsets that need to be put in the pattern-tree, and to be again evaluated later on. A more
balanced approach is to adapt the strategies proposed here to filter itemsets during the actual discovery
process.
An important contribution of CoPT4Streams is that it is able to push constraints in the same
pattern-tree structure that is used by our algorithm StarFP-Stream, and to maintain it in the streaming
environment. However, it cannot be directly applied to our multi-dimensional domain, since there are
differences in the content of the pattern-trees. While in the traditional case we have itemsets that
correspond to transactions of some entity, in the case of a star schema we have transactions of more than
one different type of entity, and we are in the presence of both transactional and non-transactional data.
This requires some adaptations and a deeper analysis of these differences.
74
5.3 Towards the Incorporation of Constrains into
Multi-Dimensional Mining
As seen throughout this dissertation, multi-relational pattern mining algorithms are able to mine directly
more than one table, and find patterns that relate the characteristics of all tables. However, they are still
not able to push constraints into the discovery process. Also, although constrained mining algorithms are
able to incorporate constraints to deliver more interesting results, they cannot deal with more than one
table. There is therefore a need for the integration of these two areas of pattern mining. This integration
is not straightforward, since these approaches look and treat data differently.
One one side, most of the existing constrained techniques are designed for mining transactional data.
On the other side, in the case of a star schema, we are dealing with two types of data: transactional
and non-transactional. While the fact table records transactional data (the business events), dimensions
store non-transactional data (the characteristics of business entities). Since there are differences on mining
these two kinds of data tables, existing constrained algorithms cannot be directly used over star schemas,
and existing multi-relational algorithms cannot be directly used for pushing constraints.
To the best of our knowledge, there is no work that makes this integration. Hence, we discuss in this
section some naturally arising questions:
• Is it possible to integrate these two areas of pattern mining?
• What are the differences and emerging challenges?
• Can we use traditional constraints in this multi-dimensional environment?
• And finally, can existing algorithms be applied or adapted to find frequent constrained patterns in
a star schema? If so, how?
We argue that it is possible to combine these two paradigms, and we answer to these questions in the
course of this section.
We first describe the differences on mining transactional and non-transactional data, and then how
these differences can be overcome. We discuss how constraints may be interpreted in this multi-
dimensional domain, by proposing Star Constraints, and also how they can be introduced into the mining
process.
5.3.1 Transactional vs. Non-Transactional Data
Mining patterns on transactional and non-transactional data is different, but those differences stem only
from the interpretation and meaning of items and patterns. Fig. 5.10, along with Table 5.1, show these
variations. While, in the transactional case, each item corresponds to an entity (e.g. a product), in
the non-transactional case, we are mining pairs (attribute, value) (e.g. price = 30e ). This means that
patterns have different interpretations in each case: sets of entities frequently transacted together, or
sets of characteristics common to a frequent number of entities, respectively for the transactional and
non-transactional cases.
These differences are not seen in traditional pattern mining, since algorithms work with items and
itemsets, independently of their meaning. Therefore, any algorithm is able to run over both data types,
and discover existing patterns. However, this is important in a constrained environment, since we can
restrict both entities and their attributes, and it is expected that all elements of a pattern can be tested
for the constraint.
Let’s analyze this in more detail. Assume a transactional table (e.g. Fig. 5.10 left). Each element of
a pattern is an entity, and therefore we can check what is the value of some attribute for every element.
75
Product Price Color ...p1 30 € Black
1 p1 p2 p3 p2 10 € Red2 p2 p4 p3 5 € Blue3 p1 p3 p4 p5 p4 15 € Black... p5 20 € Blue
...
Product Price Color ...p1 30 € Blackp2 10 € Redp3 5 € Blue...
Order Product Customer ... FPrice Qnt ...1 p1 c1 25&€ 11 p2 c1 10&€ 11 p3 c1 5&€ 32 p2 c2 9&€ 2...
Dim Product
Fact TableProduct Price Color ...
p1 30 € Black1 p1 p2 p3 p2 10 € Red2 p2 p4 p3 5 € Blue3 p1 p3 p4 p5 p4 15 € Black... p5 20 € Blue
...
Product Price Color ...p1 30 € Blackp2 10 € Redp3 5 € Blue...
Order Product Customer ... FPrice Qnt ...1 p1 c1 25&€ 11 p2 c1 10&€ 11 p3 c1 5&€ 32 p2 c2 9&€ 2...
Dim Product
Fact Table
Figure 5.10: A transactional data table (left), modeling the products that were bought together in the same trans-action, and the associated non-transactional data table (right), describing the characteristics of those products.
Table 5.1: Differences on mining transactional and non-transactional data.
Transactional Non-Transactional
Cell/Item Entity (Attribute = Value)
e.g. p1 e.g. (Price=30e )
Row The set of entities transacted at thesame time
The set of characteristics of one singleentity
Itemset A set of entities transacted together A set of characteristics of some entity
e.g. X = {p1, p4} e.g. X = {(Price=30e ),
(Color=Black)}Pattern A set of entities transacted together
frequentlyA set of characteristics shared by a fre-quent number of entities
e.g. X ∧ sup(X) ≥ σ ×N
This means, for example, that if we only want products with price lower than 20e , we can just test the
price of every product in each pattern and eliminate those patterns that have any product with price
higher than the maximum (note that this can also be done during the discovery process, in a similar way,
instead of post-processing). Using the example in Fig. 5.10, a pattern like {p1, p2} is rejected, because
Price(p1) > 20e , but a pattern such as {p2, p3, p4} is accepted, since all prices satisfy the constraint.
If we assume a non-transactional data table, we cannot think or do the same, because elements of
patterns are attributes, not entities. Following the example above, but using the non-transactional table,
we could have frequent patterns like {(Price = 5e ),(Color = Blue)} and {(Color = Black)} (blue
5eproducts and black products are frequent). If we wanted to apply the same constraint, Price < 20e ,
we could say that the first pattern satisfies the constraint (since it is an intersection of 5e products with
others), but we could not guarantee that the second one resulted only from processing products with
price lower than 20e . This means that the pushing of constraints into non-transactional data cannot be
made, as simply as before, as a post-processing step. In this case, when restricting some attribute, the
entire rows (products) where that attribute does not satisfy the constraint should not be considered for
support, to guarantee that all attributes in patterns result only from the processing of valid entries. Note
also that we are not mining entities that were transacted at the same time, and therefore constraints
like sum(X.attribute) ≥ v (or other aggregate constraint) do not make sense when mining only non-
transactional data.
76
5.3.2 Constraints in Star Schemas
In the case of a star schema, we have both data types in synergy. An example of a star schema,
corresponding to the example above, is shown in Fig. 5.11.
Product Price Color ...p1 30 € Black
1 p1 p2 p3 p2 10 € Red2 p2 p4 p3 5 € Blue3 p1 p3 p4 p5 p4 15 € Black... p5 20 € Blue
...
Product Price Color ...p1 30 € Blackp2 10 € Redp3 5 € Blue...
Order Product Customer ... FPrice Qnt ...1 p1 c1 25&€ 11 p2 c1 10&€ 11 p3 c1 5&€ 32 p2 c2 9&€ 2...
Dim Product
Fact TableProduct Price Color ...p1 30 € Black
1 p1 p2 p3 p2 10 € Red2 p2 p4 p3 5 € Blue3 p1 p3 p4 p5 p4 15 € Black... p5 20 € Blue
...
Product Price Color ...p1 30 € Blackp2 10 € Redp3 5 € Blue...
Order Product Customer ... FPrice Qnt ...1 p1 c1 25&€ 11 p2 c1 10&€ 11 p3 c1 5&€ 32 p2 c2 9&€ 2...
Dim Product
Fact Table
Customer Age Addr ...c1 27 Lisbonc2 38 Paris...
Dim Customer
Product Price Color ...p1 30 € Black
1 p1 p2 p3 p2 10 € Red2 p2 p4 p3 5 € Blue3 p1 p3 p4 p5 p4 15 € Black... p5 20 € Blue
...
Product Price Color ...p1 30 € Blackp2 10 € Redp3 5 € Blue...
Order Product Customer ... FPrice Qnt ...1 p1 c1 25&€ 11 p2 c1 10&€ 11 p3 c1 5&€ 32 p2 c2 9&€ 2...
Dim Product
Fact Table
Figure 5.11: A star schema, containing both transactional (fact table) and non-transactional data (dimensionsProduct and Customer).
In a star, we have more than one entity type (e.g. products and customers), represented by each
dimension. Each dimension describes the set of entities of that type (e.g. each product) through the
use of some attributes (e.g. price and color). The fact table relates the entities of each dimension with
each other and with a set of measures (e.g. final price and quantity) that characterize the corresponding
transaction (e.g. the sale). In this sense, we can define a set of Star Constraints, composed of a constraint
for each of these aspects: entity type, entity, attribute and measure.
Let dim(it) be a mapping function that returns the dimension to which item it belongs to (e.g. p1
belongs to dimension Product, therefore dim(p1) = Product). Let also dim.attr(it) be a function that
gives the value for attribute attr (which belongs to dimension dim) that is associated with item it (e.g.
Product.Price(p1) returns 30e , the price of product p1).
Constraints over entity type (dimension constraints)
Since we have more than one entity type, we may be interested in the presence or not of some certain
types. So, for all patterns X, the dimension of each element should be valid:
C(X) = (∀ el ∈ X . dim(el) ∈ {D1, ..., Dj}). (or 6∈)
For example, we may only want to mine products and customers, and ignore other dimensions (i.e.
dim(el) ∈ {Product, Customer}). Therefore, a pattern like {p1, c2} (customer c1 buys product p1) is
accepted, but {p1, c2, t1} (customer c1 buys product p1 in territory t1) would not, because contains one
element that does not belong to the accepted dimensions. Note that, if {p1, c2, t1} is frequent, its subsets
are also frequent, such as {p1, c2}, that will be returned as patterns. Therefore, by eliminating the first
itemset because it is not interesting, we will not loose interesting patterns.
Following our proposed framework, dimension constraints are succinct constraints, since we know, at
each point in time, all possible accepted patterns, based on the current alphabet. They can also be seen
as conceptual constraints, since they restrict the dimension associated with items. And finally, they are
designed for multi-dimensional datasets.
77
Constraints over entities
Additionally, we may restrict the presence of some specific entities (or instances) of one type. These
constraints can be made over each dimension or over the fact table, since both have information about
entities:
C(X) = ({en1, ..., enj} ⊆ X). (or 6⊆)
For example, we may be interested only in the specific products p1 and p2. Hence we could define the
constraint {p1, p2} ⊆ X to filter all patterns that does not contain both desired products.
To situate entity constraints in the framework, they correspond to the traditional item constraints
that have a succinct property, and they are applied to star schemas.
Constraints over attributes
Regarding the attributes of dimensions, we may want to limit the value of some attribute (of one dimen-
sion) for each entity:
C(X) = (∀ el ∈ X . dim.attr(el) ≤ v). (or ≥, =, 6=)
For example, we may only want customers under 30 years old, and therefore, all customers in patterns
must satisfy the constraint Customer.Age(el) < 30.
Note that, if the attribute is not numeric, we may similarly define the constraint as:
C(X) = (∀ el ∈ X . dim.attr(el) ∈ {v1, ..., vj}). (or 6∈)
E.g. if we want products with color black or blue, i.e. Product.Color(el) ∈ {Black,Blue}.These simple attribute constraints are value constraints, and are also succinct.
We may also want to limit the aggregate value of some attribute, for one set of entities:
C(X) = (agg(dim.attr(X)) ≤ v). (or ≥, =, 6=)
where the aggregate function ∈ sum, avg, min, max, etc. and dim.attr, applied to an itemset X, returns
all values for that attribute attr, for all items in X.
For example, if we are interested in patterns resulting from sales where the sum of their products’
price is less than 20e , we could use the constraint sum(Product.Price(X)) < 20e . A pattern such as
{p1, p2} is not accepted (the sum of the prices is 40e ), but {p2, p3} is (the sum is 15e ).
These constraints correspond to aggregate constraints, and they also have nice properties, depending
on the aggregate function, and should only be applied to the transactional data.
Constraints over measures (fact constraints)
Measures are (mostly) numeric values that characterize the business events or transactions. Therefore,
fact constraints can only be applied to the transactional data in the fact table. And in order to incorporate
them, we need to be able to track back the transactions that originated each pattern, i.e. the transactions
that give support to a pattern (e.g. pattern {p2} occurred in orders 1 and 2, but pattern {p1, p2} occurred
only on order 1). By having these transactions, we can retrieve the correct value of a measure.
Let us denote measure(trans, it) a function that retrieves from the fact table the value for measure
measure corresponding to transaction trans and item it (following the same example, FPrice(1, p2) =
10e , since the final price of product p2 in sales order 1 was 10e , and similarly, FPrice(2, p2) = 9e ).
Using fact constraints, we can limit both the value and the aggregate value of some measure. In the
78
first case, the value of the measure must be valid for all transactions that gave support to a pattern:
C(X) = (∀ trans, el ∈ X . measure(trans, el) ≤ v), (or ≥, =, 6=)
As an example, we may only want sales of products that were bought in sets of 2 or more, at the
same time (Qnt(trans, el) ≥ 2).
In the second case, the aggregate value of the measure in question must be valid for the set of
transactions that gives support to the pattern:
C(X) = (∀ trans ∈ X . agg(measure(trans)) ≤ v), (or ≥, =, 6=)
with measure(trans) the set of measure values for all elements of the pattern.
We may be interested only in sales of more than 4 products at the same time (sum(Qnt(trans)) ≥ 4).
Measure constraints are value or aggregate constraints, that are designed for the measure attribute
present in the fact table. They also have nice properties, with a reasoning similar to attribute constraints,
but they may only be applied to the transactional data in a star schema.
5.3.3 Pushing Star Constraints into Pattern Mining over Star Schemas
In order to push the above constraints into the discovery of frequent patterns over star schemas, we
may take different approaches: mine only one non-transactional data table (one dimension), mine only
transactional data (the fact table), or mine both data at the same time. Clearly, the last approach is
more difficult, but the one that fulfills the goal of multi-dimensional pattern mining.
We discuss below what is the difference between these approaches, namely, what patterns are expected
to be obtained, what constraints can be used, and how can they be incorporated in the search process,
as well as how existing algorithms should be adapted.
Mining one dimension
By mining one single dimension, we are mining its attributes, and therefore we are able to find common
characteristics of the respective business entity (e.g. most of male customers have more than 30 years, or
blue products are usually cheaper than others). We can even go one step further, and use the fact table
to calculate the support of each entity before mining the dimension. This way, we can find the common
characteristics of the most transacted entities.
When mining one dimension, we can apply constraints over entities and constraints over the value of
this dimension’s attributes, and therefore limit the discovered characteristics to what is really interesting.
It does not make sense to apply entity type constraints, since we are mining only one dimension, neither
aggregate constraints, since there are no co-occurrences. We also cannot apply measure constraints,
because there are no transactions.
As explained in section 5.3.1, there is no algorithm designed for non-transactional data, like di-
mensions. However, and since these constraints (entity and attribute value) are mostly succinct anti-
monotonic, we may take a very simple approach and eliminate, from the beginning, all entities (rows)
that do not satisfy the constraint, and then apply any single table pattern mining algorithm. For exam-
ple, for dimension Product and for the attribute constraint Price < 20e , we may eliminate all rows of
products with an higher price, as a pre-processing step, because no pattern with one of those products
will satisfy the constraint (we want to know here what are the common characteristics of cheep products).
The same can be made for entity constraints.
79
Mining the fact table
When mining the fact table, we are only mining transactional data, i.e. the entities transacted together.
This allows us to find the sets of entities that co-occur frequently. Existing constrained algorithms deal
with only one single entity type, and this means that all entities of a pattern belong to the same type,
and therefore all can be tested for the constraint. In the case of a fact table, entities in patterns may
belong to the same entity type (e.g. product p1 is usually bought along with product p2) or to different
types (e.g. customer c1 often buys product p1).
To apply constraints over the entity type, we may simply eliminate or ignore all entities that do not
satisfy the constraint. For example, if we only want to mine products, we may ignore other dimensions,
such as customers, and all entities in the fact table belonging to those dimensions.
Existing constrained algorithms can be used over the fact table for introducing entity and attribute
constraints. In this case, the constraint is checked for the entities that belong to the dimension of that
attribute, and the value for the attribute in question is retrieved in the corresponding dimension table.
Measure constraints are different from others, since instead of a property of an entity, they are a
property of a transaction. One hypothesis is to consider measures as entities, and use existing constrained
algorithms to discover co-occurrences of measures (like for attributes). By doing this, we could find, for
example, and considering the quantity measure, that who buys 4 of the same product, usually buys 3 of
another product, in the same sale.
Another approach would be to associate to each pattern the transactions that gave it support (e.g. a
pattern {p2} occurred in the sale orders {1, 2, ...}), as is being done in areas like genome analysis [MPP07,
SV11]. By doing this, we could apply the measure constraints by retrieving the value of the measure from
the fact table, based on the transaction numbers and entities, and checking if it satisfies the constraint.
Using the example, we could find that who buys 4 of product p1, usually buys 3 of product p2. By
keeping the transactions, it would be possible to incorporate measure constraints, either during the
discovery process, or as a post-processing step, with some adaptations of existing traditional constrained
algorithms. However, this approach goes against the philosophy of data streams, in which records can
only be seen once, and therefore could not be applied for finding patterns on growing star schemas.
Mining the star
By mining both dimensions and fact table, we are able to discover how the common characteristics of
different entity types relate each other, i.e. how the elements of both dimensions co-occur. We could
find, for example, that blue products are often bought by male customers.
Both constraints described above may be used when mining the whole star. However, we are not min-
ing entities, we are mining attributes, and it means that we cannot apply existing constrained algorithms
and strategies. However, we can look to each type of constraint and think in a form of pushing them.
For entity type constraints, we can simply eliminate or ignore the dimensions in question. This can be
done as a post-processing step, by discarding all itemsets that contain some item (a pair attribute–value)
which attribute belongs to a non accepted dimension. Or can be done before, by only exploring the
entities of the fact table that belong to accepted dimensions.
For entity constraints, we should discard and not count for support all rows of the fact table that
contain invalid entities. This could be done during the first steps of the discovery process: for each fact,
only count the support for its entities if they are all valid (or else, discard the entire fact). Introducing
these constraints as a post-processing step is not straightforward. Since we process all facts, we will
probably have patterns that result from the processing of invalid entities. We would need, for example,
to keep track of the entities that support each item in each pattern, but it requires much extra processing
and memory. We could also keep track of the facts that give support to each pattern, and when testing
80
the constraint, check if the entities transacted in those facts are valid. If not, discard the pattern. Still,
this approach is not appropriate for star streams, and even for static star schemas, it is less efficient than
incorporating entity constraints during the mining process.
Limiting the value of some attribute is trickier, since patterns are sets of pairs (attribute, value), that
may mix different attributes of an entity, as well as attributes of different entity types. So, if we want
to limit the value of some attribute of a specific dimension, all entities of that dimension which attribute
value does not satisfy the constraint should be ignored. This means that both the rows corresponding to
those entities in the dimension in question, as well as the rows of the fact table that contain them, should
not count for support. As an example, if we have the constraint Customer.Age(el) < 30, customer c2
violates the constraint, and therefore no attribute of c2 should count for support, as well as no order
made by this customer. We could, for example, when mining each fact, do not explore entities which
constrained attribute violates the constraint. As post-processing step, it requires keeping the facts that
support each pattern, and when testing the constraint, check if the entities of those facts satisfy the
constraint (e.g. if all customers of those facts are under 30 years old). If we want to mine the aggregate
value of some attribute, we can take the same post-processing approach.
Finally, measure constraints can be integrated in a way similar to the described in the previous section,
since they refer to a transaction and not to an entity.
5.3.4 Discussion
We can now answer the questions posed in the beginning of the section: Is it possible to integrate
constrained mining with star schemas? Yes. We showed here that it is possible and important to
integrate the multi-dimensional and constrained paradigms. By combining these two areas, it is possible
to improve the results of multi-dimensional techniques, not only by limiting the number of patterns, but
also by focusing these results on user needs, defined by the means of constraints.
What are the emerging challenges? Each one of these areas has its own set of challenges, and therefore
joining these paradigms results in a mix of them. The main ones are the fact that we are dealing with
more than one table at the same time, and a star schema usually contains both transactional and non
transactional data. This hinders the use of existing constrained algorithms for mining the whole star as
one, as well as the adaptation of multi-dimensional algorithms for pushing existing constraints.
Can we use traditional constraints in this multi-dimensionall environment? Traditional constraints
are constraints over entities (or over their values for some attribute). Since we always have transactional
data in a star (the fact table), representing the transactions of entities, we can also apply these constraints.
However, we are also often in the presence of more than one entity type, and of non-transactional data,
which means that new constraints need to be defined for this environment. In this thesis we proposed
four types of constraints: entity type, entity, attribute and measure constraints.
And finally, can existing algorithms be applied or adapted to find frequent constrained patterns in a
star schema? If so, how? Since existing algorithms for constrained pattern mining are only able to deal
with one transactional table, they can be applied to the fact table, with some adaptations to deal with
different entity types. However, they cannot be used to push constraints into the whole star. Even so,
depending on the constraint, there are some adaptations that can be made. The extension of CoPT
and CoPT4Streams to incorporate constraints in multi-dimensional mining is possible, but requires some
major adaptations: it needs to be able to track to which dimension items belong to, as well as to retrieve
the values for attributes from the corresponding dimensions. Since they push constraints as a post-
processing step, they also need to keep a record of what transactions gave support to each pattern, so
that it guarantees that they do not result from the processing of invalid entities. This extra storage
of transactions has been applied to other areas [MPP07], but it is still not efficient, resulting not only
81
in extra memory, but also in extra processing time for the discovery process. More importantly, this
approach cannot even be applied to streaming data, since transactions can only be seen once. Also,
as stated above, they need that an unconstrained pattern mining run over the data first and store the
patterns in a pattern-tree, and this means that we have to process the patterns twice. Nevertheless, they
take constraint properties into account, and therefore they do not need to test all patterns this second
time.
In this sense, there is a need for a new algorithm that can somehow use the strategies defined through-
out this chapter for an efficient incorporation of constraints into the mining of multiple tables. In the
next section, we propose a new algorithm, that adapts StarFP-Stream (Chapter 3) to incorporate star
constraints into the pattern mining of large and growing star schemas.
5.4 Mining Stars with Constraints
As seen above, the application of constraints into the mining of the whole star depends heavily on the
transactions that give support to patterns. And therefore, specially in a streaming environment, it is not
feasible to verify their satisfiability in a post-processing step, since algorithms would have to keep the
transactions (the facts) along with the patterns, for further access.
Furthermore, in the traditional paradigm of basket analysis, items in patterns correspond to entities,
all of the same type (e.g. all products). The bulk of the work on constrained pattern mining is in fact
along this paradigm. However, when we move to a relational domain, we start having several dimensions,
and entities of different types (e.g customers’ characteristics, sellers, products, etc.). The star schema
is one of these cases, where items in patterns are pairs (attribute, value) that belong to an entity, from
some dimension, and thus each branch of the pattern-tree (i.e. a pattern) contains a mix of items, of
different dimensions. This makes it also difficult to apply the strategies of CoPT or CoPT4Streams, that
rely on the fact that items are all entities, and of the same type.
However, instead of constraining the items in patterns, what we want is to constrain the transactions
that support patterns. That is, if we are only interested on patterns related to a set of entities (entity
constraints), we can consider only the facts where those entities were transacted, and therefore other
entities will not appear in patterns. Similarly, if we are not interested on entities with particular charac-
teristics (attribute constraints), we can discard all transactions with invalid entities, so that they do not
appear in (or influence) the final results.
In fact, as mentioned before, most star constraints have succinct properties, and therefore they can
be incorporated more efficiently as a pre-processing step, or in this streaming case, during the processing
of each arriving transaction.
In this sense, we propose the algorithm Domain Driven Star FP-Stream, or D2StarFP-Stream, that is
an extension of Star FP-Stream (Section 3.3) that is able to incorporate star constraints over transactions
into the mining of the whole star.
We first formally define how to apply the Star Constraints over the transactions, and then present
the proposed constrained multi-dimensional algorithm. Finally, we also present a performance evaluation
over the AdventureWorks DW, and end with some discussion and conclusions.
5.4.1 Constraining Business Facts
In a star schema, a transaction corresponds to one business fact, that may or not contain more than one
row in the fact table (if a degenerated dimension exists or not, respectively). Recall that one fact is one
single row in the fact table, and one business fact is the set of facts that correspond to the same business
transaction (e.g. sale order). Also, a fact is a set of foreign keys (entities), one for each dimension.
82
In the presence of a dimension constraint, the only thing that is needed is to ignore entities (and
items) from invalid dimensions. For example, if we are not interested in dimension SalesTerritory, when
processing a fact, we can simply ignore or discard all entities of that dimension. This is similar to perform
a full roll-up on the SalesTerritory dimension, in OLAP operations.
In the presence of any other star constraint C (entity, attribute or measure), defined in Section 5.3.2,
one fact is valid if its entities satisfy the constraint. That is:
V alid(fact) = (∀en∈fact C(en) = true)
But when we consider a business fact with more than one fact, we may want to constrain it in
three different ways – (1) to consider the whole business fact only if all facts satisfy the constraint, (2) to
consider the whole business fact if at least one satisfies, or (3) to consider just the facts of the business fact
that satisfy the constraint and ignore invalid ones (which is equivalent to perform a slice on a particular
dimension, on OLAP operations).
Let us define, as an example, a business fact with two sales, bf = {(p1, d, c, t), (p2, d, c, t)} (customer
c bought, on day d and store t, both products p1 and p2), where product p1 has (Color = “Blue′′) and
product p2 has (Color = “Red′′). Let us have the following attribute constraint asking for products of
color blue: C(en) = (dim(en) = Product ⇒ color(en) = “Blue′′). In this case, the first fact is valid:
V alid((p1, d, c, t)) = true, since the only entity of dimension Product is p1, and it has color blue (note
that the constraint C applied to entities of other dimensions has always value true, since they are not the
ones being constrained. Nevertheless, in reality, there is no need to test them, since in a fact, we know
in which column the entity of some specific dimension is). On the other side, the second fact is invalid:
V alid((p2, d, c, t)) = false, since product p2 has a color different from blue.
We define the three following validity properties, and a function named getV alidFacts, that can be
applied to a business fact, and return the set of individual facts that should be considered for mining,
based on the validity property of the star constraint. in each case:
All if All Valid: getV alidFacts(bf,ALL V ALID) = {f ∈ bf | ∀g∈bf V alid(g) = true}
If all facts satisfy the constraint, the whole business fact is valid, and all facts should be considered.
This allows us to find patterns that involve only complete valid transactions.
Following the example above, we may be interested in sales that only included blue products, and
therefore, the business fact in question is not valid, and should be discarded, since it contains one
product of an invalid color. We would find, e.g. what types of customers only buy blue products,
or if there is a specific season where customers buy blue products.
All if One Valid: getV alidFacts(bf,ONE V ALID) = {f ∈ bf | ∃g∈bf V alid(g) = true}
If at least one fact satisfies the constraint, the whole business fact is valid, and all facts should be
considered. This allows us to find patterns that are frequent along with the valid entities, such as
what type of other entities are transacted along with the valid ones.
Using the same example, we may want sales that include at least one blue product, and therefore,
the business fact in question is valid, and both products p1 and p2 should be considered. We could
find, for example, what other types of products are bought along with blue products.
Only Valid: getV alidFacts(bf,ONLY V ALID) = {f ∈ bf | V alid(f) = true}
In this case, a business fact is always valid, unless it contains no transaction with a valid entity.
However, we should only consider the valid individual facts, i.e. the individual transactions of valid
entities. This allows the finding of patterns associated with the transactions of valid entities, such
83
as profiles of customers that buy specific types of products, or what is common for specific types
of customers.
Following the example, we are interested in sales of blue products, and thus the sale order is valid,
since it contains one sale of a blue product, and only this transaction should be considered. We
could discover what types of customers buy blue products, in general.
Note that, if there is no degenerate dimension (i.e. no aggregations of facts), each business fact is
composed of exactly one fact, and therefore there is no need to differ between the validity property over
business facts (all three have the same result). The same happens when the constraint C is over an entity
that does not change in one business fact (e.g. the customer in a sale order is always the same). In this
case, all three validity properties will also result in the same set of valid facts. Using the example above,
for a constraint C(en) = (dim(en) = Customer ⇒ age(en) < 30), asking for customers above 30 years
old, we only have to test the customer of the first fact (because it is the same is every fact). In this
case, if customer c has less than 30 years old, the whole business fact should be considered, or should be
discarded otherwise.
5.4.2 D2Star FP-Stream
In this section we propose a new algorithm, Domain Driven Star FP-Stream, or D2StarFP-Stream, that
is an extension of Star FP-Stream (proposed in Section 3.3) that is able to incorporate star constraints
over transactions into the mining of the whole star.
The main idea is to push star constraints as new business facts arrive, and build both the DimFP-Trees
and the StarFP-Tree only with valid ones (according to one of the validity properties defined above, over
business facts). By doing this, these trees will only have the content of valid transactions, and therefore
the global mining step (combining and processing these trees) can be performed as for the unconstrained
StarFP-Stream, with no change required. The final patterns will also satisfy the constraints, with no
checking needed, because: In the ALL V ALID and ONLY V ALID cases, patterns will not contain
invalid entities (they were discarded) and even if they are composed just with pairs (attribute, value) from
unconstrained dimensions, it is guaranteed that these came from valid transactions; In the ONE V ALID
case, invalid entities are not discarded, if there is one valid entity in the set. However, patterns, as a
whole, will satisfy the constraint, because the set of entities from the constrained dimension satisfies the
constraint.
The pseudocode of the algorithm is present in Algorithm 4.
Algorithm 4 D2StarFP-Stream Pseudocode
Input: Star Stream S, error rate ε, Star constraint COutput: Approximate frequent items with threshold σ that satisfy the constraint C (and respective validity
property), whenever the user asks1: i = 1, |B| = 1/ε, N = 0, flist and ptree are empty2: V alidDim← getValidDimensions(S,C) //Dimension constraints3: initialize one DimFP-trees for each V alidDim to empty4: for all arriving business fact bf = (tidD1 , tidD2 , ..., tidDn ,m1, ...,mp) do5: N = N + 16: bf = getV alidFacts(bf, C) //Application of the validity properties and of the other star constraints7: for all Dimension Dj in V alidDim do8: T ← transaction of Dj with tidDj
9: insert T in the DimFP-treej10: flist ← append new items introduced by T11: if all business facts of Bi arrived then12: super-tree ← combineDimFP-trees(DimFP-trees, Bi)13: FP-Growth-for-streams(super-tree, ∅, ptree, i)14: discard the super-tree15: tail-pruning(ptree.Root, i)16: i = i+ 1, initialize n DimFP-trees to empty
84
We can see in line 6, after receiving a business fact, the first thing to do is to check the validity of
the whole transaction, based on the validity property of the star constraint. This depends, as seen in
Section 5.4.1, on the validity of the individual facts, which in turn depends on the satisfaction of the star
constraint C by the corresponding entities.
In this sense, the incorporation of each star constraint is performed as follows: For pushing a dimension
constraint, the algorithm just needs to ignore the entities in facts that belong to invalid dimensions, and
only needs to build the DimFP-Trees of valid ones. This is implemented in lines 2 (we know from the
beginning which are the valid dimensions), 3 and 7.
The testing of other star constraints (entity, attribute and measure) is performed while testing the
validity of business facts and facts (function getV alidFacts, line 6). For pushing entity constraints,
the algorithm only needs to test the entities in facts, against the accepted or unwanted entities. As
for the incorporation of attribute constraints, when checking the validity of facts, we need to test, for
all entities of the dimension being constrained, what is the value for that attribute. In this sense, the
algorithm, for each entity of that dimension, goes to the corresponding dimension table and checks the
value corresponding to that entity and attribute. Finally, for measure constraints, the algorithm does
not need to test any entity, only the measure values of each fact, and check if those values satisfy the
constraint. In the case of a constraint over the aggregated value of a measure, we only have to compute
the aggregation of the respective measure, for the whole business fact (e.g. sum of all quantities), and
check if the result satisfies the constraint.
5.4.3 Experimental Results
In order to test the performance of D2StarFP-Stream, we use the same experimental setup as for the eval-
uation of its unconstrained counterpart, StarFP-Stream (Section 3.4). And the goal of these experiments
is to compare both, and analyze if adding the constraints into StarFP-Stream minimizes the bottleneck
of the size of the pattern-tree, as well as, at the same time, improves the memory and time needed to
process each batch.
In summary, we tested the algorithms with a sample of the AdventureWorks 2008 Data Warehouse,
with the star schema shown in Fig. 2.1, with the degenerated attribute SalesOrderNumber (AW D-Star),
in order to test the several validity properties for business facts.
We analyzed the behavior of the pattern-tree and the time and memory used by each algorithm, and
we conducted experiments varying both minimum support and maximum error thresholds. Since results
are similar, and we want to compare the constrained versus the unconstrained approach, we only present
here the results for 3% of error.
Since the behavior of the algorithms may vary with the selectivity of the constraints, as shown in
Sections 5.1.4 and 5.2.4, we test the algorithm D2StarFP-Stream with several constraints, with different
selectivities. However, since we are constraining transactions, and not patterns, we measure the selectivity
of star constraints as the ratio of entities that violate the constraint, i.e. the number of entities of the
constrained dimension that are invalid, over the total number of entities of that dimension.
Figures 5.12 and 5.13 show the average pattern-tree size per batch and maximum memory needed,
respectively.
As expected, the size of the pattern tree decreases with the incorporation of constraints, even for
constraints with small selectivity. We can see that, for example, for 50% of selectivity, the pattern tree
is 4 times smaller than the unconstrained case. This evidences that pushing constraints minimizes the
bottleneck of StarFP-Stream, by minimizing the number of patterns that must be kept in the pattern
tree. And by having smaller trees, D2StarFP-Stream needs less memory to keep them, overcoming its
unconstrained counterpart.
85
0"
200"
400"
600"
800"
1000"
1200"
1400"
10%" 30%" 50%" 70%" 90%"
Pa#ern'Tree'Size''
(tho
usan
ds'of'n
ode)'
En66es'Selec6vity'
Unconstrained"
Constrained"
0"
50"
100"
150"
200"
250"
10%" 30%" 50%" 70%" 90%"
Mem
ory'(M
b)'
En66es'Selec6vity'
Unconstrained"
Constrained"
Figure 5.12: Average size of the pattern-tree, perbatch, for 3% of error, and for entity and attributeconstraints over a non degenerated dimension.
0"
200"
400"
600"
800"
1000"
1200"
1400"
10%" 30%" 50%" 70%" 90%"
Pa#ern'Tree'Size''
(tho
usan
ds'of'n
ode)'
En66es'Selec6vity'
Unconstrained"
Constrained"
0"
50"
100"
150"
200"
250"
10%" 30%" 50%" 70%" 90%"
Mem
ory'(M
b)'
En66es'Selec6vity'
Unconstrained"
Constrained"
Figure 5.13: Average maximum memory needed, perbatch, for 3% of error, and for entity and attributeconstraints over a non degenerated dimension.
Following the same reasoning, the smaller the pattern-trees, the less time needed to process them.
Figure 5.14 shows this decrease with the increase of the selectivity of the constraints.
0"
5"
10"
15"
20"
25"
30"
35"
10%" 30%" 50%" 70%" 90%"
Time%(s)%
En++es%Selec+vity%
Unconstrained"
Constrained"
Figure 5.14: Average update time of the pattern-tree, per batch, for 3% of error, and for entity and attributeconstraints over a non degenerated dimension.
Despite this decrease, we can see that, for constraints with small selectivity, D2StarFP-Stream needs
more time to process one batch than the unconstrained algorithm. This happens because pushing con-
straints requires an extra step for checking the validity of business facts and the satisfaction of the star
constraints. This extra validation step results in more overall time needed per batch with small selectivi-
ties, since in these cases few entities are invalid, so they all need to be tested and most of them will remain
in the tree. However, this extra time gets compensated as more facts are discarded, and for example, for
the 50% of selectivity, D2StarFP-Stream takes half the time to process each batch, in average, than the
unconstrained StarFP-Stream.
5.4.4 Discussion and Conclusions
In this section we proposed a new algorithm, D2StarFP-Stream, for pushing star constraints into the
discovery of patterns over a large and growing star schema. The algorithm is an extension of the un-
constrained StarFP-Stream (Section 3.3), that returns less and more interesting results, according to the
constraints. By being able to incorporate star constraints, D2StarFP-Stream eliminates earlier invalid
transactions, and keeps smaller pattern-trees, minimizing therefore the bottleneck of the unconstrained
algorithm.
Experimental results show that the algorithm is memory efficient, and that it results not only in
86
smaller pattern-trees, but also in less memory needed, even for constraints with small selectivity. Despite
the extra time introduced, the validation and check step, that results in more time per batch for small
selectivities, this overhead is diluted for constraints with more selectivity, making the algorithm to take
less time per batch than the unconstrained one.
5.5 Conclusions and Open Issues
In this chapter, the algorithm CoPT [SA13a] was proposed for post-pushing constraints into pattern
mining. The algorithm uses a prefix tree structure to store the frequent itemsets, and then pushes
constraints deep into this tree, taking advantage of the constraint properties in question. Despite being
a post-processing algorithm, it is able to push constraints satisfying all known properties, still taking
advantage of them.
We also proposed an extension of the algorithm CoPT [SA13a] for data streams, named
CoPT4Streams. The idea is to use any data streaming algorithm, and storing all current patterns in
a pattern-tree. Constraints are pushed at each batch boundary, resulting in a smaller summary structure
at every batch, and therefore less time and memory needed. We also show that violating itemsets can al-
ways be removed at each batch, without loosing patterns. Like CoPT, this algorithm is able to efficiently
incorporate constraints that follow any constraint property, still taking advantage of them.
However, both algorithms are designed for one single data table.
We also analyzed in detail the integration of multi-dimensional mining with constrained mining, and
define in this chapter a set of constraints for star schemas – the star constraints: entity type, entity,
attribute and measure constraints. We also discuss and propose a set of strategies for pushing these
star constraints into multi-dimensional mining algorithms, and show that it is possible to incorporate
constraints into the mining of multiple tables.
By being post-processing algorithms, both CoPT and CoPT4Streams cannot be directly applied to
the mining of a star schema. However, they both can be applied to the mining of the fact table, with small
adaptations to deal with different entity types and retrieve the values from the corresponding dimensions.
In order to mine the whole star with constraints, we proposed the algorithm D2StarFP-Stream. It is
able to push all star constraints as new business facts arrive, guaranteeing that invalid transactions are
eliminated and do not contribute for support. By constraining the business transactions, according to
the desired validation property for events, the algorithm is also able to mine the star schema at the right
business and aggregation level.
To the best of our knowledge, this is the first approach dedicated to the integration of these two areas.
Despite being an important step, there is still room for progress, by finding ways to push more
complex constraints, such as sequences and temporal constraints (it is possible, since we have transactions,
and therefore time and sequences), as well as other more complex forms of domain knowledge, such as
ontologies.
The algorithm D2StarFP-Stream can also be improved, for example, in terms of how the validation
of the constraints are made. By finding more efficient ways of making these constraint checks, it is
possible to minimize the overhead of this extra step. Optimization techniques, such as parallelization
and integration with the database (for faster access to attribute values), could also bring benefits to
D2StarFP-Stream.
87
88
Chapter 6
A Case Study in Healthcare
Huge amounts of data are continuously being generated in the healthcare system. The analysis of these
data is mandatory, since it may help in many areas of healthcare management, such as evaluating treat-
ment effectiveness, understanding causes and effects, anticipating future demanded resources, predicting
patient’s behaviors and best treatments, defining best practices, etc. [KT05, KW06]. Due to the nature
of this information, results of these analyzes may make the difference, by decreasing healthcare costs and,
at the same time, improving the quality of healthcare services and patients’ life.
Healthcare data are usually massive, too sparse and complex to be analyzed by hand with traditional
methods. In the last decades, data mining has begun to address this area, providing the technology
and approaches to transform huge and complex data into useful information for decision making [KT05].
Data mining (DM) [FPSM92] has been successively applied to many different subfields of healthcare
management, which results proved to be very useful to all parts involved [KT05, KW06].
One of the characteristics of the data collected in the healthcare domain is their high dimensionality.
They include patient personal attributes, resource management data, medical test results, conducted
treatments, hospital and financial data, etc. Thus, healthcare organizations must capture, store and
analyze these multi-dimensional data efficiently.
Multi-Relational Data Mining, or MRDM [D03] is therefore a promising approach for analyzing health-
care data, since its goal is to discover frequent relations that involve multiple tables (or dimensions), in
their original structure, i.e. without joining all the tables before mining.
In this chapter, we present a case study on the healthcare domain, showing how existing data can be
explored. The case is based on the use of the Hepatitis dataset, created by Chiba University Hospital,
containing information about 771 patients having hepatitis B or C, and more than 2 million examinations
dating from 1982 to 2001. This dataset is organized in a relational model that may help data storage,
but that hinders data analysis, since data are scattered through different tables, and it is not easy to
inter-relate the data in a timeline.
In this work, we propose a multi-dimensional model for the Hepatitis dataset, that makes it possible an
efficient analysis and knowledge extraction. We also present some statistics, in order to better understand
the distributions of the data in this domain. After modeling the dataset through a multidimensional
model, we analyze the application of data mining to these models, and present the results of applying
the MRDM algorithm StarFP-Stream [SA12a], to the proposed model.
Section 6.1 describes the Hepatitis dataset and section 6.2 proposes a multi-dimensional model for
the Hepatitis data – the Hepatitis star, in order to promote their analysis for decision making. We first
show an evaluation of the performance of applying StarFP-Stream to the Hepatitis star (Section 6.3), and
then we present two applications of MRDM with this healthcare data. In the first, we use our algorithm
to find discriminant patterns and association rules (Section 6.5) to understand the relations between the
89
laboratory examinations and the two types of hepatitis, and in the second, we show that StarFP-Stream
can be used to find inter-dimensional and aggregated patterns, that are able to characterize patient exam
behaviors These, in turn, may be used as classification features to predict if a patient has hepatitis or not,
which type and even the stage of the hepatitis (Section 6.6). Finally, section 6.7 discusses and concludes
the chapter.
6.1 The Hepatitis Dataset
The Hepatitis dataset1 contains information about laboratory examinations and treatments taken on the
patients of hepatitis B and C, who were admitted to Chiba University Hospital in Japan. There are
771 patients, and more than 2 million examinations dating from 1982 to 2001, from about 900 different
blood and urine types of exams. The dataset also contains data about the biopsies (about 695 biopsy
results) and interferon treatments (about 200) performed to patients. Biopsies reveal the true existence
of hepatitis and respective fibrosis stage. However, they are invasive procedures, and therefore there is
an interest in finding other indicators that allow for the detection of hepatitis in a more friendly way.
Interferon treatments have also been seen and used as an effective way to treat hepatitis C, although
it has tough side effects, and its efficacy is not yet proved. Hence, there is the need to understand the
impact of this treatment.
2 Multirelational Data Mining
2.1 Rationale
Figure 1 depicts the main tables involved in the hepatitis database. The patients’
various exams are not directly related, so joining these tables for a common analysis fails to provide a suitable dataset for discovering association rules based on traditional data mining algorithms such as Apriori [4], FP-Growth [5]. In other words, the results deriving from joint tables may lead to data redundancy and thence to distortions in the calculation of the support and confidence measures of interest.
PatientInterferon Therapy
HematologicalAnalysis
In-HospitalExamination
Results of Biopsy
Out-HospitalExamination
Fig. 1 – Hepatitis dataset tables
To better explain this problem, consider the three tables in Figure 2, which contain data on urinalysis and biopsy results, join them in a third table based on the attributes {MID, Month, Year}. Consider, also, that our aim is to define whether these two types of exams are related.
In Figure 2, note that the tuple (MID=772, Month=2, Year=1999, Fibrosis=F4) appears in 20% of the Biopsy table, while the data of the same tuple occurs in 50% of the joint table. This distortion is due to the spurious tuples resulted in the Joint Table, which is not in the Fourth Normal Form1. This difference can cause distortions in the calculation of the measures of interest of the rules deriving from the mining of the joint table, or prevent the discovery of interesting patterns.
Therefore, to analyze the datasets correctly and obtain rules for the biopsy results and the other types of exams, we applied the Connection algorithm to the hepatitis database.
2.2 The Connection Algorithm
The Connection algorithm mines Boolean association rules from several tables that
have at least one attribute in common, without joining the tables. This algorithm was originally proposed to mine data from data warehouses [1, 2], but the proposed method can be used to mine multiple tables of a relational database.
1 From the Normalization Theory for the Relational Model.
Figure 6.1: Hepatitis relational model [PRV05].
The hepatitis dataset is composed of several data tables, modeled in a relational schema centered on
the patient. This model is shown in Figure 6.1. Each patient may have performed some biopsies, several
hematological analysis, in-hospital and out-hospital exams, and may have also been under interferon
therapy. Each one of these aspects is stored in one different table and is independent of the others.
Despite being modular, this schema does not facilitates the analysis of these data, for several reasons:
(1) the various exams – both in-, out-hospital and hematological analysis – are not directly related,
although the same type of exams may be present in more than one table; (2) relating both exams, or
exams and biopsies or interferon therapy requires joining the tables for a common analysis. This process
of joining the tables is time consuming and non trivial, and the resulting table hinders the analysis,
since it will contain lots of redundant data, as well as lots of missing values; and (3) time is not directly
modeled, and therefore there is no easy way to understand the interconnection between co-occurring
events (e.g. exam results during interferon therapy), neither the disease evolution. Moreover, most data
are distributed irregularly, either in time, as well as per patient, making a direct analysis unfeasible.
The work presented in [PRV05] is the first step on the multi-dimensional analysis of the hepatitis
data. The authors use a multi-relational algorithm to connect biopsies and urinal exams, and to generate
association rules that estimate the stage of liver fibrosis based on lab tests. However, they are only able to
mine two dimensions of the relational model at a time, and therefore they cannot relate the biopsies with,
for example, both the blood tests (Hematological Analysis) and the other tests (In- and Out-Hospital
Examinations).
1The Hepatitis dataset was made available as part of the ECML/PKDD 2005 Discovery Challenge:http://lisp.vse.cz/challenge/CURRENT/
90
6.2 The Hepatitis Multi-Dimensional Model
As stated before, one of the characteristics of the data collected in the healthcare domain is their high
dimensionality. In the case of the Hepatitis dataset, we have administrative data such as patient’s features
(gender and date of birth), pathological classification of the disease (given by biopsy results), duration
of interferon therapy, and temporal data about the blood and urine tests performed to patients. Note
that we could have more data, such as treatment and tests’ cost, hospital data related to out-hospital
exams, information about doctors in charge of patients, etc., which would increase the dimensionality of
the dataset and the complexity of the model.
One efficient way to store high-dimensional data is through the use of a multi-dimensional model – a
star schema, in particular. A star schema clearly divides the different dimensions of a domain into a set
of separated data tables, interrelated by a central table, representing the occurring events. In the case of
the Hepatitis data, we can identify several dimensions – patient, biopsy, possible exams and date – and
events correspond to patient examinations.
Figure 6.2: Hepatitis star schema.
In this sense, one of the possible star schemas that can be defined is proposed in Figure 6.2. The star
schema is composed of 4 dimensions (Patient, Biopsy, Exam Type and Date) and one central fact table
that corresponds to the Examination Results. Each dimension is independent and contains the respective
characteristics (Patient contains patients’ features, and Exam Type contains data about possible exams,
like upper and lower bounds and units). By analyzing the central table, we can understand the relation
between all dimensions: one patient P , with active biopsy B, performed exam E on date D. The result of
this event was r (given by attribute Result in the central table), and at the moment of this examination,
it was (or not) being administrated interferon therapy (attribute InInterferonTherapy?).
Adding new dimensions to this star schema is straightforward. For example, we could add dimensions
Hospital and Doctor just by adding the respective keys into the central table, and each event in that
table would correspond to one exam E, performed to patient P , with active biopsy B, on date D, in
hospital H with doctor Doc.
91
6.2.1 Building the Star Schema
In order to build our star schema, we had to perform a pre-processing phase to join exam data from the
different tables and improve their quality.
First, we decided to reduce data and select only the most significant exams, based on the report
carried out by [WSYT03]. These exams are GOT, GPT, ZTT, TTT, T-BIL, D-BIL, I-BIL, ALB, CHE,
T-CHO, TP, WBC, RBC, HGB, HCT, MCV and PLT. In this sense, dimension Exam Type contains the
known data about these exams (i.e. code, bounds and units). The reason for this data reduction is that
other exams are so rare that one cannot draw any conclusion based on them. Another reason is the fact
that, due to the lack of domain knowledge, we can only interpret the results of these exams (as normal
or abnormal results). Dimension Patient is equivalent to the original table in the Hepatitis dataset, and
Biopsy contains only the possible outputs of biopsies (type can be B or C, the fibrosis stage varies from
0 to 4, and respective activity from 0 to 3). Note that dimension Date contains all dates from 1982 to
2001 and is trivial to generate.
Since these exams are spread in both Hematological Analysis, In- and Out-Hospital Examination
tables, each row of these tables corresponds to one event (one examination) in the central table of the
star schema. Then, exam results were categorized into 7 degrees: extremely, very or simply high (UH,
VH, H), normal (N), low, very or extremely low (L, VL, UL). The thresholds and categories for each of
the selected exams are described in [WSYT03], and presented in Figure 6.1. The exam results of patients
with more than one result for the same type of exam in one day were averaged.
Table 6.1: Important exams and corresponding thresholds and categories in the Hepatitis data [WSYT03].
404
702
763
629
0100200300400
0 1000 2000 3000
GPT
0
200
400
600
0 1000 2000 3000
CHE
01234
0 1000 2000 3000
T-BIL
0100200300400
0 1000 2000 3000
PLT
0
100
200
300
0 1000 2000
GPT
0
200
400
600
0 1000 2000
CHE
01234
0 1000 2000
T-BIL
0100200300400
0 1000 2000
PLT
0
100
200
0 1000
GPT
0
200
400
600
0 1000
CHE
1
2
3
4
0 1000
T-BIL
0100200300400
0 1000
PLT
0100200300400500600700
-2000 -1000 0
GPT
0
200
400
600
-2000 -1000 0
CHE
01234
-2000 -1000 0
T-BIL
0100200300400
-2000 -1000 0
PLT
Fig. 1. Part of the chronic hepatitis data, where each column and row represent an example and anattribute respectively.
2.2 Intuitive Explanation of PrototypeLines
Based on the discussions in previous sections, we have proposed a novel visualization method whichallows detection of interesting exceptions from medical test data with a small amount of labor andskill [7]. This section gives an intuitive explanation of our PrototypeLines.
A probabilistic mixture model allows us to represent data as a linear sum of prototypes and isfrequently used in statistics. Since it summarizes results of a large number of medical tests with asmall number of prototypes (i.e. base models), we believe that it facilitates recognition of overalltendencies. Therefore, we have adopted a method which obtains prototypes from data based onthe EM method [4] and transforms each medical test result into a linear sum of the prototypes.
For an effective display of prototypes, we only employ color as information media. Based ona novel information criterion, prototypes are sorted from good results to bad results, and each ofthem is assigned a color. The colors intuitively become worse from cold colors to warm colors.
3 Experimental Results
3.1 Obtained Prototypes
Due to the nature of the disease, the degree of fibrosis is considered as stable before 500 days andafter 500 days of a biopsy. We have selected patients with degrees of biopsy for analysis. In thedata, a category is either of extremely high (UH), very high (VH), high (H), normal (N), low (L),very low (VL) or extremely low (UL). The medical tests are shown in table 1 with their thresholdsand categories, and the number of random restart in the EM method was settled to 100.
Table 1. Important attributes in the chronic hepatitis data.
medical test (thresholds) category
GOT (40, 100, 200), GPT (40, 100, 200), ZTT (12, 24, 36), TTT (5, 10, 15) N, H, VH, UHT-BIL (1.2, 2.4, 3.6), D-BIL (0.3, 0.6, 0.9), I-BIL (0.9, 1.8, 2.7) N, H, VH, UHALB (3.0, 3.9, 5.1, 6.0), CHE (100, 180, 430, 510) VL, L, N, H, VHT-CHO (90, 125, 220, 255), TP (5.5, 6.5, 8.2, 9.2) VL, L, N, H, VHWBC (2, 3, 4, 9), PLT (50, 100, 150, 350) UL, VL, L, N, HRBC (3.75, 5.0), HGB (12, 18), HCT (36, 50), MCV (84, 95) L, N, H
Fibrosis are considered stable 500 days before and 500 days after a biopsy [WSYT03]. Therefore, for
each examination in the central table, the corresponding active biopsy is the most recent one performed
for the patient, within the 500 days interval (or none).
Finally, interferon therapy data was also integrated in the multi-dimensional star, by marking all
examinations in the central table made during the administration of this therapy (using the information
on the Interferon Therapy table of the relational model).
6.2.2 Understanding the data
After building the star schema for the Hepatitis dataset as described above, it resulted in a central table
with almost 600 thousand examinations performed, for 722 patients (the other 50 patients have not
performed none of the most significant exams, therefore they remain on the Patient dimension, but are
not present in the central table).
In order to better understand the domain in question, Figure 6.3 shows the distribution of the exams
per patient. We can see that there are patients with just a few exams, and other patients with more than
2500 exams. However, in average, each patient performed about 500 - 700 exams. Also, only 30% of all
patients are female, but women perform, in average, more exams than men.
92
0
500
1000
1500
2000
2500
3000
3500
F M
Num
ber o
f Exa
ms
Number of Exams per Patient
Figure 6.3: Number of exams per patient (femaleand male).
0
500
1000
1500
2000
2500
3000
3500
B C Unknown
Num
ber o
f Exa
ms
Number of Exams per Patient
Figure 6.4: Number of exams per patient diagnosedwith hepatitis B, C or still undiagnosed (Unknown).
From these patients, 234 have not performed any biopsy, which means that they were not diagnosed
with any type of hepatitis, yet. The number of examinations performed to patients with hepatitis B, C or
none is shown in Figure 6.4. Note that, from all patients, only 27.5% were diagnosed with hepatitis B, at
some point in time, 40% with hepatitis C, and the rest 32.5% have no biopsy. We can see in that figure
that patients with hepatitis C perform much more exams than patients with hepatitis B. One possible
explanation is the fact that hepatitis C has been treated with interferon therapy, and therefore more
exams (and biopsies) are performed to check if the condition improves.
Also, patients with no biopsy made much less exams than the others. This may indicate that they
did not undertake the biopsy, because doctors thought these patients were not infected with hepatitis B
or C, and therefore the biopsy was not necessary.
Figure 6.5 presents the variation of the number of exams per stage of hepatitis (fibrosis). A value of
0 means that there is no fibrosis, and 4 that the stage of the fibrosis is severe. Note that only one fifth of
the total examinations (about 137 hundred) are performed while there is a valid biopsy (they are active
500 days before and after they are conducted). Others may correspond to patients that never performed
a biopsy or to other patients, before, between or after the conducted biopsies.
7.0%%
16.4%%
76.6%%
B% C% Unkn.%
B1%37%%
B2%33%%
B3%17%%
B4%13%%
Hepa%%s'B'C0%3%%
C1%50%%
C2%15%%
C3%14%%
C4%18%%
Hepa%%s'C'
Figure 6.5: Distribution of exams per stage of hepatitis (i.e. exams that, when performed, there was a validbiopsy indicating the fibrosis stage).
As expected, there are more cases of hepatitis in their early stages than in severe ones. In the case of
hepatitis C, 50% of all performed exams correspond to patients in stage 1 of fibrosis. This means that, in
order to find correlations between exams and fibrosis stages, we are analyzing patterns that are common
to a very small percentage of data.
93
0
500
1000
1500
2000
2500
3000
3500
B1 B2 B3 B4 C0 C1 C2 C3 C4
Num
ber o
f Exa
ms
Number of Exams per Patient
Figure 6.6: Number of exams per patient, at each stage of hepatitis.
The number of exams per patient, at each stage of hepatitis does not suffer many changes, as can be
seen in Figure 6.6. Furthermore, it is stable for patients with hepatitis C, with the exception of stage 0 (no
fibrosis). This can again be explained by the application of interferon therapy and respective evolution
check.
6.3 Performance Evaluation
In order to measure the performance of our algorithm StarFP-Stream over real data, we replicated the
experiments made over the AdventureWorks DW. The goal is to evaluate the accuracy, time and memory
usage, and show that StarFP-Stream is accurate and performs better than the joining before mining
approach.
Similarly, we assume a landmark model, and we test our multi-relational approach – StarFP-Stream
over SimpleFP-Stream, as described in Section 3.4, that denormalizes business facts as they arrive.
We tested the algorithms over the Hepatitis Star in Figure 6.2. This star has no degenerated dimen-
sions, and therefore each row in the fact table corresponds to one business fact. Table 6.2 presents a
summary of the dataset characteristics.
Since the Hepatitis Star contains ten times more facts than the AW T-star, we used lower errors to
get larger batches. Also, the frequency of each item globally is much smaller (patients perform different
exams), and hence we had to use lower supports too, to achieve similar amounts of patterns. In this
sense, experiments were conducted varying both the minimum support and maximum error thresholds:
σ ∈ {5%, 2%, 1%, 0.5%} and ε ∈ {1%, 0.5%, 0.1%, 0.05%, 0.01%}. By varying the error, we are varying
batch sizes. Table 6.3 shows the size and number of batches corresponding to each error.
Table 6.2: A summary of the Hepatitis star characteristics.
Hepatitis
Number of facts 580.000
Number of transactions per fact 1
Number of attributes per dimension [2; 4]
Number of entries per dimension [52; 772]
Table 6.3: Batches of Hepatitis facts, corre-sponding to each error.
Error |B| N Batches
1% 100 5.800
0.1% 1000 580
0.01% 10000 58
94
The computer and settings used to run the experiments was the same: an Intel Xeon E5310 1.60GHz
(Quad Core), with 2GB of RAM. The operating system used was GNU/Linux amd64 and the algorithms
were implemented using the Java Programming language (Java Virtual Machine version 1.6.0 24).
6.3.1 Experimental Results
In these experiments we analyze the accuracy of the results, as well as the behavior of the pattern-tree
and the time and memory used by each algorithm.
In terms of accuracy, we compared the patterns returned by StarFP-Stream with the exact patterns,
given by FP-Growth (with the complete denormalized table as input). Recall that the patterns returned
by both StarFP-Stream and SimpleFP-Stream are the same (they only differ in how they manipulate the
data), thus we only present these results for our algorithm.
Figure 6.7 shows the number of patterns returned, along with the precision (the rate of real patterns
over the returned ones).
10#
100#
1000#
10000#
5.0%# 2.0%# 1.0%# 0.5%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 0.01%# 0.1%# 1.0%#Error:'
80%#
85%#
90%#
95%#
100%#
5.0%# 2.0%# 1.0%# 0.5%#
Precision'
Support'
0.01%# 0.1%# 1.0%#Error:'
1#
10#
100#
1000#
50%# 40%# 30%# 20%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 1%# 3%# 5%# 10%#Error:'
1#
10#
100#
1000#
50%# 40%# 30%# 20%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 1%# 3%# 5%# 10%#Error:'
50%#
60%#
70%#
80%#
90%#
100%#
50%# 40%# 30%# 20%#
Precision'
Support'
1%# 3%# 5%# 10%#
Error:'
50%#
60%#
70%#
80%#
90%#
100%#
50%# 40%# 30%# 20%#
Precision'
Support'
1%# 3%# 5%# 10%#
Error:'
0#
200#
400#
600#
800#
10%# 5%# 4%# 3%# 2%# 1%#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
'
Error'
0#
1000#
2000#
3000#
4000#
5000#
6000#
10%# 5%# 4%# 3%# 2%# 1%#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
'
Error'
0#
200#
400#
600#
800#
1000#
1200#
1.00%# 0.50%# 0.10%# 0.05%# 0.01%#
Pa,ern'Tree'Size''
(tho
usan
ds'of'n
odes)'
'
Error'
0#
50#
100#
150#
200#
250#
300#
1# 101# 201# 301# 401# 501#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odde
s)'
Batch'
0#
2000#
4000#
6000#
8000#
10000#
1# 101# 201# 301# 401# 501# 601# 701# 801#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
Batch'
0#
100#
200#
300#
400#
500#
600#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
Batch'
(a) Number of patterns returned
10#
100#
1000#
10000#
5.0%# 2.0%# 1.0%# 0.5%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 0.01%# 0.1%# 1.0%#Error:'
80%#
85%#
90%#
95%#
100%#
5.0%# 2.0%# 1.0%# 0.5%#
Precision'
Support'
0.01%# 0.1%# 1.0%#Error:'
1#
10#
100#
1000#
50%# 40%# 30%# 20%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 1%# 3%# 5%# 10%#Error:'
1#
10#
100#
1000#
50%# 40%# 30%# 20%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 1%# 3%# 5%# 10%#Error:'
50%#
60%#
70%#
80%#
90%#
100%#
50%# 40%# 30%# 20%#
Precision'
Support'
1%# 3%# 5%# 10%#
Error:'
50%#
60%#
70%#
80%#
90%#
100%#
50%# 40%# 30%# 20%#
Precision'
Support'
1%# 3%# 5%# 10%#
Error:'
0#
200#
400#
600#
800#
10%# 5%# 4%# 3%# 2%# 1%#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
'
Error'
0#
1000#
2000#
3000#
4000#
5000#
6000#
10%# 5%# 4%# 3%# 2%# 1%#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
'
Error'
0#
200#
400#
600#
800#
1000#
1200#
1.00%# 0.50%# 0.10%# 0.05%# 0.01%#
Pa,ern'Tree'Size''
(tho
usan
ds'of'n
odes)'
'
Error'
0#
50#
100#
150#
200#
250#
300#
1# 101# 201# 301# 401# 501#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odde
s)'
Batch'
0#
2000#
4000#
6000#
8000#
10000#
1# 101# 201# 301# 401# 501# 601# 701# 801#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
Batch'
0#
100#
200#
300#
400#
500#
600#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
Batch'
(b) Precision
Figure 6.7: Hepatitis Star (results for 1% error only appear when ε� σ).
Note that there are no results for a support of 1% and 0.5% with an error of 1% (and less), because,
by definition, ε� σ. Using an ε ≥ σ would cause the algorithm to return all possible patterns stored in
the pattern-tree. The results would explode and would not be significant.
As expected, as the minimum support decreases, more are the patterns, since we demand fewer
occurrences for an itemset to be frequent. Also, as the error increases, more patterns are returned,
because we can eliminate more items, and therefore have to return more possible patterns. This results
in less precision, since the number of false positives increases.
Figure 6.8 presents an analysis of the size of the pattern-tree. As the error decreases, the size of
the trees increases (Figure 6.8a). Also, and despite being a summary structure, it is very large, with
thousands of nodes.
10#
100#
1000#
10000#
5.0%# 2.0%# 1.0%# 0.5%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 0.01%# 0.1%# 1.0%#Error:'
80%#
85%#
90%#
95%#
100%#
5.0%# 2.0%# 1.0%# 0.5%#
Precision'
Support'
0.01%# 0.1%# 1.0%#Error:'
1#
10#
100#
1000#
50%# 40%# 30%# 20%#Num
ber'o
f'Pa,
erns'
Support'
Real# 1%# 3%# 5%# 10%#Error:'
1#
10#
100#
1000#
50%# 40%# 30%# 20%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 1%# 3%# 5%# 10%#Error:'
50%#
60%#
70%#
80%#
90%#
100%#
50%# 40%# 30%# 20%#
Precision'
Support'
1%# 3%# 5%# 10%#
Error:'
50%#
60%#
70%#
80%#
90%#
100%#
50%# 40%# 30%# 20%#
Precision'
Support'
1%# 3%# 5%# 10%#
Error:'
0#
200#
400#
600#
800#
10%# 5%# 4%# 3%# 2%# 1%#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
'
Error'
0#
1000#
2000#
3000#
4000#
5000#
6000#
10%# 5%# 4%# 3%# 2%# 1%#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
'
Error'
0#
200#
400#
600#
800#
1000#
1200#
1.00%# 0.50%# 0.10%# 0.05%# 0.01%#
Pa,ern'Tree'Size''
(tho
usan
ds'of'n
odes)'
'
Error'
0#
50#
100#
150#
200#
250#
300#
1# 101# 201# 301# 401# 501#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odde
s)'
Batch'
0#
2000#
4000#
6000#
8000#
10000#
1# 101# 201# 301# 401# 501# 601# 701# 801#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
Batch'
0#
100#
200#
300#
400#
500#
600#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
Batch'
(a) Average size – Hepatitis Star
10#
100#
1000#
10000#
5.0%# 2.0%# 1.0%# 0.5%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 0.01%# 0.1%# 1.0%#Error:'
80%#
85%#
90%#
95%#
100%#
5.0%# 2.0%# 1.0%# 0.5%#
Precision'
Support'
0.01%# 0.1%# 1.0%#Error:'
1#
10#
100#
1000#
50%# 40%# 30%# 20%#Num
ber'o
f'Pa,
erns'
Support'
Real# 1%# 3%# 5%# 10%#Error:'
1#
10#
100#
1000#
50%# 40%# 30%# 20%#
Num
ber'o
f'Pa,
erns'
Support'
Real# 1%# 3%# 5%# 10%#Error:'
50%#
60%#
70%#
80%#
90%#
100%#
50%# 40%# 30%# 20%#
Precision'
Support'
1%# 3%# 5%# 10%#
Error:'
50%#
60%#
70%#
80%#
90%#
100%#
50%# 40%# 30%# 20%#
Precision'
Support'
1%# 3%# 5%# 10%#
Error:'
0#
200#
400#
600#
800#
10%# 5%# 4%# 3%# 2%# 1%#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
'
Error'
0#
1000#
2000#
3000#
4000#
5000#
6000#
10%# 5%# 4%# 3%# 2%# 1%#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
'
Error'
0#
200#
400#
600#
800#
1000#
1200#
1.00%# 0.50%# 0.10%# 0.05%# 0.01%#
Pa,ern'Tree'Size''
(tho
usan
ds'of'n
odes)'
'
Error'
0#
50#
100#
150#
200#
250#
300#
1# 101# 201# 301# 401# 501#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odde
s)'
Batch'
0#
2000#
4000#
6000#
8000#
10000#
1# 101# 201# 301# 401# 501# 601# 701# 801#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
Batch'
0#
100#
200#
300#
400#
500#
600#
Pa,ern'Tree'Size'
(tho
usan
ds'of'n
odes)'
Batch'
(b) Size with 0.1% error – Hepatitis Star
Figure 6.8: Average (left) and detailed (right) pattern-tree size.
95
As for the AdventureWorks, the pattern-tree follows the same trends. The pattern-tree of the Hepatitis
Star also increases in the first batches, but then also tends to stabilize further ahead (Figure 6.8b).
However, the Hepatitis case contains more fluctuations on the main tendency, which may mean that
patterns are not as well defined. This behavior is common to all errors.
This pattern-tree is the most important structure, since it holds all possible patterns, and therefore
it will influence both time and memory needed.
Figure 6.9 shows the time needed to process each batch (update time). Even for an error of 0.1%
(which imposes a batch size of 1000 facts), the time needed to process each batch is just a couple of
seconds (like the AW T-Star for 3% error). We can also state that StarFP-Stream needs, in average, less
time than SimpleFP-Stream (Figure 6.9a), and it confirms that denormalizing before mining takes more
time than mining directly the star schema.
0#
2#
4#
6#
8#
10%# 5%# 4%# 3%# 2%# 1%#
Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0#
100#
200#
300#
400#
10%# 5%# 4%# 3%# 2%# 1%#
Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0#
5#
10#
15#
20#
25#
30#
1.0%# 0.5%# 0.1%# 0.05%# 0.01%#
Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0.01#
0.1#
1#
10#
1# 101# 201# 301# 401# 501#
Time'(s)'
Batch'
StarFPStream# SimpleFPStream#
1#
10#
100#
1000#
10000#
1# 51# 101#151#201#251#301#351#401#451#501#551#601#651#701#751#801#
Time'(s)'
Batch'
SimpleFPStream# StarFPStream#
0#
2#
4#
6#
8#
10#
Time'(s)'
Batch'
SimpleFPStream# StarFPStream#
0#
200#
400#
600#
800#
1000#
1200#
10%# 5%# 4%# 3%# 2%# 1%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#StarFPStream#
0#
100#
200#
300#
400#
500#
600#
10%# 5%# 4%# 3%# 2%# 1%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#StarFPStream#
0#
50#
100#
150#
200#
250#
300#
1.0%# 0.5%# 0.1%# 0.05%# 0.01%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#
StarFPStream#
(a) Average time – Hepatitis Star
0#
2#
4#
6#
8#
10%# 5%# 4%# 3%# 2%# 1%#
Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0#
100#
200#
300#
400#
10%# 5%# 4%# 3%# 2%# 1%#
Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0#
5#
10#
15#
20#
25#
30#
1.0%# 0.5%# 0.1%# 0.05%# 0.01%#
Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0.01#
0.1#
1#
10#
1# 101# 201# 301# 401# 501#
Time'(s)'
Batch'
StarFPStream# SimpleFPStream#
1#
10#
100#
1000#
10000#
1# 51# 101#151#201#251#301#351#401#451#501#551#601#651#701#751#801#
Time'(s)'
Batch'
SimpleFPStream# StarFPStream#
0#
2#
4#
6#
8#
10#
Time'(s)'
Batch'
SimpleFPStream# StarFPStream#
0#
200#
400#
600#
800#
1000#
1200#
10%# 5%# 4%# 3%# 2%# 1%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#StarFPStream#
0#
100#
200#
300#
400#
500#
600#
10%# 5%# 4%# 3%# 2%# 1%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#StarFPStream#
0#
50#
100#
150#
200#
250#
300#
1.0%# 0.5%# 0.1%# 0.05%# 0.01%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#
StarFPStream#
(b) Time with 0.1% error – Hepatitis Star
Figure 6.9: Average (left) and detailed (right) update time.
Even though the time in Figure 6.9b increases a bit as new batches are processed, we can see that
it tends to become constant. Since there are lots of items with a small frequency (due to the data
characteristics), there will be lots of infrequent itemsets that must be removed, and this increase in time
may be caused by just memory management.
The analysis of the maximum memory needed per batch is shown in Figure 6.10. It is strongly related
to the pattern-tree, and therefore with the error bound.
0#
2#
4#
6#
8#
10%# 5%# 4%# 3%# 2%# 1%#
Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0#
100#
200#
300#
400#
10%# 5%# 4%# 3%# 2%# 1%#
Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0#
5#
10#
15#
20#
25#
30#
1.0%# 0.5%# 0.1%# 0.05%# 0.01%#
Time'(s)'
Error'
SimpleFPStream#StarFPStream#
0.01#
0.1#
1#
10#
1# 101# 201# 301# 401# 501#
Time'(s)'
Batch'
StarFPStream# SimpleFPStream#
1#
10#
100#
1000#
10000#
1# 51# 101#151#201#251#301#351#401#451#501#551#601#651#701#751#801#
Time'(s)'
Batch'
SimpleFPStream# StarFPStream#
0#
2#
4#
6#
8#
10#
Time'(s)'
Batch'
SimpleFPStream# StarFPStream#
0#
200#
400#
600#
800#
1000#
1200#
10%# 5%# 4%# 3%# 2%# 1%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#StarFPStream#
0#
100#
200#
300#
400#
500#
600#
10%# 5%# 4%# 3%# 2%# 1%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#StarFPStream#
0#
50#
100#
150#
200#
250#
300#
1.0%# 0.5%# 0.1%# 0.05%# 0.01%#
Mem
ory'(M
b)'
Error'
SimpleFPStream#
StarFPStream#
Figure 6.10: Average maximum memory per batch.
We can see there that our algorithm needs some more memory than SimpleFP-Stream, and this is
due to the fact that the first needs to create the DimFP-Trees for each dimension, while the second puts
the denormalized facts into one single FP-Tree. We can also see, once again, that the memory needed
increases exponentially with the decrease of the error, because the less error, the more has to be kept in
the pattern-tree. However, just as in time usage, memory needed tends to stabilize and not depend on
the number of batches processed so far.
96
6.4 Hepatitis Application Goals
For the analysis of the hepatitis star schema, we decided to address two topics of interest, suggested for
this dataset:
1. To discover the differences between patients with hepatitis B and C;
2. To evaluate whether laboratory examinations can be used to estimate the stage of liver fibrosis.
This second topic is of particular importance, since biopsies are invasive to patients, and therefore doctors
try to avoid them.
By using the star in Figure 6.2 we are able to relate the exams (and other dimensions) with the type
of hepatitis, as well as with the fibrosis stage. We can look for what examination results are common
(frequent) along with hepatitis B or/and C, and see the differences (goal 1). Similarly, we can look for
frequent exam results for each fibrosis stage (goal 2), and then use those patterns to help classifying other
patients with similar results.
In order to tackle these topics, we follow two approaches: (1) to discover discriminant patterns and
association rules; and (2) to find inter-dimensional and aggregated patterns and use them to enrich
classification, and potentially to improve predition results.
By following the first approach, we can understand if there are examination results that are char-
acteristic to some type of hepatitis, or that are connected to some particular stages of hepatitis. On
another side, by using the second methodology, we are able not only to find interesting sets of frequent
examination results, but also to use these multi-relational patterns to improve predicting if a patient is
infected or not with hepatitis, which type or in what stage. By that, if in the presence of these patterns
the prediction of classification models improve, we are also evaluating and proving the importance of our
multi-relational patterns, and therefore the importance of our algorithms.
6.5 Finding Discriminant Patterns and Association Rules
At a first glance, in order to find the discriminant patterns, the approach to follow may seem straight-
forward: we may apply StarFP-Stream to the hepatitis star, and choose all patterns that relate some
examination result with hepatitis types and fibrosis stages. However, as seen in section 6.2.2, only less
than 1 fourth of all examinations have an active biopsy associated. In particular, 16% of examinations
correspond to hepatitis C, and only 7% to hepatitis B.
First, this means that, to find the hepatitis type B as a frequent item (the same is valid for hepatitis
C), we have to select a very low support, and in order to find some examination that is frequent along with
hepatitis B, we have to set the support to even lower values. Furthermore, if we look to the frequency of
examinations corresponding to each fibrosis stage, we are talking about even lower supports. This leads
to lots of uninteresting and possibly misleading patterns.
And second, if we are using all data to contribute for the support, highly frequent patterns (> 16%
+7% = 23%) are frequent because they co-occur more in data with no biopsy information. And this
means that they are not interesting because they cannot discriminate any type of hepatitis (at most, they
can discriminate the non existence of hepatitis, if they are not frequent for any type of hepatitis).
In this sense, we decided to constrain the data, and apply StarFP-Stream with low supports:
1. To all examinations with hepatitis B – referred to as B;
2. To examinations with hepatitis C – referred to as C;
3. To all exams with no biopsy data – referred to as None.
97
This way, we found 3 sets of patterns: B, C and None. And then, we generated the association rules
(with respective support, confidence and lift measures), based on the discovered patterns (also, 3 sets of
rules, B, C and None).
Next, for the analysis, we categorized patterns and rules as discriminant or non-discriminant. A
pattern is discriminant if it belongs to group B or/and C, but not to group None, i.e. if it is frequent for
some type of hepatitis, and it is not frequent for those that are not diagnosed yet. Additionally, a pattern
that belongs only to group None is also discriminant, since it may be a good indicator that a patient do
not have hepatitis. Patterns that belong to some group of hepatitis and at the same time to group None,
are non-discriminant, and thus not interesting. Discriminating patterns may be used to address goal 1,
i.e. to understand the differences between hepatitis B and C.
Finally, we analyzed association rules that implicate some stage of fibrosis, to understand if the stage
can be estimated by examination results (goal 2).
6.5.1 Interesting Patterns
Table 6.6 presents a subset of the frequent patterns found, with information about results and fibrosis.
As expected, the supports of these patterns are very low (around 1% of the group), in the three groups.
In fact, in these data, what we found is that patterns with more support correspond to patterns that are
not discriminant (such as normal values for most of the examinations).
Table 6.4: Some examples of the patterns found in the hepatitis dataset.
Pattern B C None Discriminant?1 (Result=RBC_H) 1% Yes<(B)2 (Result=GPT_UH) 1% Yes<(B)3 (Result=GPT_VH) 1% 2% 1% No4 (Result=GPT_H) 2% 2% 3% No5 (Result=GOT_VH) 1% 1% Yes<(B<and<C)6 (Result=GOT_H) 3% 3% 3% No7 (Result=HCT_H) 2% 2% 1% No8 (Result=CHE_VL) 4% 1% No<*9 (Result=ALB_L) 1% Yes<(None)10 (Result=PLT_VL) 1% Yes<(None)11 (Sex=M,Result=GPT_VH) 1% 1% Yes<(B<and<C)12 (Sex=M,Result=CHE_VL) 3% Yes<(B)13 (Fibrosis=1,Result=CHE_VL) 1% Yes<(B)14 (Fibrosis=1,Result=GPT_H) 1% Yes<(C)15 (Fibrosis=1,Result=GOT_H) 1% Yes<(C)
Support<in
However, by analyzing the differences between groups, we can find some possible interesting and
discriminant examinations. For example, we find that ultra high (UH) values for the GPT test only
appear in the hepatitis B test set (more than 1% of the time), but as the value lowers, the test stops
being discriminant. Another examples are patterns 9 and 10, that may indicate that lower values on ALB
and PLT tests are good markers for not having hepatitis (note that, in these data, not having information
about a biopsy does not say that a person do not have hepatitis, but it may be an indicator for finding
relations by which doctors think that there is no need for a biopsy. Nevertheless, this would need further
analysis).
Pattern 8 is marked with an ∗ because, as can be noted, it has 4% of support for hepatitis B, and
only 1% in group None, and therefore it is considered non-discriminant. But, if we look to patterns 12
and 13, very low (VL) values for the CHE test may be an indicator of hepatitis B, meaning that pattern
8 occurrences in the None group may be outliers (or not yet diagnosed hepatitis B patients).
98
The only discriminant patterns that relate exam results and the fibrosis stage can only find a relation
for fibrosis stage 1 (patterns 13 to 15, in the table), because of the extremely low supports of other stages
of fibrosis. And in fact, high (H) values for GPT and GOT exams are not, by themselves, discriminant of
hepatitis C (patterns 4 and 6). At most, they may be able to discriminate the fibrosis stage in patients
already diagnosed with hepatitis C.
6.5.2 Association Rules
Table 6.5: Some examples of the association rules found in the hepatitis dataset.
AR Conf. Lift Discr?1 (Result=GOT_H)<⟹<(Fibrosis=1) 48.10% 0.96 No2 (Result=GPT_H)<⟹<(Fibrosis=1) 51.35% 1.03 No3 (BirthDecade=1960)<⟹<(Fibrosis=0) 19.90% 6.92 No4 (BirthDecade=1960)<⟹<(Fibrosis=1) 62.32% 1.25 No5 (BirthDecade=1930,Sex=F)<⟹<(Fibrosis=1) 51.25% 1.02 Yes<(C)6 (BirthDecade=1930,Sex=F)<⟹<(Fibrosis=2) 15.07% 0.99 Yes<(C)7 (BirthDecade=1930,Sex=F)<⟹<(Fibrosis=3) 13.81% 0.99 Yes<(C)8 (BirthDecade=1930,Sex=F)<⟹<(Fibrosis=4) 17.57% 0.98 Yes<(C)
Table 6.5 presents a subset of the frequent association rules found of the form X ⇒ Fibrosis, with X
any other item.
In order to address the second goal, we wanted to find all rules for which some examination result
implies some fibrosis stage. Rules 1 and 2 are examples of those rules. However, their confidence is
around 50% which means that these rules are not unexpected and are probably to tied to the data in
question. The lift is also too close to 1, confirming that these rules are not interesting. Indeed, both
antecedents were non discriminant (as seen in table 6.6), as well as these rules. All other rules of this
form are equivalent, and furthermore, can only estimate the fibrosis stage 1. This means that, using these
data, no examination result, by itself, can predict the stage of the fibrosis, in both type of hepatitis.
Rules 3 and 4 are examples of rules with a slightly higher lift. They indicate that 20% of the patients
that were born in de 60s (i.e. that were examined with 20 to 40 years old) had hepatitis in fibrosis stage
0, and 62% of them in fibrosis stage 1. However, these rules have small confidence, and therefore we
cannot conclude that there is a relation between the age of the patients with the stage of the hepatitis.
Finally, rules 5 to 8 show that there are attributes that, although being discriminative, they are not
good to predict the stage of the fibrosis. In these examples, women born in the 30s can predict any stage,
from 1 to 4, with smaller confidences (with the exception of stage 1, that is explained by the fact that
there are more instances of this type) and bad lifts.
In [PRV05], the authors only generate and analyze the confidence of association rules of the form
Examination Result → Fibrosis. However, besides the confidence of those rules be low (in most of the
cases), neither the support or the lift of those rules was analyzed. As shown here, rules of that form have
a small confidence (rules 1 and 2) and also a lift too close to 1 (and a very low support), which means
there are too few examples and these rules may not be significant.
These poor results mean that there is the need for further analysis of these data, in a different and
more structured way. They also prove that there are some possible tendencies, but alone, examination
results cannot predict fibrosis stage of hepatitis patients.
99
6.6 Improving Prediction using Multi-Dimensional Patterns
Classification is a data mining task widely used for predicting future outcomes. As an example, in this
healthcare domain, it can be used to predict if a patient is infected with hepatitis or not, as well as to
predict the type and stage of hepatitis.
Classification algorithms create a prediction model based on the existing data (training data), for
which we know a set of features and also the outcomes, and then use this model to predict the unknown
outcomes of new data, based on their observed features.
There are several algorithms and models proposed for classification, that have been applied to vast
number of different domains. However, and despite the advances, the relations between attributes are not
being considered in existing approaches. In fact, in a multi-relational domain there are implicit relations
between data that are easily modeled through a relational schema, but that cannot be easily modeled in
one single training table. Naturally, if we could somehow incorporate these relations into classification,
prediction results would be likely to improve.
If we consider the multi-relational (MR) patterns described above, both inter-dimensional and aggre-
gated patterns represent the relations between entities. This means that, if a record (or event) in data
satisfies a MR pattern, we can say that this record encloses the relations represented by the pattern.
In this sense MR patterns can be seen as a compact way to model the relationships in data, and can
be used as features to enrich data for classification. The simplest way to incorporate these patterns is to
pre-process individual records for verifying the satisfaction of each pattern identified, and extend these
records with one boolean attribute per pattern, corresponding to the satisfaction (or not) of each pattern
by the respective record. In this manner, what is multi-relational by nature becomes tabular, without
loosing the dependencies identified before, and traditional classifiers are applicable without the need for
any adaptation.
In this thesis, we claim that we can use the multi-relational patterns to enrich classification data in the
healthcare domain, and improve prediction, as done before with sequential patterns [BA11] and frequent
graphs [PA09].
We first describe in detail the methodology taken, and then we put it into practice with several
experiments, and we show that running classification over these enriched data improves not only the
accuracy of the predictions, but also the classification models built.
6.6.1 Methodology
The general process is illustrated in Figure 6.11, and is divided into four main steps: multi-dimensional
pattern mining, pattern filtering, data enrichment and classification.
Individual)Performances)
Star)Schemas)
Frequent))Pa7erns)
Mul9:Dimensional)
Pa7ern)Mining)
N)Best))Pa7erns)
Pa7ern)Filtering)
Enriched))Individual))
Performances)
Data)Enrichment)
Predic9on)Model)
Classifica9on)
Figure 6.11: The multi-dimensional methodology for enriching classification.
The main idea is to make use of a MRDM algorithm to find inter-dimensional and aggregated patterns
that are able to characterize different entities and their behaviors. These patterns may in turn be
100
filtered and used as classification features to predict some outcome, depending on the different dimensions
considered.
Given a star schema (or a constellation of star schemas) containing the individual performances (such
as the Hepatitis star given in Figure 6.2, recording all the examinations performed), we propose to apply
the next steps:
1. Multi-Dimensional Pattern Mining : This step consists on running an algorithm for multi-
dimensional pattern mining, such as StarFP-Growth [SA11] or StarFP-Stream [SA12a], over each
star schema. By doing this, we are able to find frequent patterns – intra-, inter-dimensional and
aggregated patterns, related to each star.
For example, running a MRDM algorithm over the Hepatitis Star allow us to find frequent patient
behaviors related to the results of the examinations they performed (e.g. sets of results that are
frequent together, for each particular hepatitis).
2. Pattern Filtering : After finding the patterns, the next step is to filter them and choose the N best
ones. We can either do this filtering to the patterns of each star separately, and choose the N best
of each set, or rate the set of all patterns and choose the N best global ones.
Note that, in theory, using all patterns with at least 2 items to enrich the training data for clas-
sification should achieve the best results. However, this will eventually lead to over fitting of the
models found, which will also lead to poor results when classifying new instances. In this sense, we
should choose, somehow, only those patterns that achieve an higher information gain.
First, we are only interested in the patterns that can model the multi-dimensional relations between
entities, therefore we only want inter-dimensional and aggregated patterns (i.e. patterns with items
of more than one dimension and items resulting from the aggregation of facts). Then, in order to
choose the N best patterns, we have to filter and rate the patterns according to some interesting
measure. We define five filters:
Support: The support of a pattern is the number of times it occurs. Therefore, the higher the
support of a pattern, the more events share the same characteristics represented in this pattern,
and the more likely it is to cover more entities that we want to classify.
In this sense, using a support filter, we order the patterns in a support descending order, and
choose the N patterns with highest support.
However, patterns with highest support are the smallest ones (the number of times a pattern
occurs is higher or equal to the number of times its super patterns may occur), and therefore
those that represent smaller relations.
Also, in this healthcare domain, if exam results are shared by a high number of patients, it
may mean that they are not discriminant of the type of hepatitis or its stage;
Size: On another side, the largest patterns model more multi-dimensional relations than smaller
ones, and hence may be more interesting to improve classification.
By using a size filter, we order the patterns in a descending order of their size, and choose the
N largest patterns.
The downside of this is the fact that these patterns tend to have the smallest supports, and
cover a very small part of the data;
Closed: One characteristic of patterns is that, if one is frequent, all of its subsets are also frequent
(anti-monotonicity), and this means that some might be redundant. Thus, if we eliminate the
redundant patterns, the final set of chosen patterns is more likely to be more interesting.
101
A pattern is closed if none of its immediate supersets have the same support (if some has, this
pattern is not interesting). In this case, we are only interested in the closed patterns. Using a
closed filter, we consider only the closed patterns, and choose those with highest support;
One of the problems of the above measures, is that they do not take into account how correlated
items are, or how much gain they bring over what is already known. Thereby, we define the next
two filters.
Rough Independence: According to probability theory, two events A1 and A2 are independent
if P (A1 ∩ A2) = P (A1)P (A2). And if two events are independent, the occurrences of one
do not influence the probability of the other. Therefore, patterns that contain these two
events are not interesting. For more than two events, they can be mutually independent if
P (Ai ∩Ai+1 ∩ ...∩An) = P (Ai)P (Ai+1)...P (An), for all the power set of those events. Taking
this into consideration, for this work we define a rough independence measure:
RInd({A1, A2, ..., An}) =P (A1 ∩A2 ∩ ... ∩An)
P (A1)P (A2)...P (An)
If RInd is 1, it means that the elements of the pattern are rough independent, and therefore
less important. The higher the value of RInd, the more dependent are the elements, and more
important is the pattern.
Therefore, using the rough independence filter, patterns are ordered in a decreasing order of
their value for |RInd− 1|, and only the N patterns with highest difference are chosen.
Note that, in order to guarantee the mutually independence of a pattern, it would be necessary
to measure RInd to all of its subsets (the power set of all elements).
Rough Chi-square (χ2): Chi-square is an interesting measure that evaluates the correlation be-
tween variables [BMS97]. Generally, the more correlated, the more interesting are the relations.
The chi-square of two variables is defined as:
n,m∑i,j=1
(observedij − expectedij)2/expectedij
in which observedij is the observed support of values i and j, and expectedij the expected
probability of those values if the variables were independent.
In this work we define a rough chi-square measure to evaluate the correlation of elements in a
pattern: Rχ2({A1, ..., An}) =
(support(A1 ∩ ... ∩An)− P (A1)...P (An))2
P (A1)...P (An)
The higher the value of Rχ2, the more rough correlated are the elements in patterns, and
therefore more interesting.
In this sense, using a rough chi-square filter, patterns are ordered in a decreasing order, ac-
cording to the value of Rχ2, and the N with highest measure are chosen.
3. Data Enrichment : Once we have the best patterns, we can use them as features for classification
training.
The simplest way to incorporate these patterns in the classification process is to pre-process indi-
vidual records (the original training data) for verifying the satisfaction of each pattern identified.
102
From this verification results a new extended record, where multi-dimensional patterns are just rep-
resented as boolean attributes – true or false – whenever an entity satisfies (or not) the particular
pattern. In this manner, what is multi-dimensional by nature becomes tabular, without loosing the
dependencies identified before, and traditional classifiers are applicable without any adaptation.
4. Classification: We can then finally run classification algorithms on these enriched data and observe
the results. It is expected, not only to achieve better predictions, but also better models (in
particular, smaller models).
6.6.2 Methodology into Practice
In order to analyze the hepatitis dataset and achieve our goals, we decided to follow the methodology
described above: (1) run multi-relational pattern mining over the Hepatitis star schema; (2) filter the
best inter-dimensional and aggregated patterns; (3) enrich the classification data (baseline) with these
patterns; and (4) run classification over both the baseline and this enriched dataset, and compare the
results (the average of the predictions, and the size of the models built).
For the first step, we decided to run the algorithm StarFP-Stream over the Examination Results star
schema. So that we could understand the behavior of patients and discover frequent sets of exam results,
the algorithm aggregates into one singular record all the exams of the same patient while each particular
biopsy is valid, i.e. each pair (patient, biopsy) of the central fact table is considered as only one event.
By doing this, we are able to discover, not only frequent exam results (like traditional pattern mining
algorithms), but also sets of results that co-occur frequently. We can find, for example, that patients
with hepatitis B frequently have high results in exams GOT and GPT but low results in PLT , at the
same time.
After finding the patterns, we tested our approach with all the filters proposed above, and with
different number of patterns selected.
For this case study, since our goal is to predict the type or the stage of hepatitis based on exam results,
the baseline used for classification is a table composed of the patient information (Patient dimension),
and the results for all the 17 most significant exams discriminated in section 6.2. Then, we defined two
similar baselines, Type and Fib, and applied this methodology to both. The first to predict if a patient
is infected with some type of hepatitis, and the second to predict the stage of the hepatitis, if present. In
this sense, the class of the baseline Type is the type of hepatitis (B, C or None), and the class of the
baseline Fib is the stage of the fibrosis (from 0 to 5).
Once we have the best N patterns, we extend the baseline table by adding N boolean attributes.
Each value for these attributes is true if the patient satisfies the pattern, or false otherwise.
Finally, we applied the classification algorithm C4.5 on these enriched datasets, and compared the
results over the same algorithm applied to the baselines. Classification results presented are the average
of several 10-cross fold validations.
We used our implementation of StarFP-Stream (described in Chapter 3), and the C4.5 implementation
available in Weka.
6.6.3 Analysis of Multi-Relational Patterns
Table 6.6 presents a subset of the frequent patterns found. For simplicity, we only present the patterns
in which the exam results have abnormal values (i.e. low or high results).
The first five patterns are intra-dimensional, and contain only one item. The first, for example, means
that exam named GOT is frequent and appears 975 times in these settings with an higher value (H).
Also, the data contains 290 diagnosis of hepatitis C (pattern 5).
103
Table 6.6: Some examples of the multi-relational patterns found in the hepatitis dataset.
Pattern Support1 (Result=GOT_H) 9752 (Result=WBC_L) 5363 (Result=IBBIL_H) 4954 (Result=CHE_VL) 4915 (Type=C) 2906 (Result=ZTT_H,Result=DBBIL_H,Result=TTT_H) 5777 (Result=GOT_H,Result=GPT_H,Result=ZTT_H,Result=MCV_H) 5868 (Result=GOT_H,Result=GPT_H,Result=DBBIL_H,Result=CHE_VL) 3869 (Result=GOT_H,Result=GPT_H,Result=PLT_L,Result=WBC_L) 36210 (Result=GOT_VH,Result=GPT_VH,Result=ZTT_H,Result=WBC_L) 35511 (Sex=M,Result=MCV_H) 57912 (Sex=M,Result=MCV_H,Result=HCT_H) 46913 (Type=None,Fibrosis=None,Result=DBBIL_H) 53414 (Type=None,Result=GOT_H,Result=GPT_H) 50815 (Type=C,Result=GOT_H,Result=GPT_H,Result=ZTT_H) 236
The next five patterns are aggregated patterns, since they represent frequent sets of examination
results, and they are discovered because we aggregated the data in the star schema per pair (patient,
biopsy). We can see that exams GOT and GPT frequently appear together, and with very similar (and
high) results. We can also observe that, e.g. more than 300 patient diagnosis have high results in both
GOT and GPT , and at the same time, low results in PLT and WBC exams.
Patterns 11 to 15 are inter-dimensional patterns, that contain items from more than one dimension.
From these, the first 2 patterns relate the Patient and Exam dimensions, and the rest relates the Biopsy
with the examinations and corresponding results. In these examples, we note that, from the 800 biopsy
diagnosis of male patients, almost 600 correspond to examinations with an high value in exam MCV .
The last 3 patterns relate the type of hepatitis and the stage of the fibrosis with examination results.
As an example, an high value for the D-BIL exam was frequently associated with no hepatitis cases.
The last patterns suggest that an higher value for both the GOT and GPT tests, along with ZTT ,
indicate that the patient has hepatitis C, since more than 80% of the cases of hepatitis C show these
values. However, we can see in pattern 14 that these values are also associated with not having hepatitis.
This may evidence that these values or tests are not discriminant (corroborating the results of our first
approach, presented in Section 6.5). We also have to recall that, in these data, not having information
about a biopsy does not say that a person do not have hepatitis. It says only that the person has not
been diagnosed yet. However, it may be an indicator for finding relations by which doctors think there
is no need for a biopsy.
6.6.4 Enriched Classification Results
Figures 6.12a and 6.12b show the accuracy of the classification step, over the baseline Type and Fib,
respectively, and corresponding datasets enriched with the multi-relational patterns from the pattern
mining step.
When we add the patterns that represent patient exam behaviors, we can see in the figures that the
accuracy improves in both cases, as expected. Although small, the improvements indicate that patterns
are chosen instead of specific exams, and this may result in models with less over fitting, and therefore
more accurate when predicting new instances. Also, results show that, in general, the more N best
patterns are chosen, the better the accuracy.
When analyzing the different filters, there are small fluctuations, but both the rough independence
and rough chi-square filters revealed to achieve better results, in both baselines. Choosing the patterns
104
90.0$
91.0$
92.0$
93.0$
94.0$
95.0$
96.0$
50$ 100$ 250$ 500$ 1000$ 2500$ 5000$
Accuracy'(%
)'
N'Best'
Baseline'2'4'Accuracy'
Support$
Size$
Closed$
R$Ind$
R$Chi2$
Baseline$Fib$
86.0$
87.0$
88.0$
89.0$
90.0$
91.0$
92.0$
93.0$
50$ 100$ 250$ 500$ 1000$ 2500$ 5000$
Accuracy'(%
)'
N'Best'
Baseline'1'4'Accuracy'
Support$
Size$
Closed$
R$Ind$
R$Chi2$
Baseline$Type$
300$
350$
400$
450$
500$
550$
600$
650$
50$ 100$ 250$ 500$ 1000$ 2500$ 5000$
Size'of'the
'tree'
N'Best'
Baseline'2'4'Size'of'tree'
Support$
Size$
Closed$
R$Ind$
R$Chi2$
Baseline$Fib$
600$
700$
800$
900$
1000$
1100$
1200$
50$ 100$ 250$ 500$ 1000$ 2500$ 5000$
Size'of'the
'tree'
N'Best'
Baseline'1'4'Size'of'tree'
Support$
Size$
Closed$
R$Ind$
R$Chi2$
Baseline$Type$
(a) Baseline Type.
90.0$
91.0$
92.0$
93.0$
94.0$
95.0$
96.0$
50$ 100$ 250$ 500$ 1000$ 2500$ 5000$
Accuracy'(%
)'
N'Best'
Baseline'2'4'Accuracy'
Support$
Size$
Closed$
R$Ind$
R$Chi2$
Baseline$Fib$
86.0$
87.0$
88.0$
89.0$
90.0$
91.0$
92.0$
93.0$
50$ 100$ 250$ 500$ 1000$ 2500$ 5000$
Accuracy'(%
)'
N'Best'
Baseline'1'4'Accuracy'
Support$
Size$
Closed$
R$Ind$
R$Chi2$
Baseline$Type$
300$
350$
400$
450$
500$
550$
600$
650$
50$ 100$ 250$ 500$ 1000$ 2500$ 5000$
Size'of'the
'tree'
N'Best'
Baseline'2'4'Size'of'tree'
Support$
Size$
Closed$
R$Ind$
R$Chi2$
Baseline$Fib$
600$
700$
800$
900$
1000$
1100$
1200$
50$ 100$ 250$ 500$ 1000$ 2500$ 5000$
Size'of'the
'tree'
N'Best'
Baseline'1'4'Size'of'tree'
Support$
Size$
Closed$
R$Ind$
R$Chi2$
Baseline$Type$
(b) Baseline Fib.
Figure 6.12: Accuracy for baselines and respective extensions with MR patterns.
with higher support is the approach that brings less improvements, because they are the smallest ones,
and they might not be discriminant of patients with different type or stages of hepatitis. The closed filter
is similar to the support, while the size filter achieves intermediate results. These tendencies happened
in both baselines.
Figures 6.13a and 6.13b analyze the size of the trees created by the classifier (i.e. the size of the
models).
90.0$
91.0$
92.0$
93.0$
94.0$
95.0$
96.0$
50$ 100$ 250$ 500$ 1000$ 2500$ 5000$
Accuracy'(%
)'
N'Best'
Baseline'2'4'Accuracy'
Support$
Size$
Closed$
R$Ind$
R$Chi2$
Baseline$Fib$
86.0$
87.0$
88.0$
89.0$
90.0$
91.0$
92.0$
93.0$
50$ 100$ 250$ 500$ 1000$ 2500$ 5000$
Accuracy'(%
)'
N'Best'
Baseline'1'4'Accuracy'
Support$
Size$
Closed$
R$Ind$
R$Chi2$
Baseline$Type$
300$
350$
400$
450$
500$
550$
600$
650$
50$ 100$ 250$ 500$ 1000$ 2500$ 5000$
Size'of'the
'tree'
N'Best'
Baseline'2'4'Size'of'tree'
Support$
Size$
Closed$
R$Ind$
R$Chi2$
Baseline$Fib$
600$
700$
800$
900$
1000$
1100$
1200$
50$ 100$ 250$ 500$ 1000$ 2500$ 5000$
Size'of'the
'tree'
N'Best'
Baseline'1'4'Size'of'tree'
Support$
Size$
Closed$
R$Ind$
R$Chi2$
Baseline$Type$
(a) Baseline Type.
90.0$
91.0$
92.0$
93.0$
94.0$
95.0$
96.0$
50$ 100$ 250$ 500$ 1000$ 2500$ 5000$
Accuracy'(%
)'
N'Best'
Baseline'2'4'Accuracy'
Support$
Size$
Closed$
R$Ind$
R$Chi2$
Baseline$Fib$
86.0$
87.0$
88.0$
89.0$
90.0$
91.0$
92.0$
93.0$
50$ 100$ 250$ 500$ 1000$ 2500$ 5000$
Accuracy'(%
)'
N'Best'
Baseline'1'4'Accuracy'
Support$
Size$
Closed$
R$Ind$
R$Chi2$
Baseline$Type$
300$
350$
400$
450$
500$
550$
600$
650$
50$ 100$ 250$ 500$ 1000$ 2500$ 5000$
Size'of'the
'tree'
N'Best'
Baseline'2'4'Size'of'tree'
Support$
Size$
Closed$
R$Ind$
R$Chi2$
Baseline$Fib$
600$
700$
800$
900$
1000$
1100$
1200$
50$ 100$ 250$ 500$ 1000$ 2500$ 5000$
Size'of'the
'tree'
N'Best'
Baseline'1'4'Size'of'tree'
Support$
Size$
Closed$
R$Ind$
R$Chi2$
Baseline$Type$
(b) Baseline Fib.
Figure 6.13: Size of the trees for baselines and respective extensions with MR patterns.
We can see that for both baselines, also as expected, the trees resulting from classifying the enriched
datasets are smaller than the base tree (they can have less 300 nodes for N = 500 patterns when we
are predicting the type of hepatitis). The tendencies of the different filters are the same. Both rough
independence and rough chi-square filters result in the smallest trees, which means that they choose the
patterns that bring more information gain to the models (therefore they are chosen instead of individual
examination results). Again, on the contrary, filters support and closed are the ones that achieve less
improvements on the size of the models.
6.7 Discussion and Conclusions
In this chapter, we presented a case study on the healthcare domain. Using the Hepatitis dataset, we
showed how these data can be modeled and explored in a multi-dimensional model to promote decision
105
support. We also discussed the use of multi-relational data mining algorithms to mine this model, as well
as the use of the results to improve classification.
The performance evaluation of StarFP-Stream over the Hepatitis star schema corroborates and vali-
dates the results obtained over the fictitious AdventureWorks DW (Section 3.4): the algorithm is accurate
and needs less time than the denormalizing before mining approach.
Results over the Hepatitis dataset show that it is possible to mine these data and find interesting
relations between dimensions. However, due to the nature and distributions of these data, interesting
patterns found in the first approach have very low support, and therefore, there was a need to further
analysis. Our study over the discovered association rules concluded that the examination results present
in the hepatitis dataset, without aggregating data per patient and biopsy, cannot predict the fibrosis
stage, mainly due to the very low supports.
Results achieved in our second approach show that we can discover structured patterns from the
multi-relational model, and find frequent sets of examination results that are common to some type of
hepatitis or that lead to some fibrosis stage. Classification experiments validate our claim – by enriching
the training data with the discovered multi-relational patterns, it is possible not only to improve the
accuracy of classification, but also to create better and smaller models, meaning that multi-relational
patterns are chosen as key features instead of specific examination results.
The methodology used is simple and general, and may be used to any healthcare data warehouse or
star schema, and can also be applied to different domains. Another benefit of this methodology is that
any algorithm can be applied for multi-relational data mining, as well as different classifiers.
This application also demonstrates the importance and applicability of multi-dimensional patterns.
As future work, and in order to surpass the difficulties of this dataset, other paths must be taken. One
of the problems comes from the lack of data and their quality. The hepatitis dataset contains more than
30% of patients that did not perform any biopsy (undiagnosed), and more than 75% of examinations for
which there is no information about an active biopsy. To have a better understanding about why these
patients have not performed a biopsy requires domain knowledge, and may help partitioning the data
and improve the results. In line with the above, this dataset contains a very low number of instances for
each type and stage of hepatitis. There is the need for the integration and analysis of more data in this
domain.
The use of different approaches may also result in better outcomes, such as infrequent pattern min-
ing [ZY07], for finding rare patterns; or sequential and temporal pattern mining, for the analysis of the
evolution of the disease.
We can also try to understand the use of the interferon therapy, and aggregate the data per patient in
(and out of) interferon therapy, and find the differences in frequent patterns that happen before, during
and after the administration of that therapy (we can also apply the same algorithm, StarFP-Stream).
106
Chapter 7
A Case Study in Education
The long history of education as an institution, along with the need of recording student results as proof
for their credentials, lead to huge amounts of data, requiring automatic means for exploring them. In
general, these data mainly describe the courses taken by students and their corresponding grades, but
also the information about the teachers involved in the course, and when and where the educational
process happened. With the spread of information society, the variety of records enlarged considerably,
and nowadays records encompass all kinds of items, from learning materials available and used, to the
answers given to specific questions. Actually, while present through the history, the multi-dimensionality
of these data is even clearer nowadays.
Educational data mining gives a first opportunity for exploring these data, providing the adequate
tools for predicting students performance and dropouts, but also for understanding student behaviors.
However, and despite the encouraging results, few approaches were dedicated to explore the multi-
dimensionality of data, and the vast majority resume to explore just one, at most two dimensions. To
our knowledge, there is no proposal to address the problem in a multi-dimensional context, for example
on predicting student results in a particular course given the entire context, such as the teacher involved,
the history of the course, and also the learning materials or the time and place of the occurrence.
The main reason for this lack of interest is certainly the difficulty on mining multi-dimensional data.
Definitely, the huge amounts of data made the joining of the different dimensions (recorded in separate
tables) infeasible until the advent of big data exploration. But even in this new era, mining these huge
tables is not straightforward due to their nature. As explained in Section 2.3, joining the tables into one
would result in a huge table with many attributes (the combination of all attributes of all dimensions),
many repetitions and possibly many missing values. In this educational domain, this joining is even
harder since each entity, such as students, will have a different number of events associated. For example,
students can attend a different number of courses and can fail some enrollments, thus resulting in a
different number of enrollments for each student. Teachers can also lecture in different courses, as well
as each course can be lectured by a different number of teachers. The possible combinations are usually
far too many, and therefore there is a strong need for approaches able to explore these multi-dimensional
data without having to join the different tables.
In order to deal with this, there are some approaches that use feature selection as a preprocessing step
to reduce the number of attributes to consider in the classification step [MVCRV13]. In these techniques,
the goal is to identify the data attributes that have the greatest effect on the output variable, and use
only those as classification features. In this work, we follow another approach and we argue that we can
use multi-dimensional patterns to enrich classification data, and, with this, improve prediction results
and deal with the high-dimensionality of the data in this domain. The reasoning is that these patterns
capture the existing relations between the dimensions and between the instances, and therefore, by adding
107
them as features to classification data, we can transmit these dependencies. Consequently, in some sense,
classification algorithms start being able to take the multi-dimensions into account.
In this chapter, we apply to educational data the same multi-dimensional methodology for enriching
classification described in the hepatitis case study (Section 6.6.1). We illustrate the interest of our
application on the prediction of student results, when enrolled in a given course taught by a particular
teacher. Experimental results on a real educational case study reveal improvements in prediction and on
the classification model built, when compared with two baseline models.
The rest of the chapter is organized as follows. Section 7.1 presents the educational multi-dimensional
model used in this case study. The application of the methodology to the proposed educational star
schema is described in detail in Section 7.2, along with an analysis of the multi-dimensional patterns
found and of the classification results achieved. Finally, section 7.3 discusses and concludes the case
study.
7.1 TheEducare Multi-Dimensional Model
To the best of our knowledge, a characteristic of the educational data that has not been addressed
properly is their multi-dimensionality. Definitely, the educational process encompasses a set of different
entities, characterized by distinct sets of attributes – dimension. Students, teachers, courses are clear
examples of such dimensions. In the intersection of these dimensions occurs the educational process, with
the materialization of its events. Examples of those events are the lessons attended by some student in
some day, for a specific course with a particular teacher, or just the grade achieved by some student in
some course for a particular enrollment.
Multi-dimensional models, such as star schemas or constellations (i.e. sets of star schemas), are
recognized as the most usual schemas to model these kinds of data, being usually used for modeling data
warehouses [KR02]. An example of a multi-dimensional model designed for educational data is shown in
Figure 7.1.
Figure 7.1: An example of an educational data-warehouse.
In this example we have two star schemas: one for modeling student enrollments in courses, here-
inafter called Enrollments Star, and another for modeling teachers quality assurance (QA) surveys, called
Teaching QA Star. In the fact table of the Enrollments Star, each student enrollment in a course in a
108
particular term is recorded, along with the corresponding grade achieved, if approved. The second star
contains the grades given by students to their teachers, in anonymized surveys carried out in the end of
each term. In this sense, each tuple in the Teaching QA Star records the average grade for a specific QA
item (or question), given to some teacher when teaching some course in a specific term for a determined
lesson type (note that these surveys are anonymous, and therefore there is no information about the
students that answered). As can be seen in the figure, dimensions Program, Course and Term are shared
by both star schemas.
By mining these multi-dimensional data, we can, among other things, discover relations between
dimensions in the context of some event, such as, for example, the types of students that achieve better
grades in different types of courses, as well as understand dimension behaviors, e.g. the most frequent
sets of courses’ results.
In this case study we used the data from the Information Systems and Computer Engineering program,
offered in Instituto Superior Tecnico, at the University of Lisbon, in Portugal. From the data warehouse
created, we have chosen the two stars in Figure 7.1: the Enrollments Star and the Teaching QA Star,
modeling student performances in their enrollments, and teacher evaluation for their lectures, respectively.
7.2 Predicting Student Grades Using Multi-Dimensional Pat-
terns
Our main goal is to test our multi-dimensional methodology using the star schemas in Figure 7.1, for
predicting student results on courses of more advanced years (3rd to 5th), based on the frequent behaviors
found in the first 2 years of the program, and on the performances of teachers. With these experiments,
we want to show that it is possible to take the multi-dimensionality of educational data into account, and
that using the enriched data with the multi-dimensional patterns improves classification results.
Data relative to teaching quality assurance is only available from 1995 to 1999, and therefore, to
achieve correct results, we decided to find frequent behaviors only until 1998 and uncover student results
in 1999. The goal is therefore, to predict student results in 1999, on a subset of the 10 most representative
courses from the 3rd to 5th years of the program (let this set be called Courses3− 5). Thus, we are only
interested in the students and teachers involved. There were more than 650 students enrolled in some of
those courses in 1999, and 36 teachers lecturing those classes. On total there were 1830 enrollments in
those conditions. Student grades were also categorized in A, B, C, D, F or Failure.
In order to evaluate our proposal, we tested the classification of our enriched data with two baselines
(without patterns), described next. During the pattern filtering phase, we also varied both the number of
patterns chosen, as well as the different filters applied, to understand the variation of results. Classification
results presented are the average of several 10-cross fold validations.
We also tested our methodology when using the Enrollments Star and the Teaching QA Star. The
multi-dimensional pattern mining algorithm used in these experiments was StarFP-Stream [SA13a].
We used our implementation of StarFP-Stream (described in Chapter 3), and the C4.5 implementation
available in Weka.
7.2.1 Baselines
For this case study, we decided to define 2 different baselines, that will then be enriched with our
methodology and used to test the improvements.
Since our goal is to predict student results in 1999 to those 10 courses (Courses3− 5), our baselines
will contain all enrollments and grades of students that were enrolled in at least one of those courses,
109
during that year. As noted above, there were 1830 enrollments in those conditions, and therefore both
baselines contain 1830 records.
The first baseline (denoted as B1) consists on a table composed of the student information, its average
grade of the first 2 years of the program, until 1998, the information about the enrolled course (from
Courses3− 5) that we want to predict and also the information about the main teacher (conceptually, it
consists in the joining of the student, the course and teacher dimensions, plus the student average grade).
In this simple baseline, the only information about former student performance is the average grade,
therefore it is expected not to achieve a very good accuracy, and that it would improve significantly after
being extended with the multi-dimensional patterns.
A second baseline (denoted as B2) consists on a table with the student information, its grade on every
course of the 1st and 2nd years (let them be called Courses1− 2), until 1998, and the information about
the enrolled course (from Courses3 − 5) in 1999 we want to predict and the respective main teacher.
Students that did not enrolled some course (in Courses1 − 2) are marked with a “NE” (not-enrolled)
value. This baseline is like B1, but instead of keeping just the average of former years, it contains the
specific grades achieved on those courses. In this way, it is able to model some student behavior, and
therefore it is expected to achieve better accuracy results by itself. What we want to show is that,
even so, adding the multi-dimensional patterns may improve not only the model, but also improve the
accuracy of the classifier. On one side, it may improve the model, because it is likely to result in smaller
trees. The reason is that these patterns, specially aggregated ones, encapsulate several courses and can be
selected instead of several specific grades. Additionally, it can improve the prediction accuracy, because
multi-dimensional patterns condense information, and only what is important. Without these patterns,
classification algorithms choose specific grades to build the model, which may lead to overfitting.
7.2.2 Methodology into practice
In order to analyze these educational data, we decided to follow the methodology described above (Sec-
tion 6.6.1): (1) run multi-relational pattern mining over both the Enrollments Star and Teaching QA
Star schemas; (2) filter the best inter-dimensional and aggregated patterns; (3) enrich the classification
data (baselines) with these patterns; and (4) run classification over both the baselines and the enriched
datasets, and compare the results (the average of the predictions, and the size of the models built).
During phase one, for multi-dimensional pattern mining, we applied our algorithm StarFP-Stream to
each of the stars.
For finding student behaviors, only the most representative courses from the 1st and 2nd years were
taken into account (23 courses), from 1990 to 1998. In these first years, all courses are mandatory, hence
there were more than 17 thousand enrollments in this period that were used for pattern mining. Data in
the fact table of the Enrollments Star was aggregated per each pair student–term, so that we could find
the frequent sets of courses attended (both approved and failed) per term.
For finding teacher behaviors (Teaching QA Star), only the most representative courses from the 3rd,
4th and 5th years were used (the 10 courses in Courses3− 5), from 1995 to 1998, inclusive. There were
1088 survey questions answered during this period. Data in this star schema was aggregated per survey
id, in order to find frequent sets of evaluations given by students to their teachers. Surveys have 10
questions, here numbered from 1 to 10, and grades can go from 1 (worse) to 5 (best). While mining the
data for this star schema, records in the fact table were aggregated per survey id (QASurvey), since each
survey agglomerates the questions evaluating the performance of each teacher, for some specific course,
during each term. In this sense, we can find, e.g. frequent sets of assessments (grades per questions) for
teachers (as an example, we can find that some teachers never arrive late, and/or are always available to
answer student doubts).
110
These resulting patterns were then filtered with each of the proposed filters (in Section
subsec:classification-methodology), and the best N were used to enrich the baselines. In this step, pat-
terns were added to the baseline tables as features (columns), and records that satisfied (or not) the
patterns were marked with true (or false) in the corresponding feature.
Since we have two star schemas, and therefore two sets of frequent multi-relational patterns, we tested
adding the patterns of each star separately, and also adding both the N/2 best patterns of each, together.
By doing this, we want to test what is more relevant for predicting student grades, the behaviors of
students, the performance of teachers, or both.
The classification algorithm C4.5 was then used on these enriched datasets, and results are presented
below.
7.2.3 Analysis of Multi-Relational Patterns
Some examples of patterns found for the Enrollments Star can be seen in Table 7.1. We can see there
that 117 students failed to course F2 and had a bad grade (the lowest, F) at AN in the same term. Also
the 3rd pattern indicates that it is frequent to succeed on SIBD, PLF, AM3, Pest and AN in a single
term. The last pattern shows that it is common to fail to AN course in the second season. This last
pattern is inter-dimensional, since it relates terms and subjects, and the others are aggregate patterns.
Table 7.1: Some examples of patterns found for the Enrollment Star.
Student
Pattern Support(sub/=/1F2,/grade/=/AN_F) 117(sub/=/FEX,/sub/=/1AED) 169(sub/=/SIBD,/sub/=/PLF,/sub/=/AM3,/sub/=/PEst,/sub/=/AN) 168(sub/=/SIBD,/sub/=/PLF) 447(season=2,/sub/=/1AN) 126
Teacher
Pattern Support(avg_grade=8_5) 29(avg_grade=9_4,avg_grade=6_4,avg_grade=7_4,avg_grade=3_4,avg_grade=4_4)11(avg_grade=4_3,avg_grade=5_3,avg_grade=9_3) 8(avg_grade=8_5,subject_alias=Comp) 7(avg_grade=5_3,subject_alias=M) 7
Examples of patterns for the Teachers QA Star are presented in Table 7.2. In this table, the first
pattern indicates that it is frequent to have a grade of 5 in question 8, the second (an aggregated pattern)
says that it is common to have grade 4 in both questions 3, 4, 6, 7 and 9. The last pattern, for example,
is an inter-dimensional pattern indicating that teachers of course M usually have grade 3 in question 5.
Table 7.2: Some examples of patterns found for the Teaching QA Star.
Student
Pattern Support(sub/=/1F2,/grade/=/AN_F) 117(sub/=/FEX,/sub/=/1AED) 169(sub/=/SIBD,/sub/=/PLF,/sub/=/AM3,/sub/=/PEst,/sub/=/AN) 168(sub/=/SIBD,/sub/=/PLF) 447(season=2,/sub/=/1AN) 126
Teacher
Pattern Support(grade=8_5) 29(grade=9_4,grade=6_4,grade=7_4,grade=3_4,grade=4_4) 11(grade=4_3,grade=5_3,grade=9_3) 8(grade=8_5,subject=Comp) 7(grade=5_3,subject=M) 7
7.2.4 Enriched Classification Results
Figures 7.2a and 7.2b show the accuracy of the classification step, over the baseline 1 and 2, respectively,
and corresponding datasets enriched with patterns from student behaviors (i.e. patterns of Enrollments
Star).
As expected, since B2 has more information about the background of the student, it achieves better
accuracy than B1 (a 35% improvement). It is interesting to see that we can predict 50% of student grades
111
454
504
554
604
654
704
754
804
854
904
104 254 504 1004 2504 5004 10004
Accuracy2(%
)2
N2Best2
Baseline21282Accuracy2
Support4
Size4
Closed4
R4Ind4
R4Chi24
Baseline14
04
1004
2004
3004
4004
5004
6004
7004
8004
104 254 504 1004 2504 5004 10004
Size2of2the
2tree2
N2Best2
Baseline21282Size2of2tree2
Support4
Size4
Closed4
R4Ind4
R4Chi24
Baseline14
(a) Baseline 1.
854
864
874
884
894
904
914
104 254 504 1004 2504 5004 10004
Accuracy2(%
)2
N2Best2
Baseline22282Accuracy2
Support4
Size4
Closed4
R4Ind4
R4Chi24
Baseline424
10004
11004
12004
13004
14004
15004
104 254 504 1004 2504 5004 10004
Size2of2the
2tree2
N2Best2
Baseline22282Size2of2tree2
Support4
Size4
Closed4
R4Ind4
R4Chi24
Baseline424
(b) Baseline 2.
Figure 7.2: Accuracy for both baselines and respective extensions with student behaviors.
based solely on their characteristics and average grade from years 1 and 2 (B1), and 85% if we know the
grades of the courses they enrolled those years (B2). When we add the patterns that represent student
behaviors, we can see in the figures that the accuracy improves in both cases, as expected. In B1, the
improvement is huge, of about 35%, because we are adding the behavior information about students, that
was not present before. In B2, it allows classification to achieve an accuracy of 90%. Although only 4%,
this improvement indicates that patterns are chosen instead of specific courses, and this may result in
models with less over fitting, and therefore more accurate when predicting new instances. Also, results
show that the more patterns we use to enrich the training data, the better the accuracy, in general.
When analyzing the different filters, there are small fluctuations, but both the size and closed filters
revealed to achieve better results. Choosing the largest patterns as the best ones is the approach that
brings less improvements, because very few students satisfy them (very small coverage). Both rough
independence and chi-squared filters achieve intermediate results. These tendencies happened in both B1
and B2 baselines.
Figures 7.3a and 7.3b analyze the size of the trees created by the classifier (i.e. the size of the model).
454
504
554
604
654
704
754
804
854
904
104 254 504 1004 2504 5004 10004
Accuracy2(%
)2
N2Best2
Baseline21282Accuracy2
Support4
Size4
Closed4
R4Ind4
R4Chi24
Baseline14
04
1004
2004
3004
4004
5004
6004
7004
8004
104 254 504 1004 2504 5004 10004
Size2of2the
2tree2
N2Best2
Baseline21282Size2of2tree2
Support4
Size4
Closed4
R4Ind4
R4Chi24
Baseline14
(a) Baseline 1.
854
864
874
884
894
904
914
104 254 504 1004 2504 5004 10004
Accuracy2(%
)2
N2Best2
Baseline22282Accuracy2
Support4
Size4
Closed4
R4Ind4
R4Chi24
Baseline424
10004
11004
12004
13004
14004
15004
104 254 504 1004 2504 5004 10004
Size2of2the
2tree2
N2Best2
Baseline22282Size2of2tree2
Support4
Size4
Closed4
R4Ind4
R4Chi24
Baseline424
(b) Baseline 2.
Figure 7.3: Size of the trees for both baselines and respective extensions with student behaviors.
We can see that for B2, also as expected, the trees resulting from classifying the enriched datasets
are smaller than the base tree (it can have less 300 nodes for N = 250 patterns). In the B1 case, the
models of the enriched datasets are larger than the baseline, mainly because the baseline does not have
112
much information, and when we add patterns, they are chosen for building the tree. Nevertheless, for
similar values of accuracy (85%), the tree for B1 is much smaller than the tree for B2. The tendencies
of the different filters are the same. Both support and closed filters result in smaller trees earlier in B2,
and slightly larger trees in B1.
Results using the baselines enriched with teacher performances revealed that these patterns are not
very important for predicting student results. In these extended datasets, the accuracy is very close to
the baselines, and therefore the figures are not presented here.
7.3 Discussion and Conclusions
In this chapter, we presented a case study on the educational domain. Using a sample of a data warehouse
in the educare project, we discuss the use of multi-relational data mining algorithms to mine this model,
as well as the use of the results to improve classification.
Experiments on these real data show that it is possible to take into account the multi-dimensionality
of the educational data, and that by applying the multi-dimensional methodology, we are able not only
to discover frequent behaviors, but also to use those behaviors to improve predicting student grades.
We applied the method to more than one related star schema, which allowed us to find structured
patterns for both students and teachers, such as frequent sets of courses (and grades) for which students
were approved (or not), and frequent sets of teacher assessments.
Like for the hepatitis case study, classification results show that prediction accuracy improves when
enriching datasets with the multi-dimensional patterns relating to student behaviors, and that the models
built are also smaller.
We show again in this chapter that the employed methodology is simple and general, and may be
used to any educational data warehouse or star schema, and can also be applied to different domains.
113
114
Chapter 8
Conclusions and Future Work
In this dissertation we have proposed a new algorithm for finding multi-dimensional patterns in large and
growing databases, modeled as star schemas. The algorithm, named StarFP-Stream, combines MRDM
with data streaming techniques, and is able to mine a star schema directly, without materializing the
join between the tables. Also, by using a strategy similar to the one followed on mining data streams,
it is able to effectively mine large star schemas, as well as to effectively mine DW, by dealing with their
growing nature.
The pattern-tree used by the algorithm allows it to continuously store and update the current patterns
in an efficient way, keeping them up to date and accessible, anytime.
Another important contribution of StarFP-Stream is related to the fact that it correctly handles
degenerated dimensions, by aggregating the rows in the fact table that correspond to the same business
event, and it is also able to find multi-dimensional patterns at some other level of aggregation (either by
some dimension or combination of dimensions).
There are only two other algorithms in the literature for relational pattern mining over data streams.
However, while one is a probabilistic approach, therefore it does not return all real patterns, the other is
not able to deal with degenerated dimensions or other aggregations. In this sense, they are not directly
comparable with StarFP-Stream.
Performance analysis over several star schemas show that our algorithm is accurate, efficient, and
that it does not depend on the number of transactions processed so far. Experiments also show that
StarFP-Stream outperforms its single table predecessor in terms of time. Thus, we can say that our
algorithm overcomes the join before mining approach.
In order to tackle the incorporation of domain knowledge, we have also proposed in this work two
efficient and general algorithms for pushing constraints into a pattern-tree. CoPT and CoPT4Streams
are designed for single table (and static) datasets and for single table data streams, respectively. By
using the pattern-tree structure, both algorithms are able to optimize the incorporation of constraints.
The idea is to take advantage of constraint properties to avoid unnecessary tests and to eliminate invalid
patterns earlier, while traversing the tree.
In the streaming case, the advantages of this approach are even more visible, since constraints are
pushed at each batch boundary, resulting in a smaller pattern-tree for every next batch, and therefore
less time and memory needed in the overall discovery process.
Experiments show that both algorithms are effective and efficient, for all constraint properties, and
even for constraints with small selectivity, when compared to an approach that does not take these
properties into account.
For the integration of multi-dimensional mining with constrained mining, we first defined a set of
constraints for star schemas – the star constraints: entity type, entity, attribute and measure constraints.
115
These constraints capture the relations in a star schema and the aspects that can be restricted. Then,
we also proposed a set of strategies for pushing these star constraints into multi-dimensional mining
algorithms, and showed that it is possible to incorporate constraints into the mining of multiple tables.
By being post-processing algorithms, both CoPT and CoPT4Streams cannot be directly applied to
the mining of a star schema. However, they both can be applied to the mining of the fact table, with small
adaptations to deal with different entity types and retrieve the values from the respective dimensions.
We have also proposed the algorithm D2StarFP-Stream, that is an adaptation of StarFP-Stream for
incorporating star constraints into the mining of large and growing star schemas. By incorporating
constraints, the algorithm is able to maintain smaller summary structures, minimizing the bottleneck of
its counterpart and therefore returning less, but more interesting results.
To the best of our knowledge, this is the first approach dedicated to the incorporation of constraints
into multi-dimensional pattern mining.
Experiments over real-world datasets validate our claims, and prove the utility, efficacy and efficiency
of StarFP-Stream. Of particular interest are the experiments of enriching classification data with multi-
dimensional patterns. Results show that prediction accuracy improves when we add the discovered
patterns, and that they are chosen as key features, instead of pre-existing data. And therefore, this
demonstrates the interest and applicability of multi-dimensional patterns.
8.1 Future Work
Despite the advances made in this dissertation, it opens several opportunities for future research, both
on the multi-relational, the data streaming and the constrained perspectives.
Considering a Time Sensitive Model: In many real world applications, changes in patterns and their
trends are more interesting than patterns themselves (e.g. shopping and fashion trends, Internet
bandwidth usage, resource allocation, etc.). Therefore, an important way of improvement is to
extend StarFP-Stream to a time sensitive model. We discuss this in section 3.3.8.
Finding Structured and Temporal Patterns: As an historical repository for data, time is a dimen-
sion that is always present in DW. Events arrive continuously in time, and are stored in the fact
table. In this sense, DW have a sequential nature, and therefore finding other types of patterns,
such as sequences and temporal regularities (e.g. periodicities), is possible and could bring benefits
to multi-relational pattern mining.
Indeed, despite the advancements, there is still the need for creating algorithms capable of finding
structured and temporal patterns in a multi-dimensional context.
Pushing Structural Constraints: Along the line of the above, since we have time and sequences, it
also makes sense to constrain the temporal and sequential aspects of the facts in a DW. These
structural constraints allow us to, for example, limit the gap between events, perform short-term
(or long-term) analysis, specify interesting combinations or orders of items, etc.
Therefore, there is also an interest in defining these constraints in the multi-dimensional environ-
ment, and developing algorithms capable of incorporating them.
Pushing Graph-Based Domain Knowledge and Network Constraints: Graph-based represen-
tations are one valuable and more expressive source of domain knowledge, that are more and more
available nowadays. These representations, such as ontologies, capture the conceptual structure of
the domain and model the existing concepts and relations in a more intuitive way (note that they
are models of the domain, not of the data).
116
One way to incorporate this knowledge is through structural network constraints. As described in
section 4.5, by mapping items to domain concepts, these constraints allow us to filter the existing
relations (both taxonomical and non-taxonomical) between items, as well as the concepts and
distances. In the presence of a graph-based domain model, these constraints can also be defined
over the star schema.
The work on this area is increasing [KLSP07, MGB08, Ant09b, ME09b], and results show that the
discovered patterns are more interesting when we filter them according to what we already know
based on the domain model. This means that one important step forward is to find ways that are
capable of incorporating this graph-based domain models into the mining of multiple relations.
Optimizing StarFP-Stream: Another path for improving is to optimize our algorithm.
For example, parallelizing our StarFP-stream can significantly improve the time needed, as well as
increase the throughput of the algorithm. In this case, we can parallelize the processing of each
fact in a batch (since what it does is to keep the transactions in the corresponding DimFP-Trees).
Also, at each batch boundary, while those trees are being processed and results mined, the new
batch may already be collected and facts can be processed in parallel. There have already been
some efforts in the parallelization of traditional pattern mining, in particular of the base FP-Growth
algorithm [LWZ+08], which may also serve as a basis for parallelizing the mining of the SuperFP-
Tree of our StarFP-Stream algorithm.
Furthermore, the pattern-tree is usually huge, and therefore it might not fit in main memory. It
may be a good step to find a way to store and manage it from hard disk.
Finally, another path for improvement is to integrate StarFP-Stream with database management
systems (DBMS), in order to retrieve data from the dimensions. This might be very important,
since our algorithm assumes all dimensions are in main memory, which may not be possible in real
world large DW. By integrating it with the DBMS, whenever a new fact arrives, StarFP-Stream
can ask the database for the corresponding transactions, saving significantly in memory needs.
117
118
Bibliography
[ACTM11] Annalisa Appice, Michelangelo Ceci, Antonio Turi, and Donato Malerba. A parallel, dis-tributed algorithm for relational frequent pattern discovery from very large data sets. Intell.Data Anal., 15(1):69–88, January 2011.
[AIS93] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining association rules betweensets of items in large databases. In Proceedings of the 1993 ACM SIGMOD internationalconference on Management of data (SIGMOD 93), pages 207–216. ACM, 1993.
[ALB03] Hunor Albert-Lorincz and Jean-Francois Boulicaut. Mining frequent sequential patternsunder regular expressions: A highly adaptive strategy for pushing contraints. In Proc. ofthe 3rd SIAM Int. Conf. on Data Mining (SDM 03), pages 316–320, San Francisco, CA,USA, 2003. Springer-Verlag.
[Ant07] Claudia Antunes. Onto4ar: a framework for mining association rules. In Workshop onConstraint-Based Mining and Learning in the Int. Conf. on Principles and Practice ofKnowledge Discovery in Databases (PKDDW-CMILE 07), page 37, Warsaw, Poland, 2007.Springer.
[Ant08] Claudia Antunes. An ontology-based framework for mining patterns in the presence ofbackground knowledge. In Proc. of Int. Conf. on Advanced Intelligence (ICAI 08), pages163–168, Beijing, China, 2008. Post and Telecom Press.
[Ant09a] Claudia Antunes. Mining patterns in the presence of domain knowledge. In Proc. of the11th Int. Conf. on Enterprise Information Systems (ICEIS 09), pages 188–193, Milan, Italy,2009. Springer.
[Ant09b] Claudia Antunes. Pattern mining over star schemas in the onto4ar framework. In Proc. ofthe 2009 Int. workshop on Semantic Aspects in Data Mining (SADM 09), pages 453–458,Washington, DC, USA, 2009. IEEE Computer Society.
[AO02] Claudia Antunes and Arlindo Oliveira. Inference of sequential association rules guided bycontext-free grammars. In Proc. 6th Int. Conf. on Grammatical Inference (ICGI 2002),pages 289–293, Amsterdam, 2002. Springer.
[AO03] Claudia Antunes and Arlindo Oliveira. Generalization of pattern-growth methods for se-quential pattern mining with gap constraints. In Proc. of the 3rd Int. Conf. on Machinelearning and data mining in pattern recognition (MLDM 03), pages 239–251, Leipzig, Ger-many, 2003. Springer-Verlag.
[AO04] Claudia Antunes and Arlindo L. Oliveira. Sequential pattern mining with approximatedconstraints. In Proc. of IADIS Int. Applied Computing Conf. (AC 04), pages 131–138,Lisbon, Portugal, 2004. IADIS Press.
[AO05] Claudia Antunes and Arlindo Oliveira. Constraint relaxations for discovering unknownsequential patterns. Knowledge Discovery in Inductive Databases: 3rd Int. Workshop, KDID2004 (Revised Selected and Invited Papers), pages 11–32, 2005.
[AS94] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules inlarge databases. In VLDB 94: Proc. of the 20th Intern. Conf. on Very Large Data Bases,pages 487–499, San Francisco, USA, 1994. Morgan Kaufmann.
119
[BA99] Roberto J. Bayardo and Rakesh Agrawal. Mining the most interesting rules. In Proc. of the5th ACM SIGKDD Int. Conf. on Knowledge discovery and data mining (KDD 99), pages145–154, San Diego, California, United States, 1999. ACM.
[BA11] Joana Barracosa and Claudia Antunes. Anticipating teachers performance. In Proc. of Int.W. on Knowl. Discovery on Educational Data (KDDinED@KDD). ACM, 2011.
[BA14] Antonio Barreto and Claudia Antunes. Mining compact but non-lossy convergent patternsover time series. In Proceedings of the International work-conference on Time Series (ITISE14), 2014.
[Bay05] Roberto J. Bayardo. The hows, whys, and whens of constraints in itemset and rule discovery.In Proc. of the 2004 European Conf. on Constraint-Based Mining and Inductive Databases,pages 1–13, Hinterzarten, Germany, 2005. Springer-Verlag.
[BBM02] Sugato Basu, Arindam Banerjee, and Raymond Mooney. Semi-supervised clustering byseeding. In Proc. of the Nineteenth Int. Conf. on Machine Learning (ICML 02), pages27–34, Sydney, Australia, 2002. Morgan Kaufmann Publishers Inc.
[BBR00] Jean-Francois Boulicaut, Artur Bykowski, and Christophe Rigotti. Approximation of fre-quency queries by means of free-sets. In Proc. of the 4th European Conf. on Principles ofData Mining and Knowledge Discovery (PKDD 00), pages 75–85, London, UK, UK, 2000.Springer-Verlag.
[BGKW03] Cristian Bucila, Johannes Gehrke, Daniel Kifer, and Walker M. White. Dualminer: A dual-pruning algorithm for itemsets with constraints. Data Min. Knowl. Discov., 7(3):241–272,2003.
[BGMP03] Francesco Bonchi, Fosca Giannotti, Alessio Mazzanti, and Dino Pedreschi. Adaptive con-straint pushing in frequent pattern mining. In Proc. of the 7th Conf. on Principles andPractice of Knowledge Discovery in Databases (PKDD 03), pages 47–58, Cavtat-Dubrovnik,Croatia, 2003. Springer Berlin Heidelberg.
[BGMP05] Francesco Bonchi, Fosca Giannotti, Alessio Mazzanti, and Dino Pedreschi. Exante: Apreprocessing method for frequent-pattern mining. IEEE Intelligent Systems, 20(3):25–31,2005.
[BJ00] Jean-Francois Boulicaut and Baptiste Jeudy. Using constraints for itemset mining: Shouldwe prune or not? In Actes des 16emes Journees Bases de Donnees Avancees (BDA 00),Blois, France, 2000.
[BJ05] Jean-Francois Boulicaut and Baptiste Jeudy. Constraint-based data mining. In The DataMining and Knowledge Discovery Handbook, pages 399–416. Springer, 2005.
[BMS97] Sergey Brin, Rajeev Motwani, and Craig Silverstein. Beyond market baskets: generalizingassociation rules to correlations. SIGMOD Rec., 26(2):265–276, 1997.
[Bou04] Jean-Francois Boulicaut. Inductive databases and multiple uses of frequent itemsets: Thecinq approach. In Database Support for Data Mining Applications, pages 1–23, Berlin,Germany, 2004. Springer.
[CJB99] B. Chandrasekaran, John R. Josephson, and V. Richard Benjamins. What are ontologies,and why do we need them? IEEE Intelligent Systems, 14(1):20–26, 1999.
[CJS00] Viviane Crestana-Jensen and Nandit Soparkar. Frequent itemset counting across multipletables. In PADKK 00: Proc. of the 4th Pacific-Asia Conf. on Knowledge Discovery andData Mining, Current Issues and New Applications, pages 49–61, London, 2000. Springer.
[CLZ07] Longbing Cao, Dan Luo, and Chengqi Zhang. Knowledge actionability: satisfying technicaland business interestingness. Int. J. Bus. Intell. Data Min., 2(4):496–514, December 2007.
[CMB02] Matthieu Capelle, Cyrille Masson, and Jean-Francois Boulicaut. Mining frequent sequentialpatterns under a similarity constraint. In Proc. of the Third Intern. Conf. on IntelligentData Engineering and Automated Learning (IDEAL 02), pages 1–6, London, UK, UK, 2002.Springer-Verlag.
120
[CS01] Laurentiu Cristofor and Dan Simovici. Mining association rules in entity-relationship mod-eled databases. Technical report, 2001.
[CYZZ10a] Longbing Cao, P. Yu, C. Zhang, and H. Zhang. Data Mining for Business Applications.Springer, 2010.
[CYZZ10b] Longbing Cao, P. Yu, C. Zhang, and Y. Zhao. Domain driven data mining. Springer, 2010.
[CZ06] Longbing Cao and Chengqi Zhang. Domain-driven data mining: A practical methodology.Int. Journal of Data Warehousing and Mining (IJDWM, 2(4):49–65, 2006.
[CZZ+07] Longbing Cao, Chengqi Zhang, Yanchang Zhao, Philip S. Yu, and Graham Williams.Dddm2007: Domain driven data mining. SIGKDD Explor. Newsl., 9(2):84–86, 2007.
[DKP+06a] Pedro Domingos, Stanley Kok, Hoifung Poon, Matthew Richardson, and Parag Singla. Uni-fying logical and statistical ai. In Proceedings of the 21st National Conference on ArtificialIntelligence - Volume 1 (AAAI 06), pages 2–7. AAAI Press, 2006.
[DKP+06b] Pedro Domingos, Stanley Kok, Hoifung Poon, Matthew Richardson, and Parag Singla.Unifying logical and statistical ai. In Proc. of the 21st Int. Conf. on Artificial intelligence- Volume 1 (AAAI 06), pages 2–7, Boston, Massachusetts, 2006. AAAI Press.
[DL99] Guozhu Dong and Jinyan Li. Efficient mining of emerging patterns: discovering trends anddifferences. In Proc. of the 5th ACM SIGKDD Int. Conf. on Knowledge discovery and datamining (KDD 99), pages 43–52, San Diego, California, United States, 1999. ACM.
[Dom03] Pedro Domingos. Prospects and challenges for multi-relational data mining. SIGKDDExplor. Newsl., 5(1):80–83, 2003.
[Dom07] Pedro Domingos. Toward knowledge-rich data mining. Data Min. Knowl. Discov., 15(1):21–28, 2007.
[DP08] C. Diamantini and D. Potena. Semantic annotation and services for kdd tools sharing andreuse. In Proc. of the 2008 IEEE Int. Conf. on Data Mining Workshops (ICDMW 08),pages 761 –770, Pisa, Italy, 2008. IEEE.
[DR97] L. Dehaspe and L. De Raedt. Mining association rules in multiple relations. In ILP 97:Proc. of the 7th Intern. Workshop on Inductive Logic Programming, pages 125–132, London,UK, 1997. Springer.
[D96] Saso Dzeroski. Inductive logic programming and knowledge discovery in databases. InU. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances inKnowledge Discovery and Data Mining, pages 117–152. MIT Press, 1996.
[D03] Saso Dzeroski. Multi-relational data mining: an introduction. SIGKDD Explor. Newsl.,5(1):1–16, 2003.
[EC07] Gonenc Ercan and Ilyas Cicekli. Using lexical chains for keyword extraction. Inf. Process.Manage., 43(6):1705–1714, 2007.
[FCAM09] Fabio Fumarola, Anna Ciampi, Annalisa Appice, and Donato Malerba. A sliding windowalgorithm for relational frequent patterns mining from data streams. In Proc. of the 12thIntern. Conf. on Discovery Science, pages 385–392. Springer, 2009.
[FPSM92] William J. Frawley, Gregory Piatetsky-Shapiro, and Christopher J. Matheus. Knowledgediscovery in databases: an overview. AI Mag., 13(3):57–70, 1992.
[FPSS96] Usama M. Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. From data mining toknowledge discovery in databases. AI Magazine, 17(3):37–54, 1996.
[FSC05] Nuno Fonseca, Fernando Silva, and Rui Camacho. Strategies to parallelize ilp systems.In Proc. of the 15th Int. Conf. on Inductive Logic Programming (ILP 05), pages 136–153,Berlin, Heidelberg, 2005. Springer-Verlag.
121
[GB00] Bart Goethals and Jan Van den Bussche. On supporting interactive association rule mining.In Proc. of the 2nd Int. Conf. on Data Warehousing and Knowledge Discovery (DaWaK00), pages 307–316, London, UK, UK, 2000. Springer-Verlag.
[GHP+03] Chris Giannella, Jiawei Han, Jian Pei, Xifeng Yan, and Philip S. Yu. Mining frequentpatterns in data streams at multiple time granularities: Next generation data mining.AAAI/MIT, 2003.
[GLW00] G. Grahne, L. V S Lakshmanan, and X. Wang. Efficient mining of constrained correlatedsets. In Proc.. 16th Int. Conf. on Data Engineering, pages 512–521, 2000.
[GMV11] Bart Goethals, Sandy Moens, and Jilles Vreeken. Mime: a framework for interactive visualpattern mining. In Proc. of the 17th ACM SIGKDD Int. Conf. on Knowledge discovery anddata mining (KDD 11), pages 757–760, San Diego, California, USA, 2011. ACM.
[GRS99] Minos N. Garofalakis, Rajeev Rastogi, and Kyuseok Shim. Spirit: Sequential pattern miningwith regular expression constraints. In Proc. of the 25th Int. Conf. on Very Large Data Bases(VLDB 99), pages 223–234, San Francisco, CA, USA, 1999. Morgan Kaufmann PublishersInc.
[GSD07] Warwick Graco, Tatiana Semenova, and Eugene Dubossarsky. Toward knowledge-drivendata mining. In Proc. of the 2007 Int. workshop on Domain driven data mining (DDDM07), pages 49–54, San Jose, California, 2007. ACM.
[HCXY07] Jiawei Han, Hong Cheng, Dong Xin, and Xifeng Yan. Frequent pattern mining: currentstatus and future directions. Data Min. Knowl. Discov., 15(1):55–86, August 2007.
[HF95] Jiawei Han and Yongjian Fu. Discovery of multiple-level association rules from largedatabases. In Proc. of the 21th Int. Conf. on Very Large Data Bases (VLDB 95), pages420–431, San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc.
[HG02] Jochen Hipp and Ulrich Guntzer. Is pushing constraints deeply into the mining algorithmsreally what we want?: an alternative approach for association rule mining. SIGKDD Explor.Newsl., 4(1):50–55, 2002.
[HKP11] Jiawei Han, M. Kamber, and Jian Pei. Data Mining: Concepts and Techniques: Conceptsand Techniques. The Morgan Kaufmann Series in Data Management Systems. ElsevierScience, 2011.
[HPY00] Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate genera-tion. In SIGMOD 00: Proc. of the 2000 ACM SIGMOD, pages 1–12, New York, NY, USA,2000. ACM.
[HPYM04] Jiawei Han, Jian Pei, Yiwen Yin, and Runying Mao. Mining frequent patterns withoutcandidate generation: A frequent-pattern tree approach. Data Mining and Knowledge Dis-covery, 8(1):53–87, 2004.
[HYXW09] Wei Hou, Bingru Yang, Yonghong Xie, and Chensheng Wu. Mining multi-relational frequentpatterns in data streams. In BIFE 09: Proc. of the Second Intern. Conf. on BusinessIntelligence and Financial Engineering, pages 205–209, 2009.
[Inm96] W. H. Inmon. Building the data warehouse (2nd ed.). John Wiley & Sons, Inc., New York,NY, USA, 1996.
[JLL07] Joanna Jozefowska, Agnieszka Lawrynowicz, and Tomasz Lukaszewski. A study of the sem-intec approach to frequent pattern mining. In Bettina Berendt, Dunja Mladenic, Marcode Gemmis, Giovanni Semeraro, Myra Spiliopoulou, Gerd Stumme, Vojtech Svatek, andFilip Zelezny, editors, Knowledge Discovery Enhanced with Semantic and Social Informa-tion, volume 220 of Studies in Computational Intelligence, pages 37–51. Springer, 2007.
[JLL10] Joanna Jozefowska, Agnieszka Lawrynowicz, and Tomasz Lukaszewski. The role of semanticsin mining frequent patterns from knowledge bases in description logics with rules. TheoryPract. Log. Program., 10(3):251–289, 2010.
122
[JN07] Finn Jensen and Thomas Nielsen. Bayesian Networks and Decision Graphs. SpringerPublishing Company, Incorporated, 2nd edition, 2007.
[JS04] Szymon Jaroszewicz and Dan A. Simovici. Interestingness of frequent itemsets usingbayesian networks as background knowledge. In Proc. of the 10th ACM SIGKDD Int.Conf. on Knowledge discovery and data mining (KDD 04), pages 178–186, Seattle, WA,USA, 2004. ACM.
[JS05] Szymon Jaroszewicz and Tobias Scheffer. Fast discovery of unexpected patterns in data,relative to a bayesian network. In Proc. of the 11th ACM SIGKDD Int. Conf. on Knowledgediscovery in data mining (KDD 05), pages 118–127, Chicago, Illinois, USA, 2005. ACM.
[Kan05] Juveria Kanodia. Structural advances for pattern discovery in multi-relational databases.Master’s thesis, Rochester Institute of Technology, Rochester, NY, 2005.
[KLSP07] Yen-Ting Kuo, Andrew Lonie, Liz Sonenberg, and Kathy Paizis. Domain ontology drivendata mining: a medical case study. In Proc. of the 2007 Int. workshop on Domain drivendata mining (DDDM 07), pages 11–17, San Jose, California, 2007. ACM.
[KR02] Ralph Kimball and Margy Ross. The Data warehouse Toolkit - the complete guide to di-mensional modeling. John Wiley & Sons, Inc., New York, USA, 2nd edition, 2002.
[KSR+07] Stanley. Kok, M. Sumner, Matthew Richardson, Parag Singla, Hoifung Poon, D. Lowd,and Pedro Domingos. The alchemy system for statistical relational ai. Technical report,Department of Computer Science and Engineering, University of Washington, Seattle, WA,2007. http://alchemy.cs.washington.edu.
[KT05] Hian Koh and Gerald Tan. Data mining applications in healthcare. Journal of HealthcareInformation Management, 19(2):64–71, 2005.
[KW06] Harleen Kaur and Siri Wasan. Empirical study on applications of data mining techniquesin healthcare. Journal of Computer Science, 2(2):194–200, 2006.
[LB09] Carson Kai-Sang Leung and Dale A. Brajczuk. Efficient algorithms for mining constrainedfrequent patterns from uncertain data. In Proc. of the 1st ACM SIGKDD Workshop onKnowledge Discovery from Uncertain Data (U 09), pages 9–18, Paris, France, 2009. ACM.
[LE09] Francesca Lisi and Floriana Esposito. On ontologies as prior conceptual knowledge in induc-tive logic programming. In Bettina Berendt, Dunja Mladenic, Marco de Gemmis, GiovanniSemeraro, Myra Spiliopoulou, Gerd Stumme, Vojtech Svatek, and Filip Zelezny, editors,Knowledge Discovery Enhanced with Semantic and Social Information, volume 220 of Stud-ies in Computational Intelligence, pages 3–17. Springer, 2009.
[LHB10] Carson Kai-Sang Leung, Boyu Hao, and Dale Brajczuk. Mining uncertain data for frequentitemsets that satisfy aggregate constraints. In Proc. of the 2010 ACM Symposium on AppliedComputing (SAC 10), pages 1034–1038, Sierre, Switzerland, 2010. ACM.
[LHM98] Bing Liu, Wynne Hsu, and Yiming Ma. Integrating classification and association rulemining. In Proc. of the 1998 Intern. Conf. on Knowledge Discovery and Data Mining(KDD 98), pages 80–86, New York, NY, USA, 1998. AAAI Press.
[Lis05] Francesca Lisi. Principles of inductive reasoning on the semantic web: a framework forlearning in al-log. In Proc. of the 3rd Int. Conf. on Principles and Practice of SemanticWeb Reasoning (PPSWR 05), pages 118–132, Berlin, Germany, 2005. Springer-Verlag.
[Liu10] Haishan Liu. Towards semantic data mining. In Proc. of the 9th Int. Semantic Web Conf.(ISWC 10), 2010.
[LK06] Carson Kai-Sang Leung and Quamrul Khan. Efficient mining of constrained frequent pat-terns from streams. In Proc. of the 10th Int. Database Engineering and Applications Sym-posium (IDEAS 06), volume 0, pages 61–68, Delhi, India, 2006. IEEE Computer Society.
[LLH11] Hongyan Liu, Yuan Lin, and Jiawei Han. Methods for mining frequent items in data streams:an overview. Knowl. Inf. Syst., 26(1):1–30, 2011.
123
[LLN02] Carson Kai-Sang Leung, Laks Lakshmanan, and Raymond Ng. Exploiting succinct con-straints using fp-trees. SIGKDD Explor. Newsl., 4(1):40–49, 2002.
[LM03] Francesca Lisi and Donato Malerba. Bridging the gap between horn clausal logic anddescription logics in inductive learning. In Proc. of Advances in Artificial Intelligence, 8thCongress of the Italian Association for Artificial Intelligence (AI*IA 03), pages 53–64, Pisa,Italy, 2003. Springer.
[LM04] Francesca Lisi and Donato Malerba. Inducing multi-level association rules from multiplerelations. Machine Learning, 55(2):175–210, 2004.
[LR98] Alon Levy and Marie-Christine Rousset. Combining horn rules and description logics incarin. Artif. Intell., 104(1-2):165–209, 1998.
[LS12] Carson Kai-Sang Leung and Lijing Sun. A new class of constraints for constrained frequentpattern mining. In Proc. of the 27th Annual ACM Symposium on Applied Computing (SAC12), pages 199–204, Trento, Italy, 2012. ACM.
[LSW97] Brian Lent, Arun Swami, and Jennifer Widom. Clustering association rules. In Proc. ofthe 13th Intern. Conf. on Data Engineering (ICDE 97), pages 220–231, Birmingham, U.K.,1997. IEEE Computer Society.
[LVS+11] Nada Lavrac, Anze Vavpetic, Larisa N. Soldatova, Igor Trajkovski, and Petra Kralj Novak.Using ontologies in semantic data mining with segs and g-segs. In Proc. of the 14th Int.Conf. on Discovery Science (DS 11), pages 165–178, Finland, 2011.
[LWZ+08] Haoyuan Li, Yi Wang, Dong Zhang, Ming Zhang, and Edward Y. Chang. Pfp: Paral-lel fp-growth for query recommendation. In Proceedings of the 2008 ACM Conference onRecommender Systems (RecSys 08), pages 107–114, New York, NY, USA, 2008. ACM.
[ME09a] Nizar Mabroukeh and Christie Ezeife. Semantic-rich markov models for web prefetching.In Proc. of the IEEE Int. Conf. on Data Mining Workshops (ICDMW 09), pages 465–470,Miami, Florida, USA, 2009.
[ME09b] Nizar Mabroukeh and Christie Ezeife. Using domain ontology for semantic web usage miningand next page prediction. In Proc. of the 18th ACM Conf. on Information and knowledgemanagement (CIKM 09), pages 1677–1680, Hong Kong, China, 2009. ACM.
[MEL01] Donato Malerba, Floriana Esposito, and Francesca A. Lisi. A logical framework for frequentpattern discovery in spatial data. In FLAIRS Conference, pages 557–561, Florida, USA,2001. AAAI Press.
[MGB08] Claudia Marinica, Fabrice Guillet, and Henri Briand. Post-processing of discovered asso-ciation rules using ontologies. In Proc. of the 2008 Int. workshop on Domain driven datamining (DDDM 08), pages 126–133, Pisa, Italy, 2008. IEEE Computer Society.
[MM02] Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over datastreams. In VLDB 02: Proc. of the 28th Intern. Conf. on Very Large Data Bases, pages346–357, Hong Kong, China, 2002. Morgan Kaufman.
[MPP07] Ricardo Martınez, Claude Pasquier, and Nicolas Pasquier. Genminer: Mining informativeassociation rules from genomic data. In Proc. of the IEEE Intern. Conf. on Bioinformaticsand Biomedicine (BIBM 2007), pages 15–22. IEEE Computer Society, 2007.
[MT97] Heikki Mannila and Hannu Toivonen. Levelwise search and borders of theories in knowledgediscovery. Data Min. Knowl. Discov., 1(3):241–258, 1997.
[MTIV97] Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo. Discovery of frequent episodesin event sequences. Data Min. Knowl. Discov., 1(3):259–289, 1997.
[MVCRV13] Carlos Marquez-Vera, Alberto Cano, Cristobal Romero, and Sebastian Ventura. Predictingstudent failure at school using genetic programming and different data mining approacheswith high dimensional and imbalanced data. Appl. Intell., 38(3):315–330, 2013.
124
[NCW97] Shan-Hwei Nienhuys-Cheng and Ronald de Wolf. Foundations of Inductive Logic Program-ming. Springer-Verlag, Secaucus, NJ, USA, 1997.
[NDD99] Biswadeep Nag, Prasad M. Deshpande, and David J. DeWitt. Using a knowledge cachefor interactive discovery of association rules. In Proc. of the 5th ACM SIGKDD Int. Conf.on Knowledge discovery and data mining (KDD 99), pages 244–253, San Diego, California,United States, 1999. ACM.
[NFW02] Eric Ka Ka Ng, Ada Wai-Chee Fu, and Ke Wang. Mining association rules from stars. InICDM 02: Proc. of the 2002 IEEE International Conf. on Data Mining, pages 322–329,Japan, 2002. IEEE.
[NJG11] Siegfried Nijssen, Aıda Jimenez, and Tias Guns. Constraint-based pattern mining in multi-relational databases. In ICDM Workshops, pages 1120–1127, Vancouver, BC, Canada, 2011.IEEE Computer Society.
[NK01] Siegfried Nijssen and Joost N. Kok. Faster association rules for multiple relations. In IJCAI01: Proc. of the 17th Intern. Joint Conf. on Artificial Intelligence, volume 2, pages 891–896,San Francisco, CA, USA, 2001. Morgan Kaufmann.
[NLHP98] Raymond Ng, Laks Lakshmanan, Jiawei Han, and Alex Pang. Exploratory mining andpruning optimizations of constrained associations rules. In Proc. of the 1998 ACM SIGMODInt. Conf. on Management of data, pages 13–24, Seattle, Washington, United States, 1998.ACM.
[NVTL09] Petra Novak, Anze Vavpetic, Igor Trajkovski, and Nada Lavrac. Towards semantic datamining with g-segs. In Proc. of the 11th Int. Multiconference Information Society (IS 09),2009.
[ORS98] Banu Ozden, Sridhar Ramaswamy, and Abraham Silberschatz. Cyclic association rules. InProc. of the 14th Int. Conf. on Data Engineering (ICDE 98), pages 412–421, Washington,DC, USA, 1998. IEEE Computer Society.
[PA09] Miguel Pironet and Claudia Antunes. Classification for fraud detection with social networkanalysis. Technical report, Instituto Superior Tecnico, Universidade de Lisboa, Portugal,2009.
[PBTL99] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. Efficient mining of associ-ation rules using closed itemset lattices. Inf. Syst., 24(1):25–46, 1999.
[PDS08] Pance Panov, Saso Dzeroski, and Larisa Soldatova. Ontodm: An ontology of data mining.In Proc. of the 2008 IEEE Int. Conf. on Data Mining Workshops (ICDMW 08), pages752–760, Washington, DC, USA, 2008. IEEE Computer Society.
[Pei02] Jian Pei. Pattern-growth methods for frequent pattern mining. PhD thesis, Simon FraserUniversity, Burnaby, BC, Canada, Canada, 2002. Adviser-Jiawei Han.
[PH00] Jian Pei and Jiawei Han. Can we push more constraints into frequent pattern mining? InProc. of the sixth ACM SIGKDD intern. conf. on Knowledge discovery and data mining(KDD 00), pages 350–354, Boston, Massachusetts, USA, 2000. ACM.
[PH02] Jian Pei and Jiawei Han. Constrained frequent pattern mining: a pattern-growth view.SIGKDD Explor. Newsl., 4(1):31–39, 2002.
[PHL01] Jian Pei, Jiawei Han, and Laks V. S. Lakshmanan. Mining frequent itemsets with convertibleconstraints. In Proc. of the 17th Int. Conf. on Data Engineering (ICDE 01), pages 433–442,Washington, DC, USA, 2001. IEEE Computer Society.
[PHMA+01] Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal,and Meichun Hsu. Prefixspan: Mining sequential patterns by prefix-projected growth. InProc. of the 17th Int. Conf. on Data Engineering (ICDE 01), pages 215–224, Washington,DC, USA, 2001. IEEE Computer Society.
125
[PHW02] Jian Pei, Jiawei Han, and Wei Wang. Mining sequential patterns with constraints in largedatabases. In Proc. of the 2002 ACM Int. Conf. on Information and Knowledge Management(CIKM 02), pages 18–25, McLean, VA, USA, 2002.
[PHW07] Jian Pei, Jiawei Han, and Wei Wang. Constraint-based sequential pattern mining: thepattern-growth methods. J. Intell. Inf. Syst., 28(2):133–160, 2007.
[PRV05] Luciene Pizzi, Marcela Ribeiro, and Marina Vieira. Analysis of hepatitis dataset using mul-tirelational association rules. In ECML/PKDD 2005 Discovery Challenge, Porto, Portugal,2005.
[PT98] Balaji Padmanabhan and Alexander Tuzhilin. A belief-driven method for discovering un-expected patterns. In Proc. of the 4th Int. Conf. on Knowledge discovery in data mining(KDD 98), pages 94–100. AAAI Press, 1998.
[RGN08] Luc De Raedt, Tias Guns, and Siegfried Nijssen. Constraint programming for itemsetmining. In Proc. of the 14th ACM SIGKDD Int. Conf. on Knowledge discovery and datamining (KDD 08), pages 204–212, New York, NY, USA, 2008. ACM.
[RJLM10] Luc De Raedt, Manfred Jaeger, Sau Lee, and Heikki Mannila. A theory of inductive queryanswering. In Saso Dzeroski, Bart Goethals, and Pance Panov, editors, Inductive Databasesand Constraint-Based Data Mining, pages 79–103. Springer New York, 2010.
[RK01] Luc De Raedt and Stefan Kramer. The levelwise version space algorithm and its applicationto molecular fragment finding. In Proc. of the 17th Int. joint Conf. on Artificial intelligence- Volume 2 (IJCAI 01), pages 853–859, Seattle, WA, USA, 2001. Morgan Kaufmann Pub-lishers Inc.
[RR04] Luc De Raedt and Jan Ramon. Condensed representations for inductive logic programming.In In Proc. of the 9th Intl. Conf. on Principles of Knowledge Representation and Reasoning,pages 438–446. AAAI Press, 2004.
[RS98] Rajeev Rastogi and Kyuseok Shim. Mining optimized association rules with categorical andnumeric attributes. In ICDE, pages 503–512, 1998.
[RV00] Celine Rouveirol and Veronique Ventos. Towards learning in carin-aln. In Proc. of the 10thInt. Conf. on Inductive Logic Programming (ILP 00), pages 191–208, London, UK, 2000.Springer-Verlag.
[RV04] Marcela Xavier Ribeiro and Marina Teresa Pires Vieira. A new approach for mining asso-ciation rules in data warehouses. In FQAS, pages 98–110, 2004.
[SA95] Ramakrishnan Srikant and Rakesh Agrawal. Mining generalized association rules. In Proc.of the 21th Int. Conf. on Very Large Data Bases (VLDB 95), pages 407–419, San Francisco,CA, USA, 1995. Morgan Kaufmann Publishers Inc.
[SA96] Ramakrishnan Srikant and Rakesh Agrawal. Mining sequential patterns: Generalizationsand performance improvements. In Proc. of the 5th Int. Conf. on Extending DatabaseTechnology: Advances in Database Technology (EDBT 96), pages 3–17, London, UK, UK,1996. Springer-Verlag.
[SA10] Andreia Silva and Claudia Antunes. Pattern mining on stars with fp-growth. In MDAI 2010:Proc. of the 7th International Conference on Modeling Decisions for Artificial Intelligence,pages 175–186, Perpignan, France, 2010. Springer.
[SA11] Andreia Silva and Claudia Antunes. Mining stars with fp-growth: a case study on biblio-graphic data. International Journal of Uncertainty, Fuzziness and Knowledge-Based Sys-tems, 19(Supplement-1):65–91, 2011.
[SA12a] Andreia Silva and Claudia Antunes. Finding patterns in large star schemas at the rightaggregation level. In MDAI 2012: Proc. of the 9th International Conference on ModelingDecisions for Artificial Intelligence, pages 329–340. Springer, 2012.
126
[SA12b] Andreia Silva and Claudia Antunes. Mining patterns from large star schemas based onstreaming algorithms. In Roger Lee, editor, Computer and Information Science 2012: Stud-ies in Comp. Int., volume 429, pages 139–150. Springer, 2012.
[SA12c] Andreia Silva and Claudia Antunes. Semi-supervised clustering: A case study. In Proc. ofthe 8th Int. Conf. on Machine Learning and Data Mining in Pattern Recognition (MLDM12), pages 252–263, Berlin, Germany, 2012. Springer.
[SA13a] Andreia Silva and Claudia Antunes. Pushing constraints into a pattern tree. In Proc. of the10th Intern. Conf. on Modeling Decisions for Artificial Intelligence (MDAI 13), Barcelona,Spain, November 2013. Springer.
[SA13b] Andreia Silva and Claudia Antunes. Pushing constraints into data streams. In 2nd Intern.Workshop on Big Data, Streams and Heterogeneous Source Mining (BigMine 13), pages79–86. ACM, August 2013.
[SA13c] Andreia Silva and Claudia Antunes. Towards the integration of constrained mining withstar schemas. In Proc. of the 13th IEEE Intern. Conf. on Data Mining Workshops (ICDMW13), pages 413–420. IEEE Computer Society, December 2013.
[SA14a] Andreia Silva and Claudia Antunes. Finding multi-dimensional patterns in healthcare.In MLDM 14: Proc. of the 10th Int. Conf. on Machine Learning and Data Mining, St.Petersborg, Russia, 2014. Springer.
[SA14b] Andreia Silva and Claudia Antunes. Mining multi-dimensional patterns for student mod-eling. In EDM 14: Proc. of the 7th Int. Conf. on Educational Data Mining, London, UK,2014.
[SA14c] Andreia Silva and Claudia Antunes. Multi-dimensional pattern mining: A case study inhealthcare. In ICEIS 14: Proc. of the 16th Int. Conf. on Enterprise Inf. Systems, Lisbon,Portugal, 2014. Morgan Kaufmann.
[SA14d] Andreia Silva and Claudia Antunes. Multi-relational pattern mining over data streams.under review for publishing in the International Journal of Data Mining and KnowledgeDiscovery, 2014.
[SC05] Arnaud Soulet and Bruno Cremilleux. An efficient framework for mining flexible constraints.In TuBao Ho, David Cheung, and Huan Liu, editors, Advances in Knowledge Discovery andData Mining, volume 3518 of Lecture Notes in Computer Science, pages 661–671. SpringerBerlin Heidelberg, 2005.
[Set10] Burr Settles. Active learning literature survey. Computer sciences technical report, Univer-sity of Wisconsin-Madison, 2009 (updated in 2010).
[SHB06] Gerd Stumme, Andreas Hotho, and Bettina Berendt. Semantic web mining: State of theart and future directions. Web Semantics: Science, Services and Agents on the World WideWeb, 4(2):124–143, 2006.
[Sri96] Ramakrishnan Srikant. Fast algorithms for mining association rules and sequential patterns.PhD thesis, The University of Wisconsin, Madison, 1996. Supervisor-Jeffrey F. Naughton.
[SV11] Akdes Serin and Martin Vingron. Debi: Discovering differentially expressed biclusters usinga frequent itemset approach. Algorithms for Molecular Biology, 6:18, 2011.
[SVA97] Ramakrishnan Srikant, Quoc Vu, and Rakesh Agrawal. Mining association rules with itemconstraints. In Proc. of the 3rd ACM SIGKDD Int. Conf. on Knowledge discovery and datamining (KDD 97), pages 67–73, California, USA, 1997. AAAI Press.
[TLT08] Igor Trajkovski, Nada Lavrac, and Jakub Tolar. Segs: Search for enriched gene sets inmicroarray data. J. of Biomedical Informatics, 41(4):588–601, 2008.
[WJL03] Ke Wang, Yuelong Jiang, and Laks V. S. Lakshmanan. Mining unexpected rules by pushinguser dynamics. In Proc. of the 9th ACM SIGKDD Int. Conf. on Knowledge discovery anddata mining (KDD 03), pages 246–255, Washington, D.C., 2003. ACM.
127
[WJY+05] Ke Wang, Yuelong Jiang, Jeffrey Xu Yu, Guozhu Dong, and Jiawei Han. Divide-and-approximate: A novel constraint push strategy for iceberg cube mining. IEEE Trans. onKnowl. and Data Eng., 17(3):354–368, 2005.
[WSYT03] Takeshi Watanabe, Einoshin Susuki, Hideto Yokoi, and Katsuhiko Takabayashi. Applicationof prototypelines to chronic hepatitis data. In ECML/PKDD 2003 Discovery Challenge,Cavtat, Croatia, 2003.
[XSMH06] Dong Xin, Xuehua Shen, Qiaozhu Mei, and Jiawei Han. Discovering interesting patternsthrough user’s interactive feedback. In Proc. of the 12th ACM SIGKDD Int. Conf. onKnowledge discovery and data mining (KDD 06), pages 773–778, Philadelphia, PA, USA,2006. ACM.
[XX06] Li-Jun Xu and Kang-Lin Xie. A novel algorithm for frequent itemset mining in data ware-houses. Journal of Zhejiang University - Science A, 7(2):216–224, 2006.
[YL05] Unil Yun and John J. Leggett. Wfim: Weighted frequent itemset mining with a weightrange and a minimum weight. In SDM, 2005.
[YW06] Qiang Yang and Xindong Wu. 10 challenging problems in data mining research. Intern.Journal of Inf. Technology and Decision Making, 5(4):597–604, 2006.
[Zak00a] Mohammed Zaki. Generating non-redundant association rules. In Proc. of the 6th ACMSIGKDD Int. Conf. on Knowledge discovery and data mining (KDD 00), pages 34–43, NewYork, NY, USA, 2000. ACM.
[Zak00b] Mohammed Zaki. Sequence mining in categorical domains: incorporating constraints. InProc. of the 9th Int. Conf. on Information and knowledge management (CIKM 00), pages422–429, McLean, Virginia, United States, 2000. ACM.
[ZCD07] Xiuzhen Zhang, Pauline Lienhua Chou, and Guozhu Dong. Efficient computation of icebergcubes by bounding aggregate functions. IEEE Trans. Knowl. Data Eng., 19(7):903–918,2007.
[ZO98] M. J. Zaki and M. Ogihara. Theoretical foundations of association rules. In Workshop onresearch issues in Data Mining and Knowledge Discovery (DMKD 98), pages 1–8. ACMPress, 1998.
[ZY07] Ling Zhou and Stephen Yau. Efficient association rule mining among both frequent andinfrequent items. Computers and Mathematics with Applications, 54(6):737 – 749, 2007.
[ZYHY07] Feida Zhu, Xifeng Yan, Jiawei Han, and Philip S. Yu. gprune: a constraint pushing frame-work for graph pattern mining. In Proc. of the 11th Pacific-Asia Conf. on Advances inknowledge discovery and data mining (PAKDD 07), pages 388–400, Nanjing, China, 2007.Springer-Verlag.
128