universidade de lisboa instituto superior tecnico...

UNIVERSIDADE DE LISBOA

INSTITUTO SUPERIOR TECNICO

Pattern Mining on Data Warehouses:A Domain Driven Approach

Andreia Liliana Perdigao da Silva

Supervisor: Doctor Claudia Martins Antunes

Thesis approved in public session to obtain the PhD degree inInformation Systems and Computer Engineering

Jury final classification: Pass with Merit

JuryChairperson: Chairman of the IST Scientific BoardMembers of the Committee:

Doctor Sebastian Ventura Soto

Doctor Francisco Jose Moreira Couto

Doctor Alıpio Mario Guedes Jorge

Doctor Claudia Martins Antunes

Doctor Alexandre Paulo Lourenco Francisco

Doctor Sara Alexandra Cordeiro Madeira

2014

UNIVERSIDADE DE LISBOA

INSTITUTO SUPERIOR TECNICO

Pattern Mining on Data Warehouses:A Domain Driven Approach

Andreia Liliana Perdigao da Silva

Supervisor: Doctor Claudia Martins Antunes

Thesis approved in public session to obtain the PhD degree inInformation Systems and Computer Engineering

Jury final classification: Pass with Merit

Jury

Chairperson: Chairman of the IST Scientific Board

Members of the Committee:

Doctor Sebastian Ventura Soto, Associate Professor, University of Cordoba, Spain

Doctor Francisco Jose Moreira Couto, Professor Associado, Faculdade de Ciencias, Universi-

dade de Lisboa

Doctor Alıpio Mario Guedes Jorge, Professor Associado, Faculdade de Ciencias, Universidade

do Porto

Doctor Claudia Martins Antunes, Professora Auxiliar, Instituto Superior Tecnico, Universidade

de Lisboa

Doctor Alexandre Paulo Lourenco Francisco, Professor Auxiliar, Instituto Superior Tecnico,

Universidade de Lisboa

Doctor Sara Alexandra Cordeiro Madeira, Professora Auxiliar, Instituto Superior Tecnico,

Universidade de Lisboa

Funding Institutions

Fundacao para a Ciencia e a Tecnologia

2014

Resumo

Um desafio crescente do data mining prende-se com a capacidade de lidar com grandes quantidades de

dados complexos e dinamicos. Em muitas aplicacoes reais, os dados complexos estao organizados em

multiplas tabelas de dados, relacionadas entre si, o que torna a sua analise como um todo mais difıcil e

desafiante. Uma forma comum de representar um modelo multi-dimensional e atraves de um esquema

em estrela, que consiste numa tabela de factos central, que liga um conjunto de tabelas de dimensao.

Esta tabela de factos guarda normalmente um conjunto enorme de registos, que torna quase impossıvel

ter todos os dados em memoria. Mais ainda, nem todos os dados podem estar disponıveis a priori,

uma vez que novos dados estao, muito provavelmente, continuamente a ser gerados. Outro problema

comum dos algoritmos de descoberta de padroes e o facto destes gerarem um elevado numero de padroes,

independentes dos conhecimentos do utilizador. Este tao grande numero de resultados e a sua falta de

foco dificultam a interpretacao e seleccao de resultados, e por isso limitam a utilizacao destas tecnicas

para apoio a decisao.

Neste trabalho, argumenta-se que e possıvel descobrir padroes em dados modelados num esquema

em estrela de modo eficiente, bem como incorporar restricoes de domımio no processo de descoberta,

para focar os resultados no conhecimento de domımio existente e nas expectativas dos utilizadores. De

modo a demonstrar a validade desta tese, e proposto um novo algoritmo – StarFP-Stream, que combina

tecnicas de descoberta de padroes em varias tabelas com tecnicas para fluxos contınuos de dados (ou

streams). Este algoritmo e capaz de explorar eficientemente grandes e crescentes quantidades de dados

de um esquema em estrela, e em varios nıveis de agregacao. Tambem sao propostos dois algoritmos –

CoPT and CoPT4Streams, para introduzir restricoes numa tabela de dados estaticos ou num stream,

respectivamente. Os algoritmos usam uma estrutura em arvore compacta, e sao capazes de acelerar

a incorporacao de qualquer tipo de restricoes, evitando testes desnecessarios e eliminando mais cedo

os padroes invalidos. Finalmente, tambem e definido um conjunto de restricoes desenhadas para um

esquema em estrela, e e proposto um novo algoritmo – D2StarFP-Stream, para introduzir essas restricoes

na descoberta de padroes multi-dimensionais.

Alem disso, os algoritmos sao avaliados sobre conjuntos de dados artificiais e reais, tanto do domınio

de vendas, como na saude e na educacao.

i

Abstract

A growing challenge in data mining is the ability to deal with complex, voluminous and dynamic data.

In many real world applications, complex data is organized in multiple inter-related database tables,

which makes their analysis as a whole more difficult and challenging. A very common multi-dimensional

model is a star schema that consists of a central fact table, linking a set of dimensional tables. This

fact table usually stores a massive number of records, which makes almost impossible to have all data in

primary memory. Furthermore, not all data may be available a priori, since new data is most likely being

continuously generated. Another problem of pattern discovery algorithms, is the fact that they generate

a huge number of patterns, independent of user expertise. Such large number of results and their lack of

focus hinder the interpretation and selection of results, and therefore make it harder to use these results

for decision support.

In this work we argue that it is possible to efficiently and effectively mine large amounts of data

modeled as a star schema, as well as to incorporate domain constraints into the discovery process, to

focus the results according to the domain knowledge and user expectations. In order to demonstrate

the validity of this thesis, we propose a new algorithm – StarFP-Stream, that combines multi-relational

with data streaming techniques, and is able to mine a large and growing star schema efficiently, at any

aggregation level. We also propose two algorithms – CoPT and CoPT4Streams, for pushing constraints

into static and growing single tables, respectively. The algorithms make use of a compact tree structure,

and are able to speed up the incorporation of any type of constraints, by avoiding unnecessary tests and

pruning earlier invalid patterns. Finally, we also define a set of constraints designed for a star schema,

and propose a new algorithm – D2StarFP-Stream, that is able to incorporate these constraints into

multi-dimensional mining.

Additionally, we evaluate our algorithms over both artificial and real data, in the sales, healthcare

and education domains.

iii

Palavras-Chave

Keywords

Palavras-Chave

Descoberta de Informacao

Descoberta de Padroes

Exploracao de Armazens de Dados

Esquemas em Estrela

Descoberta de Informacao em Dados Multi-Relacionais

Descoberta de Informacao em Fluxos Contınuos de Dados

Arvores de Padroes

Conhecimento de Domınio

Incorporacao de Restricoes

Restricoes Multi-Dimensionais

Keywords

Data Mining

Pattern Mining

Mining Data Warehouses

Star Schemas

Multi-Relational Data Mining

Mining Data Streams

Pattern-Trees

Domain Knowledge

Constrained Mining

Multi-Dimensional Constraints

v

Acknowledgments

I would like to thank to all who have been present and have contributed to this thesis in so many ways.

First and foremost, I would like to thank my adviser, Professor Claudia Antunes, for all the support,

encouragement, guidance and confidence she gave me, and for the countless hours of talk during these

five years.

I would also like to thank my colleagues and friends (in alphabetical order): David Duarte, Nuno

Lopes, Rui Henriques. They have contributed to this work in the form of insightful discussions, collabo-

ration and key advices.

A big thank you also to my conference friends, for making the experience of attending conferences

more enjoyable and productive, technically and socially.

A special thank you to Filipe, for his presence, patience, encouragement and support. He also con-

tributed to this work with more technical discussions and guidance.

This work was financially supported in part by FCT (Fundacao para a Ciencia e a Tecnologia) under

grant SFRH/BD/64108/2009 and research projects educare (PTDC/EIA-EIA/110058/2009) and D2PM

(PTDC/EIA-EIA/110074/2009). I’m very grateful and indebted for their support.

Finally, a big thank you to my family and all other friends for their continuous support and encour-

agement through all these years.

vii

Contents

1 Introduction 1

1.1 Open Issues in Multi-Relational Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Finding Patterns on Star Schemas: An Introduction 7

2.1 The Core of the Multi-Dimensional Model: a Star Schema . . . . . . . . . . . . . . . . . . 8

2.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Challenges and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Finding Patterns on Large Star Schemas 15

3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.1 MRPM over Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 StarFP-Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.1 Rationale behind the star stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.2 Pattern-Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3.3 Algorithm StarFP-Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3.5 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.6 Strengths and Weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.7 Comparison with Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.8 Time Sensitive Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.5 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4 The Groundwork on Domain Driven Data Mining 35

4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1.1 Inductive Logic Programming - Discussion and Arguments . . . . . . . . . . . . . 37

4.1.2 Domain Driven Data Mining – Discussion and Arguments . . . . . . . . . . . . . . 37

4.1.3 Semantic Data Mining – Discussion and Arguments . . . . . . . . . . . . . . . . . 38

4.2 Domain Knowledge Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3 Constrained Pattern Mining: Problem Definition . . . . . . . . . . . . . . . . . . . . . . . 41

4.4 A new Framework for Constrained Pattern Mining . . . . . . . . . . . . . . . . . . . . . . 43

4.5 Constraint Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

ix

4.6 Constraint Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.7 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.8 Constrained Pattern Mining Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.8.1 Properties vs. Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.8.2 Categories vs. Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.8.3 Data Sources vs. Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.9 Discussion and Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5 Pushing Constraints into Pattern Mining 61

5.1 Pushing Constraints into a Static Pattern-Tree . . . . . . . . . . . . . . . . . . . . . . . . 62

5.1.1 Pattern-Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.1.2 Constraint Pushing Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.1.3 Algorithm CoPT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.1.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.1.5 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.2 Pushing Constraints into a Dynamic Pattern-Tree . . . . . . . . . . . . . . . . . . . . . . 68

5.2.1 Pattern-Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.2.2 Constraint Pushing Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.2.3 Algorithm CoPT4Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.2.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72


5.3 Towards the Incorporation of Constrains into

Multi-Dimensional Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.3.1 Transactional vs. Non-Transactional Data . . . . . . . . . . . . . . . . . . . . . . . 75

5.3.2 Constraints in Star Schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.3.3 Pushing Star Constraints into Pattern Mining over Star Schemas . . . . . . . . . . 79

5.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.4 Mining Stars with Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.4.1 Constraining Business Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.4.2 D2Star FP-Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84



5.5 Conclusions and Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6 A Case Study in Healthcare 89

6.1 The Hepatitis Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.2 The Hepatitis Multi-Dimensional Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.2.1 Building the Star Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.2.2 Understanding the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94


6.4 Hepatitis Application Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.5 Finding Discriminant Patterns and Association Rules . . . . . . . . . . . . . . . . . . . . 97

6.5.1 Interesting Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.5.2 Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.6 Improving Prediction using Multi-Dimensional Patterns . . . . . . . . . . . . . . . . . . . 100

6.6.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

x

6.6.2 Methodology into Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.6.3 Analysis of Multi-Relational Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.6.4 Enriched Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104


7 A Case Study in Education 107

7.1 TheEducare Multi-Dimensional Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

7.2 Predicting Student Grades Using Multi-Dimensional Patterns . . . . . . . . . . . . . . . . 109

7.2.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

7.2.2 Methodology into practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

7.2.3 Analysis of Multi-Relational Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.2.4 Enriched Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111


8 Conclusions and Future Work 115

8.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

xi

List of Figures

2.1 Star Internet Sales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 An example of a pattern-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 StarFP-Stream example: Part of the pattern-tree resulting from the first batch . . . . . . 22

3.3 StarFP-Stream example: DimFP-tree of dimensions Customer and Product . . . . . . . . 23

3.4 StarFP-Stream example: Super FP-tree of the second batch . . . . . . . . . . . . . . . . . 23

3.5 StarFP-Stream example: Part of the final pattern-tree . . . . . . . . . . . . . . . . . . . . 24

3.6 AW experiments: Number of patterns returned and precision . . . . . . . . . . . . . . . . 30

3.7 AW experiments: Average and detailed pattern-tree size . . . . . . . . . . . . . . . . . . . 31

3.8 AW experiments: Average and detailed update time . . . . . . . . . . . . . . . . . . . . . 32

3.9 AW experiments: Average maximum memory per batch . . . . . . . . . . . . . . . . . . . 33

4.1 A framework for constrained pattern mining . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.1 CoPT : Time with AM constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.2 CoPT : Checks with AM constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.3 CoPT : Time with M constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.4 CoPT : Checks with M constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.5 CoPT : Time with Mixed constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.6 CoPT : Checks with Mixed constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.7 CoPT4Streams: Average size of the pattern-tree . . . . . . . . . . . . . . . . . . . . . . . 73

5.8 CoPT4Streams: Average time needed to update the pattern-tree . . . . . . . . . . . . . . 73

5.9 CoPT4Streams: Average number of constraint checks . . . . . . . . . . . . . . . . . . . . 73

5.10 Example of transactional and corresponding non-transactional data . . . . . . . . . . . . . 76

5.11 A star schema, showing transactional and non-transactional data . . . . . . . . . . . . . . 77

5.12 D2StarFP-Stream: Average size of the pattern-tree . . . . . . . . . . . . . . . . . . . . . . 86

5.13 D2StarFP-Stream: Average maximum memory needed . . . . . . . . . . . . . . . . . . . . 86

5.14 D2StarFP-Stream: Average update time of the pattern-tree . . . . . . . . . . . . . . . . . 86

6.1 Hepatitis relational model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.2 Hepatitis star schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.3 Hepatitis: Number of exams per patient (female and male) . . . . . . . . . . . . . . . . . 93

6.4 Hepatitis: Number of exams per patient diagnosed with hepatitis B, C or still undiagnosed 93

6.5 Hepatitis: Distribution of exams per stage of hepatitis . . . . . . . . . . . . . . . . . . . . 93

6.6 Hepatitis: Number of exams per patient, at each stage of hepatitis . . . . . . . . . . . . . 94

6.7 Hepatitis Star : Number of patterns returned and precision . . . . . . . . . . . . . . . . . . 95

6.8 Hepatitis Star : Average and detailed pattern-tree size . . . . . . . . . . . . . . . . . . . . 95

6.9 Hepatitis Star : Average and detailed update time . . . . . . . . . . . . . . . . . . . . . . . 96

6.10 Hepatitis Star : Average maximum memory per batch . . . . . . . . . . . . . . . . . . . . 96

xiii

6.11 The multi-dimensional methodology for enriching classification . . . . . . . . . . . . . . . 100

6.12 Classification in Hepatitis: Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.13 Classification in Hepatitis: Size of the trees . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7.1 An example of an educational data-warehouse . . . . . . . . . . . . . . . . . . . . . . . . . 108

7.2 Classification in Educare: Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

7.3 Classification in Educare: Size of the trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

xiv

List of Tables

2.1 Star Internet Sales: Dimension Tables Product, Customer and Sales Territory . . . . . . 9

2.2 Star Internet Sales: Sales Orders Fact Table . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1 StarFP-Stream example: A subset of the final patterns . . . . . . . . . . . . . . . . . . . . 24

3.2 Correspondence between StarFP-Stream and SWARM representations . . . . . . . . . . . 26

3.3 AW experiments: A summary of the dataset characteristics . . . . . . . . . . . . . . . . . 29

3.4 AW experiments: Batches corresponding to each error . . . . . . . . . . . . . . . . . . . . 29

4.1 Advantages and disadvantages of the different forms of domain knowledge representations 42

4.2 Content constraints and respective properties . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3 Structural constraints and respective properties . . . . . . . . . . . . . . . . . . . . . . . . 50

4.4 Algorithms for each constraint property . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.5 Algorithms for each constraint category . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.1 Differences on mining transactional and non-transactional data . . . . . . . . . . . . . . . 76

6.1 Hepatitis: Important exams and corresponding thresholds and categories . . . . . . . . . . 92

6.2 A summary of the Hepatitis star characteristics . . . . . . . . . . . . . . . . . . . . . . . . 94

6.3 Hepatitis Star : Batches corresponding to each error . . . . . . . . . . . . . . . . . . . . . . 94

6.4 Hepatitis Star : Some examples of the patterns found . . . . . . . . . . . . . . . . . . . . . 98

6.5 Hepatitis Star : Some examples of the association rules found . . . . . . . . . . . . . . . . 99

6.6 Classification in Hepatitis: Some examples of the multi-relational patterns found . . . . . 104

7.1 Educare: Some examples of patterns found for the Enrollment Star . . . . . . . . . . . . . 111

7.2 Educare: Some examples of patterns found for the Teaching QA Star . . . . . . . . . . . . 111

xv

Chapter 1

Introduction

To undertake the rapid growth of data, the area of Data Mining [FPSM92] emerged with the goal

of creating methods and tools capable of analyzing these data and extracting useful information, that

companies can exploit and apply to their businesses. Finding frequent patterns in data has become an

important and widely studied task in data mining, since it allows us to find different types of interesting

relations among data, such as association rules [AS94, PH02], correlations [BMS97], sequences [PHW07],

multi-dimensional patterns [D03, SA10], episodes, emerging patterns [MTIV97, DL99], etc. Pattern

mining (PM) is also recognized as an important tool that helps in data pre-processing, and in other data

mining tasks like classification [LHM98, SA14b, SA14a] and clustering [LSW97].

Despite the great advances in this field, the challenges imposed by the era of big data continue to

defy those algorithms. Indeed, big data brought a completely new context to operate, changing the

data nature from static to dynamic, but also from tabular to more complex data sources, such as social

networks (expressed as graphs) and data warehouses (expressed as multi-dimensional models).

In this new context, and more than ever, users need effective and efficient ways to mine these more

complex and growing data, so that results can actually be used for decision support in real world problems.

Data Warehouses (DW) are an example of data repositories that emerged to make easier the analysis of

data, which clearly separates the representation of business dimensions and events, into a set of different,

but related tables [Inm96]. However, and despite the ultimate goal and the advances of data mining

algorithms, many of them are designed to deal with one single table and cannot be reused in several

domains.

One of the major challenges of mining multiple tables is how to join and create the tuples to be mined,

during the mining process. The most used approach is to join all the tables into one before mining, and

apply an existing and efficient single-table algorithm. However, in large applications, this initial process

may not be realistically computed, and if it can, the resulting table is so large and sparse that it presents

a huge overhead to the already expensive mining process [NFW02].

The area of Multi-relational data mining (MRDM) [D03] was born from the need to explore data

stored in multiple interrelated tables, and aims for the development of efficient data mining techniques

that are able to discover frequent patterns that involve multiple tables, in their original structure. There

were deep early advances brought by Inductive Logic Programming (ILP) [DR97, MEL01, NK01], and

in recent years, the most common techniques have been extended to the multi-relational case.

1.1 Open Issues in Multi-Relational Pattern Mining

Despite the progress in the area, just a few algorithms developed for MRDM are dedicated to the multi-

dimensional model present in DW, the star schema [CJS00, NFW02, XX06, SA11]. A star schema [KR02]

1

consists in a central fact table containing the business events, and a set of surrounding dimension tables,

comprising the specific data about each business dimension.

The main problem with the existing algorithms is that they are often not scalable with respect to

the number of events and relations in the database. And in the case of DW, there is an urgent need for

algorithms able to deal with large datasets, due to their growing nature: records are added along time,

but never deleted. In some manner, we can say that DW can be compared to data streams, in the sense

that they are continuously growing along time, since new records are added to the fact table for each

event occurrence.

To the best of our knowledge, there are only two algorithms that are able to mine multiple data stream

tables [FCAM09, HYXW09], and both are based on ILP, and therefore suffer from the same limitations

than other ILP techniques, such as needing all tables in prolog form. They also suffer from the candidate

generation bottleneck, well known in traditional pattern mining [AS94]. There is therefore the need for

new and more efficient algorithms.

Another problem of pattern mining algorithms, and also of MRDM, is the fact that they generate

a huge number of patterns (thousands or more), independent of user expertise. Such large number of

results and their lack of focus hinder the interpretation and selection of results, and therefore make it

harder to use these results for decision support. Actually, this is one of the reasons why pattern mining

techniques are not better accepted and applied on real businesses.

Several ways have been proposed to minimize these bottlenecks, and the use of domain knowledge

is the most accepted and common approach to focus the algorithms in areas where they are more likely

to gain information and return more interesting results [BJ05]. This knowledge driven data mining has

gained attention in recent years, and the ways to represent and use the domain knowledge have evolved,

from simple user interactions and annotations, going through the use of constraints, to the use of domain

ontologies. These new forms of representation are a promising way to guide data mining algorithms

through the analysis of more complex and multi-dimensional data, since they can make explicit the

existing dependencies and relations between business dimensions. However, the problem of efficiently

mining multi-dimensional data with domain knowledge remains unsolved.

The bulk of the research in this area is centered on constrained data mining, since constraints are easily

defined and interpreted by the users, capturing application semantics and user expectations. Moreover,

they can also be efficiently incorporated into the algorithms to guide them and filter the search space

and the results [Bay05].

As far as we know, there is no work that integrates constraints into relational mining. This integration

is not straightforward, since star schemas contain both transactional (the fact table) and non-transactional

(the dimensions) data, and existing constrained algorithms are designed only for transactional data, hence

cannot be directly reused on the whole star.

1.2 Thesis Statement

In this dissertation, we argue that it is possible to efficiently and effectively find patterns in

large amounts of data modeled as a star schema, as well as to incorporate constraints into

those algorithms, to focus the mining results according to the domain knowledge and user

expectations, and therefore deliver less, but more interesting patterns.

The analysis of this thesis statement leads to some questions that need to be addressed:

• How can we mine data modeled as a star schema directly? MRDM intends to develop algorithms for

mining multiple tables directly, and there are some algorithms designed for star schemas. We discuss

2

these algorithms and respective strategies in Chapter 3, as well as the challenges and limitations of

existing work.

• What is the difference between mining large amounts of data in a star schema, and a smaller star

schema? Mining large quantities of data imposes new challenges to the mining process, for both

single and multiple tables, due to memory and time limitations. In order to optimize memory

usage and time spent, and actually be able to mine big data, one well known approach is to use

data streaming techniques. In this work, we argue that it is possible to integrate MRDM and data

streaming techniques for mining large star schemas. We describe this approach in more detail in

Chapter 3.

• What does effectively mean? Existing algorithms are able to mine directly data modeled as a star

schema, but they do not scale well with the number of dimensions and records in the database.

In this sense, most of them may not be able to even finish the mining process, if there are large

amounts of data. Furthermore, denormalizing the whole schema into one table, and use an efficient

traditional pattern mining algorithm is not a solution, since this extra step may be infeasible for

big quantities of data. In this manner, effectively means that we can actually finish the discovery

process in large star schemas, keeping results updated.

• What does efficiently mean? There are several well known and efficient pattern mining algorithms

designed for single tables. Mining multiple tables directly should perform better than denormalizing

the tables into one and using one of those algorithms, since it skips the overhead of an extra joining

step. Therefore, efficiently means that, in general, a multi-relational pattern mining approach

should take less time than a join-before-mining approach.

• How can domain knowledge and user expectations be captured through constraints? There are several

forms of domain knowledge representations, with constraints the most used and well known form.

In Chapter 4, we present a discussion of the use of domain knowledge, as well as a new framework

for constrained pattern mining, that helps organizing and understanding constraints and their use.

• How can we efficiently incorporate constraints into the algorithms? There are many different types

of constraints, which hinders the definition of general algorithms. However, most of constraints

share a set of properties that allows for defining efficient strategies (explained in Chapter 4). In this

work, we argue that it is possible to efficiently push constraints as a post-processing step, by making

use of an efficient tree structure – the pattern-tree. We present this in more detail in Chapter 5.

• Which constraints can we push into the mining of a star schema and how? Mining a star schema

introduces new challenges to the constrained process, since star schemas contain both transactional

and non-transactional data, and therefore existing algorithms cannot be applied directly. Also, the

structure of the star schema itself encompasses more opportunities for constraining, and therefore

we define a set of constraints for a star schema, named Star Constraints, as well as a set of strategies

to incorporate them (Chapter 5).

The validation of this thesis will be carried out through four main procedures. First, comparing

the performance of our multi-relational algorithms with their non-multi-relational counterparts, i.e. with

algorithms that need to denormalize the tables into one before mining. Second, by comparing the number

of patterns returned using constrained and unconstrained algorithms. Third, the interest of the discovered

multi-relational patterns will be evaluated in two case studies with real data (Chapter 6 and 7), by using

a set of different measures to select the best patterns, use them to enrich classification training data,

and analyzing if they improve prediction accuracy. Finally, in order to test the different parameters, a

performance evaluation is performed using synthetic data, in each chapter.

3

1.3 Contributions

In order to demonstrate the validity of this thesis, we propose a new method, Star Frequent-Pattern

Stream (StarFP-Stream), which combines MRDM and data streaming techniques, for frequent pattern

mining in large star schemas [SA12b]. StarFP-Stream does not materialize the join between the tables,

and adopts a pattern growth strategy, therefore not suffering from the candidate generation bottleneck,

like the only two related algorithms in the literature. It is also able to correctly aggregate the business

events, and therefore finding patterns at the right aggregation level [SA12a]. By using a strategy similar

to the one followed on mining data streams, the algorithm is able to mine both large and growing datasets

modeled as star schemas, by avoiding multiple scans to the data, and optimizing both memory usage

and performance. It estimates an approximate frequency of items, based on the number of times they

occur since they have first appeared, and on a user defined maximum error threshold. Only the estimated

frequent patterns are kept in an efficient prefix-tree summary structure, called pattern-tree.

Experiments show that StarFP-Stream is accurate and efficient [SA14d], and demonstrate that it

greatly outperforms its single table predecessor in terms of time, when the second one is applied to a

joined table. In this manner, it is possible to say that our algorithm overcomes the join before mining

approach.

To address the second goal of this thesis, and as a starting point, before addressing the multi-

dimensional case, we propose two efficient algorithms for pushing constraints as a post-processing step,

into a pattern-tree: Constraint pushing into a Pattern-Tree (CoPT ) [SA13a] and CoPT4Streams [SA13b],

are the algorithms proposed for single table datasets and for single table data streams, respectively. By

using the pattern-tree structure, both algorithms are able to optimize the incorporation of any constraint,

avoiding unnecessary tests and eliminating invalid patterns earlier, according to the properties of the con-

straints. Experiments show that the algorithms are efficient and effective, even for constraints with small

selectivity, when compared to a baseline approach that does not take constraint properties into account.

We then analyze in more detail the challenges and prospects of constrained multi-dimensional min-

ing [SA13c], and propose the definition of a set of constraints that can be defined according to the star

schema. We also analyze a set of strategies for the incorporation of constraints in star schemas, based on

constraint properties and on the structure of the star itself.

Finally, we also propose an algorithm for pushing the defined star constraints into the discovery of

patterns over large and growing star schemas, named D2StarFP-Stream. It is an extension of StarFP-

Stream, that is able not only to minimize the bottlenecks of the first, by returning and keeping less results,

but also to focus these results on the existing domain knowledge.

Experiments show that the algorithm is memory efficient, requiring smaller summary structures and

less memory, and that it surpasses the unconstrained StarFP-Stream. They also show that it takes less

time per batch, as the selectivity of the constraints increase.

The developed algorithms are also evaluated in two case studies over real data, one in the healthcare

domain [SA14c, SA14a], and another in the educational domain [SA14b].

1.4 Outline

This dissertation contains 8 chapters. The first (Chapter 1) motivates this thesis and presents a summary

of its main goals and contributions. The thesis statement is also presented, along with an explanation of

each claim made.

Chapter 2 introduces the main concepts of pattern mining and presents a detailed description of the

multi-relational pattern mining problem on star schemas. In order to better understand the domain, we

formally present the concepts of a star schema and corresponding dimensions and facts, and we define

4

the notation used in the rest of the dissertation.

In Chapter 3, the first algorithm, StarFP-Stream, is proposed for mining directly a large star schema,

using data streaming techniques. The flow of the algorithm is illustrated using an example, and a

detailed analysis of its complexity, strengths and weaknesses is also presented. This chapter also makes

a comparison of StarFP-Stream with the related work, demonstrating the importance and novelty of our

algorithm, and discusses how it can be adapted for a time sensitive model. Finally, the chapter presents

the performance evaluation over a synthetic star schema.

A discussion on how domain knowledge has been used in data mining is made in Chapter 4. The

existing forms of domain knowledge representation are described, along with an examination of their

advantages and disadvantages. The main portion of this chapter is dedicated to the description of the

proposed framework for constrained pattern mining. Existing constrained algorithms are organized and

explained based on the different types of constraints, on their properties and on the nature of the data

sources being mined. The end of the chapter presents the open issues in this area.

In Chapter 5, a new strategy for pushing constraints into pattern mining, through the use of a

pattern-tree, is proposed. In particular, we propose two algorithms: CoPT and CoPT4Streams, for

mining static tables or data streams, respectively. A performance study of each algorithm is presented

for constraints with different selectivities and different properties. The chapter then discusses what is

the difference between introducing constraints into multi-relational and traditional pattern mining, and

presents a solution for overcoming those differences. It first describes a set of constraints defined based on

the star schema, and then proposes a set of strategies for pushing these constraints into multi-relational

pattern mining. In the end of the chapter, a new algorithm, named D2StarFP-Stream, is proposed, for

incorporating the previously defined star constraints into the mining of growing multi-dimensional star

schemas, fulfilling therefore the goal of this thesis. Some experimental results are also presented here, for

constraints with different selectivities.

Chapters 6 and 7 present the results obtained in two case studies, using real data, the first in the

healthcare domain, and the second in the educational domain. In both studies, we first show the perfor-

mance evaluation of our algorithms, and then we evaluate the quality of the discovered multi-relational

patterns by using them to enrich classification data, and examining the results, to test if they improve

predictions.

This dissertation concludes in Chapter 8, where a summary of this thesis and results achieved are

presented. Moreover, some guidelines for future research are also suggested.

5

Chapter 2

Finding Patterns on Star Schemas:

An Introduction

The rapid development of the Internet and evolution of technologies made companies realize that they

can benefit from that to improve their businesses and gain competitive advantage. To undertake the rapid

growth of data, everywhere and in a great variety of fields, the area of Data Mining (DM) emerged with

the goal of creating methods and tools capable of analyzing these data and extracting useful information

that companies can exploit and apply to their businesses.

Data mining [FPSM92] is formally defined as the nontrivial extraction of implicit, previously unknown,

and potentially useful information from data. We can say that data mining is a set of techniques that help

getting appropriate, accurate and useful information automatically, which we cannot find with standard

query tools and statistical analysis. Fundamentally, traditional data mining is the analysis of a table

with data, i.e. a set of instances, described by a fixed set of attributes, for the construction of a model

to explain these data. The model discovered is then evaluated, being confronted with the expectations

of the user, essentially measuring the model’s capability to explain, whether data already known, as yet

unknown.

Association rules (AR) [AIS93] were first introduced in 1993 and correspond to an important data

mining paradigm that helps to discover patterns that conceptually represent causality among discrete

entities (or items) [ZO98]. Given a set of records, where each transaction is a set of objects (called

items), an association rule is an expression of the form X ⇒ Y , where X and Y are sets of items (called

itemsets) [Sri96]. The intuitive meaning of such a rule is that database records which contain X tend to

contain Y , with a certain probability.

In order to find these trends, first there is the need to find the items and sets of items that co-occur more

frequently, and based on that, only then association rules are build. These frequent occurrences are called

patterns, and finding patterns, as shown in many studies (e.g. [AS94]), is significantly more costly in terms

of time than the rule generation step [Pei02]. In this sense, the bulk of existing work in this area is centered

on the task of finding frequent patterns in data, a task known as Pattern Mining (PM) or Frequent Itemset

Mining (FIM). There have been great advances in PM, and it now allows for the discovery of several types

of relations besides association rules [AS94, PH02], such as correlations [BMS97], sequences [PHW07],

multi-dimensional patterns [D03, SA10], episodes and emerging patterns [MTIV97, DL99], etc.

Despite these advances, the new era of big data brought new challenges and requirements to existing

techniques. Nowadays, we have unbounded quantities of the most diversified data, in many different

domains, and there is a great and increasing need for tools that are able to efficiently integrate and

analyze these data for decision support.

7

In fact, the data storage paradigm has changed in the last decade, from operational databases to data

repositories that make easier to analyze data and to find useful information. Data warehouses (DW)

are an example of such repositories, that clearly separate the representation of business dimensions and

events, into a set of different, but related tables [Inm96].

Multi-Relational Data Mining (MRDM) [D03] is an area that aims for the discovery of patterns that

involve multiple tables, in their original structure, i.e. without joining all the tables before mining. In

recent years, the most common mining techniques have been extended to the multi-relational context,

but there are few dedicated to the multi-dimensional model most present in DW, the star schema [SA10,

FCAM09, HYXW09], and they are often not scalable. Therefore, finding efficient and effective ways for

dealing with this kind of data is still a challenge.

In this chapter, we first define in detail the multi-dimensional model – the star schema, that is the

main object of this thesis (Section 2.1). Then, we present the problem statement for multi-relational

pattern mining over star schemas (Section 2.2), as well as the challenges introduced by this domain and

the existing related work (Section 2.3).

2.1 The Core of the Multi-Dimensional Model: a Star Schema

A star schema is a multidimensional model that models data as a set of facts, each describing an event or

occurrence, characterized by a particular combination of dimensions and a set of measures. An example

of a star schema can be seen in Fig. 2.1. It contains four dimension tables: Product, Date, Customer and

Sales Territory, and one fact table, registering some sales.

Figure 2.1: Star Internet Sales.

To help understanding the definitions and the flow of the algorithm, we describe a simplified example

of the contents of a database following the star schema on Fig. 2.1. Tables 2.1 and 2.2 present the content

of the dimensions and of the fact table, respectively (dimension Date was omitted here, since it can be

inferred by the key).

8

In the context of a database, a table contains one or more descriptive fields, called attributes, and each

row consists in a set of values for those attributes. A table can therefore be seen as a simple set of pairs

(attribute, value), corresponding to the characteristics of the data in analysis. As dimensions provide the

context for facts, they should also contain one single primary key, that can be used as a foreign key in

the fact table.

Table 2.1: Dimension Tables Product, Customer and Sales Territory.

ProductProductKey Name Category Color

p1 Mountain Bike Bike Blackp2 Road Bike Bike Redp3 Bike Shorts Clothes Multip4 Gloves Utilities Blackp5 Mountain Seat Seatsp6 Road Seat Seatsp7 Mountain Tire Tiresp8 Road Tire Tires

CustomerCustomerKey Status Gender

c1 M Fc2 S Mc3 S Mc4 M Mc5 M Mc6 S F

Sales TerritoryTerritoryKey Country Group

s1 USA Americas2 Canada Americas3 UK Europes4 France Europe

Table 2.2: Internet Sales Orders Fact Table

Sales Order DateKey ProductKey CustomerKey TerritoryKey1 20040510 p1 c1 s12 20040821 p6 c2 s32 20040821 p8 c2 s33 20040907 p3 c3 s14 20050803 p2 c4 s44 20050803 p6 c4 s44 20050803 p8 c4 s45 20060213 p5 c5 s35 20060213 p7 c5 s36 20060217 p5 c1 s16 20060217 p7 c1 s17 20060509 p2 c5 s47 20060509 p3 c5 s48 20060515 p5 c3 s18 20060515 p7 c3 s19 20060527 p1 c6 s29 20060527 p4 c6 s29 20060527 p5 c6 s210 20060930 p5 c2 s110 20060930 p7 c2 s1

Definition 1. A Dimension table D is a set of tuples (tidD, X), with tidD the primary key (also referred

to as transaction id) and X a set of pairs (attribute, value).

Definition 2. A Fact table FT is composed of a set of tuples with n foreign keys, connecting it to the

n dimensions that provide context to its records: (tidD1, tidD2

,... tidDn). A fact table may also contain

one or more numerical measurement fields, called facts or measures, that quantify some property of the

events.

9

As can be seen in our example, (p1, Name=Mountain Bike, Category=Bike, Color=Black) is a tuple

of dimension Product, that is referenced in the fact table in sales order number 1 and 9.

Definition 3. A Star schema S is a tuple (D1, D2,... Dn, FT ), composed of one fact table and the

corresponding n dimension tables.

For simplicity, and since most pattern mining techniques do not deal with numerical values, we do

not consider measures in this work. Nevertheless, they can be included and treated like other attributes,

by first transforming measures into categorical values (e.g. partitioning into ranges [RS98]), and then

considering them as an additional dimension, as usually done in OLAP (OnLine Analytical Processing).

While dimension tables contain the characteristics of the business entities, like products and clients,

usually unchanged or slowly changing, the fact table records the events of the business, like the sales,

which are characterized by some combination of the attributes in dimensions (the context). The way to

understand the fact table depends on the meaning of a business event (or business fact).

In general, each row of the fact table corresponds to one business event (e.g. one sale per row).

However, it is common to have a control number, such as an order number, that allows us to group

the rows that were generated as part of the same business event (e.g. one or more rows for each sale).

These control identifiers are usually represented as degenerated (or empty) dimensions, containing only

a primary key (the control number or id) and no descriptive attributes. For this reason, they usually do

not have a physical table associated, instead, the id is put directly into the fact table (e.g. the sales order

number in Fig. 2.1). In the presence of a degenerated dimension, this key/id can act as a primary key

of the fact table, since these keys separate the different business events (rows with the same degenerated

key correspond to the same event). In our example, in the first order, product p1 was bought alone, but

in the second order, p6 and p8 were bought together. Moreover, we have 10 orders, therefore 10 business

facts.

Note that a degenerated key can be seen as an aggregation key, since it indicates what facts should be

aggregated in order to have an event. Similarly, we can think of aggregating the facts by any other key

(or combination of keys). For example, if we consider the ProductKey as the aggregation key, we combine

all sales of the same product, and therefore we can find the common characteristics and behaviors of

these products’ buyers (product profiles). We could also consider the pair (CustomerKey, DateKey) as

the aggregation key, and find, e.g. which types of products are being bought by particular customers,

each season (customer seasonal profiles).

2.2 Problem Statement

Frequent pattern mining aims at enumerating all frequent patterns that conceptually represent relations

among discrete entities (or items). Depending on the complexity of these relations, different types of

patterns arise, with the transactional patterns being the most common. A transactional pattern is just

a set of items that occur together frequently. A well-known example is a market-basket, the set of items

that are bought in the same transaction by a significant number of customers.

In this context:

Definition 4. An item i corresponds to one pair (attribute, value). An itemset X is a set of items.

Itemsets can be:

• intra-dimensional – if all items belong to the same dimension;

• inter-dimensional – if items belong to more than one dimension.

10

An example of an intra-dimensional itemset is (Country=UK, Group=Europe), i.e. the european

country of UK. On the other hand, itemsets (Color=Red, Gender=F ), i.e. red products transacted by

female customers, and (Semester=2, Category=Seats, Country=UK ), i.e. seats bought by clients from

UK in the second semester, are examples of inter-dimensional itemsets.

Events in the fact table can also be aggregated according to some entity or entities, so that we can

discover frequent behaviors or profiles. For example, if we aggregate the facts in the star Internet Sales

per customer, we can discover sets of products bought together, by particular customers. In this sense,

itemsets can also be:

• Aggregated – if they result from the aggregation of events of the fact table, i.e. if they contain

combinations of items with the same attribute.

An example of an aggregated itemset is (Category=Tires, Category=Seats), i.e. tires bought together

with seats. Note that aggregated itemsets are a special case of intra-dimensional ones: items belong to

the same dimension, and may also belong to the same attribute of that dimension.

Let ID be the set of all items of dimension D, and I =⋃n

j=1 IDj= {i1, i2, . . . , im} be the set of

all items. We assume that all items are unique (by, e.g. adding the name of the dimension before the

attribute name).

The support of an itemset is defined as the number of its occurrences in the database. In the case of a

star schema, we have to consider that the number of occurrences of one item in some dimension depends

on the number of occurrences of the corresponding transactions in the fact table.

So, for an intra-dimensional itemset X of dimension D, let’s define TD(X) as the set of all primary

keys of transactions in D that contain X. The support of X is the number of different business facts that

contain each of those keys. Let getFacts : TD(X)→ TFT define a function that gives the business facts

that contain all the keys in TD(X). Hence, sup(X) = |⋃

t∈TD(X) getFacts(t)|. Following our example,

X = {Color=Black} is an intra-dimensional itemset of table Product that corresponds to both products

p1 and p4 (TD(X) = {p1, p4}). Therefore, sup(X) = |getFacts(p1) ∪ getFacts(p4)|. Since p1 appears in

orders 1 and 9 and p4 in order 9, we can conclude that sup(X) = |{1, 9}| = 2. Note that keys only count

once per business fact, i.e. although e.g. client c2 appears twice in order 2, this corresponds to just one

order, and therefore it counts once.

Inter-dimensional itemsets contain items from multiple dimensions, so they can be defined as X =⋃nj=1XDj

, where XDj⊆ IDj

, i.e. X is the union of n intra-dimensional itemsets. Note that, using this

definition, X is an intra-dimensional itemset if all XDjare empty, except one. Thus, TDj

(XDj) (or TDj

,

for short) is the set of all primary keys of transactions in Dj that contain XDj. X occurs if all XDj

occur, which means that some key of all TDjmust occur.

Definition 5. The support of an itemset X is the number of different business facts that contain at least

one key from each TDj.

sup(X) =

∣∣∣∣∣ ⋃T∈

⊗nj=1 TDj

getFacts(T )

∣∣∣∣∣In the equation,

⊗TDj

gives all combinations composed of one key from each TDj, i.e. from each

dimension. For example, in X = {Color = Black,Gender = F}, the first item comes from dimen-

sion Product, and appears twice: TProduct = {p1, p4}, and the other item comes from dimension Cus-

tomer, and TCustomer = {c1, c6}. For X to occur, some of those products must have been bought by

some of those clients, therefore sup(X) = |getFacts({p1, c1}) ∪ getFacts({p1, c6}) ∪ getFacts({p4, c1}) ∪getFacts({p4, c6})| = |{1}∪{9}∪∅∪{9}| = 2. In this example, two female clients bought black products.

Let N be the total number of business facts in S.

11

Definition 6. A pattern is a frequent itemset, i.e. an itemset whose support is greater or equal than a

user defined minimum support threshold, σ ∈]0, 1].

X is a pattern if sup(X) ≥ σ ×N

Naturally, patterns can also be intra-dimensional, inter-dimensional and aggregated.

The problem of multi-relational frequent pattern mining over star schemas is to find all patterns in a

star S.

Since a star schema is a particular case of a relational database, hereinafter we refer to our problem as

multi-dimensional pattern mining (as an equivalent of multi-relational pattern mining over star schemas),

to find all multi-dimensional patterns in a star.

2.3 Challenges and Related Work

In order to deal with multiple tables, pattern mining has to join somehow the different tables, creating

the tuples to be mined. An option that allows for the use of the existing single-table algorithms, is to

join all the tables in one before mining (a step also known as propositionalization or denormalization).

At a first glance, it may seem easy to join the tables into one, and then do the mining process on the

joined result [NFW02]. However, when multiple tables are joined, the resulting table will be much larger

and sparser, with an explosion of attributes, value repetitions and null values, making the mining process

more expensive and time consuming.

Denormalizing the star in Fig. 2.1 would result, as an example, in one table with almost 20 columns

(the SalesOrderNumber, plus all attributes of all dimensions and all measures) and as much rows as the

fact table. In that table, each row of each dimension is replicated as many times as the corresponding

keys appear in the fact table.

There are two major problems: First, in large applications, the join of all related tables often cannot

be realistically computed because of the distributed nature of data: large dimension tables and the many-

to-many relationship blow up. Second, even if the join can be computed, the multifold increase, in both

size and dimensionality, presents a huge overhead to the already expensive pattern mining process:

1. The number of columns will be close to the sum of the number of columns in the individual tables,

or much more if there are degenerated dimensions (since in this case, the fact table has several rows

for the same event, and therefore all attributes of all records must be associated to that event in

the denormalized table);

2. If the join result is stored on disk, the I/O cost will increase significantly for multiple scanning steps

in data mining;

3. For mining frequent itemsets of small sizes, a large portion of the I/O cost is wasted on reading the

full records containing irrelevant dimensions;

4. The joined table will eventually have many repetitions of the same values. While when using several

tables, we can just link several times for some value that is stored (once) in some other table, with

low memory effort, this is not possible when using just one table. Moreover, these repetitions of

values may cause distortions in the computation of the measures of interest, and therefore hinder

the discovery of really interesting patterns;

5. There will be, as well, many missing/null values, since each entity may have different number of

records associated (for example, the products sold in each transaction).

12

Research in this area has shown that methods that follow the philosophy of mining before joining

usually outperform the methods following the joining before mining approach, even when the latter

adopts the known fastest single-table algorithms [NFW02].

One of the great potential benefits of MRDM is the ability to automate this process to a significant

extent. Fulfilling this potential requires solving the significant efficiency problems that arise when at-

tempting to mine directly from a relational database, as opposed to from a single pre-extracted flat file

[Dom03].

In recent years, the most common types of patterns and approaches considered in data mining have

been extended to the multi-relational case and have been successfully applied to a number of different

problems in a variety of areas [DR97, NK01, D03, RV04, Kan05]. However, just a few are able to deal

with star schemas directly [CJS00, NFW02, XX06, SA10].

Historically there have been two major approaches to research in artificial intelligence: one based on

logic representations, and one focused on statistical ones. While the first is able to deal better with the

complexity of the real world, the second is better when dealing with uncertainty [DKP+06a]. In fact, the

most common approach for pattern mining is based on statistics.

Even so, the first multi-relational methods have been developed by the logical approach, in particular

by the Inductive Logic Programming (ILP) community, about ten years ago. And WARMR [DR97],

SPADA [MEL01] and FARMER [NK01] are the most representative ones. As stated by those authors,

ILP approaches achieve a good accuracy in data analysis, but they are usually not scalable with respect

to the number of relations and attributes in the database. Therefore they are inefficient for databases

with large schemas. Nevertheless, there has been an effort to minimize this bottleneck by making use

of optimization techniques like parallelization and distribution (see [FSC05] for a detailed survey), and

also sampling [ACTM11]. Despite their powerful representation capabilities, which are beyond our star

schema, another drawback of ILP approaches, and a reason for not being widely used, is that they need

all data in a declarative language, such as Prolog. Luckily, there are already some tools that are able to

automatically translate from a relational database to these representations, easing therefore the use of

these algorithms. In this work, however, we opted for a statistical approach.

Few approaches were designed for frequent pattern mining over star schemas:

An apriori-based algorithm is introduced by Jensen and Soparkar (2000) [CJS00], that first generates

frequent tuples in each single table using a slightly modified version of Apriori [AS94], and then looks

for frequent tuples whose items belong to different tables via a multi-dimensional count array. It does

not construct the whole joined table and processes each row as the row is formed, thus storage cost for

the joined table is avoided. Cristofor and Simovici (2001) [CS01] eliminated the explosion of candidates

present in Jensen’s algorithm, and they are also able to produce the local patterns existing among

attributes of the same table, i.e. patterns that are frequent with respect to their dimension table, but

not with respect to the relationship (or fact) table.

Ng et al. (2002) [NFW02] proposed an efficient algorithm that mines first each table separately, and

then two tables at a time to find patterns from multiple tables. The idea is to perform local mining on

each dimension table, and then “bind” two dimensional tables at each iteration, i.e. mine all frequent

itemsets with items from two different tables without joining them. After binding, those two tables are

virtually combined into one, which will be “binded” to the next dimension table.

Xu and Xie (2006) [XX06] presented a novel algorithm, MultiClose, that first converts all dimension

tables to a vertical data format, and then mines each of them locally, with a closed algorithm. The

patterns are stored in two-level hash trees, which are then traversed in pairs to find multi-table patterns;

StarFP-Growth, proposed by Silva and Antunes [SA10], is a pattern-growth method, based on FP-

Growth [HPY00]. Its main idea is to construct a tree for each dimension (named DimFP-Tree), according

13

to the global support of its items (i.e. depending on the number of times the corresponding keys appear in

the fact table). Then, the algorithm builds a global FP-Tree structure, named Super FP-Tree, combining

the branches in the DimFP-Trees, accordingly to the facts. All the multi-dimensional patterns are then

retrieved by traversing this tree using FP-Growth.

There are other algorithms for finding multi-relational frequent itemsets [RV04, Kan05], however they

just consider one common attribute at a time, and the patterns discovered by those methods will not

reflect the co-occurrences among dimensions in a star schema.

14

Chapter 3

Finding Patterns on Large Star

Schemas

The ability to mine complex data has been recognized as one of the goals for the future in data mining

[YW06], and dealing with multi-relational, large and growing data has deserved some attention in the

last years, with deep advances on mining data streams.

Data Warehouses (DW) meet both these lines of research since, apart from having multiple inter-

related tables, records are added along time, but never deleted. Dimension tables are usually large, but

not too large, and slowly changing compared to fact tables. In some manner, DW can be compared to

data streams in the sense that it is continuously growing along time, since new records are added to the

fact table for each event occurrence.

Indeed, due to their growing nature, in order to efficiently mine DW, we propose to adopt a strategy

similar to the one followed on mining data streams: avoid multiple scans of the dataset, optimize memory

usage and use a small constant time per record [LLH11].

This brings new challenges to both MRDM and mining data streams, since existing algorithms for

mining multiple relations are usually not scalable with the number of records, as well as not able to deal

with new data as it arrives, and existing algorithms for mining data streams are designed for a single

data table [GHP+03, LLH11].

To the best of our knowledge, there are only two algorithms that are able to mine multiple data

stream tables, and both are based on ILP, and therefore suffer from the same limitations than other ILP

techniques. They also suffer from the candidate generation bottleneck, well known in traditional pattern

mining.

In this chapter, we propose a method, Star Frequent-Patterns on Streams (StarFP-Stream) that

combines MRDM and data streaming techniques, for frequent pattern mining in large star schemas.

StarFP-Stream does not materialize the join between the tables, and adopts a pattern growth strat-

egy [HPY00], therefore not suffering from the candidate generation bottleneck, like the only two related

algorithms in the literature [FCAM09, HYXW09]. By using a strategy similar to the one followed on

mining data streams, the algorithm is able to mine both large and growing star schemas, by avoiding

multiple scans to the data, and optimizing memory usage and time spent.

StarFP-Stream was first proposed in [SA12b] and then updated in [SA12a] to correctly aggregate

the business events in the presence of degenerated dimensions, and therefore finding patterns at the

right aggregation level. In this work, we present an overview of StarFP-Stream, where we describe the

algorithm in detail and illustrate it with the example of Chapter 2, extracted from a star schema used in

the experiments. We also present an analysis of our algorithm’s complexity, strengths and weaknesses,

15

and a comparison with the related work.

In Section 3.1 we formally define the problem of finding frequent itemsets over growing star schemas.

Section 3.2 reviews existing work on multi-relational data streams, and the proposed method is described,

exemplified and analyzed in Section 3.3. The performance of our algorithm is evaluated over a DW, and

results are presented in Section 3.4. And finally, Section 3.5 concludes the chapter with a discussion and

some open issues.

3.1 Problem Statement

The definitions presented in Chapter 2 consider that the star is static and that the database is mined all

together. In order to consider growing star schemas, we need to extend those definitions.

Let us now assume that the tables are data streams, where new business facts continuously arrive.

We now have what we call a star stream, but only the fact table needs to be treated as an actual stream

(the fact stream)1.

Definition 7. A fact stream FS = B1 ∪ B2 ∪ ...Bk is a sequence of batches, where Bk is the current

batch, B1 the oldest one, and each batch is a set of business facts. Let N be the current length of the

stream, i.e. the number of business facts seen so far.

Note that if the fact table is not a stream, we can still treat it like one, by dividing it into batches.

Following our example in Chapter 2, we can consider that the fact stream (Table 2.2) is composed of two

batches – B1 with the first 5 business facts (gray rows), and B2 with the other 5 (white rows).

As it is unrealistic to hold all streaming data in the limited main memory, data streaming algorithms

have to sacrifice the correctness of their results by allowing some items and itemsets to be discarded. This

means that the support calculated for each item is an approximate value (denoted by sup′), instead of

the real value. These counting errors should be as small as possible, but still allowing an effective usage

of memory by discarding very infrequent items.

In data streaming algorithms, these errors are bounded by a user defined maximum error threshold,

ε ∈ [0, 1[, such that ε � σ, i.e. it is much lower than the minimum support threshold. Therefore, the

difference between the real and approximated support should be at most εN .

Definition 8. An approximate pattern is an itemset whose estimated support is greater or equal than the

minimum support threshold minus the maximum error allowed.

X is an approximate pattern if sup′(X) ≥ (σ − ε)×N

Given σ and ε, the problem of multi-relational frequent pattern mining over star streams consists of

finding all approximate patterns in the star S.

3.2 Related Work

Although there exist some algorithms that are able to find multi-dimensional patterns in star schemas

(Section 2.3), they are often not scalable. Indeed, due to the growing nature of data warehouses, it is

necessary to adopt a strategy similar to the one followed on mining data streams. However, most of

existing algorithms for data streams are designed for a single table [GHP+03, LLH11].

A data stream is an ordered sequence of instances that are constantly being generated and collected.

The nature of these streaming data makes the mining process different from traditional data mining in

several aspects:

1Dimensions can also be streams. But since for a foreign key to appear in the fact table, it must be alreadycreated and populated in the corresponding dimension, only the fact table needs to be considered a stream.

16

1. Each element should be examined at most once and as fast as possible;

2. Memory usage should be limited, even though new data elements are continuously arriving;

3. The results generated should be always available and updated;

4. Frequency errors on results should be as small as possible.

This implies the creation and maintenance of a memory-resident summary data structure, that stores

only the information that is strictly necessary to avoid loosing patterns [LLH11]. Hence, data stream

mining algorithms have to sacrifice the correctness of their results by allowing some counting errors.

Existing approaches can be deterministic or probabilistic: deterministic if they only allow an error in

the frequency counts, but guarantee that all real frequent patterns are returned (i.e. there are no false

negatives); and probabilistic if, besides an error, they also allow a probability of failure, i.e. there is a

probability that some real patterns are not returned (there might be false negatives). In this work, we

decided to focus on deterministic algorithms, so that we can have the guarantee that we do not miss any

real pattern.

The first proposed algorithm was Lossy Counting [MM02]. It divides the data stream into batches and

maintains frequent items in a set summary structure along with their estimated frequency and maximum

error. The algorithm guarantees that: (1) all itemsets whose true frequency exceeds σN are reported

(there are no false negatives); (2) frequencies are underestimated by at most εN ; and (3) false positives

have a true frequency of at least (σ − ε)N .

Giannella et al. [GHP+03] presented a novel algorithm, called FP-Streaming, that adapts FP-

Growth [HPY00] to mine frequent itemsets in time sensitive data streams and gives the same guarantees

as Lossy Counting. They make use of the FP-tree structure and its compression properties to main-

tain time sensitive frequency information about patterns. The stream is divided into batches and a tree

structure (called FP-stream) is updated at every batch boundary. Each node in this tree represents a

pattern (from the root to the node) and its frequency is stored in the node, in the form of a tilted-time

window table, which keeps frequencies for several time intervals. The tilted-time windows give a loga-

rithmic overview on the frequency history of each pattern, allowing the algorithm to address queries that

request frequent itemsets over arbitrary time intervals, rather than only over the entire stream (called a

landmark model). It can also be used in this later case, without temporal information and with the same

guarantees, by storing only one frequency in each node, instead of a time table (let us call this simpler

version as SimpleFP-Stream).

Some other algorithms were proposed to mine frequent itemsets in data streams (see [LLH11] for a

more exhaustive survey), but most of them are adaptations of the strategies applied in the algorithms

above.

3.2.1 MRPM over Data Streams

To the best of our knowledge, there are only two works on multi-relational frequent pattern mining over

data streams. They are both based on ILP, hence, for dealing with multi-relational databases, these

algorithms need all data in prolog form: a set of predicates of variables and constants for representing

the relations and attributes in the database. Both consider that there exists one relation (i.e. one table)

that represents the target relation, which is the main subject or unit of analysis. And patterns found

represent the frequent relations between other tables/attributes and the target.

In Fumarola et al. [FCAM09], SWARM, a Sliding Window Algorithm for Relational Pattern Mining

over data streams, is proposed. SWARM is a deterministic approach for data streams and is based on

a sliding window model, i.e. the stream is divided into a set of batches (or slides) from which a window

17

with the most recent ones is kept. The idea is to find all patterns in each slide by building a SE-tree

(Set Enumerated tree). This tree starts with the target relation as the root, and nodes are iteratively

expanded, by adding the predicates that have some variable in common (candidate generation), and then

evaluated (support check). A global SE-tree is used to keep the patterns for the window. It stores a

sliding vector in each node with the support of the respective patterns for each slide of the window, so

that when a new slide flows, the support vector is shifted to remove the expired support and the tree is

pruned to eliminate unknown patterns.

Hou et al. [HYXW09] presented RFPS (Relational Frequent Patterns in Streams), a probabilistic

approach based on period sampling, for finding relational patterns over a sliding time window of a

relational data stream. Since it is based on WARMR [DR97], it needs the database in prolog form.

RFPS is an apriori-based algorithm [AS94] that first generates and tests candidates with the help of a

Patterns Joint Tree (with the possible refinements of predicates), and then maintains frequent patterns

in a virtual stream tree, based on a periodical sampling probability.

After presenting our algorithm, we compare and discuss in more detail the results of these algorithms,

and show that StarFP-Stream is different and useful.

3.3 StarFP-Stream

StarFP-Stream is a MRDM algorithm that is able to find approximate frequent dimensional patterns in

large databases following a star schema. It is able to deal with degenerated dimensions, and to aggregate

the rows of the fact table into business facts, making possible the mining of the star at the right business

level. At the same time, it is also an algorithm for mining multiple dimensional data streams, hence able

to mine growing star schemas.

In this work, we will assume a landmark model, i.e. that patterns are measured from the start of

the stream up to the current moment. We discuss later how StarFP-Stream can be adapted to a time

sensitive model.

StarFP-Stream combines the multi-dimensional strategies of StarFP-Growth [SA10] with the data

streaming strategies of FP-Streaming [GHP+03], and guarantees that all real frequent itemsets are re-

turned. It does not materialize the join of the tables, making use of the star properties, and it processes

one batch of data at a time, maintaining and updating frequent itemsets in a pattern-tree structure.

Like data streaming challenges, StarFP-Stream can be asked to produce a list of current frequent

itemsets along with their estimated frequencies, at any point in time.

3.3.1 Rationale behind the star stream

As noted above, in a star stream, only the fact table needs to be treated as a stream (denoted as the fact

stream), since when a new fact arrives, the corresponding occurring transactions must have already been

added to the corresponding dimensions.

In this work, we follow the philosophy of the streaming algorithm Lossy Counting [MM02], for the

division of data into batches and for guaranteeing the maximum error. We explain these ideas in detail.

This fact stream is conceptually divided into k batches of d1/εe business facts each, so that the batch

id (1..k) exactly refers to the maximum error threshold, i.e. k = εN , with N = k|B| the number of facts

seen so far. This means that, to be frequent, an item must appear more than k times (the equivalent of

once per batch).

All items that appear more than σN times, are frequent with respect to the entire stream. Items that

appear less than σN but more than k are possibly frequent and have to be maintained, since they may

become frequent later.

18

Lemma 1. Items that only appear k times (or less), are infrequent and can be discarded, because even

if they reappear later in other batches and become frequent, the loss of support will not affect significantly

the calculated support, i.e. the difference between estimated and real frequencies is at most εN .

Proof. Considering that an itemset I first occurs in batch Bj , let us denote f the real frequency of I and

f its estimated frequency, after the current batch Bi (with j ≤ i ≤ k), and ∆ = j − 1 the maximum

error of I (i.e. the number of times it could have appeared and been ignored before j). Itemsets that are

frequent since the first batch, have ∆ = 0 and f = f . Otherwise they can have been discarded in the first

∆ batches. Therefore, f ≤ f + ∆. And since ∆ ≤ i− 1 ≤ εN , we can state that f ≤ f + ∆ ≤ f + εN .

Lemma 2. All patterns with f + ∆ ≥ σN are returned, therefore: (1) there are no false negatives, i.e.

all real patterns are returned; and (2) all false positives returned are guaranteed to have a support above

(σ − ε)N .

Proof. Real patterns have true frequency f ≥ σN . Since f + ∆ ≥ f , returning all patterns with

f + ∆ ≥ σN will include all real patterns. Similarly, since f + ∆ ≤ f + εN , then for all patterns,

f + εN ≥ σN ⇔ f ≥ (σ − ε)N , i.e. the frequencies of the returned patterns are off σ at most ε.

This lemma guarantees that the recall, i.e. the percentage of real patterns that are returned, is 100%

(no false negatives). However, the precision, i.e. the percentage of patterns returned that are real, is

normally below 100%, because itemsets with estimated frequencies between the minimum support and

maximum error are also returned (there are some false positives).

3.3.2 Pattern-Tree

In order to make the storage and search for patterns efficient, estimated frequent itemsets are kept in a

prefix-tree summary structure, called pattern-tree, along with their corresponding support and error.

As a prefix-tree, all prefixes are stored, so that we can easily search for any prefix. As an example, if

the itemset (a, b, c) is in the pattern-tree, then both a and (a, b) are also in that tree, and they share the

same branch. Since we are storing patterns, due to anti-monotonicity, all prefixes of a pattern are also

a pattern, as well as all subsets resulting of its strict porwerset. Using the same example, if (a, b, c) : 5

is a pattern with support 5, then both a, b, c, (a, b), (a, c) and (b, c) are patterns (with support equal

or higher than 5), and therefore all are put in the pattern-tree. In this case, we have 4 branches in the

tree: one with a, (a, b) and (a, b, c); one with a and (a, c) (note that the node a is the same node in

both branches); another with b and (b, c); and finally another branch, only with c. This example of a

pattern-tree is presented in Figure 3.1;

a : 8Δ

b : 6Δ

c : 5Δ

c : 5Δ

b : 7Δ

c : 6Δ

c : 6Δ

Figure 3.1: An example of a pattern-tree, corresponding to pattern (a, b, c) and all subsets.

19

In this sense, a pattern-tree is a prefix-tree structure, where every node corresponds to a pattern,

composed of the items from the root to this node, and the estimated support and error (f and ∆)

attached to this node.

Considering that patterns that share the same prefix also share the same nodes in the tree, the size

of the tree is usually much smaller than having all patterns in a list or a table, and the search for an

itemset is usually much faster.

3.3.3 Algorithm StarFP-Stream

The main idea of the algorithm is to have a local tree structure for each dimension that will store the

occurring transactions, as new business facts arrive. These trees, called DimFP-Trees, contain the intra-

dimensional itemsets of the current batch. When |B| facts have arrived, those local trees are combined

into one global one, called Super FP-Tree, that will contain the inter-dimensional itemsets. This tree

is then used to extract the multi-dimensional patterns of the current batch, which are, in turn, used to

update the global pattern-tree, described above.

The detailed algorithm is presented in Algorithms 1 and 2.

Algorithm 1 StarFP-Stream Pseudocode

Input: Star Stream S, error rate εOutput: Approximate frequent items with threshold σ, whenever the user asks1: i = 1, |B| = 1/ε, flist and ptree are empty2: B1 ← the first |B| business facts3: L← StarFP-Growth(B1, support = ε|B|+ 1)4: flist ← frequent items in B1, sorted by minimum support5: for all patterns P ∈ L do6: insert P in the ptree with max error i− 17: N = |B|, discard B1 and L8: i = i+ 1, initialize n DimFP-trees to empty9: for all arriving business fact f = (tidD1 , tidD2 , ..., tidDn) do

10: N = N + 111: for all Dimension Dj do12: T ← transaction of Dj with tidDj

13: insert T in the DimFP-treej14: flist ← append new items introduced by T15: if all business facts of Bi arrived then16: super-tree ← combineDimFP-trees(DimFP-trees, Bi)17: FP-Growth-for-streams(super-tree, ∅, ptree, i)18: discard the super-tree19: tail-pruning(ptree.Root, i)20: i = i+ 1, initialize n DimFP-trees to empty

combineDimFP-trees(DimFP-trees dim-trees, Batch of business facts Bi)fptree ← new FP-treefor all business fact f ∈ Bi do

for all Dimension Dj doT ← branch of DimTreej with tidDj

sort items in T accordingly to flistinsert T in fptree

return fptree

tail-pruning(Pattern-tree node R, Batch id i)for all children C of R do

if C.support+ C.error ≤ i thenremove C from the tree

elsetail-pruning(C, i)

When mining a star as a whole, items are ordered in a support descending order, which is known to

enhance the compactness of the trees [HPY00]. However, when we are dealing with streams, the order of

items can not depend on their support, not only because we do not know the alphabet of items from the

beginning and we only see one transaction at a time, but also because items appear and disappear and

their support changes. Therefore, the list of items (flist) is dynamic, with items appended as they appear,

20

so that all patterns in the pattern-tree are sorted accordingly to their order of appearence. In this sense,

we decided to process the first batch separately, as a whole, using StarFP-Growth [SA10], to initialize

both the order of items and the pattern-tree (rows 2–6 in Algorithm 1). Note that, while processing this

batch, we should only scan the transactions of dimensions that appear in the batch, instead of the whole

dimensions.

After the first batch, all next business facts are processed as they arrive (rows 11–14 in Algorithm

1). So, every time one fact arrives, it is scanned and the transaction corresponding to each key is stored

in the respective DimFP-Tree. These local trees are simple prefix-trees (like the pattern-tree) that allow

us to compress the intra-dimensional itemsets of each batch. Besides, they also contain a header table

with the occurring keys, and a link from them to the corresponding branches in the tree, so that given a

foreign key, we can easily reach and regenerate the respective transaction. Note that we can not discard

any item that appears in the current batch, since we do not know if it is already frequent or not. Thus,

all items found must be in the DimFP-tree. We can say that the DimFP-tree is a compact and efficient

representation of the dimension table of the current batch.

When a batch is complete (rows 15–20 in Algorithm 1), the DimFP-Trees are combined to form

a global tree, called Super FP-Tree. In this step we have to scan the facts of the batch a second time,

otherwise we would not know which intra-dimension itemsets of the different DimFP-Trees occur together.

However, a fact is just a set of tids, therefore this extra scan is not significant. So the Super FP-Tree

is constructed by first looking for the co-occurring foreign keys of a fact in each DimFP-Tree, and then

joining the corresponding branches (see function combineDimFP-trees). Hence, it will contain all the

inter-dimension itemsets of the respective batch.

Algorithm 2 FP-Growth-for-streams Pseudocode

Input: FP-tree fptree, Itemset α, Pattern-tree ptree, Current batch id i1: if fptree = ∅ then2: return3: else if fptree contains a single path P then4: for all β ∈ P(P ) do5: processPattern(ptree, α ∪ β: min[support(nodes∈ β)], i)6: else7: for all a ∈ Header(fptree) do8: β ← α ∪ a : a.support9: if processPattern(ptree, β, i) = false then

10: proceed to the next a11: else12: treeβ ← conditional fptree on a13: FP-Growth-for-streams(treeβ , β, ptree, i)

processPattern(Pattern-tree ptree, Itemset I, Batch id i)if I ∈ ptree thenP ← last node of I in ptreeP.support← increment by I.supportif P.support+ P.error ≤ i then

return false//Type II Pruningelse if I.support > ε|B| then

insert I in ptree with support = I.support and maximum error = i− 1else

return false//Type I pruningreturn true

The Super FP-Tree is then mined using FP-Growth algorithm, presented in Algorithm 2, modified as

following:

For each mined itemset I (see function processPattern):

1. if it is not in the pattern-tree (i.e. its f ≤ i − 1, according to Lemma 1), test Type I Pruning :

if I only occurs once in Bi and it is not in the pattern-tree, it is infrequent and thus we do not

insert it in the pattern-tree (Lemma 1) and we can stop mining the supersets of I (according to the

anti-monotone property [AS94]: if I is infrequent, all supersets of I are also infrequent).

21

Otherwise, insert I into the tree with the number of occurrences in Bi and maximum error ∆ = i−1.

2. If I is in the pattern-tree (i.e. its f > i− 1):

(a) Update its frequency, by adding the number of occurrences in Bi;

(b) Test Type II Pruning : if f + ∆ ≤ i, it will be deleted later because it is infrequent (Lemma

1), therefore we can stop mining the supersets of I. Otherwise, FP-Growth continues with I.

After mining the batch, we can discard the Super FP-Tree and prune the pattern-tree, by Tail Pruning :

prune all items in the tree whose f + ∆ ≤ i (Lemma 1). The pattern-tree is now updated, and contains

all approximate frequent itemsets until that batch. The next steps consist only in preparing the next

batch and wait for new facts.

If there are no more batches, or every time a user asks for a list with the current frequent itemsets,

we just need to scan the pattern-tree and return all itemsets with f + ∆ ≥ σN (Lemma 2).

3.3.4 Example

To help understanding the flow of the algorithm StarFP-Stream, we illustrate the flow of our algorithm

with the example presented in tables 2.1 and 2.2. Let the minimum support threshold be 50% of the

database and the maximum error be 20%. For this error, the fact table is divided into 2 batches with 5

business facts each (|B| = 1/ε). In a static environment, without error, 50% of support means that an

itemset is frequent if it occurs at least in 5 business facts (σN).

The algorithm starts by processing the first batch separately. Fig. 3.2 illustrates part of the pattern-

tree with the resulting patterns. For example, pattern (Gender=M) has a support of 4 business facts, and

pattern (Gender=M,Category=Seats,Category=Tires) of 3 facts. Both were added to the pattern-tree in

the first batch, therefore ∆ = 0. An example of an infrequent item is (Year=2006). Since it occurs only

once, it was ignored and not inserted in the pattern-tree.

Year = 2004 :2Δ=0

TerritoryGroup= America :2

Δ=0

Country= USA :2Δ=0

Category = Bike :2 Δ=0

…

Gender = M :4Δ=0

Year = 2004 :3 Δ=0

Category = Seats :3 Δ=0

Category= Seats :3Δ=0

TerritoryGroup= Europe :3

Δ=0

Category= Tires :3Δ=0

Category= Tires :3

Δ=0


Δ=0

Figure 3.2: Part of the pattern-tree resulting from the first batch. Gray nodes represent patterns whose f + ∆ ≥σN = 2.5, and therefore are returned by the algorithm. White nodes have σN > f + ∆ ≥ εN = 2, therefore theyare not returned, but can not be discarded.

Next, for each arriving business fact, the respective occurring transactions are inserted in the corre-

sponding compact DimFP-trees. Using our example, when the 6th sales order arrives, transactions p5

and p7 are inserted into the DimFP-tree of dimension Product; transaction c1 into the tree of dimension

Customer; and so forth. The DimFP-trees of dimensions Customer and Product are illustrated in Fig.

3.3, as if all business facts of the second batch had already arrived (for simplicity, we omitted the product

names). We can see that products p1, p2, p3, p4, p5 and p7 occurred, and that the corresponding DimFP-

tree maintains a link of each key to the respective path in the tree, which facilitates further searches.

Note that each of these trees contain all possible intra-dimensional patterns of the current batch.

22

Gender M : 3

Gender F : 2

Status M : 1

Status S : 2

c1 c2 c3 c5 c6

Status M : 1

Status S : 1

Category Seat: 4

Category Tires: 3

Category Bike: 2

Category Clothes: 1

Category Utilities: 1

p1 p2 p3 p4 p5 p7

Color Black: 1

Color Red: 1

Color Multi: 1

Color Black: 1

Figure 3.3: DimFP-Tree of dimensions Customer (left) and Product (right), at the end of the second batch.

Gender = M : 3

Status = M : 1 Status = S : 2

Gender = F : 2

Status = M : 1 Status = S : 1

Category = Bike: 1 Category = Seat: 1Category = Seat: 2 Category = Seat: 1

Color = Red: 1 Category = Tires: 2 Category = Tires: 1 Category = Bike: 1

Category = Clothes: 1

TerritoryGroup = America : 2 TerritoryGroup = America : 1

Category = Utils: 1

Color = Multi: 1 Color = Black: 1

TerritoryGroup = Europe : 1

Country = France : 1 Country = USA : 2 Country = USA : 1

TerritoryGroup = America : 1

Country = Canada : 1

Year = 2006 : 1 Year = 2006 : 2 Year = 2006 : 1 Year = 2006: 1

Month = May : 1 Month = May : 1 Month = Sep : 1 Month = Feb : 1 Month = May: 1

Figure 3.4: Super FP-tree of the second batch.

When all 5 business facts of this batch have arrived, the DimFP-trees are combined into one, the Super

FP-tree presented in Fig. 3.4, by scanning again the business facts, looking for the keys in the DimFP-

trees and joining the co-occurring paths (for convenience, nodes with items from the same dimension are

presented together, instead of ordered according to the flist). Using our example again, when scanning

sales order 7, all paths corresponding to client c5, products p2 and p3, territory s4 and date 20060509 are

joined in only one path (the left most path in the figure). By doing this, the Super FP-tree puts together

the intra-dimensional itemsets and forms inter-dimensional ones.

The Super FP-tree is then mined using the described adaptation of the FP-Growth algorithm and the

pattern-tree is updated. Fig. 3.5 shows a subset of the final pattern-tree. During this mining, discovered

patterns that are already in the pattern-tree are updated. For example, the itemset (Gender=M) was

already in the pattern-tree, and occurs 3 times in the second batch. It now has a frequency of 7, and since

it is higher than the maximum error (2 for the second batch), we can keep mining its supersets (Type II

Pruning). Also, new discovered patterns are added to the tree. For example, itemset (Year = 2006) was

not in the pattern-tree (although it appeared once in the first batch). Since its frequency in this batch

is 5, higher than the error, it is inserted in the tree, with estimated support 5, and ∆ = 1 (Note that

f + ∆ = 6 = f). Itemsets that occur only once and were not in the pattern-tree, such as (Category =

Clothes), are not added nor further explored (Type I Pruning).

23

Finally, the pattern-tree is pruned by Tail Pruning, to eliminate all infrequent itemsets (f + ∆ ≤ 2).

One example is (Gender = M, Year = 2004):2 that appeared in the first pattern-tree, but not in the

second one.


Δ=0

Year = 2006 :3Δ=1

Year = 2004 :3 Δ=0


Δ=0

Month = May :3 Δ=1

Category = Bike :4 Δ=0

Gender = M :7Δ=0

Category = Seats :7 Δ=0


Δ=0

Year = 2006 :5 Δ=1

Category= Seats :5Δ=0

Category= Tires :5Δ=0

Category= Tires :6

Δ=0


Δ=1

Year = 2006 :4 Δ=1

Country= USA :5Δ=0

Figure 3.5: Part of the final pattern-tree.

These steps are then repeated for new facts.

If we ask for a list with the current frequent itemsets, a scan to the pattern-tree would return all

itemsets with f + ∆ ≥ 6. In our example, the complete pattern-tree would have 175 nodes, but would

return only 17 patterns. A subset of the final patterns returned by the algorithm, corresponding to the

pattern-tree in Fig. 3.5, can be found in table 3.1.

Table 3.1: A subset of the final patterns.

Pattern Dimensions(Year=2006):5 Date(TerritoryGroup=America):6 Territory(TerritoryGroup=America,Country=USA):5 Territory(Category=Seats):7 Product(Category=Seats,Category=Tires):6 Product(Category=Seats,TerritoryGroup=America):4 Product, Territory(Category=Seats,TerritoryGroup=America, Product, Territory,

Year = 2006):4 Date(Gender=M):7 Customer(Gender=M,Category=Seats):5 Customer, Product(Gender=M,Category=Seats,Category=Tires):5 Customer, Product

Note that the algorithm found intra- and inter-dimensional patterns. For example, the pattern (Cate-

gory=Seats,Category=Tires):6 is an intra-dimensional pattern that states that seats and tires were bought

together in 6 internet sales (60%). Notice that this pattern relates different products that co-occurred,

and therefore could only be found because we aggregated the sales by the degenerated dimension (sales or-

der number). An example of an inter-dimensional pattern is (Category=Seats, TerritoryGroup=America,

Year = 2006):4, relating both dimensions Product, Sales Territory and Date. Note that this pattern has

real f = 5, but f = 4, since it appeared only once in the first batch and was discarded. However, it was

not missed because f + ∆ = 5. Although pattern (Category=Bike):4 has the same f , it was not returned

because its ∆ is zero (and it was indeed not a real pattern).

3.3.5 Complexity Analysis

Since we are working with one batch at a time, which corresponds to a certain amount of facts and

corresponding transactions in each dimension, we can assume that we work with a smaller star at each

point in time. Let that conditional star be SB . Let |B| be the number of business facts in a batch, and

24

mi the number of rows in the fact table necessary for representing the fact i in that batch. The size

of the respective fact table can be given by |FT |B = n ×∑|B|

i=1mi primary keys, with n the number of

dimensions (note that in the case of a star without degenerated dimensions mi = 1, and the size of the

fact table is n × |B|). Let also tdiB correspond to the number of transactions of dimension i that occur

in this batch. The size of each dimension is given by |Di|B = tdiB × cdi, with cdi the number of columns

of dimension i. Then the size of the star is |S|B = |FT |B +∑n

i=1 |Di|B .

Joining the tables before mining, would result in a much larger table, whose size would be the number

of rows in the batch times the sum of all columns in all dimensions:∑|B|

i=1mi ×∑n

i=1 cdi. This would

have a negative impact on the memory needed, as well as on the time, not just because of the extra

pre-processing step for the creation of this table, but also because of all steps involving the scans to the

transactions.

Each new business fact i that arrives is treated separately in O(mi ×∑n

j=1 cdj), since there is only

the need to scan once the occurring transactions of each dimension, to insert them in the corresponding

DimFP-tree.

When a batch is complete, we need to take 3 steps: (1) combine the trees; (2) run FP-Growth on the

combined tree; and (3) prune the pattern-tree.

In (1), the DimFP-trees are combined in O(∑|B|

i=1mi ×∑n

j=1 cdj log(cdj) + cdj). This step consists in

one scan to the fact table, to find and get the branches of the trees of different dimensions that co-occur in

each fact. The co-occurring items are then sorted and inserted in a Super FP-tree. Sorting and inserting

is bounded by the number of columns in the dimensions.

This Super FP-tree is then used in (2) to run the algorithm FP-Growth [for more details, see HPYM04],

with just one difference: for each possible pattern, we have to first look for it in the pattern-tree, and

insert it if it is not there, which is O(2mi ×∑n

j=1 cdj). Both searching and inserting in the tree are

linear operations, and depend on the size of the pattern, which is, in the worst case, the set of all items

occurring in one business fact.

Finally, step (3) consists in scanning the pattern-tree to remove current infrequent items (tail-pruning).

If one node is infrequent, there is no need to scan its children, because they are infrequent too. However,

in the worst case (there are no infrequent nodes, e.g.), we have to scan all nodes in the tree.

3.3.6 Strengths and Weaknesses

As a data streaming algorithm, StarFP-Stream gives the following guarantees, like [MM02, GHP+03]:

• All itemsets whose true frequency exceeds σN are returned (there are no false negatives);

• No itemset whose true frequency is less than (σ − ε)N is returned;

• And estimated frequencies are less than true frequencies by at most εN .

As a multi-relational algorithm for star schemas, StarFP-Stream guarantees that it mines the star

directly, without materializing the join of the tables, and that all multi-dimensional patterns are returned.

Like any algorithm, StarFP-Stream also has some limitations:

• As an FP-Growth [HPY00] based algorithm, it has to scan the facts twice, first to know which

transactions of dimensions occur, and second to combine them in the end of a batch. However, a

fact is just a set of tids, therefore the time needed for each scan and the memory needed to keep

it, are not significant, when compared to scanning and storing transactions of items;

• And the pattern-tree tends to be very large, since it has to keep all frequent and possible frequent

patterns. Nevertheless, its size tends to be stable as the batches increase, and it is able to return

the patterns for every minimum support σ � ε, anytime.

25

3.3.7 Comparison with Related Work

As referred in Section 3.2.1, there are only two algorithms for relational pattern mining over data streams:

RPFS [HYXW09] and SWARM [FCAM09]. However, they are not directly comparable with StarFP-

Stream.

While our algorithm is deterministic, RPFS is a probabilistic approach that only uses a sample of the

data. Deterministic approaches allow an error in the frequency counts, but guarantee that all real frequent

patterns are returned (i.e. there are no false negatives). On the contrary, probabilistic approaches, besides

an error, also allow a probability of failure, i.e. there is a probability that some real patterns are not

returned (there might be false negatives). And therefore we can not make a fair comparison between

StarFP-Stream and RPFS.

On the other hand, SWARM is deterministic, but the transformation of the data into their input is

not straightforward, as well as the correspondence between their results and ours. We would need both

an extra pre-processing and an extra post-processing steps. We analyze this in more detail hereinafter.

The authors define 3 types of predicates: key, for target objects; structural, for relations between

objects; and property predicates, for values taken by properties of an object (binary) or of a relation be-

tween two objects (ternary). Table 3.2 presents a comparison between our representation and SWARM ’s.

Capitalized letters correspond to variables, which can take any value.

Table 3.2: Correspondence between StarFP-Stream and SWARM representations.

StarFP-Stream SWARM

Item → Property predicate(with variables)

(color, “Black′′) color(P, “Black′′)

Dim transaction → Prop-erty predicate (with values)

{p1, (color, “Black′′)} color(p1, “Black′′)

Fact key → key predicate sale order 1 order(1)

Fact → Binary structuralpredicates

{1, p1, c1}{order(1), soldWhat(1, p1),soldTo(1, c1)}

Measure → Ternary struc-tural predicates

{1, p1, c1,m} quantity(1, p1,m)

Intra-dimensional pattern (color, “Black′′){order(O), soldWhat(O,P ),color(P, “Black′′)}

Inter-dimensional pattern{(color, “Black′′),(gender, “M ′′)}

{order(O), soldWhat(O,P ),color(P, “Black′′), soldTo(O,C),gender(C, “M ′′)}

As can be seen there, items correspond to property predicates. Each transaction of a dimension is

therefore a set of property predicates, one per attribute. In our star schema case, we assume that our

target are business facts and that structural predicates correspond to relations between the facts and

entities in dimensions. In this sense, a fact table can be mapped to their representation in a set of N key

predicates (one per fact) and n×N binary structural predicates (one per dimension, per fact). If there

are measures in the fact table, they can also be represented by ternary structural predicates. However,

there are no n-ary predicates, therefore they can not represent measures that depend on more than two

objects.

Every time a new order is made, all the predicates relating to this order must flow

through the stream. So, for example, order {1, 20040510, p1, c1, s1} should be transformed to

{order(1), soldWhen(1, 20040510), soldWhat(1, p1), soldTo(1, c1), soldWhere(1, s1)}, along with all

property predicates related to the entities in question, such as category(p1, “Bike′′), Color(p1, “Black

′′),

from dimension Product, and status(c1, “M′′), gender(c1, “F

′′) from Customer.

26

In SWARM, patterns must relate the target with other relations and attribute values, and therefore

they are built by expanding the key predicate with the other related predicates. In this sense, SWARM

finds first-order patterns as shown in the table. In the last example, our pattern stating that “Black′′

products are bought frequently by male customers, is translated to: orders of products which color is

black, sold to clients of gender male, are frequent.

However, while expanding the predicates, the algorithm does not allow repetitions, and this means

that it can not deal with degenerated dimensions, like in our example, and therefore it can not always

find patterns at the right business level (e.g. {(category, “Seats′′), (category, “Tires′′)}). And hence it

is not comparable to our algorithm.

3.3.8 Time Sensitive Model

StarFP-Stream can also be extended to a time sensitive data streams paradigm. By being aware of time,

outdated data can be discarded, more recent patterns can have more weight than older ones, and patterns

can be examined at different granularities. This is very important to many real-world applications, where

changes of patterns and their trends are more interesting than patterns themselves (e.g. shopping and

fashion trends, Internet bandwidth usage, etc.).

There are several ways to achieve this, in light of existing works [GHP+03, LLH11]. One simple

approach is to keep more than one frequency per pattern (i.e. per node in the pattern-tree), corresponding

to the support of the pattern in each of the most recent batches. Whenever a new batch arrives, we can

just shift the supports and discard the oldest one. The estimated frequency of each pattern can be the

sum of the stored frequencies, and therefore patterns that were frequent before but not recently can be

discarded. We can also consider a pondered sum of the frequencies, by giving an higher weight to more

recent batches.

Another example is to consider more elaborated time divisions, such as the logarithmic time windows

adopted by Gianella et al. [GHP+03]. In this case, we can consider periods (called windows) of different

time granularities (e.g. day, month, year) and instead of discarding the oldest frequencies, they are

aggregated. In this sense, when a new period arrives, the shift of frequencies corresponds to adding them

to the group of frequencies at a higher granularity (e.g. a pattern has support h on the last day, thus

when a new day comes, h is added to the number of times it has appeared in the last month). We can

then give more importance to patterns that are frequent in more recent windows, and discard the oldest

windows if their support is not significant.

3.4 Performance Evaluation

This section presents the experiments conducted to evaluate the performance of our data streaming

algorithm. Our goal is to evaluate the accuracy, time and memory usage, and to show that:

1. StarFP-Stream is capable of mining directly a star schema with degenerated dimensions, at the

right aggregation level;

2. Our algorithm has a high accuracy and does not miss any real pattern;

3. The time needed for each batch is less than the time needed to denormalize each fact before mining

the batch;

4. Mining directly the star is better than joining before mining, with and without degenerated dimen-

sions.

27

We assume that we are facing a landmark model, where all patterns are equally relevant, regardless

of when they appear in the data. Therefore, we test StarFP-Stream over an adaptation of FP-Streaming

for landmark models, which we will call SimpleFP-Stream, like described in Section 3.2. Since SimpleFP-

Stream does not deal with stars directly, it has to join the tables in one. And since it will have a star

stream as input, with business facts arriving continuously, it denormalizes each business fact when it

arrives (i.e. it goes to every dimension and joins all the transactions corresponding to the tids of the

business fact in question), before mining it.

FP-Growth was also implemented by us, so that we can run it on all data, compare the returned

patterns (the exact patterns) and evaluate the accuracy of StarFP-Stream results (approximate patterns).

We tested the algorithms with a sample of the AdventureWorks 2008 Data Warehouse, described

below. In order to analyze the algorithms in the absence and presence of degenerated dimensions, we

first test the performance of both algorithms on a traditional star, ignoring the degenerated dimension,

and then considering it, so that the algorithms can aggregate the facts related with the same business

transaction, and find patterns at the right business level.

In this work we analyze the accuracy of the results, as well as the behavior of the pattern-tree

and the time and memory used by each algorithm. Experiments were conducted varying both minimum

support and maximum error thresholds: σ ∈ {50%, 40%, 30%, 20%} and ε ∈ {10%, 5%, 4%, 3%, 2%, 1%}2.

Dimension tables were kept in memory and the fact table is read as new facts are needed. Note that

the course of the mining process of streaming algorithms does not depend on the minimum support

defined, only on the maximum error allowed. The support only influences the pattern extraction from

the pattern-tree, which, in turn, is ready for the extraction of patterns that surpass any asked support

(σ � ε). Since the size of the batches is defined by the error (|B| = d1/εe), by varying the error we are

varying the batch size.

The computer used to run the experiments was an Intel Xeon E5310 1.60GHz (Quad Core), with

2GB of RAM. The operating system used was GNU/Linux amd64 and the algorithms were implemented

using the Java Programming language (Java Virtual Machine version 1.6.0 24).

3.4.1 Data Description

We tested the algorithms in one sample of the AdventureWorks 2008 Data Warehouse3, created by

Microsoft especially to support data mining scenarios. The AdventureWorks DW is from an artificial

company that manufactures and sells metal and composite bicycles to North American, European and

Asian commercial markets.

In this work, we analyze a sample of the star Internet sales, considering four dimensions: Customer,

Product, Date and Sales Territory (who bought what, when and where), as it is shown in Fig. 2.1. Each

dimension has only one primary key and other attributes (no foreign keys). Numerical attributes were

excluded (except year and semester in dimension Date), as well as translations and other personal textual

attributes, like addresses, phone numbers, emails, names and descriptions.

This star contains information about more than 60 thousand individual customer Internet sales, from

July 2001 to July 2004. The fact table has the keys of those four dimensions and a control number (other

attributes were removed). The control number, attribute SalesOrderNumber, is a degenerated dimension,

that indicates which products were bought in the same sales order. There are 27600 Internet sales orders

and 60399 rows (individual sales) in the fact table.

In order to evaluate the performance of the algorithms in the absence and presence of degenerated

dimensions, we have chosen two stars to use in these experiments: (1) AW T-Star – the traditional star,

2A common way to define the error is ε = 0.1σ [LLH11]. Additionally, we use a larger error to see how worsethe results are, and a smaller error to see the improvements.

3AdventureWorks Sample Data Warehouse is available at http://sqlserversamples.codeplex.com/

28

i.e. the star in Fig. 2.1 without the degenerated attribute SalesOrderNumber ; and (2) AW D-Star – the

degenerated star, i.e. the star as it is presented in Fig. 2.1.

Table 3.3 presents a summary of the dataset characteristics.

Table 3.3: A summary of the dataset characteristics.

AW

T − star D − StarNumber of facts 60.400 27.600

Number of transactions per fact 1 [1; 8]

Number of attributes per dimension [2; 7]

Number of entries per dimension [12; 18.485]

3.4.2 Experimental Results

The results obtained in these experiments describe the performance of the algorithms when dealing with

traditional star schemas (T-Star) and with stars that have degenerated dimensions (D-Star). We first

discuss the accuracy of the results, and then the size of the pattern-tree. Finally, we present the time

and memory used by each algorithm.

In traditional stars, each row in the fact table corresponds to one business fact. In degenerated stars,

several rows in the fact table may correspond to the same business fact. This means that there are

less business facts than rows, and therefore the D-Star will have less number of batches, and they will

probably have different sizes.

For a better understanding of the domain of each experiment, the number of batches and their size,

corresponding to each error, are shown in table 3.4.

Table 3.4: Batches corresponding to each error.

ErrorNumber of BusinessFacts per Batch

Number of Batches

T-Star D-Star

10% 10 6039 2760

5% 20 3019 1308

4% 25 2415 1104

3% 34 1776 812

2% 50 1207 552

1% 100 603 276

Accuracy

The accuracy of the results is influenced by both error and support thresholds. Therefore, we conducted

tests on StarFP-Stream varying both. Note that the resulting patterns of StarFP-Stream and SimpleFP-

Stream are the same (the algorithms only differ in how they manipulate the data). The exact patterns

were given by FP-Growth (with all facts as input) and were compared with the approximate ones.

We know that as the minimum support decreases, the number of patterns increases, since we require

fewer occurrences of an item for it to be frequent. And as the maximum error increases, the number of

patterns returned also tends to increase, because although we can discard more items, we have to return

more possible patterns to make sure we do not miss any real one. As expected, this is verified in these

experiments, as can be seen in Fig. 3.6 (left).

29

10#

100#

1000#

10000#

5.0%# 2.0%# 1.0%# 0.5%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 0.01%# 0.1%# 1.0%#Error:'

80%#

85%#

90%#

95%#

100%#

5.0%# 2.0%# 1.0%# 0.5%#

Precision'

Support'

0.01%# 0.1%# 1.0%#Error:'

1#

10#

100#

1000#

50%# 40%# 30%# 20%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 1%# 3%# 5%# 10%#Error:'

1#

10#

100#

1000#

50%# 40%# 30%# 20%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 1%# 3%# 5%# 10%#Error:'

50%#

60%#

70%#

80%#

90%#

100%#

50%# 40%# 30%# 20%#

Precision'

Support'

1%# 3%# 5%# 10%#

Error:'

50%#

60%#

70%#

80%#

90%#

100%#

50%# 40%# 30%# 20%#Precision'

Support'

1%# 3%# 5%# 10%#

Error:'

0#

200#

400#

600#

800#

10%# 5%# 4%# 3%# 2%# 1%#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

'

Error'

0#

1000#

2000#

3000#

4000#

5000#

6000#

10%# 5%# 4%# 3%# 2%# 1%#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

'

Error'

0#

200#

400#

600#

800#

1000#

1200#

1.00%# 0.50%# 0.10%# 0.05%# 0.01%#

Pa,ern'Tree'Size''

(tho

usan

ds'of'n

odes)'

'

Error'

0#

50#

100#

150#

200#

250#

300#

1# 101# 201# 301# 401# 501#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odde

s)'

Batch'

0#

2000#

4000#

6000#

8000#

10000#

1# 101# 201# 301# 401# 501# 601# 701# 801#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

Batch'

0#

100#

200#

300#

400#

500#

600#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

Batch'

10#

100#

1000#

10000#

5.0%# 2.0%# 1.0%# 0.5%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 0.01%# 0.1%# 1.0%#Error:'

80%#

85%#

90%#

95%#

100%#

5.0%# 2.0%# 1.0%# 0.5%#

Precision'

Support'

0.01%# 0.1%# 1.0%#Error:'

1#

10#

100#

1000#

50%# 40%# 30%# 20%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 1%# 3%# 5%# 10%#Error:'

1#

10#

100#

1000#

50%# 40%# 30%# 20%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 1%# 3%# 5%# 10%#Error:'

50%#

60%#

70%#

80%#

90%#

100%#

50%# 40%# 30%# 20%#

Precision'

Support'

1%# 3%# 5%# 10%#

Error:'

50%#

60%#

70%#

80%#

90%#

100%#

50%# 40%# 30%# 20%#Precision'

Support'

1%# 3%# 5%# 10%#

Error:'

0#

200#

400#

600#

800#

10%# 5%# 4%# 3%# 2%# 1%#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

'

Error'

0#

1000#

2000#

3000#

4000#

5000#

6000#

10%# 5%# 4%# 3%# 2%# 1%#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

'

Error'

0#

200#

400#

600#

800#

1000#

1200#

1.00%# 0.50%# 0.10%# 0.05%# 0.01%#

Pa,ern'Tree'Size''

(tho

usan

ds'of'n

odes)'

'

Error'

0#

50#

100#

150#

200#

250#

300#

1# 101# 201# 301# 401# 501#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odde

s)'

Batch'

0#

2000#

4000#

6000#

8000#

10000#

1# 101# 201# 301# 401# 501# 601# 701# 801#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

Batch'

0#

100#

200#

300#

400#

500#

600#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

Batch'

(a) AW T-Star

10#

100#

1000#

10000#

5.0%# 2.0%# 1.0%# 0.5%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 0.01%# 0.1%# 1.0%#Error:'

80%#

85%#

90%#

95%#

100%#

5.0%# 2.0%# 1.0%# 0.5%#

Precision'

Support'

0.01%# 0.1%# 1.0%#Error:'

1#

10#

100#

1000#

50%# 40%# 30%# 20%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 1%# 3%# 5%# 10%#Error:'

1#

10#

100#

1000#

50%# 40%# 30%# 20%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 1%# 3%# 5%# 10%#Error:'

50%#

60%#

70%#

80%#

90%#

100%#

50%# 40%# 30%# 20%#

Precision'

Support'

1%# 3%# 5%# 10%#

Error:'

50%#

60%#

70%#

80%#

90%#

100%#

50%# 40%# 30%# 20%#Precision'

Support'

1%# 3%# 5%# 10%#

Error:'

0#

200#

400#

600#

800#

10%# 5%# 4%# 3%# 2%# 1%#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

'

Error'

0#

1000#

2000#

3000#

4000#

5000#

6000#

10%# 5%# 4%# 3%# 2%# 1%#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

'

Error'

0#

200#

400#

600#

800#

1000#

1200#

1.00%# 0.50%# 0.10%# 0.05%# 0.01%#

Pa,ern'Tree'Size''

(tho

usan

ds'of'n

odes)'

'

Error'

0#

50#

100#

150#

200#

250#

300#

1# 101# 201# 301# 401# 501#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odde

s)'

Batch'

0#

2000#

4000#

6000#

8000#

10000#

1# 101# 201# 301# 401# 501# 601# 701# 801#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

Batch'

0#

100#

200#

300#

400#

500#

600#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

Batch'

10#

100#

1000#

10000#

5.0%# 2.0%# 1.0%# 0.5%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 0.01%# 0.1%# 1.0%#Error:'

80%#

85%#

90%#

95%#

100%#

5.0%# 2.0%# 1.0%# 0.5%#

Precision'

Support'

0.01%# 0.1%# 1.0%#Error:'

1#

10#

100#

1000#

50%# 40%# 30%# 20%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 1%# 3%# 5%# 10%#Error:'

1#

10#

100#

1000#

50%# 40%# 30%# 20%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 1%# 3%# 5%# 10%#Error:'

50%#

60%#

70%#

80%#

90%#

100%#

50%# 40%# 30%# 20%#

Precision'

Support'

1%# 3%# 5%# 10%#

Error:'

50%#

60%#

70%#

80%#

90%#

100%#

50%# 40%# 30%# 20%#

Precision'

Support'

1%# 3%# 5%# 10%#

Error:'

0#

200#

400#

600#

800#

10%# 5%# 4%# 3%# 2%# 1%#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

'

Error'

0#

1000#

2000#

3000#

4000#

5000#

6000#

10%# 5%# 4%# 3%# 2%# 1%#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

'

Error'

0#

200#

400#

600#

800#

1000#

1200#

1.00%# 0.50%# 0.10%# 0.05%# 0.01%#

Pa,ern'Tree'Size''

(tho

usan

ds'of'n

odes)'

'

Error'

0#

50#

100#

150#

200#

250#

300#

1# 101# 201# 301# 401# 501#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odde

s)'

Batch'

0#

2000#

4000#

6000#

8000#

10000#

1# 101# 201# 301# 401# 501# 601# 701# 801#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

Batch'

0#

100#

200#

300#

400#

500#

600#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

Batch'

(b) AW D-Star

Figure 3.6: Number of patterns returned (left) and precision (right), by StarFP-Stream.

It is interesting to see that the algorithms return more patterns for AW D-Star than for AW T-Star.

For example, for 30% of support, mining each row as a single fact asks for patterns that appear in more

than 18 thousand rows. By aggregating per degenerated key, e.g. we can find the products and sets of

products that are bought together in more than 8280 sales. This leads to the increase of the number of

patterns returned, since items appearing more than 8280 times but less than 18 thousand, are infrequent

in the first case, but frequent when aggregating.

We can see in the charts that, although StarFP-Stream returns more patterns than the exact ones,

it returns just a few more. The precision helps evaluating that, measuring the rate of real patterns over

the patterns returned by the streaming algorithm.

Fig. 3.6 (right) presents the precision as the support varies. These results depend on the data

characteristics, namely in the number of hidden patterns and in the history of occurrences of items across

the batches processed. We can see that, as the error increases, the precision decreases, for all support

thresholds. In other words, the smaller the error, fewer non real patterns are returned. The overall results

show that the precision is always above 60%.

In the case of the AW T-Star, we can state that for 40% of support the algorithm achieved better

results (100% of precision for errors between 1% and 5%, and 93% for an error of 10%). This may mean

that patterns that appear more than 40% of the times are well defined and consequently are monitored

early during processing. For 50% of support, all errors achieved the same precision of 83%. The precision

for the AW D-Star is similar, achieving better results for smaller errors than the T-Star.

The recall of StarFP-Stream (and SimpleFP-Stream) is proved theoretically to be 100% (see Section

3.3.1), meaning that there are no false positives, i.e. there are no real patterns that the algorithm

considers infrequent. The size of the batches is defined in terms of the error, so that we can discard the

first n occurrences of an item if n is less than the current number of batches, and still not loose any real

pattern. This fact was also verified in these experiments.

In terms of accuracy, we can state that the streaming results are accurate and achieve a high precision.

30

Pattern-Tree

The pattern-tree is the key element of these algorithms, since it is the summary structure that holds all

the possible patterns. The maximum error and the characteristics of the data influence its size, which

in turn influences the time and memory needed. The minimum support only counts for extracting the

patterns out of the pattern-tree, and it does not influence its size.

Since both algorithms use the same rules to construct the pattern-tree, it will be equivalent on both

cases. Therefore, we only present the results of the pattern-tree constructed by StarFP-Stream.

10#

100#

1000#

10000#

5.0%# 2.0%# 1.0%# 0.5%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 0.01%# 0.1%# 1.0%#Error:'

80%#

85%#

90%#

95%#

100%#

5.0%# 2.0%# 1.0%# 0.5%#

Precision'

Support'

0.01%# 0.1%# 1.0%#Error:'

1#

10#

100#

1000#

50%# 40%# 30%# 20%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 1%# 3%# 5%# 10%#Error:'

1#

10#

100#

1000#

50%# 40%# 30%# 20%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 1%# 3%# 5%# 10%#Error:'

50%#

60%#

70%#

80%#

90%#

100%#

50%# 40%# 30%# 20%#

Precision'

Support'

1%# 3%# 5%# 10%#

Error:'

50%#

60%#

70%#

80%#

90%#

100%#

50%# 40%# 30%# 20%#

Precision'

Support'

1%# 3%# 5%# 10%#

Error:'

0#

200#

400#

600#

800#

10%# 5%# 4%# 3%# 2%# 1%#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

'

Error'

0#

1000#

2000#

3000#

4000#

5000#

6000#

10%# 5%# 4%# 3%# 2%# 1%#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

'

Error'

0#

200#

400#

600#

800#

1000#

1200#

1.00%# 0.50%# 0.10%# 0.05%# 0.01%#

Pa,ern'Tree'Size''

(tho

usan

ds'of'n

odes)'

'

Error'

0#

50#

100#

150#

200#

250#

300#

1# 101# 201# 301# 401# 501#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odde

s)'

Batch'

0#

2000#

4000#

6000#

8000#

10000#

1# 101# 201# 301# 401# 501# 601# 701# 801#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

Batch'

0#

100#

200#

300#

400#

500#

600#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

Batch'

(a) Average size – AW T-Star

10#

100#

1000#

10000#

5.0%# 2.0%# 1.0%# 0.5%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 0.01%# 0.1%# 1.0%#Error:'

80%#

85%#

90%#

95%#

100%#

5.0%# 2.0%# 1.0%# 0.5%#

Precision'

Support'

0.01%# 0.1%# 1.0%#Error:'

1#

10#

100#

1000#

50%# 40%# 30%# 20%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 1%# 3%# 5%# 10%#Error:'

1#

10#

100#

1000#

50%# 40%# 30%# 20%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 1%# 3%# 5%# 10%#Error:'

50%#

60%#

70%#

80%#

90%#

100%#

50%# 40%# 30%# 20%#

Precision'

Support'

1%# 3%# 5%# 10%#

Error:'

50%#

60%#

70%#

80%#

90%#

100%#

50%# 40%# 30%# 20%#

Precision'

Support'

1%# 3%# 5%# 10%#

Error:'

0#

200#

400#

600#

800#

10%# 5%# 4%# 3%# 2%# 1%#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

'

Error'

0#

1000#

2000#

3000#

4000#

5000#

6000#

10%# 5%# 4%# 3%# 2%# 1%#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

'

Error'

0#

200#

400#

600#

800#

1000#

1200#

1.00%# 0.50%# 0.10%# 0.05%# 0.01%#

Pa,ern'Tree'Size''

(tho

usan

ds'of'n

odes)'

'

Error'

0#

50#

100#

150#

200#

250#

300#

1# 101# 201# 301# 401# 501#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odde

s)'

Batch'

0#

2000#

4000#

6000#

8000#

10000#

1# 101# 201# 301# 401# 501# 601# 701# 801#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

Batch'

0#

100#

200#

300#

400#

500#

600#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

Batch'

(b) Size with 3% error – AW T-Star

10#

100#

1000#

10000#

5.0%# 2.0%# 1.0%# 0.5%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 0.01%# 0.1%# 1.0%#Error:'

80%#

85%#

90%#

95%#

100%#

5.0%# 2.0%# 1.0%# 0.5%#

Precision'

Support'

0.01%# 0.1%# 1.0%#Error:'

1#

10#

100#

1000#

50%# 40%# 30%# 20%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 1%# 3%# 5%# 10%#Error:'

1#

10#

100#

1000#

50%# 40%# 30%# 20%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 1%# 3%# 5%# 10%#Error:'

50%#

60%#

70%#

80%#

90%#

100%#

50%# 40%# 30%# 20%#

Precision'

Support'

1%# 3%# 5%# 10%#

Error:'

50%#

60%#

70%#

80%#

90%#

100%#

50%# 40%# 30%# 20%#

Precision'

Support'

1%# 3%# 5%# 10%#

Error:'

0#

200#

400#

600#

800#

10%# 5%# 4%# 3%# 2%# 1%#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

'

Error'

0#

1000#

2000#

3000#

4000#

5000#

6000#

10%# 5%# 4%# 3%# 2%# 1%#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

'

Error'

0#

200#

400#

600#

800#

1000#

1200#

1.00%# 0.50%# 0.10%# 0.05%# 0.01%#

Pa,ern'Tree'Size''

(tho

usan

ds'of'n

odes)'

'

Error'

0#

50#

100#

150#

200#

250#

300#

1# 101# 201# 301# 401# 501#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odde

s)'

Batch'

0#

2000#

4000#

6000#

8000#

10000#

1# 101# 201# 301# 401# 501# 601# 701# 801#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

Batch'

0#

100#

200#

300#

400#

500#

600#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

Batch'

(c) Average size – AW D-Star

10#

100#

1000#

10000#

5.0%# 2.0%# 1.0%# 0.5%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 0.01%# 0.1%# 1.0%#Error:'

80%#

85%#

90%#

95%#

100%#

5.0%# 2.0%# 1.0%# 0.5%#

Precision'

Support'

0.01%# 0.1%# 1.0%#Error:'

1#

10#

100#

1000#

50%# 40%# 30%# 20%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 1%# 3%# 5%# 10%#Error:'

1#

10#

100#

1000#

50%# 40%# 30%# 20%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 1%# 3%# 5%# 10%#Error:'

50%#

60%#

70%#

80%#

90%#

100%#

50%# 40%# 30%# 20%#

Precision'

Support'

1%# 3%# 5%# 10%#

Error:'

50%#

60%#

70%#

80%#

90%#

100%#

50%# 40%# 30%# 20%#

Precision'

Support'

1%# 3%# 5%# 10%#

Error:'

0#

200#

400#

600#

800#

10%# 5%# 4%# 3%# 2%# 1%#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

'

Error'

0#

1000#

2000#

3000#

4000#

5000#

6000#

10%# 5%# 4%# 3%# 2%# 1%#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

'

Error'

0#

200#

400#

600#

800#

1000#

1200#

1.00%# 0.50%# 0.10%# 0.05%# 0.01%#

Pa,ern'Tree'Size''

(tho

usan

ds'of'n

odes)'

'

Error'

0#

50#

100#

150#

200#

250#

300#

1# 101# 201# 301# 401# 501#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odde

s)'

Batch'

0#

2000#

4000#

6000#

8000#

10000#

1# 101# 201# 301# 401# 501# 601# 701# 801#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

Batch'

0#

100#

200#

300#

400#

500#

600#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

Batch'

(d) Size with 3% error – AW D-Star

Figure 3.7: Average (left) and detailed (right) pattern-tree size.

Fig. 3.7 (left) reveals, for each error, the average size of the pattern-tree after processing a batch.

Thus we confirm that, as the error decreases, the size of the pattern-tree increases. This is explained by

the fact that for higher errors, the batches are smaller and the algorithms can discard much more possible

patterns than for lower errors. Although being a summary structure, it still is a very large structure,

with thousands of nodes.

With AW D-Star, the Super FP-Tree and the pattern-tree are substantially larger and different: when

aggregating, they have less paths, but longer ones, because of the co-occurrences of items of the same

table in the same transaction.

Fig. 3.7 (right) shows the detailed size of the pattern-trees, for a fixed error of 3%. In the AW T-Star

chart, we can see that, in the first batches, the pattern-tree is larger, but tends to stabilize a few batches

ahead. This behavior is common for all errors, and reveals that there were a lot of patterns that were

frequent only in the beginning. Contrariwise, the pattern-tree of AW D-Star is smaller in the beginning.

This happens because until the sale order number 5400, customers only bought one product at a time,

and therefore there are fewer patterns and no co-occurrences of more than one product in the same sale.

In both cases, the spikes are caused by the introduction of recently appearing itemsets, followed by

their removal a few batches later in the pruning step. It is interesting to see that despite the spikes, the

trees always come back to the same size. This might indicate that the patterns are well defined, and they

keep consistent across the batches.

These results are important to understand the fluctuations in time and space described below.

31

Time

With respect to data streams, processing time is usually analyzed in two different ways [LLH11]: the

time required to process one batch (update time) and the time needed to return the patterns, for a given

support (query time). The first consists in the elapsed time from the reading of a transaction to the

update of the pattern-tree. The second is the time needed to scan the pattern-tree.

The minimum support does not influence the construction of the pattern-tree, and thus the update

time. On the contrary, the maximum error influences both update and query time. Both times must not

depend on the total number of transactions.

0#

2#

4#

6#

8#

10%# 5%# 4%# 3%# 2%# 1%#

Time'(s)'

Error'

SimpleFPStream#StarFPStream#

0#

100#

200#

300#

400#

10%# 5%# 4%# 3%# 2%# 1%#

Time'(s)'

Error'


0#

5#

10#

15#

20#

25#

30#

1.0%# 0.5%# 0.1%# 0.05%# 0.01%#

Time'(s)'

Error'


0.01#

0.1#

1#

10#

1# 101# 201# 301# 401# 501#

Time'(s)'

Batch'

StarFPStream# SimpleFPStream#

1#

10#

100#

1000#

10000#

1# 51# 101#151#201#251#301#351#401#451#501#551#601#651#701#751#801#

Time'(s)'

Batch'

SimpleFPStream# StarFPStream#

0#

2#

4#

6#

8#

10#

Time'(s)'

Batch'


0#

200#

400#

600#

800#

1000#

1200#

10%# 5%# 4%# 3%# 2%# 1%#

Mem

ory'(M

b)'

Error'


0#

100#

200#

300#

400#

500#

600#

10%# 5%# 4%# 3%# 2%# 1%#

Mem

ory'(M

b)'

Error'


0#

50#

100#

150#

200#

250#

300#

1.0%# 0.5%# 0.1%# 0.05%# 0.01%#

Mem

ory'(M

b)'

Error'

SimpleFPStream#

StarFPStream#

(a) Average time – AW T-Star

0#

2#

4#

6#

8#

10%# 5%# 4%# 3%# 2%# 1%#

Time'(s)'

Error'


0#

100#

200#

300#

400#

10%# 5%# 4%# 3%# 2%# 1%#

Time'(s)'

Error'


0#

5#

10#

15#

20#

25#

30#

1.0%# 0.5%# 0.1%# 0.05%# 0.01%#

Time'(s)'

Error'


0.01#

0.1#

1#

10#

1# 101# 201# 301# 401# 501#

Time'(s)'

Batch'


1#

10#

100#

1000#

10000#

1# 51# 101#151#201#251#301#351#401#451#501#551#601#651#701#751#801#

Time'(s)'

Batch'


0#

2#

4#

6#

8#

10#

Time'(s)'

Batch'


0#

200#

400#

600#

800#

1000#

1200#

10%# 5%# 4%# 3%# 2%# 1%#

Mem

ory'(M

b)'

Error'


0#

100#

200#

300#

400#

500#

600#

10%# 5%# 4%# 3%# 2%# 1%#

Mem

ory'(M

b)'

Error'


0#

50#

100#

150#

200#

250#

300#

1.0%# 0.5%# 0.1%# 0.05%# 0.01%#

Mem

ory'(M

b)'

Error'

SimpleFPStream#

StarFPStream#

(b) Time with 3% error – AW T-Star

0#

2#

4#

6#

8#

10%# 5%# 4%# 3%# 2%# 1%#

Time'(s)'

Error'


0#

100#

200#

300#

400#

10%# 5%# 4%# 3%# 2%# 1%#

Time'(s)'

Error'


0#

5#

10#

15#

20#

25#

30#

1.0%# 0.5%# 0.1%# 0.05%# 0.01%#

Time'(s)'

Error'


0.01#

0.1#

1#

10#

1# 101# 201# 301# 401# 501#

Time'(s)'

Batch'


1#

10#

100#

1000#

10000#

1# 51# 101#151#201#251#301#351#401#451#501#551#601#651#701#751#801#

Time'(s)'

Batch'


0#

2#

4#

6#

8#

10#

Time'(s)'

Batch'


0#

200#

400#

600#

800#

1000#

1200#

10%# 5%# 4%# 3%# 2%# 1%#

Mem

ory'(M

b)'

Error'


0#

100#

200#

300#

400#

500#

600#

10%# 5%# 4%# 3%# 2%# 1%#

Mem

ory'(M

b)'

Error'


0#

50#

100#

150#

200#

250#

300#

1.0%# 0.5%# 0.1%# 0.05%# 0.01%#

Mem

ory'(M

b)'

Error'

SimpleFPStream#

StarFPStream#

(c) Average time – AW D-Star

0#

2#

4#

6#

8#

10%# 5%# 4%# 3%# 2%# 1%#

Time'(s)'

Error'


0#

100#

200#

300#

400#

10%# 5%# 4%# 3%# 2%# 1%#

Time'(s)'

Error'


0#

5#

10#

15#

20#

25#

30#

1.0%# 0.5%# 0.1%# 0.05%# 0.01%#

Time'(s)'

Error'


0.01#

0.1#

1#

10#

1# 101# 201# 301# 401# 501#

Time'(s)'

Batch'


1#

10#

100#

1000#

10000#

1# 51# 101#151#201#251#301#351#401#451#501#551#601#651#701#751#801#

Time'(s)'

Batch'


0#

2#

4#

6#

8#

10#

Time'(s)'

Batch'


0#

200#

400#

600#

800#

1000#

1200#

10%# 5%# 4%# 3%# 2%# 1%#

Mem

ory'(M

b)'

Error'


0#

100#

200#

300#

400#

500#

600#

10%# 5%# 4%# 3%# 2%# 1%#

Mem

ory'(M

b)'

Error'


0#

50#

100#

150#

200#

250#

300#

1.0%# 0.5%# 0.1%# 0.05%# 0.01%#

Mem

ory'(M

b)'

Error'

SimpleFPStream#

StarFPStream#

(d) Time with 3% error – AW D-Star

Figure 3.8: Average (left) and detailed (right) update time.

Fig. 3.8 (left) shows the average update time per batch, for both algorithms and for all errors. For

consistency, we do not take into account the time needed to process the first batch, since it is processed

separately.

We can state that SimpleFP-Stream demands, on average, more time than StarFP-Stream. This

difference is even higher when there is a degenerated dimension (AW D-Star), since the first has to

denormalize several rows before mining each business fact. This demonstrates that, for star streams,

denormalize before mining takes more time than mining directly the star schema, specially in the presence

of degenerated dimensions, corroborating our goal and one of the goals of MRDM.

The update time should tend to be constant and not depend on the number of transactions. This can

be verified in Fig. 3.8 (right), that shows in detail the time needed per batch, for 3% error. There, we

can see that the update time tends to be constant as more batches are processed.

The higher values in the first batches for AW T-Star, and lower values in the same batches for D-Star

are directly related with the size of the pattern-tree. Around order 5400 (batch 160), the data change

(customers start buying more than one product at a time). Without aggregations (AW T-Star), the

algorithms are able to prune almost half the patterns and therefore they need less time to process each

batch. Considering the degenerated dimension (AW D-Star), the aggregations start only at this point,

therefore batches start being larger from here, and the algorithms need more time to process each batch.

Summarizing, these charts reflect that, as the error decreases, the larger are the batches and more

32

time is needed to process them. They show that the time needed tends to be constant, and depends

mainly on the size of the batches and on the size of the current pattern-tree. StarFP-Stream is the

algorithm that performs better and needs less time to process each batch, overcoming the “join before

mining” approach (SimpleFP-Stream), with and without degenerated dimensions.

The query time turned out to be insignificant, comparing to update time, taking always less than

0.005 seconds. As the error decreases, the pattern-tree increases, and the time needed to extract the

patterns also increases, but just in milliseconds. The same happens with the minimum support, since the

lower the support, the more patterns have to be returned.

Memory

The space or memory used by the algorithms was also studied. It depends on the intermediate structures

used by the algorithms, and it is strongly related with the size of the pattern-tree (and therefore with

the error bound). To analyze the maximum memory per batch, we measured the memory used by the

algorithms for each batch, right before discarding the Super FP-Tree and doing the pruning step.

0#

2#

4#

6#

8#

10%# 5%# 4%# 3%# 2%# 1%#

Time'(s)'

Error'


0#

100#

200#

300#

400#

10%# 5%# 4%# 3%# 2%# 1%#

Time'(s)'

Error'


0#

5#

10#

15#

20#

25#

30#

1.0%# 0.5%# 0.1%# 0.05%# 0.01%#

Time'(s)'

Error'


0.01#

0.1#

1#

10#

1# 101# 201# 301# 401# 501#

Time'(s)'

Batch'


1#

10#

100#

1000#

10000#

1# 51# 101#151#201#251#301#351#401#451#501#551#601#651#701#751#801#

Time'(s)'

Batch'


0#

2#

4#

6#

8#

10#

Time'(s)'

Batch'


0#

200#

400#

600#

800#

1000#

1200#

10%# 5%# 4%# 3%# 2%# 1%#

Mem

ory'(M

b)'

Error'


0#

100#

200#

300#

400#

500#

600#

10%# 5%# 4%# 3%# 2%# 1%#

Mem

ory'(M

b)'

Error'


0#

50#

100#

150#

200#

250#

300#

1.0%# 0.5%# 0.1%# 0.05%# 0.01%#

Mem

ory'(M

b)'

Error'

SimpleFPStream#

StarFPStream#

(a) AW T-Star

0#

2#

4#

6#

8#

10%# 5%# 4%# 3%# 2%# 1%#

Time'(s)'

Error'


0#

100#

200#

300#

400#

10%# 5%# 4%# 3%# 2%# 1%#Time'(s)'

Error'


0#

5#

10#

15#

20#

25#

30#

1.0%# 0.5%# 0.1%# 0.05%# 0.01%#

Time'(s)'

Error'


0.01#

0.1#

1#

10#

1# 101# 201# 301# 401# 501#

Time'(s)'

Batch'


1#

10#

100#

1000#

10000#

1# 51# 101#151#201#251#301#351#401#451#501#551#601#651#701#751#801#

Time'(s)'

Batch'


0#

2#

4#

6#

8#

10#

Time'(s)'

Batch'


0#

200#

400#

600#

800#

1000#

1200#

10%# 5%# 4%# 3%# 2%# 1%#

Mem

ory'(M

b)'

Error'


0#

100#

200#

300#

400#

500#

600#

10%# 5%# 4%# 3%# 2%# 1%#

Mem

ory'(M

b)'

Error'


0#

50#

100#

150#

200#

250#

300#

1.0%# 0.5%# 0.1%# 0.05%# 0.01%#

Mem

ory'(M

b)'

Error'

SimpleFPStream#

StarFPStream#

(b) AW D-Star

Figure 3.9: Average maximum memory per batch.

Fig. 3.9 presents the average maximum memory per batch. We can note that it increases with the

decrease of the error and that the algorithms perform very similar. With the star without degenerated

dimensions (AW T-Star), StarFP-Stream needs slightly more memory per batch than SimpleFP-Stream,

which was expected, since the first has to construct the DimFP-trees for each dimension, while the second

puts the denormalized facts in just one FP-tree. With the AW D-Star, this difference is not even visible.

We can also see that the algorithms need both more memory per batch when dealing with degenerated

dimensions, due to the storage of larger trees.

Naturally, as in other pattern mining algorithms, the memory used increases exponentially with the

decrease of the error, since the error defines what is considered frequent, and therefore the smaller the

error, the more has to be kept. However, and just as in the time usage, the memory needed tends to

stabilize and not depend on the number of batches processed so far, as required by the data streaming

paradigm. This memory behavior is directly related with the size of the pattern-tree.

3.5 Discussion and Conclusions

In this chapter we described an algorithm, named StarFP-Stream, for mining patterns on very large

data repositories, modeled as a star schema. The algorithm finds frequent patterns at some level of

aggregation, and it is able to deal with degenerated dimensions by aggregating the rows in the fact table

corresponding to the same business fact, and still mining directly the star.

Experimental results show that StarFP-Stream is accurate as its predecessors, achieving precisions

above 60% and 100% of recall. The pattern-tree tends to be very large, but its size tends to be stable,

33

and it is able to return the patterns for every minimum support, at anytime. The time and memory

needed by the algorithm tend to be constant and do not depend on the total number of transactions

processed so far, but only on the size of the batches and on the size of the current pattern-tree, which

in turn depends on the characteristics of data. StarFP-Stream greatly outperforms SimpleFP-Stream in

terms of time. In this manner, it is possible to conclude that our algorithm overcomes the “join before

mining” approach.

Despite being an efficient algorithm that is able to mine large and growing star schemas, it still suffers

from the main bottleneck of pattern mining: it returns a huge number of unfocused patterns. In this

data streaming environment, this problem is even more visible, since streaming algorithms must keep an

even higher amount of possible frequent patterns. To tackle this problem, the most studied and used

approach is to incorporate domain knowledge into the pattern mining algorithms, to help filtering the

patterns and getting less and more interesting results, in the user and application points of view. In our

case, it can greatly reduce the size of the pattern-tree, and therefore the memory needed per batch, as

well as the time needed, since we would process smaller pattern-trees. Accomplish this, however, is not

straightforward, since the introduction of domain knowledge, such as constraints, has been tackled over

single transactional tables. To the best of our knowledge, there is no approach for incorporating domain

knowledge into multi-relational pattern mining. We discuss and progress on this in the following chapters

of this dissertation.

Another path for improvement could also be the creation of a parallelized version of StarFP-Stream,

which could significantly improve the time needed, as well as increase the throughput of the algorithm.

We could parallelize the processing of each fact in a batch, and when the batch is complete, while the

Super FP-Tree is being mined, the new batch may already be collected and new facts can be processed in

parallel. There have been some efforts in the parallelization of traditional pattern mining, in particular

of the base FP-Growth algorithm [LWZ+08], which may also serve as a basis for parallelizing the mining

of the SuperFP-Tree of our StarFP-Stream algorithm.

34

Chapter 4

The Groundwork on Domain Driven

Data Mining

Despite the advances and the recognizable value of pattern mining for finding different types of inter-

esting relations among data, from association rules [AS94, PH02] to sequences [PHW07] and emerging

patterns [MTIV97, DL99], it tends not to be widely used in real world applications.

One of the main reasons, and one of the common criticisms pointed out to pattern mining is the

fact that it generates a huge number of patterns, independent of user expertise, making it very hard to

understand and use the results [SVA97, HCXY07].

The truth is that, if we only want patterns with higher support, only already known patterns are

found, or none at all; otherwise, if the support is set too low, the number of patterns explode, and it is

very difficult to distinguish the few real useful patterns among the many uninteresting ones. A balance

is required, and ways to limit the number of results are needed.

Some strategies to cut down the number of patterns returned were already proposed and tested,

namely: (1) reducing the set of attributes to consider [BGMP05]; (2) identifying only the k-best patterns;

(3) mining condensed representations [PBTL99, Zak00a, BBR00]; (4) using sampling and approximation

techniques; and (5) using background knowledge to limit the results to what is unknown.

Nonetheless, we need more than just reducing the number of patterns. Another bottleneck of pattern

mining is the lack of focus on user expectations and weak actionability of the results. User expectations

are subjective and are not easy to convert to a machine readable form [GSD07]. However, they are related

with the domain and with the business goals. In this sense, the use of domain knowledge in the search

process is recognized as a promising approach to solve this problem.

The existing knowledge of a specific application domain, regarding the important aspects of the

business, like its structure, data, processes, persons, goals, etc., is referred to as domain knowledge. This

knowledge usually belongs to experts, and depends on their experience and view of the business. Different

experts for a same domain may have different perspectives of the business, and therefore different domain

knowledge.

Being able to explore this knowledge makes it possible, not only to focus the results, but also to reduce

the number of results, hence improving their interpretability and actionability, and eventually improving

the business leverage and the acceptance of pattern mining.

With respect to the use of domain knowledge in data mining, there are some fundamental issues that

need attention:

• How to capture domain knowledge? Domain knowledge is usually implicit and tacit. How to extract

this knowledge from people is non trivial, specially because different people may have different

35

perspectives of the business and goals, and also because they do not always have conscience of what

they know, or do not know how to transmit it (the knowledge acquisition bottleneck);

• How to represent domain knowledge? In order to use this knowledge, it is necessary to formalize it

in some human and machine readable representation. Several ways have been proposed to represent

this knowledge, such as in the form of annotations, introduced directly by users or experts, con-

straints over the search and results space, and ontologies, modeling the relations between important

concepts;

• How to involve domain knowledge in the data mining process? Algorithms must use this knowledge

to constrain the search and results to what is important. How to efficiently use this knowledge

is still a considerable challenge. On one side, if we restrict too much, we may be reducing the

discovery process to a simple hypothesis testing task, that can only find already known patterns

[HG02]. On the other side, if we restrict too little, we go back to a traditional pattern mining

process, and may find too many uninteresting patterns. Another concern is related to the fact that

there are several different representations, and therefore the ways to incorporate each one may also

vary. Furthermore, we also have to be careful not to depend on one specific domain, or else the

algorithms cannot be reused;

• How to evaluate the interestingness and novelty of resulting patterns? This problem is common

to traditional data mining, and involves the creation of interestingness measures that are able to

evaluate the results based on the existing knowledge, and the selection of only the ones that are

novel.

The problems of acquiring and representing domain knowledge are outside the scope of this work.

Rather, in this thesis we will focus on how the existing domain knowledge representations have been

incorporated into the pattern mining task.

The work on the introduction of domain knowledge or semantics in pattern mining has been increasing

in the last decade. In a broader view, existing algorithms mostly use this knowledge in the pre- or post-

processing step of the Knowledge Discovery in Databases (KDD) process [FPSS96]:

As a pre-processing task, domain knowledge has been used mainly to reduce the search space or to

improve the quality of data. This can be achieved by, for example, filtering original data, selecting only

the most important records [BGMP05]; enriching data by adding related concepts in the background

knowledge [HF95, SA96, SVA97]; and replacing concepts and missing values based on the domain knowl-

edge. While these techniques help on filtering and improving the quality of the data to mine, they may

need a lot of time to be tuned. Additionally, they might eliminate important data and they do not

guarantee that the algorithms will return less and novel results.

As a post-processing step, domain knowledge has been used to evaluate the interestingness and novelty

of the discovered patterns, so that only the best (and new) results are shown to the users [WJL03, JS04,

JS05, PT98, CLZ07]. Using domain knowledge only as a post-processing step, means that all data has

to be processed to find all patterns, and then all patterns must be processed again to evaluate and filter

them. Thus, it may not be the most efficient strategy.

A more balanced approach is to use the domain knowledge during the data mining phase. In this

direction, algorithms are able to prune the search space and filter the results “on the fly”, and there-

fore return less and more focused results, and at the same time, needing less time and memory than

other approaches. Mainly, existing approaches incorporate domain knowledge into the discovery process

to avoid generating uninteresting candidates or going to uninteresting paths, and to avoid testing all

data [SA95, SVA97, PH02, BJ05, Ant09b, ME09b].

36

In this chapter we analyze in detail the existing approaches on the incorporation of domain knowledge

in the pattern mining process, and present a new global view of the work done in this area. In particular,

we propose a new framework for constrained pattern mining, that helps organizing the existing strate-

gies to incorporate constraints in the search process, based on the semantics and properties of existing

constraints, as well as on the data sources being constrained.

Section 4.1 presents the background on the use of domain knowledge in pattern mining, and section

4.2 describes the different forms of domain knowledge that have been used, along with their advantages

and disadvantages. Sections 4.4 to 4.8 present in detail the framework for constrained pattern mining

and existing related work. And finally, section 4.9 concludes the work with some discussion and open

issues.

4.1 Background

The use of domain knowledge has been explored in data mining since its early years, in a somehow

independent manner among different areas, such as inductive logic programming, semantic data mining,

and more recently, domain driven data mining.

In this section we discuss the approaches and goals of each of these areas.

4.1.1 Inductive Logic Programming - Discussion and Arguments

Inductive Logic Programming (ILP) is a well known and studied paradigm of machine learning that

is concerned with inducing classification rules from examples and background knowledge, all of which

expressed using logical representations, such as Prolog programs [NCW97, LM04, LE09]. It was born

from the interception of Concept Learning and Logic Programming, with the goal of prediction within

the representation framework of Horn Clausal Logic (HCL).

The fact that all information must be written in declarative languages (like Prolog and Datalog) is

one of the drawbacks of ILP approaches, and one of the reasons for not being widely used. Nevertheless,

its structure promotes the representation and use of domain knowledge. There are many ILP algorithms

that are able to introduce this knowledge into the discovery process (see, for example, [RR04, MEL01,

LM04, LR98, RV00, LM03, Lis05]).

ILP techniques must also deal with the tradeoff between expressiveness and efficiency of the used

representations. Studies show that current algorithms would scale relatively well as the amount of back-

ground knowledge increases. But they would not scale well with the number of relations involved, and in

some cases, with the complexity of the patterns being searched [D96, LM04].

4.1.2 Domain Driven Data Mining – Discussion and Arguments

The methodology of Domain Driven Data Mining, D3M, was proposed recently [CZ06, CZZ+07, CLZ07,

CYZZ10b, CYZZ10a], defending an urgent need for Actionable Knowledge Discovery (AKD) to support

businesses and applications.

The motivation behind D3M is the gap between academic objectives (innovation, performance and

generalization) and business goals (problem solving), and between academic outputs and business ex-

pectations [CYZZ10a]. So that data mining can be better accepted and advantageously applied in real

businesses and applications, it is necessary to create methods and tools capable of analyzing real world

data and extracting actionable knowledge, i.e. useful information that can be (as far as possible) directly

converted into decision-making actions. The term “actionability” measures the ability of a pattern to

prompt a user to take concrete actions to his advantage in the real world [CLZ07].

37

To achieve that, data mining must involve the ubiquitous intelligence surrounding the business prob-

lem, such as human intelligence, domain intelligence, network and organizational/social intelligence

[CYZZ10a]. Therefore, they propose a paradigm shift from data-centered knowledge discovery to domain-

driven actionable knowledge discovery.

Research included in this area of D3M has been centered on the proposal of methods dedicated to

specific domains, with a special emphasis on the actionability of the results. The specificity of those

methods difficult their application to other domains, and the need for a standard methodology that is

able to incorporate domain knowledge in the mining process remains an open issue. We can argue that

existing work in D3M is more centered in the actionability of results in some domain, than on the reuse

of the proposed strategies.

4.1.3 Semantic Data Mining – Discussion and Arguments

The name Semantic Data Mining (SDM) has been used to denote several approaches to data mining, in

a not very consistent way:

1. Defining the semantic of DM [PDS08, Ant09b] (or semantic meta-mining [NVTL09, JLL10]). Defin-

ing the semantic of the DM process itself may help understanding the actual process as well as the

dependencies between the several approaches. By identifying and formally representing the respec-

tive inputs, outputs, configurations and even workflows, it is possible to discover problems and

solutions, and envision more efficient strategies;

2. Extracting semantics from data [Set10, EC07]. This may be seen as the original goal of DM, which

is to discover useful knowledge from data. In this case, this knowledge or semantics has usually the

form of keywords or features, that give meaning to data;

3. Mining semantic data. With the explosive growing of the Semantic Web and of the Internet re-

sources, there is more and more semantic information available worldwide. Mining this semantic

information directly may improve its understanding and use [TLT08, LVS+11];

4. Adding semantics to data. This approach is also powered by the growth of the Semantic Web, and

it is most known as semantic annotation [DP08, Liu10]. By enriching data with semantics, it is

possible to help users understanding the data, and use it to get better results;

5. Using the existing semantics of some domain to guide DM algorithms. By incorporating the knowl-

edge inherent to each domain, data mining techniques are able to focus the search and modeling

process, and find more interesting results. Despite the advances, most of the existing work included

in this area are designed for some specific domain, and therefore cannot be reused. Also, being able

to capture and represent the semantics of some domain is not straightforward, but there has been

an effort to create and use increasingly expressive forms of representation of domain knowledge

[Ant07, NVTL09, JLL10].

In this work we consider the last approach, and analyze in detail the use of domain knowledge to

guide DM algorithms in the search for more focused results. We focus mainly on the existing generic

forms of domain knowledge representation, and on the strategies created to incorporate these forms. The

motivation is that, by being able to use generic representations, the algorithms can be applied to any

real problem, and still guide the discovery process through the specific knowledge of that domain.

38

4.2 Domain Knowledge Representations

Modeling has been one of the core parts of information science, either in information systems or in

artificial intelligence. In both ones, it is generally accepted that, without a good model, no system works

adequately.

The advances in the areas of modeling and knowledge representation allow for using the developed

mature formalisms to represent existing knowledge, and therefore making possible the exploration of

those models to guide the discovery process.

The use of domain knowledge in data mining has been a topic of extensive research, and several

representations have been proposed and analyzed, from simpler forms of knowledge, like annotations, to

more elaborated and expressive representations, like ontologies.

Each form of representation allows the formalization of more or less complex forms of knowledge, and

therefore has its advantages and disadvantages, and can be used in different ways to guide the mining

process. Usually, the more complex the model, the more complex it is to efficiently incorporate it.

Domain knowledge used by existing data mining approaches can be divided in: human interactions,

annotations (or labels), constraints and graph-based models. Along time, several strategies have been

proposed to incorporate these forms of knowledge representations, some general, but most of them ad-hoc

approaches.

Human Interactions Techniques based on human interactions are approaches (known as interactive

approaches) that utilize the user or expert in the actual discovery process, letting him direct the

flow of the algorithms and influence the selection of results.

The reasoning behind these approaches is that, from the users point of view, the pattern discovery

process is an interactive and iterative process [Bou04]. They define the data to analyze, choose the

desired parameters and thresholds, and interpret and evaluate the quality and applicability of the

results.

However, the actual discovery phase is usually a black box, and therefore it is difficult to trackback

the results, and find out which parameters or constraints lead to the interesting results. This also

leads to another problem, related to the difficulty on choosing the best parameters and constraints.

The users do not always know exactly what they want a priori, and this black box approach makes

it very inefficient to try different values (it is necessary to run the algorithms again, with the

new parameters). To overcome this, we need interactive approaches, capable of involving the user

in the discovery process. These approaches should be able to use their feedback iteratively and

incrementally [NDD99, GB00, GMV11].

Active learning techniques also require human interactions to help in the learning process. The

main idea is to make the system iteratively ask the oracle (e.g. the user or the expert) to give some

new information (e.g. labels or evaluations), and from the answers, the system is able to learn what

are the user knowledge and expectations [Set10, XSMH06].

Annotations One simple form of attaching domain knowledge to data is to add some labels or anno-

tations that characterize in some way the context in which the user or expert wants data to be

analyzed.

These labels can be, for example, the ratings given by customers on a social network, “relevant” and

“not relevant” tags, insights about important objects and possible relations, and desired categories.

In more complex or critical domains, like genetics, fraud detection, speech recognition and media,

these labels must be given by experts. However, with the rapid growth of data, most of the times it

is too expensive and time consuming to ask humans to label all data [Set10]. Techniques like active

39

learning try to automatize this labeling process, by iteratively and interactively learn how to label

new data points. But, how are these labels used in DM to find useful models?

Labels have been used mostly by classification and semi-supervised clustering techniques, to train

or initialize the models that will then categorize unknown or unlabeled instances [BBM02, SA12c].

Normally, the more labels used as background knowledge, the better the results. That is why,

generally, classification results are more accurate than semi-supervised approaches, which are, in

turn, better than unsupervised ones. However, algorithms also depend on the quality of labels,

and it does not guarantee that the results are more interesting, and that the algorithms return less

patterns/smaller models [SA12c].

Constraints The most used way to represent user expectations is through the definition of constraints

[Bay05]. Essentially, constraints are filters on the data or on the results, that capture application

semantics and allow the users to somehow control the search process and focus de algorithms on

what is really interesting. There are many types of constraints, from simple constraints that limit

the items appearing in patterns [SVA97], to more complex constraints requiring that patterns must

conform a regular expression [GRS99].

The work on constrained pattern mining is extensive and the most widely used, therefore we describe

it in more detail in the next section. We analyze both the different types of constraints, as well as

their properties and strategies for their incorporation in pattern mining. We also propose a new

framework to describe constrained pattern mining algorithms based on three of their dimensions:

constraint categories, constraint properties and data sources.

Graph-based Models Graph-based representations are one valuable and more expressive source of

domain knowledge, since they are able to capture the conceptual structure of the domain, and model,

in a more intuitive (and visual) way, the existing concepts and relations. Examples of graph-based

representations are taxonomies (or concept hierarchies), ontologies, bayesian and markov networks.

• Taxonomies are hierarchies of concepts, that can be seen as directed acyclic graphs (DAG),

containing the is-a relations existing between the concepts of the database. They have been

used in pattern mining in several ways: to enrich concepts in data with their ancestors in the

taxonomy, and therefore avoid redundant processing and duplicates [SA95, AS94]; and to find

patterns in all hierarchical levels, by mining one level each time, and use the results to mine

the next (more specific) level [HF95, MEL01, LM04].

• Ontologies are content theories about the objects, their properties and relations, that are

possible in a specified domain of knowledge, forming the heart of any system of knowledge

representation for that domain [CJB99]. They can be seen as extensions of taxonomies,

since they can represent not only the is-a relations between concepts, but also other types

of relations, hierarchies between relations, and axioms, that constrain the interpretation of

concepts [SHB06].

In a pragmatic view, an ontology mainly defines a directed graph, with concepts represented by

nodes and relations by edges, which can be efficiently traversed by search domain-independent

algorithms.

The use of ontologies in data mining with the purpose of finding more interesting results

is recent, and a great part of existing works are ad-hoc applications to specific problems.

They have been used as a pre-processing step, to categorize and enrich data [KLSP07], or

has a post-processing step, to filter patterns or association rules based on the relations of

their concepts in the ontology and on user defined constraints [MGB08]. There are also some

40

approaches that are able to use ontologies to influence the discovery process itself, either

by defining a set of constraints based on the ontology and using those constraints to avoid

generating invalid [Ant07, Ant09b] or uninteresting candidates (that are too distant from each

other) [ME09b, ME09a], or by using it to replace instances by corresponding concepts and

using the relations to grow only valid patterns [JLL07].

• Bayesian networks encode the joint distribution over a set of attributes, and provide well

understood inference mechanisms that make easier the computation of the probability of arbi-

trary events (in this case, combinations of concepts) in the network [JN07]. Bayesian networks

are an easily interpretable alternative language for expressing background knowledge, and are

used in frequent pattern mining to find whether the discovered knowledge is entailed by the

previously available knowledge [JS04, JS05].

• Markov Logic Networks (MLN) are one recent and promising example of a graph-based model

that can roughly be seen as bayesian networks with weights [Dom07]. They are able to model

all possible worlds, with the corresponding dependencies, probabilities and weights, and the

probability of each world depends on the sum of the product of the weight of each formula

(combination of concepts and relations) with the number of corresponding instantiations that

are true in that world.

Markov networks are a powerful representation for joint distributions, but learning them from

data is extremely difficult, and therefore they have not been widely used. The Alchemy sys-

tem [KSR+07] includes inference and learning algorithms for MLNs, and has been used for

knowledge-rich data mining in several domains [DKP+06b], like information extraction, link

prediction, entity resolution and social network analysis.

Table 4.1 summarizes the analysis of the advantages and disadvantages of each of the above domain

knowledge representations.

It is important to note that existing approaches for each kind of representation are interrelated and

can not be perfectly separated. One reason resides in the fact that there are similar forms of knowledge

formalizations, and therefore strategies for one may serve for another, with smaller adaptations. Another

reason is the fact that it is possible to formulate the problem of mining one type of knowledge into

a similar problem, with other type of knowledge. For example, we can define constraints from most

of other knowledge representations, like ontologies, and in this manner, it is possible to incorporate

ontologies using constraints and constrained algorithms.

4.3 Constrained Pattern Mining: Problem Definition

The oldest and more studied constraint in pattern mining is the minimum support threshold [AS94],

which states that, to be interesting, a pattern must have occurred more than the given threshold. In

fact, what we call traditional pattern mining corresponds to the discovery of frequent itemsets from data.

Therefore, the minimum support is not usually considered as a constraint, but as a strong measure that

should be the basis of all other pattern mining approaches. In this sense, constrained pattern mining is

perceived as the use of constraints beyond the minimum support, i.e. the discovery of frequent itemsets

from data that satisfy some constraint.

In a normal constrained problem, we are dealing with one single table. In this sense, we follow the

notation used in previous chapters, but define the main concepts for a traditional single table environment.

Formally, let I = {i1, i2, . . . , im} be a set of distinct literals, called items. A subset of items is denoted

as an itemset. A superset of an itemset X is also an itemset, that contains all items in X and more. The

support of an itemset X is the number of occurrences in the dataset. In this context, an itemset is frequent

41

Table 4.1: Advantages and disadvantages of the different forms of domain knowledge representations.

KnowledgeRepresentation

Advantages Disadvantages

HumanInteractions

• Facilitate interpretation and evalua-tion;

• Provide traceability of results;• Easy to re-try with different parame-

ters;• Results are in accordance with user

expectations;• No need to express knowledge

beforehand.

• Users do not always know what theywant;• Labor intensive for complex domains;• Not easy to re-use in different do-

mains;• Users must learn how to interact with

the system;• There is no interface, perfect for ev-

ery users.

Annotations

• Results are more accurate, even forfew fractions of labels;

• Results are more likely to be in ac-cordance with the context;

• Normally, the more labels, the bestthe results.

• Label all data is too expensive andtime consuming;• Not effective for unbalanced datasets;• The choice of seeds may negatively

influence the results;• Labels may be wrong;• Use only one label is limited, and use

multiple labels is not trivial.

Constraints

• Constraints capture application se-mantics;

• Allow the user to control the miningprocess;

• Reduce the number of results;• Increase the efficiency of the algo-

rithms;• Improve the interpretability of

results.

• Restricting too much leads to a sim-ple hypothesis testing;• Constrain too few leads to an explo-

sion of results and less efficiency;• More complex constraints are not

trivially incorporated into thealgorithms.

Graph-basedModels

• More expressive power and easy toextend;

• Formally represent experts view ofthe domain;

• In general, are more intuitive repre-sentations of concepts and relations;

• Results more interesting according tothe model;

• Independent from mining methods(as opposed to constraints).

• More computational complexity;• Need for a graphical notation, that

can be used by mining methods;• Algorithms must deal with multiple

relations and mappings for the sameconcepts;• There may be multiple models for the

same domain. The choice may influ-ence the results;• The more complex are the mod-

els, the more difficult to understandthem.

42

if its support is no less than a predefined minimum support threshold, σ ∈ [0, 1]: sup(X) ≥ σ ×N , with

N the total number of transactions in data.

Definition 9. A constraint C is a predicate on the powerset of I [PHL01], i.e. C : 2I 7→ {true, false}.An itemset X satisfies a constraint C, if C(X) = true.

A pattern corresponds to a frequent itemset that satisfies the constraint C, i.e. if sup(X) ≥ σ ×N ∧C(X) = true. And, given σ and C, the problem of constrained frequent pattern mining is to find all

patterns in a dataset that satisfy the imposed constraint.

4.4 A new Framework for Constrained Pattern Mining

In this section we gathered the different proposed constraints in the literature and analyzed them in terms

of their semantics and properties. We also examined closely the existing strategies for their incorporation

into the pattern discovery, and how these constraints and strategies are adapted to different data source

types. In this sense, we propose the framework for constrained pattern mining presented in Fig. 4.1. This

framework is a classification scheme to organize and analyze constrained pattern mining algorithms, based

on three different perspectives that influence the choice of the strategies to use: constraint categories,

constraint properties and data sources.

Constrained++Pa-ern+Mining+Algorithms+

+

Data Sources

Figure 4.1: A framework for constrained pattern mining.

Categories: According to the semantics of the constraints, we can divide them in a set of different

categories. These categories are defined based on what is being constrained. As an example, we

may apply constraints over the items, the values of the items, the relations among items, etc.

Properties: Apart from the categories, constraints can be categorized by a set of properties, according

to their behavior when adding or removing items to an itemset. For example, there are constraints

that, once unaccepted for one itemset, are always unaccepted for any of its supersets (it is called

an anti-monotonic constraint). These behaviors, or properties, allow us to define and apply more

generic and efficient strategies in order to introduce these constraints into the discovery process.

Data Sources: The nature of the data sources may also influence the constraints and the strategies that

may be used. Data can be tabular or multi-relational, and the source can be dynamic or static.

The challenges introduced by these more complex types of data require, for example, the definition

of new types of constraints, or/and the nullity of some assumptions, like the persistency of data.

Associated to each constraint are also the strategies used to incorporate them. These strategies

depend on the category and property of the constraints, but also on the data source. There are a lot of

43

ad hoc strategies, designed for some specific constraints, and some more generic approaches, designed for

constraints following a specific property. However, there is no algorithm able to efficiently incorporate all

types of constraints, and the great majority is designed for tabular and static data sources only.

In the next sections we describe the framework in more detail.

4.5 Constraint Categories

According to the semantics and form of constraints, they can be divided in the following categories1. Let

P be a pattern, and P.attr be the value of all elements of the pattern P for attribute attr (e.g. P.price

corresponds to the price of all products in P ):

1. Content constraints: These constraints correspond to filters over the content of the discovered

patterns. They are conditions over the value of the items that would appear (or not) in the resulting

patterns. They try to capture the semantics of the application and introduce it into the mining

process.

(a) Item constraints: They express conditions on the presence or absence of some items in the final

patterns [SVA97]. These were the first proposed constraints, different from minimum support.

For example, a school teacher may be interested in patterns relating his discipline with others:

maths ∈ P . Thus, these constraints allow for the discovery of patterns that relate some specific

known items with others unknown. From another perspective, a school teacher may also be

interested in patterns containing only the students of his discipline, instead of all students:

P ⊆ {s1, s2, ..., sn}, with si the students in question. So, they also allow for the discovery of

unknown frequent relations between the known elements.

(b) Value constraints: These constraints assume that a value is associated with each item, and

limit this value for every element of a pattern [NLHP98].

For example, a market customer may only be interested in products which price is less than

a specific value. In this manner, the constraint P.price ≤ e 100 will only return patterns of

products with price not exceeding e 100.

Another interesting application of value constraints is weighted pattern mining, where items

have a weight that shows their importance. We can establish a weight constraint P.weight ≥ wto indicate to the algorithm that we are only interested in itemsets with a weight higher than

w [YL05].

(c) Aggregate constraints: These constraints also assume that a value is associated with each item,

and that several aggregate functions (e.g. sum, average, max, min) can be used over these

values [NLHP98]. An aggregate constraint limits the value of aggregate functions over the set

of items in the patterns.

For example, a marketing analyst may be interested in products for undergraduate students,

and therefore the maximum age in the target audience of products in each pattern should be

18 years (max(P.age) ≤ 18). Or he can be interested in sets of products with an average price

no higher than a given value (avg(P.price) ≤ v).

Formally, aggregate functions can be divided into three categories [HKP11]: distributive, al-

gebraic and holistic. Distributive functions can be computed in a distributed manner. i.e.

applying them to each partition and then applying them to those partition results, is the

1In this work we adopt and extend the notation presented by Ng et al. [NLHP98].

44

same as applying them to all data without partitioning (e.g min,max, count and sum). Al-

gebraic functions can be obtained by applying some algebraic operator to two or more results

from distributed functions (e.g. avg can be computed as sum/count; Other examples include

min N , max N , variation and std deviation). Due to these properties, distributed aggregate

constraints are usually easier to push into PM algorithms [NLHP98]. Algebraic aggregate

constraints need more attention, because we can only confirm that they are satisfied after

computing the sub functions over all elements in the itemsets. Even so, some efficient tech-

niques were proposed to deal with such constraints [PHL01, PH02, WJY+05, ZCD07]. Finally,

aggregate functions can also be holistic, meaning that there is no algebraic function that char-

acterizes their computation (e.g. mode and median). Holistic aggregate constraints are more

difficult to push, i.e. it is difficult to create a generalized strategy to push them.

2. Structural constraints: These constraints define conditions on the content and on the structure

of data [Ant09a].

(a) Length constraints: They specify a limit on the length of the patterns, i.e. on the maximum

or minimum number of items in each pattern.

For example, a sales analyst may be interested in patterns with at most 5 products (patterns

with more items usually have lower support, and are not significant): |P | ≤ 5. This accelerates

the mining process and also limits the number of results.

(b) Sequence constraints: The most studied structural constraints have been represented as regular

expressions.

Formal languages, such as regular and context free languages, provide a simple, natural syntax

for the specification of sequences, and have sufficient expressive power for specifying a wide

range of constraints [GRS99, AO02, PHW07]. Enforcing regular expressions (RE) into the

mining process minimizes the computational cost, by focusing only on sequences that can

potentially be in the final answer set. RE constraints are specified as RE over the set of items

using the established set of regular expression operators (like disjunctions). They specify the

possible (or the most interesting) combinations of items, and the order they should have. A

sequential pattern satisfies the constraint if it is accepted by its equivalent deterministic finite

automata or push-down automata.

(c) Network constraints: One promising type of constraints are network constraints, which are

defined based on the characteristics of domain knowledge in the form of a network (a graph-

based representation like taxonomies and ontologies). These networks model the concepts

existing in the domain, as well as the (hierarchical and possibly non-hierarchical) relations

between these concepts. Each item in the database can be mapped to the corresponding

concept in the network. This means that we can restrict both the concepts associated with

each item, as well as the relations between items in an itemset [Ant08, Ant09b].

i. Conceptual constraints: They express conditions on the presence or absence of some con-

cepts in the patterns. One concept is said to be present in an itemset if it contains some

item that is mapped to that concept in the network. These constraints are like item con-

straints, but instead of specific items, we are looking for the specific concepts for which

they are mapped. In this sense, we can specify, for example, that one or one set of con-

cepts must (or cannot) be present in patterns, or restrict the possible concepts to one

45

specific accepted (or unaccepted) set. Most content constraints, (and others, like sequence

constraints) can also be applied to concepts instead of items.

ii. Taxonomical constraints: These constraints establish restrictions based on the family ties

among concepts, defined by some taxonomy [Ant08]. We can require, for example, that

the concepts in a pattern belong to the family of a specific concept, or to belong to the

same family (same family constraints). If there are multiple hierarchies, we may want to

find patterns in which their concepts belong not to the same family, but to a closer family

(close family constraints). We can also define constraints to require that the concepts in

patterns must belong to some or to the same hierarchical level (level constraints).

iii. Relational constraints: If the network also models non-taxonomical relations, we can re-

strict the type and number of relations between items in each pattern. Two items are

related if the concepts for which they are mapped are related in the network. The sim-

plest relational constraint is to limit the presence or absence of some relations in patterns.

But we may also create constraints based on the connectivity between items. For exam-

ple, (1) all items must be related to at least another (weakly connected), or to all others

(strongly connected); and (2) there must be a chain of relations between items (softly

connected).

iv. Distance constraints: Distance constraints limit the number of indirect relations that

connect two concepts (and therefore two items) [ME09b]. For concepts related in more

than one way, the distance is the smallest one, i.e. the lowest number of edges between

two concepts. These constraints allow us to define to what extent the user considers

two items related, and therefore how important is that relation. As an example, we can

consider that relations with more than three indirections are not important, and therefore

we can impose a maximum distance between concepts. Distance constraints also allow us

to guarantee that the items in each pattern share the same context, by being all related.

(d) Temporal constraints: These constraints restrict the resulting patterns based on the temporal

dimension. They allow us to find temporal and sequential patterns, analyze their evolution

over time, limit the duration and gap between events, etc. [SA96, Zak00b, PHW02, AO03].

Temporal constraints are usually defined in databases where each transaction has a timestamp,

and each pattern is a frequent ordered sequence of time stamped itemsets.

i. Duration constraints: They limit the time between the oldest and newest event in the

pattern, i.e. it indicates that the timestamp difference between the first and the last

transactions in the patterns must be longer or shorter than a given period.

For example, for short-term pattern analysis, we may impose a limit of at most 3 months,

and for long-term analysis, we may say that we are interested in patterns where the

duration is at least 1 year.

ii. Gap constraints: Gap constraints define the maximum or minimum time interval between

consecutive events in each pattern, i.e. the timestamp difference between every two adja-

cent transactions must be longer or shorter than a given gap value.

For example, in a medical domain, doctors may specify that the maximum gap between

two exams must be 6 months to obtain relevant patterns, so that they help on a correct

diagnosis or treatment.

iii. Periodical constraints: These constraints define a periodicity in which patterns should

hold. This concept was first introduced by Ozden [ORS98] for association rules, in which

46

the time dimension is divided into equally spaced user-defined time intervals, and a rule

is said to be cyclic if it holds for a fixed periodicity along the whole length of the sequence

of time intervals. This allows for the discovery of seasonal patterns.

For example, in an educational domain, teachers may be interested only in patterns that

occur every semester during exams (low number of students in classes, or high affluence

to office hours).

When mining temporal databases, one can also consider other types of constraints, like lifespan

and growth constraints. Lifespan constraints impose a limit on the lifetime of items in patterns,

i.e. they define a maximum or minimum time interval between the first and the last appearance

of each item in the database. And growth constraints were proposed to capture emerging

patterns and their evolution over time [DL99]. A pattern is emergent if its support increased

more than a given threshold in the most recent time interval. Thus, a growth constraint

defines a limit on the growth rate of patterns (i.e. the ratio of the support of the pattern in

the most recent period over its support in the previous time period). Convergent and divergent

constraints look for patterns whose period shrinks or grows along time [BA14].

(e) Other : Other structural constraints are being proposed in domains like graph pattern mining

[ZYHY07] (density ratio, density, diameter, edge and vertex connectivity), defined according

to the number of edges and vertices on the graphs.

3. Interestingness measures: Interesting measures are constraints that impose quantitative condi-

tions over the set of items in the pattern or rule. They rank the results according to their usefulness

and utility, according to some user chosen function. Usually, only the results that surpass a user

defined threshold are considered interesting, and therefore only those are presented to the final user.

The choice of the best interesting measure, as well as the choice of the threshold that separates

interesting from uninteresting patterns, are two non trivial problems of pattern mining that may

have a great impact on the quality of the results.

The most known interestingness measure is the minimum support, which has been used since the

first proposal of pattern mining [AS94]. Establishing a minimum support threshold allows us to draw

a limit on the support beneath which we consider itemsets infrequent, and therefore not interesting

information that can be discarded. It gives to pattern mining several important advantages [Bay05],

since it preserves the discovery of unknown and important patterns and improves the efficiency of

the algorithms during mining. However, it suffers from some limitations that are more evident

when we start dealing with larger and denser datasets. The main limitation is the fact that results

may be redundant and in a large quantity, and they are not user-oriented, and thus they may not

correspond to user expectations.

Other interestingness measures have appeared trying to improve the quality of the results that are

returned to the user. Most of them are still not user oriented, but provide a good way of reducing

the number of results. Examples are rule based measures, such as confidence [AS94], correlation

(including lift, cosine, χ2 and all confidence) [AS94, BMS97, HCXY07] and the improvement of

a rule [Bay05, BA99]. They measure the interestingness of association rules, which are generated

based on the patterns found by pattern mining.

Most of existing interestingness measures, with the exception of the minimum support, are used only

to evaluate the resulting patterns. However, Bayardo [Bay05] showed that some of these measures

can be rewritten in a form that is composed of elements of other forms of constraints, so that they

can be used during the mining process.

47

Some studies have also been conducted to find interesting and unexpected patterns, based on

what is already known. Padmanabhan and Tuzhilin [PT98] use probability-based belief to describe

user confidence in unexpected rules. Wang and Lakshmanan [WJL03] are able to capture the

unexpectedness and strength of a rule. Jaroszewicz and Simovici [JS04] define the interestingness

of an itemset as the absolute difference between its support in the data and its expected support.

The user defines a minimum threshold, and the reasoning behind it is that, if the difference is small,

the itemset is uninteresting, since it is already known [JS05]. More recently, Cao et al. [CLZ07]

introduced knowledge actionability to measure the ability of a pattern to be converted to a concrete

action in the real world.

4.6 Constraint Properties

There are several different constraints, which hinders the creation of algorithms that are able to incor-

porate them, without being specific for some constraint. Fortunately, studies show that constraints have

some properties that allow for efficient and generic strategies to prune the search space and improve the

performance of the algorithms.

These “nice” properties [PH02] are:

1. Anti-monotonicity: A constraint is said anti-monotone if and only if, whenever an itemset X

violates it, so does any superset of X. Also, a disjunction or a conjunction of anti-monotonic

constraints is also an anti-monotonic constraint.

For example, assume an item constraint saying that all items must belong to an accepted set of

items V (X ⊆ V ). If an itemset X violates it, it means that it contains some item i /∈ V . All

supersets of X will have that item, and therefore all supersets will violate the constraint.

The best known and simple example of an anti-monotone constraint is the minimum support thresh-

old [AS94]. As an anti-monotone constraint, if an itemset is infrequent, so does any of its supersets.

2. Monotonicity: A constraint is said monotonic if and only if, whenever an itemset X satisfies it, so

does any superset of X [GLW00]. Conjunctions and disjunctions of monotonic constraints are still

monotonic, and monotonic constraints can be seen as the negation of anti-monotonic constraints.

Following the example above, imagine now an item constraint defining that every pattern must

contain at least the items from a set V (V ⊆ X). If an itemset violates it, i.e. does not have all

known items, a superset can satisfy it, by introducing the missing item ∈ V . However, if an itemset

satisfies the constraint (i.e. it contains all the required items), all supersets also satisfy it, because

they contain the same items and more.

3. Succinctness: In its essence, a constraint is succinct if it is possible to enumerate all possible

patterns, based on the powersets of the elements of the alphabet of items [NLHP98].

A simple example is the value constraint X.price ≤ e 100. It is a succinct constraint because

we can select from the alphabet all items with price ≤ e 100 using the selection predicate: I1 =

ρprice≤e 100(Items), and the itemsets that satisfy the constraint are exactly only those in the strict

powerset of I1: 2I1 . Another example is the item constraint {a} ⊆ X. We can select all items from

the alphabet that are not a, using the predicate: I2 = ρitem 6=a(Items), and say that all itemsets

resulting from the powerset of I2 do not contain a, and cannot be patterns. It is a succinct constraint

since we can define that the itemsets that satisfy it are, exactly, 2Items − 2I2 (the powerset of all

items in the alphabet, except the powerset of I2).

48

Formally, a succinct constraint is defined as follows:

• An itemset X ⊆ Items is a succinct set if it can be expressed as ρp(Items), for some selection

predicate p;

• SP ⊆ 2Items is a succinct powerset (SP) if there is a fixed number of succinct sets

I1, I2, ..., Ik ⊆ Items, such that SP can be expressed as unions and minus of the strict power-

sets of I1, I2, ..., Ik.

• A constraint is succinct if the set of itemsets that satisfy it is a succinct powerset.

A succinct constraint can be considered a special case of conjunctions of anti-monotonic and mono-

tonic constraints.

Another characteristic of these constraints, is that we can easily define a function that can generate

the members of a satisfying itemset: the member generating function (or MGF). In our first example,

the MGF is simply {X| X ⊆ I1 & X 6= ∅} (all non empty subsets of I1). In the second example,

the MGF is {X1 ∪ X2| X1 = {a} & X2 ⊆ I2} (the conjunction of itemset {a} with all subsets of

I2).

4. Prefix-monotonicity: A constraint is prefix-monotone2 if there is an order of items that allows

the algorithms to treat it as anti-monotonic or monotonic. By fixing an order on items, each

transaction can be seen as a sequence, and therefore we can use the notion of prefixes and suffixes,

as the first or last items in the ordered transaction, respectively.

A constraint is prefix-monotone, if it is prefix anti-monotonic or prefix monotonic. Formally, a

constraint C is prefix anti-monotonic (resp. prefix monotonic) if there is an order R over the

set of items, and assuming each itemset X = i1i2...in is ordered accordingly to order R, such

that, whenever an itemset X violates (resp. satisfies) C, so does any itemset with X as prefix

(X ′ = X ∪ {in+1} = i1i2...inin+1).

For example, an aggregate constraint like C ≡ avg(X) ≥ 20, is not monotonic, neither anti-

monotonic nor succinct. But, if we order the items in a value-descending order (and assuming only

positive values), an itemset X = i1i2...in has higher average than its supersets X ′ = i1i2...inin+1.

This means that, if X violates C, also all its supersets X ′ will. Thus, C is prefix anti-monotonic.

5. Mixed Monotonicity: Leung and Sun [LS12] proposed recently the concept of mixed monotone

constraints, to define constraints that are both anti-monotonic and monotonic, at the same time,

for different groups of possible values (positive and negative).

Formally, let Item denote the set of items, which is divided into two disjointed groups based on the

sign of their attribute values: ItemP , the set of items with positive value (including 0), and ItemN ,

with negative value. Then, a constraint is mixed monotone if, for any itemset X: (a) whenever

X satisfies C, all supersets of X formed by adding items from one specific group, also satisfy C

(monotonic for that group); and (b) whenever X violates C, all supersets of X formed by adding

items from another group, also violate C (anti-monotonic for the other group).

This property was proposed in particular for aggregate constraints using the sum function, where

items may contain negative numerical values. The aggregate constraint sum(X) ≥ v, for example,

2Prefix-monotone constraints were first proposed with the name of convertible constraints [PH02]. Since wecan convert other constraints using several approaches (like using relaxations), we use the term prefix-monotoneto designate the constraints that are convertible due to the order of items.

49

is monotonic for positive values (including zero), and anti-monotonic for negative values. The aggre-

gate constraint sum(X) ≤ v is, on the contrary, anti-monotonic for positive values and monotonic

for negative ones.

Table 4.2 and 4.3 associates these properties to content and structural categories, respectively.

Table 4.2: Content constraints and respective properties (∗ means it depends on the function).

AM M Succinct PAM PM Mixθ"∈"{∉} X X Xθ"∈"{∈} X X Xθ"∈"{⊈} X X Xθ"∈"{⊆} X X Xθ"∈"{⊈} X X Xθ"∈"{⊆} X X X

P"θ"v θ"∈"{<,≤,>,≥} X X Xθ"∈"{<,≤} X X Xθ"∈"{>,≥} X X Xθ"∈"{<,≤} X X Xθ"∈"{>,≥} X X Xθ"∈"{<,≤} X Xθ"∈"{>,≥} X X

sum(P)"θ"v"(positive"and"negative)

θ"∈"{<,≤,>,≥} X

avg(P)"θ"v θ"∈"{<,≤,>,≥} X Xθ"∈"{<,≤} X *θ"∈"{>,≥} * Xθ"∈"{<,≤} * Xθ"∈"{>,≥} X *

Holistic median(P)"θ"v θ"∈"{<,≤,>,≥} X X

Constraints

Algebraic

Content

Aggregate

Categorie

s

Properties

f(P)"θ"v"(f"is"prefix"decreasing)

f(P)"θ"v"(f"is"prefix"increasing)

Item

Value

Deterministic

x"θ"P

X"θ"P

P"θ"X

min(P)"θ"v

max(P)"θ"v

sum(P)"θ"v"(positive"or"negative)

Table 4.3: Structural constraints and respective properties.

AM M Succinct PAM PM Mixθ"∈"{<,≤} X Xθ"∈"{>,≥} X X

RE "BB XConceptual

sameBfamily "BB XcloseBfamily "BB Xlevel(P)"θ"l θ"∈"{<,≤,>,≥} Xrelational

weakly"or"softlyBconnected "BB

stronglyBconnected "BB XDistance distance(P)"θ"v θ"∈"{<,≤,>,≥} X

θ"∈"{<,≤} Xθ"∈"{>,≥} X

Gap gap(P)"θ"v θ"∈"{<,≤,>,≥} X

Properties

(like"Item"constraints)

(like"Item"constraints)

(No"property)

Structural Network

Temporal

Length

Sequence

Taxonomical

Relational

Duration

|P|"θ"v

duration(P)"θ"v

Categorie

s

Constraints

4.7 Data Sources

Despite the advances on constrained pattern mining, the great majority of existing work is only concerned

with tabular data. However, the rapid growth of data, in both quantity and in the variety of different

types of data structures, has brought new requirements to data mining techniques.

On one side, in many real world applications data appear in the form of continuous data streams, as

opposed to traditional static datasets. In this sense, data sources can be:

50

Static: When we are in the presence of static data sources, we can make some assumptions over these

data that ease the definition and incorporation of constraints into the mining algorithms. These

assumptions are: (1) all data are available from the beginning, and therefore we can know in

advance, for example, what are all possible items in the dataset (the alphabet); (2) no new data

will appear, and therefore decisions are generally taken based on all data and are persistent. For

example, after reading the available data, infrequent items are effectively infrequent and can be

eliminated. If these items could appear later, they could become frequent, which would invalidate

the former decision of their deletion; (3) since all data are available, we can usually make several

passes over data; and (4) there are typically fewer memory and time limitations.

Continuous: Data sources are continuous, or data streams, if they are continuously being generated

and collected [MM02]. The nature of these streaming data makes the mining process different

from traditional data mining in several aspects, as referred in Section 3.2: (1) each element should

be examined at most once and as fast as possible; (2) memory usage should be limited, even

though new data elements are continuously arriving; and (3) the results generated should be always

available and up to date. This means that only the information that is strictly necessary to avoid

loosing patterns should be kept [LLH11], and other must be deleted. This will result in errors in

frequency counting, and these errors on results should be as small as possible. Beyond that, all the

assumptions when in the presence of static datasets cannot be made: data are not all available a

priori, and conclusions may not be persistent (items reported as infrequent may become frequent

later).

On the other side, this new era made it necessary to create new and more efficient ways to store and

analyze these data. In fact, the data storage paradigm has changed, from operational databases to data

repositories that make easier to analyze data and to find useful information.

Despite this change, and the fact that most of real world applications involve multiple tables, and

eventually multiple data sources, existing algorithms for constrained pattern mining are only able to deal

with one single data table. On the other side, existing algorithms for mining multiple tables (described

in Chapter 2) are not able to deal with constraints.

As referred in Section 2.3, dealing with multiple tables introduces a set of new challenges to pattern

mining algorithms, and even more to constrained mining, due to the nature of data. Multi-relational mod-

els contain not only transactional data (the occurring events or transactions) but also non-transactional

data (the characteristics of entities) [SA13c], thus when mining these models, we are mining both types

of data. The problem is that existing constraints were proposed for transactional data (the goal is to con-

strain the co-occurrences of entities based on their characteristics, and not to constrain the co-occurrences

of characteristics themselves), and this requires the adaptation of both the constraints and algorithms.

Although these difficulties, mining these multi-relational models is also an opportunity for constrained

mining. The existing relations between tables may lead to the definition of new constraints based on the

structure of the models, so that we can find and guide the algorithms through the most interesting

relations, in the user point of view.

Also, new and more complex data types have also appeared and have become popular nowadays, such

as social networks and other graph-based models. The research on data mining over these data sources

is increasing, but there is still much work to do regarding the incorporation of constraints on the mining

of these sources.

51

4.8 Constrained Pattern Mining Algorithms

How to enforce these constraints into pattern mining is not trivial, and depends heavily on the constraints

in question.

Performing an extensive search is not a viable solution mostly due to the size of the search space. A

naive approach starts by running an existing traditional pattern mining algorithm, and only then test

the constraints and filter the patterns that do not satisfy them. However, most of the times, the itemsets

that satisfy some constraint are much less than the ones resulting from a traditional pattern mining run.

Therefore, this first step is somehow unnecessarily time consuming for a constrained environment.

Several algorithms have been proposed in the literature for the integration of constraints into pattern

mining, some of them designed for some particular type of constraints, others more general, designed for all

constraints following some “nice” property. Nearly all algorithms were proposed for single transactional

tables, and there are some proposals that are able to deal with data streams [LK06].

4.8.1 Properties vs. Algorithms

1. Anti-monotonicity: Both apriori-like [AS94], pattern-growth [HPY00] and vertical [Zak00b]

methods use the anti-monotonicity of minimum support to stop exploring itemsets that are not

frequent. Their idea is to start from frequent length-1 itemsets and iteratively find longer frequent

itemsets. Apriori-like methods iteratively generate all candidates (supersets) of current frequent

itemsets, and test them for frequency. Infrequent itemsets are discarded and therefore not used in

the next candidate generation step. Pattern-growth methods recursively grow frequent smaller pat-

terns to longer ones, based on the co-occurrences of items in the database, with no need to generate

all candidates. Infrequent itemsets are also discarded and are not grown. Vertical algorithms first

transform the database into a vertical data format, in which, instead of having a set of items per

transaction id, they have a set of transaction ids per item. The number of ids per item corresponds

to the support of that item, therefore these methods do not need to scan the database to count

their support, neither of larger intemsets. The strategy is similar to pattern-growth algorithms,

and larger itemsets are formed by intersecting the sets of ids of the corresponding smaller frequent

itemsets.

In this manner, it is somehow intuitive to push other anti-monotone constraints to those approaches:

we can only use the itemsets that satisfy the constraint to generate/grow longer itemsets, i.e. we

can discard all itemsets that do not satisfy the constraint, because their supersets will also violate

it [PH02]. This strategy is the basis of existing algorithms, when in the presence of anti-monotone

constraints [NLHP98, PH00, PH02, BJ05].

Anti-monotonicity, if used actively, can drastically reduce the search space. It is the strongest

property, by being the one that allows the algorithms to prune more, with less effort, minimizing

the computational cost, and at the same time maximizing the efficacy of the results. However, it

is not possible to ensure the efficiency of pushing this type of constraints, since it depends on their

selectivity [Bou04], i.e. the rate of itemsets that can be discarded: the less selective a constraint is,

the less efficient.

2. Monotonicity: In the case of a monotonic constraint, we cannot discard itemsets that violate

it, because supersets can satisfy it. However, when we find some itemset that satisfies it, we can

automatically generate all possible supersets of that itemset and return those that are frequent,

without more testing for the constraint. Thus, monotonic constraints can also be used to improve

the efficiency of pattern mining, by avoiding multiple unnecessary tests.

52

The basic strategy is to find the frequent k-itemsets and, for those that satisfy the constraint, there is

no need to test for that constraint when generating/growing frequent (k+1)-itemsets [PHL01, BJ05].

It is shown that the strategy for anti-monotonic constraints is more powerful, since it can eliminate

early much more itemsets than the monotonic strategy [GRS99]. But again, it depends on constraint

selectivity.

3. Succinctness: A succinct constraint is, at the same time, succinct and anti-monotonic (e.g.

X.price ≤ e 100) or succinct and monotonic (e.g. {a} ⊆ X). The way to push them depends

on that:

If the constraint is both succinct and anti-monotonic, we can prune from the beginning the items

that do not satisfy it (i.e. use only the elements of the respective member generating function –

MGF). In our first example, we can discard all items with price higher than e 100, because no item

with that price will satisfy the constraint.

If the constraint is succinct and monotonic, we cannot eliminate items, but we know, from the

MGF, which items and combinations satisfy the constraint. Therefore we can start with the possible

values of the first member, and from there generate candidates by joining these values with the next

member, one by one. In our second example, the corresponding MGF is {X1∪X2|X1 = {a} & X2 ⊆I2}, and since the first member must be the element a (it satisfies the constraint by itself), we just

have to join it with the other values from the second member X2 to form other possible patterns,

with no constraint check.

Succinct constraints were first proposed by Ng et al. [NLHP98], as well as an apriori-based al-

gorithm, called CAP (Constrained APriori), implementing the strategies above. Later on, Leung

et al.[LLN02] proposed FPS (FP-tree based mining of Succinct constraints) that uses the same

strategy but with a pattern-growth approach. These strategies are the basis for pushing succinct

constraints [BJ05].

4. Prefix-monotonicity: These constraints may seem straightforward to push into pattern mining

algorithms, since they can be treated as anti-monotonic or monotonic, just by imposing the correct

order of items in all itemsets. However, there is one main difference: one cannot discard all itemsets

that violate a constraint, because itemsets may violate it as a prefix, but not as a suffix of a valid

prefix. For example, an itemset X = {20, 10}, does not satisfy the constraint C ≡ avg(X) ≥ 20.

However, an itemset X ′ = {30, 20, 10}, with X as a suffix, satisfies it.

These constraints were proposed by Pei et al. [PHL01] as well as a pattern growth algorithm, FIC

(Frequent Itemset mining with Convertible constraints)3, with a strategy similar to the algorithm

PrefixSpan [PHMA+01] for sequential pattern mining. In the presence of a prefix anti-monotonic

constraint (FICA), the idea is to keep all frequent itemsets, but only grow itemsets that satisfy

the constraint (valid prefixes). Apriori algorithms can also adapt this strategy, by also keeping

all frequent itemsets (even if they violate the constraint), and only generate candidates with valid

prefixes.

In the presence of prefix monotonic constraints (FICM ), all frequent itemsets must be kept too,

but, as soon as some frequent itemset satisfies the constraint (valid prefix), algorithms do not need

to test any itemset with it as a prefix, just grow them (i.e. generate all supersets with that prefix)

and return them [PHL01], after confirming their frequency.

3A first draft of this algorithm was proposed in [PH00], with the name CFG (Constrained Frequent patternGrowth).

53

An important contribution of this property is the fact that regular expressions in sequential pattern

mining are prefix-monotone constraints. This means that they can be treated with a very similar

strategy. Pei et al. [PHW02] took advantage of this and proposed Prefix-Growth, a pattern growth

algorithm that recursively grows longer sequences from smaller ones, but only projects sequences

that are a valid prefix. The same authors also presented an overview of constrained-based sequential

pattern mining [PHW07], where they state that PrefixGrowth achieves better performance than

other ad hoc algorithms for regular expressions [SA96, GRS99, Zak00b], and is able to push more

constraints.

5. Mixed Monotonicity: Leung and Sun [LS12] also proposed the algorithm FPM (Frequent Pattern

mining for Mixed monotone constraints). FPM is a pattern-growth algorithm, that is able to explore

the properties of prefix-trees and to include mixed monotone constraints in a quite simple way.

The idea is to first divide the items into positive and negative sets (ItemP and ItemN ) and then

order the items in ItemP in ascending order of values, and ItemN in descending order. The mining

process proceeds iteratively starting by the monotonic group and only then piecing together the

anti-monotonic group (in the case of sum(P ) ≥ v, starts with ItemP and then ItemN ). Thus,

while mining the monotonic group, if an itemsets satisfies the constraint, no checking is needed

for the supersets composed of items of that group. When the processing of this group finishes,

the algorithm adds items from the anti-monotonic group, one by one, and if the resulting itemsets

violate the constraint, one stops exploring them.

This strategy can be applied in a wide range of domains, including financial markets and air tem-

perature, to correctly and efficiently find patterns where constraints involve manipulating negative

values.

Table 4.4 presents a summary of the algorithms that implement the strategies above, taking into

advantage the properties of the constraints.

Table 4.4: Algorithms designed to incorporate constraints that follow specific properties. Note that algorithmsfor prefix-monotone or for conjunctions of constraints are also able to deal with simple AM or M constraints.

Properties Algorithms

Anti-Monotonicity andMonotonicity

CAP [NLHP98]

SuccinctnessCAP [NLHP98]

FPS [LLN02]

Prefix-Monotonicity

CFG [PH00]

FIC [PHL01]

Prefix-Growth [PH02]

Mixed-Monotonicity FPM [LS12]

Conjunctions ofAnti-monotone andMonotone Constraints

G [BJ00]

BMS+ and BMS* [GLW00]

Molfea [RK01]

DualMiner [BGKW03]

ExAnte [BGMP05]

MUSIC [SC05]

[RJLM10]

54

There are some constraints that do not have nice properties for pushing (i.e. are not anti-monotonic,

neither monotonic, succinct, nor prefix or mixed monotone). For example, combinations of monotonic and

anti-monotonic constraints and most of the existing interestingness measures [Bay05]. These constraints

are not easily pushed into the pattern mining process. And an exhaustive search is not an efficient

solution, since the number of frequent itemsets can still be much higher than those that satisfy the

constraint. Fortunately, there are some strategies proposed to deal with such constraints that try to take

advantage of the benefits of constraint properties.

One widely used approach is to introduce constraint relaxations (weaker constraints) [NLHP98,

GRS99, AO05] that allow the algorithms to prune some search space and therefore make the discov-

ery more efficient. These relaxations depend on the constraint, but there has been a major effort to find

relaxations that have nice properties. Thus, the idea is to run a more efficient algorithm over the data

using the relaxation, and then to perform an extensive search on the results (instead of on all data). Since

relaxations are weaker than the original constraint (though stronger than just using frequency pruning),

results must always be tested against the constraint so that only valid itemsets are returned.

Another approach is to use more than one strategy (one after the other or simultaneously). However,

as highlighted by Boulicaut and Jeudy [BJ00], if we are dealing with a conjunction of monotonic and

anti-monotonic constraints, we face a tradeoff between anti-monotonic and monotonic pruning. This may

happen because, when a monotonic constraint is pushed, it might save tests on monotonic constraints.

But, the results of those tests could have led to more effective anti-monotonic pruning [SVA97, GRS99].

As an example, pushing the monotonic constraint length(P ) ≥ 10, would avoid the generation of itemsets

of size less than 10. However, then, there would be a lot of candidates of size higher than 10 and all of

them would have to be tested for the anti-monotonic constraint. If the smaller itemsets have been tested

for the anti-monotonic constraint, many itemsets of size higher than 10 might have been already pruned,

and therefore not tested.

The identification of a good strategy for pushing these constraints needs the a priori knowledge about

the constraint selectivity. However, this is in general not available [Bou04]. Boulicaut and Jeudi [BJ00]

also proposed a strategy (and the G algorithm) that might help dealing with these conjunctions, by

choosing the order of constraint pushing based on their selectivity and evaluation cost. With this in

mind, Bonchi et al. [BGMP03] proposed an adaptive strategy, ACP (Adaptive Constraint Pushing),

that is able to dynamically give more importance or focus to anti-monotonic or monotonic pruning to

maximize efficiency, depending on the ratio of itemsets found infrequent. The same authors also proposed

ExAnte [BGMP05], a pre-processing algorithm that is able to reduce the data by eliminating, repeatedly,

all itemsets that violate the monotone constraints and then all that violate the frequency or the anti-

monotone constraints. It can be followed by any efficient traditional pattern mining algorithm, but it

requires several scans to the data.

Other algorithms were proposed based on version spaces and on border representations. Essentially,

it has been realized that, for example, the space of solutions of a monotonic constraint is completely

characterized by its set (or border) of maximally specific elements. Likewise, the space of solutions of

an anti-monotonic constraint is completely characterized by its set (or border) of maximally general

elements. The idea is that, given a conjunction of an anti-monotonic and a monotonic constraint, it is

possible to start a level-wise search from the minimal itemsets that satisfy the monotonic constraint,

until reaching the maximal itemsets satisfying the anti-monotonic constraint [Bou04]. These properties

have been exploited by level-wise algorithms [MT97] to mine conjunctions, such as G [BJ00], BMS+

and BMS* [GLW00], Molfea [RK01], MUSIC [SC05] and Dualminer [BGKW03], and by the algorithm

proposed by De Raedt et al. [RJLM10] to mine arbitrary expressions, over anti-monotonic and monotonic

constraints.

55

4.8.2 Categories vs. Algorithms

Besides the algorithms described above, there are some algorithms designed specifically for some particular

constraint category. Table 4.5 summarizes these algorithms.

Table 4.5: Algorithms designed to incorporate content and structural constraint categories.

Categories Algorithms

ContentItem and Value

MultipleJoins, Reorder and Direct [SVA97]

WFIM (weights) [YL05]

Aggregate DnA and BP-cubing [WJY+05, ZCD07]

Structural

Sequence (and length)

SPIRIT [GRS99]

[AO02]

Sim [CMB02]

Re-Hackle [ALB03]

ε-accepts [AO04]

Network

Onto4AR framework [Ant08]

D2Apriori [Ant09b]

SemAware [ME09b]

Temporal

(gap and duration)

GSP [SA96]

C-SPADE [Zak00b]

Gen-PrefixSpan [AO03]

(episodes)

[MTIV97]

MBD-LLBorder [DL99]

(cycles)

sequential and interleaved [ORS98]

Srikant and Agrawal [SA95] were the first to introduce item constraints, the first different from

minimum support. They proposed three apriori-based algorithms – MultipleJoins, Reorder and Direct

– that are able to deal with boolean combinations of these constraints, i.e. C = D1 ∧ D2 ∧ ... ∧ Dm,

where each Di = Ci1 ∨ Ci2 ∨ ... ∨ Cin , and each Cij an item constraint of the form i ∈ S or i 6∈ S.

Despite being composed of anti-monotonic and monotonic constraints, these conjunctions are neither of

them. Nevertheless, each individual C is a simple item constraint. The first two algorithms proposed

(MultipleJoins and Reorder) use an anti-monotonic relaxation of the constraint, by finding first an itemset

S such that all patterns (i.e. valid itemsets) must have some item from S. Candidate generation is

optimized by joining only itemsets, whose prefixes or suffixes contain items from S. Reorder is an

optimization of MultipleJoin that reorders itemsets to contain items from S in the first place. Direct

pushes the complete constraint, at the cost of a more complex candidate generation phase [BJ05]. The

idea of the algorithm is to explore the smaller constraints comprising the main constraint and join the

frequent itemsets that satisfy each one separately.

An alternative ad-hoc strategy for mining aggregate constraints was proposed with BP-Cubing

(Bound Prune Cubing) [WJY+05], an algorithm of the family of DnA (Divide and Approximate) al-

gorithms [ZCD07]. They propose a divide and approximate approach that first divides the search space

into subspaces (group-by partitions) and then seek for individual constraint approximations in each sub-

space to achieve the best results. They also propose the integration of more aggregate functions, like sum

of squares, positive sum, negative sum and variance.

56

Some algorithms were proposed do deal with sequences that accept gap and duration constraints. The

first was GSP [SA96] (Generalized Sequential Patterns), an apriori-based algorithm, that organizes the

data according to given time windows, and generates (k+ 1)-sequences based on the k-sequences (joining

(k − 1)-itemsets when the prefix of one equals the suffix of the other). The authors also proposed the

integration of taxonomies, by extending transactions with the ancestors of items. C-SPADE [Zak00b]

is an extension that outperforms GSP and allows other constraints, like length, minimum gap and item

constraints. GenPrefixSpan [AO03] is a pattern-growth algorithm with the same goal as GSP, but without

the candidate generation bottleneck. Over time, other algorithms for temporal constraints were proposed,

for example, to find episodes (sets of events that must occur close to each other) [MTIV97, DL99] and

cycles [ORS98].

The first algorithms for mining with regular expression (RE) constraints did not make use of the

prefix-monotone property, and hence they had to create some relaxations, in order to achieve a balance

between the efficiency of the algorithms and the effective push of the constraints. An example is the family

of the apriori-like SPIRIT (Sequential Pattern mIning with Regular expressIons consTraints) algorithms

[GRS99]. SPIRIT(N) only requires that all elements in the pattern appear in the RE, which is a simple

anti-monotonic item constraint relaxation. SPIRIT(L) only generate sequences that are legal w.r.t some

state of the RE, i.e. the corresponding transactions must be possible in the RE. For SPIRIT(V), possible

patterns must be valid suffixes, i.e. existing transactions that lead to a final state. Finally, SPIRIT(R)

enforces the complete constraint, and only generates valid sequences. The first relaxations are easier

to push into the algorithms, however, weaker constraints will prune less possible patterns than stronger

ones. Therefore, all versions except R must test all results against the original constraint before returning

it.

Antunes and Oliveira [AO02] followed SPIRIT ideas, and adapted it to deal with context free grammars

(CFG), through the use of pushdown automata. These grammars are more powerful than RE, since

they can express the same and more languages. Authors also show that the increase of the expressive

power of the language used for specifying constraints does not impair the performance of the algorithms.

The same authors also proposed the pattern-growth algorithm ε-accepts [AO04], to find sequences that

approximately conform a CFG, by allowing some insertions, deletions or replacements, in the middle of

the sequences. Capelle et al. [CMB02] proposed an apriori-like algorithm with a similar goal. They

assume a reference sequence, given by the user, and calculate the similarity of the discovered sequences

with the reference (i.e. the number of differences). Those sequences that surpass a similarity threshold

are returned.

Some adaptive strategies have also appeared for pushing RE. The algorithm RE-Hackle [ALB03],

represents RE in a tree structure called Hackle-tree, containing one node per operator (disjunction,

conjunction, ’∗’ operator) and one path per combination of these operators in the RE. This tree is

scanned at each candidate generation step, and an extraction function (that depends on the operator) is

used in each node to extract the valid candidates. From these candidates, frequent ones are used for the

next generation.

Antunes [Ant07, Ant08, Ant09b] was the first to propose the introduction of ontologies in DM as a

constrained mining problem, and defines a set of ontology constraints, along with the framework Onto4AR

(Ontologies for Association Rules). The goal of this framework is to work for any ontology, and therefore

for any domain, and allow the users to choose the ontology constraints to incorporate in the mining

process. The framework allows not only for the introduction of some of the network constraints described

above, as well as of weak and strong compositions of those constraints. The same author also proposes the

algorithms D2Apriori (Domain Driven Apriori) and D2FP-Growth, that first acquire domain knowledge

from the knowledge base within the ontology, and then instantiate the constraints, read the data creating

their representation, and finally identify frequent constrained patterns.

57

Mabroukeh and Ezeife [ME09b] proposed SemAware, a generic framework for sequential pattern min-

ing that integrates semantic information in the form of ontologies into all phases of web usage mining

process. It defines an apriori-based algorithm that prunes candidate generation and search space ac-

cordingly to the semantic distance between objects and a maximum distance threshold that can be user

specified or automatically calculated from the minimum support and the number of edges in the ontology.

Before the mining process, a matrix with the topological distance between concepts in the ontology is

built. And during the mining process, sequences and candidates with more than the allowed maximum

distance are pruned, with no need for support counting (anti-monotonicity property). The same authors

have extended this work to take into account weights in the relations. By combining the distance matrix

with a weight matrix, they define a weighted distance constraint, corresponding to a weighted sum of

the distance between two concepts times the weights in that path. In [ME09a], the distance matrix

is combined with the transition probability matrix from Markov models, therefore the same algorithm

(SemAware) is able to guide a markov process.

4.8.3 Data Sources vs. Algorithms

As noted before, the great majority of algorithms proposed for constrained pattern mining were designed

for mining one single and static data table.

Leung et al. [LK06] was the first to propose the integration of data streams with constrained mining,

with two algorithms, ApproxCFPS and ExactCFPS (Approximated and Exact Constrained Frequent

Patterns for Streams). These algorithms are able to push succinct constraints deep into the algorithm

FP-Streaming [GHP+03], and to find all approximated or exact patterns, respectively, in data streams.

The ideas are simple and consist on, for succinct anti-monotonic constraints: remove all single items that

violate the constraint before processing each transaction. And for succinct monotonic constraints: for

each batch of transactions, divide the items into mandatory and optional items, and order transactions

so that mandatory items appear first, and therefore there is only the need to mine itemsets that first

contain the mandatory items. While the algorithms efficiently push constraints into data stream mining,

they are only able to handle constraints that are succinct.

To the best of our knowledge, there is no algorithm designed for pushing constraints into the mining

of multiple tables. This incorporation is not straightforward, since we are usually in the presence of

multiple types of entities and events, with different characteristics, and the support of items depend

on the inter-relations between them. As referred in Section 2.3, one of the common ways to deal with

multiple tables is to join all of them into one single table and apply one single table pattern mining

algorithm. However, even if this pre-processing step can be computed, this denormalized table usually

contains both transactional and non-transactional data, and the usual goal is to mine and find the common

characteristics of the transacted entities, as opposed to the goal of constrained pattern mining, that is to

find the common entities transacted together (only considering the entities whose characteristics satisfy

the constraints). In this sense, since most of existing constraints were proposed for transactional data,

applying them to a multi-relational domain requires the adaptation of the algorithms.

4.9 Discussion and Open Issues

The use of domain knowledge in data mining has been recognized as one of the 10 most important

challenges in DM [YW06], not only because this domain knowledge represents the semantics of the domain

and user expectations, but also because, by introducing it into the discovery process, it is possible to

guide the algorithms through the discovery of more interesting and focused results. It is, therefore, one

promising approach to minimize two drawbacks of pattern mining: the large number of results, and their

58

lack of focus on user expectations.

This area of data mining guided by domain knowledge has been evolving, and several representations

have been proposed and analyzed, including human interactions, annotations, constraints, taxonomies,

ontologies, and other forms. Each form of representation allows for the formalization of more complex

forms of knowledge (from first to last), and therefore has its advantages and disadvantages, and can

be used in different ways to guide the mining process. It is important to note, though, that the more

complex the model, the more difficult it is to understand and deal with it, and to incorporate it in the

mining process.

The most explored form of domain knowledge are domain constraints. Essentially, they are filters on

the data or results that capture application semantics and user needs in an intuitive manner. There have

been proposed several types of constraints, and depending on their properties, some general and several

ad-hoc strategies have appeared (see tables 4.4 and 4.5), and have already been extended and applied to

a variety of problems and domains.

The use of constraints has been increasingly associated with other areas. One interesting example

is the algorithm U-FPS, proposed by [LB09] to deal with constrained frequent pattern mining from

uncertain data. It is able to represent user beliefs on the presence or absence of items in data, and

also to push constraints deep into the discovery process – succinct [LB09] or (prefix-)monotone [LHB10]

constraints. More recently, some authors made the correspondence between pattern mining and constraint

programming (SAT solvers) [RGN08, NJG11]. The advantage of these approaches is that the definition of

constraints is independent of the SAT solver, i.e. we can try several SAT solvers for the same constraint

specifications, as well as we can use the same SAT solver to solve different constraints. The major problem

stems from the non trivial specification and mapping of the domain and constraints (as we already know

them) to a language that can be used by the solver. Fortunately, this problem is being addressed by

techniques like the one proposed by [NJG11].

Despite all advantages and opportunities in the use of constraints, it is important not to forget its

tradeoff: pushing too restrictive constraints may lead the discovery process to a simple hypothesis testing

approach. There are already some approaches that have this tradeoff into account, and propose some

solutions, like the use of constraint relaxations.

Even though the great advances in the use of domain knowledge in pattern mining, it is clear that

there are several research paths to follow. One open discussion is when to push the domain knowledge.

On one side, pushing as a pre-processing step reduces the data to analyze, but may eliminate important

data; Pushing as a post-processing step is discovery preserving, but requires all data to be analyzed; And

incorporating domain knowledge during the actual discovery process allows us to gradually reduce the

search space and guide the mining only through promising paths, thus avoiding processing all data and,

if used wisely, avoiding eliminating potentially interesting data. Therefore, there is a need to develop and

extend algorithms that are able to push constraints during the pattern mining process.

Also, besides the use of constraints, most of the existing algorithms for other forms of knowledge

representation do not allow the discovery of more complex patterns, like sequential and temporal patterns.

Furthermore, there are many ad-hoc approaches, and few general strategies. This generally hinders

their application to different domains, and their actual use for decision support. It is undeniable the need

for an integration theory.

Finally, existing algorithms are mainly designed for transactional and single table databases. There

is a need to create new constraints and adapt these strategies (or create new ones) to be able to deal

with more complex and demanding data, as other forms of structured data, like multi-relational models,

graphs and xml files, where their inherent structure can be explored.

In Chapter 5, we address and discuss in more detail these last two open issues.

59

Chapter 5

Pushing Constraints into Pattern

Mining

The previous chapters of this dissertation discuss two important lines of research for pattern mining. On

one side, the mining of both large, growing and multi-relational databases is more and more important

nowadays, in this era of big data, where all kinds of data are continuously being generated and made

available. On the other side, constrained mining is very important for the pattern mining task, since

it can significantly improve the results and applicability of these techniques, by decreasing the number

of patterns returned, and focusing the discovery process in areas where it is more likely to gain more

interesting information.

There is therefore a need for the integration of these areas, and despite the great advances in each

separate area, there is still a lot of work to do in order to incorporate constraints into the mining of more

complex and dynamic databases.

In this sense, we first propose two efficient and general algorithms for pushing constraints following

any property. Both algorithms incorporate any constraint as a post-processing step, into a pattern-tree

(the same structure used by our multi-dimensional algorithm StarFP-Stream). The first algorithm is

called Constraint pushing into a Pattern-Tree (CoPT ) [SA13a], and the second CoPT4Streams [SA13b].

They are designed for single table (and static) datasets and for single table data streams, respectively. By

using the pattern-tree structure, both algorithms are able to optimize the incorporation of any constraint,

avoiding unnecessary tests and eliminating invalid patterns earlier, according to the properties of the

constraints. Experiments show that the algorithms are efficient and effective, for all constraint properties,

and even for constraints with small selectivity.

Afterwards, we analyze in detail the incorporation of constraints in a multi-dimensional domain, and

propose a set of constraints that can be applied when mining a star schema (Star Constraints): entity

type, entity, attribute and measure constraints. Based on the strategies proposed for the algorithms

CoPT and CoPT4Streams, as well as on the related work described in Chapter 4, we also propose a set

of approaches for pushing the above constraints in pattern mining over star schemas.

To the best of our knowledge, there is no work on the incorporation of constraints in multi-relational

mining, and therefore this work is an important first step towards this integration.

In this chapter we first present in detail the two algorithms for pushing constraints into a pattern-

tree (Sections 5.1 and 5.2). Then, in Section 5.3 we first analyze the difficulties on pushing constraints

in the multi-relational domain, and then we define the star constraints and propose a set of strategies

for incorporating those constraints in the discovery of patterns over star schemas. Finally, Section 5.5

discusses and concludes the chapter.

61

5.1 Pushing Constraints into a Static Pattern-Tree

As described in Section 4.3, the problem of constrained pattern mining is to find all frequent itemsets

that satisfy some constraint.

In this section we propose a set of strategies to push constraints that have nice properties into pattern

mining, through the use of a pattern-tree structure. These are post-processing strategies that, combined

with the properties of the pattern-tree, make it possible to efficiently filter the results accordingly to any

constraint.

We also propose an algorithm, called CoPT (Constraint Pushing into a Pattern-Tree), that implements

these strategies and is able to incorporate any of those constraints efficiently, and therefore return less

and more interesting results. As a post-processing algorithm, any traditional pattern mining algorithm

can be used before to search for frequent itemsets, and its results, kept in a pattern-tree, can be processed

directly by CoPT.

5.1.1 Pattern-Tree

A pattern-tree, as first described for StarFPStream in Section 3.3, is a compact prefix tree structure that

holds information about patterns.

In its basis, each node contains an item and a support, and edges link items that occur together,

forming the itemsets. Therefore, to each node in the pattern-tree corresponds an itemset, composed by

the items from the root to this node, and the support attached to it. Note that each node may contain

other fields, if needed, such as the error when mining data streams. In this sense, the children of a given

node (its subtree) correspond to the supersets of the respective itemset.

As a prefix tree, itemsets that share the same prefix also share the same nodes corresponding to that

prefix. Since there are often a lot of sharing of frequent items among patterns, the size of the tree is

usually much smaller than having them in a list or a table, and the search for an itemset is usually much

faster.

Note that if an itemset (a, b, c) : 5 is a frequent itemset, then both a, b, c, (a, b), (a, c) and (b, c) are

also frequent, with support higher or equal to 5, and therefore they are also in the pattern-tree. This

means that, for each itemset in the tree, all elements of its strict powerset are also in the tree. This may

seem undesirable or redundant at a first look, but it is a important property that facilitates the pruning

of the tree while searching for constraint satisfaction.

5.1.2 Constraint Pushing Strategies

In order to push constraints into a pattern-tree, we define a set of strategies that can be used, based on

constraint properties. A naive approach is to perform a simple depth-first search (DFS) to traverse the

tree and test all nodes for all types of constraints (note that, when we test a node for a constraint, we

mean that we test the itemset corresponding to that node). However, not all nodes need to be tested.

For example, if an itemset of a node violates an anti-monotonic constraint, no superset will satisfy it,

and therefore there is no need to test the children of that node, neither to keep them in the tree. Hence,

we can take advantage of constraint properties and perform a constrained DFS, stopping the search at

some points and avoiding unnecessary tests.

Another possible approach is to push the constraint right before inserting each itemset in the pattern-

tree. However, while this may be better in terms of memory, because the pattern-tree would be smaller,

this means that we have to test every itemset. By scanning the tree, we may skip the constraint checking

of a lot of itemsets.

62

Furthermore, constraints can be used, not only to filter the results, but also to prune the pattern-tree

and remove invalid itemsets for future accesses.

Next, we describe the strategies for pushing constraints satisfying each property.

Anti-Monotonicity (AM):

Pushing an AM constraint (CAM ) is pretty straightforward. While performing a DFS, if the node:

(a) Satisfies CAM : keep it in the tree and return it as a pattern;

(b) Violates CAM : there is no need to search its subtree because all supersets also violate the constraint.

Therefore we can prune the tree and remove this node, as well as all of its children.

Monotonicity (M):

To incorporate a monotonic constraint (CM ), we cannot remove nodes that violate it, because the super-

sets of this node (its children) may satisfy it. So, while traversing the tree, if the node:

(a) Satisfies CM : keep it in the tree and return it as a pattern. Do the same for each node in its subtree,

without testing for the constraint; (Note that if we are just pruning the tree, not yet returning the

patterns, we do not even need to scan the subtree, because all supersets satisfy the constraint, and

there is nothing to remove.)

(b) Violates CM : If it is a leaf node (has no supersets), we can remove it, as well as all parents that

become a leaf because of this elimination. If it is not a leaf, continue the search to its children,

since they can satisfy the constraint.

Succinctness (S):

In the presence of a succinct constraint, we can apply the strategies for CAM or CM , whether it is

succinct anti-monotonic (CSAM ) or succinct monotonic (CSM ), respectively. However, the succinctness

of a constraint allow us to know, from the outset, which items satisfy or not satisfy the constraint.

Therefore, we can use that to take advantage of this property, and obtain a more efficient search.

With this in mind, we can first divide the items into two groups: items that satisfy or are necessary for

the satisfaction of the constraint, Is; and items that violate, or are not required for the satisfaction of the

constraint, Iv. And before inserting itemsets into the pattern-tree, we can order the itemsets according

to those groups.

CSAM : With a SAM constraint, single items that violate it can be discarded. If we order items in

itemsets so that Iv appears before Is (Iv closer to the root and Is to the leafs), when applying the

CAM strategy, we only need to check the first level of the pattern-tree. If the node violates the

constraint, remove it and its sub-tree; if the node satisfies, all of its children will also satisfy, because

they belong to Is, so we can return all of them as patterns, without testing for the constraint.

CSM : In the case of a SM constraint, Is contains the mandatory items and Iv the optional items. If an

itemset with items from Is satisfy the constraint, all of its supersets formed by adding items from

Is or Iv also satisfy it. Itemsets with items only from Iv violate the constraint. In this sense, if

we order itemsets so that items from Is appear first than items from Iv, when applying the CM

strategy, we only need to do it until the first node from Iv. This is because, if we arrive to a node

like this and still need to test the constraint, it means it has not been satisfied by items from Is,

and next items also cannot satisfy it because they are optional, therefore we do not need to test

anything more.

63

Prefix-Monotonicity (P ):

Since prefix-monotone constraints can only be treated as AM (CPAM ) or M (CPM ) constraints if items

are ordered by a particular order, we just need to sort the itemsets according to that order before inserting

them in the pattern-tree, and apply the CAM or CM strategy, respectively. Otherwise, we have to traverse

the whole tree and check all nodes for the constraint.

Mixed-Monotonicity (Mix):

Mixed-monotone constraints (CMix) are both AM and M , for different groups of values. In this case,

we just have to divide the items into those groups: IAM and IM , and put IM before IAM in the tree,

i.e. sort itemsets so that items from the IM group appear above items from IAM . The idea is to start

with the CM strategy, until a node that satisfies it, or a node from IAM appears. From that node, we

can apply the CAM strategy and prune invalid nodes from its sub-tree. So, for each node, start with the

monotone strategy:

1. Monotone strategy: If the itemset:

(a) Satisfies CMix: Keep it in the tree and return it as a pattern. We can now change to the

anti-monotone strategy and proceed;

(b) Violates CMix: If it is a leaf, remove it, as well as all parents that become a leaf. If it is a

node from IAM , remove it, and all its sub-tree. Otherwise, continue to its children.

2. Anti-monotone strategy: If the itemset satisfies the constraint, keep it in the tree and return it as a

pattern. If it violates the constraint, prune the tree from this node removing it and all its children.

Combinations of constraints:

Most of the times, combinations of constraints that individually have these nice properties, do not

have nice properties. And pushing constraints with no nice properties, means that all tree needs to be

traversed, and all nodes must be tested. Nevertheless, there are three important aspects.

First, disjunctions or conjunctions of anti-monotonic (reps. monotonic) constraints are also anti-

monotonic (reps. monotonic) constraints. Therefore, we can push them all at the same time (as one

single constraint), with the exact CAM (resp. CM ) strategy.

Second, for other properties, we need to sort the items according to the order that allows us to take

the most advantage of the property. When we have more than one constraint that needs an order, if they

are compatible (i.e. do not change the order of items of the other), it is possible to apply the respective

strategies at the same time. However, if the orders are not compatible, we cannot apply any strategy

above.

But, and finally, since we prune the pattern-tree to remove itemsets that violate a constraint, the size

of the pattern-tree is generally smaller after applying some strategy. In this sense, we can still efficiently

push several constraints, one after another, over the pattern-tree resulting from pushing the previous

constraint. And we can do it in an efficient order, by pushing first AM constraints, and then M ones.

Constraints with no nice properties (or with incompatible orders), can be pushed in the end, over the

smallest pattern-trees.

5.1.3 Algorithm CoPT

Since there are a lot of similarities between the strategies presented above, they can be combined into

one single generic strategy or algorithm. We propose therefore the algorithm CoPT (Constraint Pushing

into a Pattern-Tree), that is able to efficiently and effectively push any constraint into a pattern-tree.

64

Algorithm 3 CoPT Pseudocode

Input: Support σ, Dataset D, Constraint COutput: All frequent itemsets that satisfy C

if C has order thenorder ← best order for C

p-tree ← empty tree with order orderrun a pattern mining algorithm with σ and D, and insert results into the p-treeL← pushConstraint(p-tree, C)return L

Patterns ← pushConstraint(Pattern-Tree p-tree, Constraint C)L← ∅for all Node N , children of the root of p-tree do

remove? ← push(N , C, {}, L)if remove? is true then

remove N from rootreturn L

boolean ← push(Node N , Constraint C, Itemset itset, Patterns L)isPattern? ← true, current ← itset ∪N.item : N.supportif Constraint is not null then

if C is Succinct and N.item ∈ C.Iv thenreturn true //remove this node

if current satisfies C thenif C is Monotonic or C is Succinct then

if C is Mixed thenChange C to AM for next children

elseC ← null //no need to test any children

elseif C is Anti-monotonic then

return trueisPattern? ← false

if isPattern? is true thenL← L ∪ current

for all Node T , children of N doremove? ← push(T , C, current, L)if remove? is true then

remove T from Nif isPattern? is false and N is leaf then

return truereturn false

The pseudo-code of the algorithm is presented in Algorithm 3.

Essentially, to push a constraint, CoPT first checks what is the order of items for that constraint,

and creates an empty pattern-tree with it (if there is no order, items are put in the pattern-tree in a

support-descending order, which is known to improve the compactness of the tree [HPY00]). Then a

traditional pattern mining algorithm can run over the dataset to get frequent itemsets. While running

it, results are inserted in the pattern-tree (note that the algorithm does not need any change. Only the

pattern-tree knows how to sort and insert the itemsets). After that, we can push the constraint into the

pattern-tree.

Following function push for each node, current corresponds to the itemset composed of items from

root to this node, and until proved otherwise, it is a pattern. If there is no constraint to check (e.g. a

CM already satisfied), add it as a pattern and do the same for all of its children. Otherwise, (1) if the

constraint C is succinct (SAM or SM) and the node violates it, it can be removed; (2) if current satisfies

C: (a) C is mixed and we can change the strategy to AM ; (b) C is monotonic and no child needs testing;

or (c) C is succinct AM , and only the first level of the tree needs testing. (3) if current violates C, it is

not a pattern, and if C is AM we can prune the tree from here. After checking the constraints, if the

node was not pruned, we can test its children. Finally, after pushing C into the children, if the node is

not a pattern and is a leave, we can remove it.

65

5.1.4 Performance Evaluation

The goal of these experiments is to analyze the behavior of our algorithm in the presence of all types

of constraints, and prove that CoPT is able to effectively and efficiently push them into a pattern-tree,

taking advantage of their properties.

In these experiments we use a transaction database automatically generated by the program developed

at IBM Almaden Research Center [AS94]. The dataset has 10k transactions, with an average of 25 items

per transaction and a domain of 1000 items (with values from zero to 1000). In addition, in order to test

the mixed-monotone constraint, we consider an equivalent dataset but with negative values, by making

values vary from −500 to 500.

We analyze the time needed to push the constraints on these datasets, as well as the size of the pruned

pattern-tree and the number of constraint checks the algorithm needed to make. Since the behavior of the

algorithm can depend on the selectivity of the constraints, we also use it in our experiments. Selectivity

is defined as the ratio of frequent itemsets that violate the constraint, over the total number of frequent

itemsets, i.e. how much we can filter. Therefore, we test CoPT with several constraints with different

selectivities, varying from 10% to 90%. We also tested several minimum supports, and since results are

consistent, we present the results for a support of 0.5%, and results presented correspond to the average

of several runs with different constraints with equivalent selectivity. Also, to have a term of comparison,

we test our algorithm against a version that checks all nodes for the constraints (i.e. not taking into

account constraint properties), named CoPT+.

The traditional pattern mining algorithm used was FP-Growth [HPY00], since it is an efficient al-

gorithm that does not suffer from the candidate generation problem. The computer used to run the

experiments was an Intel Core i7 CPU at 2GHz (Quad Core), with 8GB of RAM and using Mac OS X

Server 10.7.5 and the algorithms were implemented using Java (JVM version 1.6.0 37).

Experimental Results

As the core of our algorithm, the pattern-tree plays an important role in these experiments. Independently

of the constraint, most of the times the size of the pattern-tree after pushing the constraint is smaller

than the original one, because it does not contain leaves that violate it. As the selectivity increases,

the more itemsets violate the constraint, and therefore the more can be discarded from the tree. In the

case of an AM constraint (either AM , SAM or PAM), the number of nodes in the final pattern-tree

corresponds to the number of frequent itemsets that satisfy the constraint (the number of patterns). In

the case of M constraints (M , SM , PM and Mix), this might not be true, since nodes that violate the

constraint have to be kept if there is some superset that satisfies the constraint.

In fact, the time needed by the traditional unconstrained pattern mining algorithm corresponds to

the bulk of time needed: about 5 hours for these settings. After having the patterns in a pattern-tree,

and due to its compact nature, it is fast (compared to pattern mining) to look for patterns that satisfy

some constraint, even constraints with no nice properties (CoPT+) and with less selectivity. Fig. 5.1, 5.3

and 5.5 show the time needed for pushing AM , M and Mix constraints into a pattern-tree, respectively.

We can see there that pushing constraints taking into account their properties (CoPT ) takes less time

than testing all nodes (CoPT+), for every constraint property. For all AM and Succinct constraints,

as the selectivity increases, the time needed to prune the tree decreases, since they can eliminate earlier

more itemsets that violate it. On the contrary, M and SM constraints tend to increase the time needed,

because they take more time until finding itemsets that satisfy it (so that they can stop checking the

constraint). The time is therefore related to the number of constraint checks.

These constraint checks are also an important part of the algorithm, since theoretically, taking ad-

vantage of constraint properties results in less tests. Fig. 5.2, 5.4 and 5.6 show interesting results about

66

that. For AM constraints (AM and PAM), the number of tests decreases with the increase of selectivity,

because the number of itemsets that violate and can be discarded increases. For M constraints (both M

and PM) the trend is reversed. This happens because the M strategy only stops checking when itemsets

satisfy the constraints. If there are more itemsets that violate (more selectivity), more itemsets need to

be tested. Using the succinctness of constraints brings the highest improvements, both in time needed

and in constraint checks avoided. The number of tests for succinct constraints does not depend on the

selectivity, because only and all nodes of the first level of the tree need to be tested (in this case, about 800

nodes). Note that the tree has more than 300 thousand nodes, and only 800 need to be checked. Finally,

Mix constraints have a “mix” of the behavior of M and AM constraints. As the selectivity increases,

more itemsets belonging to both groups of values violate the constraint, and the more violating itemsets

from IAM , the more can be pruned, but the more violating itemsets from IM , the more constraint checks

are required. Hence, there is a tradeoff between both strategies.All#AM

All#M

0#

25#

50#

75#

100#

125#

150#

175#

0%# 20%# 40%# 60%# 80%# 100%#

Time%(m

s)%

Selec,vity%CoPT+# CoPT#(AM)#CoPT#(SAM)# CoPT#(PAM)#

0#

50#

100#

150#

200#

250#

300#

350#

0%# 20%# 40%# 60%# 80%# 100%#Num

ber%o

f%Nod

es%

Thou

sand

s%

Selec,vity%CoPT+# CoPT#(AM)#

CoPT#(SAM)# CoPT#(PAM)#

0#

50#

100#

150#

200#

250#

300#

350#

0%# 20%# 40%# 60%# 80%# 100%#N.%C

onstraint%C

hecks%(thou

sand

s)%


0#

25#

50#

75#

100#

125#

150#

175#

0%# 20%# 40%# 60%# 80%# 100%#

Time%(m

s)%

Selec,vity%CoPT+# CoPT#(M)#CoPT#(SM)# CoPT#(PM)#

0#

50#

100#

150#

200#

250#

300#

350#

0%# 20%# 40%# 60%# 80%# 100%#N.%C

onstraint%C

hecks%(thou

sand

s)%


Figure 5.1: Time with AM .

All#AM

All#M

0#

25#

50#

75#

100#

125#

150#

175#

0%# 20%# 40%# 60%# 80%# 100%#

Time%(m

s)%


0#

50#

100#

150#

200#

250#

300#

350#

0%# 20%# 40%# 60%# 80%# 100%#Num

ber%o

f%Nod

es%

Thou

sand

s%



0#

50#

100#

150#

200#

250#

300#

350#

0%# 20%# 40%# 60%# 80%# 100%#N.%C

onstraint%C

hecks%(thou

sand

s)%


0#

25#

50#

75#

100#

125#

150#

175#

0%# 20%# 40%# 60%# 80%# 100%#

Time%(m

s)%


0#

50#

100#

150#

200#

250#

300#

350#

0%# 20%# 40%# 60%# 80%# 100%#N.%C

onstraint%C

hecks%(thou

sand

s)%


Figure 5.2: Checks with AM .

All#AM

All#M

0#

25#

50#

75#

100#

125#

150#

175#

0%# 20%# 40%# 60%# 80%# 100%#

Time%(m

s)%


0#

50#

100#

150#

200#

250#

300#

350#

0%# 20%# 40%# 60%# 80%# 100%#Num

ber%o

f%Nod

es%

Thou

sand

s%



0#

50#

100#

150#

200#

250#

300#

350#

0%# 20%# 40%# 60%# 80%# 100%#N.%C

onstraint%C

hecks%(thou

sand

s)%


0#

25#

50#

75#

100#

125#

150#

175#

0%# 20%# 40%# 60%# 80%# 100%#

Time%(m

s)%


0#

50#

100#

150#

200#

250#

300#

350#

0%# 20%# 40%# 60%# 80%# 100%#N.%C

onstraint%C

hecks%(thou

sand

s)%


Figure 5.3: Time with M .

All#AM

All#M

0#

25#

50#

75#

100#

125#

150#

175#

0%# 20%# 40%# 60%# 80%# 100%#

Time%(m

s)%


0#

50#

100#

150#

200#

250#

300#

350#

0%# 20%# 40%# 60%# 80%# 100%#Num

ber%o

f%Nod

es%

Thou

sand

s%



0#

50#

100#

150#

200#

250#

300#

350#

0%# 20%# 40%# 60%# 80%# 100%#N.%C

onstraint%C

hecks%(thou

sand

s)%


0#

25#

50#

75#

100#

125#

150#

175#

0%# 20%# 40%# 60%# 80%# 100%#

Time%(m

s)%


0#

50#

100#

150#

200#

250#

300#

350#

0%# 20%# 40%# 60%# 80%# 100%#N.%C

onstraint%C

hecks%(thou

sand

s)%


Figure 5.4: Checks with M .

0"

50"

100"

150"

200"

250"

0%" 20%" 40%" 60%" 80%" 100%"

Time%(m

s)%

Selec,vity%

CoPT+" CoPT"(Mix)" 0"

50000"

100000"

150000"

200000"

250000"

300000"

350000"

400000"

450000"

0%" 20%" 40%" 60%" 80%" 100%"

Num

ber%o

f%Nod

es%%

Selec,vity%

CoPT"(Mix)" Pa6erns"

0"

100"

200"

300"

400"

500"

0%" 20%" 40%" 60%" 80%" 100%"N.%C

onstraint%C

hecks%(thou

sand

s)%

Selec,vity%

CoPT+" CoPT"(Mix)"

Figure 5.5: Time with Mixed.

0"

50"

100"

150"

200"

250"

0%" 20%" 40%" 60%" 80%" 100%"

Time%(m

s)%

Selec,vity%

CoPT+" CoPT"(Mix)" 0"

50000"

100000"

150000"

200000"

250000"

300000"

350000"

400000"

450000"

0%" 20%" 40%" 60%" 80%" 100%"

Num

ber%o

f%Nod

es%%

Selec,vity%

CoPT"(Mix)" Pa6erns"

0"

100"

200"

300"

400"

500"

0%" 20%" 40%" 60%" 80%" 100%"N.%C

onstraint%C

hecks%(thou

sand

s)%

Selec,vity%

CoPT+" CoPT"(Mix)"

Figure 5.6: Checks with Mixed.

67

5.1.5 Discussion and Conclusions

In this section, we propose a new set of post-processing strategies for pushing constraints into pattern

mining, through the use of the efficient pattern-tree structure. These strategies take advantage of con-

straint properties, so that we can filter earlier the frequent itemsets that satisfy each constraint, and

avoid unnecessary tests. We also propose a general algorithm, named CoPT , that combines the defined

strategies and is able to push any constraint into a pattern-tree, and still taking advantage of their

properties.

Experimental results show that the algorithm is effective and efficient. It needs a small amount of

time to push and prune the pattern-tree (when compared to the time needed by the pattern mining

algorithm), even for constraints with small selectivity, and checks much less nodes and needs less time

than an approach that does not take into account constraint properties.

Despite the benefits of CoPT , it is a post-processing approach. This means that some traditional

pattern mining algorithm must run first to discover all frequent itemsets. This usually takes much

time, and results in a large quantity of frequent itemsets that need to be again evaluated. A path for

improvement is to create a more balanced approach and use the strategies proposed here to filter itemsets

during the actual discovery process.

An important contribution of CoPT is the fact that it uses the same pattern-tree structure that is

used by our algorithm StarFP-Stream. However, it makes some assumptions that are not valid when we

move to a streaming environment, such as all data is available from the beginning. In the streaming case,

new data is continuously appearing, and this means that we do not know a priori the alphabet and order

of items, and furthermore, we need to analyze to what extent we can remove itemsets from the tree if

the same itemsets can appear later on.

5.2 Pushing Constraints into a Dynamic Pattern-Tree

In this section we adapt and discuss the set of strategies proposed above for pushing constraints into

stream pattern mining, through the use of the pattern-tree structure. The problem of constrained pattern

mining over data streams is to find all approximate patterns (with estimated support higher than the

threshold) that satisfy some constraint.

We also propose a generic algorithm, called CoPT4Streams (Constraint Pushing into a Pattern-Tree

for Streams), that combines and implements these strategies and is able to dynamically discover all

patterns that satisfy any user defined constraint. CoPT4Streams pushes constraints into the pattern-tree

structure at each batch boundary in an efficient way, by taking advantage of the properties of constraints,

and filters all patterns and possibly patterns in that tree, resulting in a much smaller summary, and

therefore less memory and time needed.

Since it is an algorithm that is applied to the pattern-tree, any data streaming algorithm can be used

along with our CoPT4Streams, provided that it uses a pattern-tree as its summary data structure.

5.2.1 Pattern-Tree

As described above for CoPT, a pattern-tree is a compact prefix tree structure that holds information

about patterns. In the streaming environment, this tree contains also information about the error asso-

ciated to each pattern.

In this context, each node of a pattern-tree contains an item, an approximate support and a maximum

error, and edges link items that occur together, forming the patterns. Therefore, each node in a pattern-

tree corresponds to an approximate pattern, composed of the items from the root to this node, and the

estimated support and error attached to this node.

68

5.2.2 Constraint Pushing Strategies

As we are integrating data streams and constraints, some questions arise. Note that the pattern-tree

must be updated in every batch, to renew the current approximated frequent itemsets. And therefore the

order in which the items in patterns are inserted in the tree must remain the same across the batches.

1. Data are not available a priori, and so we do not know all possible items at the beginning. In the

cases where the order of items matter (e.g. for prefix-monotone constraints), new items that should

be placed between already known items may appear. Is it possible to efficiently take advantage of

constraint properties, even when the order of items changes?

2. In a static application, invalid itemsets could be removed from the tree, since they do not satisfy

the constraint (for both AM and M constraints). In a data stream, these itemsets could reappear in

following batches, and valid supersets of current invalid itemsets could also appear later (in the case

of M constraints). Can we, at some batch, remove itemsets in the tree that violate the constraint?

With these differences, the main question is:

• Can we use the same strategies as the algorithm CoPT?

The answer is yes to both questions, with small adaptations, essentially because for a pattern to

appear in the pattern-tree (i.e. to be approximately frequent), all of its subsets must appear too. But

we will delve into these questions further ahead.

We assume that constraints have fixed parameters (for example, min(X) < v, in which X is an itemset

and v is a fixed threshold), i.e. parameters do not depend on the number of transactions seen so far, and

do not change across different batches (e.g. we do not consider constraints like min(X) < min(all items

seen so far)). This makes the satisfaction of constraints permanent, meaning that, if an itemset satisfies

(reps. violates) a constraint in some batch, it always satisfies (reps. violates) the same constraint, in any

later batch.

Anti-Monotonicity:

For pushing an AM constraint (CAM ) we can use the same strategy as used for mining static data tables

(as CoPT ). The only difference is that we don’t have to return any pattern as a result, since we are

pushing constraints at the end of a batch, to filter the pattern-tree for the next batch.

The reason we can apply the same strategy is that, for AM constraints, itemsets that violate the

constraint can be removed, because they will never satisfy the constraint. Even if they reappear in later

batches because they are frequent, they will be removed again, since they violate the constraint (answer

to question 2).

Recalling the strategy:

While performing a DFS, if the node satisfies CAM , keep it in the tree and proceed to its children; if

it violates CAM , there is no need to search its subtree because all supersets also violate the constraint.

Therefore, prune the tree and remove this node, as well as all of its children.

Monotonicity:

To incorporate a monotonic constraint (CM ), we can also adopt a similar strategy as proposed for CoPT.

Since we do not have to return results, when we find a satisfying itemset, we do not need to traverse its

supersets to return them. In this sense, we not only save time on constraint checks, but also on traversing

the tree.

69

Answering again to question 2, for M constraints, all itemsets with no supersets in the tree (leafs)

that violate the constraint can be removed, because they will never satisfy the constraint. Note that,

if some valid superset appears in later batches, it means that both that itemset and the superset are

frequent, and therefore both will appear in the tree, in the same branch. However, only the superset will

be returned as pattern, because it is the only one valid. Summing, there is no need to keep an invalid

itemset in the tree, while it has no valid supersets.


If a node satisfies CM , keep it in the tree and do not scan the subtree, because all supersets

satisfy the constraint, and there is nothing to remove; If it violates CM and it is a leaf node (has

no supersets), remove it, as well as all parents that become a leaf because of this elimination. If

it violates but is not a leaf, continue the search to its children (we cannot remove it, because the

supersets of this node, its children, can satisfy the constraint).

Succinctness:

Recall that a succinct constraint allow us to know, by looking for single items, which of them satisfy (Is)

or not satisfy (Iv) the constraint. With this in mind, before inserting itemsets into the pattern-tree, we

can order their contents according to those two groups of items.

In this sense, succinct constraints refer to question 1, since they need the items to be sorted. However,

in this streaming environment, we do not know the overall order a priori, and therefore new items from

the first group may appear and need to be placed before all already known items from the second group

(e.g. for a CSAM , at some batch, Iv = {a} and Is = {b}. If an item c appears and belongs to Iv (therefore

Iv = {a, c} and Is = {b}), and we have the occurrence of itemset abc, the order of items to be inserted

in the pattern-tree should be acb).

The truth is that this poses no problem, as long as the relative order of existing items does not change,

because if an itemset with new items appear in the tree, all subsets will also appear, and all subsets that

not include these new items remain with the same order (using the example above, both itemsets a, b,

c, ab, ac, cb and acb must be in the tree, and as can be noted, the itemsets without item c maintain the

order, such as ab in this case).

Therefore, we can follow the CoPT strategy proposed for static datasets, but without the need for

returning itemsets.


CSAM : With a SAM constraint, order items in itemsets so that Iv appears before Is. At each

batch boundary, apply the CAM strategy only on the first level of the pattern-tree. If the node

violates the constraint, remove it and its sub-tree; if the node satisfies the constraint, all of its

children will also satisfy it, because they belong to Is, so skip testing for the constraint.

CSM : In the case of a SM constraint, order itemsets so that items from Is (mandatory) appear

first than items from Iv (optional). When applying the CM strategy after each batch, only

do it until find the first node from Is that satisfies the constraint (since all supersets satisfy).

Otherwise, until the first node from Iv, because if we arrive to a node from this group, and

still need to test the constraint, it means it has not been satisfied by items from Is, and the

following items will also not satisfy it because they are optional. In this case, do not test this

node, neither any child, and remove them from the tree.

70

Prefix-Monotonicity:

Prefix-monotone constraints can only be treated as AM (CPAM ) or M (CPM ) constraints if items are

ordered by a particular order. Answering to question 1, and like for succinct constraints, this order is

not a problem, as long as the relative order of existing items does not change.

Therefore, we just need to follow the same approach as CoPT.

Sort the itemsets according to the correct order before inserting them in the pattern-tree, and

apply the CAM or CM strategy, respectively.

Mixed-Monotonicity:

As for mixed-monotone constraints (CMix), the answer to questions 1 and 2 follow the same reasoning

as explained above for prefix-monotone and succinct constraints.

In this sense, we can also apply CoPT.

Divide the items, as they appear, into two groups: anti-monotonic IAM and monotonic IM , and

put IM before IAM in the tree. Then start with the CM strategy, until a node that satisfies it, or a

node from IAM appears. From that node, apply the CAM strategy to all of its supersets (children)

and prune invalid nodes from its sub-tree.

5.2.3 Algorithm CoPT4Streams

Based on the discussion above, we propose an extension of CoPT, called CoPT4Streams (Constraint

Pushing into a Pattern-Tree for Streams), that is able to efficiently and effectively push any constraint

into a pattern-tree, when mining data streams.

The idea is to run CoPT4Streams over the pattern-tree resulting of the mining of each batch, and using

the resulting smaller tree to mine the next batches. By doing this, the algorithm is able to filter what is

really interesting for the users, and keep smaller summary structures, which result in improvements on

both memory and time needed, as well as on the number of the patterns returned.

Since constraint satisfaction is permanent, we can perform an extra optimization (besides using con-

straint properties) and only compute the satisfaction of some node once, by e.g. keeping one flag in each

node indicating if it satisfies or violates the constraint. Thus, we can mitigate the constraint checking for

nodes that remain in the tree from one batch to another (nodes closer to the root).

Essentially, to push a constraint, CoPT4Streams works as follows. For each batch, and for each

approximate pattern discovered by the streaming algorithm, it is ordered according to the order of items

for that constraint, if exists, and inserted in the tree (if there is no order, items are put in the pattern-tree

in a support-descending order [HPY00]).

At each batch boundary, we can push the constraint C into the pattern-tree, by scanning the tree

according to the constraint property. So, for each node, if the node is new in the tree (i.e. if we never

checked for the constraint), we can first see, in the case of succinct or mixed constraints, if the item in

the node belongs to the second group of items. If so, it means the node can be discarded (the constraint

was not satisfied by the first group of items), along with its children. Then, or in the case of other type

of constraints, we should check for the constraint (and store the result into the satisfaction flag in the

node).

When we know the result of the constraint checking:

1. If the itemset corresponding to this node satisfies C:

71

(a) C is mixed and we can change the strategy to AM ;

(b) C is monotonic and no child needs testing; or

(c) C is succinct AM , and also no child needs testing (only the first level of the tree).

2. If the itemset violates C, it is not a pattern, and if C is AM (including SAM and PAM) we can

prune the tree from here.

After checking the constraints, if the node was not pruned, we can test its children. Finally, after

pushing C into the children, if the node is not a pattern and is a leave, we can remove it. Note that this

final node pruning is made for every constraint, even if they have no “nice” properties. However, in this

later case, all nodes need to be tested.

5.2.4 Performance Evaluation

The goal of these experiments is to analyze the behavior of our algorithm in the presence of a data stream,

and all types of constraints, and prove that CoPT4Streams is able to effectively and efficiently push

them into a pattern-tree at each batch, taking advantage of their properties.

In these experiments, similar to the experiments with CoPT, we use a database automatically gen-

erated by the program developed at IBM Almaden Research Center [AS94]. The dataset has 100k

transactions, with an average of 10 items per transaction and a domain of 1000 items (with values from

zero to 1000). In addition, in order to test the mixed-monotone constraint, we consider an equivalent

dataset but with negative values (by making values vary from −500 to 500).

Recall that, the higher the selectivity, the more we can filter, and less patterns are returned. But

on the other hand, the lower the selectivity, the more patterns need to be kept and returned (and we

get closer to the problems of unconstrained techniques). Therefore, we test CoPT4Streams with several

constraints with different selectivities, varying from 10% to 90%.

We also tested several minimum supports and errors, and since results are consistent, we present the

results for a support of 0.1% and an error of 0.01% (a common way to define the error, is ε = 0.1σ),

and results presented correspond to the average of several runs with different constraints with equivalent

selectivity. Also, to have a term of comparison, we test our algorithm against CoPT4Streams+, a version

that checks all nodes for the constraints (i.e. that does not take into account constraint properties).

The data streams algorithm used was SimpleFP-Stream (a simplification of FP-Streaming [HPY00]

presented and used in Chapter 3, for evaluating StarFP-Stream). It was chosen because it is an efficient

algorithm for single table data streams that does not suffer from the candidate generation problem,

and keeps current patterns into a pattern-tree. The size of each batch is defined by |B| = 1/ε, which

corresponds to 100 batches of 1000 transactions in each batch. The computer used to run the experiments

was an Intel Core i7 CPU at 2GHz (Quad Core), with 8GB of RAM and using Mac OS X Server 10.7.5

and the algorithm was implemented using Java (JVM version 1.6.0 37).

By definition, data streaming techniques return more patterns than traditional algorithms for static

datasets, and the higher the error allowed, the more patterns are returned and the less accuracy they

obtain. By incorporating constraints into data streams, we can filter not only the patterns returned, but

also the patterns that must be kept in memory, improving the performance of the algorithms, either in

terms of time, memory and results.

Experimental Results

We first analyze the average size of the pruned pattern-tree. When applying constraints, more itemsets

can be discarded, and therefore the pattern-tree is smaller than in an unconstrained environment. In

72

turn, a smaller pattern-tree in every batch may have an impact on the time needed to update the tree

and on the number of constraint checks the algorithm need to make. Remember that the update time

is perceived as the time needed to process one batch of transactions until the complete update of the

pattern-tree (Section 3.4.2). Since the trends are the same, whether a constraint is AM or M, fig. 5.7

to 5.9 show the average results when in the presence of AM (an average of both AM, SAM and PAM)

and M (average of both M, SM and PM) constraints. The only difference is that, in the unconstrained

case, as well as for the simple AM and M constraints, there is no need to sort the items in the patterns.

On the other side, succinct, prefix- and mixed-monotone constraints require that items are put in the

pattern-tree sorted according to some specific order. This means that all itemsets must be sorted before,

which results in an overhead in time, that depends on that order.NOLSUCTimeToCheckUpdateTime

6.1 16.56.7 15.9

5.7 12.9

4.8 9.7

1.3 5.81.0 5.3

8.1 16.59.3 15.9

9.6 12.9

9.7 9.7

6.9 5.87.0 5.3

8.0 29.88.2 34.0

8.4 29.8

8.2 21.5

6.5 16.36.1 15.5

8.8 29.88.8 33.9

9.1 29.8

9.3 21.5

7.9 16.37.6 15.6

0L

10000L

20000L

30000L

40000L

50000L

60000L

10%L20%L 40%L 60%L 80%L90%L

Size%of%the

%Pa,

ern%Tree%

Selec2vity%

CoPT4StreamsL(AM)L CoPT4StreamsL(M)L

UnconstrainedL

0L

20000L

40000L

60000L

10%L20%L 40%L 60%L 80%L90%L

Num

ber%o

f%Con

straint%C

hecks%

Selec2vity%CoPT4StreamsL(AM)L CoPT4StreamsL(M)LCoPT4Streams+L(AM)L CoPT4Streams+L(M)LUnconstrainedL

0L

10L

20L

30L

40L

50L

10%L20%L 40%L 60%L 80%L90%L

Upd

ate%Time%(s)%


Figure 5.7: Average size of the pattern-tree, perbatch, after pushing the constraint.

0L5L10L15L20L25L30L35L40L45L50L

10%L20%L 40%L 60%L 80%L90%L

Upd

ate%Time%(s)%

Selec2vity%


UnconstrainedL

Figure 5.8: Average time needed per batch, to up-date the pattern-tree.

NOLSUCTimeToCheckUpdateTime

6.1 16.56.7 15.9

5.7 12.9

4.8 9.7

1.3 5.81.0 5.3

8.1 16.59.3 15.9

9.6 12.9

9.7 9.7

6.9 5.87.0 5.3

8.0 29.88.2 34.0

8.4 29.8

8.2 21.5

6.5 16.36.1 15.5

8.8 29.88.8 33.9

9.1 29.8

9.3 21.5

7.9 16.37.6 15.6

0L

10000L

20000L

30000L

40000L

50000L

60000L

10%L20%L 40%L 60%L 80%L90%L

Size%of%the

%Pa,

ern%Tree%

Selec2vity%


UnconstrainedL

0L

20000L

40000L

60000L

10%L20%L 40%L 60%L 80%L90%L

Num

ber%o

f%Con

straint%C

hecks%


0L

10L

20L

30L

40L

50L

10%L20%L 40%L 60%L 80%L90%L

Upd

ate%Time%(s)%


Figure 5.9: Average number of constraint checks per batch.

As expected, as the selectivity increases, more itemsets can be removed from the tree, and therefore the

size of the pattern-tree is smaller, as well as the time needed to update smaller pattern-trees. We can also

confirm in fig. 5.7 that AM constraints allow us to prune much more itemsets than M constraints, leading

to much smaller pattern-trees. This is explained by the fact that itemsets that violate M constraints but

have supersets that satisfy them, cannot be discarded from the tree. By similar reasons, AM constraints

need, in average, less time to update the pattern-tree than M constraints. Fig. 5.8 also shows that

pushing AM or M constraints into the pattern-tree results in a decrease of the update time, even when

73

the selectivity is low. Since CoPT4Streams+ needs to check all nodes for the constraint, it needs some

more time to update the pattern-tree.

In fig. 5.9, we analyze the average number of constraint checks. We can state that pushing constraints

is always better, even with the naive approach, CoPT4Streams+, due to the resulting smaller pattern-trees

from one batch to another. Nevertheless, taking into account constraint properties to avoid constraint

checks (CoPT4Streams) requires significantly less number of constraint checks. It is interesting to see

that the trends are the same for both the static (Section 5.1.4) and streaming cases. As the selectivity

increases, the number of constraint checks for AM constraints decreases, since the number of itemsets

that can be discarded increases. But on the contrary, for M constraints, the number of tests increases

along with the increase of the selectivity. This happens because the M strategy only stops checking when

itemsets satisfy the constraint. And if there are more items that violate it, more itemsets need to be

tested.

The behavior of Mixed constraints is consistent with the trends presented above: pushing them into

the pattern-trees results in much smaller trees, and therefore less constraint checks and update time,

when comparing with both the unconstrained and the CoPT4Streams+ algorithms. As the selectivity

increases, the number of patterns in the trees decreases, as well as the time needed to process them. The

number of constraint checks tends to be constant, independently of the selectivity of the constraints.


In this section, we analyzed a set of strategies for pushing constraints into stream pattern mining, through

the use of the efficient pattern-tree structure. These strategies take advantage of constraint properties, so

that we can filter earlier the frequent itemsets that do not satisfy each constraint, and avoid unnecessary

tests. By doing this for each batch of transactions, greatly decreases the size of the pattern-trees that

need to be maintained for this streaming environment, and therefore helps focusing the pattern mining

task and returning much less, but more interesting results. We also propose a general algorithm, named

CoPT4Streams, that combines the defined strategies and is able to dynamically push any constraint

into a pattern-tree, and still taking advantage of their properties.

Experimental results show that the algorithm is effective and efficient. The pattern-trees maintained

are much smaller, which generally results on less time needed. It also checks much less nodes and needs

less time than an approach that does not take into account constraint properties.

Despite the benefits of CoPT4Streams, and along the line of CoPT , it is a post-processing approach

(applied after the processing of each batch), which needs that an unconstrained algorithm run to first

discover all possible frequent patterns. This usually takes much time, and results in a large quantity of

frequent itemsets that need to be put in the pattern-tree, and to be again evaluated later on. A more

balanced approach is to adapt the strategies proposed here to filter itemsets during the actual discovery

process.

An important contribution of CoPT4Streams is that it is able to push constraints in the same

pattern-tree structure that is used by our algorithm StarFP-Stream, and to maintain it in the streaming

environment. However, it cannot be directly applied to our multi-dimensional domain, since there are

differences in the content of the pattern-trees. While in the traditional case we have itemsets that

correspond to transactions of some entity, in the case of a star schema we have transactions of more than

one different type of entity, and we are in the presence of both transactional and non-transactional data.

This requires some adaptations and a deeper analysis of these differences.

74

5.3 Towards the Incorporation of Constrains into

Multi-Dimensional Mining

As seen throughout this dissertation, multi-relational pattern mining algorithms are able to mine directly

more than one table, and find patterns that relate the characteristics of all tables. However, they are still

not able to push constraints into the discovery process. Also, although constrained mining algorithms are

able to incorporate constraints to deliver more interesting results, they cannot deal with more than one

table. There is therefore a need for the integration of these two areas of pattern mining. This integration

is not straightforward, since these approaches look and treat data differently.

One one side, most of the existing constrained techniques are designed for mining transactional data.

On the other side, in the case of a star schema, we are dealing with two types of data: transactional

and non-transactional. While the fact table records transactional data (the business events), dimensions

store non-transactional data (the characteristics of business entities). Since there are differences on mining

these two kinds of data tables, existing constrained algorithms cannot be directly used over star schemas,

and existing multi-relational algorithms cannot be directly used for pushing constraints.

To the best of our knowledge, there is no work that makes this integration. Hence, we discuss in this

section some naturally arising questions:

• Is it possible to integrate these two areas of pattern mining?

• What are the differences and emerging challenges?

• Can we use traditional constraints in this multi-dimensional environment?

• And finally, can existing algorithms be applied or adapted to find frequent constrained patterns in

a star schema? If so, how?

We argue that it is possible to combine these two paradigms, and we answer to these questions in the

course of this section.

We first describe the differences on mining transactional and non-transactional data, and then how

these differences can be overcome. We discuss how constraints may be interpreted in this multi-

dimensional domain, by proposing Star Constraints, and also how they can be introduced into the mining

process.

5.3.1 Transactional vs. Non-Transactional Data

Mining patterns on transactional and non-transactional data is different, but those differences stem only

from the interpretation and meaning of items and patterns. Fig. 5.10, along with Table 5.1, show these

variations. While, in the transactional case, each item corresponds to an entity (e.g. a product), in

the non-transactional case, we are mining pairs (attribute, value) (e.g. price = 30e ). This means that

patterns have different interpretations in each case: sets of entities frequently transacted together, or

sets of characteristics common to a frequent number of entities, respectively for the transactional and

non-transactional cases.

These differences are not seen in traditional pattern mining, since algorithms work with items and

itemsets, independently of their meaning. Therefore, any algorithm is able to run over both data types,

and discover existing patterns. However, this is important in a constrained environment, since we can

restrict both entities and their attributes, and it is expected that all elements of a pattern can be tested

for the constraint.

Let’s analyze this in more detail. Assume a transactional table (e.g. Fig. 5.10 left). Each element of

a pattern is an entity, and therefore we can check what is the value of some attribute for every element.

75

Product Price Color ...p1 30 € Black

1 p1 p2 p3 p2 10 € Red2 p2 p4 p3 5 € Blue3 p1 p3 p4 p5 p4 15 € Black... p5 20 € Blue

...

Product Price Color ...p1 30 € Blackp2 10 € Redp3 5 € Blue...

Order Product Customer ... FPrice Qnt ...1 p1 c1 25&€ 11 p2 c1 10&€ 11 p3 c1 5&€ 32 p2 c2 9&€ 2...

Dim Product

Fact TableProduct Price Color ...

p1 30 € Black1 p1 p2 p3 p2 10 € Red2 p2 p4 p3 5 € Blue3 p1 p3 p4 p5 p4 15 € Black... p5 20 € Blue

...



Dim Product

Fact Table

Figure 5.10: A transactional data table (left), modeling the products that were bought together in the same trans-action, and the associated non-transactional data table (right), describing the characteristics of those products.

Table 5.1: Differences on mining transactional and non-transactional data.

Transactional Non-Transactional

Cell/Item Entity (Attribute = Value)

e.g. p1 e.g. (Price=30e )

Row The set of entities transacted at thesame time

The set of characteristics of one singleentity

Itemset A set of entities transacted together A set of characteristics of some entity

e.g. X = {p1, p4} e.g. X = {(Price=30e ),

(Color=Black)}Pattern A set of entities transacted together

frequentlyA set of characteristics shared by a fre-quent number of entities

e.g. X ∧ sup(X) ≥ σ ×N

This means, for example, that if we only want products with price lower than 20e , we can just test the

price of every product in each pattern and eliminate those patterns that have any product with price

higher than the maximum (note that this can also be done during the discovery process, in a similar way,

instead of post-processing). Using the example in Fig. 5.10, a pattern like {p1, p2} is rejected, because

Price(p1) > 20e , but a pattern such as {p2, p3, p4} is accepted, since all prices satisfy the constraint.

If we assume a non-transactional data table, we cannot think or do the same, because elements of

patterns are attributes, not entities. Following the example above, but using the non-transactional table,

we could have frequent patterns like {(Price = 5e ),(Color = Blue)} and {(Color = Black)} (blue

5eproducts and black products are frequent). If we wanted to apply the same constraint, Price < 20e ,

we could say that the first pattern satisfies the constraint (since it is an intersection of 5e products with

others), but we could not guarantee that the second one resulted only from processing products with

price lower than 20e . This means that the pushing of constraints into non-transactional data cannot be

made, as simply as before, as a post-processing step. In this case, when restricting some attribute, the

entire rows (products) where that attribute does not satisfy the constraint should not be considered for

support, to guarantee that all attributes in patterns result only from the processing of valid entries. Note

also that we are not mining entities that were transacted at the same time, and therefore constraints

like sum(X.attribute) ≥ v (or other aggregate constraint) do not make sense when mining only non-

transactional data.

76

5.3.2 Constraints in Star Schemas

In the case of a star schema, we have both data types in synergy. An example of a star schema,

corresponding to the example above, is shown in Fig. 5.11.



...



Dim Product

Fact TableProduct Price Color ...p1 30 € Black


...



Dim Product

Fact Table

Customer Age Addr ...c1 27 Lisbonc2 38 Paris...

Dim Customer



...



Dim Product

Fact Table

Figure 5.11: A star schema, containing both transactional (fact table) and non-transactional data (dimensionsProduct and Customer).

In a star, we have more than one entity type (e.g. products and customers), represented by each

dimension. Each dimension describes the set of entities of that type (e.g. each product) through the

use of some attributes (e.g. price and color). The fact table relates the entities of each dimension with

each other and with a set of measures (e.g. final price and quantity) that characterize the corresponding

transaction (e.g. the sale). In this sense, we can define a set of Star Constraints, composed of a constraint

for each of these aspects: entity type, entity, attribute and measure.

Let dim(it) be a mapping function that returns the dimension to which item it belongs to (e.g. p1

belongs to dimension Product, therefore dim(p1) = Product). Let also dim.attr(it) be a function that

gives the value for attribute attr (which belongs to dimension dim) that is associated with item it (e.g.

Product.Price(p1) returns 30e , the price of product p1).

Constraints over entity type (dimension constraints)

Since we have more than one entity type, we may be interested in the presence or not of some certain

types. So, for all patterns X, the dimension of each element should be valid:

C(X) = (∀ el ∈ X . dim(el) ∈ {D1, ..., Dj}). (or 6∈)

For example, we may only want to mine products and customers, and ignore other dimensions (i.e.

dim(el) ∈ {Product, Customer}). Therefore, a pattern like {p1, c2} (customer c1 buys product p1) is

accepted, but {p1, c2, t1} (customer c1 buys product p1 in territory t1) would not, because contains one

element that does not belong to the accepted dimensions. Note that, if {p1, c2, t1} is frequent, its subsets

are also frequent, such as {p1, c2}, that will be returned as patterns. Therefore, by eliminating the first

itemset because it is not interesting, we will not loose interesting patterns.

Following our proposed framework, dimension constraints are succinct constraints, since we know, at

each point in time, all possible accepted patterns, based on the current alphabet. They can also be seen

as conceptual constraints, since they restrict the dimension associated with items. And finally, they are

designed for multi-dimensional datasets.

77

Constraints over entities

Additionally, we may restrict the presence of some specific entities (or instances) of one type. These

constraints can be made over each dimension or over the fact table, since both have information about

entities:

C(X) = ({en1, ..., enj} ⊆ X). (or 6⊆)

For example, we may be interested only in the specific products p1 and p2. Hence we could define the

constraint {p1, p2} ⊆ X to filter all patterns that does not contain both desired products.

To situate entity constraints in the framework, they correspond to the traditional item constraints

that have a succinct property, and they are applied to star schemas.

Constraints over attributes

Regarding the attributes of dimensions, we may want to limit the value of some attribute (of one dimen-

sion) for each entity:

C(X) = (∀ el ∈ X . dim.attr(el) ≤ v). (or ≥, =, 6=)

For example, we may only want customers under 30 years old, and therefore, all customers in patterns

must satisfy the constraint Customer.Age(el) < 30.

Note that, if the attribute is not numeric, we may similarly define the constraint as:

C(X) = (∀ el ∈ X . dim.attr(el) ∈ {v1, ..., vj}). (or 6∈)

E.g. if we want products with color black or blue, i.e. Product.Color(el) ∈ {Black,Blue}.These simple attribute constraints are value constraints, and are also succinct.

We may also want to limit the aggregate value of some attribute, for one set of entities:

C(X) = (agg(dim.attr(X)) ≤ v). (or ≥, =, 6=)

where the aggregate function ∈ sum, avg, min, max, etc. and dim.attr, applied to an itemset X, returns

all values for that attribute attr, for all items in X.

For example, if we are interested in patterns resulting from sales where the sum of their products’

price is less than 20e , we could use the constraint sum(Product.Price(X)) < 20e . A pattern such as

{p1, p2} is not accepted (the sum of the prices is 40e ), but {p2, p3} is (the sum is 15e ).

These constraints correspond to aggregate constraints, and they also have nice properties, depending

on the aggregate function, and should only be applied to the transactional data.

Constraints over measures (fact constraints)

Measures are (mostly) numeric values that characterize the business events or transactions. Therefore,

fact constraints can only be applied to the transactional data in the fact table. And in order to incorporate

them, we need to be able to track back the transactions that originated each pattern, i.e. the transactions

that give support to a pattern (e.g. pattern {p2} occurred in orders 1 and 2, but pattern {p1, p2} occurred

only on order 1). By having these transactions, we can retrieve the correct value of a measure.

Let us denote measure(trans, it) a function that retrieves from the fact table the value for measure

measure corresponding to transaction trans and item it (following the same example, FPrice(1, p2) =

10e , since the final price of product p2 in sales order 1 was 10e , and similarly, FPrice(2, p2) = 9e ).

Using fact constraints, we can limit both the value and the aggregate value of some measure. In the

78

first case, the value of the measure must be valid for all transactions that gave support to a pattern:

C(X) = (∀ trans, el ∈ X . measure(trans, el) ≤ v), (or ≥, =, 6=)

As an example, we may only want sales of products that were bought in sets of 2 or more, at the

same time (Qnt(trans, el) ≥ 2).

In the second case, the aggregate value of the measure in question must be valid for the set of

transactions that gives support to the pattern:

C(X) = (∀ trans ∈ X . agg(measure(trans)) ≤ v), (or ≥, =, 6=)

with measure(trans) the set of measure values for all elements of the pattern.

We may be interested only in sales of more than 4 products at the same time (sum(Qnt(trans)) ≥ 4).

Measure constraints are value or aggregate constraints, that are designed for the measure attribute

present in the fact table. They also have nice properties, with a reasoning similar to attribute constraints,

but they may only be applied to the transactional data in a star schema.

5.3.3 Pushing Star Constraints into Pattern Mining over Star Schemas

In order to push the above constraints into the discovery of frequent patterns over star schemas, we

may take different approaches: mine only one non-transactional data table (one dimension), mine only

transactional data (the fact table), or mine both data at the same time. Clearly, the last approach is

more difficult, but the one that fulfills the goal of multi-dimensional pattern mining.

We discuss below what is the difference between these approaches, namely, what patterns are expected

to be obtained, what constraints can be used, and how can they be incorporated in the search process,

as well as how existing algorithms should be adapted.

Mining one dimension

By mining one single dimension, we are mining its attributes, and therefore we are able to find common

characteristics of the respective business entity (e.g. most of male customers have more than 30 years, or

blue products are usually cheaper than others). We can even go one step further, and use the fact table

to calculate the support of each entity before mining the dimension. This way, we can find the common

characteristics of the most transacted entities.

When mining one dimension, we can apply constraints over entities and constraints over the value of

this dimension’s attributes, and therefore limit the discovered characteristics to what is really interesting.

It does not make sense to apply entity type constraints, since we are mining only one dimension, neither

aggregate constraints, since there are no co-occurrences. We also cannot apply measure constraints,

because there are no transactions.

As explained in section 5.3.1, there is no algorithm designed for non-transactional data, like di-

mensions. However, and since these constraints (entity and attribute value) are mostly succinct anti-

monotonic, we may take a very simple approach and eliminate, from the beginning, all entities (rows)

that do not satisfy the constraint, and then apply any single table pattern mining algorithm. For exam-

ple, for dimension Product and for the attribute constraint Price < 20e , we may eliminate all rows of

products with an higher price, as a pre-processing step, because no pattern with one of those products

will satisfy the constraint (we want to know here what are the common characteristics of cheep products).

The same can be made for entity constraints.

79

Mining the fact table

When mining the fact table, we are only mining transactional data, i.e. the entities transacted together.

This allows us to find the sets of entities that co-occur frequently. Existing constrained algorithms deal

with only one single entity type, and this means that all entities of a pattern belong to the same type,

and therefore all can be tested for the constraint. In the case of a fact table, entities in patterns may

belong to the same entity type (e.g. product p1 is usually bought along with product p2) or to different

types (e.g. customer c1 often buys product p1).

To apply constraints over the entity type, we may simply eliminate or ignore all entities that do not

satisfy the constraint. For example, if we only want to mine products, we may ignore other dimensions,

such as customers, and all entities in the fact table belonging to those dimensions.

Existing constrained algorithms can be used over the fact table for introducing entity and attribute

constraints. In this case, the constraint is checked for the entities that belong to the dimension of that

attribute, and the value for the attribute in question is retrieved in the corresponding dimension table.

Measure constraints are different from others, since instead of a property of an entity, they are a

property of a transaction. One hypothesis is to consider measures as entities, and use existing constrained

algorithms to discover co-occurrences of measures (like for attributes). By doing this, we could find, for

example, and considering the quantity measure, that who buys 4 of the same product, usually buys 3 of

another product, in the same sale.

Another approach would be to associate to each pattern the transactions that gave it support (e.g. a

pattern {p2} occurred in the sale orders {1, 2, ...}), as is being done in areas like genome analysis [MPP07,

SV11]. By doing this, we could apply the measure constraints by retrieving the value of the measure from

the fact table, based on the transaction numbers and entities, and checking if it satisfies the constraint.

Using the example, we could find that who buys 4 of product p1, usually buys 3 of product p2. By

keeping the transactions, it would be possible to incorporate measure constraints, either during the

discovery process, or as a post-processing step, with some adaptations of existing traditional constrained

algorithms. However, this approach goes against the philosophy of data streams, in which records can

only be seen once, and therefore could not be applied for finding patterns on growing star schemas.

Mining the star

By mining both dimensions and fact table, we are able to discover how the common characteristics of

different entity types relate each other, i.e. how the elements of both dimensions co-occur. We could

find, for example, that blue products are often bought by male customers.

Both constraints described above may be used when mining the whole star. However, we are not min-

ing entities, we are mining attributes, and it means that we cannot apply existing constrained algorithms

and strategies. However, we can look to each type of constraint and think in a form of pushing them.

For entity type constraints, we can simply eliminate or ignore the dimensions in question. This can be

done as a post-processing step, by discarding all itemsets that contain some item (a pair attribute–value)

which attribute belongs to a non accepted dimension. Or can be done before, by only exploring the

entities of the fact table that belong to accepted dimensions.

For entity constraints, we should discard and not count for support all rows of the fact table that

contain invalid entities. This could be done during the first steps of the discovery process: for each fact,

only count the support for its entities if they are all valid (or else, discard the entire fact). Introducing

these constraints as a post-processing step is not straightforward. Since we process all facts, we will

probably have patterns that result from the processing of invalid entities. We would need, for example,

to keep track of the entities that support each item in each pattern, but it requires much extra processing

and memory. We could also keep track of the facts that give support to each pattern, and when testing

80

the constraint, check if the entities transacted in those facts are valid. If not, discard the pattern. Still,

this approach is not appropriate for star streams, and even for static star schemas, it is less efficient than

incorporating entity constraints during the mining process.

Limiting the value of some attribute is trickier, since patterns are sets of pairs (attribute, value), that

may mix different attributes of an entity, as well as attributes of different entity types. So, if we want

to limit the value of some attribute of a specific dimension, all entities of that dimension which attribute

value does not satisfy the constraint should be ignored. This means that both the rows corresponding to

those entities in the dimension in question, as well as the rows of the fact table that contain them, should

not count for support. As an example, if we have the constraint Customer.Age(el) < 30, customer c2

violates the constraint, and therefore no attribute of c2 should count for support, as well as no order

made by this customer. We could, for example, when mining each fact, do not explore entities which

constrained attribute violates the constraint. As post-processing step, it requires keeping the facts that

support each pattern, and when testing the constraint, check if the entities of those facts satisfy the

constraint (e.g. if all customers of those facts are under 30 years old). If we want to mine the aggregate

value of some attribute, we can take the same post-processing approach.

Finally, measure constraints can be integrated in a way similar to the described in the previous section,

since they refer to a transaction and not to an entity.

5.3.4 Discussion

We can now answer the questions posed in the beginning of the section: Is it possible to integrate

constrained mining with star schemas? Yes. We showed here that it is possible and important to

integrate the multi-dimensional and constrained paradigms. By combining these two areas, it is possible

to improve the results of multi-dimensional techniques, not only by limiting the number of patterns, but

also by focusing these results on user needs, defined by the means of constraints.

What are the emerging challenges? Each one of these areas has its own set of challenges, and therefore

joining these paradigms results in a mix of them. The main ones are the fact that we are dealing with

more than one table at the same time, and a star schema usually contains both transactional and non

transactional data. This hinders the use of existing constrained algorithms for mining the whole star as

one, as well as the adaptation of multi-dimensional algorithms for pushing existing constraints.

Can we use traditional constraints in this multi-dimensionall environment? Traditional constraints

are constraints over entities (or over their values for some attribute). Since we always have transactional

data in a star (the fact table), representing the transactions of entities, we can also apply these constraints.

However, we are also often in the presence of more than one entity type, and of non-transactional data,

which means that new constraints need to be defined for this environment. In this thesis we proposed

four types of constraints: entity type, entity, attribute and measure constraints.

And finally, can existing algorithms be applied or adapted to find frequent constrained patterns in a

star schema? If so, how? Since existing algorithms for constrained pattern mining are only able to deal

with one transactional table, they can be applied to the fact table, with some adaptations to deal with

different entity types. However, they cannot be used to push constraints into the whole star. Even so,

depending on the constraint, there are some adaptations that can be made. The extension of CoPT

and CoPT4Streams to incorporate constraints in multi-dimensional mining is possible, but requires some

major adaptations: it needs to be able to track to which dimension items belong to, as well as to retrieve

the values for attributes from the corresponding dimensions. Since they push constraints as a post-

processing step, they also need to keep a record of what transactions gave support to each pattern, so

that it guarantees that they do not result from the processing of invalid entities. This extra storage

of transactions has been applied to other areas [MPP07], but it is still not efficient, resulting not only

81

in extra memory, but also in extra processing time for the discovery process. More importantly, this

approach cannot even be applied to streaming data, since transactions can only be seen once. Also,

as stated above, they need that an unconstrained pattern mining run over the data first and store the

patterns in a pattern-tree, and this means that we have to process the patterns twice. Nevertheless, they

take constraint properties into account, and therefore they do not need to test all patterns this second

time.

In this sense, there is a need for a new algorithm that can somehow use the strategies defined through-

out this chapter for an efficient incorporation of constraints into the mining of multiple tables. In the

next section, we propose a new algorithm, that adapts StarFP-Stream (Chapter 3) to incorporate star

constraints into the pattern mining of large and growing star schemas.

5.4 Mining Stars with Constraints

As seen above, the application of constraints into the mining of the whole star depends heavily on the

transactions that give support to patterns. And therefore, specially in a streaming environment, it is not

feasible to verify their satisfiability in a post-processing step, since algorithms would have to keep the

transactions (the facts) along with the patterns, for further access.

Furthermore, in the traditional paradigm of basket analysis, items in patterns correspond to entities,

all of the same type (e.g. all products). The bulk of the work on constrained pattern mining is in fact

along this paradigm. However, when we move to a relational domain, we start having several dimensions,

and entities of different types (e.g customers’ characteristics, sellers, products, etc.). The star schema

is one of these cases, where items in patterns are pairs (attribute, value) that belong to an entity, from

some dimension, and thus each branch of the pattern-tree (i.e. a pattern) contains a mix of items, of

different dimensions. This makes it also difficult to apply the strategies of CoPT or CoPT4Streams, that

rely on the fact that items are all entities, and of the same type.

However, instead of constraining the items in patterns, what we want is to constrain the transactions

that support patterns. That is, if we are only interested on patterns related to a set of entities (entity

constraints), we can consider only the facts where those entities were transacted, and therefore other

entities will not appear in patterns. Similarly, if we are not interested on entities with particular charac-

teristics (attribute constraints), we can discard all transactions with invalid entities, so that they do not

appear in (or influence) the final results.

In fact, as mentioned before, most star constraints have succinct properties, and therefore they can

be incorporated more efficiently as a pre-processing step, or in this streaming case, during the processing

of each arriving transaction.

In this sense, we propose the algorithm Domain Driven Star FP-Stream, or D2StarFP-Stream, that is

an extension of Star FP-Stream (Section 3.3) that is able to incorporate star constraints over transactions

into the mining of the whole star.

We first formally define how to apply the Star Constraints over the transactions, and then present

the proposed constrained multi-dimensional algorithm. Finally, we also present a performance evaluation

over the AdventureWorks DW, and end with some discussion and conclusions.

5.4.1 Constraining Business Facts

In a star schema, a transaction corresponds to one business fact, that may or not contain more than one

row in the fact table (if a degenerated dimension exists or not, respectively). Recall that one fact is one

single row in the fact table, and one business fact is the set of facts that correspond to the same business

transaction (e.g. sale order). Also, a fact is a set of foreign keys (entities), one for each dimension.

82

In the presence of a dimension constraint, the only thing that is needed is to ignore entities (and

items) from invalid dimensions. For example, if we are not interested in dimension SalesTerritory, when

processing a fact, we can simply ignore or discard all entities of that dimension. This is similar to perform

a full roll-up on the SalesTerritory dimension, in OLAP operations.

In the presence of any other star constraint C (entity, attribute or measure), defined in Section 5.3.2,

one fact is valid if its entities satisfy the constraint. That is:

V alid(fact) = (∀en∈fact C(en) = true)

But when we consider a business fact with more than one fact, we may want to constrain it in

three different ways – (1) to consider the whole business fact only if all facts satisfy the constraint, (2) to

consider the whole business fact if at least one satisfies, or (3) to consider just the facts of the business fact

that satisfy the constraint and ignore invalid ones (which is equivalent to perform a slice on a particular

dimension, on OLAP operations).

Let us define, as an example, a business fact with two sales, bf = {(p1, d, c, t), (p2, d, c, t)} (customer

c bought, on day d and store t, both products p1 and p2), where product p1 has (Color = “Blue′′) and

product p2 has (Color = “Red′′). Let us have the following attribute constraint asking for products of

color blue: C(en) = (dim(en) = Product ⇒ color(en) = “Blue′′). In this case, the first fact is valid:

V alid((p1, d, c, t)) = true, since the only entity of dimension Product is p1, and it has color blue (note

that the constraint C applied to entities of other dimensions has always value true, since they are not the

ones being constrained. Nevertheless, in reality, there is no need to test them, since in a fact, we know

in which column the entity of some specific dimension is). On the other side, the second fact is invalid:

V alid((p2, d, c, t)) = false, since product p2 has a color different from blue.

We define the three following validity properties, and a function named getV alidFacts, that can be

applied to a business fact, and return the set of individual facts that should be considered for mining,

based on the validity property of the star constraint. in each case:

All if All Valid: getV alidFacts(bf,ALL V ALID) = {f ∈ bf | ∀g∈bf V alid(g) = true}

If all facts satisfy the constraint, the whole business fact is valid, and all facts should be considered.

This allows us to find patterns that involve only complete valid transactions.

Following the example above, we may be interested in sales that only included blue products, and

therefore, the business fact in question is not valid, and should be discarded, since it contains one

product of an invalid color. We would find, e.g. what types of customers only buy blue products,

or if there is a specific season where customers buy blue products.

All if One Valid: getV alidFacts(bf,ONE V ALID) = {f ∈ bf | ∃g∈bf V alid(g) = true}

If at least one fact satisfies the constraint, the whole business fact is valid, and all facts should be

considered. This allows us to find patterns that are frequent along with the valid entities, such as

what type of other entities are transacted along with the valid ones.

Using the same example, we may want sales that include at least one blue product, and therefore,

the business fact in question is valid, and both products p1 and p2 should be considered. We could

find, for example, what other types of products are bought along with blue products.

Only Valid: getV alidFacts(bf,ONLY V ALID) = {f ∈ bf | V alid(f) = true}

In this case, a business fact is always valid, unless it contains no transaction with a valid entity.

However, we should only consider the valid individual facts, i.e. the individual transactions of valid

entities. This allows the finding of patterns associated with the transactions of valid entities, such

83

as profiles of customers that buy specific types of products, or what is common for specific types

of customers.

Following the example, we are interested in sales of blue products, and thus the sale order is valid,

since it contains one sale of a blue product, and only this transaction should be considered. We

could discover what types of customers buy blue products, in general.

Note that, if there is no degenerate dimension (i.e. no aggregations of facts), each business fact is

composed of exactly one fact, and therefore there is no need to differ between the validity property over

business facts (all three have the same result). The same happens when the constraint C is over an entity

that does not change in one business fact (e.g. the customer in a sale order is always the same). In this

case, all three validity properties will also result in the same set of valid facts. Using the example above,

for a constraint C(en) = (dim(en) = Customer ⇒ age(en) < 30), asking for customers above 30 years

old, we only have to test the customer of the first fact (because it is the same is every fact). In this

case, if customer c has less than 30 years old, the whole business fact should be considered, or should be

discarded otherwise.

5.4.2 D2Star FP-Stream

In this section we propose a new algorithm, Domain Driven Star FP-Stream, or D2StarFP-Stream, that

is an extension of Star FP-Stream (proposed in Section 3.3) that is able to incorporate star constraints

over transactions into the mining of the whole star.

The main idea is to push star constraints as new business facts arrive, and build both the DimFP-Trees

and the StarFP-Tree only with valid ones (according to one of the validity properties defined above, over

business facts). By doing this, these trees will only have the content of valid transactions, and therefore

the global mining step (combining and processing these trees) can be performed as for the unconstrained

StarFP-Stream, with no change required. The final patterns will also satisfy the constraints, with no

checking needed, because: In the ALL V ALID and ONLY V ALID cases, patterns will not contain

invalid entities (they were discarded) and even if they are composed just with pairs (attribute, value) from

unconstrained dimensions, it is guaranteed that these came from valid transactions; In the ONE V ALID

case, invalid entities are not discarded, if there is one valid entity in the set. However, patterns, as a

whole, will satisfy the constraint, because the set of entities from the constrained dimension satisfies the

constraint.

The pseudocode of the algorithm is present in Algorithm 4.

Algorithm 4 D2StarFP-Stream Pseudocode

Input: Star Stream S, error rate ε, Star constraint COutput: Approximate frequent items with threshold σ that satisfy the constraint C (and respective validity

property), whenever the user asks1: i = 1, |B| = 1/ε, N = 0, flist and ptree are empty2: V alidDim← getValidDimensions(S,C) //Dimension constraints3: initialize one DimFP-trees for each V alidDim to empty4: for all arriving business fact bf = (tidD1 , tidD2 , ..., tidDn ,m1, ...,mp) do5: N = N + 16: bf = getV alidFacts(bf, C) //Application of the validity properties and of the other star constraints7: for all Dimension Dj in V alidDim do8: T ← transaction of Dj with tidDj

9: insert T in the DimFP-treej10: flist ← append new items introduced by T11: if all business facts of Bi arrived then12: super-tree ← combineDimFP-trees(DimFP-trees, Bi)13: FP-Growth-for-streams(super-tree, ∅, ptree, i)14: discard the super-tree15: tail-pruning(ptree.Root, i)16: i = i+ 1, initialize n DimFP-trees to empty

84

We can see in line 6, after receiving a business fact, the first thing to do is to check the validity of

the whole transaction, based on the validity property of the star constraint. This depends, as seen in

Section 5.4.1, on the validity of the individual facts, which in turn depends on the satisfaction of the star

constraint C by the corresponding entities.

In this sense, the incorporation of each star constraint is performed as follows: For pushing a dimension

constraint, the algorithm just needs to ignore the entities in facts that belong to invalid dimensions, and

only needs to build the DimFP-Trees of valid ones. This is implemented in lines 2 (we know from the

beginning which are the valid dimensions), 3 and 7.

The testing of other star constraints (entity, attribute and measure) is performed while testing the

validity of business facts and facts (function getV alidFacts, line 6). For pushing entity constraints,

the algorithm only needs to test the entities in facts, against the accepted or unwanted entities. As

for the incorporation of attribute constraints, when checking the validity of facts, we need to test, for

all entities of the dimension being constrained, what is the value for that attribute. In this sense, the

algorithm, for each entity of that dimension, goes to the corresponding dimension table and checks the

value corresponding to that entity and attribute. Finally, for measure constraints, the algorithm does

not need to test any entity, only the measure values of each fact, and check if those values satisfy the

constraint. In the case of a constraint over the aggregated value of a measure, we only have to compute

the aggregation of the respective measure, for the whole business fact (e.g. sum of all quantities), and

check if the result satisfies the constraint.


In order to test the performance of D2StarFP-Stream, we use the same experimental setup as for the eval-

uation of its unconstrained counterpart, StarFP-Stream (Section 3.4). And the goal of these experiments

is to compare both, and analyze if adding the constraints into StarFP-Stream minimizes the bottleneck

of the size of the pattern-tree, as well as, at the same time, improves the memory and time needed to

process each batch.

In summary, we tested the algorithms with a sample of the AdventureWorks 2008 Data Warehouse,

with the star schema shown in Fig. 2.1, with the degenerated attribute SalesOrderNumber (AW D-Star),

in order to test the several validity properties for business facts.

We analyzed the behavior of the pattern-tree and the time and memory used by each algorithm, and

we conducted experiments varying both minimum support and maximum error thresholds. Since results

are similar, and we want to compare the constrained versus the unconstrained approach, we only present

here the results for 3% of error.

Since the behavior of the algorithms may vary with the selectivity of the constraints, as shown in

Sections 5.1.4 and 5.2.4, we test the algorithm D2StarFP-Stream with several constraints, with different

selectivities. However, since we are constraining transactions, and not patterns, we measure the selectivity

of star constraints as the ratio of entities that violate the constraint, i.e. the number of entities of the

constrained dimension that are invalid, over the total number of entities of that dimension.

Figures 5.12 and 5.13 show the average pattern-tree size per batch and maximum memory needed,

respectively.

As expected, the size of the pattern tree decreases with the incorporation of constraints, even for

constraints with small selectivity. We can see that, for example, for 50% of selectivity, the pattern tree

is 4 times smaller than the unconstrained case. This evidences that pushing constraints minimizes the

bottleneck of StarFP-Stream, by minimizing the number of patterns that must be kept in the pattern

tree. And by having smaller trees, D2StarFP-Stream needs less memory to keep them, overcoming its

unconstrained counterpart.

85

0"

200"

400"

600"

800"

1000"

1200"

1400"

10%" 30%" 50%" 70%" 90%"

Pa#ern'Tree'Size''

(tho

usan

ds'of'n

ode)'

En66es'Selec6vity'

Unconstrained"

Constrained"

0"

50"

100"

150"

200"

250"

10%" 30%" 50%" 70%" 90%"

Mem

ory'(M

b)'

En66es'Selec6vity'

Unconstrained"

Constrained"

Figure 5.12: Average size of the pattern-tree, perbatch, for 3% of error, and for entity and attributeconstraints over a non degenerated dimension.

0"

200"

400"

600"

800"

1000"

1200"

1400"

10%" 30%" 50%" 70%" 90%"

Pa#ern'Tree'Size''

(tho

usan

ds'of'n

ode)'

En66es'Selec6vity'

Unconstrained"

Constrained"

0"

50"

100"

150"

200"

250"

10%" 30%" 50%" 70%" 90%"

Mem

ory'(M

b)'

En66es'Selec6vity'

Unconstrained"

Constrained"

Figure 5.13: Average maximum memory needed, perbatch, for 3% of error, and for entity and attributeconstraints over a non degenerated dimension.

Following the same reasoning, the smaller the pattern-trees, the less time needed to process them.

Figure 5.14 shows this decrease with the increase of the selectivity of the constraints.

0"

5"

10"

15"

20"

25"

30"

35"

10%" 30%" 50%" 70%" 90%"

Time%(s)%

En++es%Selec+vity%

Unconstrained"

Constrained"

Figure 5.14: Average update time of the pattern-tree, per batch, for 3% of error, and for entity and attributeconstraints over a non degenerated dimension.

Despite this decrease, we can see that, for constraints with small selectivity, D2StarFP-Stream needs

more time to process one batch than the unconstrained algorithm. This happens because pushing con-

straints requires an extra step for checking the validity of business facts and the satisfaction of the star

constraints. This extra validation step results in more overall time needed per batch with small selectivi-

ties, since in these cases few entities are invalid, so they all need to be tested and most of them will remain

in the tree. However, this extra time gets compensated as more facts are discarded, and for example, for

the 50% of selectivity, D2StarFP-Stream takes half the time to process each batch, in average, than the

unconstrained StarFP-Stream.


In this section we proposed a new algorithm, D2StarFP-Stream, for pushing star constraints into the

discovery of patterns over a large and growing star schema. The algorithm is an extension of the un-

constrained StarFP-Stream (Section 3.3), that returns less and more interesting results, according to the

constraints. By being able to incorporate star constraints, D2StarFP-Stream eliminates earlier invalid

transactions, and keeps smaller pattern-trees, minimizing therefore the bottleneck of the unconstrained

algorithm.

Experimental results show that the algorithm is memory efficient, and that it results not only in

86

smaller pattern-trees, but also in less memory needed, even for constraints with small selectivity. Despite

the extra time introduced, the validation and check step, that results in more time per batch for small

selectivities, this overhead is diluted for constraints with more selectivity, making the algorithm to take

less time per batch than the unconstrained one.

5.5 Conclusions and Open Issues

In this chapter, the algorithm CoPT [SA13a] was proposed for post-pushing constraints into pattern

mining. The algorithm uses a prefix tree structure to store the frequent itemsets, and then pushes

constraints deep into this tree, taking advantage of the constraint properties in question. Despite being

a post-processing algorithm, it is able to push constraints satisfying all known properties, still taking

advantage of them.

We also proposed an extension of the algorithm CoPT [SA13a] for data streams, named

CoPT4Streams. The idea is to use any data streaming algorithm, and storing all current patterns in

a pattern-tree. Constraints are pushed at each batch boundary, resulting in a smaller summary structure

at every batch, and therefore less time and memory needed. We also show that violating itemsets can al-

ways be removed at each batch, without loosing patterns. Like CoPT, this algorithm is able to efficiently

incorporate constraints that follow any constraint property, still taking advantage of them.

However, both algorithms are designed for one single data table.

We also analyzed in detail the integration of multi-dimensional mining with constrained mining, and

define in this chapter a set of constraints for star schemas – the star constraints: entity type, entity,

attribute and measure constraints. We also discuss and propose a set of strategies for pushing these

star constraints into multi-dimensional mining algorithms, and show that it is possible to incorporate

constraints into the mining of multiple tables.

By being post-processing algorithms, both CoPT and CoPT4Streams cannot be directly applied to

the mining of a star schema. However, they both can be applied to the mining of the fact table, with small

adaptations to deal with different entity types and retrieve the values from the corresponding dimensions.

In order to mine the whole star with constraints, we proposed the algorithm D2StarFP-Stream. It is

able to push all star constraints as new business facts arrive, guaranteeing that invalid transactions are

eliminated and do not contribute for support. By constraining the business transactions, according to

the desired validation property for events, the algorithm is also able to mine the star schema at the right

business and aggregation level.

To the best of our knowledge, this is the first approach dedicated to the integration of these two areas.

Despite being an important step, there is still room for progress, by finding ways to push more

complex constraints, such as sequences and temporal constraints (it is possible, since we have transactions,

and therefore time and sequences), as well as other more complex forms of domain knowledge, such as

ontologies.

The algorithm D2StarFP-Stream can also be improved, for example, in terms of how the validation

of the constraints are made. By finding more efficient ways of making these constraint checks, it is

possible to minimize the overhead of this extra step. Optimization techniques, such as parallelization

and integration with the database (for faster access to attribute values), could also bring benefits to

D2StarFP-Stream.

87

Chapter 6

A Case Study in Healthcare

Huge amounts of data are continuously being generated in the healthcare system. The analysis of these

data is mandatory, since it may help in many areas of healthcare management, such as evaluating treat-

ment effectiveness, understanding causes and effects, anticipating future demanded resources, predicting

patient’s behaviors and best treatments, defining best practices, etc. [KT05, KW06]. Due to the nature

of this information, results of these analyzes may make the difference, by decreasing healthcare costs and,

at the same time, improving the quality of healthcare services and patients’ life.

Healthcare data are usually massive, too sparse and complex to be analyzed by hand with traditional

methods. In the last decades, data mining has begun to address this area, providing the technology

and approaches to transform huge and complex data into useful information for decision making [KT05].

Data mining (DM) [FPSM92] has been successively applied to many different subfields of healthcare

management, which results proved to be very useful to all parts involved [KT05, KW06].

One of the characteristics of the data collected in the healthcare domain is their high dimensionality.

They include patient personal attributes, resource management data, medical test results, conducted

treatments, hospital and financial data, etc. Thus, healthcare organizations must capture, store and

analyze these multi-dimensional data efficiently.

Multi-Relational Data Mining, or MRDM [D03] is therefore a promising approach for analyzing health-

care data, since its goal is to discover frequent relations that involve multiple tables (or dimensions), in

their original structure, i.e. without joining all the tables before mining.

In this chapter, we present a case study on the healthcare domain, showing how existing data can be

explored. The case is based on the use of the Hepatitis dataset, created by Chiba University Hospital,

containing information about 771 patients having hepatitis B or C, and more than 2 million examinations

dating from 1982 to 2001. This dataset is organized in a relational model that may help data storage,

but that hinders data analysis, since data are scattered through different tables, and it is not easy to

inter-relate the data in a timeline.

In this work, we propose a multi-dimensional model for the Hepatitis dataset, that makes it possible an

efficient analysis and knowledge extraction. We also present some statistics, in order to better understand

the distributions of the data in this domain. After modeling the dataset through a multidimensional

model, we analyze the application of data mining to these models, and present the results of applying

the MRDM algorithm StarFP-Stream [SA12a], to the proposed model.

Section 6.1 describes the Hepatitis dataset and section 6.2 proposes a multi-dimensional model for

the Hepatitis data – the Hepatitis star, in order to promote their analysis for decision making. We first

show an evaluation of the performance of applying StarFP-Stream to the Hepatitis star (Section 6.3), and

then we present two applications of MRDM with this healthcare data. In the first, we use our algorithm

to find discriminant patterns and association rules (Section 6.5) to understand the relations between the

89

laboratory examinations and the two types of hepatitis, and in the second, we show that StarFP-Stream

can be used to find inter-dimensional and aggregated patterns, that are able to characterize patient exam

behaviors These, in turn, may be used as classification features to predict if a patient has hepatitis or not,

which type and even the stage of the hepatitis (Section 6.6). Finally, section 6.7 discusses and concludes

the chapter.

6.1 The Hepatitis Dataset

The Hepatitis dataset1 contains information about laboratory examinations and treatments taken on the

patients of hepatitis B and C, who were admitted to Chiba University Hospital in Japan. There are

771 patients, and more than 2 million examinations dating from 1982 to 2001, from about 900 different

blood and urine types of exams. The dataset also contains data about the biopsies (about 695 biopsy

results) and interferon treatments (about 200) performed to patients. Biopsies reveal the true existence

of hepatitis and respective fibrosis stage. However, they are invasive procedures, and therefore there is

an interest in finding other indicators that allow for the detection of hepatitis in a more friendly way.

Interferon treatments have also been seen and used as an effective way to treat hepatitis C, although

it has tough side effects, and its efficacy is not yet proved. Hence, there is the need to understand the

impact of this treatment.

2 Multirelational Data Mining

2.1 Rationale

Figure 1 depicts the main tables involved in the hepatitis database. The patients’

various exams are not directly related, so joining these tables for a common analysis fails to provide a suitable dataset for discovering association rules based on traditional data mining algorithms such as Apriori [4], FP-Growth [5]. In other words, the results deriving from joint tables may lead to data redundancy and thence to distortions in the calculation of the support and confidence measures of interest.

PatientInterferon Therapy

HematologicalAnalysis

In-HospitalExamination

Results of Biopsy

Out-HospitalExamination

Fig. 1 – Hepatitis dataset tables

To better explain this problem, consider the three tables in Figure 2, which contain data on urinalysis and biopsy results, join them in a third table based on the attributes {MID, Month, Year}. Consider, also, that our aim is to define whether these two types of exams are related.

In Figure 2, note that the tuple (MID=772, Month=2, Year=1999, Fibrosis=F4) appears in 20% of the Biopsy table, while the data of the same tuple occurs in 50% of the joint table. This distortion is due to the spurious tuples resulted in the Joint Table, which is not in the Fourth Normal Form1. This difference can cause distortions in the calculation of the measures of interest of the rules deriving from the mining of the joint table, or prevent the discovery of interesting patterns.

Therefore, to analyze the datasets correctly and obtain rules for the biopsy results and the other types of exams, we applied the Connection algorithm to the hepatitis database.

2.2 The Connection Algorithm

The Connection algorithm mines Boolean association rules from several tables that

have at least one attribute in common, without joining the tables. This algorithm was originally proposed to mine data from data warehouses [1, 2], but the proposed method can be used to mine multiple tables of a relational database.

1 From the Normalization Theory for the Relational Model.

Figure 6.1: Hepatitis relational model [PRV05].

The hepatitis dataset is composed of several data tables, modeled in a relational schema centered on

the patient. This model is shown in Figure 6.1. Each patient may have performed some biopsies, several

hematological analysis, in-hospital and out-hospital exams, and may have also been under interferon

therapy. Each one of these aspects is stored in one different table and is independent of the others.

Despite being modular, this schema does not facilitates the analysis of these data, for several reasons:

(1) the various exams – both in-, out-hospital and hematological analysis – are not directly related,

although the same type of exams may be present in more than one table; (2) relating both exams, or

exams and biopsies or interferon therapy requires joining the tables for a common analysis. This process

of joining the tables is time consuming and non trivial, and the resulting table hinders the analysis,

since it will contain lots of redundant data, as well as lots of missing values; and (3) time is not directly

modeled, and therefore there is no easy way to understand the interconnection between co-occurring

events (e.g. exam results during interferon therapy), neither the disease evolution. Moreover, most data

are distributed irregularly, either in time, as well as per patient, making a direct analysis unfeasible.

The work presented in [PRV05] is the first step on the multi-dimensional analysis of the hepatitis

data. The authors use a multi-relational algorithm to connect biopsies and urinal exams, and to generate

association rules that estimate the stage of liver fibrosis based on lab tests. However, they are only able to

mine two dimensions of the relational model at a time, and therefore they cannot relate the biopsies with,

for example, both the blood tests (Hematological Analysis) and the other tests (In- and Out-Hospital

Examinations).

1The Hepatitis dataset was made available as part of the ECML/PKDD 2005 Discovery Challenge:http://lisp.vse.cz/challenge/CURRENT/

90

6.2 The Hepatitis Multi-Dimensional Model

As stated before, one of the characteristics of the data collected in the healthcare domain is their high

dimensionality. In the case of the Hepatitis dataset, we have administrative data such as patient’s features

(gender and date of birth), pathological classification of the disease (given by biopsy results), duration

of interferon therapy, and temporal data about the blood and urine tests performed to patients. Note

that we could have more data, such as treatment and tests’ cost, hospital data related to out-hospital

exams, information about doctors in charge of patients, etc., which would increase the dimensionality of

the dataset and the complexity of the model.

One efficient way to store high-dimensional data is through the use of a multi-dimensional model – a

star schema, in particular. A star schema clearly divides the different dimensions of a domain into a set

of separated data tables, interrelated by a central table, representing the occurring events. In the case of

the Hepatitis data, we can identify several dimensions – patient, biopsy, possible exams and date – and

events correspond to patient examinations.

Figure 6.2: Hepatitis star schema.

In this sense, one of the possible star schemas that can be defined is proposed in Figure 6.2. The star

schema is composed of 4 dimensions (Patient, Biopsy, Exam Type and Date) and one central fact table

that corresponds to the Examination Results. Each dimension is independent and contains the respective

characteristics (Patient contains patients’ features, and Exam Type contains data about possible exams,

like upper and lower bounds and units). By analyzing the central table, we can understand the relation

between all dimensions: one patient P , with active biopsy B, performed exam E on date D. The result of

this event was r (given by attribute Result in the central table), and at the moment of this examination,

it was (or not) being administrated interferon therapy (attribute InInterferonTherapy?).

Adding new dimensions to this star schema is straightforward. For example, we could add dimensions

Hospital and Doctor just by adding the respective keys into the central table, and each event in that

table would correspond to one exam E, performed to patient P , with active biopsy B, on date D, in

hospital H with doctor Doc.

91

6.2.1 Building the Star Schema

In order to build our star schema, we had to perform a pre-processing phase to join exam data from the

different tables and improve their quality.

First, we decided to reduce data and select only the most significant exams, based on the report

carried out by [WSYT03]. These exams are GOT, GPT, ZTT, TTT, T-BIL, D-BIL, I-BIL, ALB, CHE,

T-CHO, TP, WBC, RBC, HGB, HCT, MCV and PLT. In this sense, dimension Exam Type contains the

known data about these exams (i.e. code, bounds and units). The reason for this data reduction is that

other exams are so rare that one cannot draw any conclusion based on them. Another reason is the fact

that, due to the lack of domain knowledge, we can only interpret the results of these exams (as normal

or abnormal results). Dimension Patient is equivalent to the original table in the Hepatitis dataset, and

Biopsy contains only the possible outputs of biopsies (type can be B or C, the fibrosis stage varies from

0 to 4, and respective activity from 0 to 3). Note that dimension Date contains all dates from 1982 to

2001 and is trivial to generate.

Since these exams are spread in both Hematological Analysis, In- and Out-Hospital Examination

tables, each row of these tables corresponds to one event (one examination) in the central table of the

star schema. Then, exam results were categorized into 7 degrees: extremely, very or simply high (UH,

VH, H), normal (N), low, very or extremely low (L, VL, UL). The thresholds and categories for each of

the selected exams are described in [WSYT03], and presented in Figure 6.1. The exam results of patients

with more than one result for the same type of exam in one day were averaged.

Table 6.1: Important exams and corresponding thresholds and categories in the Hepatitis data [WSYT03].

404

702

763

629

0100200300400

0 1000 2000 3000

GPT

0

200

400

600

0 1000 2000 3000

CHE

01234

0 1000 2000 3000

T-BIL

0100200300400

0 1000 2000 3000

PLT

0

100

200

300

0 1000 2000

GPT

0

200

400

600

0 1000 2000

CHE

01234

0 1000 2000

T-BIL

0100200300400

0 1000 2000

PLT

0

100

200

0 1000

GPT

0

200

400

600

0 1000

CHE

1

2

3

4

0 1000

T-BIL

0100200300400

0 1000

PLT

0100200300400500600700

-2000 -1000 0

GPT

0

200

400

600

-2000 -1000 0

CHE

01234

-2000 -1000 0

T-BIL

0100200300400

-2000 -1000 0

PLT

Fig. 1. Part of the chronic hepatitis data, where each column and row represent an example and anattribute respectively.

2.2 Intuitive Explanation of PrototypeLines

Based on the discussions in previous sections, we have proposed a novel visualization method whichallows detection of interesting exceptions from medical test data with a small amount of labor andskill [7]. This section gives an intuitive explanation of our PrototypeLines.

A probabilistic mixture model allows us to represent data as a linear sum of prototypes and isfrequently used in statistics. Since it summarizes results of a large number of medical tests with asmall number of prototypes (i.e. base models), we believe that it facilitates recognition of overalltendencies. Therefore, we have adopted a method which obtains prototypes from data based onthe EM method [4] and transforms each medical test result into a linear sum of the prototypes.

For an effective display of prototypes, we only employ color as information media. Based ona novel information criterion, prototypes are sorted from good results to bad results, and each ofthem is assigned a color. The colors intuitively become worse from cold colors to warm colors.

3 Experimental Results

3.1 Obtained Prototypes

Due to the nature of the disease, the degree of fibrosis is considered as stable before 500 days andafter 500 days of a biopsy. We have selected patients with degrees of biopsy for analysis. In thedata, a category is either of extremely high (UH), very high (VH), high (H), normal (N), low (L),very low (VL) or extremely low (UL). The medical tests are shown in table 1 with their thresholdsand categories, and the number of random restart in the EM method was settled to 100.

Table 1. Important attributes in the chronic hepatitis data.

medical test (thresholds) category

GOT (40, 100, 200), GPT (40, 100, 200), ZTT (12, 24, 36), TTT (5, 10, 15) N, H, VH, UHT-BIL (1.2, 2.4, 3.6), D-BIL (0.3, 0.6, 0.9), I-BIL (0.9, 1.8, 2.7) N, H, VH, UHALB (3.0, 3.9, 5.1, 6.0), CHE (100, 180, 430, 510) VL, L, N, H, VHT-CHO (90, 125, 220, 255), TP (5.5, 6.5, 8.2, 9.2) VL, L, N, H, VHWBC (2, 3, 4, 9), PLT (50, 100, 150, 350) UL, VL, L, N, HRBC (3.75, 5.0), HGB (12, 18), HCT (36, 50), MCV (84, 95) L, N, H

Fibrosis are considered stable 500 days before and 500 days after a biopsy [WSYT03]. Therefore, for

each examination in the central table, the corresponding active biopsy is the most recent one performed

for the patient, within the 500 days interval (or none).

Finally, interferon therapy data was also integrated in the multi-dimensional star, by marking all

examinations in the central table made during the administration of this therapy (using the information

on the Interferon Therapy table of the relational model).

6.2.2 Understanding the data

After building the star schema for the Hepatitis dataset as described above, it resulted in a central table

with almost 600 thousand examinations performed, for 722 patients (the other 50 patients have not

performed none of the most significant exams, therefore they remain on the Patient dimension, but are

not present in the central table).

In order to better understand the domain in question, Figure 6.3 shows the distribution of the exams

per patient. We can see that there are patients with just a few exams, and other patients with more than

2500 exams. However, in average, each patient performed about 500 - 700 exams. Also, only 30% of all

patients are female, but women perform, in average, more exams than men.

92

0

500

1000

1500

2000

2500

3000

3500

F M

Num

ber o

f Exa

ms

Number of Exams per Patient

Figure 6.3: Number of exams per patient (femaleand male).

0

500

1000

1500

2000

2500

3000

3500

B C Unknown

Num

ber o

f Exa

ms


Figure 6.4: Number of exams per patient diagnosedwith hepatitis B, C or still undiagnosed (Unknown).

From these patients, 234 have not performed any biopsy, which means that they were not diagnosed

with any type of hepatitis, yet. The number of examinations performed to patients with hepatitis B, C or

none is shown in Figure 6.4. Note that, from all patients, only 27.5% were diagnosed with hepatitis B, at

some point in time, 40% with hepatitis C, and the rest 32.5% have no biopsy. We can see in that figure

that patients with hepatitis C perform much more exams than patients with hepatitis B. One possible

explanation is the fact that hepatitis C has been treated with interferon therapy, and therefore more

exams (and biopsies) are performed to check if the condition improves.

Also, patients with no biopsy made much less exams than the others. This may indicate that they

did not undertake the biopsy, because doctors thought these patients were not infected with hepatitis B

or C, and therefore the biopsy was not necessary.

Figure 6.5 presents the variation of the number of exams per stage of hepatitis (fibrosis). A value of

0 means that there is no fibrosis, and 4 that the stage of the fibrosis is severe. Note that only one fifth of

the total examinations (about 137 hundred) are performed while there is a valid biopsy (they are active

500 days before and after they are conducted). Others may correspond to patients that never performed

a biopsy or to other patients, before, between or after the conducted biopsies.

7.0%%

16.4%%

76.6%%

B% C% Unkn.%

B1%37%%

B2%33%%

B3%17%%

B4%13%%

Hepa%%s'B'C0%3%%

C1%50%%

C2%15%%

C3%14%%

C4%18%%

Hepa%%s'C'

Figure 6.5: Distribution of exams per stage of hepatitis (i.e. exams that, when performed, there was a validbiopsy indicating the fibrosis stage).

As expected, there are more cases of hepatitis in their early stages than in severe ones. In the case of

hepatitis C, 50% of all performed exams correspond to patients in stage 1 of fibrosis. This means that, in

order to find correlations between exams and fibrosis stages, we are analyzing patterns that are common

to a very small percentage of data.

93

0

500

1000

1500

2000

2500

3000

3500

B1 B2 B3 B4 C0 C1 C2 C3 C4

Num

ber o

f Exa

ms


Figure 6.6: Number of exams per patient, at each stage of hepatitis.

The number of exams per patient, at each stage of hepatitis does not suffer many changes, as can be

seen in Figure 6.6. Furthermore, it is stable for patients with hepatitis C, with the exception of stage 0 (no

fibrosis). This can again be explained by the application of interferon therapy and respective evolution

check.

6.3 Performance Evaluation

In order to measure the performance of our algorithm StarFP-Stream over real data, we replicated the

experiments made over the AdventureWorks DW. The goal is to evaluate the accuracy, time and memory

usage, and show that StarFP-Stream is accurate and performs better than the joining before mining

approach.

Similarly, we assume a landmark model, and we test our multi-relational approach – StarFP-Stream

over SimpleFP-Stream, as described in Section 3.4, that denormalizes business facts as they arrive.

We tested the algorithms over the Hepatitis Star in Figure 6.2. This star has no degenerated dimen-

sions, and therefore each row in the fact table corresponds to one business fact. Table 6.2 presents a

summary of the dataset characteristics.

Since the Hepatitis Star contains ten times more facts than the AW T-star, we used lower errors to

get larger batches. Also, the frequency of each item globally is much smaller (patients perform different

exams), and hence we had to use lower supports too, to achieve similar amounts of patterns. In this

sense, experiments were conducted varying both the minimum support and maximum error thresholds:

σ ∈ {5%, 2%, 1%, 0.5%} and ε ∈ {1%, 0.5%, 0.1%, 0.05%, 0.01%}. By varying the error, we are varying

batch sizes. Table 6.3 shows the size and number of batches corresponding to each error.

Table 6.2: A summary of the Hepatitis star characteristics.

Hepatitis

Number of facts 580.000

Number of transactions per fact 1

Number of attributes per dimension [2; 4]

Number of entries per dimension [52; 772]

Table 6.3: Batches of Hepatitis facts, corre-sponding to each error.

Error |B| N Batches

1% 100 5.800

0.1% 1000 580

0.01% 10000 58

94

The computer and settings used to run the experiments was the same: an Intel Xeon E5310 1.60GHz

(Quad Core), with 2GB of RAM. The operating system used was GNU/Linux amd64 and the algorithms

were implemented using the Java Programming language (Java Virtual Machine version 1.6.0 24).


In these experiments we analyze the accuracy of the results, as well as the behavior of the pattern-tree

and the time and memory used by each algorithm.

In terms of accuracy, we compared the patterns returned by StarFP-Stream with the exact patterns,

given by FP-Growth (with the complete denormalized table as input). Recall that the patterns returned

by both StarFP-Stream and SimpleFP-Stream are the same (they only differ in how they manipulate the

data), thus we only present these results for our algorithm.

Figure 6.7 shows the number of patterns returned, along with the precision (the rate of real patterns

over the returned ones).

10#

100#

1000#

10000#

5.0%# 2.0%# 1.0%# 0.5%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 0.01%# 0.1%# 1.0%#Error:'

80%#

85%#

90%#

95%#

100%#

5.0%# 2.0%# 1.0%# 0.5%#

Precision'

Support'

0.01%# 0.1%# 1.0%#Error:'

1#

10#

100#

1000#

50%# 40%# 30%# 20%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 1%# 3%# 5%# 10%#Error:'

1#

10#

100#

1000#

50%# 40%# 30%# 20%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 1%# 3%# 5%# 10%#Error:'

50%#

60%#

70%#

80%#

90%#

100%#

50%# 40%# 30%# 20%#

Precision'

Support'

1%# 3%# 5%# 10%#

Error:'

50%#

60%#

70%#

80%#

90%#

100%#

50%# 40%# 30%# 20%#

Precision'

Support'

1%# 3%# 5%# 10%#

Error:'

0#

200#

400#

600#

800#

10%# 5%# 4%# 3%# 2%# 1%#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

'

Error'

0#

1000#

2000#

3000#

4000#

5000#

6000#

10%# 5%# 4%# 3%# 2%# 1%#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

'

Error'

0#

200#

400#

600#

800#

1000#

1200#

1.00%# 0.50%# 0.10%# 0.05%# 0.01%#

Pa,ern'Tree'Size''

(tho

usan

ds'of'n

odes)'

'

Error'

0#

50#

100#

150#

200#

250#

300#

1# 101# 201# 301# 401# 501#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odde

s)'

Batch'

0#

2000#

4000#

6000#

8000#

10000#

1# 101# 201# 301# 401# 501# 601# 701# 801#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

Batch'

0#

100#

200#

300#

400#

500#

600#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

Batch'

(a) Number of patterns returned

10#

100#

1000#

10000#

5.0%# 2.0%# 1.0%# 0.5%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 0.01%# 0.1%# 1.0%#Error:'

80%#

85%#

90%#

95%#

100%#

5.0%# 2.0%# 1.0%# 0.5%#

Precision'

Support'

0.01%# 0.1%# 1.0%#Error:'

1#

10#

100#

1000#

50%# 40%# 30%# 20%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 1%# 3%# 5%# 10%#Error:'

1#

10#

100#

1000#

50%# 40%# 30%# 20%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 1%# 3%# 5%# 10%#Error:'

50%#

60%#

70%#

80%#

90%#

100%#

50%# 40%# 30%# 20%#

Precision'

Support'

1%# 3%# 5%# 10%#

Error:'

50%#

60%#

70%#

80%#

90%#

100%#

50%# 40%# 30%# 20%#

Precision'

Support'

1%# 3%# 5%# 10%#

Error:'

0#

200#

400#

600#

800#

10%# 5%# 4%# 3%# 2%# 1%#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

'

Error'

0#

1000#

2000#

3000#

4000#

5000#

6000#

10%# 5%# 4%# 3%# 2%# 1%#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

'

Error'

0#

200#

400#

600#

800#

1000#

1200#

1.00%# 0.50%# 0.10%# 0.05%# 0.01%#

Pa,ern'Tree'Size''

(tho

usan

ds'of'n

odes)'

'

Error'

0#

50#

100#

150#

200#

250#

300#

1# 101# 201# 301# 401# 501#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odde

s)'

Batch'

0#

2000#

4000#

6000#

8000#

10000#

1# 101# 201# 301# 401# 501# 601# 701# 801#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

Batch'

0#

100#

200#

300#

400#

500#

600#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

Batch'

(b) Precision

Figure 6.7: Hepatitis Star (results for 1% error only appear when ε� σ).

Note that there are no results for a support of 1% and 0.5% with an error of 1% (and less), because,

by definition, ε� σ. Using an ε ≥ σ would cause the algorithm to return all possible patterns stored in

the pattern-tree. The results would explode and would not be significant.

As expected, as the minimum support decreases, more are the patterns, since we demand fewer

occurrences for an itemset to be frequent. Also, as the error increases, more patterns are returned,

because we can eliminate more items, and therefore have to return more possible patterns. This results

in less precision, since the number of false positives increases.

Figure 6.8 presents an analysis of the size of the pattern-tree. As the error decreases, the size of

the trees increases (Figure 6.8a). Also, and despite being a summary structure, it is very large, with

thousands of nodes.

10#

100#

1000#

10000#

5.0%# 2.0%# 1.0%# 0.5%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 0.01%# 0.1%# 1.0%#Error:'

80%#

85%#

90%#

95%#

100%#

5.0%# 2.0%# 1.0%# 0.5%#

Precision'

Support'

0.01%# 0.1%# 1.0%#Error:'

1#

10#

100#

1000#

50%# 40%# 30%# 20%#Num

ber'o

f'Pa,

erns'

Support'

Real# 1%# 3%# 5%# 10%#Error:'

1#

10#

100#

1000#

50%# 40%# 30%# 20%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 1%# 3%# 5%# 10%#Error:'

50%#

60%#

70%#

80%#

90%#

100%#

50%# 40%# 30%# 20%#

Precision'

Support'

1%# 3%# 5%# 10%#

Error:'

50%#

60%#

70%#

80%#

90%#

100%#

50%# 40%# 30%# 20%#

Precision'

Support'

1%# 3%# 5%# 10%#

Error:'

0#

200#

400#

600#

800#

10%# 5%# 4%# 3%# 2%# 1%#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

'

Error'

0#

1000#

2000#

3000#

4000#

5000#

6000#

10%# 5%# 4%# 3%# 2%# 1%#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

'

Error'

0#

200#

400#

600#

800#

1000#

1200#

1.00%# 0.50%# 0.10%# 0.05%# 0.01%#

Pa,ern'Tree'Size''

(tho

usan

ds'of'n

odes)'

'

Error'

0#

50#

100#

150#

200#

250#

300#

1# 101# 201# 301# 401# 501#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odde

s)'

Batch'

0#

2000#

4000#

6000#

8000#

10000#

1# 101# 201# 301# 401# 501# 601# 701# 801#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

Batch'

0#

100#

200#

300#

400#

500#

600#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

Batch'

(a) Average size – Hepatitis Star

10#

100#

1000#

10000#

5.0%# 2.0%# 1.0%# 0.5%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 0.01%# 0.1%# 1.0%#Error:'

80%#

85%#

90%#

95%#

100%#

5.0%# 2.0%# 1.0%# 0.5%#

Precision'

Support'

0.01%# 0.1%# 1.0%#Error:'

1#

10#

100#

1000#

50%# 40%# 30%# 20%#Num

ber'o

f'Pa,

erns'

Support'

Real# 1%# 3%# 5%# 10%#Error:'

1#

10#

100#

1000#

50%# 40%# 30%# 20%#

Num

ber'o

f'Pa,

erns'

Support'

Real# 1%# 3%# 5%# 10%#Error:'

50%#

60%#

70%#

80%#

90%#

100%#

50%# 40%# 30%# 20%#

Precision'

Support'

1%# 3%# 5%# 10%#

Error:'

50%#

60%#

70%#

80%#

90%#

100%#

50%# 40%# 30%# 20%#

Precision'

Support'

1%# 3%# 5%# 10%#

Error:'

0#

200#

400#

600#

800#

10%# 5%# 4%# 3%# 2%# 1%#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

'

Error'

0#

1000#

2000#

3000#

4000#

5000#

6000#

10%# 5%# 4%# 3%# 2%# 1%#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

'

Error'

0#

200#

400#

600#

800#

1000#

1200#

1.00%# 0.50%# 0.10%# 0.05%# 0.01%#

Pa,ern'Tree'Size''

(tho

usan

ds'of'n

odes)'

'

Error'

0#

50#

100#

150#

200#

250#

300#

1# 101# 201# 301# 401# 501#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odde

s)'

Batch'

0#

2000#

4000#

6000#

8000#

10000#

1# 101# 201# 301# 401# 501# 601# 701# 801#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

Batch'

0#

100#

200#

300#

400#

500#

600#

Pa,ern'Tree'Size'

(tho

usan

ds'of'n

odes)'

Batch'

(b) Size with 0.1% error – Hepatitis Star

Figure 6.8: Average (left) and detailed (right) pattern-tree size.

95

As for the AdventureWorks, the pattern-tree follows the same trends. The pattern-tree of the Hepatitis

Star also increases in the first batches, but then also tends to stabilize further ahead (Figure 6.8b).

However, the Hepatitis case contains more fluctuations on the main tendency, which may mean that

patterns are not as well defined. This behavior is common to all errors.

This pattern-tree is the most important structure, since it holds all possible patterns, and therefore

it will influence both time and memory needed.

Figure 6.9 shows the time needed to process each batch (update time). Even for an error of 0.1%

(which imposes a batch size of 1000 facts), the time needed to process each batch is just a couple of

seconds (like the AW T-Star for 3% error). We can also state that StarFP-Stream needs, in average, less

time than SimpleFP-Stream (Figure 6.9a), and it confirms that denormalizing before mining takes more

time than mining directly the star schema.

0#

2#

4#

6#

8#

10%# 5%# 4%# 3%# 2%# 1%#

Time'(s)'

Error'


0#

100#

200#

300#

400#

10%# 5%# 4%# 3%# 2%# 1%#

Time'(s)'

Error'


0#

5#

10#

15#

20#

25#

30#

1.0%# 0.5%# 0.1%# 0.05%# 0.01%#

Time'(s)'

Error'


0.01#

0.1#

1#

10#

1# 101# 201# 301# 401# 501#

Time'(s)'

Batch'


1#

10#

100#

1000#

10000#

1# 51# 101#151#201#251#301#351#401#451#501#551#601#651#701#751#801#

Time'(s)'

Batch'


0#

2#

4#

6#

8#

10#

Time'(s)'

Batch'


0#

200#

400#

600#

800#

1000#

1200#

10%# 5%# 4%# 3%# 2%# 1%#

Mem

ory'(M

b)'

Error'


0#

100#

200#

300#

400#

500#

600#

10%# 5%# 4%# 3%# 2%# 1%#

Mem

ory'(M

b)'

Error'


0#

50#

100#

150#

200#

250#

300#

1.0%# 0.5%# 0.1%# 0.05%# 0.01%#

Mem

ory'(M

b)'

Error'

SimpleFPStream#

StarFPStream#

(a) Average time – Hepatitis Star

0#

2#

4#

6#

8#

10%# 5%# 4%# 3%# 2%# 1%#

Time'(s)'

Error'


0#

100#

200#

300#

400#

10%# 5%# 4%# 3%# 2%# 1%#

Time'(s)'

Error'


0#

5#

10#

15#

20#

25#

30#

1.0%# 0.5%# 0.1%# 0.05%# 0.01%#

Time'(s)'

Error'


0.01#

0.1#

1#

10#

1# 101# 201# 301# 401# 501#

Time'(s)'

Batch'


1#

10#

100#

1000#

10000#

1# 51# 101#151#201#251#301#351#401#451#501#551#601#651#701#751#801#

Time'(s)'

Batch'


0#

2#

4#

6#

8#

10#

Time'(s)'

Batch'


0#

200#

400#

600#

800#

1000#

1200#

10%# 5%# 4%# 3%# 2%# 1%#

Mem

ory'(M

b)'

Error'


0#

100#

200#

300#

400#

500#

600#

10%# 5%# 4%# 3%# 2%# 1%#

Mem

ory'(M

b)'

Error'


0#

50#

100#

150#

200#

250#

300#

1.0%# 0.5%# 0.1%# 0.05%# 0.01%#

Mem

ory'(M

b)'

Error'

SimpleFPStream#

StarFPStream#

(b) Time with 0.1% error – Hepatitis Star

Figure 6.9: Average (left) and detailed (right) update time.

Even though the time in Figure 6.9b increases a bit as new batches are processed, we can see that

it tends to become constant. Since there are lots of items with a small frequency (due to the data

characteristics), there will be lots of infrequent itemsets that must be removed, and this increase in time

may be caused by just memory management.

The analysis of the maximum memory needed per batch is shown in Figure 6.10. It is strongly related

to the pattern-tree, and therefore with the error bound.

0#

2#

4#

6#

8#

10%# 5%# 4%# 3%# 2%# 1%#

Time'(s)'

Error'


0#

100#

200#

300#

400#

10%# 5%# 4%# 3%# 2%# 1%#

Time'(s)'

Error'


0#

5#

10#

15#

20#

25#

30#

1.0%# 0.5%# 0.1%# 0.05%# 0.01%#

Time'(s)'

Error'


0.01#

0.1#

1#

10#

1# 101# 201# 301# 401# 501#

Time'(s)'

Batch'


1#

10#

100#

1000#

10000#

1# 51# 101#151#201#251#301#351#401#451#501#551#601#651#701#751#801#

Time'(s)'

Batch'


0#

2#

4#

6#

8#

10#

Time'(s)'

Batch'


0#

200#

400#

600#

800#

1000#

1200#

10%# 5%# 4%# 3%# 2%# 1%#

Mem

ory'(M

b)'

Error'


0#

100#

200#

300#

400#

500#

600#

10%# 5%# 4%# 3%# 2%# 1%#

Mem

ory'(M

b)'

Error'


0#

50#

100#

150#

200#

250#

300#

1.0%# 0.5%# 0.1%# 0.05%# 0.01%#

Mem

ory'(M

b)'

Error'

SimpleFPStream#

StarFPStream#

Figure 6.10: Average maximum memory per batch.

We can see there that our algorithm needs some more memory than SimpleFP-Stream, and this is

due to the fact that the first needs to create the DimFP-Trees for each dimension, while the second puts

the denormalized facts into one single FP-Tree. We can also see, once again, that the memory needed

increases exponentially with the decrease of the error, because the less error, the more has to be kept in

the pattern-tree. However, just as in time usage, memory needed tends to stabilize and not depend on

the number of batches processed so far.

96

6.4 Hepatitis Application Goals

For the analysis of the hepatitis star schema, we decided to address two topics of interest, suggested for

this dataset:

1. To discover the differences between patients with hepatitis B and C;

2. To evaluate whether laboratory examinations can be used to estimate the stage of liver fibrosis.

This second topic is of particular importance, since biopsies are invasive to patients, and therefore doctors

try to avoid them.

By using the star in Figure 6.2 we are able to relate the exams (and other dimensions) with the type

of hepatitis, as well as with the fibrosis stage. We can look for what examination results are common

(frequent) along with hepatitis B or/and C, and see the differences (goal 1). Similarly, we can look for

frequent exam results for each fibrosis stage (goal 2), and then use those patterns to help classifying other

patients with similar results.

In order to tackle these topics, we follow two approaches: (1) to discover discriminant patterns and

association rules; and (2) to find inter-dimensional and aggregated patterns and use them to enrich

classification, and potentially to improve predition results.

By following the first approach, we can understand if there are examination results that are char-

acteristic to some type of hepatitis, or that are connected to some particular stages of hepatitis. On

another side, by using the second methodology, we are able not only to find interesting sets of frequent

examination results, but also to use these multi-relational patterns to improve predicting if a patient is

infected or not with hepatitis, which type or in what stage. By that, if in the presence of these patterns

the prediction of classification models improve, we are also evaluating and proving the importance of our

multi-relational patterns, and therefore the importance of our algorithms.

6.5 Finding Discriminant Patterns and Association Rules

At a first glance, in order to find the discriminant patterns, the approach to follow may seem straight-

forward: we may apply StarFP-Stream to the hepatitis star, and choose all patterns that relate some

examination result with hepatitis types and fibrosis stages. However, as seen in section 6.2.2, only less

than 1 fourth of all examinations have an active biopsy associated. In particular, 16% of examinations

correspond to hepatitis C, and only 7% to hepatitis B.

First, this means that, to find the hepatitis type B as a frequent item (the same is valid for hepatitis

C), we have to select a very low support, and in order to find some examination that is frequent along with

hepatitis B, we have to set the support to even lower values. Furthermore, if we look to the frequency of

examinations corresponding to each fibrosis stage, we are talking about even lower supports. This leads

to lots of uninteresting and possibly misleading patterns.

And second, if we are using all data to contribute for the support, highly frequent patterns (> 16%

+7% = 23%) are frequent because they co-occur more in data with no biopsy information. And this

means that they are not interesting because they cannot discriminate any type of hepatitis (at most, they

can discriminate the non existence of hepatitis, if they are not frequent for any type of hepatitis).

In this sense, we decided to constrain the data, and apply StarFP-Stream with low supports:

1. To all examinations with hepatitis B – referred to as B;

2. To examinations with hepatitis C – referred to as C;

3. To all exams with no biopsy data – referred to as None.

97

This way, we found 3 sets of patterns: B, C and None. And then, we generated the association rules

(with respective support, confidence and lift measures), based on the discovered patterns (also, 3 sets of

rules, B, C and None).

Next, for the analysis, we categorized patterns and rules as discriminant or non-discriminant. A

pattern is discriminant if it belongs to group B or/and C, but not to group None, i.e. if it is frequent for

some type of hepatitis, and it is not frequent for those that are not diagnosed yet. Additionally, a pattern

that belongs only to group None is also discriminant, since it may be a good indicator that a patient do

not have hepatitis. Patterns that belong to some group of hepatitis and at the same time to group None,

are non-discriminant, and thus not interesting. Discriminating patterns may be used to address goal 1,

i.e. to understand the differences between hepatitis B and C.

Finally, we analyzed association rules that implicate some stage of fibrosis, to understand if the stage

can be estimated by examination results (goal 2).

6.5.1 Interesting Patterns

Table 6.6 presents a subset of the frequent patterns found, with information about results and fibrosis.

As expected, the supports of these patterns are very low (around 1% of the group), in the three groups.

In fact, in these data, what we found is that patterns with more support correspond to patterns that are

not discriminant (such as normal values for most of the examinations).

Table 6.4: Some examples of the patterns found in the hepatitis dataset.

Pattern B C None Discriminant?1 (Result=RBC_H) 1% Yes<(B)2 (Result=GPT_UH) 1% Yes<(B)3 (Result=GPT_VH) 1% 2% 1% No4 (Result=GPT_H) 2% 2% 3% No5 (Result=GOT_VH) 1% 1% Yes<(B<and<C)6 (Result=GOT_H) 3% 3% 3% No7 (Result=HCT_H) 2% 2% 1% No8 (Result=CHE_VL) 4% 1% No<*9 (Result=ALB_L) 1% Yes<(None)10 (Result=PLT_VL) 1% Yes<(None)11 (Sex=M,Result=GPT_VH) 1% 1% Yes<(B<and<C)12 (Sex=M,Result=CHE_VL) 3% Yes<(B)13 (Fibrosis=1,Result=CHE_VL) 1% Yes<(B)14 (Fibrosis=1,Result=GPT_H) 1% Yes<(C)15 (Fibrosis=1,Result=GOT_H) 1% Yes<(C)

Support<in

However, by analyzing the differences between groups, we can find some possible interesting and

discriminant examinations. For example, we find that ultra high (UH) values for the GPT test only

appear in the hepatitis B test set (more than 1% of the time), but as the value lowers, the test stops

being discriminant. Another examples are patterns 9 and 10, that may indicate that lower values on ALB

and PLT tests are good markers for not having hepatitis (note that, in these data, not having information

about a biopsy does not say that a person do not have hepatitis, but it may be an indicator for finding

relations by which doctors think that there is no need for a biopsy. Nevertheless, this would need further

analysis).

Pattern 8 is marked with an ∗ because, as can be noted, it has 4% of support for hepatitis B, and

only 1% in group None, and therefore it is considered non-discriminant. But, if we look to patterns 12

and 13, very low (VL) values for the CHE test may be an indicator of hepatitis B, meaning that pattern

8 occurrences in the None group may be outliers (or not yet diagnosed hepatitis B patients).

98

The only discriminant patterns that relate exam results and the fibrosis stage can only find a relation

for fibrosis stage 1 (patterns 13 to 15, in the table), because of the extremely low supports of other stages

of fibrosis. And in fact, high (H) values for GPT and GOT exams are not, by themselves, discriminant of

hepatitis C (patterns 4 and 6). At most, they may be able to discriminate the fibrosis stage in patients

already diagnosed with hepatitis C.

6.5.2 Association Rules

Table 6.5: Some examples of the association rules found in the hepatitis dataset.

AR Conf. Lift Discr?1 (Result=GOT_H)<⟹<(Fibrosis=1) 48.10% 0.96 No2 (Result=GPT_H)<⟹<(Fibrosis=1) 51.35% 1.03 No3 (BirthDecade=1960)<⟹<(Fibrosis=0) 19.90% 6.92 No4 (BirthDecade=1960)<⟹<(Fibrosis=1) 62.32% 1.25 No5 (BirthDecade=1930,Sex=F)<⟹<(Fibrosis=1) 51.25% 1.02 Yes<(C)6 (BirthDecade=1930,Sex=F)<⟹<(Fibrosis=2) 15.07% 0.99 Yes<(C)7 (BirthDecade=1930,Sex=F)<⟹<(Fibrosis=3) 13.81% 0.99 Yes<(C)8 (BirthDecade=1930,Sex=F)<⟹<(Fibrosis=4) 17.57% 0.98 Yes<(C)

Table 6.5 presents a subset of the frequent association rules found of the form X ⇒ Fibrosis, with X

any other item.

In order to address the second goal, we wanted to find all rules for which some examination result

implies some fibrosis stage. Rules 1 and 2 are examples of those rules. However, their confidence is

around 50% which means that these rules are not unexpected and are probably to tied to the data in

question. The lift is also too close to 1, confirming that these rules are not interesting. Indeed, both

antecedents were non discriminant (as seen in table 6.6), as well as these rules. All other rules of this

form are equivalent, and furthermore, can only estimate the fibrosis stage 1. This means that, using these

data, no examination result, by itself, can predict the stage of the fibrosis, in both type of hepatitis.

Rules 3 and 4 are examples of rules with a slightly higher lift. They indicate that 20% of the patients

that were born in de 60s (i.e. that were examined with 20 to 40 years old) had hepatitis in fibrosis stage

0, and 62% of them in fibrosis stage 1. However, these rules have small confidence, and therefore we

cannot conclude that there is a relation between the age of the patients with the stage of the hepatitis.

Finally, rules 5 to 8 show that there are attributes that, although being discriminative, they are not

good to predict the stage of the fibrosis. In these examples, women born in the 30s can predict any stage,

from 1 to 4, with smaller confidences (with the exception of stage 1, that is explained by the fact that

there are more instances of this type) and bad lifts.

In [PRV05], the authors only generate and analyze the confidence of association rules of the form

Examination Result → Fibrosis. However, besides the confidence of those rules be low (in most of the

cases), neither the support or the lift of those rules was analyzed. As shown here, rules of that form have

a small confidence (rules 1 and 2) and also a lift too close to 1 (and a very low support), which means

there are too few examples and these rules may not be significant.

These poor results mean that there is the need for further analysis of these data, in a different and

more structured way. They also prove that there are some possible tendencies, but alone, examination

results cannot predict fibrosis stage of hepatitis patients.

99

6.6 Improving Prediction using Multi-Dimensional Patterns

Classification is a data mining task widely used for predicting future outcomes. As an example, in this

healthcare domain, it can be used to predict if a patient is infected with hepatitis or not, as well as to

predict the type and stage of hepatitis.

Classification algorithms create a prediction model based on the existing data (training data), for

which we know a set of features and also the outcomes, and then use this model to predict the unknown

outcomes of new data, based on their observed features.

There are several algorithms and models proposed for classification, that have been applied to vast

number of different domains. However, and despite the advances, the relations between attributes are not

being considered in existing approaches. In fact, in a multi-relational domain there are implicit relations

between data that are easily modeled through a relational schema, but that cannot be easily modeled in

one single training table. Naturally, if we could somehow incorporate these relations into classification,

prediction results would be likely to improve.

If we consider the multi-relational (MR) patterns described above, both inter-dimensional and aggre-

gated patterns represent the relations between entities. This means that, if a record (or event) in data

satisfies a MR pattern, we can say that this record encloses the relations represented by the pattern.

In this sense MR patterns can be seen as a compact way to model the relationships in data, and can

be used as features to enrich data for classification. The simplest way to incorporate these patterns is to

pre-process individual records for verifying the satisfaction of each pattern identified, and extend these

records with one boolean attribute per pattern, corresponding to the satisfaction (or not) of each pattern

by the respective record. In this manner, what is multi-relational by nature becomes tabular, without

loosing the dependencies identified before, and traditional classifiers are applicable without the need for

any adaptation.

In this thesis, we claim that we can use the multi-relational patterns to enrich classification data in the

healthcare domain, and improve prediction, as done before with sequential patterns [BA11] and frequent

graphs [PA09].

We first describe in detail the methodology taken, and then we put it into practice with several

experiments, and we show that running classification over these enriched data improves not only the

accuracy of the predictions, but also the classification models built.

6.6.1 Methodology

The general process is illustrated in Figure 6.11, and is divided into four main steps: multi-dimensional

pattern mining, pattern filtering, data enrichment and classification.

Individual)Performances)

Star)Schemas)

Frequent))Pa7erns)

Mul9:Dimensional)

Pa7ern)Mining)

N)Best))Pa7erns)

Pa7ern)Filtering)

Enriched))Individual))

Performances)

Data)Enrichment)

Predic9on)Model)

Classifica9on)

Figure 6.11: The multi-dimensional methodology for enriching classification.

The main idea is to make use of a MRDM algorithm to find inter-dimensional and aggregated patterns

that are able to characterize different entities and their behaviors. These patterns may in turn be

100

filtered and used as classification features to predict some outcome, depending on the different dimensions

considered.

Given a star schema (or a constellation of star schemas) containing the individual performances (such

as the Hepatitis star given in Figure 6.2, recording all the examinations performed), we propose to apply

the next steps:

1. Multi-Dimensional Pattern Mining : This step consists on running an algorithm for multi-

dimensional pattern mining, such as StarFP-Growth [SA11] or StarFP-Stream [SA12a], over each

star schema. By doing this, we are able to find frequent patterns – intra-, inter-dimensional and

aggregated patterns, related to each star.

For example, running a MRDM algorithm over the Hepatitis Star allow us to find frequent patient

behaviors related to the results of the examinations they performed (e.g. sets of results that are

frequent together, for each particular hepatitis).

2. Pattern Filtering : After finding the patterns, the next step is to filter them and choose the N best

ones. We can either do this filtering to the patterns of each star separately, and choose the N best

of each set, or rate the set of all patterns and choose the N best global ones.

Note that, in theory, using all patterns with at least 2 items to enrich the training data for clas-

sification should achieve the best results. However, this will eventually lead to over fitting of the

models found, which will also lead to poor results when classifying new instances. In this sense, we

should choose, somehow, only those patterns that achieve an higher information gain.

First, we are only interested in the patterns that can model the multi-dimensional relations between

entities, therefore we only want inter-dimensional and aggregated patterns (i.e. patterns with items

of more than one dimension and items resulting from the aggregation of facts). Then, in order to

choose the N best patterns, we have to filter and rate the patterns according to some interesting

measure. We define five filters:

Support: The support of a pattern is the number of times it occurs. Therefore, the higher the

support of a pattern, the more events share the same characteristics represented in this pattern,

and the more likely it is to cover more entities that we want to classify.

In this sense, using a support filter, we order the patterns in a support descending order, and

choose the N patterns with highest support.

However, patterns with highest support are the smallest ones (the number of times a pattern

occurs is higher or equal to the number of times its super patterns may occur), and therefore

those that represent smaller relations.

Also, in this healthcare domain, if exam results are shared by a high number of patients, it

may mean that they are not discriminant of the type of hepatitis or its stage;

Size: On another side, the largest patterns model more multi-dimensional relations than smaller

ones, and hence may be more interesting to improve classification.

By using a size filter, we order the patterns in a descending order of their size, and choose the

N largest patterns.

The downside of this is the fact that these patterns tend to have the smallest supports, and

cover a very small part of the data;

Closed: One characteristic of patterns is that, if one is frequent, all of its subsets are also frequent

(anti-monotonicity), and this means that some might be redundant. Thus, if we eliminate the

redundant patterns, the final set of chosen patterns is more likely to be more interesting.

101

A pattern is closed if none of its immediate supersets have the same support (if some has, this

pattern is not interesting). In this case, we are only interested in the closed patterns. Using a

closed filter, we consider only the closed patterns, and choose those with highest support;

One of the problems of the above measures, is that they do not take into account how correlated

items are, or how much gain they bring over what is already known. Thereby, we define the next

two filters.

Rough Independence: According to probability theory, two events A1 and A2 are independent

if P (A1 ∩ A2) = P (A1)P (A2). And if two events are independent, the occurrences of one

do not influence the probability of the other. Therefore, patterns that contain these two

events are not interesting. For more than two events, they can be mutually independent if

P (Ai ∩Ai+1 ∩ ...∩An) = P (Ai)P (Ai+1)...P (An), for all the power set of those events. Taking

this into consideration, for this work we define a rough independence measure:

RInd({A1, A2, ..., An}) =P (A1 ∩A2 ∩ ... ∩An)

P (A1)P (A2)...P (An)

If RInd is 1, it means that the elements of the pattern are rough independent, and therefore

less important. The higher the value of RInd, the more dependent are the elements, and more

important is the pattern.

Therefore, using the rough independence filter, patterns are ordered in a decreasing order of

their value for |RInd− 1|, and only the N patterns with highest difference are chosen.

Note that, in order to guarantee the mutually independence of a pattern, it would be necessary

to measure RInd to all of its subsets (the power set of all elements).

Rough Chi-square (χ2): Chi-square is an interesting measure that evaluates the correlation be-

tween variables [BMS97]. Generally, the more correlated, the more interesting are the relations.

The chi-square of two variables is defined as:

n,m∑i,j=1

(observedij − expectedij)2/expectedij

in which observedij is the observed support of values i and j, and expectedij the expected

probability of those values if the variables were independent.

In this work we define a rough chi-square measure to evaluate the correlation of elements in a

pattern: Rχ2({A1, ..., An}) =

(support(A1 ∩ ... ∩An)− P (A1)...P (An))2

P (A1)...P (An)

The higher the value of Rχ2, the more rough correlated are the elements in patterns, and

therefore more interesting.

In this sense, using a rough chi-square filter, patterns are ordered in a decreasing order, ac-

cording to the value of Rχ2, and the N with highest measure are chosen.

3. Data Enrichment : Once we have the best patterns, we can use them as features for classification

training.

The simplest way to incorporate these patterns in the classification process is to pre-process indi-

vidual records (the original training data) for verifying the satisfaction of each pattern identified.

102

From this verification results a new extended record, where multi-dimensional patterns are just rep-

resented as boolean attributes – true or false – whenever an entity satisfies (or not) the particular

pattern. In this manner, what is multi-dimensional by nature becomes tabular, without loosing the

dependencies identified before, and traditional classifiers are applicable without any adaptation.

4. Classification: We can then finally run classification algorithms on these enriched data and observe

the results. It is expected, not only to achieve better predictions, but also better models (in

particular, smaller models).

6.6.2 Methodology into Practice

In order to analyze the hepatitis dataset and achieve our goals, we decided to follow the methodology

described above: (1) run multi-relational pattern mining over the Hepatitis star schema; (2) filter the

best inter-dimensional and aggregated patterns; (3) enrich the classification data (baseline) with these

patterns; and (4) run classification over both the baseline and this enriched dataset, and compare the

results (the average of the predictions, and the size of the models built).

For the first step, we decided to run the algorithm StarFP-Stream over the Examination Results star

schema. So that we could understand the behavior of patients and discover frequent sets of exam results,

the algorithm aggregates into one singular record all the exams of the same patient while each particular

biopsy is valid, i.e. each pair (patient, biopsy) of the central fact table is considered as only one event.

By doing this, we are able to discover, not only frequent exam results (like traditional pattern mining

algorithms), but also sets of results that co-occur frequently. We can find, for example, that patients

with hepatitis B frequently have high results in exams GOT and GPT but low results in PLT , at the

same time.

After finding the patterns, we tested our approach with all the filters proposed above, and with

different number of patterns selected.

For this case study, since our goal is to predict the type or the stage of hepatitis based on exam results,

the baseline used for classification is a table composed of the patient information (Patient dimension),

and the results for all the 17 most significant exams discriminated in section 6.2. Then, we defined two

similar baselines, Type and Fib, and applied this methodology to both. The first to predict if a patient

is infected with some type of hepatitis, and the second to predict the stage of the hepatitis, if present. In

this sense, the class of the baseline Type is the type of hepatitis (B, C or None), and the class of the

baseline Fib is the stage of the fibrosis (from 0 to 5).

Once we have the best N patterns, we extend the baseline table by adding N boolean attributes.

Each value for these attributes is true if the patient satisfies the pattern, or false otherwise.

Finally, we applied the classification algorithm C4.5 on these enriched datasets, and compared the

results over the same algorithm applied to the baselines. Classification results presented are the average

of several 10-cross fold validations.

We used our implementation of StarFP-Stream (described in Chapter 3), and the C4.5 implementation

available in Weka.

6.6.3 Analysis of Multi-Relational Patterns

Table 6.6 presents a subset of the frequent patterns found. For simplicity, we only present the patterns

in which the exam results have abnormal values (i.e. low or high results).

The first five patterns are intra-dimensional, and contain only one item. The first, for example, means

that exam named GOT is frequent and appears 975 times in these settings with an higher value (H).

Also, the data contains 290 diagnosis of hepatitis C (pattern 5).

103

Table 6.6: Some examples of the multi-relational patterns found in the hepatitis dataset.

Pattern Support1 (Result=GOT_H) 9752 (Result=WBC_L) 5363 (Result=IBBIL_H) 4954 (Result=CHE_VL) 4915 (Type=C) 2906 (Result=ZTT_H,Result=DBBIL_H,Result=TTT_H) 5777 (Result=GOT_H,Result=GPT_H,Result=ZTT_H,Result=MCV_H) 5868 (Result=GOT_H,Result=GPT_H,Result=DBBIL_H,Result=CHE_VL) 3869 (Result=GOT_H,Result=GPT_H,Result=PLT_L,Result=WBC_L) 36210 (Result=GOT_VH,Result=GPT_VH,Result=ZTT_H,Result=WBC_L) 35511 (Sex=M,Result=MCV_H) 57912 (Sex=M,Result=MCV_H,Result=HCT_H) 46913 (Type=None,Fibrosis=None,Result=DBBIL_H) 53414 (Type=None,Result=GOT_H,Result=GPT_H) 50815 (Type=C,Result=GOT_H,Result=GPT_H,Result=ZTT_H) 236

The next five patterns are aggregated patterns, since they represent frequent sets of examination

results, and they are discovered because we aggregated the data in the star schema per pair (patient,

biopsy). We can see that exams GOT and GPT frequently appear together, and with very similar (and

high) results. We can also observe that, e.g. more than 300 patient diagnosis have high results in both

GOT and GPT , and at the same time, low results in PLT and WBC exams.

Patterns 11 to 15 are inter-dimensional patterns, that contain items from more than one dimension.

From these, the first 2 patterns relate the Patient and Exam dimensions, and the rest relates the Biopsy

with the examinations and corresponding results. In these examples, we note that, from the 800 biopsy

diagnosis of male patients, almost 600 correspond to examinations with an high value in exam MCV .

The last 3 patterns relate the type of hepatitis and the stage of the fibrosis with examination results.

As an example, an high value for the D-BIL exam was frequently associated with no hepatitis cases.

The last patterns suggest that an higher value for both the GOT and GPT tests, along with ZTT ,

indicate that the patient has hepatitis C, since more than 80% of the cases of hepatitis C show these

values. However, we can see in pattern 14 that these values are also associated with not having hepatitis.

This may evidence that these values or tests are not discriminant (corroborating the results of our first

approach, presented in Section 6.5). We also have to recall that, in these data, not having information

about a biopsy does not say that a person do not have hepatitis. It says only that the person has not

been diagnosed yet. However, it may be an indicator for finding relations by which doctors think there

is no need for a biopsy.

6.6.4 Enriched Classification Results

Figures 6.12a and 6.12b show the accuracy of the classification step, over the baseline Type and Fib,

respectively, and corresponding datasets enriched with the multi-relational patterns from the pattern

mining step.

When we add the patterns that represent patient exam behaviors, we can see in the figures that the

accuracy improves in both cases, as expected. Although small, the improvements indicate that patterns

are chosen instead of specific exams, and this may result in models with less over fitting, and therefore

more accurate when predicting new instances. Also, results show that, in general, the more N best

patterns are chosen, the better the accuracy.

When analyzing the different filters, there are small fluctuations, but both the rough independence

and rough chi-square filters revealed to achieve better results, in both baselines. Choosing the patterns

104

90.0$

91.0$

92.0$

93.0$

94.0$

95.0$

96.0$

50$ 100$ 250$ 500$ 1000$ 2500$ 5000$

Accuracy'(%

)'

N'Best'

Baseline'2'4'Accuracy'

Support$

Size$

Closed$

R$Ind$

R$Chi2$

Baseline$Fib$

86.0$

87.0$

88.0$

89.0$

90.0$

91.0$

92.0$

93.0$

50$ 100$ 250$ 500$ 1000$ 2500$ 5000$

Accuracy'(%

)'

N'Best'


Support$

Size$

Closed$

R$Ind$

R$Chi2$

Baseline$Type$

300$

350$

400$

450$

500$

550$

600$

650$

50$ 100$ 250$ 500$ 1000$ 2500$ 5000$

Size'of'the

'tree'

N'Best'

Baseline'2'4'Size'of'tree'

Support$

Size$

Closed$

R$Ind$

R$Chi2$

Baseline$Fib$

600$

700$

800$

900$

1000$

1100$

1200$

50$ 100$ 250$ 500$ 1000$ 2500$ 5000$

Size'of'the

'tree'

N'Best'


Support$

Size$

Closed$

R$Ind$

R$Chi2$

Baseline$Type$

(a) Baseline Type.

90.0$

91.0$

92.0$

93.0$

94.0$

95.0$

96.0$

50$ 100$ 250$ 500$ 1000$ 2500$ 5000$

Accuracy'(%

)'

N'Best'


Support$

Size$

Closed$

R$Ind$

R$Chi2$

Baseline$Fib$

86.0$

87.0$

88.0$

89.0$

90.0$

91.0$

92.0$

93.0$

50$ 100$ 250$ 500$ 1000$ 2500$ 5000$

Accuracy'(%

)'

N'Best'


Support$

Size$

Closed$

R$Ind$

R$Chi2$

Baseline$Type$

300$

350$

400$

450$

500$

550$

600$

650$

50$ 100$ 250$ 500$ 1000$ 2500$ 5000$

Size'of'the

'tree'

N'Best'


Support$

Size$

Closed$

R$Ind$

R$Chi2$

Baseline$Fib$

600$

700$

800$

900$

1000$

1100$

1200$

50$ 100$ 250$ 500$ 1000$ 2500$ 5000$

Size'of'the

'tree'

N'Best'


Support$

Size$

Closed$

R$Ind$

R$Chi2$

Baseline$Type$

(b) Baseline Fib.

Figure 6.12: Accuracy for baselines and respective extensions with MR patterns.

with higher support is the approach that brings less improvements, because they are the smallest ones,

and they might not be discriminant of patients with different type or stages of hepatitis. The closed filter

is similar to the support, while the size filter achieves intermediate results. These tendencies happened

in both baselines.

Figures 6.13a and 6.13b analyze the size of the trees created by the classifier (i.e. the size of the

models).

90.0$

91.0$

92.0$

93.0$

94.0$

95.0$

96.0$

50$ 100$ 250$ 500$ 1000$ 2500$ 5000$

Accuracy'(%

)'

N'Best'


Support$

Size$

Closed$

R$Ind$

R$Chi2$

Baseline$Fib$

86.0$

87.0$

88.0$

89.0$

90.0$

91.0$

92.0$

93.0$

50$ 100$ 250$ 500$ 1000$ 2500$ 5000$

Accuracy'(%

)'

N'Best'


Support$

Size$

Closed$

R$Ind$

R$Chi2$

Baseline$Type$

300$

350$

400$

450$

500$

550$

600$

650$

50$ 100$ 250$ 500$ 1000$ 2500$ 5000$

Size'of'the

'tree'

N'Best'


Support$

Size$

Closed$

R$Ind$

R$Chi2$

Baseline$Fib$

600$

700$

800$

900$

1000$

1100$

1200$

50$ 100$ 250$ 500$ 1000$ 2500$ 5000$

Size'of'the

'tree'

N'Best'


Support$

Size$

Closed$

R$Ind$

R$Chi2$

Baseline$Type$

(a) Baseline Type.

90.0$

91.0$

92.0$

93.0$

94.0$

95.0$

96.0$

50$ 100$ 250$ 500$ 1000$ 2500$ 5000$

Accuracy'(%

)'

N'Best'


Support$

Size$

Closed$

R$Ind$

R$Chi2$

Baseline$Fib$

86.0$

87.0$

88.0$

89.0$

90.0$

91.0$

92.0$

93.0$

50$ 100$ 250$ 500$ 1000$ 2500$ 5000$

Accuracy'(%

)'

N'Best'


Support$

Size$

Closed$

R$Ind$

R$Chi2$

Baseline$Type$

300$

350$

400$

450$

500$

550$

600$

650$

50$ 100$ 250$ 500$ 1000$ 2500$ 5000$

Size'of'the

'tree'

N'Best'


Support$

Size$

Closed$

R$Ind$

R$Chi2$

Baseline$Fib$

600$

700$

800$

900$

1000$

1100$

1200$

50$ 100$ 250$ 500$ 1000$ 2500$ 5000$

Size'of'the

'tree'

N'Best'


Support$

Size$

Closed$

R$Ind$

R$Chi2$

Baseline$Type$

(b) Baseline Fib.

Figure 6.13: Size of the trees for baselines and respective extensions with MR patterns.

We can see that for both baselines, also as expected, the trees resulting from classifying the enriched

datasets are smaller than the base tree (they can have less 300 nodes for N = 500 patterns when we

are predicting the type of hepatitis). The tendencies of the different filters are the same. Both rough

independence and rough chi-square filters result in the smallest trees, which means that they choose the

patterns that bring more information gain to the models (therefore they are chosen instead of individual

examination results). Again, on the contrary, filters support and closed are the ones that achieve less

improvements on the size of the models.


In this chapter, we presented a case study on the healthcare domain. Using the Hepatitis dataset, we

showed how these data can be modeled and explored in a multi-dimensional model to promote decision

105

support. We also discussed the use of multi-relational data mining algorithms to mine this model, as well

as the use of the results to improve classification.

The performance evaluation of StarFP-Stream over the Hepatitis star schema corroborates and vali-

dates the results obtained over the fictitious AdventureWorks DW (Section 3.4): the algorithm is accurate

and needs less time than the denormalizing before mining approach.

Results over the Hepatitis dataset show that it is possible to mine these data and find interesting

relations between dimensions. However, due to the nature and distributions of these data, interesting

patterns found in the first approach have very low support, and therefore, there was a need to further

analysis. Our study over the discovered association rules concluded that the examination results present

in the hepatitis dataset, without aggregating data per patient and biopsy, cannot predict the fibrosis

stage, mainly due to the very low supports.

Results achieved in our second approach show that we can discover structured patterns from the

multi-relational model, and find frequent sets of examination results that are common to some type of

hepatitis or that lead to some fibrosis stage. Classification experiments validate our claim – by enriching

the training data with the discovered multi-relational patterns, it is possible not only to improve the

accuracy of classification, but also to create better and smaller models, meaning that multi-relational

patterns are chosen as key features instead of specific examination results.

The methodology used is simple and general, and may be used to any healthcare data warehouse or

star schema, and can also be applied to different domains. Another benefit of this methodology is that

any algorithm can be applied for multi-relational data mining, as well as different classifiers.

This application also demonstrates the importance and applicability of multi-dimensional patterns.

As future work, and in order to surpass the difficulties of this dataset, other paths must be taken. One

of the problems comes from the lack of data and their quality. The hepatitis dataset contains more than

30% of patients that did not perform any biopsy (undiagnosed), and more than 75% of examinations for

which there is no information about an active biopsy. To have a better understanding about why these

patients have not performed a biopsy requires domain knowledge, and may help partitioning the data

and improve the results. In line with the above, this dataset contains a very low number of instances for

each type and stage of hepatitis. There is the need for the integration and analysis of more data in this

domain.

The use of different approaches may also result in better outcomes, such as infrequent pattern min-

ing [ZY07], for finding rare patterns; or sequential and temporal pattern mining, for the analysis of the

evolution of the disease.

We can also try to understand the use of the interferon therapy, and aggregate the data per patient in

(and out of) interferon therapy, and find the differences in frequent patterns that happen before, during

and after the administration of that therapy (we can also apply the same algorithm, StarFP-Stream).

106

Chapter 7

A Case Study in Education

The long history of education as an institution, along with the need of recording student results as proof

for their credentials, lead to huge amounts of data, requiring automatic means for exploring them. In

general, these data mainly describe the courses taken by students and their corresponding grades, but

also the information about the teachers involved in the course, and when and where the educational

process happened. With the spread of information society, the variety of records enlarged considerably,

and nowadays records encompass all kinds of items, from learning materials available and used, to the

answers given to specific questions. Actually, while present through the history, the multi-dimensionality

of these data is even clearer nowadays.

Educational data mining gives a first opportunity for exploring these data, providing the adequate

tools for predicting students performance and dropouts, but also for understanding student behaviors.

However, and despite the encouraging results, few approaches were dedicated to explore the multi-

dimensionality of data, and the vast majority resume to explore just one, at most two dimensions. To

our knowledge, there is no proposal to address the problem in a multi-dimensional context, for example

on predicting student results in a particular course given the entire context, such as the teacher involved,

the history of the course, and also the learning materials or the time and place of the occurrence.

The main reason for this lack of interest is certainly the difficulty on mining multi-dimensional data.

Definitely, the huge amounts of data made the joining of the different dimensions (recorded in separate

tables) infeasible until the advent of big data exploration. But even in this new era, mining these huge

tables is not straightforward due to their nature. As explained in Section 2.3, joining the tables into one

would result in a huge table with many attributes (the combination of all attributes of all dimensions),

many repetitions and possibly many missing values. In this educational domain, this joining is even

harder since each entity, such as students, will have a different number of events associated. For example,

students can attend a different number of courses and can fail some enrollments, thus resulting in a

different number of enrollments for each student. Teachers can also lecture in different courses, as well

as each course can be lectured by a different number of teachers. The possible combinations are usually

far too many, and therefore there is a strong need for approaches able to explore these multi-dimensional

data without having to join the different tables.

In order to deal with this, there are some approaches that use feature selection as a preprocessing step

to reduce the number of attributes to consider in the classification step [MVCRV13]. In these techniques,

the goal is to identify the data attributes that have the greatest effect on the output variable, and use

only those as classification features. In this work, we follow another approach and we argue that we can

use multi-dimensional patterns to enrich classification data, and, with this, improve prediction results

and deal with the high-dimensionality of the data in this domain. The reasoning is that these patterns

capture the existing relations between the dimensions and between the instances, and therefore, by adding

107

them as features to classification data, we can transmit these dependencies. Consequently, in some sense,

classification algorithms start being able to take the multi-dimensions into account.

In this chapter, we apply to educational data the same multi-dimensional methodology for enriching

classification described in the hepatitis case study (Section 6.6.1). We illustrate the interest of our

application on the prediction of student results, when enrolled in a given course taught by a particular

teacher. Experimental results on a real educational case study reveal improvements in prediction and on

the classification model built, when compared with two baseline models.

The rest of the chapter is organized as follows. Section 7.1 presents the educational multi-dimensional

model used in this case study. The application of the methodology to the proposed educational star

schema is described in detail in Section 7.2, along with an analysis of the multi-dimensional patterns

found and of the classification results achieved. Finally, section 7.3 discusses and concludes the case

study.

7.1 TheEducare Multi-Dimensional Model

To the best of our knowledge, a characteristic of the educational data that has not been addressed

properly is their multi-dimensionality. Definitely, the educational process encompasses a set of different

entities, characterized by distinct sets of attributes – dimension. Students, teachers, courses are clear

examples of such dimensions. In the intersection of these dimensions occurs the educational process, with

the materialization of its events. Examples of those events are the lessons attended by some student in

some day, for a specific course with a particular teacher, or just the grade achieved by some student in

some course for a particular enrollment.

Multi-dimensional models, such as star schemas or constellations (i.e. sets of star schemas), are

recognized as the most usual schemas to model these kinds of data, being usually used for modeling data

warehouses [KR02]. An example of a multi-dimensional model designed for educational data is shown in

Figure 7.1.

Figure 7.1: An example of an educational data-warehouse.

In this example we have two star schemas: one for modeling student enrollments in courses, here-

inafter called Enrollments Star, and another for modeling teachers quality assurance (QA) surveys, called

Teaching QA Star. In the fact table of the Enrollments Star, each student enrollment in a course in a

108

particular term is recorded, along with the corresponding grade achieved, if approved. The second star

contains the grades given by students to their teachers, in anonymized surveys carried out in the end of

each term. In this sense, each tuple in the Teaching QA Star records the average grade for a specific QA

item (or question), given to some teacher when teaching some course in a specific term for a determined

lesson type (note that these surveys are anonymous, and therefore there is no information about the

students that answered). As can be seen in the figure, dimensions Program, Course and Term are shared

by both star schemas.

By mining these multi-dimensional data, we can, among other things, discover relations between

dimensions in the context of some event, such as, for example, the types of students that achieve better

grades in different types of courses, as well as understand dimension behaviors, e.g. the most frequent

sets of courses’ results.

In this case study we used the data from the Information Systems and Computer Engineering program,

offered in Instituto Superior Tecnico, at the University of Lisbon, in Portugal. From the data warehouse

created, we have chosen the two stars in Figure 7.1: the Enrollments Star and the Teaching QA Star,

modeling student performances in their enrollments, and teacher evaluation for their lectures, respectively.

7.2 Predicting Student Grades Using Multi-Dimensional Pat-

terns

Our main goal is to test our multi-dimensional methodology using the star schemas in Figure 7.1, for

predicting student results on courses of more advanced years (3rd to 5th), based on the frequent behaviors

found in the first 2 years of the program, and on the performances of teachers. With these experiments,

we want to show that it is possible to take the multi-dimensionality of educational data into account, and

that using the enriched data with the multi-dimensional patterns improves classification results.

Data relative to teaching quality assurance is only available from 1995 to 1999, and therefore, to

achieve correct results, we decided to find frequent behaviors only until 1998 and uncover student results

in 1999. The goal is therefore, to predict student results in 1999, on a subset of the 10 most representative

courses from the 3rd to 5th years of the program (let this set be called Courses3− 5). Thus, we are only

interested in the students and teachers involved. There were more than 650 students enrolled in some of

those courses in 1999, and 36 teachers lecturing those classes. On total there were 1830 enrollments in

those conditions. Student grades were also categorized in A, B, C, D, F or Failure.

In order to evaluate our proposal, we tested the classification of our enriched data with two baselines

(without patterns), described next. During the pattern filtering phase, we also varied both the number of

patterns chosen, as well as the different filters applied, to understand the variation of results. Classification

results presented are the average of several 10-cross fold validations.

We also tested our methodology when using the Enrollments Star and the Teaching QA Star. The

multi-dimensional pattern mining algorithm used in these experiments was StarFP-Stream [SA13a].

We used our implementation of StarFP-Stream (described in Chapter 3), and the C4.5 implementation

available in Weka.

7.2.1 Baselines

For this case study, we decided to define 2 different baselines, that will then be enriched with our

methodology and used to test the improvements.

Since our goal is to predict student results in 1999 to those 10 courses (Courses3− 5), our baselines

will contain all enrollments and grades of students that were enrolled in at least one of those courses,

109

during that year. As noted above, there were 1830 enrollments in those conditions, and therefore both

baselines contain 1830 records.

The first baseline (denoted as B1) consists on a table composed of the student information, its average

grade of the first 2 years of the program, until 1998, the information about the enrolled course (from

Courses3− 5) that we want to predict and also the information about the main teacher (conceptually, it

consists in the joining of the student, the course and teacher dimensions, plus the student average grade).

In this simple baseline, the only information about former student performance is the average grade,

therefore it is expected not to achieve a very good accuracy, and that it would improve significantly after

being extended with the multi-dimensional patterns.

A second baseline (denoted as B2) consists on a table with the student information, its grade on every

course of the 1st and 2nd years (let them be called Courses1− 2), until 1998, and the information about

the enrolled course (from Courses3 − 5) in 1999 we want to predict and the respective main teacher.

Students that did not enrolled some course (in Courses1 − 2) are marked with a “NE” (not-enrolled)

value. This baseline is like B1, but instead of keeping just the average of former years, it contains the

specific grades achieved on those courses. In this way, it is able to model some student behavior, and

therefore it is expected to achieve better accuracy results by itself. What we want to show is that,

even so, adding the multi-dimensional patterns may improve not only the model, but also improve the

accuracy of the classifier. On one side, it may improve the model, because it is likely to result in smaller

trees. The reason is that these patterns, specially aggregated ones, encapsulate several courses and can be

selected instead of several specific grades. Additionally, it can improve the prediction accuracy, because

multi-dimensional patterns condense information, and only what is important. Without these patterns,

classification algorithms choose specific grades to build the model, which may lead to overfitting.

7.2.2 Methodology into practice

In order to analyze these educational data, we decided to follow the methodology described above (Sec-

tion 6.6.1): (1) run multi-relational pattern mining over both the Enrollments Star and Teaching QA

Star schemas; (2) filter the best inter-dimensional and aggregated patterns; (3) enrich the classification

data (baselines) with these patterns; and (4) run classification over both the baselines and the enriched

datasets, and compare the results (the average of the predictions, and the size of the models built).

During phase one, for multi-dimensional pattern mining, we applied our algorithm StarFP-Stream to

each of the stars.

For finding student behaviors, only the most representative courses from the 1st and 2nd years were

taken into account (23 courses), from 1990 to 1998. In these first years, all courses are mandatory, hence

there were more than 17 thousand enrollments in this period that were used for pattern mining. Data in

the fact table of the Enrollments Star was aggregated per each pair student–term, so that we could find

the frequent sets of courses attended (both approved and failed) per term.

For finding teacher behaviors (Teaching QA Star), only the most representative courses from the 3rd,

4th and 5th years were used (the 10 courses in Courses3− 5), from 1995 to 1998, inclusive. There were

1088 survey questions answered during this period. Data in this star schema was aggregated per survey

id, in order to find frequent sets of evaluations given by students to their teachers. Surveys have 10

questions, here numbered from 1 to 10, and grades can go from 1 (worse) to 5 (best). While mining the

data for this star schema, records in the fact table were aggregated per survey id (QASurvey), since each

survey agglomerates the questions evaluating the performance of each teacher, for some specific course,

during each term. In this sense, we can find, e.g. frequent sets of assessments (grades per questions) for

teachers (as an example, we can find that some teachers never arrive late, and/or are always available to

answer student doubts).

110

These resulting patterns were then filtered with each of the proposed filters (in Section

subsec:classification-methodology), and the best N were used to enrich the baselines. In this step, pat-

terns were added to the baseline tables as features (columns), and records that satisfied (or not) the

patterns were marked with true (or false) in the corresponding feature.

Since we have two star schemas, and therefore two sets of frequent multi-relational patterns, we tested

adding the patterns of each star separately, and also adding both the N/2 best patterns of each, together.

By doing this, we want to test what is more relevant for predicting student grades, the behaviors of

students, the performance of teachers, or both.

The classification algorithm C4.5 was then used on these enriched datasets, and results are presented

below.

7.2.3 Analysis of Multi-Relational Patterns

Some examples of patterns found for the Enrollments Star can be seen in Table 7.1. We can see there

that 117 students failed to course F2 and had a bad grade (the lowest, F) at AN in the same term. Also

the 3rd pattern indicates that it is frequent to succeed on SIBD, PLF, AM3, Pest and AN in a single

term. The last pattern shows that it is common to fail to AN course in the second season. This last

pattern is inter-dimensional, since it relates terms and subjects, and the others are aggregate patterns.

Table 7.1: Some examples of patterns found for the Enrollment Star.

Student

Pattern Support(sub/=/1F2,/grade/=/AN_F) 117(sub/=/FEX,/sub/=/1AED) 169(sub/=/SIBD,/sub/=/PLF,/sub/=/AM3,/sub/=/PEst,/sub/=/AN) 168(sub/=/SIBD,/sub/=/PLF) 447(season=2,/sub/=/1AN) 126

Teacher

Pattern Support(avg_grade=8_5) 29(avg_grade=9_4,avg_grade=6_4,avg_grade=7_4,avg_grade=3_4,avg_grade=4_4)11(avg_grade=4_3,avg_grade=5_3,avg_grade=9_3) 8(avg_grade=8_5,subject_alias=Comp) 7(avg_grade=5_3,subject_alias=M) 7

Examples of patterns for the Teachers QA Star are presented in Table 7.2. In this table, the first

pattern indicates that it is frequent to have a grade of 5 in question 8, the second (an aggregated pattern)

says that it is common to have grade 4 in both questions 3, 4, 6, 7 and 9. The last pattern, for example,

is an inter-dimensional pattern indicating that teachers of course M usually have grade 3 in question 5.

Table 7.2: Some examples of patterns found for the Teaching QA Star.

Student

Pattern Support(sub/=/1F2,/grade/=/AN_F) 117(sub/=/FEX,/sub/=/1AED) 169(sub/=/SIBD,/sub/=/PLF,/sub/=/AM3,/sub/=/PEst,/sub/=/AN) 168(sub/=/SIBD,/sub/=/PLF) 447(season=2,/sub/=/1AN) 126

Teacher

Pattern Support(grade=8_5) 29(grade=9_4,grade=6_4,grade=7_4,grade=3_4,grade=4_4) 11(grade=4_3,grade=5_3,grade=9_3) 8(grade=8_5,subject=Comp) 7(grade=5_3,subject=M) 7

7.2.4 Enriched Classification Results

Figures 7.2a and 7.2b show the accuracy of the classification step, over the baseline 1 and 2, respectively,

and corresponding datasets enriched with patterns from student behaviors (i.e. patterns of Enrollments

Star).

As expected, since B2 has more information about the background of the student, it achieves better

accuracy than B1 (a 35% improvement). It is interesting to see that we can predict 50% of student grades

111

454

504

554

604

654

704

754

804

854

904

104 254 504 1004 2504 5004 10004

Accuracy2(%

)2

N2Best2

Baseline21282Accuracy2

Support4

Size4

Closed4

R4Ind4

R4Chi24

Baseline14

04

1004

2004

3004

4004

5004

6004

7004

8004

104 254 504 1004 2504 5004 10004

Size2of2the

2tree2

N2Best2

Baseline21282Size2of2tree2

Support4

Size4

Closed4

R4Ind4

R4Chi24

Baseline14

(a) Baseline 1.

854

864

874

884

894

904

914

104 254 504 1004 2504 5004 10004

Accuracy2(%

)2

N2Best2


Support4

Size4

Closed4

R4Ind4

R4Chi24

Baseline424

10004

11004

12004

13004

14004

15004

104 254 504 1004 2504 5004 10004

Size2of2the

2tree2

N2Best2


Support4

Size4

Closed4

R4Ind4

R4Chi24

Baseline424

(b) Baseline 2.

Figure 7.2: Accuracy for both baselines and respective extensions with student behaviors.

based solely on their characteristics and average grade from years 1 and 2 (B1), and 85% if we know the

grades of the courses they enrolled those years (B2). When we add the patterns that represent student

behaviors, we can see in the figures that the accuracy improves in both cases, as expected. In B1, the

improvement is huge, of about 35%, because we are adding the behavior information about students, that

was not present before. In B2, it allows classification to achieve an accuracy of 90%. Although only 4%,

this improvement indicates that patterns are chosen instead of specific courses, and this may result in

models with less over fitting, and therefore more accurate when predicting new instances. Also, results

show that the more patterns we use to enrich the training data, the better the accuracy, in general.

When analyzing the different filters, there are small fluctuations, but both the size and closed filters

revealed to achieve better results. Choosing the largest patterns as the best ones is the approach that

brings less improvements, because very few students satisfy them (very small coverage). Both rough

independence and chi-squared filters achieve intermediate results. These tendencies happened in both B1

and B2 baselines.

Figures 7.3a and 7.3b analyze the size of the trees created by the classifier (i.e. the size of the model).

454

504

554

604

654

704

754

804

854

904

104 254 504 1004 2504 5004 10004

Accuracy2(%

)2

N2Best2


Support4

Size4

Closed4

R4Ind4

R4Chi24

Baseline14

04

1004

2004

3004

4004

5004

6004

7004

8004

104 254 504 1004 2504 5004 10004

Size2of2the

2tree2

N2Best2


Support4

Size4

Closed4

R4Ind4

R4Chi24

Baseline14

(a) Baseline 1.

854

864

874

884

894

904

914

104 254 504 1004 2504 5004 10004

Accuracy2(%

)2

N2Best2


Support4

Size4

Closed4

R4Ind4

R4Chi24

Baseline424

10004

11004

12004

13004

14004

15004

104 254 504 1004 2504 5004 10004

Size2of2the

2tree2

N2Best2


Support4

Size4

Closed4

R4Ind4

R4Chi24

Baseline424

(b) Baseline 2.

Figure 7.3: Size of the trees for both baselines and respective extensions with student behaviors.

We can see that for B2, also as expected, the trees resulting from classifying the enriched datasets

are smaller than the base tree (it can have less 300 nodes for N = 250 patterns). In the B1 case, the

models of the enriched datasets are larger than the baseline, mainly because the baseline does not have

112

much information, and when we add patterns, they are chosen for building the tree. Nevertheless, for

similar values of accuracy (85%), the tree for B1 is much smaller than the tree for B2. The tendencies

of the different filters are the same. Both support and closed filters result in smaller trees earlier in B2,

and slightly larger trees in B1.

Results using the baselines enriched with teacher performances revealed that these patterns are not

very important for predicting student results. In these extended datasets, the accuracy is very close to

the baselines, and therefore the figures are not presented here.


In this chapter, we presented a case study on the educational domain. Using a sample of a data warehouse

in the educare project, we discuss the use of multi-relational data mining algorithms to mine this model,

as well as the use of the results to improve classification.

Experiments on these real data show that it is possible to take into account the multi-dimensionality

of the educational data, and that by applying the multi-dimensional methodology, we are able not only

to discover frequent behaviors, but also to use those behaviors to improve predicting student grades.

We applied the method to more than one related star schema, which allowed us to find structured

patterns for both students and teachers, such as frequent sets of courses (and grades) for which students

were approved (or not), and frequent sets of teacher assessments.

Like for the hepatitis case study, classification results show that prediction accuracy improves when

enriching datasets with the multi-dimensional patterns relating to student behaviors, and that the models

built are also smaller.

We show again in this chapter that the employed methodology is simple and general, and may be

used to any educational data warehouse or star schema, and can also be applied to different domains.

113

Chapter 8

Conclusions and Future Work

In this dissertation we have proposed a new algorithm for finding multi-dimensional patterns in large and

growing databases, modeled as star schemas. The algorithm, named StarFP-Stream, combines MRDM

with data streaming techniques, and is able to mine a star schema directly, without materializing the

join between the tables. Also, by using a strategy similar to the one followed on mining data streams,

it is able to effectively mine large star schemas, as well as to effectively mine DW, by dealing with their

growing nature.

The pattern-tree used by the algorithm allows it to continuously store and update the current patterns

in an efficient way, keeping them up to date and accessible, anytime.

Another important contribution of StarFP-Stream is related to the fact that it correctly handles

degenerated dimensions, by aggregating the rows in the fact table that correspond to the same business

event, and it is also able to find multi-dimensional patterns at some other level of aggregation (either by

some dimension or combination of dimensions).

There are only two other algorithms in the literature for relational pattern mining over data streams.

However, while one is a probabilistic approach, therefore it does not return all real patterns, the other is

not able to deal with degenerated dimensions or other aggregations. In this sense, they are not directly

comparable with StarFP-Stream.

Performance analysis over several star schemas show that our algorithm is accurate, efficient, and

that it does not depend on the number of transactions processed so far. Experiments also show that

StarFP-Stream outperforms its single table predecessor in terms of time. Thus, we can say that our

algorithm overcomes the join before mining approach.

In order to tackle the incorporation of domain knowledge, we have also proposed in this work two

efficient and general algorithms for pushing constraints into a pattern-tree. CoPT and CoPT4Streams

are designed for single table (and static) datasets and for single table data streams, respectively. By

using the pattern-tree structure, both algorithms are able to optimize the incorporation of constraints.

The idea is to take advantage of constraint properties to avoid unnecessary tests and to eliminate invalid

patterns earlier, while traversing the tree.

In the streaming case, the advantages of this approach are even more visible, since constraints are

pushed at each batch boundary, resulting in a smaller pattern-tree for every next batch, and therefore

less time and memory needed in the overall discovery process.

Experiments show that both algorithms are effective and efficient, for all constraint properties, and

even for constraints with small selectivity, when compared to an approach that does not take these

properties into account.

For the integration of multi-dimensional mining with constrained mining, we first defined a set of

constraints for star schemas – the star constraints: entity type, entity, attribute and measure constraints.

115

These constraints capture the relations in a star schema and the aspects that can be restricted. Then,

we also proposed a set of strategies for pushing these star constraints into multi-dimensional mining

algorithms, and showed that it is possible to incorporate constraints into the mining of multiple tables.

By being post-processing algorithms, both CoPT and CoPT4Streams cannot be directly applied to

the mining of a star schema. However, they both can be applied to the mining of the fact table, with small

adaptations to deal with different entity types and retrieve the values from the respective dimensions.

We have also proposed the algorithm D2StarFP-Stream, that is an adaptation of StarFP-Stream for

incorporating star constraints into the mining of large and growing star schemas. By incorporating

constraints, the algorithm is able to maintain smaller summary structures, minimizing the bottleneck of

its counterpart and therefore returning less, but more interesting results.

To the best of our knowledge, this is the first approach dedicated to the incorporation of constraints

into multi-dimensional pattern mining.

Experiments over real-world datasets validate our claims, and prove the utility, efficacy and efficiency

of StarFP-Stream. Of particular interest are the experiments of enriching classification data with multi-

dimensional patterns. Results show that prediction accuracy improves when we add the discovered

patterns, and that they are chosen as key features, instead of pre-existing data. And therefore, this

demonstrates the interest and applicability of multi-dimensional patterns.

8.1 Future Work

Despite the advances made in this dissertation, it opens several opportunities for future research, both

on the multi-relational, the data streaming and the constrained perspectives.

Considering a Time Sensitive Model: In many real world applications, changes in patterns and their

trends are more interesting than patterns themselves (e.g. shopping and fashion trends, Internet

bandwidth usage, resource allocation, etc.). Therefore, an important way of improvement is to

extend StarFP-Stream to a time sensitive model. We discuss this in section 3.3.8.

Finding Structured and Temporal Patterns: As an historical repository for data, time is a dimen-

sion that is always present in DW. Events arrive continuously in time, and are stored in the fact

table. In this sense, DW have a sequential nature, and therefore finding other types of patterns,

such as sequences and temporal regularities (e.g. periodicities), is possible and could bring benefits

to multi-relational pattern mining.

Indeed, despite the advancements, there is still the need for creating algorithms capable of finding

structured and temporal patterns in a multi-dimensional context.

Pushing Structural Constraints: Along the line of the above, since we have time and sequences, it

also makes sense to constrain the temporal and sequential aspects of the facts in a DW. These

structural constraints allow us to, for example, limit the gap between events, perform short-term

(or long-term) analysis, specify interesting combinations or orders of items, etc.

Therefore, there is also an interest in defining these constraints in the multi-dimensional environ-

ment, and developing algorithms capable of incorporating them.

Pushing Graph-Based Domain Knowledge and Network Constraints: Graph-based represen-

tations are one valuable and more expressive source of domain knowledge, that are more and more

available nowadays. These representations, such as ontologies, capture the conceptual structure of

the domain and model the existing concepts and relations in a more intuitive way (note that they

are models of the domain, not of the data).

116

One way to incorporate this knowledge is through structural network constraints. As described in

section 4.5, by mapping items to domain concepts, these constraints allow us to filter the existing

relations (both taxonomical and non-taxonomical) between items, as well as the concepts and

distances. In the presence of a graph-based domain model, these constraints can also be defined

over the star schema.

The work on this area is increasing [KLSP07, MGB08, Ant09b, ME09b], and results show that the

discovered patterns are more interesting when we filter them according to what we already know

based on the domain model. This means that one important step forward is to find ways that are

capable of incorporating this graph-based domain models into the mining of multiple relations.

Optimizing StarFP-Stream: Another path for improving is to optimize our algorithm.

For example, parallelizing our StarFP-stream can significantly improve the time needed, as well as

increase the throughput of the algorithm. In this case, we can parallelize the processing of each

fact in a batch (since what it does is to keep the transactions in the corresponding DimFP-Trees).

Also, at each batch boundary, while those trees are being processed and results mined, the new

batch may already be collected and facts can be processed in parallel. There have already been

some efforts in the parallelization of traditional pattern mining, in particular of the base FP-Growth

algorithm [LWZ+08], which may also serve as a basis for parallelizing the mining of the SuperFP-

Tree of our StarFP-Stream algorithm.

Furthermore, the pattern-tree is usually huge, and therefore it might not fit in main memory. It

may be a good step to find a way to store and manage it from hard disk.

Finally, another path for improvement is to integrate StarFP-Stream with database management

systems (DBMS), in order to retrieve data from the dimensions. This might be very important,

since our algorithm assumes all dimensions are in main memory, which may not be possible in real

world large DW. By integrating it with the DBMS, whenever a new fact arrives, StarFP-Stream

can ask the database for the corresponding transactions, saving significantly in memory needs.

117

Bibliography

[ACTM11] Annalisa Appice, Michelangelo Ceci, Antonio Turi, and Donato Malerba. A parallel, dis-tributed algorithm for relational frequent pattern discovery from very large data sets. Intell.Data Anal., 15(1):69–88, January 2011.

[AIS93] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining association rules betweensets of items in large databases. In Proceedings of the 1993 ACM SIGMOD internationalconference on Management of data (SIGMOD 93), pages 207–216. ACM, 1993.

[ALB03] Hunor Albert-Lorincz and Jean-Francois Boulicaut. Mining frequent sequential patternsunder regular expressions: A highly adaptive strategy for pushing contraints. In Proc. ofthe 3rd SIAM Int. Conf. on Data Mining (SDM 03), pages 316–320, San Francisco, CA,USA, 2003. Springer-Verlag.

[Ant07] Claudia Antunes. Onto4ar: a framework for mining association rules. In Workshop onConstraint-Based Mining and Learning in the Int. Conf. on Principles and Practice ofKnowledge Discovery in Databases (PKDDW-CMILE 07), page 37, Warsaw, Poland, 2007.Springer.

[Ant08] Claudia Antunes. An ontology-based framework for mining patterns in the presence ofbackground knowledge. In Proc. of Int. Conf. on Advanced Intelligence (ICAI 08), pages163–168, Beijing, China, 2008. Post and Telecom Press.

[Ant09a] Claudia Antunes. Mining patterns in the presence of domain knowledge. In Proc. of the11th Int. Conf. on Enterprise Information Systems (ICEIS 09), pages 188–193, Milan, Italy,2009. Springer.

[Ant09b] Claudia Antunes. Pattern mining over star schemas in the onto4ar framework. In Proc. ofthe 2009 Int. workshop on Semantic Aspects in Data Mining (SADM 09), pages 453–458,Washington, DC, USA, 2009. IEEE Computer Society.

[AO02] Claudia Antunes and Arlindo Oliveira. Inference of sequential association rules guided bycontext-free grammars. In Proc. 6th Int. Conf. on Grammatical Inference (ICGI 2002),pages 289–293, Amsterdam, 2002. Springer.

[AO03] Claudia Antunes and Arlindo Oliveira. Generalization of pattern-growth methods for se-quential pattern mining with gap constraints. In Proc. of the 3rd Int. Conf. on Machinelearning and data mining in pattern recognition (MLDM 03), pages 239–251, Leipzig, Ger-many, 2003. Springer-Verlag.

[AO04] Claudia Antunes and Arlindo L. Oliveira. Sequential pattern mining with approximatedconstraints. In Proc. of IADIS Int. Applied Computing Conf. (AC 04), pages 131–138,Lisbon, Portugal, 2004. IADIS Press.

[AO05] Claudia Antunes and Arlindo Oliveira. Constraint relaxations for discovering unknownsequential patterns. Knowledge Discovery in Inductive Databases: 3rd Int. Workshop, KDID2004 (Revised Selected and Invited Papers), pages 11–32, 2005.

[AS94] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules inlarge databases. In VLDB 94: Proc. of the 20th Intern. Conf. on Very Large Data Bases,pages 487–499, San Francisco, USA, 1994. Morgan Kaufmann.

119

[BA99] Roberto J. Bayardo and Rakesh Agrawal. Mining the most interesting rules. In Proc. of the5th ACM SIGKDD Int. Conf. on Knowledge discovery and data mining (KDD 99), pages145–154, San Diego, California, United States, 1999. ACM.

[BA11] Joana Barracosa and Claudia Antunes. Anticipating teachers performance. In Proc. of Int.W. on Knowl. Discovery on Educational Data (KDDinED@KDD). ACM, 2011.

[BA14] Antonio Barreto and Claudia Antunes. Mining compact but non-lossy convergent patternsover time series. In Proceedings of the International work-conference on Time Series (ITISE14), 2014.

[Bay05] Roberto J. Bayardo. The hows, whys, and whens of constraints in itemset and rule discovery.In Proc. of the 2004 European Conf. on Constraint-Based Mining and Inductive Databases,pages 1–13, Hinterzarten, Germany, 2005. Springer-Verlag.

[BBM02] Sugato Basu, Arindam Banerjee, and Raymond Mooney. Semi-supervised clustering byseeding. In Proc. of the Nineteenth Int. Conf. on Machine Learning (ICML 02), pages27–34, Sydney, Australia, 2002. Morgan Kaufmann Publishers Inc.

[BBR00] Jean-Francois Boulicaut, Artur Bykowski, and Christophe Rigotti. Approximation of fre-quency queries by means of free-sets. In Proc. of the 4th European Conf. on Principles ofData Mining and Knowledge Discovery (PKDD 00), pages 75–85, London, UK, UK, 2000.Springer-Verlag.

[BGKW03] Cristian Bucila, Johannes Gehrke, Daniel Kifer, and Walker M. White. Dualminer: A dual-pruning algorithm for itemsets with constraints. Data Min. Knowl. Discov., 7(3):241–272,2003.

[BGMP03] Francesco Bonchi, Fosca Giannotti, Alessio Mazzanti, and Dino Pedreschi. Adaptive con-straint pushing in frequent pattern mining. In Proc. of the 7th Conf. on Principles andPractice of Knowledge Discovery in Databases (PKDD 03), pages 47–58, Cavtat-Dubrovnik,Croatia, 2003. Springer Berlin Heidelberg.

[BGMP05] Francesco Bonchi, Fosca Giannotti, Alessio Mazzanti, and Dino Pedreschi. Exante: Apreprocessing method for frequent-pattern mining. IEEE Intelligent Systems, 20(3):25–31,2005.

[BJ00] Jean-Francois Boulicaut and Baptiste Jeudy. Using constraints for itemset mining: Shouldwe prune or not? In Actes des 16emes Journees Bases de Donnees Avancees (BDA 00),Blois, France, 2000.

[BJ05] Jean-Francois Boulicaut and Baptiste Jeudy. Constraint-based data mining. In The DataMining and Knowledge Discovery Handbook, pages 399–416. Springer, 2005.

[BMS97] Sergey Brin, Rajeev Motwani, and Craig Silverstein. Beyond market baskets: generalizingassociation rules to correlations. SIGMOD Rec., 26(2):265–276, 1997.

[Bou04] Jean-Francois Boulicaut. Inductive databases and multiple uses of frequent itemsets: Thecinq approach. In Database Support for Data Mining Applications, pages 1–23, Berlin,Germany, 2004. Springer.

[CJB99] B. Chandrasekaran, John R. Josephson, and V. Richard Benjamins. What are ontologies,and why do we need them? IEEE Intelligent Systems, 14(1):20–26, 1999.

[CJS00] Viviane Crestana-Jensen and Nandit Soparkar. Frequent itemset counting across multipletables. In PADKK 00: Proc. of the 4th Pacific-Asia Conf. on Knowledge Discovery andData Mining, Current Issues and New Applications, pages 49–61, London, 2000. Springer.

[CLZ07] Longbing Cao, Dan Luo, and Chengqi Zhang. Knowledge actionability: satisfying technicaland business interestingness. Int. J. Bus. Intell. Data Min., 2(4):496–514, December 2007.

[CMB02] Matthieu Capelle, Cyrille Masson, and Jean-Francois Boulicaut. Mining frequent sequentialpatterns under a similarity constraint. In Proc. of the Third Intern. Conf. on IntelligentData Engineering and Automated Learning (IDEAL 02), pages 1–6, London, UK, UK, 2002.Springer-Verlag.

120

[CS01] Laurentiu Cristofor and Dan Simovici. Mining association rules in entity-relationship mod-eled databases. Technical report, 2001.

[CYZZ10a] Longbing Cao, P. Yu, C. Zhang, and H. Zhang. Data Mining for Business Applications.Springer, 2010.

[CYZZ10b] Longbing Cao, P. Yu, C. Zhang, and Y. Zhao. Domain driven data mining. Springer, 2010.

[CZ06] Longbing Cao and Chengqi Zhang. Domain-driven data mining: A practical methodology.Int. Journal of Data Warehousing and Mining (IJDWM, 2(4):49–65, 2006.

[CZZ+07] Longbing Cao, Chengqi Zhang, Yanchang Zhao, Philip S. Yu, and Graham Williams.Dddm2007: Domain driven data mining. SIGKDD Explor. Newsl., 9(2):84–86, 2007.

[DKP+06a] Pedro Domingos, Stanley Kok, Hoifung Poon, Matthew Richardson, and Parag Singla. Uni-fying logical and statistical ai. In Proceedings of the 21st National Conference on ArtificialIntelligence - Volume 1 (AAAI 06), pages 2–7. AAAI Press, 2006.

[DKP+06b] Pedro Domingos, Stanley Kok, Hoifung Poon, Matthew Richardson, and Parag Singla.Unifying logical and statistical ai. In Proc. of the 21st Int. Conf. on Artificial intelligence- Volume 1 (AAAI 06), pages 2–7, Boston, Massachusetts, 2006. AAAI Press.

[DL99] Guozhu Dong and Jinyan Li. Efficient mining of emerging patterns: discovering trends anddifferences. In Proc. of the 5th ACM SIGKDD Int. Conf. on Knowledge discovery and datamining (KDD 99), pages 43–52, San Diego, California, United States, 1999. ACM.

[Dom03] Pedro Domingos. Prospects and challenges for multi-relational data mining. SIGKDDExplor. Newsl., 5(1):80–83, 2003.

[Dom07] Pedro Domingos. Toward knowledge-rich data mining. Data Min. Knowl. Discov., 15(1):21–28, 2007.

[DP08] C. Diamantini and D. Potena. Semantic annotation and services for kdd tools sharing andreuse. In Proc. of the 2008 IEEE Int. Conf. on Data Mining Workshops (ICDMW 08),pages 761 –770, Pisa, Italy, 2008. IEEE.

[DR97] L. Dehaspe and L. De Raedt. Mining association rules in multiple relations. In ILP 97:Proc. of the 7th Intern. Workshop on Inductive Logic Programming, pages 125–132, London,UK, 1997. Springer.

[D96] Saso Dzeroski. Inductive logic programming and knowledge discovery in databases. InU. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances inKnowledge Discovery and Data Mining, pages 117–152. MIT Press, 1996.

[D03] Saso Dzeroski. Multi-relational data mining: an introduction. SIGKDD Explor. Newsl.,5(1):1–16, 2003.

[EC07] Gonenc Ercan and Ilyas Cicekli. Using lexical chains for keyword extraction. Inf. Process.Manage., 43(6):1705–1714, 2007.

[FCAM09] Fabio Fumarola, Anna Ciampi, Annalisa Appice, and Donato Malerba. A sliding windowalgorithm for relational frequent patterns mining from data streams. In Proc. of the 12thIntern. Conf. on Discovery Science, pages 385–392. Springer, 2009.

[FPSM92] William J. Frawley, Gregory Piatetsky-Shapiro, and Christopher J. Matheus. Knowledgediscovery in databases: an overview. AI Mag., 13(3):57–70, 1992.

[FPSS96] Usama M. Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. From data mining toknowledge discovery in databases. AI Magazine, 17(3):37–54, 1996.

[FSC05] Nuno Fonseca, Fernando Silva, and Rui Camacho. Strategies to parallelize ilp systems.In Proc. of the 15th Int. Conf. on Inductive Logic Programming (ILP 05), pages 136–153,Berlin, Heidelberg, 2005. Springer-Verlag.

121

[GB00] Bart Goethals and Jan Van den Bussche. On supporting interactive association rule mining.In Proc. of the 2nd Int. Conf. on Data Warehousing and Knowledge Discovery (DaWaK00), pages 307–316, London, UK, UK, 2000. Springer-Verlag.

[GHP+03] Chris Giannella, Jiawei Han, Jian Pei, Xifeng Yan, and Philip S. Yu. Mining frequentpatterns in data streams at multiple time granularities: Next generation data mining.AAAI/MIT, 2003.

[GLW00] G. Grahne, L. V S Lakshmanan, and X. Wang. Efficient mining of constrained correlatedsets. In Proc.. 16th Int. Conf. on Data Engineering, pages 512–521, 2000.

[GMV11] Bart Goethals, Sandy Moens, and Jilles Vreeken. Mime: a framework for interactive visualpattern mining. In Proc. of the 17th ACM SIGKDD Int. Conf. on Knowledge discovery anddata mining (KDD 11), pages 757–760, San Diego, California, USA, 2011. ACM.

[GRS99] Minos N. Garofalakis, Rajeev Rastogi, and Kyuseok Shim. Spirit: Sequential pattern miningwith regular expression constraints. In Proc. of the 25th Int. Conf. on Very Large Data Bases(VLDB 99), pages 223–234, San Francisco, CA, USA, 1999. Morgan Kaufmann PublishersInc.

[GSD07] Warwick Graco, Tatiana Semenova, and Eugene Dubossarsky. Toward knowledge-drivendata mining. In Proc. of the 2007 Int. workshop on Domain driven data mining (DDDM07), pages 49–54, San Jose, California, 2007. ACM.

[HCXY07] Jiawei Han, Hong Cheng, Dong Xin, and Xifeng Yan. Frequent pattern mining: currentstatus and future directions. Data Min. Knowl. Discov., 15(1):55–86, August 2007.

[HF95] Jiawei Han and Yongjian Fu. Discovery of multiple-level association rules from largedatabases. In Proc. of the 21th Int. Conf. on Very Large Data Bases (VLDB 95), pages420–431, San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc.

[HG02] Jochen Hipp and Ulrich Guntzer. Is pushing constraints deeply into the mining algorithmsreally what we want?: an alternative approach for association rule mining. SIGKDD Explor.Newsl., 4(1):50–55, 2002.

[HKP11] Jiawei Han, M. Kamber, and Jian Pei. Data Mining: Concepts and Techniques: Conceptsand Techniques. The Morgan Kaufmann Series in Data Management Systems. ElsevierScience, 2011.

[HPY00] Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate genera-tion. In SIGMOD 00: Proc. of the 2000 ACM SIGMOD, pages 1–12, New York, NY, USA,2000. ACM.

[HPYM04] Jiawei Han, Jian Pei, Yiwen Yin, and Runying Mao. Mining frequent patterns withoutcandidate generation: A frequent-pattern tree approach. Data Mining and Knowledge Dis-covery, 8(1):53–87, 2004.

[HYXW09] Wei Hou, Bingru Yang, Yonghong Xie, and Chensheng Wu. Mining multi-relational frequentpatterns in data streams. In BIFE 09: Proc. of the Second Intern. Conf. on BusinessIntelligence and Financial Engineering, pages 205–209, 2009.

[Inm96] W. H. Inmon. Building the data warehouse (2nd ed.). John Wiley & Sons, Inc., New York,NY, USA, 1996.

[JLL07] Joanna Jozefowska, Agnieszka Lawrynowicz, and Tomasz Lukaszewski. A study of the sem-intec approach to frequent pattern mining. In Bettina Berendt, Dunja Mladenic, Marcode Gemmis, Giovanni Semeraro, Myra Spiliopoulou, Gerd Stumme, Vojtech Svatek, andFilip Zelezny, editors, Knowledge Discovery Enhanced with Semantic and Social Informa-tion, volume 220 of Studies in Computational Intelligence, pages 37–51. Springer, 2007.

[JLL10] Joanna Jozefowska, Agnieszka Lawrynowicz, and Tomasz Lukaszewski. The role of semanticsin mining frequent patterns from knowledge bases in description logics with rules. TheoryPract. Log. Program., 10(3):251–289, 2010.

122

[JN07] Finn Jensen and Thomas Nielsen. Bayesian Networks and Decision Graphs. SpringerPublishing Company, Incorporated, 2nd edition, 2007.

[JS04] Szymon Jaroszewicz and Dan A. Simovici. Interestingness of frequent itemsets usingbayesian networks as background knowledge. In Proc. of the 10th ACM SIGKDD Int.Conf. on Knowledge discovery and data mining (KDD 04), pages 178–186, Seattle, WA,USA, 2004. ACM.

[JS05] Szymon Jaroszewicz and Tobias Scheffer. Fast discovery of unexpected patterns in data,relative to a bayesian network. In Proc. of the 11th ACM SIGKDD Int. Conf. on Knowledgediscovery in data mining (KDD 05), pages 118–127, Chicago, Illinois, USA, 2005. ACM.

[Kan05] Juveria Kanodia. Structural advances for pattern discovery in multi-relational databases.Master’s thesis, Rochester Institute of Technology, Rochester, NY, 2005.

[KLSP07] Yen-Ting Kuo, Andrew Lonie, Liz Sonenberg, and Kathy Paizis. Domain ontology drivendata mining: a medical case study. In Proc. of the 2007 Int. workshop on Domain drivendata mining (DDDM 07), pages 11–17, San Jose, California, 2007. ACM.

[KR02] Ralph Kimball and Margy Ross. The Data warehouse Toolkit - the complete guide to di-mensional modeling. John Wiley & Sons, Inc., New York, USA, 2nd edition, 2002.

[KSR+07] Stanley. Kok, M. Sumner, Matthew Richardson, Parag Singla, Hoifung Poon, D. Lowd,and Pedro Domingos. The alchemy system for statistical relational ai. Technical report,Department of Computer Science and Engineering, University of Washington, Seattle, WA,2007. http://alchemy.cs.washington.edu.

[KT05] Hian Koh and Gerald Tan. Data mining applications in healthcare. Journal of HealthcareInformation Management, 19(2):64–71, 2005.

[KW06] Harleen Kaur and Siri Wasan. Empirical study on applications of data mining techniquesin healthcare. Journal of Computer Science, 2(2):194–200, 2006.

[LB09] Carson Kai-Sang Leung and Dale A. Brajczuk. Efficient algorithms for mining constrainedfrequent patterns from uncertain data. In Proc. of the 1st ACM SIGKDD Workshop onKnowledge Discovery from Uncertain Data (U 09), pages 9–18, Paris, France, 2009. ACM.

[LE09] Francesca Lisi and Floriana Esposito. On ontologies as prior conceptual knowledge in induc-tive logic programming. In Bettina Berendt, Dunja Mladenic, Marco de Gemmis, GiovanniSemeraro, Myra Spiliopoulou, Gerd Stumme, Vojtech Svatek, and Filip Zelezny, editors,Knowledge Discovery Enhanced with Semantic and Social Information, volume 220 of Stud-ies in Computational Intelligence, pages 3–17. Springer, 2009.

[LHB10] Carson Kai-Sang Leung, Boyu Hao, and Dale Brajczuk. Mining uncertain data for frequentitemsets that satisfy aggregate constraints. In Proc. of the 2010 ACM Symposium on AppliedComputing (SAC 10), pages 1034–1038, Sierre, Switzerland, 2010. ACM.

[LHM98] Bing Liu, Wynne Hsu, and Yiming Ma. Integrating classification and association rulemining. In Proc. of the 1998 Intern. Conf. on Knowledge Discovery and Data Mining(KDD 98), pages 80–86, New York, NY, USA, 1998. AAAI Press.

[Lis05] Francesca Lisi. Principles of inductive reasoning on the semantic web: a framework forlearning in al-log. In Proc. of the 3rd Int. Conf. on Principles and Practice of SemanticWeb Reasoning (PPSWR 05), pages 118–132, Berlin, Germany, 2005. Springer-Verlag.

[Liu10] Haishan Liu. Towards semantic data mining. In Proc. of the 9th Int. Semantic Web Conf.(ISWC 10), 2010.

[LK06] Carson Kai-Sang Leung and Quamrul Khan. Efficient mining of constrained frequent pat-terns from streams. In Proc. of the 10th Int. Database Engineering and Applications Sym-posium (IDEAS 06), volume 0, pages 61–68, Delhi, India, 2006. IEEE Computer Society.

[LLH11] Hongyan Liu, Yuan Lin, and Jiawei Han. Methods for mining frequent items in data streams:an overview. Knowl. Inf. Syst., 26(1):1–30, 2011.

123

[LLN02] Carson Kai-Sang Leung, Laks Lakshmanan, and Raymond Ng. Exploiting succinct con-straints using fp-trees. SIGKDD Explor. Newsl., 4(1):40–49, 2002.

[LM03] Francesca Lisi and Donato Malerba. Bridging the gap between horn clausal logic anddescription logics in inductive learning. In Proc. of Advances in Artificial Intelligence, 8thCongress of the Italian Association for Artificial Intelligence (AI*IA 03), pages 53–64, Pisa,Italy, 2003. Springer.

[LM04] Francesca Lisi and Donato Malerba. Inducing multi-level association rules from multiplerelations. Machine Learning, 55(2):175–210, 2004.

[LR98] Alon Levy and Marie-Christine Rousset. Combining horn rules and description logics incarin. Artif. Intell., 104(1-2):165–209, 1998.

[LS12] Carson Kai-Sang Leung and Lijing Sun. A new class of constraints for constrained frequentpattern mining. In Proc. of the 27th Annual ACM Symposium on Applied Computing (SAC12), pages 199–204, Trento, Italy, 2012. ACM.

[LSW97] Brian Lent, Arun Swami, and Jennifer Widom. Clustering association rules. In Proc. ofthe 13th Intern. Conf. on Data Engineering (ICDE 97), pages 220–231, Birmingham, U.K.,1997. IEEE Computer Society.

[LVS+11] Nada Lavrac, Anze Vavpetic, Larisa N. Soldatova, Igor Trajkovski, and Petra Kralj Novak.Using ontologies in semantic data mining with segs and g-segs. In Proc. of the 14th Int.Conf. on Discovery Science (DS 11), pages 165–178, Finland, 2011.

[LWZ+08] Haoyuan Li, Yi Wang, Dong Zhang, Ming Zhang, and Edward Y. Chang. Pfp: Paral-lel fp-growth for query recommendation. In Proceedings of the 2008 ACM Conference onRecommender Systems (RecSys 08), pages 107–114, New York, NY, USA, 2008. ACM.

[ME09a] Nizar Mabroukeh and Christie Ezeife. Semantic-rich markov models for web prefetching.In Proc. of the IEEE Int. Conf. on Data Mining Workshops (ICDMW 09), pages 465–470,Miami, Florida, USA, 2009.

[ME09b] Nizar Mabroukeh and Christie Ezeife. Using domain ontology for semantic web usage miningand next page prediction. In Proc. of the 18th ACM Conf. on Information and knowledgemanagement (CIKM 09), pages 1677–1680, Hong Kong, China, 2009. ACM.

[MEL01] Donato Malerba, Floriana Esposito, and Francesca A. Lisi. A logical framework for frequentpattern discovery in spatial data. In FLAIRS Conference, pages 557–561, Florida, USA,2001. AAAI Press.

[MGB08] Claudia Marinica, Fabrice Guillet, and Henri Briand. Post-processing of discovered asso-ciation rules using ontologies. In Proc. of the 2008 Int. workshop on Domain driven datamining (DDDM 08), pages 126–133, Pisa, Italy, 2008. IEEE Computer Society.

[MM02] Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over datastreams. In VLDB 02: Proc. of the 28th Intern. Conf. on Very Large Data Bases, pages346–357, Hong Kong, China, 2002. Morgan Kaufman.

[MPP07] Ricardo Martınez, Claude Pasquier, and Nicolas Pasquier. Genminer: Mining informativeassociation rules from genomic data. In Proc. of the IEEE Intern. Conf. on Bioinformaticsand Biomedicine (BIBM 2007), pages 15–22. IEEE Computer Society, 2007.

[MT97] Heikki Mannila and Hannu Toivonen. Levelwise search and borders of theories in knowledgediscovery. Data Min. Knowl. Discov., 1(3):241–258, 1997.

[MTIV97] Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo. Discovery of frequent episodesin event sequences. Data Min. Knowl. Discov., 1(3):259–289, 1997.

[MVCRV13] Carlos Marquez-Vera, Alberto Cano, Cristobal Romero, and Sebastian Ventura. Predictingstudent failure at school using genetic programming and different data mining approacheswith high dimensional and imbalanced data. Appl. Intell., 38(3):315–330, 2013.

124

[NCW97] Shan-Hwei Nienhuys-Cheng and Ronald de Wolf. Foundations of Inductive Logic Program-ming. Springer-Verlag, Secaucus, NJ, USA, 1997.

[NDD99] Biswadeep Nag, Prasad M. Deshpande, and David J. DeWitt. Using a knowledge cachefor interactive discovery of association rules. In Proc. of the 5th ACM SIGKDD Int. Conf.on Knowledge discovery and data mining (KDD 99), pages 244–253, San Diego, California,United States, 1999. ACM.

[NFW02] Eric Ka Ka Ng, Ada Wai-Chee Fu, and Ke Wang. Mining association rules from stars. InICDM 02: Proc. of the 2002 IEEE International Conf. on Data Mining, pages 322–329,Japan, 2002. IEEE.

[NJG11] Siegfried Nijssen, Aıda Jimenez, and Tias Guns. Constraint-based pattern mining in multi-relational databases. In ICDM Workshops, pages 1120–1127, Vancouver, BC, Canada, 2011.IEEE Computer Society.

[NK01] Siegfried Nijssen and Joost N. Kok. Faster association rules for multiple relations. In IJCAI01: Proc. of the 17th Intern. Joint Conf. on Artificial Intelligence, volume 2, pages 891–896,San Francisco, CA, USA, 2001. Morgan Kaufmann.

[NLHP98] Raymond Ng, Laks Lakshmanan, Jiawei Han, and Alex Pang. Exploratory mining andpruning optimizations of constrained associations rules. In Proc. of the 1998 ACM SIGMODInt. Conf. on Management of data, pages 13–24, Seattle, Washington, United States, 1998.ACM.

[NVTL09] Petra Novak, Anze Vavpetic, Igor Trajkovski, and Nada Lavrac. Towards semantic datamining with g-segs. In Proc. of the 11th Int. Multiconference Information Society (IS 09),2009.

[ORS98] Banu Ozden, Sridhar Ramaswamy, and Abraham Silberschatz. Cyclic association rules. InProc. of the 14th Int. Conf. on Data Engineering (ICDE 98), pages 412–421, Washington,DC, USA, 1998. IEEE Computer Society.

[PA09] Miguel Pironet and Claudia Antunes. Classification for fraud detection with social networkanalysis. Technical report, Instituto Superior Tecnico, Universidade de Lisboa, Portugal,2009.

[PBTL99] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. Efficient mining of associ-ation rules using closed itemset lattices. Inf. Syst., 24(1):25–46, 1999.

[PDS08] Pance Panov, Saso Dzeroski, and Larisa Soldatova. Ontodm: An ontology of data mining.In Proc. of the 2008 IEEE Int. Conf. on Data Mining Workshops (ICDMW 08), pages752–760, Washington, DC, USA, 2008. IEEE Computer Society.

[Pei02] Jian Pei. Pattern-growth methods for frequent pattern mining. PhD thesis, Simon FraserUniversity, Burnaby, BC, Canada, Canada, 2002. Adviser-Jiawei Han.

[PH00] Jian Pei and Jiawei Han. Can we push more constraints into frequent pattern mining? InProc. of the sixth ACM SIGKDD intern. conf. on Knowledge discovery and data mining(KDD 00), pages 350–354, Boston, Massachusetts, USA, 2000. ACM.

[PH02] Jian Pei and Jiawei Han. Constrained frequent pattern mining: a pattern-growth view.SIGKDD Explor. Newsl., 4(1):31–39, 2002.

[PHL01] Jian Pei, Jiawei Han, and Laks V. S. Lakshmanan. Mining frequent itemsets with convertibleconstraints. In Proc. of the 17th Int. Conf. on Data Engineering (ICDE 01), pages 433–442,Washington, DC, USA, 2001. IEEE Computer Society.

[PHMA+01] Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal,and Meichun Hsu. Prefixspan: Mining sequential patterns by prefix-projected growth. InProc. of the 17th Int. Conf. on Data Engineering (ICDE 01), pages 215–224, Washington,DC, USA, 2001. IEEE Computer Society.

125

[PHW02] Jian Pei, Jiawei Han, and Wei Wang. Mining sequential patterns with constraints in largedatabases. In Proc. of the 2002 ACM Int. Conf. on Information and Knowledge Management(CIKM 02), pages 18–25, McLean, VA, USA, 2002.

[PHW07] Jian Pei, Jiawei Han, and Wei Wang. Constraint-based sequential pattern mining: thepattern-growth methods. J. Intell. Inf. Syst., 28(2):133–160, 2007.

[PRV05] Luciene Pizzi, Marcela Ribeiro, and Marina Vieira. Analysis of hepatitis dataset using mul-tirelational association rules. In ECML/PKDD 2005 Discovery Challenge, Porto, Portugal,2005.

[PT98] Balaji Padmanabhan and Alexander Tuzhilin. A belief-driven method for discovering un-expected patterns. In Proc. of the 4th Int. Conf. on Knowledge discovery in data mining(KDD 98), pages 94–100. AAAI Press, 1998.

[RGN08] Luc De Raedt, Tias Guns, and Siegfried Nijssen. Constraint programming for itemsetmining. In Proc. of the 14th ACM SIGKDD Int. Conf. on Knowledge discovery and datamining (KDD 08), pages 204–212, New York, NY, USA, 2008. ACM.

[RJLM10] Luc De Raedt, Manfred Jaeger, Sau Lee, and Heikki Mannila. A theory of inductive queryanswering. In Saso Dzeroski, Bart Goethals, and Pance Panov, editors, Inductive Databasesand Constraint-Based Data Mining, pages 79–103. Springer New York, 2010.

[RK01] Luc De Raedt and Stefan Kramer. The levelwise version space algorithm and its applicationto molecular fragment finding. In Proc. of the 17th Int. joint Conf. on Artificial intelligence- Volume 2 (IJCAI 01), pages 853–859, Seattle, WA, USA, 2001. Morgan Kaufmann Pub-lishers Inc.

[RR04] Luc De Raedt and Jan Ramon. Condensed representations for inductive logic programming.In In Proc. of the 9th Intl. Conf. on Principles of Knowledge Representation and Reasoning,pages 438–446. AAAI Press, 2004.

[RS98] Rajeev Rastogi and Kyuseok Shim. Mining optimized association rules with categorical andnumeric attributes. In ICDE, pages 503–512, 1998.

[RV00] Celine Rouveirol and Veronique Ventos. Towards learning in carin-aln. In Proc. of the 10thInt. Conf. on Inductive Logic Programming (ILP 00), pages 191–208, London, UK, 2000.Springer-Verlag.

[RV04] Marcela Xavier Ribeiro and Marina Teresa Pires Vieira. A new approach for mining asso-ciation rules in data warehouses. In FQAS, pages 98–110, 2004.

[SA95] Ramakrishnan Srikant and Rakesh Agrawal. Mining generalized association rules. In Proc.of the 21th Int. Conf. on Very Large Data Bases (VLDB 95), pages 407–419, San Francisco,CA, USA, 1995. Morgan Kaufmann Publishers Inc.

[SA96] Ramakrishnan Srikant and Rakesh Agrawal. Mining sequential patterns: Generalizationsand performance improvements. In Proc. of the 5th Int. Conf. on Extending DatabaseTechnology: Advances in Database Technology (EDBT 96), pages 3–17, London, UK, UK,1996. Springer-Verlag.

[SA10] Andreia Silva and Claudia Antunes. Pattern mining on stars with fp-growth. In MDAI 2010:Proc. of the 7th International Conference on Modeling Decisions for Artificial Intelligence,pages 175–186, Perpignan, France, 2010. Springer.

[SA11] Andreia Silva and Claudia Antunes. Mining stars with fp-growth: a case study on biblio-graphic data. International Journal of Uncertainty, Fuzziness and Knowledge-Based Sys-tems, 19(Supplement-1):65–91, 2011.

[SA12a] Andreia Silva and Claudia Antunes. Finding patterns in large star schemas at the rightaggregation level. In MDAI 2012: Proc. of the 9th International Conference on ModelingDecisions for Artificial Intelligence, pages 329–340. Springer, 2012.

126

[SA12b] Andreia Silva and Claudia Antunes. Mining patterns from large star schemas based onstreaming algorithms. In Roger Lee, editor, Computer and Information Science 2012: Stud-ies in Comp. Int., volume 429, pages 139–150. Springer, 2012.

[SA12c] Andreia Silva and Claudia Antunes. Semi-supervised clustering: A case study. In Proc. ofthe 8th Int. Conf. on Machine Learning and Data Mining in Pattern Recognition (MLDM12), pages 252–263, Berlin, Germany, 2012. Springer.

[SA13a] Andreia Silva and Claudia Antunes. Pushing constraints into a pattern tree. In Proc. of the10th Intern. Conf. on Modeling Decisions for Artificial Intelligence (MDAI 13), Barcelona,Spain, November 2013. Springer.

[SA13b] Andreia Silva and Claudia Antunes. Pushing constraints into data streams. In 2nd Intern.Workshop on Big Data, Streams and Heterogeneous Source Mining (BigMine 13), pages79–86. ACM, August 2013.

[SA13c] Andreia Silva and Claudia Antunes. Towards the integration of constrained mining withstar schemas. In Proc. of the 13th IEEE Intern. Conf. on Data Mining Workshops (ICDMW13), pages 413–420. IEEE Computer Society, December 2013.

[SA14a] Andreia Silva and Claudia Antunes. Finding multi-dimensional patterns in healthcare.In MLDM 14: Proc. of the 10th Int. Conf. on Machine Learning and Data Mining, St.Petersborg, Russia, 2014. Springer.

[SA14b] Andreia Silva and Claudia Antunes. Mining multi-dimensional patterns for student mod-eling. In EDM 14: Proc. of the 7th Int. Conf. on Educational Data Mining, London, UK,2014.

[SA14c] Andreia Silva and Claudia Antunes. Multi-dimensional pattern mining: A case study inhealthcare. In ICEIS 14: Proc. of the 16th Int. Conf. on Enterprise Inf. Systems, Lisbon,Portugal, 2014. Morgan Kaufmann.

[SA14d] Andreia Silva and Claudia Antunes. Multi-relational pattern mining over data streams.under review for publishing in the International Journal of Data Mining and KnowledgeDiscovery, 2014.

[SC05] Arnaud Soulet and Bruno Cremilleux. An efficient framework for mining flexible constraints.In TuBao Ho, David Cheung, and Huan Liu, editors, Advances in Knowledge Discovery andData Mining, volume 3518 of Lecture Notes in Computer Science, pages 661–671. SpringerBerlin Heidelberg, 2005.

[Set10] Burr Settles. Active learning literature survey. Computer sciences technical report, Univer-sity of Wisconsin-Madison, 2009 (updated in 2010).

[SHB06] Gerd Stumme, Andreas Hotho, and Bettina Berendt. Semantic web mining: State of theart and future directions. Web Semantics: Science, Services and Agents on the World WideWeb, 4(2):124–143, 2006.

[Sri96] Ramakrishnan Srikant. Fast algorithms for mining association rules and sequential patterns.PhD thesis, The University of Wisconsin, Madison, 1996. Supervisor-Jeffrey F. Naughton.

[SV11] Akdes Serin and Martin Vingron. Debi: Discovering differentially expressed biclusters usinga frequent itemset approach. Algorithms for Molecular Biology, 6:18, 2011.

[SVA97] Ramakrishnan Srikant, Quoc Vu, and Rakesh Agrawal. Mining association rules with itemconstraints. In Proc. of the 3rd ACM SIGKDD Int. Conf. on Knowledge discovery and datamining (KDD 97), pages 67–73, California, USA, 1997. AAAI Press.

[TLT08] Igor Trajkovski, Nada Lavrac, and Jakub Tolar. Segs: Search for enriched gene sets inmicroarray data. J. of Biomedical Informatics, 41(4):588–601, 2008.

[WJL03] Ke Wang, Yuelong Jiang, and Laks V. S. Lakshmanan. Mining unexpected rules by pushinguser dynamics. In Proc. of the 9th ACM SIGKDD Int. Conf. on Knowledge discovery anddata mining (KDD 03), pages 246–255, Washington, D.C., 2003. ACM.

127

[WJY+05] Ke Wang, Yuelong Jiang, Jeffrey Xu Yu, Guozhu Dong, and Jiawei Han. Divide-and-approximate: A novel constraint push strategy for iceberg cube mining. IEEE Trans. onKnowl. and Data Eng., 17(3):354–368, 2005.

[WSYT03] Takeshi Watanabe, Einoshin Susuki, Hideto Yokoi, and Katsuhiko Takabayashi. Applicationof prototypelines to chronic hepatitis data. In ECML/PKDD 2003 Discovery Challenge,Cavtat, Croatia, 2003.

[XSMH06] Dong Xin, Xuehua Shen, Qiaozhu Mei, and Jiawei Han. Discovering interesting patternsthrough user’s interactive feedback. In Proc. of the 12th ACM SIGKDD Int. Conf. onKnowledge discovery and data mining (KDD 06), pages 773–778, Philadelphia, PA, USA,2006. ACM.

[XX06] Li-Jun Xu and Kang-Lin Xie. A novel algorithm for frequent itemset mining in data ware-houses. Journal of Zhejiang University - Science A, 7(2):216–224, 2006.

[YL05] Unil Yun and John J. Leggett. Wfim: Weighted frequent itemset mining with a weightrange and a minimum weight. In SDM, 2005.

[YW06] Qiang Yang and Xindong Wu. 10 challenging problems in data mining research. Intern.Journal of Inf. Technology and Decision Making, 5(4):597–604, 2006.

[Zak00a] Mohammed Zaki. Generating non-redundant association rules. In Proc. of the 6th ACMSIGKDD Int. Conf. on Knowledge discovery and data mining (KDD 00), pages 34–43, NewYork, NY, USA, 2000. ACM.

[Zak00b] Mohammed Zaki. Sequence mining in categorical domains: incorporating constraints. InProc. of the 9th Int. Conf. on Information and knowledge management (CIKM 00), pages422–429, McLean, Virginia, United States, 2000. ACM.

[ZCD07] Xiuzhen Zhang, Pauline Lienhua Chou, and Guozhu Dong. Efficient computation of icebergcubes by bounding aggregate functions. IEEE Trans. Knowl. Data Eng., 19(7):903–918,2007.

[ZO98] M. J. Zaki and M. Ogihara. Theoretical foundations of association rules. In Workshop onresearch issues in Data Mining and Knowledge Discovery (DMKD 98), pages 1–8. ACMPress, 1998.

[ZY07] Ling Zhou and Stephen Yau. Efficient association rule mining among both frequent andinfrequent items. Computers and Mathematics with Applications, 54(6):737 – 749, 2007.

[ZYHY07] Feida Zhu, Xifeng Yan, Jiawei Han, and Philip S. Yu. gprune: a constraint pushing frame-work for graph pattern mining. In Proc. of the 11th Pacific-Asia Conf. on Advances inknowledge discovery and data mining (PAKDD 07), pages 388–400, Nanjing, China, 2007.Springer-Verlag.

128

universidade de lisboa instituto superior tecnico...

Documents