integrating data cube computation and emerging … · chapter 1. introduction 3 [24]. most previous...

INTEGRATING DATA CUBE COMPUTATION AND

EMERGING PATTERN MINING FOR

MULTIDIMENSIONAL DATA ANALYSIS

by

Wei Lu

a Report submitted in partial fulfillment

of the requirements for the SFU-ZU dual degree of

Bachelor of Science

in the School of Computing Science

Simon Fraser University

and

the College of Computer Science and Technology

Zhejiang University

c© Wei Lu 2010

SIMON FRASER UNIVERSITY AND ZHEJIANG UNIVERSITY

April 2010

All rights reserved. This work may not be

reproduced in whole or in part, by photocopy

or other means, without the permission of the author.

APPROVAL

Name: Wei Lu

Degree: Bachelor of Science

Title of Report: Integrating Data Cube Computation and Emerging Pattern

Mining for Multidimensional Data Analysis

Examining Committee:

Dr. Jian Pei

Associate Professor, Computing Science


Supervisor

Dr. Qianping Gu

Professor, Computing Science


Supervisor

Dr. Ramesh Krishnamurti

Professor, Computing Science


SFU Examiner

Date Approved:

ii

Abstract

Online analytical processing (OLAP) in multidimensional text databases has recently be-

come an effective tool for analyzing text-rich data such as web documents. In this capstone

project, we follow the trend of using OLAP and the data cube to analyze web documents,

but want to address a new problem from the data mining perspective. In particular, we

wish to find contrast patterns in documents of different classes and then use those patterns

in OLAP style text data and web document analysis. To this end, we propose to integrate

the data cube with an important kind of contrast pattern called the emerging pattern, to

build a new data model for solving the document analysis problem.

Specifically, this novel data model is implemented on top of the traditional data cube by

seamlessly integrating the bottom-up cubing (BUC) algorithm with two different emerging

pattern mining algorithms, the Border-Differential and the DPMiner. The processes of cube

construction and emerging pattern mining are merged together and carried out simultane-

ously; patterns are stored into the cube as cell measures. Moreover, we study and compare

the performance of those two integrations by conducting experiments on datasets derived

from the Frequent Itemset Mining Implementations Repository (FIMI). Finally, we suggest

improvements and optimizations that can be done in future work.

iii

To my family

iv

“For those who believe, no proof is necessary; for those who don’t believe, no proof is

possible.”

— Stuart Chase, Writer and Economist, 1888

v

Acknowledgments

First of all, I would like to express my deepest appreciation to Dr. Jian Pei, for his support

and guidance during my studies at Simon Fraser University. In various courses I took with

him and particularly this capstone project, Dr. Pei showed me his broad knowledge and

deep insights in the area of data management and mining, as well as his great personality

and patience to a research beginner like me. In his precious time, he provided me with

lots of help and advice for the project and other concerns (especially my graduate school

applications). This work would not be possible without his supervision.

I would love to thank Dr. Qianping Gu and Dr. Ramesh Krishnamurti for reviewing my

report and directing the capstone projects for this amazing dual degree program. My grati-

tude also goes to Dr. Ze-Nian Li, Dr. Stella Atkins, Dr. Greg Mori and Dr. Ted Kirkpatrick

for their wonderful classes I took at SFU and their good advice for my studies and career

development. Also thanks to Dr. Guozhu Dong at Wright State University and Dr. Guimei

Liu at National University of Singapore for making useful resources available for my work.

I would also like to thank Mr. Thusjanthan Kubendranathan at SFU for his time and help

in our discussions about this project.

Deepest gratefulness to my family and friends who make my life enjoyable. In particular,

I am greatly indebted to my beloved parents, for their unconditional support and encour-

agement. Their love accompany me wherever I go. This work is dedicated to them and I

hope they are proud of me, as I am always proud of them.

vi

Contents

Approval ii

Abstract iii

Dedication iv

Quotation v

Acknowledgments vi

Contents vii

List of Tables x

List of Figures xi

1 Introduction 1

1.1 Overview of Text Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Related Work on Multidimensional Text Data Analysis . . . . . . . . . . . . . 2

1.3 Contrast Pattern Based Document Analysis . . . . . . . . . . . . . . . . . . . 3

1.4 Structure of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Literature Review 6

2.1 Data Cubes and Online Analytical Processing . . . . . . . . . . . . . . . . . . 6

2.1.1 An Example of The Data Cube . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 Data Cubing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.3 BUC: Bottom-Up Computation for Data Cubing . . . . . . . . . . . . 9

vii

2.2 Frequent Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Emerging Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 The Border-Differential Algorithm . . . . . . . . . . . . . . . . . . . . 12

2.3.2 The DPMiner Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Motivation 15

3.1 Motivation for Mining Contrast Patterns . . . . . . . . . . . . . . . . . . . . . 15

3.2 Motivation for Utilizing The Data Cube . . . . . . . . . . . . . . . . . . . . . 16

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Our Methodology 18

4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1.1 Normalizing Data Schema for Text Databases . . . . . . . . . . . . . . 18

4.1.2 Problem Modeling with Normalized Data Schema . . . . . . . . . . . 19

4.2 Processing Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2.1 Integrating Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2.2 Integrating Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.3 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.3.1 BUC with DPMiner . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3.2 BUC with Border-Differential and PADS . . . . . . . . . . . . . . . . 25

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Experimental Results and Performance Study 26

5.1 The Test Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.2 Comparative Performance Study and Analysis . . . . . . . . . . . . . . . . . . 27

5.2.1 Evaluating the BUC Implementation . . . . . . . . . . . . . . . . . . . 27

5.2.2 Comparing Border-Differential with DPMiner . . . . . . . . . . . . . . 28

5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6 Conclusions 31

6.1 Summary of The Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6.2 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Bibliography 33

viii

Index 36

ix

List of Tables

2.1 A base table storing sales data [15]. . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Aggregates computed by group-by Branch. . . . . . . . . . . . . . . . . . . . 7

2.3 The full data cube based on Table 2.1. . . . . . . . . . . . . . . . . . . . . . . 8

2.4 An sample transaction database [21]. . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 A multidimensional text database concerning Olympic news. . . . . . . . . . 17

4.1 A normalized dataset derived from the Olympic news database. . . . . . . . . 19

4.2 A normalized dataset reproduced from Table 4.1. . . . . . . . . . . . . . . . . 21

5.1 Sizes of synthetic datasets for experiments . . . . . . . . . . . . . . . . . . . . 27

5.2 The complete experimental results. . . . . . . . . . . . . . . . . . . . . . . . . 30

x

List of Figures

2.1 BUC Algorithm [5, 27]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 A example of FP-tree based on Table 2.4 [21]. . . . . . . . . . . . . . . . . . . 11

5.1 Running time and cube size of our BUC implementation. . . . . . . . . . . . 28

5.2 Comparing running time of the two integration algorithms. . . . . . . . . . . 29

xi

Chapter 1

Introduction

1.1 Overview of Text Data Mining

Analysis of documents in text databases and on the World Wide Web has been attracting

researchers from various areas, such as data mining, machine learning, information retrieval,

database systems, and natural language processing.

In general, studies in different areas have different emphases. Traditional information

retrieval techniques (e.g., the inverted index and vector-space model) prove to be efficient

and effective in searching relevant documents to answer unstructured keyword-based queries.

Machine learning approaches are also widely used in text mining, providing with effective

solutions to various problems. For example, the Naive Bayes model and the Support Vec-

tor Machines (SVMs) are used in document classification; K-means and the Expectation-

Maximization (EM) algorithms are used in document clustering. The textbook by Manning

et al. [19] covers topics summarized above and much more in both traditional information

retrieval and machine learning based document analysis.

On the other hand, data warehousing and data mining also play important roles in

analyzing documents, especially those stored in a special kind of databases called multi-

dimensional text databases (ones with both relational dimensions and text fields). While

information retrieval mainly addresses searching for documents and for information within

documents according to users’ information needs, the goal of text mining differs in the fol-

lowing sense: it focuses on finding and extracting useful patterns and hidden knowledge

from the information in documents and/or text databases, so as to improve the decision

making process based on the text information.

1

CHAPTER 1. INTRODUCTION 2

Currently, many real-life business, administration and scientific databases are multi-

dimensional text databases, containing both structured attributes and unstructured text

attributes. An example of these databases can be found in Table 3.1. Since data warehous-

ing and online analytical processing (OLAP) have proven their great usefulness in managing

and mining multidimensional data of varied granularities [11], they have recently become

important tools in analyzing such text databases [6, 17, 24, 26].

1.2 Related Work on Multidimensional Text Data Analysis

A data warehouse is a “subject-oriented, integrated, time-varying, non-volatile collection

of data that is used primarily in organizational decision making” [13]. Online analytical

processing (OLAP), which is dominated by “stylized queries that involve group-by and

aggregate operators” [27], is a powerful tool in data warehousing.

Being a multidimensional data model with various features, the data cube [10] has be-

come an essential OLAP facility in data warehousing. Conceptually, the data cube is an

extended database with aggregates in multiple levels and multiple dimensions [15]. It gen-

eralizes the group-by operator, by precomputing and storing group-bys with regard to all

possible combinations of dimensions. Data cubes are widely used in data warehousing for

analyzing multidimensional data.

Applying OLAP techniques, especially data cubes, to analyze documents in multidi-

mensional text databases has made significant advances. Important information retrieval

measures, i.e., term frequencies and inverted indexes, have been integrated into the tradi-

tional data cube, leading to the text cube [17]. It explores both dimension hierarchy and

term hierarchy in the text data, and is able to answer OLAP queries by navigating to a

specific cell via roll-up and drill-down operations. More recently, the work in [6] proposes

a query answering technique called TopCells to address the top-k query answering in the

text cube. Given a keyword query, TopCells is able to find the top-k ranked cells containing

aggregated documents that are most relevant to the query.

Another OLAP-based model dealing with multidimensional text data is the topic cube

[26]. Topic cube combines OLAP with probabilistic topic modeling. It explores topic hier-

archy of documents and stores probability-based measures learned through a probabilistic

model. Moreover, text cubes and topic cubes have been applied to information network anal-

ysis. They are combined into an information-network-enhanced text cube called iNextCube


[24].

Most previous works emphasize data warehousing more than data mining. They mainly

deal with problems such as how to explore and establish dimensional hierarchies within the

text data, and how to efficiently answer OLAP queries using cubes built on text data.

1.3 Contrast Pattern Based Document Analysis

We follow the trend of using data cubes to analyze documents in multidimensional text

databases. But as the previous works are more data warehousing oriented, we intend to

address a more data mining oriented problem called contrast pattern based document anal-

ysis.

More specifically, we wish to find contrast patterns in documents of different classes and

then use those patterns in OLAP style document analysis (like the work in [6, 17]). This

application is promising and has real-life demands. For example, from a large collection

of documents containing information and reviews of laptop computers of various brands,

a user interested in comparing Dell and Sony laptops might wish to find text information

describing Dell’s special features that do not characterize Sony. These features contrast the

two brands effectively, and would probably make the user’s decision to select Dell easier.

To achieve this goal, we propose to integrate frequent pattern mining, especially the

emerging pattern mining, and data cubing in an efficient and effective way. Frequent pattern

mining [2] aims to find itemsets, that is, sets of items that frequently occur in a dataset.

Furthermore, for patterns that can contrast different classes of data, intuitively they must

be frequent patterns in one class, but are comparatively infrequent in other classes.

There is one important class of contrast patterns, called the emerging pattern [7], defined

as itemsets whose supports increase significantly from dataset D1 to dataset D2. That said,

those patterns are frequent in D2 but infrequent in D1. Because of the sharp change of their

supports among different datasets, such patterns meet our needs of showing contrasts in

different classes of web documents.

Our Contributions

To tackle the contrast pattern based document analysis problem, we propose a novel data

model by integrating efficient emerging pattern algorithms (e.g., the Border-Differential [7]


and the state-of-the-art, DPMiner [16]) with the traditional data cube. This integrated

model is novel, but also preserves features of traditional data cubes:

1. It is based on the data cube, and is constructed through a classical data cubing

algorithm called BUC (the Bottom-Up Computation for data cubing) [5].

2. It contains multidimensional text data and multiple granularity aggregates of such

data, in order to support fast OLAP operations (such as roll-up and drill-down) and

query answering.

3. Each cell in the cube contains a set of aggregated documents in the multidimensional

text database with matched dimension attributes.

4. The measure of each cell is the emerging patterns whose support rises rapidly from

the documents not aggregated in the cell to the documents aggregated in the cell.

In this capstone project, we implement this integrated data model by incorporating

emerging pattern mining seamlessly into the data cubing process. We choose BUC as our

cubing algorithm to build the cube on structured dimensions. While aggregating documents

and materializing cells, we simultaneously mine emerging patterns in documents aggregated

in each particular cell, and store such patterns as the measure of this cell. Two widely

used emerging pattern mining algorithms, the Border-Differential and the DPMiner are

integrated with BUC cubing so as to compare their performance.

We tested these two different integrations on synthetic datasets to evaluate their per-

formance on different sizes of input data. The datasets are derived based on the Frequent

Itemset Mining Implementations Repository (FIMI) [9]. Experimental results show that the

state-of-the-art emerging pattern mining algorithm, the DPMiner, is a better choice over

the Border-Differential.

Our cube-based model shares similarity with the text cube [17] and the topic cube [26]

at the level of data structure, since all three cubes are built based on multidimensional

text data. The similarity of cube-based structure allows OLAP query answering techniques

developed in [6, 17, 24, 26] to be directly applied to our cube. In that sense, point queries

(seeking a cell), sub-cube queries (seeking an entire group-by) and top-k queries (seeking k

most relevant cells) can be answered in contrast pattern based document analysis using our

model.


Major Differences with Existing Works

This cube-based data model with emerging patterns as cell measures differs from all previous

related work. It is unlike traditional data cubes using simple aggregate functions as cell

measures, which are only adequate for relational databases. Also, our approach differs from

the text cube which uses term frequencies and inverted indexes as cell measures, and the

topic cube which uses probabilistic measures.

Most importantly, to the best of our knowledge, our data model is novel in comparison to

previous emerging pattern applications in OLAP. Specifically, a previous work in [20] used

the Border-Differential algorithm to perform cube comparisons and capture trend changes

between two precomputed data cubes. However, that work is of limited use and cannot be

applied to multidimensional text data analysis.

First, their approach worked on datasets different in kind from ours. The previous

method only works on traditional data cubes built upon relational databases with categorical

dimension attributes, while ours is designed for multidimensional text databases. Second,

their approach is to find cells with supports growing significantly from one cube to another,

but ours is able to determine emerging patterns for every single cell in the cube. Last but

not least, their approach performs the Border-Differential algorithm after two data cubes

were completely built, but our approach introduces a seamless integration: the data cubing

and emerging pattern mining are carried out simultaneously.

1.4 Structure of the Report

The rest of this capstone project report is organized as follows: Chapter 2 conducts a

literature review on previous work and background knowledge that lays the foundation for

this project. Chapter 3 motivates the contrast pattern based document analysis problem.

Chapter 4 describes our methodology to tackle the problem. This chapter formulates the

problem and proposes algorithms for constructing the integrated data model. Chapter 5

reports experimental results and studies the performance of our algorithm. Lastly, Chapter

6 concludes this capstone project and suggests improvements and optimizations that can be

done in future work.

Chapter 2

Literature Review

This chapter reviews three categories of previous research that are related to this capstone

project: data cubes and OLAP, frequent pattern mining, and emerging pattern mining.

In Section 2.1 we talk about fundamentals of data warehousing, online analytical pro-

cessing (OLAP), and data cubing. We highlight BUC [5], a bottom-up approach for data

cubing. Section 2.2 introduces frequent pattern mining and an important mining algorithm

called FP-Growth [12]. Section 2.3 reviews emerging pattern mining algorithms (Border-

Differential [7] and DPMiner [16]) that are particularly useful to our work.

2.1 Data Cubes and Online Analytical Processing

A data warehouse is “a subject oriented, integrated, time-varying, non-volatile collection

of data in support of management’s decision-making process” [13]. A powerful tool of

exploiting data warehouses is the so-called online analytical processing (OLAP). Typically,

OLAP systems are dominated by “stylized queries involving many group-by and aggregation

operations” [27].

The data cube was introduced in [10] to facilitate answering OLAP queries on multi-

dimensional data stored in data warehouses. A data cube can be viewed as “an extended

multi-level and multidimensional database with various multiple granularity aggregates”

[15]. The term data cubing refers to the process of constructing a data cube based on a

relational database table, which is often referred to as the base table . In a cubing process,

cells with non-empty aggregates will be materialized. Given a base table, we precompute

group-bys and the corresponding aggregate values with respect to all possible combinations

6

CHAPTER 2. LITERATURE REVIEW 7

of dimensions in this table. Each group-by corresponds to a set of cells. The aggregate value

for that group-by is stored as the measure of that cell. Cell measures provide with a good

and concise summary of information aggregated in the cube.

In light of the above, the data cube is a powerful data model allowing fast retrieval and

analysis of multidimensional data for decision making processes based on data warehouses.

It generalizes the group-by operator in SQL (Structured Query Language), and enable data

analysts to avoid long and complicated SQL queries when searching for unusual data patterns

in multidimensional databases [10].

2.1.1 An Example of The Data Cube

Example (Data Cube): Table 2.1 is a sample base table in a marketing management

data warehouse [15]. It shows data organized under the schema (Branch, Product, Season,

Sales).

Branch Product Season Sales

B1 P1 spring 6

B1 P2 spring 12

B2 P1 fall 9

Table 2.1: A base table storing sales data [15].

To build a data cube upon this table, group-bys are computed on three dimensions

Branch, Product and Season. Aggregate values of Sales will be cell measures. In this

example, we choose Average(Sales) as the aggregate function for this example. Since most

intermediate steps of a data cubing process are basically computing group-bys and aggregate

values to form cells, we illustrate the two cells computed by “group-by Branch” in Table 2.2.

Cell No. Branch Product Season AVG(Sales)

1 B1 ∗ ∗ 9

2 B2 ∗ ∗ 9

Table 2.2: Aggregates computed by group-by Branch.

In the same manner, the full data cube contains all possible group-bys on Branch,

Product and Season. It is shown in Table 2.3. Note that cells 1, 2 and 3 are derived

from the least aggregated group-by: group-by Branch, Product, Season. Such cells are


called base cells. On the other hand, cell 18 (∗, ∗, ∗) is the apex cuboid aggregating all

tuples in the base table.

Cell No. Branch Product Season AVG(Sales)

1 B1 P1 spring 6

2 B1 P2 spring 12

3 B2 P1 fall 9

4 B1 P1 ∗ 6

5 B1 P2 ∗ 12

6 B1 ∗ spring 9

7 B2 P1 ∗ 9

8 B2 ∗ fall 9

9 ∗ P1 spring 6

10 ∗ P1 fall 9

11 ∗ P2 spring 12

12 ∗ ∗ spring 9

13 ∗ ∗ fall 9

14 B1 ∗ ∗ 9

15 B2 ∗ ∗ 9

16 ∗ P1 ∗ 7.5

17 ∗ P2 ∗ 12

18 ∗ ∗ ∗ 9

Table 2.3: The full data cube based on Table 2.1.

2.1.2 Data Cubing Algorithms

Efficient and scalable data cubing is challenging. When a base table has a large number

of dimensions and each dimension has high cardinality, time and space complexity grows

exponentially.

In general, there are three approaches of cubing in terms of the order to materialize

cells: top-down, bottom-up and a mix of both. A top-down approach (e.g., the Multiway

Array Aggregation [28]) constructs the cube from the least aggregated base cells towards

the most aggregated apex cuboid. On the contrary, a bottom-up approach such as BUC [5]

computes cells in the opposite order. Other methods, such as Star-Cubing [23], combines

the top-down and bottom-up mechanisms together to carry out the cubing process.

On fast computation of multidimensional aggregates, [11] summarizes the following op-

timization principles: (1). Sorting or hashing dimension attributes to cluster related tuples


that are likely to be aggregated together in certain group-bys. (2). Computing higher-level

aggregates from previously computed lower-level aggregates, and caching intermediate re-

sults in memory to reduce expensive I/O operations. (3). Computing a group-by from the

smallest previously-computed group-by. (4). Mapping dimension attributes in various kinds

of formats to integers ranging between zero and the cardinality of the dimension. There are

also many other heuristics being proposed to improve the efficiency of data cubing [1, 5, 11].

2.1.3 BUC: Bottom-Up Computation for Data Cubing

BUC [5] constructs the data cube bottom-up, from the most aggregated apex cuboid to

group-bys on a single dimension, then on a pair of dimensions, and so on. It also uses

many optimization techniques introduced in the previous section. Figure 2.1 illustrates

the processing tree and the partition method used in BUC on a 4-dimensional base table.

Subfigure (b) shows the recursive nature of BUC: after sorting and partitioning data on

dimension A, we deal with the partition (a1, ∗, ∗, ∗) first and recursively partition it on

dimension B to proceed to its parent cell (a1, b1, ∗, ∗) and then the ancestor (a1, b1, c1, ∗)

and so on. After dealing with partition a1, BUC continues on to process partitions a2, a3

and a4 in the same manner until all cells are materialized.

Figure 2.1: BUC Algorithm [5, 27].

The depth-first search process for building our integrated data model (covered in Chapter


4) follows the basic framework of BUC.

2.2 Frequent Pattern Mining

Frequent patterns are patterns (sets of items, sequence, etc.) that occur frequently in a

database [2]. The supports of frequent patterns must exceed a pre-defined minimal support

threshold.

Frequent pattern mining has been studied extensively in the past two decades. It lays the

foundation for many data mining tasks such as association rules [3] and emerging pattern

mining. Although its definition is concise, the mining algorithms are not trivial. Two

notable algorithms are Apriori [3] and FP-Growth [12] . FP-Growth is more important to

our work as efficient emerging pattern mining algorithms such as [4, 16] use the FP-tree

proposed in FP-Growth as data structures.

FP-Growth addressed the limitations of the breadth-first-search-based Apriori such as

multiple database scans, large amounts of candidate generations and support counting. It

is a depth-first search algorithm. The first scan of a database finds all frequent items, ranks

them in frequency-descending order, and puts them into a head table. Then it compresses

the database into a prefix tree called FP-tree. A complete set of frequent patterns can be

mined by recursively constructing projected databases and the FP-trees based on them. For

example, given a transaction database in Table 2.4 [21], we can build a FP-tree accordingly

(shown in Figure 2.2).

TID Items (Ordered) Frequent Items

100 f, a, c, d, g, i, m, p f, a, c, m, p

200 a, b, c, f, l, m, o f, c, a, b, m

300 b, f, h, j, o f, b

400 b, c, k, s, p c, b, p

500 a, f, c, e, l, p, m, n f, c, a, m, p

Table 2.4: An sample transaction database [21].

Next, we define three special types of frequent patterns: the maximal frequent patterns

(max-patterns for short), the closed frequent patterns and frequent generators, as they are

closely related to emerging pattern mining.

Definition (Max-Pattern): An itemset X is a maximal frequent pattern, or max-

pattern, in dataset D if X is frequent in D, and for every proper super-itemset Y such that


Figure 2.2: A example of FP-tree based on Table 2.4 [21].

X ⊂ Y , Y is infrequent in D [11].

Definition (Closed Pattern and Generator): An itemset X is closed in dataset D

if there exists no proper super-itemset Y s.t. X ⊂ Y and support(X) = support(Y ) in D.

X is a closed frequent pattern in D if it is both closed and frequent in D [11].

An itemset Z is a generator in D if there exists no proper sub-itemset Z ′ such that

Z ′ ⊆ Z and support(Z ′) = support(Z) [18].

The state-of-the-art max-pattern mining algorithm is called the Pattern-Aware Dynamic

Search (PADS) [25]. The DPMiner, the state-of-the-art emerging pattern mining algorithm,

is also the most powerful algorithm for mining closed frequent patterns and frequent gener-

ators.

2.3 Emerging Pattern Mining

Emerging patterns [7] are patterns whose supports increase significantly from one class of

data to another. Mathematical details can be found Section 4.1 (Problem Formulation) of

this report and [4, 7, 8, 16]. The original work on emerging pattern in [7] gives an algorithm

called the Border-Differential for mining such patterns. It uses borders to succinctly rep-

resent patterns and mines the patterns by manipulating the borders only. The work in [4]


used the FP-tree introduced in [12] for emerging pattern mining. Following that, the work

in [16] improves the FP-tree-based algorithm by simultaneously generating closed frequent

patterns and frequent generators to form emerging patterns. This algorithm is called the

DPMiner and is considered as the state-of-the-art for emerging pattern mining.

2.3.1 The Border-Differential Algorithm

Border-Differential uses borders to represent patterns. It involves mining max-patterns

and manipulating borders initiated by the patterns to derive the border representation of

emerging patterns.

A border is an ordered pair 〈L,R〉, where L and R are the left and right bounds of the

border respectively. Both L and R are collections of itemsets, but are much smaller than

the original patterns in size. Emerging patterns represented by 〈L,R〉 are the intervals of

〈L,R〉, defined as [L,R] = {Y |∃X ∈ L,∃Z ∈ R, s.t. X ⊆ Y ⊆ Z}. For example, suppose

[L,R] = {{1}, {1, 2}, {1, 3}, {1, 2, 3}, {2, 3}, {2, 3, 4}}, it has border L = {{1}, {2, 3}}, R =

{{1, 2, 3}, {2, 3, 4}}. Itemsets other than those in L and R (e.g., {1, 3}) are intervals of

〈L,R〉.

Given a pair of borders 〈{φ},R1〉 and 〈{φ},R2〉 whose left bounds are initially empty,

the differential border 〈L1,R1〉 is derived to satisfy [L1,R1] = [{φ},R1] − [{φ},R2]. This

operation is the so-called Border-Differential.

Furthermore, given two datasets D1 and D2, to determine emerging patterns using the

Border-Differential operation, first we determine the max-patterns U1 of D1 and U2 of D2

using PADS, and initiate two borders 〈{φ},U1〉 and 〈{φ},U2〉. Then, we make the differential

between those two borders. Let U1 = {X1, X2, ..., Xn} and U2 = {Y1, Y2, ..., Ym} where

Xi and Yj are itemsets, the left bound of the differential border is computed by L1 =⋃n

i (PowerSet(Xi) −⋃m

j (PowerSet(Yj))). The right bound U1 remains the same. Lastly,

form a border 〈L1,U1〉, and the set intervals [L1,U1] of 〈L1,U1〉 are emerging patterns in

D1.

As the size of datasets grow, the Border-Differential would become problematic because

it involves set enumerations, resulting in exponential computational costs. The work in [8], a

more recent version of [7], proposed several optimization techniques to improve the efficiency

of Border-Differential. However, in fact, the complexity of finding emerging patterns is MAX

SNP-hard, which means that polynomial time approximation schemes do not exist unless

P = NP [22].


2.3.2 The DPMiner Algorithm

The work in [4] used the FP-tree and patten-growth methods to mine emerging patterns,

but it still needs to call Border-Differential to find emerging patterns. The DPMiner (stands

for Discriminative Pattern Miner) in [16] also uses FP-tree but mines emerging patterns in

a different way. It finds closed frequent patterns and frequent generators simultaneously to

form equivalent classes of such patterns, and then determine emerging patterns as “non-

redundant δ-discriminative equivalent classes” [16].

An equivalent class EC is “a set of itemsets that always occur together in some trans-

actions of dataset D” [16]. It can be uniquely represented by its set of frequent generators

G and closed frequent patterns C, in the form of EC = [G, C].

Suppose D can be divided into various classes, denoted as D = D1 ∪ D2 ∪ ... ∪ Dn. Let

δ be a small integer (usually 1 or 2) and θ be a minimal support threshold. An equivalent

class EC is a δ-discriminative equivalent class, provided that its closed pattern C’s support

is greater than θ in D1 but smaller than δ in D−D1 = D2 ∪ ...∪Dn. Furthermore, EC is a

non-redundant δ-discriminative equivalent class if and only if (1) it is δ-discriminative, (2)

there exists no EC such that C ⊆ C, where C and C are the closed patterns of EC and EC

respectively. The closed frequent patterns of a non-redundant δ-discriminative equivalent

class are emerging patterns in D1.

Data Structures and Computational Steps of The DPMiner

The high efficiency of the DPMiner is mainly attributed to its revised FP-tree structure.

Unlike traditional FP-trees, it does not store items appearing in every transaction and hence

have a full support in D. These items are removed because they cannot form generators.

Such modification results in a much smaller FP-tree compared to the original.

The computational framework of the DPMiner consists of the following five steps:

(1). Given k classes of data D1,D2, ...,Dk as input, obtain a union of them to get D =

D1 ∪ D2 ∪ ... ∪ Dk. Also specify a minimal support threshold θ and a maximal threshold δ

(thus, patterns with supports above θ in Di but below δ in D −Di are candidate emerging

patterns in Di).

(2). Construct a FP-tree based on D and run a depth-first search on the tree to find

frequent generators and closed patterns simultaneously. For each search path along the tree,

the search terminates whenever a δ-discriminative equivalent class is reached.


(3). Determine the class label distribution for every closed pattern, i.e., find in which

class a closed pattern has the highest support. This step is necessary because patterns are

not mined separately for each Di(1 ≤ i ≤ k), but rather on the entire D.

(4). Pair up generators and closed frequent patterns to form δ-discriminative equivalent

classes.

(5). Output the non-redundant δ-discriminative equivalent classes as emerging patterns.

If a pattern is labeled as i (1 ≤ i ≤ k), then it is an emerging pattern in Di.

2.4 Summary

In this chapter, we discussed previous research addressing data cubing, frequent pattern

mining and emerging pattern mining, all of which are essential for our project. Algorithms

(the Bottom-Up Cubing, the Border-Differential and the DPMiner) closely related to our

work have been described in detail.

Chapter 3

Motivation

In this chapter, we motivate the problem of contrast pattern based document analysis. We

explain why contrast patterns (in particular, the emerging patterns) are useful, and why

data cubes should be used in analyzing documents in multidimensional text databases.

3.1 Motivation for Mining Contrast Patterns

This section answers the following two questions: (1) Why we need to mine and use contrast

patterns to analyze web documents? (2) How useful are those patterns? In other words, can

they make a significant contribution to a good text mining application? We answer these

questions by introducing motivating scenarios in real life.

Example (Contrast Patterns in Documents) Since the Calgary 1988 Olympic Win-

ter Games, Canada has not been a host country for the Olympic Games for 22 years. There-

fore, people may want to know what are the most attractive and discriminative features of

the Vancouver 2010 Winter Olympics, compared to all previous Olympic Games. Indeed,

there are exciting and touching stories in almost all Olympics and Vancouver certainly has

its unique moments. For example, the Canadian figure skater Joannie Rochette won a

bronze medal under the keenly felt pain of losing her mother a day before her event started.

Suppose a user searches the web and Google returns her a collection of documents on

Olympics, consisting of many online sports news and commentaries. There may be too

much information for her to read through and find unique stories about Vancouver 2010.

Although there is no doubt that Joannie Rochette’s accomplishment will occur frequently

in articles related to Vancouver 2010, a user who is previously unaware about Rochette may

15

CHAPTER 3. MOTIVATION 16

not be able to learn about her quickly from the search results.

Similar situations may also happen when users compare products online by searching

and reading reviews by previous buyers. Here is an example we have seen in Section 1.3:

Suppose a user is comparing Dell’s laptop computers with Sony’s. She probably wants

to know the special features of Dell which are not owned by Sony’s. For example, many

reviewers would speak in favor of Dell by commenting “high performance-price ratio” but

would not do that for Sony as it is not the case. Then “high performance-price ratio” is a

pattern contrasting Dell laptops with Sony laptops.

To let the users manually determine such contrast patterns is not feasible. Therefore,

given a collection of documents, which are ideally pre-classified and stored into a multi-

dimensional text database, we need to develop efficient data models and corresponding

algorithms to determine contrast patterns in documents of different classes.

As mentioned in Section 1.3, we choose the emerging pattern [7] since it is a representa-

tive class of contrast patterns widely used in data mining. Also, there are good algorithms

[4, 7, 16] available for efficient mining of such patterns. Moreover, emerging patterns can

make a contribution to some other problems in text mining. A novel document classifier

could be constructed based on those patterns as they are claimed useful in building accurate

classifiers [8]. Also, since emerging patterns are able to capture discriminative features of a

class of data, they may be helpful in extracting keywords to summarize the given text.

3.2 Motivation for Utilizing The Data Cube

In many real-life database applications, documents and the text data within them are stored

in multidimensional text databases [24]. These kinds of databases are distinct from tradi-

tional data sources we deal with, including relational databases, transaction databases, and

text corpora. Formally, a multidimensional text database is defined as a relational database

with text fields. A sample text database is shown in Table 3.1. The first three dimen-

sions (Event, Time, and Publisher) are standard dimensions, just like those in relational

databases. The last column contains text dimensions which are documents with text terms.

Text databases provide structured attributes of documents, and the information needs

of users vary where such needs can be modeled hierarchically. This makes OLAP and data

cubes applicable. For instance (using Table 3.1), if a user wants to read news on the ice

hockey games reported by the Vancouver Sun on February 20, 2010, then two documents d1

CHAPTER 3. MOTIVATION 17

Event Time Publisher ... Text Data: Documents

Ice hockey 2010/2/20 Vancouver Sun ... d1 = {t1, t2, t3, t4}

Ice hockey 2010/2/23 Global and Mail ... d2 = {t2, t3, t7, t8}

Ice hockey 2010/2/20 Vancouver Sun ... d3 = {t1, t2, t3, t6}

Figure skating 2010/2/20 Global and Mail ... d4 = {t2, t4, t6, t7}

Figure skating 2010/2/20 Vancouver Sun ... d5 = {t1, t3, t5, t7}

Curling 2010/2/23 New York Times ... d6 = {t2, t5, t7, t9}

Curling 2010/2/28 Global and Mail ... d7 = {t3, t6, t8, t9}

... ... ... ... ...

Table 3.1: A multidimensional text database concerning Olympic news.

and d3 matching the query {Event = Ice hockey, Time = 2010/2/20, Publisher = Vancouver

Sun} will be returned to her. If another user wants to skim all Olympic news reported by

the Vancouver Sun on that day, we shall roll up to query {Event = ∗, Time = 2010/2/20,

Publisher = Vancouver Sun} and return documents d1, d3 and d5 to her. The opposite

operation of roll-up is called drill-down. In fact, roll-up and drill-down are two OLAP

operations of great importance [11]. Therefore, to meet different levels of information needs,

it is natural for us to apply the data cube to model and extend this text database. This is

exactly what the previous work in [17, 24, 26] did.

3.3 Summary

In light of the above, this chapter shows that contrast patterns are useful in analyzing large

scale text data and they are able to give concise information about the data. Also, the

nature of multidimensional text databases makes OLAP and the most essential OLAP tool,

the data cube, particularly suitable for modeling and analyzing text data in documents.

Chapter 4

Our Methodology

In this chapter, we describe our methodology to tackle the contrast pattern based document

analysis by building a novel integrated data model through BUC data cubing [5] and two

emerging pattern mining algorithms, the Border-Differential [7] and the DPMiner [16].

Section 4.1 formulates the problem we try to address in this work. Section 4.2 describes

the processing framework and our algorithms, from both data integration level and algorithm

integration level. Section 4.3 discusses issues related to implementation.

4.1 Problem Formulation

4.1.1 Normalizing Data Schema for Text Databases

Suppose a collection of web documents are stored in a multidimensional text database. The

text data in documents are collected under the schema containing a set of standard non-text

dimensions {SD1, SD2, ..., SDn}, and a set of text dimensions (terms) {TD1, TD2, ..., TDm},

where m is the number of distinct text terms in this collection. For simplicity, text terms

can be mapped to items, so documents can be mapped to transactions, or itemsets (sets

of items that appear together). This mapping is similar to the bag-of-words model, which

represents text data as an unordered collection of words, disregarding word order and count.

In that sense, a multidimensional text database can be mapped to a relational base table

with a transaction database.

Under the above mapping mechanism, each tuple in a text database corresponds to a

certain document, in the form of 〈S, T 〉, where S is the set of standard dimension attributes

18

CHAPTER 4. OUR METHODOLOGY 19

and T is a transaction. The dimension attributes can be learned through a certain classifier

or labeled artificially. Words in the document are tokenized and each distinct token will

be treated as an item in the transaction. For example, the tuple corresponding to the first

row in Table 3.1 is Ice hockey, 2010/2/20, Vancouver Sun, ..., d1 = {t1, t2, t3, t4}, with

d1 = {t1, t2, t3, t4} being the transaction.

Furthermore, we normalize text database tuples to derive a simplified data schema. We

map standard dimensions to letters, e.g, Event to A, Time to B and Publisher to C, to make

them unified. Likewise, dimension attributes are mapped to items in the same manner: Ice

hockey is mapped to a1, Figure skating is mapped to a2 and so on. Table 4.1 shows a

normalized dataset derived from the Olympic news database (Table 3.1).

A B C ... Transactions

a1 b1 c1 ... d1 = {t1, t2, t3, t4}

a1 b2 c2 ... d2 = {t2, t3, t7, t8}

a1 b1 c1 ... d3 = {t1, t2, t3, t6}

a2 b1 c2 ... d4 = {t2, t4, t6, t7}

a2 b1 c1 ... d5 = {t1, t3, t5, t7}

a3 b2 c3 ... d6 = {t2, t5, t7, t9}

a3 b3 c2 ... d7 = {t3, t6, t8, t9}

... ... ... ... ...

Table 4.1: A normalized dataset derived from the Olympic news database.

4.1.2 Problem Modeling with Normalized Data Schema

Given a normalized dataset as a base table, we build our integrated cube-based data model

by computing a full data cube grouped by all standard dimensions (e.g., {A, B, C} in the

above table). In the data cubing process, every subset of {A, B, C} will be gone through

to form a group-by corresponding to a set of cells. Emerging patterns in each cell will be

mined simultaneously and stored as cell measures.

When materializing each cell, we aggregate tuples whose dimension attributes match

this particular cell. The transactions of matched tuples form the target class (or positive

class), denoted as T C. We also virtually aggregate all unmatched tuples and extract their

transactions to form the background class (or negative class), denoted as BC. The mem-

bership in T C and BC varies from cell to cell; both classes are dynamically computed and

formed for each cell.


A transaction T is a full itemset in a tuple. A pattern X is a sub-itemset of T having a

non-zero support (i.e., the number of times X appears) in the given dataset. Let θ be the

minimal support threshold for T C and δ be the maximal support threshold for BC. Pattern

X is an emerging pattern in T C if and only if support(X, T C) ≥ θ and support(X,BC) ≤ δ.

In other words, the support of X grows significantly from BC to T C, exceeding a minimal

growth rate threshold ρ = θ/δ. Mathematically, growth rate(X) = support(X, T C) / sup-

port(X,BC) ≥ ρ. Note that δ can be 0, hence ρ = θ/δ = ∞. If growth rate(X) = ∞, X is

a jumping emerging pattern [7] which does not appear in BC at all.

Given predefined support thresholds θ and δ, for each cell in this cube-based model,

we mine all patterns whose support is above θ in the target class T C and below δ in its

background class BC. Thus, such patterns automatically exceed the minimal growth rate

threshold ρ, and become a measure of this cell. Upon obtaining all cells and corresponding

emerging patterns, the model building process is complete. The entire process is based on

data cubing and also requires a seamless integration of cubing and emerging pattern mining.

Example: Now let us consider a simple example regarding the base table in Table 4.1.

Let θ = 2 and δ = 1. Suppose at a certain stage, we are carrying out the group-by operation

on dimension A. We get three cells: (a1, ∗, ∗), (a2, ∗, ∗) and (a3, ∗, ∗). For cell (a1, ∗, ∗)

aggregating the first three tuples in Table 4.1, T C = {d1, d2, d3}, BC = {d4, d5, d6, d7}.

Then consider pattern X = (t1, t2, t3). It appears twice in T C (in d1 and d3) but zero times

in BC, so support(X, T C) ≥ θ and support(X,BC) < δ. In that sense, X = (t1, t2, t3) is an

(jumping) emerging pattern in T C and hence is a measure of cell (a1, ∗, ∗).

4.2 Processing Framework

To recapitulate, Chapter 1 introduced the contrast pattern based document analysis in

multidimensional text databases. We follow the idea of using data cubes and OLAP to

analyze multidimensional text data, and propose to merge the BUC data cubing process

with two different emerging pattern mining algorithms (the Border-Differential and the

DPMiner) to build an integrated data model based on the data cube. This model is designed

to support the contrast pattern based document analysis.

In this section, following the problem formulation in Section 4.1, we propose our al-

gorithm to integrate emerging pattern mining into data cubing. The entire processing

framework includes both data integration and algorithm integration.


4.2.1 Integrating Data

To begin with, we reproduce Table 4.1 (with slight revisions) to make the following discussion

clear. It shows a standard and ideal format of data that simplifies a multidimensional text

database. The data used in our testing will strictly follow this format: each row in a certain

dataset D is a tuple in the form of 〈S, T 〉, where S is the set of dimension attributes and T

is a transaction.

Tuple No. A B C F Transactions

1 a1 b1 c1 f1 d1 = {t1, t2, t3, t4}

2 a1 b2 c2 f1 d2 = {t2, t3, t7, t8}

3 a1 b1 c2 f2 d3 = {t1, t2, t3, t6}

4 a2 b1 c2 f2 d4 = {t2, t4, t6, t7}

5 a2 b1 c1 f1 d5 = {t1, t3, t5, t7}

6 a3 b2 c3 f3 d6 = {t2, t5, t7, t9}

7 a3 b3 c2 f3 d7 = {t3, t6, t8, t9}

8 a4 b2 c3 f1 d8 = {t6, t8, t11, t12}

Table 4.2: A normalized dataset reproduced from Table 4.1.

The integration of data is indispensable because of the nature of the multidimensional

text mining problem. In addition, data cubing and emerging patten mining algorithms

work with data from heterogeneous sources originally. Data cubing mainly deals with rela-

tional base tables in data warehouses, while emerging pattern mining concerns transaction

databases (see an example in Table 2.4). Therefore, we should unify heterogeneous data

first and then develop algorithms for a seamless integration. Thus, we model the text

database and its normalized schema (Table 4.2) by appending transaction database tuples

to relational base table tuples.

Moreover, for the integrated data, we also apply one of the optimization techniques

discussed in Section 2.1.2: mapping all dimension attributes in various kinds of formats to

integers between zero and the cardinality of the attribute [11]. For example, in Table 4.2,

dimension A has the cardinality |A| = 3, so in implementation and testing, we map a1 to

0, a2 to 1 and a3 to 2. Similarly, items in transactions are also mapped to integers ranging

between one to the total number of items in this dataset. For instance, if all items in a

dataset are labeled from t1 to t100, we can represent them by integers ranging from 1 to 100.

This kind of mapping facilitates sorting and hashing in data cubing. Particularly for BUC,

such mapping allows the use of the linear counting sort algorithm to reorder input tuples


efficiently.

4.2.2 Integrating Algorithms

Our algorithm integrates data cubing and emerging pattern mining seamlessly. It carries out

a depth-first search (DFS) to build data cubes and mine emerging patterns as cell measures

simultaneously. The algorithm is designed to work on any valid integrated datasets like

Table 4.2 (both dimension attributes and transactions should be non-empty for tuples). We

outline the algorithm in the following pseudo-code (adapted from [5]).

Algorithm

Procedure ButtomUpCubeWithDPMiner(data, dim, theta, delta)

Inputs:

data: the dataset upon which we build our integrated model.

dim: number of standard dimensions in input data.

theta: the minimal support threshold of candidate emerging patterns

in the target class.

delta: the maximal support threshold of candidate emerging patterns

in the background class.

Outputs:

cells with their measures (patterns)

Method:

1: aggregate(data);

2: if (data.count == 1) then

3: writeAncestors(data, dim);

4: return;

5: endif

6: for each dimension d (from 0 to (dim - 1)) do

7: C := cardinality(d);

8: newData := partition(data, d); // counting sort.

9: for each partition i (from 0 to (C - 1)) do

10: cell := createEmptyCell();

11: posData := newData.gatherPositiveTransactions();

12: negData := newData.gatherNegativeTransactions();

13: isDuplicate := determineCoverage(posData, negData);

14: if (!isDuplicate) then


15: cell.measure := DPMiner(posData, negData, theta, delta);

16: writeOutput(cell);

17: subData := newData.getPartition(i);

18: ButtomUpCubeWithDPMiner(subData, d+1, theta, delta);

19: endif

20: endfor

21: endfor

For integrating BUC with another emerging pattern algorithm Border-Differential, re-

place line 15 with the following pseudo-code:

15.1: posMaxPat := PADS(posData, theta);

15.2: negMaxPat := PADS(negData, theta);

15.3: cell.measure := BorderDifferential(posMaxPat, negMaxPat);

The Execution Flow

To illustrate the execution flow of our integrated algorithm, suppose the algorithm is given a

input dataset D like Table 4.2, with four dimensions namely A, B, C, F. To begin with, the

algorithm aggregates D (line 1). Then it determines the cardinality of the first dimension

A (line 7) and partitions the aggregated D on A (line 8), which creates four partitions

(a1, ∗, ∗, ∗), (a2, ∗, ∗, ∗), (a3, ∗, ∗, ∗) and (a4, ∗, ∗, ∗). Each partition is sorted linearly using

the counting sort algorithm.

Then the algorithm iterates through these partitions to construct cells and mine patterns

(line 9). It starts with cell (a1, ∗, ∗, ∗), gathering transactions with a1 on A as the target

class (line 11), and collects the remaining ones as the background class (line 12). Both

classes will then be passed on to the DPMiner procedure to find emerging patterns in the

target class (line 15), provided that this cell’s target class is not identical to that of its

descendant cells that have been processed (line 13, more on this later).

Then, BUC is called recursively on the current partition to materialize cells, mine pat-

terns and output them. The algorithm further sorts and partitions (a1, ∗, ∗, ∗) to proceed

to its parent (a1, b1, ∗, ∗). As it continues to execute, it recurses further on ancestor cells

(a1, b1, c1, ∗) and (a1, b1, c1, f1). Upon reaching the base cells, the algorithm backtracks to

the nearest descendant cell (a1, b1, c2, ∗). The complete processing order follows Figure 2.1.


Optimizations

The duplicate checking function in line 13 is an optimization aimed at avoiding producing

cells with identical aggregated tuples and patterns. For example, the cell (a2, b1, ∗, ∗) ag-

gregates tuples 4 and 5 in Table 4.2. Since we have already computed its descendant cell

(a2, ∗, ∗, ∗), which also covers exactly the same two tuples, these two cells will have exactly

the same target class and background class. Therefore, processing cells like (a2, b1, ∗, ∗)

leads to duplicate work that is unnecessary and should be avoided. The duplicate checking

function helps in this kind of situations.

The above duplicate checking function generalizes the original BUC optimization called

writeAncestors (line 3 in the pseudo code). Our algorithm also includes writeAncestors

with slight modifications, as a special case of the duplicate checking. Consider that when

the algorithm proceeds to (a4, ∗, ∗, ∗), a partition has only one tuple. In the same sense

as we have discussed above, the ancestor cells (a4, b2, ∗, ∗), (a4, b2, c3, ∗), and (a4, b2, c3, f1)

all contain exactly the same tuple and hence will have identical patterns. These four cells

actually form an equivalent class. We choose to output the lower bound (a4, ∗, ∗, ∗) together

with the upper bound (a4, b2, c3, f1) and skip all intermediate cells in this equivalent class.

Both optimization techniques shorten the running time of our program and reduces the

number of cells to output. Experiments conducted in [5] found out that in real-life data

warehouses, about 20% of the aggregates contain only one tuple. Therefore, empirically

such optimizations are useful and helpful.

4.3 Implementations

For this capstone project, we implemented BUC and the Border-Differential in C++. We

also made use of the original DPMiner package from [16] for emerging pattern mining and

the PADS package from [25] for max-pattern mining needed by the Border-Differential.

To ensure smooth data flow in the integration, both DPMiner and PADS packages are

modified sufficiently to meet our specific needs. The original packages read input data from

certain files different from our test datasets (like Table 4.2), so for our implementation, those

packages are changed to get the input directly from BUC on the fly. Therefore, primarily,

the data structures for holding transactions, and corresponding functions to manipulate

items in transactions are modified accordingly.


4.3.1 BUC with DPMiner

Integrating BUC with the DPMiner, for each cell, label the tuples whose dimension at-

tributes match the current partition in BUC as class 1 (the target class) and the tuples

which do not match as class 0 (the background class). Then pass their transactions in two

arrays to the DPMiner procedure, which will carry out the pattern mining task. It mines

frequent generators and closed patterns for both data classes by executing the computa-

tional steps described in Section 2.3.2. After the mining process, the most general closed

patterns, i.e., the ones that have the shortest length among others in its equivalent class are

determined as the so-called non-redundant δ-discriminative patterns. Such patterns will be

returned to the cell and stored as its measure. Lastly, one file per cell will be output to disk

and the file name is a string containing the dimension attributes of that cell.

4.3.2 BUC with Border-Differential and PADS

On the integration of BUC and Border-Differential, for each cell, the target class of trans-

actions and the background class will be collected in the same manner as above. Unlike the

DPMiner, the Border-Differential algorithm cannot determine candidate emerging patterns

(i.e., the max-patterns in both classes) itself. Instead, our algorithm employs PADS [25]

first to determine the max-patterns for both classes. Patterns will be passed on to the

Border-Differential procedure after they are mined.

Then, invoke the Border-Differential procedure to make the differential between two

borders initiated by the max-patterns. As there could be more than one max-pattern for

either the target or the background class, we might get multiple borders (each corresponding

to a max-pattern) for a single cell. Finally, one file per cell will be output to disk and the

file name is a string containing the dimension attributes of that cell.

4.4 Summary

In this chapter, we formulated the data model construction problem in the first section. Then

we described our processing framework, i.e., the integration of data cubing and emerging

pattern mining from the data level and the algorithm level. Both levels of integrations

are important. Lastly, we concluded this chapter by addressing some issues related to real

implementations.

Chapter 5

Experimental Results and

Performance Study

In this chapter, we present a comparative empirical evaluation on the algorithms developed

and implemented in this capstone project. The experiments are run on a machine with Intel

Core 2 Duo CPU, 3.0 GB main memory, and the Ubuntu 9.04 Linux operating system. The

machine is physically located in the Computing Science Instructional Labs (CSIL) at Simon

Fraser University. Our programs are implemented in C++.

5.1 The Test Dataset

To the best of our knowledge, there are no widely accepted datasets available which follow

the data schema specified in Section 4.1. This is mainly because previously data cubing and

emerging pattern mining work separately with entirely heterogeneous data sources.

We generated five relational base tables containing 100, 1,000, 2,500, 4,000 and 8,124

tuples respectively. All base tables have four standard dimensions with a cardinality of four.

In comparison, the experiments for the text cube in [17] uses a test dataset with much fewer

tuples (2,013) but more dimensions (14). We did intend to test our programs on datasets

with eight or more dimensions but that easily resulted in tens of thousands of files and ran

out our disk quota in the Ubuntu system.

For transactional data, we adopted a dataset named mushroom.dat from the Frequent

Itemset Mining Implementations Repository (FIMI) [9]. It contains 8,124 transactions with

26

CHAPTER 5. EXPERIMENTAL RESULTS AND PERFORMANCE STUDY 27

an of average length of 30 items. The total number of items in mushroom.dat is 113.

We synthesized our normalized datasets by appending rows in mushroom.dat to tuples

in each of the five base tables. The synthesization process is not randomly conducted. In-

stead, we simulate a real multidimensional text database where tuples with more identical

dimension attributes tend to have more similar transactions. Therefore, in the data integra-

tion process, we first clustered similar transactions to groups and then assigned transactions

within a group to tuples having overlapping dimension attributes. On the contrary, tuples

with few identical dimension attributes would be appended with dissimilar transactions

coming from different clusters. Table 5.1 shows the sizes of the 5 synthetic datasets for our

experiments.

Num. of Tuples Size (KB)

100 6.9

1,000 69.2

2,500 174.1

4,000 278.5

8,124 565.7

Table 5.1: Sizes of synthetic datasets for experiments

5.2 Comparative Performance Study and Analysis

We test three implementations (BUC solely, BUC with the DPMiner, and BUC with the

Border-Differential) on all five synthetic datasets described above. Each of the test case was

run ten times and the average running time was calculated and reported.

5.2.1 Evaluating the BUC Implementation

Figure 5.1 presents the test results of our BUC implementation. In a pure BUC cubing test,

transaction items are also included in input data, but the program would not process them

as they are not related to pure data cubing. Including transaction data will certainly result

in more computational overhead, in the sense that during the execution, data is moved

around both in memory and on the disk as entire tuples. However, test results show that

our BUC implementation is still robust under such conditions.

As can be seen from Figure 5.1, the running time of BUC is impressive, which grows


linearly as the size of input data increases. This good feature is mainly attributed to the

(fixed range of) integer representation of data, which makes it possible to use the linear

counting sort algorithm. It is shown that partitioning and sorting data are the most time-

consuming steps in a cubing process [5].

The implementation also achieves great compression ratio in cube size when the data

size is relatively small. However, as the number of tuples in the synthetic dataset grow, it

is rare to have cells containing identical aggregates, thus optimization heuristics such as the

writeAncestors becomes of little use.

0 2000 4000 6000 8000 100000

2

4

6

8

Number of Tuples

Run

ning

Tim

e (s

econ

d)

(a). Running time of BUC

(100, 0.11)

(1000, 0.98)

(2500, 2.21)

(4000, 3.30)

(8124, 6.56)

0 2000 4000 6000 8000 100000

200

400

600

800

Number of Tuples

Num

ber

of C

ells

(b). Number of Cells Created through BUC

(100, 211)

(1000, 613)(2500, 624)

(4000, 624)(8124, 624)

Figure 5.1: Running time and cube size of our BUC implementation.

5.2.2 Comparing Border-Differential with DPMiner

We compared two integration algorithms (1) BUC with the DPMiner and (2) BUC with the

Border-Differential, with respect to both running time and the size of output cubes. The

comparison results on running time is illustrated in Figure 5.2 and the complete experimental

results are summarized in Table 5.2.

For test cases with 100-tuple input, the minimal support threshold θ in target classes

is set to be 3; for test data with greater sizes (1,000 tuples and more), 3 is no longer a

reasonable value, so we take the square root of the number of tuples as the minimal support

threshold. For example, the threshold for 2,500 tuples is 50. The maximal support threshold

δ is set to 1 for all test cases.

When compared on running time (the third column in Table 5.2), the first algorithm

is faster than the second for every data except for the 1,000-tuple one (but only less than


1% slower). For data of 2,500 and 4,000 tuples, the Border-Differential is 1.3 times and 1.7

times as slow as the DPMiner respectively. When tested on the 8,124-tuple dataset, the

Border-Differential algorithm failed to complete in 120 seconds. We exclude this case as it

does not affect the competition at all. The running time of the algorithm integrating the

DPMiner is close to linear, while the MAX SNP-hard [22] Border-Differential proved to be

much slower in practice.

0 2000 4000 6000 8000 100000

20

40

60

80

Number of Tuples

Run

ning

Tim

e (s

econ

d)

(a). Running time of BUC + DPMiner

(100, 3.8)

(1000, 18.6)

(2500, 30.5)

(4000, 43.6)

(8124, 66.9)

0 1000 2000 3000 40000

20

40

60

80

Number of Tuples

Run

ning

Tim

e (s

econ

d)

(b). Running Time of BUC + Border−Differential

(100, 9.6)

(1000, 17.5)

(2500, 41.1)

(4000, 73.6)

Figure 5.2: Comparing running time of the two integration algorithms.

When compared on cube size (the fifth column in Table 5.2), the Border-Differential

appears to perform better than the DPMiner. But actually it is not the case for two rea-

sons. First, the Border-Differential does not generate actual patterns, but rather its border

description. In contrast, the DPMiner generates the full representations of patterns (i.e.,

actual items). The border representation is more succinct but is much less comprehensible,

not to mention that deriving actual patterns adds to more computational costs. For contrast

pattern based document analysis, users would like to see actual text terms. So in that sense,

the DPMiner is a preferable choice.

Second, in terms of output cubes, the Border-Differential generates more empty patterns

than the DPMiner. That is mainly because the max-pattern mining procedure does not

return enough max-patterns, or the borders formed by the max-patterns differ from each

other too much to derive a valid differential border. Lowering the minimal support threshold

θ could help, as indicated by the first row in Table 5.2, when θ = 3 (a very small threshold

compared to others), the Border-Differential produced cubes of almost the same size as the

DPMiner did.


Tuples Threshold Time(sec) Cells Avg. Cell Size (KB)

100 3 3.8 vs. 9.6 210 7.2 vs. 7.1

1,000 30 18.6 vs. 17.5 613 8.2 vs. 2.4

2,500 50 30.5 vs. 41.1 624 13.4 vs. 1.8

4,000 64 43.6 vs. 73.6 624 18.8 vs. 3.5

8,124 90 66.9 vs. N/A 624 20.6 vs. N/A

Table 5.2: The complete experimental results.

5.3 Summary

In light of the above, the feasibility of integrating data cubing (BUC) with emerging pattern

mining has been justified by a series of comparative experiments. The performance of

merging BUC with the DPMiner is efficient and robust on input data of reasonably large

size. Also, despite its larger cube size, the DPMiner is able to find more emerging patterns

with a large support threshold, and present them in an easy and intelligible way. Thus,

the DPMiner is a better choice over the Border-Differential for building our integrated data

model for contrast pattern based document analysis.

On the other hand, the results might be more convincing if an ideal dataset, collected

from real-life multidimensional text databases directly and entirely, could be used in our

performance study.

Chapter 6

Conclusions

6.1 Summary of The Project

It has been shown that OLAP techniques and data cubes are widely applicable to the

analysis of documents stored in multidimensional text databases [6, 17, 24, 26]. However, no

previous work has been done to address a data mining problem related to multidimensional

text data. We proposed an OLAP-style contrast pattern based document analysis in this

work and adopt the emerging pattern [7], an important class of contrast patterns to study

this problem.

In this capstone project, we developed algorithms for a novel data-cube-based model

to address the contrast pattern based document analysis. We implemented this model by

integrating a data cubing algorithm BUC [5] with two emerging pattern mining algorithms,

the Border-Differential [7] and the DPMiner [16]. Our empirical evaluations showed that

the DPMiner is preferable to the Border-Differential, for its seamless, effective, efficient and

robust integration with BUC. OLAP query answering techniques (point queries, sub-cube

queries and top-k queries) developed in [6, 17, 24, 26] can be applied directly to analyze

documents, thanks to the similarity of structure between these cube-based data models.

6.2 Limitations and Future Work

This work could be further explored to fully complete the non-trivial document analysis

problem. One of the limiting features in our model construction is that despite two opti-

mizations used in the algorithm, the cube size is still not as small as it can be. Meanwhile,

31

CHAPTER 6. CONCLUSIONS 32

it is costly to invoke the pattern mining procedure once for every single cell. Therefore, we

propose the following ideas for future improvements.

First, it is possible to compress the data cube by introducing more optimization heuristics

such as the incremental pattern mining. When materializing an ancestor cell, sometimes it

is not necessary to gather the target class and the background class to find patterns from

scratch. It is feasible to take the patterns in its descendant cells and test the support of those

patterns against the ancestor’s target class to see if they still exceed the support threshold.

Besides, instead of full data cubes, we can construct iceberg cubes, the partially-materialized

cubes in which cells aggregating few documents (smaller than a threshold) are excluded.

Second, we can explore other data cubing techniques, such as the Star-Cubing [23], and

use them to build the cube in our model construction process. BUC is considered the most

efficient cubing algorithm for computing iceberg cubes, but it would still be interesting to

see whether other algorithms might be more efficient.

Last but not least, it is also possible to employ advanced cubing techniques to achieve

a higher level of summarization on the text data aggregated in the cube. Such techniques

include the Quotient Cube [14] and the QC-tree [15] based on BUC to compress and sum-

marize cells. Our duplicate checking idea described in Chapter 4 is similar to but achieves

less compression than what Quotient Cube can do.

Bibliography

[1] Sameet Agarwal et al. On the Computation of Multidimensional Aggregates. In VLDB’96: Proceedings of the 22th International Conference on Very Large Data Bases, pages506–521, San Francisco, CA, USA, 1996. Morgan Kaufmann Publishers Inc.

[2] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining Association Rules Be-tween Sets of Items in Large Databases. In SIGMOD ’93: Proceedings of the 1993 ACMSIGMOD International Conference on Management of Data, pages 207–216, New York,NY, USA, 1993. ACM.

[3] Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining AssociationRules in Large Databases. In VLDB ’94: Proceedings of the 20th International Confer-ence on Very Large Data Bases, pages 487–499, San Francisco, CA, USA, 1994. MorganKaufmann Publishers Inc.

[4] James Bailey, Thomas Manoukian, and Kotagiri Ramamohanarao. Fast Algorithms forMining Emerging Patterns. In PKDD ’02: Proceedings of the 6th European Conferenceon Principles of Data Mining and Knowledge Discovery, pages 39–50, London, UK,2002. Springer-Verlag.

[5] Kevin Beyer and Raghu Ramakrishnan. Bottom-up Computation of Sparse and Ice-berg CUBE. In SIGMOD ’99: Proceedings of the 1999 ACM SIGMOD InternationalConference on Management of Data, pages 359–370, New York, NY, USA, 1999. ACM.

[6] Boling Ding et al. TopCells: Keyword-based Search of Top-k Aggregated Documentsin Text Cube. In ICDE ’10: Proceedings of the 26th International Conference on DataEngineering, Long Beach, CA, USA, 2010. IEEE.

[7] Guozhu Dong and Jinyan Li. Efficient Mining of Emerging Patterns: DiscoveringTrends and Differences. In KDD ’99: Proceedings of the fifth ACM SIGKDD Interna-tional Conference on Knowledge discovery and data mining, pages 43–52, New York,NY, USA, 1999. ACM.

[8] Guozhu Dong and Jinyan Li. Mining Border Descriptions of Emerging Patterns fromDataset Pairs. Knowledge Information System, 8(2):178–202, 2005.

33

BIBLIOGRAPHY 34

[9] Bart Goethals et al. Frequent itemset mining implementations repository. Website,2003. http://fimi.cs.helsinki.fi/data/.

[10] Jim Gray et al. Data Cube: A Relational Aggregation Operator Generalizing Group-by,Cross-tab, and Sub-totals. Data Mining and Knowledge Discovery, 1(1):29–53, 1997.

[11] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. MorganKaufmann, 2nd edition, 2006.

[12] Jiawei Han, Jian Pei, and Yiwen Yin. Mining Frequent Patterns Without CandidateGeneration. In SIGMOD ’00: Proceedings of the 2000 ACM SIGMOD InternationalConference on Management of Data, pages 1–12, New York, NY, USA, 2000. ACM.

[13] William Inmon. What Is A Data Warehouse, 1995.

[14] Laks V. S. Lakshmanan, Jian Pei, and Jiawei Han. Quotient Cube: How to Summarizethe Semantics of A Data Cube. In VLDB ’02: Proceedings of the 28th InternationalConference on Very Large Data Bases, pages 778–789. VLDB Endowment, 2002.

[15] Laks V. S. Lakshmanan, Jian Pei, and Yan Zhao. QC-Trees: An Efficient SummaryStructure for Semantic OLAP. In SIGMOD ’03: Proceedings of the 2003 ACM SIG-MOD International Conference on Management of Data, pages 64–75, New York, NY,USA, 2003. ACM.

[16] Jinyan Li, Guimei Liu, and Limsoon Wong. Mining Statistically Important Equiva-lence Classes and Delta-discriminative Emerging Patterns. In KDD ’07: Proceedingsof the 13th ACM SIGKDD International Conference on Knowledge discovery and datamining, pages 430–439, New York, NY, USA, 2007. ACM.

[17] Cindy Xide Lin et al. Text Cube: Computing IR Measures for Multidimensional TextDatabase Analysis. In ICDM ’08: Proceedings of the 2008 Eighth IEEE InternationalConference on Data Mining, pages 905–910, Washington, DC, USA, 2008. IEEE Com-puter Society.

[18] Guimei Liu, Jinyan Li, and Limsoon Wong. A New Concise Representation of FrequentItemsets Using Generators and A Positive Border. Knowledge and Information Systems,17(1):35–56, 2008.

[19] Christopher Manning, Prabhakar Raghavan, and Hinrich Schutze. Introduction to In-formation Retrieval. Cambridge University Press, 2008.

[20] Sebastien Nedjar, Alain Casali, Rosine Cicchetti, and Lotfi Lakhal. Emerging Cubes forTrends Analysis in OLAP Databases. Lecture Notes in Computer Science, 4654:135–144, 2007.

[21] Jian Pei. Pattern-Growth Methods for Frequent Pattern Mining. PhD thesis, SimonFraser University, 2002.

BIBLIOGRAPHY 35

[22] Lusheng Wang, Hao Zhao, Guozhu Dong, and Jianping Li. On the complexity of findingemerging patterns. Theoretical Computer Science, 335(1):15–27, 2005.

[23] Dong Xin et al. Star-Cubing: Computing Iceberg Cubes by Top-down and Bottom-upIntegration. In VLDB ’2003: Proceedings of the 29th International Conference on VeryLarge Data Bases, pages 476–487. VLDB Endowment, 2003.

[24] Yintao Yu et al. iNextCube: Information Network-Enhanced Text Cube. Proc. VLDBEndow., 2(2):1622–1625, 2009.

[25] Xinghuo Zeng, Jian Pei, et al. PADS: A Simple Yet Effective Pattern-Aware DynamicSearch Method for Fast Maximal Frequent Pattern Mining. Knowledge and InformationSystems, 20(3):375–391, 2009.

[26] Duo Zhang et al. Topic Modeling for OLAP on Multidimensional Text Databases:Topic Cube and Its Applications. Stat. Anal. Data Min., 2(56):378–395, 2009.

[27] Yan Zhao. Quotient Cube and QC-Tree: Efficient Summarizations for Semantic OLAP.Master’s thesis, The University of British Columbia, 2003.

[28] Yihong Zhao, Prasad M. Deshpande, and Jeffrey F. Naughton. An Array-based Al-gorithm for Simultaneous Multidimensional Aggregates. In SIGMOD ’97: Proceedingsof the 1997 ACM SIGMOD International Conference on Management of Data, pages159–170, New York, NY, USA, 1997. ACM.

Index

Apex cuboid, 8

Background class, or negative class, 19Base cell, 8Base table, 6Border, 12Border-Differential algorithm, 12BUC, the Bottom-Up Computation, 9

Closed pattern, 11

Data cube, 6Data cubing, 6Data warehouse, 6Delta-discriminative equivalent class, 13DPMiner, the Discriminative Pattern Miner,

13

Emerging pattern mining, 11equivalent class, 13

FP-Growth algorithm, 10FP-tree, 10Frequent pattern mining, 10

Generator, 11

Maximal frequent pattern, 10Multidimensional text database, 16

OLAP, Online Analytical Processing, 6

PADS, the Pattern-Aware Dynamic Search,11

Target class, or positive class, 19Transaction, 20

36