mining data streams with concept drift · before describing and evaluating diﬀerent approaches to...

Poznan University of Technology

Faculty of Computing Science and Management

Institute of Computing Science

Master’s thesis

MINING DATA STREAMS WITH CONCEPT DRIFT

Dariusz Brzeziński

Supervisor

Jerzy Stefanowski, PhD Dr Habil.

Poznań, 2010

Tutaj przychodzi karta pracy dyplomowej;

oryginał wstawiamy do wersji dla archiwum PP, w pozostałych kopiach wstawiamy ksero.

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Mining data streams 3

2.1 Data streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Concept drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Data stream mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3.1 Online learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.2 Forgetting mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.3 Taxonomy of methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.1 Monitoring systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.2 Personal assistance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4.3 Decision support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4.4 Artificial intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Single classifier approaches 13

3.1 Traditional learners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Windowing techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.1 Weighted windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.2 FISH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.3 ADWIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Drift detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.1 DDM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.2 EDDM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4 Hoeffding trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Ensemble approaches 23

4.1 Ensemble strategies for changing environments . . . . . . . . . . . . . . . . . . . . 23

4.2 Streaming Ensemble Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3 Accuracy Weighted Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.4 Hoeffding option trees and ASHT Bagging . . . . . . . . . . . . . . . . . . . . . . . 28

4.5 Accuracy Diversified Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5 MOA framework 33

5.1 Stream generation and management . . . . . . . . . . . . . . . . . . . . . . . . . . 33

I

II Contents

5.2 Classification methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.3 Evaluation procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.3.1 Holdout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.3.2 Interleaved Test-Then-Train . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.3.3 Data Chunks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6 Experimental evaluation 37

6.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.2 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.3 Experimental environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.4.1 Time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.4.2 Memory usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.4.3 Classification accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.5 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7 Conclusions 47

A Implementation details 49

A.1 MOA Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

A.2 Attribute filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

A.3 Data chunk evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

A.4 Accuracy Weighted Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

A.5 Accuracy Diversified Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

B Additional figures 53

Bibliography 71

Streszczenie 77

Chapter 1

Introduction

1.1 Motivation

In today’s information society, computer users are used to gathering and sharing data anytime and

anywhere. This concerns applications such as social networks, banking, telecommunication, health

care, research, and entertainment, among others. As a result, a huge amount of data related to

all human activity is gathered for storage and processing purposes. These data sets may contain

interesting and useful knowledge represented by hidden patterns, but due to the volume of the

gathered data it is impossible to manually extract that knowledge. That is why data mining and

knowledge discovery methods have been proposed to automatically acquire interesting, non-trivial,

previously unknown and ultimately understandable patterns from very large data sets [26, 14].

Typical data mining tasks include association mining, classification, and clustering, which all have

been perfected for over two decades.

A recent report [35] estimated that the digital universe in 2007 was 281 billion gigabytes large

and it is forecast that it will reach 5 times that size until 2011. The same report states that by

2011 half of the produced data will not have a permanent home. This is partially due to a new

class of emerging applications - applications in which data is generated at very high rates in the

form of transient data streams. Data streams can be viewed as a sequence of relational tuples (e.g.,

call records, web page visits, sensor readings) that arrive continuously at time-varying, possibly

unbound streams. Due to their speed and size it is impossible to store them permanently [45].

Data stream application domains include network monitoring, security, telecommunication data

management, web applications, and sensor networks. The introduction of this new class of applica-

tions has opened an interesting line of research problems including novel approaches to knowledge

discovery called data stream mining.

Current research in data mining is mainly devoted to static environments, where patterns

hidden in data are fixed and each data tuple can be accessed more than once. The most popular

data mining task is classification, defined as generalizing a known structure to apply it to new

data [26]. Traditional classification techniques give great results in static environments however,

they fail to successfully process data streams because of two factors: their overwhelming volume

and their distinctive feature - concept drift. Concept drift is a term used to describe changes

in the learned structure that occur over time. These changes mainly involve substitutions of

one classification task with another, but also include steady trends and minor fluctuations of the

underlying probability distributions [54]. For most traditional classifiers the occurrence of concept

drift leads to a drastic drop in classification accuracy. That is why recently, new classification

algorithms dedicated to data streams have been proposed.

1

2 Introduction

The recognition of concept drift in data streams has led to sliding-window approaches that

model a forgetting process, which allows to limit the number of processed data and to react to

changes. Different approaches to mining data streams with concept drift include instance selection

methods, drift detection, ensemble classifiers, option trees and using Hoeffding boundaries to

estimate classifier performance.

Recently, a framework called Massive Online Analysis (MOA) for implementing algorithms

and running experiments on evolving data streams has been developed [12, 11]. It includes a

collection of offline and online data stream mining methods as well as tools for their evaluation.

MOA is a new environment that can facilitate and consequently accelerate the development of new

time-evolving stream classifiers.

The aim of this thesis is to review and compare single classifier and ensemble approaches to data

stream mining. We test time and memory costs, as well as classification accuracy, of representative

algorithms from both approaches. The experimental comparison of one of the algorithms, called

Accuracy Weighted Ensemble, with other selected classifiers has, to our knowledge, not been pre-

viously done. Additionally, we propose and evaluate a new algorithm called Accuracy Diversified

Ensemble, which selects, weights, and updates ensemble members according to the current stream

distribution. For our experiments we use the Massive Online Analysis environment and extend

it by attribute filtering and data chunk evaluation procedures. We also verify the framework’s

capability to become the first commonly used software environment for research on learning from

evolving data streams.

1.2 Thesis structure

The structure of the thesis is as follows. Chapter 2 presents the basics of data stream mining.

In particular, definitions of data streams, concept drift as well as types of stream learners and

their applications are shown. Chapter 3 gives a deeper insight into single classifier approaches to

data stream mining, presenting windowing techniques and Hoeffding trees. Ensemble approaches to

classification in data streams, including the Streaming Ensemble Algorithm, Hoeffding Option Tree,

and our Accuracy Diversified Ensemble, are presented in Chapter 4. The Massive Online Analysis

framework, which was used for evaluation purposes in this thesis, is presented in Chapter 5.

Chapter 6 describes experimental results and compares single classifier and ensemble algorithms

for mining concept-drifting data streams. Finally, Chapter 7 concludes the thesis with a discussion

on the completed work and possible lines of further investigations.

1.3 Acknowledgments

The author would like to thank all the people who contributed to this study. He is grateful to his

supervisor, prof. Jerzy Stefanowski, for his inspiration, motivation, and the care with which he

reviewed this work. The author is also greatly indebted to many other teachers of the Institute

of Computing Science of Poznań University of Technology who got him interested in machine

learning and data mining. Finally, the author wishes to thank his family: his parents, for their

love, unconditional support and encouragement to pursue his interests, and his sister, for sharing

her experience of dissertation writing and giving invaluable advice.

Chapter 2

Mining data streams

Before describing and evaluating different approaches to mining streams with concept drift, we

present the basics of data streams. First, in Section 2.1 we focus on the main characteristics of the

data stream model and how it differs from traditional data. Next, in Section 2.2, we define and

categorize concept drift and its causes. Section 2.3 discusses the differences between data stream

mining and classic data mining. It also presents a taxonomy of adaptive classification techniques.

Finally, Section 2.4 describes the main applications of data stream mining techniques. As this

thesis concentrates on classification techniques, we will use the term data stream learning as a

synonym for data stream mining.

2.1 Data streams

A data stream is an ordered sequence of instances that arrive at a rate that does not permit to

permanently store them in memory. Data streams are potentially unbounded in size making them

impossible to process by most data mining approaches.

The main characteristics of the data stream model imply the following constraints [5]:

1. It is impossible to store all the data from the data stream. Only small summaries of data

streams can be computed and stored, and the rest of the information is thrown away.

2. The arrival speed of data stream tuples forces each particular element to be processed essen-

tially in real time, and then discarded.

3. The distribution generating the items can change over time. Thus, data from the past may

become irrelevant or even harmful for the current summary.

Constraint 1 limits the amount of memory that algorithms operating on data streams can use,

while constraint 2 limits the time in which an item can be processed. The first two constraints led

to the development of data stream summarization techniques. Constraint 3, is more important in

some applications than in others. Many of the first data stream mining approaches ignored this

characteristic and formed the group of static data stream learning algorithms. Other researches

considered constraint 3 as a key feature and devoted their work to evolving data stream learning.

Both of these approaches to stream data mining will be discussed in Section 2.3.

Most of data stream analysis, querying, classification, and clustering applications require some

sort of summarization techniques to satisfy the earlier mentioned constraints. Summarization

techniques are used for producing approximate answers from large data sets usually by means of

data reduction and synopsis construction. This can be done by selecting only a subset of incoming

data or by using sketching, load shedding, and aggregation techniques. In the next paragraphs,

we present the basic methods used to reduce data stream size and speed for analysis purposes.

3

4 Mining data streams

Sampling. Random sampling is probably the first developed and most common technique used to

decrease data size whilst still capturing its essential characteristics. It is perhaps the easiest form of

summarization in a data stream and other synopses can be built from a sample itself [2]. To obtain

an unbiased sample of data we need to know the data set’s size. Because in the data stream model

the length of the stream is unknown and at times even unbounded, the sampling strategy needs

to be modified. The simplest approach, involving sampling instances at periodic time intervals,

provides a good way to “slow down” a data stream, but can involve massive information loss in

streams with fluctuating data rates. A better solution is the reservoir sampling algorithm proposed

by Vitter [77], which analyzes the data set in one pass and removes instances from a sample with

a given probability rather than periodically selecting them. Chaudhuri, Motwani and Narasayya

extended the reservoir algorithm to the case of weighted sampling [18]. Other approaches include

sampling for decision tree classification and k-means clustering proposed by Domingos et al. [21],

sampling for clustering data streams [39], and sampling in the sliding window model studied by

Babcock, Datar, and Motwani [2].

Sketching. Sketching involves building a statistical summary of a data stream using a small

amount of memory. It was introduced by Alon, Matias, and Szegedy [1] and consists of frequency

moments, which can be defined as follows: “Let S = (x1, ..., xN ) be a sequence of elements where

each xi belongs to the domain D = {1, ...d}. Let the multiplicity mi = |{j : xj = i}| denote the

number of occurrences of value i in the sequence S . For k ≥ 0, the kth frequency moment Fk of

S is defined as Fk =∑di=1m

ki ; further, we define F ∗∞ = maximi” [2]. The frequency moments

capture the statistics of the data streams distribution in linear space. F0 is the number of distinct

values in sequence S , F1 is the length of the sequence, F2 is the self-join size, and F ∗∞ is the most

frequent item’s multiplicity. Several approaches to estimating different frequency moments have

been proposed [2]. Sketching techniques are very convenient for summarization in distributed data

stream systems where computation is performed over multiple streams. The biggest shortcoming

of using frequency moments is their accuracy [45].

Histograms. Histograms are summary structures capable of aggregating the distribution of

values in a dataset. They are used in tasks such as query size estimation, approximate query an-

swering, and data mining. The most common types of histograms for data streams are: V-optimal

histograms, equal-width histograms, end-biased histograms. V-optimal histograms approximate

the distribution of a set of values v1, ..., v2 by a piecewise constant function v(i), so as to minimize

the sum of squared error∑i(vi − v(i))2. Ways of computing V-optimal histograms have been

discussed in [47, 40, 36]. Equal-width histograms aggregate data distributions to instance counts

for equally wide data ranges. They partition the domain into buckets such that the number of

vi values falling into each bucket is uniform across all buckets. Equal-width histograms are less

difficult to compute and update than V-optimal histograms, but are also less accurate [37]. End-

biased histograms contain exact counts of items that occur with frequency above a threshold, but

approximate the other counts by a uniform distribution. End-biased histograms are often used

by iceberg queries - queries to find aggregate values above a specified threshold. Such queries are

common in information retrieval, data mining, and data warehousing [2, 25].

Wavelets. Wavelets are used as a technique for approximating data with a given probability.

Wavelet coefficients are projections of a given signal (set of data values) onto an orthogonal set

of basis vectors. There are many types of basis vectors, but due to their ease of computation,

the most commonly used are Haar wavelets. They have the desirable property that the signal

2.2. Concept drift 5

reconstructed from the top few wavelet coefficients best approximates the original signal in terms

of the L2 norm. Research in computing the top wavelet coefficients in the data stream model is

discussed in [36, 62].

Sliding Window. Sliding windows provide a way of limiting the analyzed data stream tuples

to the most recent instances. This technique is deterministic, as it does not involve any random

selections and prevents stale data from influencing statistics. It is used to approximate data

stream query answers, maintain recent statistics [20], V-optimal histograms [38], compute iceberg

queries [60], among other applications [45]. We will give a deeper insight into using sliding windows

in data stream mining in Section 3.2.

2.2 Concept drift

As mentioned in the previous section, the distribution generating the items of a data stream can

change over time. These changes, depending on the research area, are referred to as temporal evo-

lution, covariate shift, non stationarity, or concept drift. Concept drift is an unforseen substitution

of one data source S1 (with an underlying probability distribution ΠS1), with another source S2

(with distribution ΠS2). The most popular example to present the problem of concept drift is

that of detecting and filtering out spam e-mail. The distinction between unwanted and legitimate

e-mails is user-specific and evolves with time. As concept drift is assumed to be unpredictable,

periodic seasonality is usually not considered as a concept drift problem. As an exception, if sea-

sonality is not known with certainty, it might be regarded as a concept drift problem. The core

assumption, when dealing with the concept drift problem, is uncertainty about the future - we

assume that the source of the target instance is not known with certainty. It can be assumed,

estimated, or predicted, but there is no certainty [85].

Kelly et al. [50] presented three ways in which concept drift may occur:

• prior probabilities of classes, P (c1), ..., P (ck) may change over time,

• class-conditional probability distributions, P (X|ci), i = 1, ..., k might change,

• posterior probabilities P (ci|X), i = 1, ..., k might change.

It is worth noting that the distributions P (X|ci) might change in such a way that the class

membership is not affected (e.g. symmetric movement to opposite directions). This is one of the

reasons why this type of changed is often referred to as virtual drift and change in P (ci|X) is

referred to as real drift. From a practical point of view, the distinction between virtual and real

drifts is of little importance, and we will not make that distinction in this thesis.

Figure 2.1 shows six basic types of changes that may occur in a single variable along time.

The first plot (Sudden) shows abrupt changes that instantly and irreversibly change the variables

class assignment. Real life examples of such changes include season change in sales. The next

two plots (Incremental and Gradual) illustrate changes that happen slowly over time. Incremental

drift occurs when variables slowly change their values over time, and gradual drift occurs when

the change involves the class distribution of variables. Some researchers do not distinguish these

two types of drift and use the terms gradual and incremental as synonyms. A typical example of

incremental drift is price growth due to inflation, whilst gradual changes are exemplified by slowly

changing definitions of spam or user-interesting news feeds. The left-bottom plot (Recurring)

represents changes that are only temporary and are reverted after some time. This type of change

is regarded by some researchers as local drift [75]. It happens when several data generating sources

are expected to switch over time and reappear at irregular time intervals. This drift is not certainly


Time

c1

c2

Class

Time

c1

c2

Class

Time

c1

c2

Class

Time

c1

c2

Class

Time

c1

c2

Class

Time

c1

c2

Class

Sudden Incremental Gradual

Recurring Blip Noise

Figure 2.1: Types of changes in streaming data. Apart from Noise and Blip, all thepresented changes are treated as concept drift and require model adaptation.

periodic, it is not clear when the source might reappear, that is the main difference from the

seasonality concept used in statistics [85]. The fifth plot (Blip) represents a “rare event”, which

could be regarded as an outlier in a static distribution. In streaming data, detected blips should

be ignored as the change they represent is random. Examples of blips include anomalies in landfill

gas emission, fraudulent card transactions and network intrusion [55]. The last plot in Figure 2.1

represents random changes, which should be filtered out. Noise should not be considered as

concept drift as it is an insignificant fluctuation that is not connected with any change in the

source distribution.

It is important to note that the presented types of drift are not exhaustive and that in real

life situations concept drift is a complex combination of many types of drift. If a data stream of

length t has just two data generating sources S1 and S2, the number of possible change patterns

is 2t. Since data streams are possibly unbounded, the number of source distribution changes can

be infinite. Nevertheless, it is important to identify structural types of drift, since the assumption

about the change types is absolutely needed for designing adaptivity strategies.

2.3 Data stream mining

After discussing the characteristics of data streams and drifting concepts we see that data stream

learning must differ from traditional data mining techniques. Table 2.1 presents a comparison of

traditional and stream data mining environments.

Table 2.1: Traditional and stream data mining comparison [29].

Traditional StreamNo. of passes Multiple Single

Processing time Unlimited RestrictedMemory usage Unlimited RestrictedType of result Accurate Approximate

Concept Static EvolvingDistributed No Yes

2.3. Data stream mining 7

Data stream classifiers must be capable of learning data sequentially rather than in batch mode

and they must react to the changing environment. The first of these two requirements is fulfilled

by using online learners, the second by implementing a forgetting method.

2.3.1 Online learning

Online learning, also termed incremental learning, focuses on processing incoming examples se-

quentially in such a way that the trained classifier is as accurate, or nearly as accurate, as a

classifier trained on the whole data set at once. During classifier training, the true label for each

example is known immediately or recovered at some stage later. Knowing the true label of an

example, the classifier is updated, minimally if possible, to accommodate the new training point.

Traditionally, online learners are assumed to work, like most data mining algorithms, in static

environments. No forgetting of the learned knowledge is envisaged.

A well designed online classifier should have the following qualities [27, 73, 54, 11]:

1. Incremental: The algorithm should read blocks of data at a time, rather than require all

of it at the beginning.

2. Single pass: The algorithm should make only one pass through the data.

3. Limited time and memory: Each example should be processed in a (small) constant

time regardless of the number of examples processed in the past and should require an

approximately constant amount of memory.

4. Any-time learning: If stopped at time t, before its conclusion, the algorithm should provide

the best possible answer. Ideally, the trained classifier should be equivalent to a classifier

trained on the batch data up to time t.

A learner with the above qualities can accurately classify large streams of data without the need

of rebuilding the classifier from scratch after millions of examples. Such learners can be constructed

by scaling up traditional machine learning algorithms. This is done either by wrapping a traditional

learner to maximize the reuse of existing schemes, or by creating new methods tailored to the data

stream setting. Examples of both approaches will be discussed in detail in Chapters 3 and 4. As

mentioned earlier, online learners do not have any forgetting mechanism by default and need to

be equipped with one to be suitable for data mining streams with concept drift.

2.3.2 Forgetting mechanisms

A data stream classifier should be able to react to the changing concept by forgetting outdated

data, while learning new class descriptions. The main problem is how to select the data range to

remember. The simplest solution is forgetting training objects at a constant rate and using only

a window of the latest examples to train the classifier. This approach, based on fixed parameters,

is caught in a tradeoff between stability and flexibility. If the window is small, the system will

be very responsive and will react quickly to changes, but the accuracy of the classifier might be

low due to insufficient training data in the window. Alternatively, a large window may lead to a

sluggish, but stable and well trained classifier [54].

A different approach involves ageing at a variable rate. This may mean changing window size

when a concept drift is detected or using decay functions to differentiate the impact of data points.

This approach tries to dynamically modify the window parameters to find the best balance between

accuracy and flexibility. It is usually better suited for environments with sudden drift, where the

change is easier to detect.


Another approach to forgetting outdated and potentially harmful data is selecting examples

according to their class distribution. This is especially useful in situations where older data points

may be more relevant than more recent points. This approach, considered in adaptive nearest

neighbors models, uses weights that decay with time, but can be modified depending on the

neighborhood of the most recent examples.

There is no best generic solution when it comes to implementing a forgetting mechanism for

a data stream learner. Depending on the environment, the expected types of drift, different

approaches may work better. For rather static data streams with gradual concept drift, static

windows should give the best accuracy. Sudden changes suggest the use of system with data

ageing at a variable rate, while recurring context should be best handled by density-based forgetting

systems. Examples of windowing techniques are discussed in more detail in Section 3.2.

2.3.3 Taxonomy of methods

Based on “how” and “when” the algorithms learn and forget, Zliobaite [84] proposed a taxonomy

of the main groups of adaptive classifiers. The taxonomy is graphically presented in Figure 2.2.

Dynamic

integration

Adaptive ensembles

SVM

Adaptive decision trees and

forests

Training set formation

Model manipulation, parametrization

Evolving

Trigger based

Instance selection

Instance weighting

Traning windows

Change

detection

based

algorithms

How

When

Figure 2.2: A taxonomy of adaptive supervised learning techniques [84].

The “when” dimension ranges from gradually evolving to trigger based learners. Trigger based

means that there is a signal (usually a drift detector) which indicates a need for model update.

Such methods work well in data streams with sudden drift as the signal can be used to rebuild the

whole model instead of just updating it. On the other hand, evolving methods update the model

gradually and usually react to changes without any drift detection mechanism. By manipulating

ensemble weights or substituting models they try to adapt to the changing environment without

rebuilding the whole model.

The “how” dimension groups learners based on how they adapt. The adaptation mechanisms

mainly involve example selection or parametrization of the base learner. Basic methods of adjusting

models to changing concepts were presented in Section 2.3.2. Detailed descriptions of approaches

presented in Figure 2.2 will be discussed with algorithm pseudo-codes in Chapters 3 and 4.

2.4. Applications 9

2.4 Applications

In this section we discuss some of the main data stream application domains in which concept

drift plays an important role. Along with the characteristics of each domain we present real life

problems and discuss the sources of drift in the context of these problems. A more thorough

discussion on concept drift application domains can be found in [85].

2.4.1 Monitoring systems

Monitoring systems are characterized by large data streams which need to be analyzed in real

time. Classes in these systems are usually highly imbalanced, which makes the task even more

complicated. The typical task of a monitoring system is to distinguish unwanted situations from

“normal behavior”. This includes recognizing adversary actions and alerting before critical system

states.

Network security. The detection of unwanted computer access, also called intrusion detection,

is one of the typical monitoring problems. Intrusion detection systems filter incoming network

traffic in search of suspicious behavior. The source of concept drift in this application is mainly

connected with the attacker. Adversary actions taken by the intruder evolve with time, to surpass

the also evolving security systems. The technological progress is another source of concept drift, as

modern systems providing more functionalities often provide more possibilities of attack [56, 61, 68].

Telecommunications. Intrusion and fraud detectors are also an important part of telecommu-

nication systems. The goal is to prevent fraud and stop adversaries from accessing private data.

Once again, the source of concept drift is the change in adversary behavior as well as change in

the behavior of legitimate users [63, 43].

Finance. Streams of financial transactions can be monitored to alert for possible credit card

and Internet banking frauds. Stock markets also use data stream mining techniques to prevent

insider trading. In both cases the data labeling might be imprecise due to unnoticed frauds and

misinterpreted legitimate transactions. Like in the previous examples, the source of concept drift is

the evolving user behavior. This is especially challenging with insider trading where the adversary

possesses non-public information about a company and tries to distribute his transactions in a

non-trivial way [22].

Transportation. Data stream mining techniques can be employed to monitor and forecast traffic

states and public transportation travel time. This information can be useful for scheduling and

traffic planning as well as dynamic reacting to traffic jams. Traffic patterns can change seasonally

as well as permanently, thus the systems have to be able to handle concept drift. Human driver

factors can also be significant to concept drift [64].

Industrial monitoring. There are several emerging applications in the area of sensor monitoring

where a large numbers of sensors are distributed in the physical world and generate streams of

data that need to be combined, monitored, and analyzed [2]. Such systems are used to control the

work of machine operators and to detect system faults. In the first case, human factors are the

main source of concept drift, while in the second, the change of the systems context [29, 76].


2.4.2 Personal assistance

Data stream mining techniques are also used to personalize and organize the flow of information.

This can include individual assistance for personal use, customer profiling, and recommendation

systems. The costs of mistakes are relatively low compared to other applications, and the class

labels are mostly “soft”. A mistake made by a recommendation system is surely less important

than a mistake made by an intrusion detector. Moreover, a recommendation systems user himself

does not know for sure, which of the two given movies he likes more. Personal assistance systems

do not have to react in real time, but are usually affected by more than one source of drift.

News feeds. Most individual assistance applications are related to textual data. They aim at

classifying news feeds and categorizing articles. Drifting user interests can be a cause of reoccurring

contexts in such systems. Also article topics and nomenclature may drift, causing the distribution

of articles to change independently from the users interests. This is a typical example of virtual

drift. There are also applications addressing web personalization and dynamics, which is again

subject to drifting user interests. Here, mostly data logs are mined to profile user interests without

his involvement [49].

Spam filtering. Spam filtering is a more complicated type of information filtering. In contrast

to most personal assistance systems, it is open to adversary actions (spamming). Adversaries

are adaptive and change spam content rapidly to overcome filters. The amount and types of

illegitimate mail are subject to seasonality, but also drift irregularly over time. The definition of

spam may also differ between users [49].

Recommendation systems. An application that is not strictly connected with data streams,

but also involves concept drift is customer profiling and assistance in recommendation systems.

One of the challenges faced by recommender systems is the sparsity of data. Most users rate/buy

only a few products, but the recommendation task needs to be performed on the whole data

set. The publicity of recommender systems research has increased rapidly with the NetFlix movie

recommendation competition. The winners used temporal aspect as one of the keys to the problem.

They noted three sources of drift in the competition task: the change of movie popularity over

time, the drift of users’ rating scale, and changes in user preferences [4].

Economics. Macroeconomic forecasts and financial time series are also subjects to data stream

mining. The data in those applications is drifting primary due to a large number of factors that

are not included in the model. The publicly known information about companies can form only

a small part of attributes needed to properly model financial forecasts as a stationary problem.

That is why the main source of drift is a hidden context.

2.4.3 Decision support

Decision support with concept drift includes diagnostics and evaluation of creditworthiness. The

true answer whether a decision is correct is usually delayed in these systems. Decision support

and diagnostic applications typically involve limited amount of data and are not required to be

made in real time. The cost of mistakes in these systems is large, thus the main challenge is high

accuracy.

2.4. Applications 11

Finance. Bankruptcy prediction or individual credit scoring is typically considered to be a sta-

tionary problem. However, just like in the earlier mentioned financial applications, there is drift

due to hidden context. The decisions that need to be made by the system are based on fragmen-

tary information. The need for different models for bankruptcy prediction under different economic

conditions was acknowledged, but the need for models to be able to deal with non stationarity has

been rarely researched [85].

Biomedical applications. Biomedical applications present an interesting field of concept drift

research due to the adaptive nature of microorganisms. As microorganism mutate, their resistance

to antibiotics changes. Patients treated with antibiotics when it is not necessary, can become

“immune” to their action when really needed. Other medical applications include changes in disease

progression, discovering emerging resistance and monitoring nonsomnical infections. Concept drift

also occurs in biometric authentication. The classification drift in this application is usually

caused by hidden context such as new light sources, image background, and rotation as well as

physiological factors, for example growing beard. The adaptivity of the algorithms should be used

with caution, due to potential adversary behavior [75].

2.4.4 Artificial intelligence

Learning in dynamic environments is a branch of machine learning and AI where concept drift

plays an important role. Classification algorithms learn how to interact with the environment

to achieve a given goal and since the environment is changing, the learners need to be adaptive

to succeed in completing their task. Ubiquitous Knowledge Discovery (UKD), which deals with

mobile distributed systems such as navigation systems, vehicle monitoring, household management

systems, is also prone to concept drift.

Navigation systems. The winners of the 2005 DARPA Grand Challenge used online learning

for road image classification. The main sources of concept drift were the changing road conditions.

The designing of a soccer player robot brings similar challenges and sources of drift [74, 57].

Smart homes and virtual reality. Intelligent household appliances need to be adaptive to

changing environment and user needs. Also virtual reality needs mechanisms to take concept drift

into account. In computer games and flight simulators, the virtual reality should adapt to the

skills of different users and prevent adversary actions like cheating [17].

Chapter 3

Single classifier approaches

In this chapter we discuss the most popular single classifiers used to classify streaming data.

In Section 3.1 we present learners proposed for stationary classification tasks that can also be

employed to classify data streams. In Section 3.2 we discuss the use of windows to model the

forgetting process, necessary to react to concept drift. Drift detectors, wrapper methods allowing

to rebuild classifiers only when necessary, are presented in Section 3.3. Finally, in Section 3.4

we discuss Very Fast Decision Trees (VFDT), an anytime system that builds decision trees using

constant time and memory per example.

3.1 Traditional learners

Some of the popular classifiers proposed for stationary data mining fulfill both of the stream mining

requirements - have the qualities of an online learner and a forgetting mechanism. Some methods

that are only able to process data sequentially, but do not adapt, can be easily modified to react

to change. In the following paragraphs we present four learners that fall into these groups: neural

networks, Naive Bayes, nearest neighbor methods, and decision rules.

Neural networks. In traditional data mining applications, neural networks are trained using

the epoch protocol. The entire set of examples is sequentially passed through the network a

previously defined number of times (epochs) and updates neuron weights, usually according to the

backpropagation algorithm. Presenting the same data multiple times allows the learner to better

adjust to the presented concept and provide better classification accuracy.

By abandoning the epoch protocol, and presenting examples in a single pass, neural networks

can easily work in data stream environments. Each example is seen only once and usually constant

time is required to update neuron weights. Most networks are fixed, meaning they do not alter

their number of neurons or architecture, thus the amount of memory necessary to use the learner

is also constant. Forgetting is a natural consequence of abandoning the epoch protocol. When not

presenting the same examples multiple times, the network will change according to the incoming

examples, thus reacting to concept drift. The rate of this reaction can be adjusted by the learning

rate of the backpropagation algorithm. A real world application using neural networks for data

stream mining is given by Gama and Rodrigues [33].

Naive Bayes. This model is based on the Bayes’ theorem and computes class-conditional prob-

abilities for each new example. Bayesian methods learn incrementally by nature and require

constant memory. Naive Bayes is a lossless classifier, meaning it “produces a classifier function-

ally equivalent to the corresponding classifier trained on the batch data” [54]. To add a forgetting

13

14 Single classifier approaches

mechanism usually sliding windows are employed to “unlearn” the oldest examples. A single Naive

Bayes model will generally not be as accurate as more complex models. Bayesian networks, which

give better results, are also suited to the data stream setting, it is only necessary to dynamically

learn their structure [13].

Nearest neighbor. Nearest neighbor classifiers, also called instance-based learners or lazy learn-

ers, provide an accurate way of learning data incrementally. Each processed example is stored and

serves as a reference for new data points. Classification is based on the labels of the nearest

historical examples. In this, lossless, version of the nearest neighbor algorithm called IB1, the

reference set grows with each example increasing memory requirements and classification time. A

more recent method from this family called IB3, limits the number of stored historical data points

only to the most “usefull” for the classification process. Apart from reducing time and memory

requirements, the size limitation of the reference set provides a forgetting mechanism as it removes

outdated examples from the model.

Decision rules. Rule-based models can also be adjusted to data stream environments. Decision

rule classifiers consist of rules - disjoint components of the model that can be evaluated in isolation

and removed from the model without major disruption. However, rules may be computationally

expensive to maintain as many rules can affect a decision for a single example. These observations

served as base for developing complex data stream mining systems like SCALLOP [28], FACIL [31]

and FLORA [80]. These systems learn rules incrementally and employ dynamic windows to provide

a forgetting mechanism.

3.2 Windowing techniques

The most popular approach to dealing with time changing data involves the use of sliding windows.

Windows provide a way of limiting the amount of examples introduced to the learner, thus elimi-

nating those data points that come from an old concept. The procedure of using sliding windows

for mining data stream is presented in Algorithm 3.1. Because in this work we discuss algorithms

that have the property of any-time learning and should be able to provide the best answer after

each example, the pseudo-codes do not contain explicit return statements. We assume that the

output classifier is available at any moment of the processing of the input stream.

Algorithm 3.1 The basic windowing algorithm

Input: S: a data stream of examplesW : window of examples

Output: C: a classifier built on the data in window W

1: initialize window W ;2: for all examples xi ∈ S do3: W ←W ∪ {xi};4: if necessary remove outdated examples from W ;5: rebuild/update C using W ;

The basic windowing algorithm is straightforward. Each example updates the window and later

the classifier is updated by that window. The key part of this algorithm lies in the definition of the

window - in the way it models the forgetting process. In the simplest approach sliding windows are

of fixed size and include only the most recent examples from the data stream. With each new data

point the oldest example that does not fit in the window is thrown away. When using windows

3.2. Windowing techniques 15

of fixed size, the user is caught in a tradeoff. If he chooses a small window size the classifier will

react quickly to changes, but may loose on accuracy in periods of stability, choosing a large size

will result in increasing accuracy in periods of stability, but will fail to adapt to rapidly changing

concepts. That is why dynamic ways of modeling the forgetting process have been proposed.

3.2.1 Weighted windows

A simple way of making the forgetting process more dynamic is providing the window with a

decay function that assigns a weight to each example. Older examples receive smaller weights and

are treated as less important by the base classifier. Cohen and Strauss [19] analyzed the use of

different decay functions for calculating data stream aggregates. Equations 3.1 through 3.3 present

the proposed functions.

wexp(t) = e−λt, λ > 0 (3.1)

wpoly(t) =1

tα, α > 0 (3.2)

wchord(t) = 1− t

|W |(3.3)

Equation 3.1 presents an exponential decay function, 3.2 a polynomial function, and 3.3 a

chordal function. For each of the functions t represents the age of an example. A new example

will have t = 0 whilst the last example that fits chronologically in a window will have t = |W | − 1.

The use of decay functions allows to gradually weight the examples offering a compromise between

large and small fixed windows. Algorithm 3.2 presents the process of obtaining a window with

decaying weights.

Algorithm 3.2 Weighted windows

Input: S: a data stream of examplesk: size of windoww(·): weight function

Output: W : a window of examples

1: for all examples xi ∈ S do2: if |W | = k then3: remove the oldest example from W ;4: W ←W ∪ {xi};5: for all examples xj ∈W do6: calculate example’s weight w(xj);

3.2.2 FISH

Zliobaite [85] proposed a family of algorithms called FISH, that use time and space similarities

between examples as a way of dynamically creating a window. To explain her approach, let us

consider an illustrative example, which we present in Figure 3.1. A binary classification problem

is represented by black and white dots. The data generating sources change with time, gradually

rotating the optimal classification hyperplane. For a given fixed in space area, depicted with a

red circle, the correct class changes as the optimal boundary rotates. The examples shows that

similarity in an evolving environment depends on both time and space.


optimum boundary:

Figure 3.1: Rotating hyperplane example [82]: (left) initial source S1, (center)source S2 after 45◦ rotation, (right) source S3 after 90◦ rotation. Black and whitedots represent the two classes.

The author proposed the selection of training examples based on a distance measure Dij defined

as follows:

Dij = a1d(s)ij + a2d

(t)ij (3.4)

where d(s) indicates distance in attribute space, d(t) indicates distance in time, and a1, a2 are

the weight coefficients. In order to manage the balance between the time and space distances,

d(s) and d(t) need to be normalized. For two examples xi, xj the author proposes Eucledian

distance (d(s)ij =√∑ p

k=1|xki − xkj |2) for distance in space and the number of dividing examples

(d(t)ij = |i− j|) for distance in time. It is worth noticing that if a2 = 0 then the measure Dij turns

into instance selection, and if a1 = 0 then we have a simple window with linearly time decaying

weights. Having discussed the proposed distance measure we present FISH3 in Algorithm 3.3.

Algorithm 3.3 FISH3 [85]

Input: S: a data stream of examplesk: neighborhood sizewindowStep: optimal window size search stepproportionStep: optimal time/space proportion search stepb: backward search size

Output: W : window of selected examples

1: for all examples xi ∈ S do2: for α = 0; α ≤ 1; α← α+ proportionStep do3: a1 ← α;4: a2 ← 1− α;5: for all remembered historical examples xj ∈ {xi−b, ..., xi−1} do6: calculate distance Dij using (3.4);7: sort the distances from minimum to maximum;8: for s = k; s ≤ b; s← s+ windowStep do9: select s instances having the smallest distance D;

10: using cross-validation build a classifier Cs using the instances indexed {i1, ...., is} andtest it on the k nearest neighbors indexed {i1, ..., ik};

11: record the acquired testing error es;12: find the minimum error classifier Cl, where l = arg min

l=k,...,b(el);

13: W ← instances indexed i1, ..., il;

For each new example xi in the data stream, FISH3 evaluates different time/space proportions

and window sizes. For each tested time/space proportion it calculates the similarities between tar-

get observation xi and the past b instances, and sorts those distances from minimum to maximum.

3.2. Windowing techniques 17

Next, the closest k instances to the target observation are selected as a validation set. This set is

used to evaluate different window sizes from k to b. FISH3 selects the training size l, which has

given the best accuracy on the validation set. For window testing, leave-one-out cross validation

is employed to reduce the risk of overfitting. Without cross validation the training set of size k

is likely to give the best accuracy, because in that case the training set is equal to the validation

set. The algorithm returns a window of l selected training examples that can be used to learn any

base classifier.

FISH3 allows to dynamically establish the size of the training window and the proportion

between time and space weights. The algorithm’s previous version FISH2 [82] takes the time/space

proportion as a parameter, while the first algorithm from the family, FISH1 [83], uses a fixed

window of the nearest instances. To implement a variable sample size the FISH2 and FISH3

incorporate the principles from two windowing methods proposed by Klinkenberg et al. [52] and

Tsymbal et al. [75].

FISH3 is an algorithm that needs to iterate through many window sizes and time/space pro-

portions each time performing leave-one-out cross validation. This is a costly process and may be

unfeasible for rapid data streams. That is why the definition of parameters k, b, proportionStep,

windowStep is very important. It is also worth noticing that although the algorithm can be used

with any base classifier, due to the way it selects instances it will work best with nearest neighbor

type methods.

Experiments performed on 6 small and medium size data sets for 4 types of base classifiers

(decision tree, Nearest Neighbor, Nearest Mean, and Parzen Window) showed that integration

of similarity in time and feature space when selecting a training set improve generalization per-

formance [85]. Additionally, FISH2 has been compared with, and outperformed, the windowing

methods of Klinkenberg and Tsymbal. The FISH family of algorithms should be regarded as a

generic extension to other classification techniques that can be employed when dynamic widowing

is necessary.

3.2.3 ADWIN

Bifet [6, 7] proposed an adapting sliding window algorithm called ADWIN suitable for data streams

with sudden drift. The algorithm keeps a sliding window W with the most recently read examples.

The main idea of ADWIN is as follows: whenever two “large enough” subwindows of W exhibit

“distinct enough” averages, one can conclude that the corresponding expected values are different,

and the older portion of the window is dropped. This involves answering a statistical hypothesis:

“Has the average µt remained constant in W with confidence δ”? The pseudo-code of ADWIN is

listed in Algorithm 3.4.

Algorithm 3.4 Adaptive windowing algorithm [7]

Input: S: a data stream of examplesδ: confidence level

Output: W : a window of examples

1: initialize window W ;2: for all xi ∈ S do3: W ←W ∪ {xi};4: repeat5: drop the oldest element from W ;6: until | ˆµW0 − ˆµW1 | < εcut holds for every split of W into W = W0 ·W1;


The key part of the algorithm lies in the definition of εcut and the test it is used for. The

authors state that different statistical tests can be used for this purpose, but propose only one

specific implementation. Let n denote the size of W , and n0 and n1 the sizes of W0 and W1

consequently, so that n = n0 + n1. Let ˆµW0and ˆµW1

be the averages of the values in W0 and W1,

and µW0and µW1

their expected values. The value of εcut is proposed as follows:

εcut =

√1

2m· 4

δ′, (3.5)

where

m =1

1/n0 + 1/n1, and δ′ =

δ

n.

The statistical test in line 6 of the pseudo-code checks if the observed average in both subwin-

dows differs by more than threshold εcut. The threshold is calculated using the Hoeffding bound,

thus gives formal guarantees of the base classifiers performance. The phrase “holds for every split

of W into W = W0 ·W1” means that we need to check all pairs of subwindows W0 and W1 created

by splitting W into two. The verification of all subwindows is very costly due to the number of

possible split points. That is why the authors proposed an improvement to the algorithm that

allows to find a good cut point quickly [7]. The originally proposed ADWIN algorithms are also

lossless learners, thus the window size W can grow infinitely if no drift occurs. This can be easily

improved by adding a parameter that would limit the windows maximal size. In its original form,

proposed by Bifet, ADWIN works only for 1-dimensional data, e.g., the running error. For this

method to be used for n-dimensional raw data, a separate window should be maintained for each

dimension. Such a modified model, although costly, reflects the fact that the importance of each

feature may change at different pace.

3.3 Drift detectors

After windowing techniques, another group of algorithms allowing to adapt almost any learner to

evolving data streams are drift detectors. Their task is to detect concept drift and alarm the base

learner that its model should be rebuilt or updated. This is usually done by a statistical test that

verifies if the running error or class distribution remain constant over time.

For numeric sequences the first proposed tests where the Cumulated Sum (CUSUM) [67] and

the Geometric Moving Average (GMA) [71]. The CUSUM test raises an alarm if the mean of

the input data is significantly different from zero, while GMA checks if the weighted average of

examples in a window is higher than a given threshold. For populations more complex than

numeric sequences statistical tests like the Kolmogorov-Smirnov test have been proposed. Below,

we discuss two recently proposed tests designed for drifting data streams.

3.3.1 DDM

Gama et al. [30] based their Drift Detection Method (DDM) on the fact, that in each iteration an

online classifier predicts the decision class of an example. That prediction can be either true or

false, thus for a set of examples the error is a random variable from Bernoulli trials. That is why

the authors model the number of classification errors with a Binomial distribution. Let us denote

pi as the probability of a false prediction and si as its standard deviation calculated as given by

Equation 3.6.

si =

√pi(1− pi)

i(3.6)

3.3. Drift detectors 19

The authors use the fact, that for a sufficiently large number of examples (n > 30), the Binomial

distribution is closely approximated by a Normal distribution with the same mean and variance.

For each example in the data stream the error rate is tracked updating two registers: pmin and

smin. These values are used to calculate a warning level condition presented in Equation 3.7 and an

alarm level condition presented in Equation 3.8. Each time a warning level is reached, examples are

remembered in a separate window. If afterwards the error rate falls below the warning threshold,

the warning is treated as a false alarm and the separate window is dropped. However, if the alarm

level is reached, the previously taught base learner is dropped and a new one is created, but only

from the examples stored in the separate “warning” window.

pi + si ≥ pmin + α · smin (3.7)

pi + si ≥ pmin + β · smin (3.8)

The values α and β in the above conditions decide about the confidence levels at which the

warning and alarm signals are triggered. The authors proposed α = 2 and β = 3, giving approxi-

mately 95% confidence of warning and 99% confidence of drift. Algorithm 3.5 shows the steps of

the Drift Detection Method.

Algorithm 3.5 The Drift Detection Method [30]

Input: S: a data stream of examplesC: classifier

Output: W : a window with examples selected to train classifier C

1: Initialize(i, pi, si, psmin, pmin, smin);2: newDrift← false;3: W ← ∅;4: W ′ ← ∅;5: for all examples xi ∈ S do6: if prediction C(xi) is incorrect then7: pi ← pi + (1.0− pi)/i;8: else9: pi ← pi − (pi)/i;

10: compute si using (3.6);11: i← i+ 1;12: if i > 30 (approximated normal distribution) then13: if pi + si ≤ psmin then14: pmin ← pi;15: smin ← si;16: psmin ← pi + si;17: if drift detected (3.8) then18: Initialize(i, pi, si, psmin, pmin, smin);19: W ←W ′;20: W ′ ← ∅;21: else if warning level reached (3.7) then22: if newDrift = true then23: W ′ ← ∅;24: newDrift← false25: W ′ ←W ′ ∪ {xi}26: else27: newDrift← true;28: W ←W ∪ {xi};


Algorithm 3.6 DDM: Initialize()

Input:i, pi, si, psmin, pmin, smin: window statisticsOutput: initialized statistics’ values

1: i← 1;2: pi ← 1;3: si ← 0;4: psmin ←∞;5: pmin ←∞;6: smin ←∞;

DDM works best on data streams with sudden drift as gradually changing concepts can pass

without triggering the alarm level. When no changes are detected, DDM works like a lossless

learner constantly enlarging the window size which can lead to the memory limit being exceeded.

3.3.2 EDDM

Baena-Garcia et al. [3] proposed a modification of DDM called EDDM. The authors use the same

warning-alarm mechanism that was proposed by Gama, but instead of using the classifier’s error

rate, they propose the distance error rate. They denote p′i as the average distance between two

consecutive errors and s′i as its standard deviation. Using these values the new warning and alarm

conditions are given by Equation 3.9 and 3.10.

p′i + 2 · s′ip′max + 2 · s′max

< α (3.9)

p′i + 3 · s′ip′max + 3 · s′max

< β (3.10)

EDDM works better than DDM for slow gradual drift, but is more sensitive to noise. Another

drawback of this method is that it considers the thresholds and searches for concept drift when a

minimum of 30 errors have occurred. This is necessary to approximate the Binomial distribution

by a Normal distribution, but can take a large amount of examples to happen.

3.4 Hoeffding trees

Decision trees were the first learners to be adapted to data stream mining by using the Hoeffding

bound. The Hoeffding bound states that with probability 1−δ, the true mean of a random variable

of range R will not differ from the estimated mean after n independent observations by more than:

ε =

√R2ln(1/δ)

2n. (3.11)

Using this bound, Domingos and Hulten [21] proposed a classifier called Very Fast Decision

Tree which we present in Algorithm 3.7.

The algorithm induces a decision tree from a data stream incrementally, without the need for

storing examples after they have been used to update the tree. It works similarly to the classic

tree induction algorithm [69, 16, 70] and differs almost only in the selection of the split attribute.

Instead of selecting the best attribute (in terms of split evaluation function G(·)) after viewing all

the examples, it uses the Hoeffding bound to calculate the number of examples necessary to select

the right split-node with probability 1− δ.

3.4. Hoeffding trees 21

Algorithm 3.7 The Hoeffding tree algorithm [21]

Input: S: a data stream of examplesX : a set of discrete attributesG(·): a split evaluation functionδ: split confidence

Output: HT : a Hoeffding decision tree

1: HT ← a tree with a single leaf l1 (the root);2: X1 ← X ∪ {X0};3: G1(X0)← G obtained by predicting the most frequent class in S;4: for all classes yk ∈ Y do5: for all values xij of each attribute Xi ∈ X do6: nijk(l1)← 0;7: for all examples (x, y) ∈ S do8: Sort (x, y) into a leaf l using HT ;9: for all attribute values xij ∈ x such that Xi ∈ Xl do

10: nijk(l)← nijk(l) + 1;11: label l with the majority class among the examples seen so far at l;12: if the examples seen so far at l are not all of the same class then13: compute Gl(Xi) for each Xi ∈ Xl − {X0} using the counts nijk(l);14: Xa ← the attribute with the highest Gl;15: Xb ← the attribute with the second-highest Gl;16: compute Hoeffding bound ε using (3.11);17: if Gl(Xa)−Gl(Xb) > ε and Xa 6= X0 then18: replace l by an internal node that splits on Xa;19: for all branches of the split do20: add a new leaf lm;21: Xm ← X − {Xa};22: Gm(X0)← the G obtained by predicting the most frequent class at lm;23: for all classes yk ∈ Y and each value xij of each attribute Xi ∈ Xm − {X0} do24: nijk(lm)← 0;

Many enhancements to the basic VFDT algorithm have been proposed. Domingos and Hul-

ten [21] introduced a method of limiting memory usage. They proposed to eliminate the statistics

held by the “least promising” leafs. The least promising nodes are defined to be the ones with the

lowest values of plel, where pl is the probability that examples will reach a particular leaf l, and

el is the observed rate of error at l. To reduce memory usage even more, they also suggested the

removal of statistics of the poorest attributes in each leaf.

The Hoeffding bound holds true for any type of distribution. A disadvantage of being so

general is that it is more conservative than a distribution-dependent bound and thus requires

more examples than really necessary. Jin and Agrawal [48] proposed the use of an alternative

bound which requires less examples for each split node. They also proposed a way of dealing

with numerical attributes, which VFDT originally does not support, called Numerical Interleave

Pruning (NIP). NIP creates data structures similar to histograms for numerical attributes with

many distinct values. With time, the number of bins in such histograms can be pruned allowing

the memory usage to remain constant.

A different approach to dealing with numerical attributes was proposed by Gama et al. [32].

They use binary trees as a way of dynamically discretizing numerical values. The same paper

also investigates the use of an additional classifier at leaf nodes, namely Naive Bayes. Other

performance enhancements [44, 11, 32] to Hoeffding trees include the use of grace periods, tie-

breaking, and skewed split prevention. Because it is costly to compute the split evaluation function

for each example, it is sensible to wait for more examples before re-evaluating a split node. Still,


after each example leaf statistics are updated, but the split nodes are evaluated after a larger

number of examples dictated by a grace period parameter. Tie breaking involves adding a new

parameter τ , which is used in an additional condition ε < τ in line 17 of the presented VFDT

pseudo-code. This condition prevents the algorithm form waiting too long before choosing one of

two, almost identically useful split attributes. To prevent skewed splits, Gama proposed a rule

stating that “a split is only allowed if there are at least two branches where more than pmin of the

total proportion of examples are estimated to follow the branch” [11].

The originally proposed VFDT algorithm was designed for static data streams and provided no

forgetting mechanism. The problem of classifying time changing data streams with Hoeffding trees

was first tackled by Hulten et al. [44] in the paper “Mining Time-Changing Data Streams”. The

authors proposed a new algorithm called CVFDT, which used a fixed-size window to determine

which nodes are aging and may need updating. For fragments of the Hoeffding tree that become

old and inaccurate, alternative subtrees are grown that later replace the outdated nodes. It is

worth noting, that the whole process does not require model retraining. Outdated examples are

forgot by updating node statistics and necessary model changes are performed on subtrees rather

than the whole classifier.

Different approaches to adding a forgetting mechanism to the Hoeffding Tree include using

an Exponential Weight Moving Average (EWMA) or ADWIN as drift detectors [5]. The latter,

gives performance guarantees concerning the obtained error rate and both mentioned methods are

more accurate and less memory consuming than CVFDT. The price the EWMA and ADWIN tree

extensions pay is the average time necessary to process a single example.

Hoeffding trees represent the current state-of-the-art in single classifier mining of data streams.

They fulfill all the requirements of an online learner presented in Section 2.3 and provide good

interpretability. Their performance has been compared with traditional decision trees, Naive Bayes,

kNN, and ensemble methods [21, 10, 32, 44, 48]. They proved to be much faster and less memory

consuming while handling extremely large datasets. The compared ensemble methods require

much more time and memory, and the accuracy boost they offer was usually marginal compared

to the used resources.

Chapter 4

Ensemble approaches

Classifier ensembles are a common way of boosting classification accuracy. Due to their modularity,

they also provide a natural way of adapting to change by modifying ensemble members. In this

chapter we discuss the use of ensemble classifiers to mine evolving data streams. In Section 4.1

main types of ensemble modification techniques are shown. Following sections describe three

specific adaptive ensemble algorithms. Section 4.2 discusses the Streaming Ensemble Algorithm,

Section 4.3 Accuracy Weighted Ensembles, and Section 4.4 Hoeffding Option Trees. Finally, in

Section 4.5 we propose a new data stream classifier called Accuracy Diversified Ensemble.

4.1 Ensemble strategies for changing environments

Ensemble algorithms are sets of single classifiers (components) whose decisions are aggregated by

a voting rule. The combined decision of many single classifiers is usually more accurate than that

given by a single component. Studies show that to obtain this accuracy boost, it is necessary to

diversify ensemble members from each other. Components can differ from each other by the data

they have been trained on, the attributes they use, or the base learner they have been created

from. For a new example, class predictions are usually established by member voting. A generic

ensemble training scheme is presented in Algorithm 4.1.

Algorithm 4.1 Generic ensemble training algorithm [51]

Input: S: a set of examplesk: number of classifiers in ensemble

Output: E : an ensemble of classifiers

1: E ← k classifiers;2: for all classifiers Ci in ensemble E do3: assign a weight to each example in S to create weight distribution Dm;4: build/update Ci with S modified by weight distribution Dm;

Ensemble training is a costly process. It requires at least k times more processing than the

training of a single classifier, plus member example selection and weight assignment usually make

the process even longer. In massive data streams, single classifier models can perform better

because there might not be time for running and updating an ensemble. On the other hand, if

time is not of primary importance, but very high accuracy is required, an ensemble would be the

natural solution.

Kuncheva [54] proposes to group ensemble strategies for changing environments as follows:

23

24 Ensemble approaches

• Dynamic combiners (horse racing) - individual classifiers (experts) are trained in advance

and the forgetting process is modeled by changing the expert combination rule.

• Updated training data - the experts in the ensemble are created incrementally by incoming

examples. The combination rule may or may not change in the process.

• Updating the ensemble members - ensemble members are update online or retrained with

blocks of data.

• Structural changes of the ensemble - periodically or when change is detected, ensemble mem-

bers are reevaluated and the worst classifiers are updated or replaced with a classifier trained

on the most recent examples.

• Adding new features - as the importance of features evolves with time, the attributes used

by team members are changed without redesigning the ensemble structure.

We will discuss horse racing, updating members, and structural changes in more detail.

Horse racing. The horse racing approach owes its name to an example that is used to explain

this method. In a series of horse races, a person (ensemble classifier) wants to predict the outcome

of each race (example). The person has k experts (ensemble members) he can trust or ignore. For

each race the person remembers an expert’s decision and updates his trustworthiness. With each

new race the set of experts the person listens to is different as he only chooses the most trustworthy

advisors.

The most famous representatives of this group include the Weighted Majority algorithm [59],

Hedge, Winnow [58], and Mixture of experts [46]. The horse racing approach may not be always

appropriate for evolving data streams. This is because, in this approach, the individual classifiers

are not retrained at any stage. Thus, after some time, the ensemble may be left without any

experts adequate to the current concept because the available members have been trained on

outdated data. In batch learning the experts are trained on the whole (finite) set of examples, so

there is no danger of lack of expertise.

Updated training data for online ensembles. In this approach the task is to differentiate

incrementally built ensemble members. This may involve sampling [65], filtering training examples

for consecutive members so they specialize in different concept cases [15], and using data chunks

to train the experts [34]. The proposed methods following this approach are usually variants

of the corresponding batch methods for stationary environments. When faced with changing

environments, we have to introduce a forgetting mechanism. One possible solution would be to

use a window of past examples, preferably of variable size.

A different representative of this approach is the online bagging algorithm proposed by Oza [66].

In this method the experts are incremental learners that combine their decision using a simple

majority vote. The sampling, crucial to batch bagging, is performed incrementally by presenting

each example to a component k times, where k is defined by the Poison distribution. Depending

on whether the base classifier is lossless or not, this method might need a separate forgetting

mechanism.

Changing the ensemble structure. Changing the ensemble structure usually involves remov-

ing a classifier to replace it with a newer one. The problem lies in the way the dropped experts

should be chosen. The simplest strategy removes the oldest classifier in the ensemble and trains

a new classifier to take its place. Wang et al. [78] propose a more sophisticated method which

4.2. Streaming Ensemble Algorithm 25

evaluates all classifiers using the most recent chunk of data as the testing set. Street and Kim [73],

on the other hand, consider a “quality score” for replacing a classifier based on its merit to the en-

semble, not only on the basis of its individual accuracy. Both listed examples of ensemble structure

changing algorithms will be discussed in detail in Sections 4.2 and 4.3.

Structure changing can be costly due to the process of selecting the weakest ensemble com-

ponents. Nevertheless, a good selection of learners boosts classification accuracy, and sometimes

even offers mathematical assurances of the new ensemble’s performance.

4.2 Streaming Ensemble Algorithm

Street and Kim [73] proposed an ensemble method called Streaming Ensemble Algorithm (SEA)

that changes its structure to react to changes. They propose a heuristic replacement strategy of

the “weakest” expert based on two factors: accuracy and diversity. Accuracy is important because,

as the authors suggest, an ensemble should correctly classify the most recent examples to adapt to

drift. On the other hand, diversity is the source of success of such ensemble methods like bagging

or boosting in static environments. The pseudo-code for SEA is listed in Algorithm 4.2.

Algorithm 4.2 The Streaming Ensemble Algorithm [73]

Input: S: a data stream of labeled examplesd: size of data chunk xiQ(·): a classifier quality measure

Output: E : an ensemble of classifiers

1: for all data chunks xi ∈ S do2: build classifier Ci using xi;3: evaluate classifier Ci−1 on xi;4: evaluate all classifiers Ej in ensemble E on xi;5: if E not full then6: E ← E ∪ {Ci−1};7: else if ∃j : Q(Ci−1) > Q(Ej) then8: replace member Ej with Ci−1;

The algorithm processes the incoming stream in data chunks. The size of those chunks is an

important parameter because it is responsible for the trade-off between accuracy and flexibility

discussed in Section 3.2. Each data chunk is used to train a new classifier, which is later compared

with ensemble members. If any ensemble member is “weaker” than the candidate classifier it is

dropped and the new classifier takes its place. To evaluate the classifiers Street and Kim propose

using the classification accuracy obtained on the most recent data chunk. They assign weights to

components according to their accuracy and additionally diversify the candidate classifiers weight

as follows:

• if both Ci−1 and E are correct, then the weight of Ci−1 is increased by 1− |P1 − P2|;

• if Ci−1 is correct and E in incorrect, then the weight of Ci−1 is increased by 1−|P1−Pcorrect|;

• both Ci−1 is incorrect, then the weight of Ci is decreased by 1− |PC − PCi−1|,

where P1 and P2 denote the two highest percentages of votes gained by the decision classes,

Pcorrect the percentage of votes of the correct decision class, and PCi−1the percentage of votes

gained by the class predicted by the new candidate classifier.

In the paper introducing SEA, the authors used C4.5 decision trees as base classifiers and

compared the ensemble’s accuracy with single pruned and unpruned decision trees. SEA performed


almost as well as a pruned tree on static data sets and much better on data sets with concept

drift. The authors also performed a series of experiments varying the number of operational

parameters. They showed that SEA performed best when no more than 25 components were used,

base classifiers were unpruned, and simple majority voting was used to combine member decisions.

4.3 Accuracy Weighted Ensemble

A similar way of restructuring an ensemble was proposed by Wang et al. [78]. In their algorithm,

called Accuracy Weighted Ensemble (AWE), they train a new classifier C ′ on each incoming

data chunk and use that chunk to evaluate all the existing ensemble members to select the best

component classifiers. Wang et al. stated and proved that for an ensemble Ek built from the k most

recent data chunks and a single classifier Gk also built from k most recent chunks, the following

theorem stands:

Theorem 4.3.1 Ek produces a smaller classification error than Gk, if classifiers in Ek are weighted

by their expected classification accuracy on the test data.

The proposed weighting provides a forgetting mechanism that is capable of handling reoccurring

concept equally well to sudden drifts and periods of stability. To explain their solution, the authors

of the algorithm discuss an illustrative example that presents the importance of accurate weighting

of ensemble members.

Let us assume a stream of 2-dimensional data partitioned into sequential chunks based on their

arrival time. Let xi be the data that came in between time ti and ti+1. Figure 4.1 shows the

distribution of the data and the optimum decision boundary during each time interval. Because

the distributions in the data chunks differ, there is a problem in determining the chunks that

should remain influential to accurately classify incoming data.

optimum boundary:

overfitting:positive:

negative:

x0 x1 x2

Figure 4.1: Data distribution.

Figure 4.2 shows the possible chunk sets that can be selected. The best set consists of chunks

x0 and x2, which have similar class distribution. This shows that decisions based on example class

distribution are bound to be better than those based solely on data arrival time. Historical data

whose class distributions are similar to that of current data can reduce the variance of the current

model and increase classification accuracy.

The similarity of distributions in data chunks largely depends on the size of the chunks. Bigger

chunks will build more accurate classifiers, but can contain more than one change. On the other

hand, smaller chunks are better at separating changes, but usually lead to poorer classifiers. The

definition of chunk sizes is crucial to the performance of this algorithm.

4.3. Accuracy Weighted Ensemble 27

optimum boundary:

S2 S1 S0+ S2 S1+ S2 S0++x2 x1 x0+ x2 x1+ x2 x0++

Figure 4.2: Training set selection.

According to Theorem 4.3.1, to properly weight the members of an ensemble we need to know

the actual function being learned, which is unavailable. That is why the authors propose to derive

weights by estimating the error rate on the most recent data chunk xi, as shown in Equations

4.1-4.3.

MSEi =1

|xi|∑

(x,c)∈xi

(1− f ic(x))2, (4.1)

MSEr =∑c

p(c)(1− p(c))2, (4.2)

wi = MSEr −MSEi, (4.3)

Function f ic(x) denotes the probability given by classifier Ci that x is an instance of class c.

This is an interesting feature of the weight calculation, as most weighting functions use only the

components prediction rather than the probability of all possible classes. It is also important

to note, that for the candidate classifier denoted as C ′, the error rate is calculated using cross

validation on the current chunk to avoid overfitting. Previous ensemble members are evaluated on

all the examples of the most recent data chunk. The value of MSEr is the mean square error of

a randomly predicting classifier and it is used to zero the weights of models that do not contain

any useful knowledge about the data.

The complete pseudo-code for the Accuracy Weighted ensemble is listed in Algorithm 4.3.

Algorithm 4.3 Accuracy Weighted Ensemble [78]

Input: S: a data stream of examplesd: size of data chunk xik: the total number of classifiersC: a set of previously trained classifiers (optional)

Output: E : a set of k classifiers with updated weights

1: for all data chunks xi ∈ S do2: train classifier C ′ on xi;3: compute error rate of C ′ via cross validation on S;4: derive weight w′ for C ′ using (4.3);5: for all classifiers Ci ∈ C do6: apply Ci on xi to derive MSEi;7: compute wi based on (4.3);8: E ← k of the top weighted classifiers in C ∪ {C ′}9: C ← C ∪ {C ′}

For the first k data chunks the algorithm outputs a set of all available classifiers, but when

processing further chunks it selects only the k best components to form an ensemble. Wang et al.


discussed that for a large data stream it is impossible to remember all the experts created during

the ensembles lifetime and the selection cannot be performed on an unbounded set of classifiers.

That is why for dynamic data streams, it is necessary to introduce a new parameter that would

limit the number of classifiers available for selection.

The AWE algorithm works well on data streams with reoccurring concepts as well as different

types of drift. As with SEA it is crucial to properly define the data chunk size as it determines the

ensembles flexibility. It is also worth noticing, that AWE will improve its performance gradually

over time and is best suited for large data streams.

4.4 Hoeffding option trees and ASHT Bagging

Recently, two interesting ensemble methods which use Hoeffding trees have been proposed.

Kirkby [51] proposed an Option Tree similar to that of Kohavi and Kunz [53] that allows each

training example to update a set of option nodes rather than just a single leaf. Option nodes work

like standard decision tree nodes with the difference that they can split the decision paths into

several subtrees. Making a decision with an option tree involves combining the predictions of all

applicable leaves into a single result.

Hoeffding Option Trees (HOT) provide a compact structure that works like a set of weighted

classifiers, and just like regular Hoeffding Trees, they are built in an incremental fashion. The

pseudo-code for the Hoeffding Option Tree is listed in Algorithm 4.4.

The algorithm works similarly to the Hoeffding Tree listed in Algorithm 3.7. The differences

show from line 18 where a new option is created. Like in most ensemble approaches, there is a limit

to the number of ensemble members denoted as k. If this limit has not been exceeded for a given

leaf, a new option path can be trained. Option creation is similar to adding a leaf to a Hoeffding

Tree with one minor difference concerning the split condition. For the initial split (line 13) the

decision process searches for the best attribute overall, but for subsequent splits (line 23) the search

is for attributes that are superior to existing splits. It is very unlikely that any other attribute

could compete so well with the best attribute already chosen that it could beat it by the same

initial margin (the Hoeffding bound practically insures that). For this reason, a new parameter δ′,

which should be much “looser”, is used for the secondary split.

For evolving data streams the author proposes to use a windowing technique that stores an

estimation of the current error at each leaf [11]. Hoeffding Option Trees offer a good compromise

between accurate, but time and memory expensive, traditional ensemble methods and fast, but

less accurate, single classifiers.

A different ensemble method designed strictly for Hoeffding trees was proposed by Bifet et

al. [10, 9]. Adaptive-Size Hoeffding Tree Bagging (ASHT Bagging) diversifies ensemble members

by using trees of different sizes. As the authors state: “The intuition behind this method is as

follows: smaller trees adapt more quickly to changes, and larger trees do better during periods

with no or little change, simply because they were built on more data” [10].

Apart from diversifying ensemble members, ASHT Bagging provides a forgetting mechanism.

Each tree in the ensemble has a maximum size s. After a node splits, if the number of split nodes

of the ASHT tree is higher than the maximum value, the tree needs to reduce its size. This can

be done by either deleting the oldest node along with its children, or by deleting all the tree nodes

and restarting its growth. The authors propose the maximal size of the n-th member to be twice

the maximal size of the (n − 1)-th tree. The suggested size of the first tree is 2. The weighting

of a component classifier in the ensemble is proportional to the inverse of the square of its error,

monitored by a exponential weighted moving average window.

4.5. Accuracy Diversified Ensemble 29

Algorithm 4.4 Hoeffding option tree [51]

Input: S: a data stream of examplesGl(·): a split evaluation functionδ: split confidenceδ′: confidence for additional splitsk: maximum number of options that should be reachable by any single example

Output: HOT : a Hoeffding option tree

1: HOT ← a tree with a single leaf l1 (the root);2: for all examples xi ∈ S do3: Sort xi into a leaves/option L using HOT ;4: for all option nodes l of the set L do5: update sufficient statistics in l;6: nl ← the number of examples seen at l;7: if nlmodnmin = 0 and examples seen at l not all of same class then8: if l has no children then9: compute Gl() for each attribute of xi;

10: Xa ← the attribute with the highest Gl;11: Xb ← the attribute with the second-highest Gl;12: compute Hoeffding bound ε using (3.11);13: if Xa 6= X∅ and (Gl(Xa)−Gl(Xb) > ε or ε < τ) then14: add a node below l that splits on Xa;15: for all branches of the split do16: add a new option leaf with initialized sufficient statistics;17: else18: if optionCountl < k then19: compute Gl() for existing splits and (non-used) attributes;20: s← existing child split with highest Gl21: Xs ← (non-used) attribute with highest Gl22: compute Hoeffding bound (3.11) using δ′ instead of δ;23: if Gl(Xs)−Gl(s) > ε then24: add an additional child option to l that splits on Xs;25: for all branches of the split do26: add a new option leaf with initialized sufficient statistics;27: else28: remove attribute statistics stored at l;

When compared with Hoeffding Option Trees, ASHT Bagging proves to be more accurate on

most data sets, but is extremely time and memory expensive [10]. For data intensive streams

Option Trees or single classifier methods are a better choice. On the other hand, if both runtime

and memory consumption are less of a concern, then variants of bagging usually produce excellent

accuracies.

4.5 Accuracy Diversified Ensemble

In Section 4.3 we discussed the construction of the Accuracy Weighted Ensemble. Based on its

weighting mechanism, we propose a new algorithm called Accuracy Diversified Ensemble (ADE),

which not only selects but also updates components according to the current distribution.

Wang et al. designed AWE to use traditional batch classifiers as base learners. Because of

this, they have to create ensemble members from single chunks and later only adjust component

weights according to the current distribution. This makes the data chunk size a crucial parameter

for AWE’s performance. We propose to use online learners as components, so as to update existing


members rather than just adjusting their weights. This modification allows to decrease the data

chunk size without the risk of creating less accurate classifiers.

Another drawback of AWE is its weight function. Because the algorithm is designed to perform

well on cost-sensitive data, the MSEr threshold in Equation 4.3 cuts-off “risky” classifiers. In

rapidly changing environments (like the Electricity data set presented in Section 6.2) this threshold

can “mute” all ensemble members causing no class to be predicted. To avoid this, in ADE we

propose a simpler weight function presented in Equation 4.4.

wi =1

(MSEi + ε)(4.4)

MSEi is calculated just like in Equation 4.1 and the ε component is a very small constant

value, which allows weight calculation in rare situations when MSEi = 0. Our slightly modified

weight function prevents the unwanted “muting” in sudden concept drift situations.

As mentioned earlier we want to update base learners according to the current distribution.

To introduce diversity, we do this only to selected classifiers. First of all, we only consider current

ensemble members - the k top weighted classifiers. Other stored classifiers are regarded as not

accurate enough to be corresponding to the current distribution. Additionally, we use MSEr as a

threshold, similarly to the way it was used in Equation 4.3 (line 12 of ADE psuedo-code). “Risky”

classifiers can enter the ensemble, but will not be updated.

The diversity introduced by updating only selected components could be marginal in periods

of stability. When no concept drift occurs, the classifiers trained on more examples are more

accurate. The most accurate classifiers are added to the ensemble and updated with new exam-

ples. Eventually, after many data chunks of stability, the ensemble can consist of almost identical

members, as they were all trained mostly on the same examples. That is why we employ online

bagging, as described in Section 4.1, for updating ensemble members. This way updating examples

are incrementally sampled reducing the risk of creating an ensemble of identical components.

The full pseudo-code of the Accuracy Diversified Ensemble is listed in Algorithm 4.5. The key

modifications, apart from base learner and weight changes, start in line 11.

Algorithm 4.5 Accuracy Diversified Ensemble

Input: S: a data stream of examplesd: size of data chunk xik: the total ensemble members

Output: E : a set of k online classifiers with updated weights

1: C ← ∅2: for all data chunks xi ∈ S do3: train classifier C ′ on xi;4: compute error rate of C ′ via cross validation on S;5: derive weight w′ for C ′ using (4.4);6: for all classifiers Ci ∈ C do7: apply Ci on xi to derive MSEi;8: compute weight wi based on (4.4);9: E ← k of the top weighted classifiers in C ∪ {C ′}

10: C ← C ∪ {C ′}11: for all classifiers Ce ∈ E do12: if we > 1

MSErand Ce 6= C ′ then

13: update classifier Ce with xi using Oza online bagging;

Compared to existing ensemble methods the Accuracy Diversified Ensemble provides a new

learning strategy. ADE differs from AWE in weight definition, the use of online base classifiers,

4.5. Accuracy Diversified Ensemble 31

bagging, and updating components with incoming examples. Ensemble members are weighted,

can be removed, and are not always updated, unlike in the online bagging approach proposed by

Oza. Compared to ASHT and HOT, we do not limit base classifier size, do not use any windows,

and update members only if they are accurate enough according to the current distribution.

The main concept of our approach is that only components closely related to the current distri-

bution should be updated, and when done so, they need to be additionally diversified.

In Chapter 6 we compare accuracy, time, and memory performance of the Accuracy Diversified

Ensemble with four classifiers: a windowed decision tree, a Hoeffding Tree with a drift detector,

the Accuracy Weighted Ensemble, and the Hoeffding Option Tree. We check if time and memory

requirements remain constant after changing AWE’s batch members to incremental learners. We

also verify if bagging is really a good way of boosting accuracy. To do this, we perform experiments

on two versions of ADE, one with and one without bagging, and compare average results.

Chapter 5

MOA framework

Massive Online Analysis (MOA) is a software environment for implementing algorithms and run-

ning experiments for online learning [8, 12, 11]. It is implemented in Java and contains a collection

of data stream generators, online learning algorithms, and evaluation procedures.

In this chapter, we discuss the components of MOA and our contributions to the framework.

Section 5.1 describes stream generation, drift and noise addition, as well as discusses the perfor-

mance of an attribute selection filter we implemented. Section 5.2 lists the available classification

methods and describes the process of implementing a custom classifier. Finally, in Section 5.3, we

present the predefined evaluation methods in MOA and our proposition to data stream evaluation

- Data Chunks. Details concerning the code and execution of the implemented features can be

found in Appendix A.

5.1 Stream generation and management

Work in MOA is divided into tasks. Main tasks in MOA include classifier training, learner eval-

uation, stream file generation, and stream speed measurement. Tasks can be executed from a

graphical user interface (GUI) as well as from the command line. The main application window

of the GUI is presented in Figure 5.1. The user interface allows to run many tasks concurrently,

controlling their progress and presenting partial results.

Figure 5.1: MOA main window.

33

34 MOA framework

MOA is capable of reading ARFF files, which are commonly used in machine learning thanks to

the popularity of the WEKA project [79, 41]. It also allows to create data streams on the fly using

generators, by joining several streams, or by filtering streams. The data stream generators available

in MOA are: Random Trees [21], SEA [73], STAGGER [72], Rotating Hyperplane [78, 23, 24],

Random RBF, LED [32], Waveform [32], and Function [48].

The framework also provides an interesting feature that allows to add concept drift to stationary

data streams. MOA uses the sigmoid function to model a concept drift event as a weighted

combination of two pure distributions that characterize the target concepts before and after the

drift. The user can define the concept before and after the drift, the moment of the drift, and its

width [12].

In its data stream processing workflow, MOA has a filtering stage. In its current release, the

framework only allows to add noise to the stream. That is why we decided to implement a new

filtering feature. Some real data sets for evaluating online learners like Spam Corpus or Donation

(described in detail in Section 6.2) have too many attributes. We implemented a static attribute

selection filter that allows to select the features that will be passed to the learner. It would be easy

to implement dynamic attribute selection by performing batch attribute selection on a sample

of the stream. Unfortunately, the process of filtering each example in MOA can be very time

expensive. The cost of filtering attributes is so high, because it is done after the instance has

been created. This means that to cut an example with 20,000 attributes to 15,000 attributes, first

20,000 attributes are read and used to create an instance, and later all those attributes are read

and filtered to create a new instance with 15,000 attributes. If the framework introduced a new

stage, prior to example creation, the attribute filtering process could be much more efficient.

5.2 Classification methods

The main classification methods implemented in MOA include: Naive Bayes, Hoeffding Tree,

Hoeffding Option Tree, Bagging, Boosting, Bagging using ADWIN, Bagging using Adaptive-Size

Hoeffding Trees, and the Weighted Majority Algorithm. The framework also allows to use a

classifier from WEKA and combine any learner with a drift detector. The configuration window

for an example classifier is presented in Figure 5.2.

Figure 5.2: Classifier settings window.

5.3. Evaluation procedures 35

MOA is designed to work in modular fashion allowing users to implement their own tasks

and add them to the framework without much effort. Creating a new classifier requires only

to extend an abstract class called moa.classifiers.AbstractClassifier and implement the

desired algorithm. The setting windows in the GUI are created dynamically. Using this feature

of MOA, we implemented the AWE and ADE algorithms described in Sections 4.3 and 4.5. The

newly implemented classifiers are compared with other algorithms in Chapter 6.

5.3 Evaluation procedures

Currently MOA comes with two evaluation methods called Holdout and Interleaved Test-Then-

Train. We propose a third method that evaluates classifiers using chunks of data. All three

methods are described in the following subsections.

5.3.1 Holdout

In batch learning the most common classifier evaluation method is cross-validation. Unfortunately,

cross-validation becomes costly for large data sets and in these cases it is often accepted to measure

performance on a single holdout set. In batch learning this is done by dividing examples between

train and test sets before classifier training. In “Data Stream mining: A Practical Approach” [11],

the authors argue that data stream learning can be viewed as a large-scale case of batch learning,

and that the holdout procedure is appropriate for these environments.

The MOA implementation of holdout evaluates the model periodically, at a constant user-

defined interval, e.g., after every one million training examples. Testing the model too often is

undesired, as it may drastically slowdown the evaluation process without providing any significant

information about the tested classifier’s performance.

For data streams without concept drift, a static holdout set should accurately evaluate a clas-

sifier. The only constraints concerning the test set are that it should be independent from the

training set and sufficiently large, relatively to the target concepts complexity. For evolving data

streams, it is necessary to dynamically populate the testing set with previously unused data. This

can be done by periodically using a set of examples for testing before training.

Currently MOA provides only holdout evaluation with a static testing set. Although it is not

stated anywhere in the documentation [12], the held out set is created by taking the n first examples

from the stream, where n is the size of the testing set. In our opinion, this implementation cannot

be used to evaluate drifting data streams, as it tests the accuracy for only the first occurring

concepts.

5.3.2 Interleaved Test-Then-Train

An alternative approach to evaluating data stream algorithms involves testing the model with each

incoming example and later using that example for training. This technique does not need separate

memory for a test set and makes maximum use of the available data. When interleaving testing

with training classifier performance can be examined with the most detailed possible resolution -

for each example. That last mentioned property can be a problem, as storing statistics for each

example in large data streams could be time and memory inefficient. For this reason, MOA allows

to reduce the storage requirements of the results by recording statistics only at periodic intervals

defined by the user.

The Test-Then-Train approach can be used equally well for static and evolving data streams.

The disadvantage of this technique, compared to using a held out set, is that it is practically

36 MOA framework

impossible to correctly measure training and testing times. This method also gives obscured

accuracy results, because the classifiers will make more mistakes at the beginning of the learning

process.

In their report about data stream mining [11], Kirkby and Bifet compared average accuracy

plots for both described evaluation techniques and opted for holdout. We believe that the holdout

procedure, at least in the way it is implemented in MOA, is inappropriate for evolving data streams.

We propose a compromise between testing after each example and creating a static held out set -

data chunks.

5.3.3 Data Chunks

Evaluating with data chunks works similarly to the Test-Then-Train method with the difference

that it uses data chunks instead of single examples. The procedure reads incoming examples

without processing them, until they form a data chunk of size s. Each new data chunk is first

used to test the existing model, then it updates the model, and finally it is disposed to preserve

memory. Figure 5.3 gives an illustration of the data chunk evaluation method.

x0

. . .x1 x2 xi

Test model on x

Update model with x Train model with x

Test model on x

Update model with x

Test model on x

Update model with x0

1 i

1 2

2

i

Figure 5.3: Data chunk evaluation method.

This approach allows to measure training and testing times and reduces the accuracy obscuring

effect. It is suitable for static and evolving streams and provides a natural method of reducing

result storage requirements. In this approach the estimate of accuracy is still more pessimistic

than the one calculate via a static holdout procedure, but the ability to evaluate classifiers on

drifting data streams makes it much better choice.

Chapter 6

Experimental evaluation

This chapter discusses the experiments carried out to compare selected data stream algorithms

from Chapters 3 and 4. Section 6.1 lists the algorithms chosen for the comparative experiments

along with their parameter settings. Section 6.2 gives insight into the main characteristics of each

data set. Section 6.3 briefly describes the experimental environment. Finally, in Sections 6.4 and

6.5 we present and discuss experimental results.

6.1 Algorithms

One of the main goals of this thesis is to compare selected single classifier and ensemble methods in

data stream environments. For our experiments, we chose two single classifier and four ensemble

approaches, including two versions of our proposed algorithm. From single classifier methods, we

chose a Hoeffding Tree with a DDM drift detector (HT+DDM) and a tree with a static window

(Tree+Win). From ensemble methods, we chose the Accuracy Weighted Ensemble (AWE), the

Hoeffding Option Tree (HOT), and the Accuracy Diversified Ensemble with (ADEBag) and without

(ADE) bagging.

To make the comparison more significant, we tried to set the same parameter values for all

the algorithms. For ensemble methods we set the number of component classifiers to 15: AWE,

ADE, and ADEBag have 15 trees, HOT has 15 options. We also set the static window size to

15 × chunkSize to make the number of examples seen by the windowed classifier similar to that

seen by AWE and ADE. The parameters of the Hoeffding Tree with the drift detector are the same

as those of the option tree, and the base classifiers (also Hoeffding Trees) of the ensembles. All

these trees have Naive Bayes leaves, split confidence δ = 0.01, and grace period γr = 100, γa = 200

for real and artificial data sets consequently. The secondary split confidence for HOT is set to

δ′ = 0.5. The windowed tree is not a Hoeffding tree, but a traditional decision tree with Naive

Bayes leaves.

All the algorithms are evaluated using the data chunk evaluation method described in Sec-

tion 5.3.3. The data chunk size is equal dr = 500 and da = 1000, for real and artificial data

sets consequently. AWE uses each data chunk to build a new classifier and test those previously

created, while ADE does the same with halves of the chunks. ADE can use halves of the chunks

because it can incrementally update components, while a component of AWE is built on a chunk

and is never updated. Data chunk sizes also defined the result sampling frequency. All the result

plots consist of sampling points joined by lines, where a sampling point represents an average value

(accuracy, memory or time) of 500 or 1000 examples.

37

38 Experimental evaluation

6.2 Data sets

For the purposes of research into data stream classification there is a shortage of suitable and

publicly available real-world benchmark data sets. Most of the common benchmarks for machine

learning algorithms are not suitable for evaluating data stream classification. They contain too

few examples and do not contain any type of concept drift.

To demonstrate their systems, several researchers have used private real world data that cannot

be reproduced by others. Examples of this include the web trace from the University of Washington

used to evaluate VFDT [21], and the credit card fraud data used by Wang et al. [78, 23, 24].

That is why, it has become a common practice by researchers to publish results based also

on synthetic data sets. The authors of data stream mining algorithms have constructed unique

data generation schemes for the purpose of evaluating their models. Some of the more popular

generators are: Waveform, RBF, SEA, LED, Hyperplane and Random generated trees.

In our experiments we use four real and four synthetic data sets with concept drift, all of which

are publicly available. A short description of each data set is given below.

Electricity market data (Elec). Electricity is a data set first described by Harries [42]. It

consists of energy prices from the electricity market in the Australian state of New South Wales.

These prices were affected by market demand, supply, season, weather and time of day and evolve

seasonally while also showing sensitivity to short-term events. From the original data set we

selected only a time period with no missing values comprised of 27,552 instances described by 7

features. Decision class values “up” and “down” indicate the change of the price and are moderately

balanced (the default accuracy is 57.60%).

Ozone level detection (Ozone). Ozone is a streaming problem concerning local ozone peak

prediction, that is based on eight hours measurement [81, 82]. The data set consists of 2,534 entries

and is highly unbalanced (2% or 5% positives depending on the criteria of “ozone days”). The

true model behind the data is stochastic as a function of measurable factors and evolves gradually

over time. Another difficulty in mining this data set, is that many of the 72 features collected for

each instance are irrelevant.

Spam Assassin corpus (Spam). Spam is a series of entries of real-world spam and legiti-

mate emails chronologically ordered according to their date and time of arrival [49]. The data

set consists of 9,324 instances and initially 40,000 features selected from the Spam Assassin

(http://spamassassin.apache.org/) data collection. The data is unbalanced (20% of spam, 80%

of legitimate emails) and represents gradual concept drift.

Donation data (Don). Donation is a data set used for The Second International Knowledge

Discovery and Data Mining Tools Competition. It was also used by Wei Fan [23] to evaluate his

Systematic data selection technique. The data represents a regression problem where the goal is

to estimate the return from a direct mailing in order to maximize donation profits. In this data

set there are almost 200,000 instances described by 479 features. The data contains examples of

sudden concept drift.

LED Generator (Led). LED is popular artificial data set that originates from the CART

book [16]. It consists of a stream of 24 binary attributes, 17 of which are irrelevant, that define

the digit displayed on a seven-segment LED display. The data set is known to have an optimal

6.3. Experimental environment 39

Bayes classification rate of 74%. We use this generator to acquire 1,000,000 examples with sudden

and gradual concept drift.

Waveform (Wave). Waveform is an artificial data set used by Gama et al. [32]. It consists of

a stream with three decision classes where the instances are described 40 attributes. It is known

that the optimal Bayes error for this data set is 14%. This generator is used to acquire 1,000,000

instances with gradual concept drift.

Hyperplane (Hyp). Hyperplane is a popular data set generator used in many experiments [78,

23, 24, 82]. It is mainly used to generate streams with gradual concept drift by rotating the

decision boundary for each concept. We set the generator to create 5,000,000 instances described

by 10 features and 2 decision class values. We also add 10% of noise to the concepts to randomly

differentiate the instances.

SEA generator (Sea). SEA was proposed by Street and Kim [73]. The data set has two

decision classes with sudden concept drift and is created by generating 60,000 random points in a

three-dimensional feature space. All three features have values between 0 and 10. The generator

then divides those points into four blocks with different concepts and assigns classes according

to a linear function. For the experiments in this thesis, we generate a data stream of 20,000,000

instances with 10% of noise.

6.3 Experimental environment

All the tested algorithms were implemented in Java as part of the MOA framework. We imple-

mented the AWE and ADE algorithms, and the Data Chunk evaluation procedure. All the other

algorithms were already a part of MOA. The experiments took place on a machine equipped with

an Intel Pentium Core 2 Duo P9300 @ 2.26 GHz processor and 3.00 GB of RAM. Each algorithm

was tested on 8 data sets, described in the previous section, using the Data Chunk evaluation

procedure.

6.4 Results

According to the main characteristics of data streams described in Chapter 2, we divide the

results into three groups. First, in Section 6.4.1, we analyze the algorithms’ time performance.

We compare train and test times, and verify if the constant processing time requirement is met.

In Section 6.4.2, we discuss the algorithms’ memory usage over time. Finally, in Section 6.4.3 we

compare the classification accuracies achieved by all the algorithms.

The most common way of displaying results in data stream mining papers is a graphical plot,

typically with the number of training examples on the x-axis [11]. During our experiments, we

generated 4 plots for each data set: train time, test time, memory usage, and accuracy. The most

interesting figures will be discussed in the following sections along with tabular summaries. All

the generated plots are given in Appendix B.

6.4.1 Time analysis

Tables 6.1 and 6.2 present average chunk train and test times for each data set. We can see that

single classifiers process data streams much faster than ensemble methods. This is quite natural

as simpler learners usually require less time for training and testing.


Table 6.1: Average test and train times in ms for data chunks in real data sets.

Elec Ozone Spam DonTrain Test Train Test Train Test Train Test

HOT 101.11 6.64 62.40 1.00 13160.98 1430.62 2168.54 17.19AWE 56.91 13.29 179.40 27.30 - - 3290.93 1074.98HT+DDM 4.04 0.58 19.50 7.80 7100.80 2921.81 53.17 5.55Tree+Win 20.51 1.44 15.60 7.80 - - 1.55 17.02ADE 75.11 15.31 241.80 54.60 - - 3292.64 1086.82ADEBag 72.22 15.31 237.90 54.60 - - 3351.61 1090.17

Table 6.2: Average test and train times in ms for data chunks in artificial data sets.

Led Wave Hyp SeaTrain Test Train Test Train Test Train Test

HOT 563.68 9.93 2573.86 51.27 5170.84 14.18 4876.41 6.44AWE 751.40 181.85 558.21 159.19 41.13 4.96 29.44 5.61HT+DDM 22.44 12.07 140.60 8.07 1998.79 7.95 2125.06 2.40Tree+Win 23.08 200.19 25.44 18.41 28.24 8.45 25.63 2.66ADE 803.93 185.39 636.59 162.06 240.22 49.39 90.05 18.84ADEBag 798.01 184.64 646.62 164.53 251.89 53.05 95.68 19.67

An interesting observation, that concerns not only time experiments, is that due to the

large number of attributes in the Spam data set (20,000 attributes) AWE, ADE, ADEBag, and

Tree+Win were unable to process that stream. To explain this, let us notice that the Naive Bayes

tree that was windowed is not an incremental learner and builds complex models more quickly

than Hoeffding trees. That is why it used more memory than the Java heap size limit. Similarly,

ensemble methods build 15 models instead of 1 and also fail to process the Spam data set with

the available memory.

0 s

50 ms

100 ms

150 ms

200 ms

250 ms

300 ms

0 100 k 200 k 300 k 400 k 500 k 600 k 700 k 800 k 900 k 1 M

Tes

t tim

e

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure 6.1: Chunk test time on the Waveform data set. Constant testing time forall the algorithms and a visible example of concept drift.

6.4. Results 41

0 s

1 s

2 s

3 s

4 s

5 s

6 s

7 s

8 s

9 s

10 s

0.0 0.5 M 1.0 M 1.5 M 2.0 M 2.5 M 3.0 M 3.5 M 4.0 M 4.5 M 5.0 M

Tra

in ti

me

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure 6.2: Chunk train time on the Waveform data set. An example of lineartraining time growth for HOT and HT+DDM.

Looking at test time plots, like the one presented in Figure 6.1, we can notice that testing times

remain close to constant through out the whole processing of the data stream. This observation

is true for all data sets (see plots B.9-B.16). On the other hand, classifier training is not constant

for all algorithms (see plots B.1-B.8). HOT shows clear linear growth of training time when no

sudden drift occurs. HT+DDM keeps low training time for smaller data sets, but also requires

linearly more time for large data sets. An example of the linear growth of training time for HOT

and HT+DDM can be seen in Figure 6.2.

The growth of training time for HOT and HT+DDM is due to the fact that we did not restrict

maximum model memory. We did so to test the algorithms in an environment optimal for their

performance. Hoeffding trees, like most decision trees, become more complex as they see more

examples. In periods of stability HOT and HT+DDM will successively grow bigger trees, thus

consuming more memory. The windowed tree, AWE, and ADE are built from a limited number of

data chunks and have an architectural memory limit. Theoretically ADE and ADEBag components

could become more complex in periods of stability, but as these experiments show this is practically

impossible.

Single classifiers seem to be the best choice when processing time is of crucial importance. For

large data sets, AWE, ADE and ADEBag come close second, as HOT and HT+DDM clearly lose

on this criterion, gradually requiring more and more processing time.

6.4.2 Memory usage

According to Table 6.3, HT+DDM used the least memory for small data sets, but consumed more

resources than AWE, ADE, and ADEBag for the three largest data streams. Sadly, due to bad

wrapping between MOA and WEKA objects, the framework outputted incorrect model size values

for the windowed tree. We believe, that its size should remain close to constant after a window of

examples. As this approach builds only one classifier, it is very probable that the windowed tree

would be the most memory effective approach for large data streams.


Table 6.3: Average trained model size for all data sets measured in MB.

Elec Ozone Spam Don Led Wave Hyp SeaHOT 0.41 0.08 29.78 13.97 2.76 12.27 18.49 24.45AWE 0.23 0.28 - 5.52 0.58 0.33 0.17 0.18HT+DDM 0.02 0.03 23.69 0.66 0.06 1.17 14.05 21.54Tree+Win - - - - - - - -ADE 0.36 0.36 - 5.64 0.88 0.92 0.86 0.46ADEBag 0.30 0.36 - 5.64 0.95 1.00 0.88 0.48

An interesting fact can be noticed when analyzing Figures 6.3 and 6.4. For small data sets (Elec,

Ozone, Spam, Don, Wave) the Hoeffding Option Tree clearly requires linearly more memory with

each processed data chunk. Interestingly, for the largest data sets (Hyp, Sea) HOT reaches a point

where the model’s size remains constant. We did not limit memory usage for these experiments,

so this must mean that the model’s ability to evolve has reached its limit. After expanding all 15

options for each node the tree might not be able grow any more.

0 B

2 MB

4 MB

6 MB

8 MB

10 MB

12 MB

14 MB

16 MB

18 MB

20 MB

0 20 k 40 k 60 k 80 k 100 k 120 k 140 k 160 k 180 k 200 k

Mem

ory

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure 6.3: Memory usage on the Donation data set. Linear growth of HOT andrelatively large memory usage of ensemble methods compared to HT+DDM.

The analysis of all memory plots (Figures B.17-B.24) shows that memory requirements are

similar to training time requirements. HT+DDM and HOT need much more memory for larger

data sets than Tree+Win, AWE, ADE, and ADEBag, which processed the data streams using

constant memory.

6.4.3 Classification accuracy

Table 6.4 presents average classification accuracies obtained by the tested algorithms on all the

data sets. Average accuracy is a good measure for evaluating overall performance, but in evolving

environments, the classifiers reaction to change is of crucial importance. That is why we analyze

more thoroughly two plots presenting the reaction of algorithms to gradual and sudden concept

drift.

6.4. Results 43

0 B

5 MB

10 MB

15 MB

20 MB

25 MB

30 MB

35 MB

40 MB

45 MB

50 MB

0 2 M 4 M 6 M 8 M 10 M 12 M 14 M 16 M 18 M 20 M

Mem

ory

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure 6.4: Memory usage on the SEA data set. An example of linear memorygrowth for HOT and HT+DDM, and a case of HOT reaching its option limit.

Table 6.4: Average accuracy for all data sets in percent.

Elec Ozone Spam Don Led Wave Hyp SeaHOT 74.37 91.60 75.25 94.35 70.68 82.57 85.07 89.81AWE 71.22 67.59 - 94.35 71.16 79.63 70.38 78.52HT+DDM 70.04 84.29 67.03 94.35 69.93 81.27 84.40 89.54Tree+Win 66.32 91.60 - 89.18 67.76 58.60 72.27 83.02ADE 74.92 76.56 - 94.34 71.41 82.26 84.72 88.61ADEBag 74.25 76.56 - 94.34 71.42 82.50 84.98 88.75

Figure 6.5 shows the accuracy plot for the Waveform data set, where gradual drift occurs

around the 300,000th example. Most of the tested classifiers react to the change with a short drop

in accuracy, which is later corrected after adjusting to the new concept. The only approach that

fails to successfully cope with gradual change is the windowed tree. Because its forgetting process

is static, it has no chance of unlearning outdated examples quickly, and since the drift is spread in

time, its decrease in accuracy is longterm.

More complex concept drift was introduced in the generation of the LED data set. We joined

two gradually evolving LED data sets with a sudden change. After half million examples we

replaced one data source with another. The algorithms’ reactions to this type of change are

presented Figure 6.6.

We see that all classifiers become less accurate after the sudden drift. Once again the windowed

tree suffers the most, but the accuracy drop is not as drastic as it was for the Waveform data set.

For this, more complex, concept drift other algorithms also have problems with adjusting to change.

ADE and ADEBag seem to cope best with this situation. HOT, which performed well before the

drift, falls down even below the level of AWE.

In periods of stability, HOT and HT+DDM grow accurate but complex structures, which are

later difficult to rebuild. AWE, ADE, and ADEBag are modular, allow quick substitution of

components, and therefore quick reaction to sudden drifts.


50 %

55 %

60 %

65 %

70 %

75 %

80 %

85 %

0 100 k 200 k 300 k 400 k 500 k 600 k 700 k 800 k 900 k 1 M

Acc

urac

y

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure 6.5: Accuracy on the Waveform data set.

50 %

55 %

60 %

65 %

70 %

75 %

0 100 k 200 k 300 k 400 k 500 k 600 k 700 k 800 k 900 k 1 M

Acc

urac

y

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure 6.6: Accuracy on the LED data set.

It is hard to select one most accurate classifier from the compared algorithms. For large data

streams with little and mostly gradual concept drift, HOT gives best results with ADEBag and

HT+DDM being close second. For the smallest data set, the windowed tree works best. For the

most rapidly changing streams (Elec, LED), the ADE approaches proved to be the most flexible.

We clearly see that depending on time and memory requirements, as well as predicted types of

drift, different approaches work better.

6.5. Remarks 45

6.5 Remarks

Recently, Bifet et al. [10] showed a comparison of many online learning algorithms, including HOT

and HT+DDM. Their results, concerning these two classifiers, our similar to ours: HT+DDM is

faster and much more memory efficient than HOT. The only difference is that on the majority

of their data sets HT+DDM classified better, while our experiments proved HOT to be more

accurate. This difference may be dependent on the data streams they used for evaluation. During

their tests, Bifet et al. generated two streams of 10 million and two of 1 million data points,

whereas we tested a stream up to 20 million examples.

Our experiments, like most data stream classifier evaluations [10, 44, 30, 78], measure memory

requirements of algorithms rather than examining their performance in environments with insuffi-

cient RAM. Additional tests need to be performed in order to see if HOT is equally accurate when

it has only the memory used by ADE or HT+DDM.

The comparison of AWE with HOT and HT+DDM was, to our knowledge, never done before.

Although AWE is more flexible when handling sudden drift presented in LED, HOT and HT+DDM

outperform the weighted ensemble on all the other data sets. AWE was previously compared with

a decision tree built on a window of examples [78] with results similar to ours. In most cases,

weighting chunks of examples according to distribution improves classifier accuracy.

The Accuracy Diversified Ensemble is our methodological contribution and therefore was not

previously compared with any other algorithm. ADE was more accurate than AWE on all but

one data set whilst still requiring constant processing time and memory. We find it to be a

promising compromise between accurate but time and memory costly HOT and HT+DDM, and the

slightly lighter but much less accurate AWE and simple windowing technique. Further experiments,

especially simulating limited memory environments, need to be performed to fully confirm our

approach’s usefulness.

Discussing the design of the Accuracy Diversified Ensemble, we were not sure if bagging was

really the best choice to additionally diversify ensemble members. We decided to test our approach

in two versions - with and without online sampling. Table 6.5 compares the average performance

of both methods on all data sets.

Table 6.5: Average accuracy, time, and memory performance for ADE and ADEBag.

Chunk train Chunk test Memory AccuracyADE - - - +0.11%ADEBag +3.07% +5.59% +4.90% -

We see that the ADE version with bagging requires more time and memory. This is a small

overhead caused by the sampling process. What is surprising, bagging does not improve accuracy

on an average. ADEBag had better accuracy on 4 data streams, was equally good three times, and

lost only on the Electricity data set. The fact that ADE was much better on that last mentioned

data set made the result in Table 6.5 unfavorable for the version with bagging.

For the majority of data sets bagging did not change or improved the accuracy of the ADE very

slightly. We believe that adding more diversity ADE components could make the classifier more

accurate, but different methods than bagging (e.g. the use of different base classifiers or boosting)

have to be explored.

Chapter 7

Conclusions

In this thesis we addressed the problem of mining time evolving data streams. We defined the main

characteristics of data streams and discussed different types of changes that occur in streaming

data. During our discussion we focused on non-random class definition changes called concept

drift. We reviewed existing single classifier and ensemble approaches to mining data streams with

concept drift. Our analysis led to the development of a new algorithm called Accuracy Diversified

Ensemble, which is based on our critique of the earlier developed Accuracy Weighted Ensemble.

Moreover, one of the aims of this thesis was the evaluation of the Massive Online Analysis

framework as a software environment for research on learning from evolving data streams. MOA is

a relatively young project and still needs some work to become a reliable data stream development

tool. MOA’s modular structure and dynamic graphical interface creation similar to WEKA are

good decisions. But even with its close relation to WEKA, MOA sometimes has trouble with

communicating with its relative. Not being able to correctly determine model size when using a

WEKA classifier is one of the cases. A different limitation is the inability of a MOA wrapper in

WEKA to handle class attributes other than nominal.

The authors of MOA wrote a manual [12] and a technical report [11], which document the

frameworks usage and theoretical foundations. On the other hand, MOA’s source code is not

documented. This makes code re-usage more difficult and does not facilitate the frameworks

future development. A different technical remark concerns the result generation in MOA. When

evaluating a classifier in MOA, results appear incrementally as the stream is being processed. If

at one point of the data stream the classifier or framework crashes, all results are lost. This is

an implementation decision, as results acquired prior to the crash could be easily saved. In our

opinion, outputting partial results with additional error information would be a better solution.

MOA has had only one stable release to date. Nevertheless, it contains 8 of the most popular

data stream generators, 11 classifiers, 3 evaluation methods and many other functions implemented

in a single package. MOA is surely the most comprehensive benchmark environment dedicated

solely to data stream mining. Despite its drawbacks, there is a chance that with the effort of its

authors it will become a commonly known framework for mining concept drifting streams, like

WEKA is for traditional data mining.

We extended the MOA framework by implementing our Accuracy Diversified Ensemble, the

Accuracy Weighted Ensemble, and a data chunk evaluation method to experimentally compare

selected single and ensemble data stream classifiers. We found that the classifier proposed by us

requires constant time and memory, and gives comparably good or better accuracy than more

resource expensive methods. Additionally, during the evaluation of our algorithm, we verified the

47

48 Conclusions

usefulness of bagging as a way of additionally diversifying our ensemble. We found that bagging

provided time and memory overhead without improving accuracy.

Our experimental results can be partially compared with previous publications. Findings con-

cerning the windowed classifier, Hoeffding Option Tree and Hoeffding Tree with a drift detector,

are similar to those stated by Bifet et al. [10]. The main difference in results is that in our setting

the Hoeffding Option Tree was practically always more accurate than the single Hoeffding Tree

with a drift detector, while the tests of Bifet et al. showed a slight domination of the single classi-

fier. Our comparison of the Accuracy Weighted Ensemble and the Accuracy Diversified Ensemble

with different algorithms provides new, previously unpublished results.

The analysis of algorithms reviewed in this thesis shows that mining of data streams with

concept drift is growing into a new branch of knowledge discovery, with its own unique research

problems. The need for online processing with time and memory constraints forces researchers to

focus on resource usage while designing accurate classifiers. Additionally, concept drift introduces

the requirement for a forgetting mechanism that dynamically removes outdated data. The MOA

framework proposes a way of unifying the implementation and evaluation of algorithms that tackle

these problems. This shows that data stream mining is becoming a mature field of study aiming

to meet the challenges of the data stream phenomenon in real life applications.

As future work, we plan to carry out additional experiments to analyze the relationship between

data chunk size and the performance of bagging in the Accuracy Diversified Ensemble. Further-

more, since bagging did not prove to provide additional accuracy to our method, we also plan to

explore different ways of diversifying ensemble members like the use of heterogeneous base learners

or boosting. Finally, in future experiments we plan to compare a larger number of algorithms to

take a broader look at the performance of the most recent stream mining techniques.

Appendix A

Implementation details

In this appendix we describe the installation process and the most important implementation de-

tails of the software attached to this thesis. Section A.1 lists software requirements and presents

the installation process. Section A.2 discusses the implementation of the Attribute Filter, Section

A.3 the implementation of the Data Chunk Evaluation procedure, Section A.4 the implementa-

tion of the Accuracy Weighted Ensemble algorithm. Finally, in Section A.5 we shortly lists the

parameters of our algorithm - the Accuracy Diversified Ensemble.

A.1 MOA Installation

MOA is written in Java and requires at least JRE 5 for running and Java 5 SDK for development.

Due to Java’s portability, MOA can be run on Windows, Mac and Unix/Linux systems. The

framework requires the following files:

• moa.jar (http://sourceforge.net/projects/moa-datastream/)

• weka.jar (http://sourceforge.net/projects/weka/)

• sizeofag.jar (http://www.jroller.com/resources/m/maxim/sizeofag.jar)

An extended version of MOA is provided with this thesis. To run a MOA task from the command

line on Windows, use the following pattern:

java.exe -cp <moaFolder>\moa.jar -javaagent:<sizeofagFolder>\sizeofag.jarmoa.DoTask <taskName> <taskParameters>

For example:

java.exe -cp "C:\moa.jar" -javaagent:"C:\sizeofag.jar" moa.DoTaskEvaluateInterleavedChunks -l AccuracyWeightedEnsemble -i 1000000

-s generators.WaveformGenerator

To run the graphical interface:

java -cp <moaFolder>\moa.jar -javaagent:<sizeofagFolder>\sizeofag.jarmoa.gui.TaskLauncher

The MOA distribution comes with two run files (moa.sh and moa.bat), which ease the execution

of the graphical interface. More details concerning the use of the graphical interface and command

line can be found in the Masive Online Analysis Manual [12].

49

http://sourceforge.net/projects/moa-datastream/

http://sourceforge.net/projects/weka/

http://www.jroller.com/resources/m/maxim/sizeofag.jar

50 Implementation details

A.2 Attribute filtering

The Attribute Filter allows to select the attributes of each example that will be passed to the

learner. For streams with many features the filtering process can significantly slow down stream

processing, but allows to decrease model size. The attributes to be removed are specified by indexes

of the unwanted attributes. The user can list single attributes separated by commas (1,4,12) or

define ranges of attributes (5-8,13-41).

The filter is implemented in the file RemoveAttributesFilter.java placed in the moa.

streams.filters package of the MOA framework. RemoveAttributesFilter extends the

AbstractStreamFilter class and can be only used with the Filtered-Stream method.

Parameters:

• -a: Indexes of attributes to be removed (-1 = no filtering)

Example usage:

java.exe -cp moa.jar -javaagent:sizeofag.jar moa.DoTask LearnModel

-s (FilteredStream -s (ArffFileStream -f (ozone.arff))

-f (RemoveAttributesFilter -a 1))

A.3 Data chunk evaluation

The Data Chunk Evaluation procedure evaluates a classifier on a stream by testing then training

with consecutive data chunks. In the implementation, accuracy, time, and memory are updated

with each example in the data chunk and later averaged. Similarly, in the training phase, the

classifier is trained incrementally (when possible), but the created model is available for further

processing after the whole data chunk rather than after each example. Thus, the sampling interval

should be equal or greater than the data chunk size.

The evaluation method is implemented in the file EvaluateInterleavedChunks.java placed

in the moa.tasks package of the MOA framework. EvaluateInterleavedChunks extends the

MainTask class and is available directly from the task selection combo box in the graphical inter-

face.

Parameters:

• -l : Classifier to train

• -s : Stream to learn from

• -e : Classification performance evaluation method

• -i : Maximum number of instances to test/train on (-1 = no limit)

• -c : Number of instances in a data chunk

• -t : Maximum number of seconds to test/train for (-1 = no limit)

• -f : How many instances between samples of the learning performance

• -b : Maximum byte size of model (-1 = no limit)

• -q : How many instances between memory bound checks

• -d : File to append intermediate csv results to

A.4. Accuracy Weighted Ensemble 51

• -O : File to save the final result of the task to

Example usage:

java.exe -cp moa.jar -javaagent:sizeofag.jar moa.DoTask

EvaluateInterleavedChunks -l HoeffdingTreeNB -s generators.WaveformGenerator

-i 1000000 -c 1000

A.4 Accuracy Weighted Ensemble

The Accuracy Weighted Ensemble was implemented according to the pseudo-code listed in Algo-

rithm 4.3. Instance based pruning and other enhancements for cost-sensitive applications [78] were

not implemented. The number of folds parameter is used only for candidate classifiers built from

the most recent data chunk. Previously built classifiers are tested on the entire, most recent data

chunk. The train and test sets for cross-validation are created by the trainCV() and testCV()

methods implemented in WEKA.

The classifier is implemented in the file AccuracyWeightedEnsemble.java placed in the

moa.classifiers package of the MOA framework. AccuracyWeightedEnsemble extends the

AbstractClassifier class and is available from the learner selection dialog in the graphical in-

terface.

Parameters:

• -l : Member classifier type

• -n : Maximum number of classifier in an ensemble

• -r : Maximum number of classifiers to store and choose from when creating an ensemble

• -c : Chunk size used for member creation and evaluation

• -f : Number of cross-validation folds for candidate classifier testing

Example usage:


EvaluateInterleavedChunks -l (AccuracyWeightedEnsemble -n 20 -c 1000)

-s generators.WaveformGenerator -i 1000000 -c 1000

A.5 Accuracy Diversified Ensemble

The Accuracy Diversified Ensemble was implemented according to the pseudo-code listed in Algo-

rithm 4.5 with an option that determines whether or not to perform bagging.

The classifier is implemented in the file AccuracyDiversifiedEnsemble.java placed in the

moa.classifiers package of the MOA framework. AccuracyDiversifiedEnsemble extends the

AccuracyWeightedEnsemble class and shares all its parameters.

Parameters:

• Same parameters as AccuracyWeightedEnsemble

• -b : If set, no bagging is performed

• -a : If set, a bagged example is always presented to the classifier (always add 1 to the sampled

Poison distribution)

52 Implementation details

Example usage:


EvaluateInterleavedChunks -l (AccuracyDiversifiedEnsemble -a -n 20 -c 1000)

-s generators.WaveformGenerator -i 1000000 -c 1000

Appendix B

Additional figures

0 s

100 ms

200 ms

300 ms

400 ms

500 ms

600 ms

0 5 k 10 k 15 k 20 k 25 k 30 k

Tra

in ti

me

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.1: Chunk train time on the Electricity data set.

53

54 Additional figures

0 s

50 ms

100 ms

150 ms

200 ms

250 ms

300 ms

350 ms

500 1000 1500 2000 2500

Tra

in ti

me

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.2: Chunk train time on the Ozone data set.

0 s

5 s

10 s

15 s

20 s

25 s

30 s

0 1 k 2 k 3 k 4 k 5 k 6 k 7 k 8 k 9 k

Tra

in ti

me

Processed instances

HOTHT + DDM

Figure B.3: Chunk train time on the Spam data set.

Additional figures 55

0 s

500 ms

1000 ms

1500 ms

2000 ms

2500 ms

3000 ms

3500 ms

4000 ms

0 20 k 40 k 60 k 80 k 100 k 120 k 140 k 160 k 180 k 200 k

Tra

in ti

me

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.4: Chunk train time on the Donation data set.

0 s

200 ms

400 ms

600 ms

800 ms

1000 ms

1200 ms

1400 ms

0 100 k 200 k 300 k 400 k 500 k 600 k 700 k 800 k 900 k 1 M

Tra

in ti

me

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.5: Chunk train time on the LED data set.


0 s

1 s

2 s

3 s

4 s

5 s

6 s

0 100 k 200 k 300 k 400 k 500 k 600 k 700 k 800 k 900 k 1 M

Tra

in ti

me

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.6: Chunk train time on the Waveform data set.

0 s

1 s

2 s

3 s

4 s

5 s

6 s

7 s

8 s

9 s

10 s

0.0 0.5 M 1.0 M 1.5 M 2.0 M 2.5 M 3.0 M 3.5 M 4.0 M 4.5 M 5.0 M

Tra

in ti

me

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.7: Chunk train time on the Hyperplane data set.


0 s

1 s

2 s

3 s

4 s

5 s

6 s

7 s

8 s

9 s

0 2 M 4 M 6 M 8 M 10 M 12 M 14 M 16 M 18 M 20 M

Tra

in ti

me

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.8: Chunk train time on the SEA data set.

0 s

5 ms

10 ms

15 ms

20 ms

25 ms

30 ms

35 ms

0 5 k 10 k 15 k 20 k 25 k 30 k

Tes

t tim

e

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.9: Chunk test time on the Electricity data set.


0 s

10 ms

20 ms

30 ms

40 ms

50 ms

60 ms

70 ms

80 ms

90 ms

100 ms

500 1000 1500 2000 2500

Tes

t tim

e

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.10: Chunk test time on the Ozone data set.

0 s

500 ms

1000 ms

1500 ms

2000 ms

2500 ms

3000 ms

3500 ms

4000 ms

4500 ms

0 1 k 2 k 3 k 4 k 5 k 6 k 7 k 8 k 9 k

Tes

t tim

e

Processed instances

HOTHT + DDM

Figure B.11: Chunk test time on the Spam data set.


0 s

200 ms

400 ms

600 ms

800 ms

1000 ms

1200 ms

1400 ms

1600 ms

0 20 k 40 k 60 k 80 k 100 k 120 k 140 k 160 k 180 k 200 k

Tes

t tim

e

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.12: Chunk test time on the Donation data set.

0 s

50 ms

100 ms

150 ms

200 ms

250 ms

0 100 k 200 k 300 k 400 k 500 k 600 k 700 k 800 k 900 k 1 M

Tes

t tim

e

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.13: Chunk test time on the LED data set.


0 s

50 ms

100 ms

150 ms

200 ms

250 ms

300 ms

0 100 k 200 k 300 k 400 k 500 k 600 k 700 k 800 k 900 k 1 M

Tes

t tim

e

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.14: Chunk test time on the Waveform data set.

0 s

10 ms

20 ms

30 ms

40 ms

50 ms

60 ms

70 ms

0.0 0.5 M 1.0 M 1.5 M 2.0 M 2.5 M 3.0 M 3.5 M 4.0 M 4.5 M 5.0 M

Tes

t tim

e

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.15: Chunk test time on the Hyperplane data set.


0 s

5 ms

10 ms

15 ms

20 ms

25 ms

30 ms

0 2 M 4 M 6 M 8 M 10 M 12 M 14 M 16 M 18 M 20 M

Tes

t tim

e

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.16: Chunk test time on the SEA data set.

0 B

100 kB

200 kB

300 kB

400 kB

500 kB

600 kB

700 kB

0 5 k 10 k 15 k 20 k 25 k 30 k

Mem

ory

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.17: Memory usage on the Electricity data set.


0 B

100 kB

200 kB

300 kB

400 kB

500 kB

600 kB

500 1000 1500 2000 2500

Mem

ory

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.18: Memory usage on the Ozone data set.

5 MB

10 MB

15 MB

20 MB

25 MB

30 MB

35 MB

40 MB

45 MB

0 1 k 2 k 3 k 4 k 5 k 6 k 7 k 8 k 9 k

Mem

ory

Processed instances

HOTHT + DDM

Figure B.19: Memory usage on the Spam data set.


0 B

2 MB

4 MB

6 MB

8 MB

10 MB

12 MB

14 MB

16 MB

18 MB

20 MB

0 20 k 40 k 60 k 80 k 100 k 120 k 140 k 160 k 180 k 200 k

Mem

ory

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.20: Memory usage on the Donation data set.

0 B

1 MB

2 MB

3 MB

4 MB

5 MB

6 MB

7 MB

0 100 k 200 k 300 k 400 k 500 k 600 k 700 k 800 k 900 k 1 M

Mem

ory

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.21: Memory usage on the LED data set.


0 B

5 MB

10 MB

15 MB

20 MB

25 MB

30 MB

0 100 k 200 k 300 k 400 k 500 k 600 k 700 k 800 k 900 k 1 M

Mem

ory

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.22: Memory usage on the Waveform data set.

0 B

5 MB

10 MB

15 MB

20 MB

25 MB

30 MB

35 MB

0.0 0.5 M 1.0 M 1.5 M 2.0 M 2.5 M 3.0 M 3.5 M 4.0 M 4.5 M 5.0 M

Mem

ory

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.23: Memory usage on the Hyperplane data set.


0 B

5 MB

10 MB

15 MB

20 MB

25 MB

30 MB

35 MB

40 MB

45 MB

50 MB

0 2 M 4 M 6 M 8 M 10 M 12 M 14 M 16 M 18 M 20 M

Mem

ory

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.24: Memory usage on the SEA data set.

50 %

55 %

60 %

65 %

70 %

75 %

80 %

85 %

90 %

95 %

0 5 k 10 k 15 k 20 k 25 k 30 k

Acc

urac

y

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.25: Accuracy on the Electricity data set.


50 %

55 %

60 %

65 %

70 %

75 %

80 %

85 %

90 %

95 %

500 1000 1500 2000 2500

Acc

urac

y

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.26: Accuracy on the Ozone data set.

50 %

55 %

60 %

65 %

70 %

75 %

80 %

85 %

0 1 k 2 k 3 k 4 k 5 k 6 k 7 k 8 k 9 k

Acc

urac

y

Processed instances

HOTHT + DDM

Figure B.27: Accuracy on the Spam data set.


50 %

55 %

60 %

65 %

70 %

75 %

80 %

85 %

90 %

95 %

0 20 k 40 k 60 k 80 k 100 k 120 k 140 k 160 k 180 k 200 k

Acc

urac

y

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.28: Accuracy on the Donation data set.

50 %

55 %

60 %

65 %

70 %

75 %

0 100 k 200 k 300 k 400 k 500 k 600 k 700 k 800 k 900 k 1 M

Acc

urac

y

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.29: Accuracy on the LED data set.


50 %

55 %

60 %

65 %

70 %

75 %

80 %

85 %

0 100 k 200 k 300 k 400 k 500 k 600 k 700 k 800 k 900 k 1 M

Acc

urac

y

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.30: Accuracy on the Waveform data set.

50 %

55 %

60 %

65 %

70 %

75 %

80 %

85 %

90 %

0.0 0.5 M 1.0 M 1.5 M 2.0 M 2.5 M 3.0 M 3.5 M 4.0 M 4.5 M 5.0 M

Acc

urac

y

Processed instances

HOTAWE

HT + DDMTree + Win

ADEADEBag

Figure B.31: Accuracy on the Hyperplane data set.


50 %

55 %

60 %

65 %

70 %

75 %

80 %

85 %

90 %

0 2 M 4 M 6 M 8 M 10 M 12 M 14 M 16 M 18 M 20 M

Acc

urac

y

Processed instances

HOTAWE

HT + DDMHT + Window

ADEADEBag

Figure B.32: Accuracy on the SEA data set.

Bibliography

[1] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the

frequency moments. J. Comput. Syst. Sci., 58(1):137–147, 1999.

[2] Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. Models

and issues in data stream systems. In Lucian Popa, editor, PODS, pages 1–16. ACM, 2002.

[3] Manuel Baena-Garcia, Jose del Campo-Avila, Raul Fidalgo, Albert Bifet, Ricard Gavalda, and

Rafael Morales-Bueno. Early drift detection method. In In Fourth International Workshop

on Knowledge Discovery from Data Streams, 2006.

[4] Robert M. Bell, Yehuda Koren, and Chris Volinsky. The bellkor solution to the netflix prize,

2008. http://www.research.att.com/~volinsky/netflix/.

[5] Albert Bifet. Adaptive learning and mining for data streams and frequent patterns. PhD

thesis, Universitat Politecnica de Catalunya, 2009.

[6] Albert Bifet and Ricard Gavalda. Kalman filters and adaptive windows for learning in data

streams. In Ljupco Todorovski, Nada Lavrac, and Klaus P. Jantke, editors, Discovery Science,

volume 4265 of Lecture Notes in Computer Science, pages 29–40. Springer, 2006.

[7] Albert Bifet and Ricard Gavalda. Learning from time-changing data with adaptive windowing.

In SDM. SIAM, 2007.

[8] Albert Bifet, Geoff Holmes, Richard Kirkby, and Bernhard Pfahringer. Moa: Massive online

analysis. Journal of Machine Learning Research, 11:1601–1604, 2010.

[9] Albert Bifet, Geoffrey Holmes, Bernhard Pfahringer, and Ricard Gavalda. Improving adaptive

bagging methods for evolving data streams. In Zhi-Hua Zhou and Takashi Washio, editors,

ACML, volume 5828 of Lecture Notes in Computer Science, pages 23–37. Springer, 2009.

[10] Albert Bifet, Geoffrey Holmes, Bernhard Pfahringer, Richard Kirkby, and Ricard Gavalda.

New ensemble methods for evolving data streams. In John F. Elder IV, Francoise Fogelman-

Soulie, Peter A. Flach, and Mohammed Javeed Zaki, editors, KDD, pages 139–148. ACM,

2009.

[11] Albert Bifet and Richard Kirkby. Data stream mining: a practical approach. Technical report,

The University of Waikato, August 2009.

[12] Albert Bifet and Richard Kirkby. Massive Online Analysis, August 2009.

[13] Remco R. Bouckaert. Voting massive collections of bayesian network classifiers for data

streams. In Abdul Sattar and Byeong Ho Kang, editors, Australian Conference on Artificial

Intelligence, volume 4304 of Lecture Notes in Computer Science, pages 243–252. Springer,

2006.

71

http://www.research.att.com/~volinsky/netflix/

72 Bibliography

[14] Max Bramer. Principles of Data Mining. Springer, 2007.

[15] Leo Breiman. Pasting small votes for classification in large databases and on-line. Machine

Learning, 36(1-2):85–103, 1999.

[16] Leo Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression

Trees. Wadsworth, 1984.

[17] Darryl Charles, Aphra Kerr, Moira McAlister, Michael McNeill, Julian Kucklich, Michaela M.

Black, Adrian Moore, and Karl Stringer. Player-centred game design: Adaptive digital games.

In DIGRA Conf., 2005.

[18] Surajit Chaudhuri, Rajeev Motwani, and Vivek R. Narasayya. On random sampling over

joins. In Alex Delis, Christos Faloutsos, and Shahram Ghandeharizadeh, editors, SIGMOD

Conference, pages 263–274. ACM Press, 1999.

[19] Edith Cohen and Martin J. Strauss. Maintaining time-decaying stream aggregates. J. Algo-

rithms, 59(1):19–36, 2006.

[20] Mayur Datar, Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Maintaining stream statis-

tics over sliding windows (extended abstract). In SODA, pages 635–644, 2002.

[21] Pedro Domingos and Geoff Hulten. Mining high-speed data streams. In KDD, pages 71–80,

2000.

[22] Steve Donoho. Early detection of insider trading in option markets. In Won Kim, Ron Kohavi,

Johannes Gehrke, and William DuMouchel, editors, KDD, pages 420–429. ACM, 2004.

[23] Wei Fan. Systematic data selection to mine concept-drifting data streams. In Won Kim, Ron

Kohavi, Johannes Gehrke, and William DuMouchel, editors, KDD, pages 128–137. ACM,

2004.

[24] Wei Fan, Yi an Huang, Haixun Wang, and Philip S. Yu. Active mining of data streams. In

Michael W. Berry, Umeshwar Dayal, Chandrika Kamath, and David B. Skillicorn, editors,

SDM. SIAM, 2004.

[25] Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, Rajeev Motwani, and Jeffrey D.

Ullman. Computing iceberg queries efficiently. In Ashish Gupta, Oded Shmueli, and Jennifer

Widom, editors, VLDB, pages 299–310. Morgan Kaufmann, 1998.

[26] Usama M. Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. From data mining to

knowledge discovery: An overview. In Advances in Knowledge Discovery and Data Mining,

pages 1–34. American Association for Artificial Intelligence, 1996.

[27] Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy,

editors. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.

[28] Francisco J. Ferrer-Troyano, Jesus S. Aguilar-Ruiz, and Jose Cristóbal Riquelme Santos. Dis-

covering decision rules from numerical data streams. In Hisham Haddad, Andrea Omicini,

Roger L. Wainwright, and Lorie M. Liebrock, editors, SAC, pages 649–653. ACM, 2004.

[29] Mohamed Medhat Gaber and Joao Gama. State of the art in data streams mining. ECML,

2007.

73

[30] J. Gama, P. Medas, G. Castillo, and P. Rodrigues. Learning with drift detection. In SBIA

Brazilian Symposium on Artificial Intelligence, page 286–295, 2004.

[31] Joao Gama and Mohamed Medhat Gaber, editors. Learning from Data Streams: Processing

techniques in Sensor Networks. Springer, 2007.

[32] Joao Gama, Ricardo Rocha, and Pedro Medas. Accurate decision trees for mining high-speed

data streams. In Lise Getoor, Ted E. Senator, Pedro Domingos, and Christos Faloutsos,

editors, KDD, pages 523–528. ACM, 2003.

[33] Joao Gama and Pedro Pereira Rodrigues. Stream-based electricity load forecast. In Joost N.

Kok, Jacek Koronacki, Ramon López de Mantaras, Stan Matwin, Dunja Mladenic, and An-

drzej Skowron, editors, PKDD, volume 4702 of Lecture Notes in Computer Science, pages

446–453. Springer, 2007.

[34] Venkatesh Ganti, Johannes Gehrke, and Raghu Ramakrishnan. Mining data streams under

block evolution. SIGKDD Explorations, 3(2):1–10, 2002.

[35] John F. Gantz, David Reinsel, Christopeher Chute, Wolfgang Schlichting, Stephen Minton,

Anna Toncheva, and Alex Manfrediz. The expanding digital universe: An updated forecast

of worldwide information growth through 2011. Technical report, IDC Information and Data,

2008.

[36] Anna C. Gilbert, Sudipto Guha, Piotr Indyk, Yannis Kotidis, S. Muthukrishnan, and Martin

Strauss. Fast, small-space algorithms for approximate histogram maintenance. In STOC,

pages 389–398, 2002.

[37] Michael Greenwald and Sanjeev Khanna. Space-efficient online computation of quantile sum-

maries. In SIGMOD Conference, pages 58–66, 2001.

[38] Sudipto Guha and Nick Koudas. Approximating a data stream for querying and estimation:

Algorithms and performance evaluation. In ICDE, pages 567–. IEEE Computer Society, 2002.

[39] Sudipto Guha, Nina Mishra, Rajeev Motwani, and Liadan O’Callaghan. Clustering data

streams. In FOCS, pages 359–366, 2000.

[40] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. Cure: An efficient clustering algorithm

for large databases. Inf. Syst., 26(1):35–58, 2001.

[41] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H.

Witten. The weka data mining software: an update. SIGKDD Explor. Newsl., 11(1):10–18,

2009.

[42] Michael Harries. Splice-2 comparative evaluation: Electricity pricing. Technical report, The

University of South Wales, 1999.

[43] Constantinos S. Hilas. Designing an expert system for fraud detection in private telecommu-

nications networks. Expert Syst. Appl., 36(9):11559–11569, 2009.

[44] Geoff Hulten, Laurie Spencer, and Pedro Domingos. Mining time-changing data streams. In

KDD, pages 97–106, 2001.

[45] Elena Ikonomovska, Suzana Loskovska, and Dejan Gjorgjevik. A survey of stream data mining,

2005.

74 Bibliography

[46] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local

experts. Neural Computation, 3:79–87, 1991.

[47] H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Viswanath Poosala, Kenneth C. Sevcik, and

Torsten Suel. Optimal histograms with quality guarantees. In Ashish Gupta, Oded Shmueli,

and Jennifer Widom, editors, VLDB, pages 275–286. Morgan Kaufmann, 1998.

[48] Ruoming Jin and Gagan Agrawal. Efficient decision tree construction on streaming data. In

Lise Getoor, Ted E. Senator, Pedro Domingos, and Christos Faloutsos, editors, KDD, pages

571–576. ACM, 2003.

[49] Ioannis Katakis, Grigorios Tsoumakas, Evangelos Banos, Nick Bassiliades, and Ioannis P.

Vlahavas. An adaptive personalized news dissemination system. J. Intell. Inf. Syst., 32(2):191–

212, 2009.

[50] Mark G. Kelly, David J. Hand, and Niall M. Adams. The impact of changing populations on

classifier performance. In KDD, pages 367–371, 1999.

[51] Richard Kirkby. Improving Hoeffding Trees. PhD thesis, Department of Computer Science,

University of Waikato, 2007.

[52] Ralf Klinkenberg and Thorsten Joachims. Detecting concept drift with support vector ma-

chines. In Pat Langley, editor, ICML, pages 487–494. Morgan Kaufmann, 2000.

[53] Ron Kohavi and Clayton Kunz. Option decision trees with majority votes. In Douglas H.

Fisher, editor, ICML, pages 161–169. Morgan Kaufmann, 1997.

[54] Ludmila I. Kuncheva. Classifier ensembles for changing environments. In Fabio Roli, Josef

Kittler, and Terry Windeatt, editors, Multiple Classifier Systems, volume 3077 of Lecture

Notes in Computer Science, pages 1–15. Springer, 2004.

[55] Ludmila I. Kuncheva. Classifier ensembles for detecting concept change in streaming data:

Overview and perspectives. In 2nd Workshop SUEMA 2008 (ECAI 2008), pages 5–10, 2008.

[56] Terran Lane and Carla E. Brodley. Temporal sequence learning and data reduction for

anomaly detection. ACM Trans. Inf. Syst. Secur., 2(3):295–331, 1999.

[57] Andreas D. Lattner, Andrea Miene, Ubbo Visser, and Otthein Herzog. Sequential pattern

mining for situation and behavior prediction in simulated robotic soccer. In Ansgar Bredenfeld,

Adam Jacoff, Itsuki Noda, and Yasutake Takahashi, editors, RoboCup, volume 4020 of Lecture

Notes in Computer Science, pages 118–129. Springer, 2005.

[58] Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold

algorithm. Machine Learning, 2(4):285–318, 1987.

[59] Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Inf. Comput.,

108(2):212–261, 1994.

[60] Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over data

streams. In VLDB, pages 346–357. Morgan Kaufmann, 2002.

[61] Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani M. Thuraisingham.

A multi-partition multi-chunk ensemble technique to classify concept-drifting data streams.

In Thanaruk Theeramunkong, Boonserm Kijsirikul, Nick Cercone, and Tu Bao Ho, editors,

PAKDD, volume 5476 of Lecture Notes in Computer Science, pages 363–375. Springer, 2009.

75

[62] Yossi Matias, Jeffrey Scott Vitter, and Min Wang. Dynamic maintenance of wavelet-based

histograms. In Amr El Abbadi, Michael L. Brodie, Sharma Chakravarthy, Umeshwar Dayal,

Nabil Kamel, Gunter Schlageter, and Kyu-Young Whang, editors, VLDB, pages 101–110.

Morgan Kaufmann, 2000.

[63] Oleksiy Mazhelis and Seppo Puuronen. Comparing classifier combining techniques for mobile-

masquerader detection. In ARES, pages 465–472. IEEE Computer Society, 2007.

[64] Joao Mendes-Moreira, Carlos Soares, Alıpio Mario Jorge, and Jorge Freire de Sousa. The

effect of varying parameters and focusing on bus travel time prediction. In Thanaruk Theera-

munkong, Boonserm Kijsirikul, Nick Cercone, and Tu Bao Ho, editors, PAKDD, volume 5476

of Lecture Notes in Computer Science, pages 689–696. Springer, 2009.

[65] Nikunj C. Oza. Online ensemble learning. In AAAI/IAAI, page 1109. AAAI Press / The

MIT Press, 2000.

[66] Nikunj C. Oza. Online Ensemble Learning. PhD thesis, The University of California, Berkeley,

CA, Sep 2001.

[67] E. S. Page. Continuous inspection schemes. Biometrika, 41(1/2):100–115, 1954.

[68] Animesh Patcha and Jung-Min Park. An overview of anomaly detection techniques: Existing

solutions and latest technological trends. Computer Networks, 51(12):3448–3470, 2007.

[69] J. Ross Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986.

[70] J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.

[71] S. W. Roberts. Control chart tests based on geometric moving averages. Technometrics,

42(1):97–101, 2000.

[72] Jeffrey C. Schlimmer and Richard H. Granger. Incremental learning from noisy data. Machine

Learning, 1(3):317–354, 1986.

[73] W. Nick Street and YongSeog Kim. A streaming ensemble algorithm (sea) for large-scale

classification. In KDD, pages 377–382, 2001.

[74] Sebastian Thrun, Mike Montemerlo, Hendrik Dahlkamp, David Stavens, Andrei Aron, James

Diebel, Philip Fong, John Gale, Morgan Halpenny, Kenny Lau, Celia Oakley, Mark Palatucci,

Vaughan Pratt, Pascal Stang, Sven Strohb, Cedric Dupont, Lars erik Jendrossek, Christian

Koelen, Charles Markey, Carlo Rummel, Joe Van Niekerk, Eric Jensen, Gary Bradski, Bob

Davies, Scott Ettinger, Adrian Kaehler, Ara Nefian, and Pamela Mahoney. The robot that

won the darpa grand challenge. Journal of Field Robotics, 23:661–692, 2006.

[75] Alexey Tsymbal, Mykola Pechenizkiy, Padraig Cunningham, and Seppo Puuronen. Dynamic

integration of classifiers for handling concept drift. Information Fusion, 9(1):56–68, 2008.

[76] Ranga Raju Vatsavai, Olufemi A. Omitaomu, Joao Gama, Nitesh V. Chawla, Mohamed Med-

hat Gaber, and Auroop R. Ganguly. Knowledge discovery from sensor data (sensorkdd).

SIGKDD Explorations, 10(2):68–73, 2008.

[77] Jeffrey Scott Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37–

57, 1985.

76 Bibliography

[78] Haixun Wang, Wei Fan, Philip S. Yu, and Jiawei Han. Mining concept-drifting data streams

using ensemble classifiers. In Lise Getoor, Ted E. Senator, Pedro Domingos, and Christos

Faloutsos, editors, KDD, pages 226–235. ACM, 2003.

[79] Weka Machine Learning Project. Weka. URL http://www.cs.waikato.ac.nz/˜ml/weka.

[80] Gerhard Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts.

In Machine Learning, pages 69–101, 1996.

[81] Kun Zhang, Wei Fan, Xiaojing Yuan, Ian Davidson, and Xiangshang Li. Forecasting skewed

biased stochastic ozone days: Analyses and solutions. In ICDM, pages 753–764. IEEE, IEEE

Computer Society, 2006.

[82] Indre Zliobaite. Combining time and space similarity for small size learning under concept

drift. In Jan Rauch, Zbigniew W. Ras, Petr Berka, and Tapio Elomaa, editors, ISMIS, volume

5722 of Lecture Notes in Computer Science, pages 412–421. Springer, 2009.

[83] Indre Zliobaite. Instance selection method (fish) for classifier training under concept drift.

Technical report, Vilnius University, Faculty of Mathematics and Informatic, 2009.

[84] Indre Zliobaite. Learning under concept drift: an overview. Technical report, Vilnius Univer-

sity, Faculty of Mathematics and Informatic, 2009.

[85] Indre Zliobaite. Adaptive training set formation. PhD thesis, Vilnius University, 2010.

Streszczenie

W dobie społeczeństwa informacyjnego, użytkownicy komputerów przyzwyczajeni są do groma-

dzenia i współdzielenia danych niemal w dowolnym miejscu i czasie. Portale społecznościowe, ban-

kowość elektroniczna, usługi telekomunikacyjne, czy współdzielenie filmów oraz muzyki to tylko

niektóre ze zjawisk powodujących gwałtowny wzrost liczby przechowywanych i przetwarzanych

danych. Raport sprzed dwóch lat [35] szacował, że rozmiar świata danych elektronicznych wyniósł

w 2007 roku 281 miliardów gigabajtów i że rozmiar ten wzrośnie pięciokrotnie do roku 2011. Ten

sam raport zakłada, że do końca 2010 roku połowa z wytwarzanych danych nie będzie trwale zapi-

sywana na żadnych nośnikach. Powodem tego są po części nowe rodzaje zastosowań informatyki,

w których przetwarzane informacje przyjmują postać strumieni danych.

Strumień danych może być postrzegany jako sekwencja elementów (np. billingów rozmów te-

lefonicznych, odwiedzin strony internetowej, odczytów z czujników), które napływają w sposób

ciągły w zmiennych interwałach czasu. Strumienie, jak inne duże wolumeny danych, mogą być

przedmiotem eksploracji danych i odkrywania wiedzy, czyli procesu poszukiwania nowych, nie-

trywialnych i potencjalnie użytecznych wzorców z danych [26, 14]. Eksploracja strumieni danych

przedstawia nowe wyzwania w stosunku do tradycyjnie pojętej eksploracji danych. W rozważanym

problemie rozmiar przetwarzanych danych jest uznawany za zbyt duży do trwałego przechowywa-

nia, a tempo napływania nowych elementów nie jest znane. Stąd potrzeba rozwijania nowych,

dedykowanych podejść do eksploracji strumieni.

Niniejsza praca podejmuje tematykę eksploracji strumieni danych ze zmienną definicją klas.

Jako jeden z podstawowych celów przyjęto dokonanie przeglądu i eksperymentalnego porówna-

nia istniejących metod klasyfikacji strumieni danych. Zaproponowano również nowy algorytm

oparty na krytyce istniejącego rozwiązania i skonfrontowano jego trafność, wymagania czasowe

oraz pamięciowe z innymi metodami. Jako dodatkowe cele przyjęto implementacyjne rozszerzenie

i wykorzystanie do eksperymentów środowiska MOA, które jest nowym projektem i nie zostało

jeszcze szerzej opisane przez osoby inne niż przez autorów. Postanowiono zbadać przydatność

MOA do tworzenia i porównywania nowych rozwiązań z dziedziny eksploracji strumieni danych.

Ze względu na charakterystykę strumieni danych, strumieniowe algorytmy eksploracji muszą

przetwarzać pojawiające się elementy przyrostowo, przetwarzając każdy przykład tylko raz, przy

ograniczonym czasie, ograniczonej pamięci, dostosowując się równocześnie do zmian w źródle stru-

mienia. Źródło strumienia jest tutaj rozumiane jako definicja klas decyzyjnych pozwalająca ge-

nerować nowe przykłady. Typowe zmiany w źródle strumienia to zaniknięcie klasy, pojawienie

się nowej, stopniowa zmiana definicji, gwałtowna zmiana definicji, oraz czasowo zanikająca klasa.

O ile ograniczenia czasowe i pamięciowe bywały już wcześniej implementowane w propozycjach

algorytmów eksploracji danych, o tyle zmiany definicji klas w czasie są cechą charakterystyczną

strumieni i nie były rozważane w tradycyjnych metodach eksploracji.

Przegląd najpopularniejszych metod eksploracji strumieni danych można podzielić na dwie

grupy: pojedyncze klasyfikatory oraz klasyfikatory złożone. Grupę pojedynczych klasyfikatorów

77

78 Streszczenie

otwierają algorytmy znane z tradycyjnej eksploracji danych jak metoda najbliższych sąsiadów,

sztuczne sieci neuronowe [33], metody bayesowskie [13], czy reguły decyzyjne [28, 31, 80]. Wszyst-

kie wymienione metody, po drobnych modyfikacjach, mogą być wykorzystane do klasyfikowania

strumieni. Tradycyjne metody eksploracji są rozwijane od wielu lat i istnieje wiele gotowych im-

plementacji, a nawet całych środowisk do ich testowania. Wadą tych podejść jest to, że dają one

z reguły gorsze rezultaty niż metody wyspecjalizowane w przetwarzaniu strumieni [5].

Do specjalistycznych metod klasyfikacji strumieni należy między innymi technika okien prze-

suwnych pozwalająca ograniczyć pamięć algorytmów do najnowszych przykładów. Dwa najciekaw-

sze algorytmy z tej grupy to FISH [83, 82, 85] i ADWIN [6, 7]. Innym sposobem na „zapomina-

nie”przestarzałych przykładów w strumieniach danych jest przebudowa klasyfikatora po wykryciu

zmiany. Do tego celu wykorzystywane są detektory zmian jak DDM [30] czy EDDM [3], które ob-

serwując rozkład błędów popełnianych przez klasyfikator starają się określić moment wystąpienia

zmiany w źródle strumienia.

Najbardziej znanym pojedynczym klasyfikatorem zaprojektowanym do przetwarzania strumieni

danych jest algorytm VFDT (Very Fast Decision Tree) [21]. Klasyfikator ten jest modyfikacją

drzewa decyzyjnego, która pozwala przyrostowo budować klasyfikator gwarantując równocześnie

trafność klasyfikacji bliską drzewu budowanemu wsadowo, na całym zbiorze danych. Gwarancja

trafności jest zapewniana przez wykorzystanie granicy Hoefffdinga jako mechanizmu określającego

liczbę przykładów potrzebnych do stworzenia najlepszego, z zadanym przez użytkownika prawdo-

podobieństwem, rozgałęzienia. Z racji wykorzystania owego mechanizmu, algorytm VFDT jak i

wszystkie jego modyfikacje są nazywane drzewami Hoeffdinga.

Klasyfikatory złożone to zbiory pojedynczych klasyfikatorów, zwanych klasyfikatorami bazo-

wymi, które wspólnie (z reguły przez głosowanie) przewidują klasę decyzyjną. Modularna budowa

klasyfikatorów złożonych sprawia, że mają one wiele cech przydatnych w przetwarzaniu strumieni

danych. Ważenie głosów klasyfikatorów bazowych pozwala dynamicznie reagować na zmiany. Bu-

dowanie nowych klasyfikatorów bazowych z nadchodzących przykładów pozwala stopniowo po-

lepszać działanie klasyfikatora bez przebudowywania wcześniej nauczonych fragmentów. Główną

wadą klasyfikatorów złożonych jest ich czas działania. Przetwarzają one dane z reguły o wiele

dłużej niż pojedyncze klasyfikatory, lecz oferują często znacznie lepszą trafność klasyfikacji.

Do jednych z pierwszych zaproponowanych strumieniowych klasyfikatorów złożonych należą

algorytmy Streaming Ensemble Algorithm (SEA) [73] i Accuracy Weighted Ensemble (AWE) [78].

Obie metody przetwarzają strumień paczkami (ang. data chunks), z których budują nowe klasyfi-

katory bazowe. Następnie, jeśli to korzystne, nowo-zbudowany klasyfikator bazowy zastępuje inny,

„słabszy”klasyfikator bazowy. O „sile”klasyfikatora decyduje nadana mu waga, która jest inaczej

obliczana dla obu algorytmów. SEA zlicza błędy popełniane przez każdy klasyfikator bazowy, a

następnie sprawdza zgodność głosów między nimi by określić wagi. Sprawdzając zgodność pre-

dykcji elementów składowych, SEA premiuje klasyfikatory bazowe, które specjalizują się w innych

przykładach niż pozostali głosujący. Algorytm AWE jako wagę przypisuje oszacowany na naj-

nowszej paczce przykładów błąd średniokwadratowy danego klasyfikatora bazowego. Jeśli element

składowy A ma najmniejszy błąd średniokwadratowy na ostatniej paczce, to AWE zakłada, że nad-

chodzące przykłady będą podobne do tych z ostatniej paczki i nadaje najwyższą wagę elementowi

składowemu A.

Zupełnie innym typem klasyfikatora złożonego jest niedawno zaproponowany algorytm Ho-

effding Option Tree (HOT) [51]. Algorytm jest zainspirowany wcześniejszą pracą Kunza oraz

Kohaviego [53] i modyfikuje ich pomysły, by dopasować klasyfikator do świata strumieni danych.

HOT można sobie wyobrazić jako skompresowaną postać kliku drzew Hoeffdinga połączonych jed-

nym korzeniem. Jest to uproszczony opis, ale przedstawia główne zalety tego podejścia: gwarancję

Streszczenie 79

trafności zapewnianą prze granicę Hoeffdinga, złożenie decyzji kilku klasyfikatorów bazowych i

małą zajętość pamięciową. Innym klasyfikatorem wykorzystującym drzewa Hoeffdinga jest ASHT

Bagging [10, 9]. W tym algorytmie drzewa Hoeffdinga są składowymi klasyfikatora złożonego, a

wagi są im nadawane zgodnie z liczbą popełnianych błędów. Cechą szczególną tej metody jest

wprowadzenie ograniczeń na kolejne drzewa składowe - każde drzewo ma określony maksymalny

rozmiar, dwa razy większy od poprzedniego drzewa. Po przekroczeniu swojego rozmiaru drzewo

jest budowane od początku z nowych przykładów. Różne rozmiary drzew działają jak różne ro-

dzaje pamięci. Drzewa większe pamiętają lepiej starsze przykłady, a drzewa małe specjalizują się

w tych najnowszych.

Analiza wymienionych algorytmów doprowadziła do zaproponowania w tej pracy nowego kla-

syfikatora złożonego o nazwie Accuracy Diversified Ensemble (ADE). ADE opiera się głównie na

krytyce algorytmu AWE - wprowadza do niego nowe elementy zachowując ideę ważenia elementów

składowych według błędu średniokwadratowego. Klasyfikatorami bazowymi w ADE są drzewa Ho-

effdinga, a nie jak w AWE zwykłe drzewa decyzyjne lub inne tradycyjne klasyfikatory. Zmiana ta

pozwala na douczanie wcześniej stworzonych składowych, a tym samym na zmniejszenie rozmiaru

paczek przykładów bez niebezpieczeństwa budowania mniej trafnych drzew. Klasyfikatory bazowe

są tak jak w oryginalnym algorytmie ważone według błędu średniokwadratowego, lecz sama funk-

cja obliczająca wagę została zmodyfikowana. Zniesiono próg błędu pozwalający przypisać wagę,

by uniknąć zaobserwowanego w eksperymentach dla AWE zjawiska zerowania wszystkich wag przy

nagłych zmianach w strumieniu. By dodatkowo premiować składowe trafnie klasyfikujące przy-

kłady z ostatniej paczki, zaproponowano douczać klasyfikatory bazowe zbudowane na poprzednich

paczkach, tylko jeśli ich obecna trafność przekracza próg wyznaczany na podstawie rozkładu przy-

kładów w ostatniej paczce.

Proponowany algorytm może za pomocą jednej paczki douczyć więcej niż jeden klasyfikator

bazowy. By uniknąć teoretycznego niebezpieczeństwa upodobniania się drzew składowych, po-

stanowiono wykorzystać zaproponowany przez Ozę algorytm Online Bagging [66] do dodatkowej

dywersyfikacji klasyfikatorów bazowych. By się upewnić, że takie dynamiczne urozmaicanie skła-

dowych jest konieczne, w eksperymentach porównano dwie wersje ADE - z i bez baggingu.

Jednym z celów tej pracy było implementacyjne rozszerzenie, wykorzystanie do testów i ocena

środowiska MOA. MOA (ang. Massive Online Analysis) to środowisko do implementowania i

przeprowadzania eksperymentów na algorytmach klasyfikujących strumienie danych [8, 12, 11].

Zaimplementowane jest w języku Java i zawiera zbiór generatorów strumieni, algorytmów eksplo-

racji danych i metod oceny algorytmów. Praca w MOA podzielona jest na zadania (ang. task).

Do podstawowych zadań oferowanych przez środowisko należą: uczenie klasyfikatora, ocena kla-

syfikatora, generowanie strumienia do pliku, pomiar prędkości strumienia. Wszystkie zadania

mogą być wykonywane z poziomu interfejsu graficznego lub konsoli. Ciekawą opcją jest możliwość

wykonywania kilku zadań równolegle z poziomu interfejsu graficznego.

Strumienie danych w MOA mogą być generowane, odczytywane z plików ARFF, łączone i

filtrowane. Ważną opcją jest również możliwość wprowadzania dynamicznych zmian definicji klas w

czasie. MOA zawiera najpopularniejsze generatory strumieni, w tym: Random Trees [21], SEA [73],

STAGGER [72], Rotating Hyperplane [78, 23, 24], Random RBF, LED [32] i Waveform [32].

W ramach rozszerzania środowiska zaimplementowano w tej pracy metodę usuwania wybranych

atrybutów ze strumienia. Zauważono, że etap filtrowania, oryginalnie zaimplementowany w MOA,

znacznie spowalnia przetwarzanie strumieni i wymaga optymalizacji.

W MOA zaimplementowano szereg popularnych klasyfikatorów strumieniowych, w tym: Na-

iwny Bayes, drzewa Hoeffdinga, HOT, Bagging Ozy, Boosting, okno przesuwne ADWIN, ASHT

Bagging i Weighted Majority Algorithm. Środowisko pozwala również na wykorzystanie klasyfika-

80 Streszczenie

tora z platformy WEKA w połączeniu z detektorem zmian (DDM lub EDDM). W ramach tej pracy

zaimplementowano i dodano do MOA algorytmy AWE i ADE. Podobnie jak w środowisku WEKA,

napisanie własnego klasyfikatora w MOA wymaga jedynie zaimplementowania klasy dziedziczącej

z klasy AbstractClassifier, a interfejs graficzny dla takiej klasy jest tworzony dynamicznie.

Ostatnim rozszerzeniem jakie w tej pracy wprowadzono do MOA jest nowa metoda oceny kla-

syfikatorów. Oryginalnie w MOA istnieją dwie metody szacowania skuteczności klasyfikatorów:

Holdout i Interleaved Test-Then-Train. Pierwsza z nich wykorzystuje n początkowych przykładów

jako testowe i co pewien interwał czasowy sprawdza trafność klasyfikatora uczonego na pozostałych

elementach strumienia. Druga metoda wykorzystuje każdy nowy przykład najpierw do testowania,

a następnie do douczania klasyfikatora. Wadą metody Holdout jest statyczność zbioru testującego,

która może skutkować złą oceną dla zmieniających się strumieni. Z kolei metoda Interleaved Test-

Then-Train wykonuje testy bardzo często zaniżając trafność klasyfikacji i uniemożliwiając osobny

pomiar czasu dla uczenia i testowania. Postanowiono zaimplementować metodę oceny będącą

kompromisem pomiędzy tymi wcześniej wymienionymi - metodę oceny paczkami. Metoda oceny

paczkami (ang. Data Chunk Evaluation) grupuje przykłady w paczki, które najpierw wykorzy-

stywane są do testowania a później do uczenia. Tworzenie paczek pozwala obliczać czas uczenia

i testowania grup przykładów oraz niweluje problem zaniżania trafności klasyfikacji występujący

w metodzie Interleaved Test-Then-Train. Metoda ta może również być wykorzystywana do oceny

działania klasyfikatorów na strumieniach zmiennych w czasie, gdyż zbiór nie jest statyczny tak jak

w metodzie Holdout. Wszystkie testy wykonane w ramach tej pracy wykorzystywały do porówny-

wania klasyfikatorów metodę oceny paczkami.

Do eksperymentalnego porównania algorytmów wybrano dwa klasyfikatory pojedyncze (drzewo

Hoeffdinga z detektorem DDM i drzewo decyzyjne z oknem przesuwnym), dwa klasyfikatory zło-

żone (AWE i HOT) i dwie wersje zaproponowanego w tej pracy algorytmu ADE (z i bez baggingu).

Działanie każdego z algorytmów przetestowano na czterech rzeczywistych i czterech sztucznie wy-

generowanych zbiorach danych. Ogólnodostępne rzeczywiste zbiory danych są małe w porównaniu

ze strumieniami narzucającymi wymagania pamięciowe i czasowe. Dlatego właśnie wykorzystano

cztery generatory do stworzenia sztucznych strumieni mających po 1, 5 i 20 milionów przykładów.

Dla wszystkich zbiorów porównywano czas uczenia, testowania, zajętość pamięciową i trafność

klasyfikacji. Wszystkie wyniki zaprezentowano w postaci przebiegów czasowych w dodatku B.

Wyniki pokazują, że pojedyncze klasyfikatory przetwarzają strumienie szybciej niż złożone kla-

syfikatory. Jest to spodziewany wynik, gdyż prostsze klasyfikatory z reguły działają szybciej niż

bardziej złożone. Dla pięciu najmniejszych zbiorów zużycie pamięci było najmniejsze dla drzewa

Hoeffdinga, a dla większych strumieni najlepiej spisywały się algorytmy AWE i ADE. Klasyfikatory

HOT i drzewo z detektorem zmian wykazywały wprost proporcjonalny do liczby przetworzonych

przykładów wzrost wymagań pamięciowych. Przykład różnych wyników dla różnych rozmiarów

strumieni pokazuje jak ważny jest zakres eksperymentów na algorytmach strumieniowych i jak po-

trzebne są sztuczne generatory danych. Różnice w trafności klasyfikacji były stosunkowo niewielkie

pomiędzy algorytmami HOT, ADE i drzewem Hoeffdinga. Klasyfikatory AWE i drzewo z oknem

przesuwnym wypadły o wiele gorzej. Widać, że zaproponowany algorytm ADE przy niewielkim

wzroście czasu przetwarzania i zajętości pamięciowej znacznie poprawił swoją trafność klasyfika-

cji w stosunku do pierwowzoru. Ponadto, ADE ma stałe wymagania pamięciowe i czasowe, w

przeciwieństwie do HOT i drzewa Hoeffdinga, których wymagania potrafiły rosnąć z czasem.

Jak wspomniano wcześniej, podczas testów porównaliśmy dwie wersje algorytmu ADE. Okazało

się, że dodanie baggingu do procesu douczania klasyfikatorów bazowych nie podniosło znacząco

trafności klasyfikacji. Ponadto, bagging wniósł dodatkowe, choć również niewielkie, narzuty cza-

sowe i pamięciowe. Wydaje się, że ADE już bez baggingu radzi sobie dobrze z dywersyfikacją

Streszczenie 81

klasyfikatorów bazowych i jeśli można osiągnąć jakieś korzyści z różnicowania drzew składowych

w algorytmie to wydaje się, że nie przez bagging.

Wyniki eksperymentów przeprowadzonych w ramach tej pracy można częściowo porównać z

publikacjami innych autorów. Porównanie wyników czasowych, pamięciowych i trafności klasyfi-

kacji dla algorytmów HOT, drzewa Hoeffdinga i metody przesuwnych okien wypada podobnie jak

w artykule napisanym przez Bifeta et al. [10]. Główna różnica polega na tym, że w testach w

tej pracy HOT uzyskiwał zwykle lepszą trafność klasyfikacji niż pojedyncze drzewo Hoeffdniga z

detektorem zmian podczas, gdy u Bifeta było odwrotnie. Porównanie AWE i ADE z pozostałymi

algorytmami w tej pracy nie było nigdy wcześniej przeprowadzane i przedstawia nowe, wcześniej

niepublikowane rezultaty.

Przegląd algorytmów dokonany w tej pracy pokazuje, że eksploracja strumieni danych ze

zmienną definicją klas kształtuje się jako nowa gałąź odkrywania wiedzy ze swoimi własnymi

problemami badawczymi. Ograniczenia czasowe i pamięciowe nałożone na algorytmy sprawiają,

że trafność klasyfikacji nie może być traktowana jako najważniejsze kryterium oceny klasyfikatora.

Ponadto, zmienność problemu decyzyjnego w czasie wymaga projektowania mechanizmów zapo-

minania zwykle niewystępujących w tradycyjnych metodach eksploracji danych. Środowisko MOA

jest propozycją zunifikowania sposobu implementacji i testowania algorytmów, które mierzą się z

tymi problemami. To kolejny znak na to, że eksploracja strumieni danych staje się coraz dojrzalszą

dziedziną informatyki dążącą do sprostania nadchodzącym wyzwaniom świata rzeczywistego.

W ramach dalszych badań, planowane jest przeprowadzenie dodatkowych eksperymentów

sprawdzjących czy rozmiar paczki przykładów wpływa na wyniki osiągane przez ADE z baggin-

giem. Ponadto, planowane jest zbadanie wpływu na trafność klasyfikacji ADE innych niż bagging

metod dywersyfikacji składowych jak na przykład używanie różnych typów klasyfikatorów bazo-

wych czy boosting. Dodatkowe eksperymenty będą okazją do poszerzenia testu o inne algorytmy

eksploracji i dokonania tym samym pełniejszego przeglądu dostępnych metod.

c© 2010 Dariusz Brzeziński

Poznan University of TechnologyFaculty of Computing Science and ManagementInstitute of Computing Science

Typeset using LATEX in Computer Modern.

BibTEX:

@mastersthesis{BrzezMs2010,author = {Dariusz Brzezi{\’n}ski},title = {Mining data streams with concept drift},school = {Poznan University of Technology},address = {Pozna{\’n}, Poland},year = {2010}

}

mining data streams with concept drift · before describing and evaluating diﬀerent approaches to...

Documents