evaluating classification algorithms applied to data streams esteban donato

Evaluating classification algorithms applied to

data streamsAuthor: Ing. Esteban D. DonatoAdvisor: Dr. Fazel FamiliCo-Advisor: Dra. Ana S. Haedo

Dec-2009

Maestría en Explotación de Datos y Descubrimiento del Conocimiento

1

Introduction

Majority of companies and organizations collect and maintain gigantic databases that grow to millions of registers per day.

Current algorithms for mining complex models from data cannot mine even a fraction of these data in useful time.

Concept drift: occurs when the underlying data distribution changes over time.

2

Objective

To perform a benchmarking analysis between a number of known algorithms applied to data streams.

The algorithms chosen for this study are: UFFT, CVFDT and VFDTc.

The analysis will be focused on some aspects that all the algorithms applied to data streams have to deal with.

3

Related work A data stream is a sequence of data items x1,…,xi,…,xn. Those

items are read one at a time in increasing order of the indices.

Off-line learning: Assumes that the dataset resides in a static database and that has been generated from a static distribution. Also, they assume that all the data is available before the training and that all the examples can fit into the memory.

Incremental learning: The items are time-ordered and the distribution that generates them varies over time. Systems evolve and change a concept definition as new observations are processed.

4

Related work (Cont.): Data Streams Mining

A sub area of incremental learning. Accumulates faster than it can be mined. It must require small constant time per record. It must use only a fixed amount of main memory. It must be able to build a model using at most one scan of the

data. It must make a usable model available at any point in time. Ideally, it should produce a model that is equivalent to the one that

would be obtained by the corresponding ordinary database mining algorithm.

The model should be up-to-date at any time. Types of Algorithms: Set of rules, Induction trees and Ensembles

methods. 5

Related work (Cont.): Very Fast Decision Tree (VFDT)

Requires each example to be read only once. Requires a small constant time to process it.

Building process: given a stream of examples, the first ones will be used to choose the root and the following examples will be passed down to the corresponding leaves.

To detect how many examples are needed at each node, The Hoeffding bound is used.

The Hoeffding bound: with probability 1 - φ, the true mean of the variable is at least r - e, where:

Let ∆G = G(Xa) - G(Xb) >= 0, if ∆G> e then ∆G >= ∆G- e > 0 with probability 1 – φ

Other features: Pre-pruning, different evaluation measure, Ties, Memory, Poor attributes, Initialization, Rescans.

Drawbacks: It does not detect Concept Drift.

ne R

2

)/1ln(2

6

Related work (Cont.): Concept Drift

Change in the target concept

Depends on some hidden attributes, not given explicitly in the form of predictive features,

Examples: Weather prediction, customers’ buying, etc.

Concept drift handling system should be able to: Quickly adapt to concept drift Be robust to noise and distinguish it from concept drift. Recognize and treat recurring contexts.

Types: sudden, gradual, frequent and virtual concept drift.

7

Conclusion of literature review Data stream is a sequence of time-ordered items, arriving faster than the time

needed to be mined.

Some changes in the underlying data distribution may occur requiring the algorithms to detect and adapt to these changes.

The main challenge in incremental learning is how to detect and adapt to a concept drift.

To deal with the problem of data arriving fast, the algorithms must require a small constant processing time per record.

One of the first algorithms developed was VFDT, using the Hoeffding bound

In concept drift, a difficult problem is to distinguish between a true concept drift and noise.

8

Algorithm: VFDTcVery Fast Decision Tree for Continuous attributes

Extension of VFDT in three directions: continuous data, functional leaves, and concept drift.

For a continuous attribute the split-test is a condition of the form attri <= cut_point.

Use of Information gain to detect the cut_point.

Functional tree leaves: An innovative aspect of this algorithm is its ability to use the naive Bayes classifiers at tree leaves

A leaf must see nmin examples before computing the evaluation function

Concept Drift is based on the assumption that whatever is the cause of the drift, the decision surface moves. It supports two methods:

Drift Detection based on Error Estimates (EE/EBP) Drift Detection based on Affinity Coefficient (AC)

Reacting to Drift: method pushes up all the information of the descending leaves to node This is a forgetting mechanism.

9

Algorithm: UFFT

Ultra Fast Forest Tree Generates a forest of binary trees Processes each example in constant time It uses analytical techniques to choose the splitting criteria, and the

information gain to estimate the merit of each possible splitting-test It maintains a short term memory for initializing the leaves To expand a leaf node: information gain positive and statistical

support Functional leaves Concept drift detection: error rate is calculated at each node (Naive-

Bayes ). Error follows a binomial distribution. Two confident interval levels: warning drift

10

Algorithm: CVFDTConcept-adapting Very Fast Decision Tree

Extension of VFDT with support to concept drifts

It works by keeping its model consistent with a sliding window of examples. Updates just statistics.

It uses information gain for selecting the best attribute.

Grows an alternative subtree with the new best attribute at its root.

Periodically scans HT and all alternate trees looking for internal nodes whose performing better than the actual nodes.

11

Performance measures Capacity to detect and respond to concept drift Capacity to detect and respond to virtual concept drift Capacity to detect and respond to recurring concept drift Capacity to adapt to sudden concept drift Capacity to adapt to gradual concept drift Capacity to adapt to frequent concept drift Accuracy of the classification task Capacity to deal with outliers Capacity to deal with noisy data Speed (Time to take to process an item in the stream)

12

Data sets generated Data sets based on a moving hyperplane d-dimensional space [0; 1]d, is denoted by MOA (Massive Online Analysis) tool

http://sourceforge.net/projects/moa-datastream/ Released under GNU. Free and open source

Current configurabe attributes: instanceRandomSeed numClasses numAtts numDriftAtts magChange noisePercentage sigmaPercentage

New configurable attributes: driftFreq driftTran outlierPercentage distributionPercentage

d

1i 0ii w xw

13

Data sets generated

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1 1.2

positive

negative

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1 1.2

positive

negative

0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

positive

negative

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1 1.2

positive

negative

14

Dataset with no concept drift, outlier of noise Dataset with 10% of noisy data

Dataset with 1% of outliers Dataset with 3 concept drift

ResultsCapacity to detect and respond to concept drift

15

ResultsCapacity to detect and respond to virtual concept drift

16

ResultsCapacity to detect and respond to recurring concept drift

17

ResultsCapacity to adapt to sudden concept drift

18

ResultsCapacity to adapt to gradual concept drift

19

ResultsCapacity to adapt to frequent concept drift

20

ResultsAccuracy of the classification task

Predicted Predicted Class 1 Class 2Actual Class 1 44.5% (887) 5.5% (109)Actual Class 2 5% (101) 45% (903)

Predicted Predicted Class 1 Class 2Actual Class 1 39% (777) 11% (219)Actual Class 2 9% (173) 41% (831)

Predicted Predicted Class 1 Class 2Actual Class 1 46% (928) 3.5% (68)Actual Class 2 2.5% (48) 48% (956)

Predicted Predicted Class 1 Class 2Actual Class 1 34.5% (685) 15.5% (311)Actual Class 2 15.5% (312) 34.5% (692)

Accuracy (AC)

True positive (TP)

False Positive (FP)

True Negative (TN)

False Negative (FN)

Precision (P)

VFDTc (CA) 0.89 0.89 0.10 0.90 0.11 0.90VFDTc (EBP) 0.80 0.78 0.17 0.83 0.22 0.82UFFT 0.94 0.93 0.05 0.95 0.07 0.95CVFDT 0.69 0.69 0.31 0.69 0.31 0.69

VFDTc (CA) VFDTc (EBP)

UFFT CVFDT

measures derived from the confusion matrix

21

ResultsDealing with outliers

22

ResultsDealing with noisy data

23

ResultsSpeed (Time to take to process an item in the stream)

24

Conclusions & future work Given that the data can be generated very fast, that give us a new and

challenging way of developing Data Mining algorithms. We have to develop them having in mind that the training phase can never

end The changes in the data distribution are another challenging scenario that

data stream mining has to deal with. VFDT was one of the first data stream mining algorithms developed. It

implemented the Hoeffding bound We generated different datasets using the moving hyperplane algorithm UFFT for short term predictions CVFDT for long term solutions No impact on virtual concept drift or recurring concept drift

25

Conclusions & future work VFDTc (CA) is not suitable for gradual or sudden concept drift VFDTc (CA) or UFFT are not suitable for frequent concept drift VFDTc (EBP) and CVFDT for data streams with outliers CVFDT for data streams with noisy points CVFDT and UFFT fastest algorithms

Future Work

Clustering algorithms applied to data streams Classification algorithms applied to data streams of unstructured datasets

(text, images, etc)

26

Questions?

27

E-mail: [email protected]

Twitter: @eddonato

evaluating classification algorithms applied to data streams esteban donato

Documents