a short course in data stream mining

A Short Course in DataStream Mining

Albert Bifet

Shenzhen, 23 January 2015Huawei Noah’s Ark Lab

Real time analytics

Introduction: Data StreamsData Streams

Sequence is potentially infinite

High amount of data: sublinear space

High speed of arrival: sublinear time per example

Once an element from a data stream has been processed it isdiscarded or archived

ExamplePuzzle: Finding Missing Numbers

Let π be a permutation of {1, . . . ,n}.Let π−1 be π with one element missing.

π−1[i] arrives in increasing order

Task: Determine the missing number










Use a n-bit vectorto memorize all thenumbers (O(n)space)










Data Streams:O(log(n)) space.










Data Streams:O(log(n)) space.Store

n(n+1)2

−∑j≤i

π−1[j].

Data Streams

Data StreamsSequence is potentially infinite




Tools:approximation

randomization, sampling

sketching

Data Streams

Data StreamsSequence is potentially infinite




Approximation algorithmsSmall error rate with high probability

An algorithm (ε,δ )−approximates F if it outputs F̃ for whichPr[|F̃−F|> εF]< δ .

Data Streams ApproximationAlgorithms

1011000111 1010101

Sliding WindowWe can maintain simple statistics over sliding windows, usingO( 1

εlog2 N) space, where

N is the length of the sliding window

ε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002


10110001111 0101011







101100011110 1010111







1011000111101 0101110







10110001111010 1011101







101100011110101 0111010






Classification

Classification

DefinitionGiven nC different classes, a classifier algorithm builds a model thatpredicts for every unlabelled instance I the class C to which it belongswith accuracy.

ExampleA spam filter

ExampleTwitter Sentiment analysis: analyze tweets with positive or negativefeelings

Data stream classification cycle

1 Process an example at a time, andinspect it only once (at most)

2 Use a limited amount of memory3 Work in a limited amount of time4 Be ready to predict at any point

Classification

Data set thatdescribes e-mailfeatures fordeciding if it isspam.

Example

Contains Domain Has Time“Money” type attach. received spam

yes com yes night yesyes edu no night yesno com yes night yesno edu no day nono com no day noyes cat no day yes

Assume we have to classify the following new instance:Contains Domain Has Time“Money” type attach. received spam

yes edu yes day ?

Bayes Classifiers

Naı̈ve BayesBased on Bayes Theorem:

P(c|d) = P(c)P(d|c)P(d)

posterior =prior× likelikood

evidenceEstimates the probability of observing attribute a and the priorprobability P(c)

Probability of class c given an instance d:

P(c|d) = P(c)∏a∈d P(a|c)P(d)

Bayes Classifiers

Multinomial Naı̈ve BayesConsiders a document as a bag-of-words.

Estimates the probability of observing word w and the priorprobability P(c)

Probability of class c given a test document d:

P(c|d) = P(c)∏w∈d P(w|c)nwd

P(d)

Perceptron

Attribute 1

Attribute 2

Attribute 3

Attribute 4

Attribute 5

Output h~w(~xi)

w1

w2

w3

w4

w5

Data stream: 〈~xi,yi〉Classical perceptron: h~w(~xi) = sgn(~wT~xi),

Minimize Mean-square error: J(~w) = 12 ∑(yi−h~w(~xi))

2

PerceptronAttribute 1

Attribute 2

Attribute 3

Attribute 4

Attribute 5

Output h~w(~xi)

w1

w2

w3

w4

w5

We use sigmoid function h~w = σ(~wT~x) where

σ(x) = 1/(1+ e−x)

σ′(x) = σ(x)(1−σ(x))

Perceptron

Minimize Mean-square error: J(~w) = 12 ∑(yi−h~w(~xi))

2

Stochastic Gradient Descent: ~w =~w−η∇J~xi

Gradient of the error function:

∇J =−∑i(yi−h~w(~xi))∇h~w(~xi)

∇h~w(~xi) = h~w(~xi)(1−h~w(~xi))

Weight update rule

~w =~w+η ∑i(yi−h~w(~xi))h~w(~xi)(1−h~w(~xi))~xi

Perceptron

PERCEPTRON LEARNING(Stream,η)

1 for each class2 do PERCEPTRON LEARNING(Stream,class,η)

PERCEPTRON LEARNING(Stream,class,η)

1 � Let w0 and ~w be randomly initialized2 for each example (~x,y) in Stream3 do if class = y4 then δ = (1−h~w(~x)) ·h~w(~x) · (1−h~w(~x))5 else δ = (0−h~w(~x)) ·h~w(~x) · (1−h~w(~x))6 ~w =~w+η ·δ ·~x

PERCEPTRON PREDICTION(~x)

1 return argmaxclass h~wclass(~x)

Classification

Data set thatdescribes e-mailfeatures fordeciding if it isspam.

Example

Contains Domain Has Time“Money” type attach. received spam

yes com yes night yesyes edu no night yesno com yes night yesno edu no day nono com no day noyes cat no day yes


yes edu yes day ?

Classification


yes edu yes day ?

Time

Contains “Money”

YES

Yes

NO

No

Day

YES

Night

Decision Trees

Basic induction strategy:

A← the “best” decision attribute for next node

Assign A as decision attribute for node

For each value of A, create new descendant of node

Sort training examples to leaf nodes

If training examples perfectly classified, Then STOP, Else iterateover new leaf nodes

Hoeffding TreesHoeffding Tree : VFDT

Pedro Domingos and Geoff Hulten.Mining high-speed data streams. 2000

With high probability, constructs an identical model that atraditional (greedy) method would learn

With theoretical guarantees on the error rate

Time


YESYes

NONo

Day

YES

Night

Hoeffding Bound Inequality

Probability of deviation of its expected value.

Hoeffding Bound InequalityLet X = ∑i Xi where X1, . . . ,Xn are independent and indenticallydistributed in [0,1]. Then

1 Chernoff For each ε < 1

Pr[X > (1+ ε)E[X]]≤ exp(−ε2

3E[X]

)2 Hoeffding For each t > 0

Pr[X > E[X]+ t]≤ exp(−2t2/n

)3 Bernstein Let σ2 = ∑i σ2

i the variance of X. If Xi−E[Xi]≤ b foreach i ∈ [n] then for each t > 0

Pr[X > E[X]+ t]≤ exp

(− t2

2σ2 + 23 bt

)

Hoeffding Tree or VFDTHT(Stream,δ )

1 � Let HT be a tree with a single leaf(root)2 � Init counts nijk at root3 for each example (x,y) in Stream4 do HTGROW((x,y),HT,δ )

HTGROW((x,y),HT,δ )

1 � Sort (x,y) to leaf l using HT2 � Update counts nijk at leaf l3 if examples seen so far at l are not all of the same class4 then� Compute G for each attribute

5 if G(Best Attr.)−G(2nd best) >√

R2 ln1/δ

2n6 then� Split leaf on best attribute7 for each branch8 do � Start new leaf and initiliatize counts

Hoeffding Trees

HT featuresWith high probability, constructs an identical model that atraditional (greedy) method would learn

Ties: when two attributes have similar G, split if

G(Best Attr.)−G(2nd best)<

√R2 ln1/δ

2n< τ

Compute G every nmin instancesMemory: deactivate least promising nodes with lower pl× el

pl is the probability to reach leaf lel is the error in the node

Hoeffding Naive Bayes Tree

Hoeffding TreeMajority Class learner at leaves

Hoeffding Naive Bayes Tree

G. Holmes, R. Kirkby, and B. Pfahringer.Stress-testing Hoeffding trees, 2005.

monitors accuracy of a Majority Class learner

monitors accuracy of a Naive Bayes learner

predicts using the most accurate method

Bagging

ExampleDataset of 4 Instances : A, B, C, D

Classifier 1: B, A, C, BClassifier 2: D, B, A, DClassifier 3: B, A, C, BClassifier 4: B, C, B, BClassifier 5: D, C, A, C

Bagging builds a set of M base models, with a bootstrap samplecreated by drawing random samples withreplacement.

Bagging


Classifier 1: A, B, B, CClassifier 2: A, B, D, DClassifier 3: A, B, B, CClassifier 4: B, B, B, CClassifier 5: A, C, C, D

Bagging builds a set of M base models, with a bootstrap samplecreated by drawing random samples withreplacement.

Bagging


Classifier 1: A, B, B, C: A(1) B(2) C(1) D(0)Classifier 2: A, B, D, D: A(1) B(1) C(0) D(2)Classifier 3: A, B, B, C: A(1) B(2) C(1) D(0)Classifier 4: B, B, B, C: A(0) B(3) C(1) D(0)Classifier 5: A, C, C, D: A(1) B(0) C(2) D(1)

Each base model’s training set contains each of the original trainingexample K times where P(K = k) follows a binomial distribution.

Bagging

Figure : Poisson(1) Distribution.

Each base model’s training set contains each of the original trainingexample K times where P(K = k) follows a binomial distribution.

Oza and Russell’s Online Baggingfor M models

1: Initialize base models hm for all m ∈ {1,2, ...,M}2: for all training examples do3: for m = 1,2, ...,M do4: Set w = Poisson(1)5: Update hm with the current example with weight w

6: anytime output:7: return hypothesis: hfin(x) = argmaxy∈Y ∑

Tt=1 I(ht(x) = y)

Evolving Stream Classification

Data Mining Algorithms withConcept Drift

No Concept Drift

-input output

DM Algorithm

-

Counter1

Counter2

Counter3

Counter4

Counter5

Concept Drift

-input output

DM Algorithm

Static Model

-

Change Detect.-

6

�

Data Mining Algorithms withConcept Drift

No Concept Drift

-input output

DM Algorithm

-

Counter1

Counter2

Counter3

Counter4

Counter5

Concept Drift

-input output

DM Algorithm

-

Estimator1

Estimator2

Estimator3

Estimator4

Estimator5

Optimal Change Detector andPredictor

High accuracy

Low false positives and false negatives ratios

Theoretical guarantees

Fast detection of change

Low computational cost: minimum space and time needed

No parameters needed

Algorithm ADaptive SlidingWINdow

Example

W= 101010110111111W0= 1

ADWIN: ADAPTIVE WINDOWING ALGORITHM

1 Initialize Window W2 for each t > 03 do W←W ∪{xt} (i.e., add xt to the head of W)4 repeat Drop elements from the tail of W5 until |µ̂W0− µ̂W1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µ̂W


Example

W= 101010110111111W0= 1 W1 = 01010110111111




Example

W= 101010110111111W0= 10 W1 = 1010110111111




Example

W= 101010110111111W0= 101 W1 = 010110111111




Example

W= 101010110111111W0= 1010 W1 = 10110111111




Example

W= 101010110111111W0= 10101 W1 = 0110111111




Example

W= 101010110111111W0= 101010 W1 = 110111111




Example

W= 101010110111111W0= 1010101 W1 = 10111111




Example

W= 101010110111111W0= 10101011 W1 = 0111111




Example

W= 101010110111111 |µ̂W0− µ̂W1 | ≥ εc : CHANGE DET.!

W0= 101010110 W1 = 111111




Example

W= 101010110111111 Drop elements from the tail of WW0= 101010110 W1 = 111111




Example

W= 01010110111111 Drop elements from the tail of WW0= 101010110 W1 = 111111




TheoremAt every time step we have:

1 (False positive rate bound). If µt remains constant within W, theprobability that ADWIN shrinks the window at this step is at mostδ .

2 (False negative rate bound). Suppose that for some partition ofW in two parts W0W1 (where W1 contains the most recent items)we have |µW0−µW1 |> 2εc. Then with probability 1−δ ADWINshrinks W to W1, or shorter.

ADWIN tunes itself to the data stream at hand, with no need for theuser to hardwire or precompute parameters.


ADWIN using a Data Stream Sliding Window Model,

can provide the exact counts of 1’s in O(1) time per point.

tries O(logW) cutpoints

uses O( 1ε

logW) memory words

the processing time per example is O(logW) (amortized andworst-case).

Sliding Window Model

1010101 101 11 1 1

Content: 4 2 2 1 1

Capacity: 7 3 2 1 1

VFDT / CVFDTConcept-adapting Very Fast Decision Trees: CVFDT

G. Hulten, L. Spencer, and P. Domingos.Mining time-changing data streams. 2001

It keeps its model consistent with a sliding window of examples

Construct “alternative branches” as preparation for changes

If the alternative branch becomes more accurate, switch of treebranches occurs

Time


YESYes

NONo

Day

YES

Night

Decision Trees: CVFDTTime


YESYes

NONo

Day

YES

Night

No theoretical guarantees on the error rate of CVFDT

CVFDT parameters :1 W: is the example window size.2 T0: number of examples used to check at each node if the

splitting attribute is still the best.3 T1: number of examples used to build the alternate tree.4 T2: number of examples used to test the accuracy of the alternate

tree.

Decision Trees: Hoeffding AdaptiveTree

Hoeffding Adaptive Tree:replace frequency statistics counters by estimators

don’t need a window to store examples, due to the fact that wemaintain the statistics data needed with estimators

change the way of checking the substitution of alternate subtrees,using a change detector with theoretical guarantees

Advantages over CVFDT:1 Theoretical guarantees2 No Parameters

ADWIN Bagging (KDD’09)

ADWINAn adaptive sliding window whose size is recomputed onlineaccording to the rate of change observed.

ADWIN has rigorous guarantees (theorems)On ratio of false positives and negatives

On the relation of the size of the current window and changerates

ADWIN BaggingWhen a change is detected, the worst classifier is removed and a newclassifier is added.

Leveraging Bagging for EvolvingData Streams (ECML-PKDD’10)

Randomization as a powerful tool to increase accuracy and diversity

There are three ways of using randomization:

Manipulating the input data

Manipulating the classifier algorithms

Manipulating the output targets

Leveraging Bagging for EvolvingData Streams

Leveraging BaggingUsing Poisson(λ )

Leveraging Bagging MCUsing Poisson(λ ) and Random Output Codes

Fast Leveraging Bagging MEif an instance is misclassified: weight = 1

if not: weight = eT/(1− eT),

Empirical evaluationAccuracy RAM-Hours

Hoeffding Tree 74.03% 0.01Online Bagging 77.15% 2.98ADWIN Bagging 79.24% 1.48Leveraging Bagging 85.54% 20.17Leveraging Bagging MC 85.37% 22.04Leveraging Bagging ME 80.77% 0.87

Leveraging BaggingLeveraging Bagging

Using Poisson(λ )

Leveraging Bagging MCUsing Poisson(λ ) and Random Output Codes

Leveraging Bagging MEUsing weight 1 if misclassified, otherwise eT/(1− eT)

Clustering

Clustering

DefinitionClustering is the distribution of a set of instances of examples intonon-known groups according to some common relations or affinities.

ExampleMarket segmentation of customers

ExampleSocial network communities

Clustering

DefinitionGiven

a set of instances I

a number of clusters K

an objective function cost(I)

a clustering algorithm computes an assignment of a cluster for eachinstance

f : I→{1, . . . ,K}

that minimizes the objective function cost(I)

ClusteringDefinitionGiven

a set of instances I

a number of clusters K

an objective function cost(C, I)

a clustering algorithm computes a set C of instances with |C|= K thatminimizes the objective function

cost(C, I) = ∑x∈I

d2(x,C)

where

d(x,c): distance function between x and c

d2(x,C) = minc∈Cd2(x,c): distance from x to the nearest point inC

k-means

1. Choose k initial centers C = {c1, . . . ,ck}2. while stopping criterion has not been met

For i = 1, . . . ,Nfind closest center ck ∈ C to each instance piassign instance pi to cluster Ck

For k = 1, . . . ,Kset ck to be the center of mass of all points in Ci

k-means++

1. Choose a initial center c1

For k = 2, . . . ,Kselect ck = p ∈ I with probability d2(p,C)/cost(C, I)

2. while stopping criterion has not been metFor i = 1, . . . ,N

find closest center ck ∈ C to each instance piassign instance pi to cluster Ck

For k = 1, . . . ,Kset ck to be the center of mass of all points in Ci

Performance Measures

Internal MeasuresSum square distance

Dunn index D = dmindmax

C-Index C = S−SminSmax−Smin

External MeasuresRand Measure

F Measure

Jaccard

Purity

BIRCH

BALANCED ITERATIVE REDUCING AND CLUSTERING USINGHIERARCHIES

Clustering Features CF = (N,LS,SS)N: number of data pointsLS: linear sum of the N data pointsSS: square sum of the N data pointsProperties:

Additivity: CF1 +CF2 = (N1 +N2,LS1 +LS2,SS1 +SS2)Easy to compute: average inter-cluster distanceand average intra-cluster distance

Uses CF treeHeight-balanced tree with two parameters

B: branching factorT: radius leaf threshold

BIRCH

BALANCED ITERATIVE REDUCING AND CLUSTERING USINGHIERARCHIES

Phase 1: Scan all data and build an initial in-memory CF tree

Phase 2: Condense into desirable range by building a smaller CFtree (optional)

Phase 3: Global clustering

Phase 4: Cluster refining (optional and off line, as requires morepasses)

Clu-Stream

Clu-StreamUses micro-clusters to store statistics on-line

Clustering Features CF = (N,LS,SS,LT,ST)N: numer of data pointsLS: linear sum of the N data pointsSS: square sum of the N data pointsLT: linear sum of the time stampsST: square sum of the time stamps

Uses pyramidal time frame

Clu-Stream

On-line PhaseFor each new point that arrives

the point is absorbed by a micro-clusterthe point starts a new micro-cluster of its own

delete oldest micro-clustermerge two of the oldest micro-cluster

Off-line PhaseApply k-means using microclusters as points

StreamKM++: Coresets

Coreset of a set P with respect to some problemSmall subset that approximates the original set P.

Solving the problem for the coreset provides an approximatesolution for the problem on P.

(k,ε)-coresetA (k,ε)-coreset S of P is a subset of P that for each C of size k

(1− ε)cost(P,C)≤ costw(S,C)≤ (1+ ε)cost(P,C)

StreamKM++: Coresets

Coreset TreeChoose a leaf l node at random

Choose a new sample point denoted by qt+1 from Pl according tod2

Based on ql and qt+1, split Pl into two subclusters and create twochild nodes

StreamKM++Maintain L = dlog2(

nm)+2e buckets B0,B1, . . . ,BL−1

Frequent Pattern Mining

Frequent Patterns

Suppose D is a dataset of patterns, t ∈D , and min sup is a constant.

DefinitionSupport (t): number of patternsin D that are superpatterns of t.

DefinitionPattern t is frequent ifSupport (t)≥ min sup.

Frequent Subpattern ProblemGiven D and min sup, find all frequent subpatterns of patterns in D .

Pattern Mining

Dataset Example

Document Patternsd1 abced2 cded3 abced4 acded5 abcded6 bcd

Itemset Mining

d1 abced2 cded3 abced4 acded5 abcded6 bcd

Support Frequentd1,d2,d3,d4,d5,d6 c

d1,d2,d3,d4,d5 e,ced1,d3,d4,d5 a,ac,ae,aced1,d3,d5,d6 b,bcd2,d4,d5,d6 d,cd

d1,d3,d5 ab,abc,abebe,bce,abce

d2,d4,d5 de,cde

minimal support = 3

Itemset Mining


Support Frequent6 c5 e,ce4 a,ac,ae,ace4 b,bc4 d,cd3 ab,abc,abe

be,bce,abce3 de,cde

Itemset Mining


Support Frequent Gen Closed6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab

be,bce,abce be abce3 de,cde de cde

Itemset Mining


Support Frequent Gen Closed Max6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab

be,bce,abce be abce abce3 de,cde de cde cde

Itemset Mining


e→ ce



Itemset Mining




Itemset Mining


a→ ace



Itemset Mining




Closed Patterns

Usually, there are too many frequent patterns. We can compute asmaller set, while keeping the same information.

Example

A set of 1000 items, has 21000 ≈ 10301 subsets, that is more than thenumber of atoms in the universe ≈ 1079

Closed Patterns

A priori propertyIf t′ is a subpattern of t, then Support (t′)≥ Support (t).

DefinitionA frequent pattern t is closed if none of its proper superpatterns hasthe same support as it has.

Frequent subpatterns and their supports can be generated from closedpatterns.

Maximal Patterns

DefinitionA frequent pattern t is maximal if none of its proper superpatterns isfrequent.

Frequent subpatterns can be generated from maximal patterns, but notwith their support.

All maximal patterns are closed, but not all closed patterns aremaximal.

Non streaming frequent itemsetminers

Representation:Horizontal layout

T1: a, b, cT2: b, c, eT3: b, d, e

Vertical layout

a: 1 0 0b: 1 1 1c: 1 1 0

Search:Breadth-first (levelwise): Apriori

Depth-first: Eclat, FP-Growth

Mining Patterns over Data Streams

Requirements: fast, use small amount of memory and adaptive

Type:ExactApproximate

Per batch, per transaction

Incremental, Sliding Window, Adaptive

Frequent, Closed, Maximal patterns

Moment

Computes closed frequents itemsets in a sliding window

Uses Closed Enumeration TreeUses 4 type of Nodes:

Closed NodesIntermediate NodesUnpromising Gateway NodesInfrequent Gateway Nodes

Adding transactions: closed items remains closed

Removing transactions: infrequent items remains infrequent

FP-Stream

Mining Frequent Itemsets at Multiple Time Granularities

Based in FP-GrowthMaintains

pattern treetilted-time window

Allows to answer time-sensitive queries

Places greater information to recent data

Drawback: time and memory complexity

Tree and Graph Mining: Dealingwith time changes

Keep a window on recent stream elementsActually, just its lattice of closed sets!

Keep track of number of closed patterns in lattice, N

Use some change detector on NWhen change is detected:

Drop stale part of the windowUpdate lattice to reflect this deletion, using deletion rule

Alternatively, sliding window of some fixed size

Summary

A short course in Data StreamMining

Short Course Summary

1 Introduction2 Classification3 Ensemble Methods4 Clustering5 Frequent Pattern Mining

Open Source Software1 MOA: http://moa.cms.waikato.ac.nz/2 SAMOA: http://samoa-project.net/

a short course in data stream mining

Data & Analytics

sublinear time

missing numberslet pi

missing numberintroduction

missing numberuse

missing numberdata streams

sliding windows

stream statistics

sliding windowwe