a short course in data stream mining

104
A Short Course in Data Stream Mining Albert Bifet Shenzhen, 23 January 2015 Huawei Noah’s Ark Lab

Upload: albert-bifet

Post on 16-Jul-2015

781 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: A Short Course in Data Stream Mining

A Short Course in DataStream Mining

Albert Bifet

Shenzhen, 23 January 2015Huawei Noah’s Ark Lab

Page 2: A Short Course in Data Stream Mining

Real time analytics

Page 3: A Short Course in Data Stream Mining

Real time analytics

Page 4: A Short Course in Data Stream Mining

Introduction: Data StreamsData Streams

Sequence is potentially infinite

High amount of data: sublinear space

High speed of arrival: sublinear time per example

Once an element from a data stream has been processed it isdiscarded or archived

ExamplePuzzle: Finding Missing Numbers

Let π be a permutation of {1, . . . ,n}.Let π−1 be π with one element missing.

π−1[i] arrives in increasing order

Task: Determine the missing number

Page 5: A Short Course in Data Stream Mining

Introduction: Data StreamsData Streams

Sequence is potentially infinite

High amount of data: sublinear space

High speed of arrival: sublinear time per example

Once an element from a data stream has been processed it isdiscarded or archived

ExamplePuzzle: Finding Missing Numbers

Let π be a permutation of {1, . . . ,n}.Let π−1 be π with one element missing.

π−1[i] arrives in increasing order

Task: Determine the missing number

Use a n-bit vectorto memorize all thenumbers (O(n)space)

Page 6: A Short Course in Data Stream Mining

Introduction: Data StreamsData Streams

Sequence is potentially infinite

High amount of data: sublinear space

High speed of arrival: sublinear time per example

Once an element from a data stream has been processed it isdiscarded or archived

ExamplePuzzle: Finding Missing Numbers

Let π be a permutation of {1, . . . ,n}.Let π−1 be π with one element missing.

π−1[i] arrives in increasing order

Task: Determine the missing number

Data Streams:O(log(n)) space.

Page 7: A Short Course in Data Stream Mining

Introduction: Data StreamsData Streams

Sequence is potentially infinite

High amount of data: sublinear space

High speed of arrival: sublinear time per example

Once an element from a data stream has been processed it isdiscarded or archived

ExamplePuzzle: Finding Missing Numbers

Let π be a permutation of {1, . . . ,n}.Let π−1 be π with one element missing.

π−1[i] arrives in increasing order

Task: Determine the missing number

Data Streams:O(log(n)) space.Store

n(n+1)2

−∑j≤i

π−1[j].

Page 8: A Short Course in Data Stream Mining

Data Streams

Data StreamsSequence is potentially infinite

High amount of data: sublinear space

High speed of arrival: sublinear time per example

Once an element from a data stream has been processed it isdiscarded or archived

Tools:approximation

randomization, sampling

sketching

Page 9: A Short Course in Data Stream Mining

Data Streams

Data StreamsSequence is potentially infinite

High amount of data: sublinear space

High speed of arrival: sublinear time per example

Once an element from a data stream has been processed it isdiscarded or archived

Approximation algorithmsSmall error rate with high probability

An algorithm (ε,δ )−approximates F if it outputs F̃ for whichPr[|F̃−F|> εF]< δ .

Page 10: A Short Course in Data Stream Mining

Data Streams ApproximationAlgorithms

1011000111 1010101

Sliding WindowWe can maintain simple statistics over sliding windows, usingO( 1

εlog2 N) space, where

N is the length of the sliding window

ε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

Page 11: A Short Course in Data Stream Mining

Data Streams ApproximationAlgorithms

10110001111 0101011

Sliding WindowWe can maintain simple statistics over sliding windows, usingO( 1

εlog2 N) space, where

N is the length of the sliding window

ε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

Page 12: A Short Course in Data Stream Mining

Data Streams ApproximationAlgorithms

101100011110 1010111

Sliding WindowWe can maintain simple statistics over sliding windows, usingO( 1

εlog2 N) space, where

N is the length of the sliding window

ε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

Page 13: A Short Course in Data Stream Mining

Data Streams ApproximationAlgorithms

1011000111101 0101110

Sliding WindowWe can maintain simple statistics over sliding windows, usingO( 1

εlog2 N) space, where

N is the length of the sliding window

ε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

Page 14: A Short Course in Data Stream Mining

Data Streams ApproximationAlgorithms

10110001111010 1011101

Sliding WindowWe can maintain simple statistics over sliding windows, usingO( 1

εlog2 N) space, where

N is the length of the sliding window

ε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

Page 15: A Short Course in Data Stream Mining

Data Streams ApproximationAlgorithms

101100011110101 0111010

Sliding WindowWe can maintain simple statistics over sliding windows, usingO( 1

εlog2 N) space, where

N is the length of the sliding window

ε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

Page 16: A Short Course in Data Stream Mining

Classification

Page 17: A Short Course in Data Stream Mining

Classification

DefinitionGiven nC different classes, a classifier algorithm builds a model thatpredicts for every unlabelled instance I the class C to which it belongswith accuracy.

ExampleA spam filter

ExampleTwitter Sentiment analysis: analyze tweets with positive or negativefeelings

Page 18: A Short Course in Data Stream Mining

Data stream classification cycle

1 Process an example at a time, andinspect it only once (at most)

2 Use a limited amount of memory3 Work in a limited amount of time4 Be ready to predict at any point

Page 19: A Short Course in Data Stream Mining

Classification

Data set thatdescribes e-mailfeatures fordeciding if it isspam.

Example

Contains Domain Has Time“Money” type attach. received spam

yes com yes night yesyes edu no night yesno com yes night yesno edu no day nono com no day noyes cat no day yes

Assume we have to classify the following new instance:Contains Domain Has Time“Money” type attach. received spam

yes edu yes day ?

Page 20: A Short Course in Data Stream Mining

Bayes Classifiers

Naı̈ve BayesBased on Bayes Theorem:

P(c|d) = P(c)P(d|c)P(d)

posterior =prior× likelikood

evidenceEstimates the probability of observing attribute a and the priorprobability P(c)

Probability of class c given an instance d:

P(c|d) = P(c)∏a∈d P(a|c)P(d)

Page 21: A Short Course in Data Stream Mining

Bayes Classifiers

Multinomial Naı̈ve BayesConsiders a document as a bag-of-words.

Estimates the probability of observing word w and the priorprobability P(c)

Probability of class c given a test document d:

P(c|d) = P(c)∏w∈d P(w|c)nwd

P(d)

Page 22: A Short Course in Data Stream Mining

Perceptron

Attribute 1

Attribute 2

Attribute 3

Attribute 4

Attribute 5

Output h~w(~xi)

w1

w2

w3

w4

w5

Data stream: 〈~xi,yi〉Classical perceptron: h~w(~xi) = sgn(~wT~xi),

Minimize Mean-square error: J(~w) = 12 ∑(yi−h~w(~xi))

2

Page 23: A Short Course in Data Stream Mining

PerceptronAttribute 1

Attribute 2

Attribute 3

Attribute 4

Attribute 5

Output h~w(~xi)

w1

w2

w3

w4

w5

We use sigmoid function h~w = σ(~wT~x) where

σ(x) = 1/(1+ e−x)

σ′(x) = σ(x)(1−σ(x))

Page 24: A Short Course in Data Stream Mining

Perceptron

Minimize Mean-square error: J(~w) = 12 ∑(yi−h~w(~xi))

2

Stochastic Gradient Descent: ~w =~w−η∇J~xi

Gradient of the error function:

∇J =−∑i(yi−h~w(~xi))∇h~w(~xi)

∇h~w(~xi) = h~w(~xi)(1−h~w(~xi))

Weight update rule

~w =~w+η ∑i(yi−h~w(~xi))h~w(~xi)(1−h~w(~xi))~xi

Page 25: A Short Course in Data Stream Mining

Perceptron

PERCEPTRON LEARNING(Stream,η)

1 for each class2 do PERCEPTRON LEARNING(Stream,class,η)

PERCEPTRON LEARNING(Stream,class,η)

1 � Let w0 and ~w be randomly initialized2 for each example (~x,y) in Stream3 do if class = y4 then δ = (1−h~w(~x)) ·h~w(~x) · (1−h~w(~x))5 else δ = (0−h~w(~x)) ·h~w(~x) · (1−h~w(~x))6 ~w =~w+η ·δ ·~x

PERCEPTRON PREDICTION(~x)

1 return argmaxclass h~wclass(~x)

Page 26: A Short Course in Data Stream Mining

Classification

Data set thatdescribes e-mailfeatures fordeciding if it isspam.

Example

Contains Domain Has Time“Money” type attach. received spam

yes com yes night yesyes edu no night yesno com yes night yesno edu no day nono com no day noyes cat no day yes

Assume we have to classify the following new instance:Contains Domain Has Time“Money” type attach. received spam

yes edu yes day ?

Page 27: A Short Course in Data Stream Mining

Classification

Assume we have to classify the following new instance:Contains Domain Has Time“Money” type attach. received spam

yes edu yes day ?

Time

Contains “Money”

YES

Yes

NO

No

Day

YES

Night

Page 28: A Short Course in Data Stream Mining

Decision Trees

Basic induction strategy:

A← the “best” decision attribute for next node

Assign A as decision attribute for node

For each value of A, create new descendant of node

Sort training examples to leaf nodes

If training examples perfectly classified, Then STOP, Else iterateover new leaf nodes

Page 29: A Short Course in Data Stream Mining

Hoeffding TreesHoeffding Tree : VFDT

Pedro Domingos and Geoff Hulten.Mining high-speed data streams. 2000

With high probability, constructs an identical model that atraditional (greedy) method would learn

With theoretical guarantees on the error rate

Time

Contains “Money”

YESYes

NONo

Day

YES

Night

Page 30: A Short Course in Data Stream Mining

Hoeffding Bound Inequality

Probability of deviation of its expected value.

Page 31: A Short Course in Data Stream Mining

Hoeffding Bound InequalityLet X = ∑i Xi where X1, . . . ,Xn are independent and indenticallydistributed in [0,1]. Then

1 Chernoff For each ε < 1

Pr[X > (1+ ε)E[X]]≤ exp(−ε2

3E[X]

)2 Hoeffding For each t > 0

Pr[X > E[X]+ t]≤ exp(−2t2/n

)3 Bernstein Let σ2 = ∑i σ2

i the variance of X. If Xi−E[Xi]≤ b foreach i ∈ [n] then for each t > 0

Pr[X > E[X]+ t]≤ exp

(− t2

2σ2 + 23 bt

)

Page 32: A Short Course in Data Stream Mining

Hoeffding Tree or VFDTHT(Stream,δ )

1 � Let HT be a tree with a single leaf(root)2 � Init counts nijk at root3 for each example (x,y) in Stream4 do HTGROW((x,y),HT,δ )

HTGROW((x,y),HT,δ )

1 � Sort (x,y) to leaf l using HT2 � Update counts nijk at leaf l3 if examples seen so far at l are not all of the same class4 then� Compute G for each attribute

5 if G(Best Attr.)−G(2nd best) >√

R2 ln1/δ

2n6 then� Split leaf on best attribute7 for each branch8 do � Start new leaf and initiliatize counts

Page 33: A Short Course in Data Stream Mining

Hoeffding Tree or VFDTHT(Stream,δ )

1 � Let HT be a tree with a single leaf(root)2 � Init counts nijk at root3 for each example (x,y) in Stream4 do HTGROW((x,y),HT,δ )

HTGROW((x,y),HT,δ )

1 � Sort (x,y) to leaf l using HT2 � Update counts nijk at leaf l3 if examples seen so far at l are not all of the same class4 then� Compute G for each attribute

5 if G(Best Attr.)−G(2nd best) >√

R2 ln1/δ

2n6 then� Split leaf on best attribute7 for each branch8 do � Start new leaf and initiliatize counts

Page 34: A Short Course in Data Stream Mining

Hoeffding Trees

HT featuresWith high probability, constructs an identical model that atraditional (greedy) method would learn

Ties: when two attributes have similar G, split if

G(Best Attr.)−G(2nd best)<

√R2 ln1/δ

2n< τ

Compute G every nmin instancesMemory: deactivate least promising nodes with lower pl× el

pl is the probability to reach leaf lel is the error in the node

Page 35: A Short Course in Data Stream Mining

Hoeffding Naive Bayes Tree

Hoeffding TreeMajority Class learner at leaves

Hoeffding Naive Bayes Tree

G. Holmes, R. Kirkby, and B. Pfahringer.Stress-testing Hoeffding trees, 2005.

monitors accuracy of a Majority Class learner

monitors accuracy of a Naive Bayes learner

predicts using the most accurate method

Page 36: A Short Course in Data Stream Mining

Bagging

ExampleDataset of 4 Instances : A, B, C, D

Classifier 1: B, A, C, BClassifier 2: D, B, A, DClassifier 3: B, A, C, BClassifier 4: B, C, B, BClassifier 5: D, C, A, C

Bagging builds a set of M base models, with a bootstrap samplecreated by drawing random samples withreplacement.

Page 37: A Short Course in Data Stream Mining

Bagging

ExampleDataset of 4 Instances : A, B, C, D

Classifier 1: A, B, B, CClassifier 2: A, B, D, DClassifier 3: A, B, B, CClassifier 4: B, B, B, CClassifier 5: A, C, C, D

Bagging builds a set of M base models, with a bootstrap samplecreated by drawing random samples withreplacement.

Page 38: A Short Course in Data Stream Mining

Bagging

ExampleDataset of 4 Instances : A, B, C, D

Classifier 1: A, B, B, C: A(1) B(2) C(1) D(0)Classifier 2: A, B, D, D: A(1) B(1) C(0) D(2)Classifier 3: A, B, B, C: A(1) B(2) C(1) D(0)Classifier 4: B, B, B, C: A(0) B(3) C(1) D(0)Classifier 5: A, C, C, D: A(1) B(0) C(2) D(1)

Each base model’s training set contains each of the original trainingexample K times where P(K = k) follows a binomial distribution.

Page 39: A Short Course in Data Stream Mining

Bagging

Figure : Poisson(1) Distribution.

Each base model’s training set contains each of the original trainingexample K times where P(K = k) follows a binomial distribution.

Page 40: A Short Course in Data Stream Mining

Oza and Russell’s Online Baggingfor M models

1: Initialize base models hm for all m ∈ {1,2, ...,M}2: for all training examples do3: for m = 1,2, ...,M do4: Set w = Poisson(1)5: Update hm with the current example with weight w

6: anytime output:7: return hypothesis: hfin(x) = argmaxy∈Y ∑

Tt=1 I(ht(x) = y)

Page 41: A Short Course in Data Stream Mining

Evolving Stream Classification

Page 42: A Short Course in Data Stream Mining

Data Mining Algorithms withConcept Drift

No Concept Drift

-input output

DM Algorithm

-

Counter1

Counter2

Counter3

Counter4

Counter5

Concept Drift

-input output

DM Algorithm

Static Model

-

Change Detect.-

6

Page 43: A Short Course in Data Stream Mining

Data Mining Algorithms withConcept Drift

No Concept Drift

-input output

DM Algorithm

-

Counter1

Counter2

Counter3

Counter4

Counter5

Concept Drift

-input output

DM Algorithm

-

Estimator1

Estimator2

Estimator3

Estimator4

Estimator5

Page 44: A Short Course in Data Stream Mining

Optimal Change Detector andPredictor

High accuracy

Low false positives and false negatives ratios

Theoretical guarantees

Fast detection of change

Low computational cost: minimum space and time needed

No parameters needed

Page 45: A Short Course in Data Stream Mining

Algorithm ADaptive SlidingWINdow

Example

W= 101010110111111W0= 1

ADWIN: ADAPTIVE WINDOWING ALGORITHM

1 Initialize Window W2 for each t > 03 do W←W ∪{xt} (i.e., add xt to the head of W)4 repeat Drop elements from the tail of W5 until |µ̂W0− µ̂W1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µ̂W

Page 46: A Short Course in Data Stream Mining

Algorithm ADaptive SlidingWINdow

Example

W= 101010110111111W0= 1 W1 = 01010110111111

ADWIN: ADAPTIVE WINDOWING ALGORITHM

1 Initialize Window W2 for each t > 03 do W←W ∪{xt} (i.e., add xt to the head of W)4 repeat Drop elements from the tail of W5 until |µ̂W0− µ̂W1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µ̂W

Page 47: A Short Course in Data Stream Mining

Algorithm ADaptive SlidingWINdow

Example

W= 101010110111111W0= 10 W1 = 1010110111111

ADWIN: ADAPTIVE WINDOWING ALGORITHM

1 Initialize Window W2 for each t > 03 do W←W ∪{xt} (i.e., add xt to the head of W)4 repeat Drop elements from the tail of W5 until |µ̂W0− µ̂W1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µ̂W

Page 48: A Short Course in Data Stream Mining

Algorithm ADaptive SlidingWINdow

Example

W= 101010110111111W0= 101 W1 = 010110111111

ADWIN: ADAPTIVE WINDOWING ALGORITHM

1 Initialize Window W2 for each t > 03 do W←W ∪{xt} (i.e., add xt to the head of W)4 repeat Drop elements from the tail of W5 until |µ̂W0− µ̂W1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µ̂W

Page 49: A Short Course in Data Stream Mining

Algorithm ADaptive SlidingWINdow

Example

W= 101010110111111W0= 1010 W1 = 10110111111

ADWIN: ADAPTIVE WINDOWING ALGORITHM

1 Initialize Window W2 for each t > 03 do W←W ∪{xt} (i.e., add xt to the head of W)4 repeat Drop elements from the tail of W5 until |µ̂W0− µ̂W1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µ̂W

Page 50: A Short Course in Data Stream Mining

Algorithm ADaptive SlidingWINdow

Example

W= 101010110111111W0= 10101 W1 = 0110111111

ADWIN: ADAPTIVE WINDOWING ALGORITHM

1 Initialize Window W2 for each t > 03 do W←W ∪{xt} (i.e., add xt to the head of W)4 repeat Drop elements from the tail of W5 until |µ̂W0− µ̂W1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µ̂W

Page 51: A Short Course in Data Stream Mining

Algorithm ADaptive SlidingWINdow

Example

W= 101010110111111W0= 101010 W1 = 110111111

ADWIN: ADAPTIVE WINDOWING ALGORITHM

1 Initialize Window W2 for each t > 03 do W←W ∪{xt} (i.e., add xt to the head of W)4 repeat Drop elements from the tail of W5 until |µ̂W0− µ̂W1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µ̂W

Page 52: A Short Course in Data Stream Mining

Algorithm ADaptive SlidingWINdow

Example

W= 101010110111111W0= 1010101 W1 = 10111111

ADWIN: ADAPTIVE WINDOWING ALGORITHM

1 Initialize Window W2 for each t > 03 do W←W ∪{xt} (i.e., add xt to the head of W)4 repeat Drop elements from the tail of W5 until |µ̂W0− µ̂W1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µ̂W

Page 53: A Short Course in Data Stream Mining

Algorithm ADaptive SlidingWINdow

Example

W= 101010110111111W0= 10101011 W1 = 0111111

ADWIN: ADAPTIVE WINDOWING ALGORITHM

1 Initialize Window W2 for each t > 03 do W←W ∪{xt} (i.e., add xt to the head of W)4 repeat Drop elements from the tail of W5 until |µ̂W0− µ̂W1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µ̂W

Page 54: A Short Course in Data Stream Mining

Algorithm ADaptive SlidingWINdow

Example

W= 101010110111111 |µ̂W0− µ̂W1 | ≥ εc : CHANGE DET.!

W0= 101010110 W1 = 111111

ADWIN: ADAPTIVE WINDOWING ALGORITHM

1 Initialize Window W2 for each t > 03 do W←W ∪{xt} (i.e., add xt to the head of W)4 repeat Drop elements from the tail of W5 until |µ̂W0− µ̂W1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µ̂W

Page 55: A Short Course in Data Stream Mining

Algorithm ADaptive SlidingWINdow

Example

W= 101010110111111 Drop elements from the tail of WW0= 101010110 W1 = 111111

ADWIN: ADAPTIVE WINDOWING ALGORITHM

1 Initialize Window W2 for each t > 03 do W←W ∪{xt} (i.e., add xt to the head of W)4 repeat Drop elements from the tail of W5 until |µ̂W0− µ̂W1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µ̂W

Page 56: A Short Course in Data Stream Mining

Algorithm ADaptive SlidingWINdow

Example

W= 01010110111111 Drop elements from the tail of WW0= 101010110 W1 = 111111

ADWIN: ADAPTIVE WINDOWING ALGORITHM

1 Initialize Window W2 for each t > 03 do W←W ∪{xt} (i.e., add xt to the head of W)4 repeat Drop elements from the tail of W5 until |µ̂W0− µ̂W1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µ̂W

Page 57: A Short Course in Data Stream Mining

Algorithm ADaptive SlidingWINdow

TheoremAt every time step we have:

1 (False positive rate bound). If µt remains constant within W, theprobability that ADWIN shrinks the window at this step is at mostδ .

2 (False negative rate bound). Suppose that for some partition ofW in two parts W0W1 (where W1 contains the most recent items)we have |µW0−µW1 |> 2εc. Then with probability 1−δ ADWINshrinks W to W1, or shorter.

ADWIN tunes itself to the data stream at hand, with no need for theuser to hardwire or precompute parameters.

Page 58: A Short Course in Data Stream Mining

Algorithm ADaptive SlidingWINdow

ADWIN using a Data Stream Sliding Window Model,

can provide the exact counts of 1’s in O(1) time per point.

tries O(logW) cutpoints

uses O( 1ε

logW) memory words

the processing time per example is O(logW) (amortized andworst-case).

Sliding Window Model

1010101 101 11 1 1

Content: 4 2 2 1 1

Capacity: 7 3 2 1 1

Page 59: A Short Course in Data Stream Mining

VFDT / CVFDTConcept-adapting Very Fast Decision Trees: CVFDT

G. Hulten, L. Spencer, and P. Domingos.Mining time-changing data streams. 2001

It keeps its model consistent with a sliding window of examples

Construct “alternative branches” as preparation for changes

If the alternative branch becomes more accurate, switch of treebranches occurs

Time

Contains “Money”

YESYes

NONo

Day

YES

Night

Page 60: A Short Course in Data Stream Mining

Decision Trees: CVFDTTime

Contains “Money”

YESYes

NONo

Day

YES

Night

No theoretical guarantees on the error rate of CVFDT

CVFDT parameters :1 W: is the example window size.2 T0: number of examples used to check at each node if the

splitting attribute is still the best.3 T1: number of examples used to build the alternate tree.4 T2: number of examples used to test the accuracy of the alternate

tree.

Page 61: A Short Course in Data Stream Mining

Decision Trees: Hoeffding AdaptiveTree

Hoeffding Adaptive Tree:replace frequency statistics counters by estimators

don’t need a window to store examples, due to the fact that wemaintain the statistics data needed with estimators

change the way of checking the substitution of alternate subtrees,using a change detector with theoretical guarantees

Advantages over CVFDT:1 Theoretical guarantees2 No Parameters

Page 62: A Short Course in Data Stream Mining

ADWIN Bagging (KDD’09)

ADWINAn adaptive sliding window whose size is recomputed onlineaccording to the rate of change observed.

ADWIN has rigorous guarantees (theorems)On ratio of false positives and negatives

On the relation of the size of the current window and changerates

ADWIN BaggingWhen a change is detected, the worst classifier is removed and a newclassifier is added.

Page 63: A Short Course in Data Stream Mining

Leveraging Bagging for EvolvingData Streams (ECML-PKDD’10)

Randomization as a powerful tool to increase accuracy and diversity

There are three ways of using randomization:

Manipulating the input data

Manipulating the classifier algorithms

Manipulating the output targets

Page 64: A Short Course in Data Stream Mining

Leveraging Bagging for EvolvingData Streams

Leveraging BaggingUsing Poisson(λ )

Leveraging Bagging MCUsing Poisson(λ ) and Random Output Codes

Fast Leveraging Bagging MEif an instance is misclassified: weight = 1

if not: weight = eT/(1− eT),

Page 65: A Short Course in Data Stream Mining

Empirical evaluationAccuracy RAM-Hours

Hoeffding Tree 74.03% 0.01Online Bagging 77.15% 2.98ADWIN Bagging 79.24% 1.48Leveraging Bagging 85.54% 20.17Leveraging Bagging MC 85.37% 22.04Leveraging Bagging ME 80.77% 0.87

Leveraging BaggingLeveraging Bagging

Using Poisson(λ )

Leveraging Bagging MCUsing Poisson(λ ) and Random Output Codes

Leveraging Bagging MEUsing weight 1 if misclassified, otherwise eT/(1− eT)

Page 66: A Short Course in Data Stream Mining

Clustering

Page 67: A Short Course in Data Stream Mining

Clustering

DefinitionClustering is the distribution of a set of instances of examples intonon-known groups according to some common relations or affinities.

ExampleMarket segmentation of customers

ExampleSocial network communities

Page 68: A Short Course in Data Stream Mining

Clustering

DefinitionGiven

a set of instances I

a number of clusters K

an objective function cost(I)

a clustering algorithm computes an assignment of a cluster for eachinstance

f : I→{1, . . . ,K}

that minimizes the objective function cost(I)

Page 69: A Short Course in Data Stream Mining

ClusteringDefinitionGiven

a set of instances I

a number of clusters K

an objective function cost(C, I)

a clustering algorithm computes a set C of instances with |C|= K thatminimizes the objective function

cost(C, I) = ∑x∈I

d2(x,C)

where

d(x,c): distance function between x and c

d2(x,C) = minc∈Cd2(x,c): distance from x to the nearest point inC

Page 70: A Short Course in Data Stream Mining

k-means

1. Choose k initial centers C = {c1, . . . ,ck}2. while stopping criterion has not been met

For i = 1, . . . ,Nfind closest center ck ∈ C to each instance piassign instance pi to cluster Ck

For k = 1, . . . ,Kset ck to be the center of mass of all points in Ci

Page 71: A Short Course in Data Stream Mining

k-means++

1. Choose a initial center c1

For k = 2, . . . ,Kselect ck = p ∈ I with probability d2(p,C)/cost(C, I)

2. while stopping criterion has not been metFor i = 1, . . . ,N

find closest center ck ∈ C to each instance piassign instance pi to cluster Ck

For k = 1, . . . ,Kset ck to be the center of mass of all points in Ci

Page 72: A Short Course in Data Stream Mining

Performance Measures

Internal MeasuresSum square distance

Dunn index D = dmindmax

C-Index C = S−SminSmax−Smin

External MeasuresRand Measure

F Measure

Jaccard

Purity

Page 73: A Short Course in Data Stream Mining

BIRCH

BALANCED ITERATIVE REDUCING AND CLUSTERING USINGHIERARCHIES

Clustering Features CF = (N,LS,SS)N: number of data pointsLS: linear sum of the N data pointsSS: square sum of the N data pointsProperties:

Additivity: CF1 +CF2 = (N1 +N2,LS1 +LS2,SS1 +SS2)Easy to compute: average inter-cluster distanceand average intra-cluster distance

Uses CF treeHeight-balanced tree with two parameters

B: branching factorT: radius leaf threshold

Page 74: A Short Course in Data Stream Mining

BIRCH

BALANCED ITERATIVE REDUCING AND CLUSTERING USINGHIERARCHIES

Phase 1: Scan all data and build an initial in-memory CF tree

Phase 2: Condense into desirable range by building a smaller CFtree (optional)

Phase 3: Global clustering

Phase 4: Cluster refining (optional and off line, as requires morepasses)

Page 75: A Short Course in Data Stream Mining

Clu-Stream

Clu-StreamUses micro-clusters to store statistics on-line

Clustering Features CF = (N,LS,SS,LT,ST)N: numer of data pointsLS: linear sum of the N data pointsSS: square sum of the N data pointsLT: linear sum of the time stampsST: square sum of the time stamps

Uses pyramidal time frame

Page 76: A Short Course in Data Stream Mining

Clu-Stream

On-line PhaseFor each new point that arrives

the point is absorbed by a micro-clusterthe point starts a new micro-cluster of its own

delete oldest micro-clustermerge two of the oldest micro-cluster

Off-line PhaseApply k-means using microclusters as points

Page 77: A Short Course in Data Stream Mining

StreamKM++: Coresets

Coreset of a set P with respect to some problemSmall subset that approximates the original set P.

Solving the problem for the coreset provides an approximatesolution for the problem on P.

(k,ε)-coresetA (k,ε)-coreset S of P is a subset of P that for each C of size k

(1− ε)cost(P,C)≤ costw(S,C)≤ (1+ ε)cost(P,C)

Page 78: A Short Course in Data Stream Mining

StreamKM++: Coresets

Coreset TreeChoose a leaf l node at random

Choose a new sample point denoted by qt+1 from Pl according tod2

Based on ql and qt+1, split Pl into two subclusters and create twochild nodes

StreamKM++Maintain L = dlog2(

nm)+2e buckets B0,B1, . . . ,BL−1

Page 79: A Short Course in Data Stream Mining

Frequent Pattern Mining

Page 80: A Short Course in Data Stream Mining

Frequent Patterns

Suppose D is a dataset of patterns, t ∈D , and min sup is a constant.

DefinitionSupport (t): number of patternsin D that are superpatterns of t.

DefinitionPattern t is frequent ifSupport (t)≥ min sup.

Frequent Subpattern ProblemGiven D and min sup, find all frequent subpatterns of patterns in D .

Page 81: A Short Course in Data Stream Mining

Frequent Patterns

Suppose D is a dataset of patterns, t ∈D , and min sup is a constant.

DefinitionSupport (t): number of patternsin D that are superpatterns of t.

DefinitionPattern t is frequent ifSupport (t)≥ min sup.

Frequent Subpattern ProblemGiven D and min sup, find all frequent subpatterns of patterns in D .

Page 82: A Short Course in Data Stream Mining

Frequent Patterns

Suppose D is a dataset of patterns, t ∈D , and min sup is a constant.

DefinitionSupport (t): number of patternsin D that are superpatterns of t.

DefinitionPattern t is frequent ifSupport (t)≥ min sup.

Frequent Subpattern ProblemGiven D and min sup, find all frequent subpatterns of patterns in D .

Page 83: A Short Course in Data Stream Mining

Frequent Patterns

Suppose D is a dataset of patterns, t ∈D , and min sup is a constant.

DefinitionSupport (t): number of patternsin D that are superpatterns of t.

DefinitionPattern t is frequent ifSupport (t)≥ min sup.

Frequent Subpattern ProblemGiven D and min sup, find all frequent subpatterns of patterns in D .

Page 84: A Short Course in Data Stream Mining

Pattern Mining

Dataset Example

Document Patternsd1 abced2 cded3 abced4 acded5 abcded6 bcd

Page 85: A Short Course in Data Stream Mining

Itemset Mining

d1 abced2 cded3 abced4 acded5 abcded6 bcd

Support Frequentd1,d2,d3,d4,d5,d6 c

d1,d2,d3,d4,d5 e,ced1,d3,d4,d5 a,ac,ae,aced1,d3,d5,d6 b,bcd2,d4,d5,d6 d,cd

d1,d3,d5 ab,abc,abebe,bce,abce

d2,d4,d5 de,cde

minimal support = 3

Page 86: A Short Course in Data Stream Mining

Itemset Mining

d1 abced2 cded3 abced4 acded5 abcded6 bcd

Support Frequent6 c5 e,ce4 a,ac,ae,ace4 b,bc4 d,cd3 ab,abc,abe

be,bce,abce3 de,cde

Page 87: A Short Course in Data Stream Mining

Itemset Mining

d1 abced2 cded3 abced4 acded5 abcded6 bcd

Support Frequent Gen Closed6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab

be,bce,abce be abce3 de,cde de cde

Page 88: A Short Course in Data Stream Mining

Itemset Mining

d1 abced2 cded3 abced4 acded5 abcded6 bcd

Support Frequent Gen Closed Max6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab

be,bce,abce be abce abce3 de,cde de cde cde

Page 89: A Short Course in Data Stream Mining

Itemset Mining

d1 abced2 cded3 abced4 acded5 abcded6 bcd

Support Frequent Gen Closed Max6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab

be,bce,abce be abce abce3 de,cde de cde cde

Page 90: A Short Course in Data Stream Mining

Itemset Mining

d1 abced2 cded3 abced4 acded5 abcded6 bcd

e→ ce

Support Frequent Gen Closed Max6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab

be,bce,abce be abce abce3 de,cde de cde cde

Page 91: A Short Course in Data Stream Mining

Itemset Mining

d1 abced2 cded3 abced4 acded5 abcded6 bcd

Support Frequent Gen Closed Max6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab

be,bce,abce be abce abce3 de,cde de cde cde

Page 92: A Short Course in Data Stream Mining

Itemset Mining

d1 abced2 cded3 abced4 acded5 abcded6 bcd

Support Frequent Gen Closed Max6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab

be,bce,abce be abce abce3 de,cde de cde cde

Page 93: A Short Course in Data Stream Mining

Itemset Mining

d1 abced2 cded3 abced4 acded5 abcded6 bcd

a→ ace

Support Frequent Gen Closed Max6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab

be,bce,abce be abce abce3 de,cde de cde cde

Page 94: A Short Course in Data Stream Mining

Itemset Mining

d1 abced2 cded3 abced4 acded5 abcded6 bcd

Support Frequent Gen Closed Max6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab

be,bce,abce be abce abce3 de,cde de cde cde

Page 95: A Short Course in Data Stream Mining

Closed Patterns

Usually, there are too many frequent patterns. We can compute asmaller set, while keeping the same information.

Example

A set of 1000 items, has 21000 ≈ 10301 subsets, that is more than thenumber of atoms in the universe ≈ 1079

Page 96: A Short Course in Data Stream Mining

Closed Patterns

A priori propertyIf t′ is a subpattern of t, then Support (t′)≥ Support (t).

DefinitionA frequent pattern t is closed if none of its proper superpatterns hasthe same support as it has.

Frequent subpatterns and their supports can be generated from closedpatterns.

Page 97: A Short Course in Data Stream Mining

Maximal Patterns

DefinitionA frequent pattern t is maximal if none of its proper superpatterns isfrequent.

Frequent subpatterns can be generated from maximal patterns, but notwith their support.

All maximal patterns are closed, but not all closed patterns aremaximal.

Page 98: A Short Course in Data Stream Mining

Non streaming frequent itemsetminers

Representation:Horizontal layout

T1: a, b, cT2: b, c, eT3: b, d, e

Vertical layout

a: 1 0 0b: 1 1 1c: 1 1 0

Search:Breadth-first (levelwise): Apriori

Depth-first: Eclat, FP-Growth

Page 99: A Short Course in Data Stream Mining

Mining Patterns over Data Streams

Requirements: fast, use small amount of memory and adaptive

Type:ExactApproximate

Per batch, per transaction

Incremental, Sliding Window, Adaptive

Frequent, Closed, Maximal patterns

Page 100: A Short Course in Data Stream Mining

Moment

Computes closed frequents itemsets in a sliding window

Uses Closed Enumeration TreeUses 4 type of Nodes:

Closed NodesIntermediate NodesUnpromising Gateway NodesInfrequent Gateway Nodes

Adding transactions: closed items remains closed

Removing transactions: infrequent items remains infrequent

Page 101: A Short Course in Data Stream Mining

FP-Stream

Mining Frequent Itemsets at Multiple Time Granularities

Based in FP-GrowthMaintains

pattern treetilted-time window

Allows to answer time-sensitive queries

Places greater information to recent data

Drawback: time and memory complexity

Page 102: A Short Course in Data Stream Mining

Tree and Graph Mining: Dealingwith time changes

Keep a window on recent stream elementsActually, just its lattice of closed sets!

Keep track of number of closed patterns in lattice, N

Use some change detector on NWhen change is detected:

Drop stale part of the windowUpdate lattice to reflect this deletion, using deletion rule

Alternatively, sliding window of some fixed size

Page 103: A Short Course in Data Stream Mining

Summary

Page 104: A Short Course in Data Stream Mining

A short course in Data StreamMining

Short Course Summary

1 Introduction2 Classification3 Ensemble Methods4 Clustering5 Frequent Pattern Mining

Open Source Software1 MOA: http://moa.cms.waikato.ac.nz/2 SAMOA: http://samoa-project.net/