a short course in data stream mining
TRANSCRIPT
A Short Course in DataStream Mining
Albert Bifet
Shenzhen, 23 January 2015Huawei Noah’s Ark Lab
Real time analytics
Real time analytics
Introduction: Data StreamsData Streams
Sequence is potentially infinite
High amount of data: sublinear space
High speed of arrival: sublinear time per example
Once an element from a data stream has been processed it isdiscarded or archived
ExamplePuzzle: Finding Missing Numbers
Let π be a permutation of {1, . . . ,n}.Let π−1 be π with one element missing.
π−1[i] arrives in increasing order
Task: Determine the missing number
Introduction: Data StreamsData Streams
Sequence is potentially infinite
High amount of data: sublinear space
High speed of arrival: sublinear time per example
Once an element from a data stream has been processed it isdiscarded or archived
ExamplePuzzle: Finding Missing Numbers
Let π be a permutation of {1, . . . ,n}.Let π−1 be π with one element missing.
π−1[i] arrives in increasing order
Task: Determine the missing number
Use a n-bit vectorto memorize all thenumbers (O(n)space)
Introduction: Data StreamsData Streams
Sequence is potentially infinite
High amount of data: sublinear space
High speed of arrival: sublinear time per example
Once an element from a data stream has been processed it isdiscarded or archived
ExamplePuzzle: Finding Missing Numbers
Let π be a permutation of {1, . . . ,n}.Let π−1 be π with one element missing.
π−1[i] arrives in increasing order
Task: Determine the missing number
Data Streams:O(log(n)) space.
Introduction: Data StreamsData Streams
Sequence is potentially infinite
High amount of data: sublinear space
High speed of arrival: sublinear time per example
Once an element from a data stream has been processed it isdiscarded or archived
ExamplePuzzle: Finding Missing Numbers
Let π be a permutation of {1, . . . ,n}.Let π−1 be π with one element missing.
π−1[i] arrives in increasing order
Task: Determine the missing number
Data Streams:O(log(n)) space.Store
n(n+1)2
−∑j≤i
π−1[j].
Data Streams
Data StreamsSequence is potentially infinite
High amount of data: sublinear space
High speed of arrival: sublinear time per example
Once an element from a data stream has been processed it isdiscarded or archived
Tools:approximation
randomization, sampling
sketching
Data Streams
Data StreamsSequence is potentially infinite
High amount of data: sublinear space
High speed of arrival: sublinear time per example
Once an element from a data stream has been processed it isdiscarded or archived
Approximation algorithmsSmall error rate with high probability
An algorithm (ε,δ )−approximates F if it outputs F̃ for whichPr[|F̃−F|> εF]< δ .
Data Streams ApproximationAlgorithms
1011000111 1010101
Sliding WindowWe can maintain simple statistics over sliding windows, usingO( 1
εlog2 N) space, where
N is the length of the sliding window
ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002
Data Streams ApproximationAlgorithms
10110001111 0101011
Sliding WindowWe can maintain simple statistics over sliding windows, usingO( 1
εlog2 N) space, where
N is the length of the sliding window
ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002
Data Streams ApproximationAlgorithms
101100011110 1010111
Sliding WindowWe can maintain simple statistics over sliding windows, usingO( 1
εlog2 N) space, where
N is the length of the sliding window
ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002
Data Streams ApproximationAlgorithms
1011000111101 0101110
Sliding WindowWe can maintain simple statistics over sliding windows, usingO( 1
εlog2 N) space, where
N is the length of the sliding window
ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002
Data Streams ApproximationAlgorithms
10110001111010 1011101
Sliding WindowWe can maintain simple statistics over sliding windows, usingO( 1
εlog2 N) space, where
N is the length of the sliding window
ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002
Data Streams ApproximationAlgorithms
101100011110101 0111010
Sliding WindowWe can maintain simple statistics over sliding windows, usingO( 1
εlog2 N) space, where
N is the length of the sliding window
ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002
Classification
Classification
DefinitionGiven nC different classes, a classifier algorithm builds a model thatpredicts for every unlabelled instance I the class C to which it belongswith accuracy.
ExampleA spam filter
ExampleTwitter Sentiment analysis: analyze tweets with positive or negativefeelings
Data stream classification cycle
1 Process an example at a time, andinspect it only once (at most)
2 Use a limited amount of memory3 Work in a limited amount of time4 Be ready to predict at any point
Classification
Data set thatdescribes e-mailfeatures fordeciding if it isspam.
Example
Contains Domain Has Time“Money” type attach. received spam
yes com yes night yesyes edu no night yesno com yes night yesno edu no day nono com no day noyes cat no day yes
Assume we have to classify the following new instance:Contains Domain Has Time“Money” type attach. received spam
yes edu yes day ?
Bayes Classifiers
Naı̈ve BayesBased on Bayes Theorem:
P(c|d) = P(c)P(d|c)P(d)
posterior =prior× likelikood
evidenceEstimates the probability of observing attribute a and the priorprobability P(c)
Probability of class c given an instance d:
P(c|d) = P(c)∏a∈d P(a|c)P(d)
Bayes Classifiers
Multinomial Naı̈ve BayesConsiders a document as a bag-of-words.
Estimates the probability of observing word w and the priorprobability P(c)
Probability of class c given a test document d:
P(c|d) = P(c)∏w∈d P(w|c)nwd
P(d)
Perceptron
Attribute 1
Attribute 2
Attribute 3
Attribute 4
Attribute 5
Output h~w(~xi)
w1
w2
w3
w4
w5
Data stream: 〈~xi,yi〉Classical perceptron: h~w(~xi) = sgn(~wT~xi),
Minimize Mean-square error: J(~w) = 12 ∑(yi−h~w(~xi))
2
PerceptronAttribute 1
Attribute 2
Attribute 3
Attribute 4
Attribute 5
Output h~w(~xi)
w1
w2
w3
w4
w5
We use sigmoid function h~w = σ(~wT~x) where
σ(x) = 1/(1+ e−x)
σ′(x) = σ(x)(1−σ(x))
Perceptron
Minimize Mean-square error: J(~w) = 12 ∑(yi−h~w(~xi))
2
Stochastic Gradient Descent: ~w =~w−η∇J~xi
Gradient of the error function:
∇J =−∑i(yi−h~w(~xi))∇h~w(~xi)
∇h~w(~xi) = h~w(~xi)(1−h~w(~xi))
Weight update rule
~w =~w+η ∑i(yi−h~w(~xi))h~w(~xi)(1−h~w(~xi))~xi
Perceptron
PERCEPTRON LEARNING(Stream,η)
1 for each class2 do PERCEPTRON LEARNING(Stream,class,η)
PERCEPTRON LEARNING(Stream,class,η)
1 � Let w0 and ~w be randomly initialized2 for each example (~x,y) in Stream3 do if class = y4 then δ = (1−h~w(~x)) ·h~w(~x) · (1−h~w(~x))5 else δ = (0−h~w(~x)) ·h~w(~x) · (1−h~w(~x))6 ~w =~w+η ·δ ·~x
PERCEPTRON PREDICTION(~x)
1 return argmaxclass h~wclass(~x)
Classification
Data set thatdescribes e-mailfeatures fordeciding if it isspam.
Example
Contains Domain Has Time“Money” type attach. received spam
yes com yes night yesyes edu no night yesno com yes night yesno edu no day nono com no day noyes cat no day yes
Assume we have to classify the following new instance:Contains Domain Has Time“Money” type attach. received spam
yes edu yes day ?
Classification
Assume we have to classify the following new instance:Contains Domain Has Time“Money” type attach. received spam
yes edu yes day ?
Time
Contains “Money”
YES
Yes
NO
No
Day
YES
Night
Decision Trees
Basic induction strategy:
A← the “best” decision attribute for next node
Assign A as decision attribute for node
For each value of A, create new descendant of node
Sort training examples to leaf nodes
If training examples perfectly classified, Then STOP, Else iterateover new leaf nodes
Hoeffding TreesHoeffding Tree : VFDT
Pedro Domingos and Geoff Hulten.Mining high-speed data streams. 2000
With high probability, constructs an identical model that atraditional (greedy) method would learn
With theoretical guarantees on the error rate
Time
Contains “Money”
YESYes
NONo
Day
YES
Night
Hoeffding Bound Inequality
Probability of deviation of its expected value.
Hoeffding Bound InequalityLet X = ∑i Xi where X1, . . . ,Xn are independent and indenticallydistributed in [0,1]. Then
1 Chernoff For each ε < 1
Pr[X > (1+ ε)E[X]]≤ exp(−ε2
3E[X]
)2 Hoeffding For each t > 0
Pr[X > E[X]+ t]≤ exp(−2t2/n
)3 Bernstein Let σ2 = ∑i σ2
i the variance of X. If Xi−E[Xi]≤ b foreach i ∈ [n] then for each t > 0
Pr[X > E[X]+ t]≤ exp
(− t2
2σ2 + 23 bt
)
Hoeffding Tree or VFDTHT(Stream,δ )
1 � Let HT be a tree with a single leaf(root)2 � Init counts nijk at root3 for each example (x,y) in Stream4 do HTGROW((x,y),HT,δ )
HTGROW((x,y),HT,δ )
1 � Sort (x,y) to leaf l using HT2 � Update counts nijk at leaf l3 if examples seen so far at l are not all of the same class4 then� Compute G for each attribute
5 if G(Best Attr.)−G(2nd best) >√
R2 ln1/δ
2n6 then� Split leaf on best attribute7 for each branch8 do � Start new leaf and initiliatize counts
Hoeffding Tree or VFDTHT(Stream,δ )
1 � Let HT be a tree with a single leaf(root)2 � Init counts nijk at root3 for each example (x,y) in Stream4 do HTGROW((x,y),HT,δ )
HTGROW((x,y),HT,δ )
1 � Sort (x,y) to leaf l using HT2 � Update counts nijk at leaf l3 if examples seen so far at l are not all of the same class4 then� Compute G for each attribute
5 if G(Best Attr.)−G(2nd best) >√
R2 ln1/δ
2n6 then� Split leaf on best attribute7 for each branch8 do � Start new leaf and initiliatize counts
Hoeffding Trees
HT featuresWith high probability, constructs an identical model that atraditional (greedy) method would learn
Ties: when two attributes have similar G, split if
G(Best Attr.)−G(2nd best)<
√R2 ln1/δ
2n< τ
Compute G every nmin instancesMemory: deactivate least promising nodes with lower pl× el
pl is the probability to reach leaf lel is the error in the node
Hoeffding Naive Bayes Tree
Hoeffding TreeMajority Class learner at leaves
Hoeffding Naive Bayes Tree
G. Holmes, R. Kirkby, and B. Pfahringer.Stress-testing Hoeffding trees, 2005.
monitors accuracy of a Majority Class learner
monitors accuracy of a Naive Bayes learner
predicts using the most accurate method
Bagging
ExampleDataset of 4 Instances : A, B, C, D
Classifier 1: B, A, C, BClassifier 2: D, B, A, DClassifier 3: B, A, C, BClassifier 4: B, C, B, BClassifier 5: D, C, A, C
Bagging builds a set of M base models, with a bootstrap samplecreated by drawing random samples withreplacement.
Bagging
ExampleDataset of 4 Instances : A, B, C, D
Classifier 1: A, B, B, CClassifier 2: A, B, D, DClassifier 3: A, B, B, CClassifier 4: B, B, B, CClassifier 5: A, C, C, D
Bagging builds a set of M base models, with a bootstrap samplecreated by drawing random samples withreplacement.
Bagging
ExampleDataset of 4 Instances : A, B, C, D
Classifier 1: A, B, B, C: A(1) B(2) C(1) D(0)Classifier 2: A, B, D, D: A(1) B(1) C(0) D(2)Classifier 3: A, B, B, C: A(1) B(2) C(1) D(0)Classifier 4: B, B, B, C: A(0) B(3) C(1) D(0)Classifier 5: A, C, C, D: A(1) B(0) C(2) D(1)
Each base model’s training set contains each of the original trainingexample K times where P(K = k) follows a binomial distribution.
Bagging
Figure : Poisson(1) Distribution.
Each base model’s training set contains each of the original trainingexample K times where P(K = k) follows a binomial distribution.
Oza and Russell’s Online Baggingfor M models
1: Initialize base models hm for all m ∈ {1,2, ...,M}2: for all training examples do3: for m = 1,2, ...,M do4: Set w = Poisson(1)5: Update hm with the current example with weight w
6: anytime output:7: return hypothesis: hfin(x) = argmaxy∈Y ∑
Tt=1 I(ht(x) = y)
Evolving Stream Classification
Data Mining Algorithms withConcept Drift
No Concept Drift
-input output
DM Algorithm
-
Counter1
Counter2
Counter3
Counter4
Counter5
Concept Drift
-input output
DM Algorithm
Static Model
-
Change Detect.-
6
�
Data Mining Algorithms withConcept Drift
No Concept Drift
-input output
DM Algorithm
-
Counter1
Counter2
Counter3
Counter4
Counter5
Concept Drift
-input output
DM Algorithm
-
Estimator1
Estimator2
Estimator3
Estimator4
Estimator5
Optimal Change Detector andPredictor
High accuracy
Low false positives and false negatives ratios
Theoretical guarantees
Fast detection of change
Low computational cost: minimum space and time needed
No parameters needed
Algorithm ADaptive SlidingWINdow
Example
W= 101010110111111W0= 1
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W←W ∪{xt} (i.e., add xt to the head of W)4 repeat Drop elements from the tail of W5 until |µ̂W0− µ̂W1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µ̂W
Algorithm ADaptive SlidingWINdow
Example
W= 101010110111111W0= 1 W1 = 01010110111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W←W ∪{xt} (i.e., add xt to the head of W)4 repeat Drop elements from the tail of W5 until |µ̂W0− µ̂W1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µ̂W
Algorithm ADaptive SlidingWINdow
Example
W= 101010110111111W0= 10 W1 = 1010110111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W←W ∪{xt} (i.e., add xt to the head of W)4 repeat Drop elements from the tail of W5 until |µ̂W0− µ̂W1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µ̂W
Algorithm ADaptive SlidingWINdow
Example
W= 101010110111111W0= 101 W1 = 010110111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W←W ∪{xt} (i.e., add xt to the head of W)4 repeat Drop elements from the tail of W5 until |µ̂W0− µ̂W1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µ̂W
Algorithm ADaptive SlidingWINdow
Example
W= 101010110111111W0= 1010 W1 = 10110111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W←W ∪{xt} (i.e., add xt to the head of W)4 repeat Drop elements from the tail of W5 until |µ̂W0− µ̂W1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µ̂W
Algorithm ADaptive SlidingWINdow
Example
W= 101010110111111W0= 10101 W1 = 0110111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W←W ∪{xt} (i.e., add xt to the head of W)4 repeat Drop elements from the tail of W5 until |µ̂W0− µ̂W1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µ̂W
Algorithm ADaptive SlidingWINdow
Example
W= 101010110111111W0= 101010 W1 = 110111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W←W ∪{xt} (i.e., add xt to the head of W)4 repeat Drop elements from the tail of W5 until |µ̂W0− µ̂W1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µ̂W
Algorithm ADaptive SlidingWINdow
Example
W= 101010110111111W0= 1010101 W1 = 10111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W←W ∪{xt} (i.e., add xt to the head of W)4 repeat Drop elements from the tail of W5 until |µ̂W0− µ̂W1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µ̂W
Algorithm ADaptive SlidingWINdow
Example
W= 101010110111111W0= 10101011 W1 = 0111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W←W ∪{xt} (i.e., add xt to the head of W)4 repeat Drop elements from the tail of W5 until |µ̂W0− µ̂W1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µ̂W
Algorithm ADaptive SlidingWINdow
Example
W= 101010110111111 |µ̂W0− µ̂W1 | ≥ εc : CHANGE DET.!
W0= 101010110 W1 = 111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W←W ∪{xt} (i.e., add xt to the head of W)4 repeat Drop elements from the tail of W5 until |µ̂W0− µ̂W1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µ̂W
Algorithm ADaptive SlidingWINdow
Example
W= 101010110111111 Drop elements from the tail of WW0= 101010110 W1 = 111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W←W ∪{xt} (i.e., add xt to the head of W)4 repeat Drop elements from the tail of W5 until |µ̂W0− µ̂W1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µ̂W
Algorithm ADaptive SlidingWINdow
Example
W= 01010110111111 Drop elements from the tail of WW0= 101010110 W1 = 111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W←W ∪{xt} (i.e., add xt to the head of W)4 repeat Drop elements from the tail of W5 until |µ̂W0− µ̂W1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µ̂W
Algorithm ADaptive SlidingWINdow
TheoremAt every time step we have:
1 (False positive rate bound). If µt remains constant within W, theprobability that ADWIN shrinks the window at this step is at mostδ .
2 (False negative rate bound). Suppose that for some partition ofW in two parts W0W1 (where W1 contains the most recent items)we have |µW0−µW1 |> 2εc. Then with probability 1−δ ADWINshrinks W to W1, or shorter.
ADWIN tunes itself to the data stream at hand, with no need for theuser to hardwire or precompute parameters.
Algorithm ADaptive SlidingWINdow
ADWIN using a Data Stream Sliding Window Model,
can provide the exact counts of 1’s in O(1) time per point.
tries O(logW) cutpoints
uses O( 1ε
logW) memory words
the processing time per example is O(logW) (amortized andworst-case).
Sliding Window Model
1010101 101 11 1 1
Content: 4 2 2 1 1
Capacity: 7 3 2 1 1
VFDT / CVFDTConcept-adapting Very Fast Decision Trees: CVFDT
G. Hulten, L. Spencer, and P. Domingos.Mining time-changing data streams. 2001
It keeps its model consistent with a sliding window of examples
Construct “alternative branches” as preparation for changes
If the alternative branch becomes more accurate, switch of treebranches occurs
Time
Contains “Money”
YESYes
NONo
Day
YES
Night
Decision Trees: CVFDTTime
Contains “Money”
YESYes
NONo
Day
YES
Night
No theoretical guarantees on the error rate of CVFDT
CVFDT parameters :1 W: is the example window size.2 T0: number of examples used to check at each node if the
splitting attribute is still the best.3 T1: number of examples used to build the alternate tree.4 T2: number of examples used to test the accuracy of the alternate
tree.
Decision Trees: Hoeffding AdaptiveTree
Hoeffding Adaptive Tree:replace frequency statistics counters by estimators
don’t need a window to store examples, due to the fact that wemaintain the statistics data needed with estimators
change the way of checking the substitution of alternate subtrees,using a change detector with theoretical guarantees
Advantages over CVFDT:1 Theoretical guarantees2 No Parameters
ADWIN Bagging (KDD’09)
ADWINAn adaptive sliding window whose size is recomputed onlineaccording to the rate of change observed.
ADWIN has rigorous guarantees (theorems)On ratio of false positives and negatives
On the relation of the size of the current window and changerates
ADWIN BaggingWhen a change is detected, the worst classifier is removed and a newclassifier is added.
Leveraging Bagging for EvolvingData Streams (ECML-PKDD’10)
Randomization as a powerful tool to increase accuracy and diversity
There are three ways of using randomization:
Manipulating the input data
Manipulating the classifier algorithms
Manipulating the output targets
Leveraging Bagging for EvolvingData Streams
Leveraging BaggingUsing Poisson(λ )
Leveraging Bagging MCUsing Poisson(λ ) and Random Output Codes
Fast Leveraging Bagging MEif an instance is misclassified: weight = 1
if not: weight = eT/(1− eT),
Empirical evaluationAccuracy RAM-Hours
Hoeffding Tree 74.03% 0.01Online Bagging 77.15% 2.98ADWIN Bagging 79.24% 1.48Leveraging Bagging 85.54% 20.17Leveraging Bagging MC 85.37% 22.04Leveraging Bagging ME 80.77% 0.87
Leveraging BaggingLeveraging Bagging
Using Poisson(λ )
Leveraging Bagging MCUsing Poisson(λ ) and Random Output Codes
Leveraging Bagging MEUsing weight 1 if misclassified, otherwise eT/(1− eT)
Clustering
Clustering
DefinitionClustering is the distribution of a set of instances of examples intonon-known groups according to some common relations or affinities.
ExampleMarket segmentation of customers
ExampleSocial network communities
Clustering
DefinitionGiven
a set of instances I
a number of clusters K
an objective function cost(I)
a clustering algorithm computes an assignment of a cluster for eachinstance
f : I→{1, . . . ,K}
that minimizes the objective function cost(I)
ClusteringDefinitionGiven
a set of instances I
a number of clusters K
an objective function cost(C, I)
a clustering algorithm computes a set C of instances with |C|= K thatminimizes the objective function
cost(C, I) = ∑x∈I
d2(x,C)
where
d(x,c): distance function between x and c
d2(x,C) = minc∈Cd2(x,c): distance from x to the nearest point inC
k-means
1. Choose k initial centers C = {c1, . . . ,ck}2. while stopping criterion has not been met
For i = 1, . . . ,Nfind closest center ck ∈ C to each instance piassign instance pi to cluster Ck
For k = 1, . . . ,Kset ck to be the center of mass of all points in Ci
k-means++
1. Choose a initial center c1
For k = 2, . . . ,Kselect ck = p ∈ I with probability d2(p,C)/cost(C, I)
2. while stopping criterion has not been metFor i = 1, . . . ,N
find closest center ck ∈ C to each instance piassign instance pi to cluster Ck
For k = 1, . . . ,Kset ck to be the center of mass of all points in Ci
Performance Measures
Internal MeasuresSum square distance
Dunn index D = dmindmax
C-Index C = S−SminSmax−Smin
External MeasuresRand Measure
F Measure
Jaccard
Purity
BIRCH
BALANCED ITERATIVE REDUCING AND CLUSTERING USINGHIERARCHIES
Clustering Features CF = (N,LS,SS)N: number of data pointsLS: linear sum of the N data pointsSS: square sum of the N data pointsProperties:
Additivity: CF1 +CF2 = (N1 +N2,LS1 +LS2,SS1 +SS2)Easy to compute: average inter-cluster distanceand average intra-cluster distance
Uses CF treeHeight-balanced tree with two parameters
B: branching factorT: radius leaf threshold
BIRCH
BALANCED ITERATIVE REDUCING AND CLUSTERING USINGHIERARCHIES
Phase 1: Scan all data and build an initial in-memory CF tree
Phase 2: Condense into desirable range by building a smaller CFtree (optional)
Phase 3: Global clustering
Phase 4: Cluster refining (optional and off line, as requires morepasses)
Clu-Stream
Clu-StreamUses micro-clusters to store statistics on-line
Clustering Features CF = (N,LS,SS,LT,ST)N: numer of data pointsLS: linear sum of the N data pointsSS: square sum of the N data pointsLT: linear sum of the time stampsST: square sum of the time stamps
Uses pyramidal time frame
Clu-Stream
On-line PhaseFor each new point that arrives
the point is absorbed by a micro-clusterthe point starts a new micro-cluster of its own
delete oldest micro-clustermerge two of the oldest micro-cluster
Off-line PhaseApply k-means using microclusters as points
StreamKM++: Coresets
Coreset of a set P with respect to some problemSmall subset that approximates the original set P.
Solving the problem for the coreset provides an approximatesolution for the problem on P.
(k,ε)-coresetA (k,ε)-coreset S of P is a subset of P that for each C of size k
(1− ε)cost(P,C)≤ costw(S,C)≤ (1+ ε)cost(P,C)
StreamKM++: Coresets
Coreset TreeChoose a leaf l node at random
Choose a new sample point denoted by qt+1 from Pl according tod2
Based on ql and qt+1, split Pl into two subclusters and create twochild nodes
StreamKM++Maintain L = dlog2(
nm)+2e buckets B0,B1, . . . ,BL−1
Frequent Pattern Mining
Frequent Patterns
Suppose D is a dataset of patterns, t ∈D , and min sup is a constant.
DefinitionSupport (t): number of patternsin D that are superpatterns of t.
DefinitionPattern t is frequent ifSupport (t)≥ min sup.
Frequent Subpattern ProblemGiven D and min sup, find all frequent subpatterns of patterns in D .
Frequent Patterns
Suppose D is a dataset of patterns, t ∈D , and min sup is a constant.
DefinitionSupport (t): number of patternsin D that are superpatterns of t.
DefinitionPattern t is frequent ifSupport (t)≥ min sup.
Frequent Subpattern ProblemGiven D and min sup, find all frequent subpatterns of patterns in D .
Frequent Patterns
Suppose D is a dataset of patterns, t ∈D , and min sup is a constant.
DefinitionSupport (t): number of patternsin D that are superpatterns of t.
DefinitionPattern t is frequent ifSupport (t)≥ min sup.
Frequent Subpattern ProblemGiven D and min sup, find all frequent subpatterns of patterns in D .
Frequent Patterns
Suppose D is a dataset of patterns, t ∈D , and min sup is a constant.
DefinitionSupport (t): number of patternsin D that are superpatterns of t.
DefinitionPattern t is frequent ifSupport (t)≥ min sup.
Frequent Subpattern ProblemGiven D and min sup, find all frequent subpatterns of patterns in D .
Pattern Mining
Dataset Example
Document Patternsd1 abced2 cded3 abced4 acded5 abcded6 bcd
Itemset Mining
d1 abced2 cded3 abced4 acded5 abcded6 bcd
Support Frequentd1,d2,d3,d4,d5,d6 c
d1,d2,d3,d4,d5 e,ced1,d3,d4,d5 a,ac,ae,aced1,d3,d5,d6 b,bcd2,d4,d5,d6 d,cd
d1,d3,d5 ab,abc,abebe,bce,abce
d2,d4,d5 de,cde
minimal support = 3
Itemset Mining
d1 abced2 cded3 abced4 acded5 abcded6 bcd
Support Frequent6 c5 e,ce4 a,ac,ae,ace4 b,bc4 d,cd3 ab,abc,abe
be,bce,abce3 de,cde
Itemset Mining
d1 abced2 cded3 abced4 acded5 abcded6 bcd
Support Frequent Gen Closed6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab
be,bce,abce be abce3 de,cde de cde
Itemset Mining
d1 abced2 cded3 abced4 acded5 abcded6 bcd
Support Frequent Gen Closed Max6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab
be,bce,abce be abce abce3 de,cde de cde cde
Itemset Mining
d1 abced2 cded3 abced4 acded5 abcded6 bcd
Support Frequent Gen Closed Max6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab
be,bce,abce be abce abce3 de,cde de cde cde
Itemset Mining
d1 abced2 cded3 abced4 acded5 abcded6 bcd
e→ ce
Support Frequent Gen Closed Max6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab
be,bce,abce be abce abce3 de,cde de cde cde
Itemset Mining
d1 abced2 cded3 abced4 acded5 abcded6 bcd
Support Frequent Gen Closed Max6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab
be,bce,abce be abce abce3 de,cde de cde cde
Itemset Mining
d1 abced2 cded3 abced4 acded5 abcded6 bcd
Support Frequent Gen Closed Max6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab
be,bce,abce be abce abce3 de,cde de cde cde
Itemset Mining
d1 abced2 cded3 abced4 acded5 abcded6 bcd
a→ ace
Support Frequent Gen Closed Max6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab
be,bce,abce be abce abce3 de,cde de cde cde
Itemset Mining
d1 abced2 cded3 abced4 acded5 abcded6 bcd
Support Frequent Gen Closed Max6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab
be,bce,abce be abce abce3 de,cde de cde cde
Closed Patterns
Usually, there are too many frequent patterns. We can compute asmaller set, while keeping the same information.
Example
A set of 1000 items, has 21000 ≈ 10301 subsets, that is more than thenumber of atoms in the universe ≈ 1079
Closed Patterns
A priori propertyIf t′ is a subpattern of t, then Support (t′)≥ Support (t).
DefinitionA frequent pattern t is closed if none of its proper superpatterns hasthe same support as it has.
Frequent subpatterns and their supports can be generated from closedpatterns.
Maximal Patterns
DefinitionA frequent pattern t is maximal if none of its proper superpatterns isfrequent.
Frequent subpatterns can be generated from maximal patterns, but notwith their support.
All maximal patterns are closed, but not all closed patterns aremaximal.
Non streaming frequent itemsetminers
Representation:Horizontal layout
T1: a, b, cT2: b, c, eT3: b, d, e
Vertical layout
a: 1 0 0b: 1 1 1c: 1 1 0
Search:Breadth-first (levelwise): Apriori
Depth-first: Eclat, FP-Growth
Mining Patterns over Data Streams
Requirements: fast, use small amount of memory and adaptive
Type:ExactApproximate
Per batch, per transaction
Incremental, Sliding Window, Adaptive
Frequent, Closed, Maximal patterns
Moment
Computes closed frequents itemsets in a sliding window
Uses Closed Enumeration TreeUses 4 type of Nodes:
Closed NodesIntermediate NodesUnpromising Gateway NodesInfrequent Gateway Nodes
Adding transactions: closed items remains closed
Removing transactions: infrequent items remains infrequent
FP-Stream
Mining Frequent Itemsets at Multiple Time Granularities
Based in FP-GrowthMaintains
pattern treetilted-time window
Allows to answer time-sensitive queries
Places greater information to recent data
Drawback: time and memory complexity
Tree and Graph Mining: Dealingwith time changes
Keep a window on recent stream elementsActually, just its lattice of closed sets!
Keep track of number of closed patterns in lattice, N
Use some change detector on NWhen change is detected:
Drop stale part of the windowUpdate lattice to reflect this deletion, using deletion rule
Alternatively, sliding window of some fixed size
Summary
A short course in Data StreamMining
Short Course Summary
1 Introduction2 Classification3 Ensemble Methods4 Clustering5 Frequent Pattern Mining
Open Source Software1 MOA: http://moa.cms.waikato.ac.nz/2 SAMOA: http://samoa-project.net/