Download - Introduction to Big Data Science
![Page 2: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/2.jpg)
data science
Data Science is an interdisciplinary field focused onextracting knowledge or insights from large
volumes of data.
1
![Page 3: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/3.jpg)
data scientist
Figure 1: http://www.marketingdistillery.com/2014/11/29/is-data-science-a-buzzword-modern-data-scientist-defined/
2
![Page 4: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/4.jpg)
data science
Figure 2: Drew Convay’s Venn diagram
3
![Page 5: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/5.jpg)
classification
![Page 6: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/6.jpg)
classification
DefinitionGiven nC different classes, a classifier algorithm builds amodel that predicts for every unlabelled instance I the class Cto which it belongs with accuracy.
ExampleA spam filter
ExampleTwitter Sentiment analysis: analyze tweets with positive ornegative feelings
5
![Page 7: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/7.jpg)
classification
Data set thatdescribes e-mailfeatures fordeciding if it isspam.
ExampleContains Domain Has Time“Money” type attach. received spam
yes com yes night yesyes edu no night yesno com yes night yesno edu no day nono com no day noyes cat no day yes
Assume we have to classify the following new instance:Contains Domain Has Time“Money” type attach. received spam
yes edu yes day ?6
![Page 8: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/8.jpg)
k-nearest neighbours
k-NN Classifier
• Training: store all instances in memory• Prediction:
• Find the k nearest instances• Output majority class of these k instances
7
![Page 9: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/9.jpg)
bayes classifiers
Naïve Bayes
• Based on Bayes Theorem:
P(c|d) = P(c)P(d|c)P(d)
posterior=prior× likelikood
evidence• Estimates the probability of observing attribute a and theprior probability P(c)
• Probability of class c given an instance d:
P(c|d) = P(c)∏a∈dP(a|c)P(d)
8
![Page 10: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/10.jpg)
bayes classifiers
Multinomial Naïve Bayes
• Considers a document as a bag-of-words.• Estimates the probability of observing word w and the priorprobability P(c)
• Probability of class c given a test document d:
P(c|d) = P(c)∏w∈dP(w|c)nwdP(d)
9
![Page 11: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/11.jpg)
perceptron
Attribute 1
Attribute 2
Attribute 3
Attribute 4
Attribute 5
Output hw⃗(⃗xi)
w1
w2
w3
w4
w5
• Data stream: ⟨⃗xi,yi⟩• Classical perceptron: hw⃗(⃗xi) = sgn(⃗wT⃗xi),• Minimize Mean-square error: J(⃗w) = 1
2 ∑(yi−hw⃗(⃗xi))2
10
![Page 12: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/12.jpg)
perceptron
Attribute 1
Attribute 2
Attribute 3
Attribute 4
Attribute 5
Output hw⃗(⃗xi)
w1
w2
w3
w4
w5
• We use sigmoid function hw⃗ = σ (⃗wT⃗x) where
σ(x) = 1/(1+e−x)
σ ′(x) = σ(x)(1−σ(x))10
![Page 13: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/13.jpg)
perceptron
• Minimize Mean-square error: J(⃗w) = 12 ∑(yi−hw⃗(⃗xi))2
• Stochastic Gradient Descent: w⃗= w⃗−η∇J⃗xi• Gradient of the error function:
∇J=−∑i(yi−hw⃗(⃗xi))∇hw⃗(⃗xi)
∇hw⃗(⃗xi) = hw⃗(⃗xi)(1−hw⃗(⃗xi))
• Weight update rule
w⃗= w⃗+η ∑i(yi−hw⃗(⃗xi))hw⃗(⃗xi)(1−hw⃗(⃗xi))⃗xi
10
![Page 14: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/14.jpg)
restricted boltzmann machines (rbms)
z4z3z2z1
x5x4x3x2x1
• Energy-based models, where
P(⃗x,⃗z) ∝ e−E(⃗x,⃗z).
• Manipulate a weight matrix W to find low-energy states andthus generate high probability P(⃗x,⃗z), where
E(⃗x,⃗z) =−W.
• RBMs can be stacked on top of each other to form so-calledDeep Belief Networks (DBNs)
11
![Page 15: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/15.jpg)
classification
Data set thatdescribes e-mailfeatures fordeciding if it isspam.
ExampleContains Domain Has Time“Money” type attach. received spam
yes com yes night yesyes edu no night yesno com yes night yesno edu no day nono com no day noyes cat no day yes
Assume we have to classify the following new instance:Contains Domain Has Time“Money” type attach. received spam
yes edu yes day ?12
![Page 16: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/16.jpg)
classification
• Assume we have to classify the following new instance:Contains Domain Has Time“Money” type attach. received spam
yes edu yes day ?
Time
Contains “Money”
YES
Yes
NO
No
Day
YES
Night
12
![Page 17: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/17.jpg)
decision trees
Basic induction strategy:
• A← the “best” decision attribute for next node• Assign A as decision attribute for node• For each value of A, create new descendant of node• Sort training examples to leaf nodes• If training examples perfectly classified, Then STOP, Elseiterate over new leaf nodes
13
![Page 18: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/18.jpg)
bagging
ExampleDataset of 4 Instances : A, B, C, D
Classifier 1: B, A, C, BClassifier 2: D, B, A, DClassifier 3: B, A, C, BClassifier 4: B, C, B, BClassifier 5: D, C, A, C
Bagging builds a set of M base models, with a bootstrapsample created by drawing random samples withreplacement.
14
![Page 19: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/19.jpg)
random forests
• Bagging• Random Trees: trees that in each node only uses a randomsubset of the attributes
Random Forests is one of the most popular methods inmachine learning.
15
![Page 20: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/20.jpg)
boosting
The strength of Weak Learnability, Schapire 90
A boosting algorithm transforms a weak learnerinto a strong one
16
![Page 21: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/21.jpg)
boosting
A formal description of Boosting (Schapire)
• given a training set (x1,y1), . . . ,(xm,ym)• yi ∈ {−1,+1} correct label of instance xi ∈ X• for t= 1, . . . ,T
• construct distribution Dt• find weak classifier
ht : X→{−1,+1}
with small error εt = PrDt [ht(xi) ̸= yi] on Dt
• output final classifier
17
![Page 22: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/22.jpg)
boosting
AdaBoost1: Initialize D1(i) = 1/m for all i ∈ {1,2, ...,m}2: for t = 1,2,...T do3: CallWeakLearn, providing it with distribution Dt4: Get back hypothesis ht : X→ Y5: Calculate error of ht : εt = ∑i:ht(xi )̸=yi Dt(i)6: Update distribution
Dt : Dt+1(i) =Dt(i)Zt ×
{εt/(1− εt) if ht(xi) = yi1 otherwise
where Zt is a normalization constant (chosen so Dt+1 is aprobability distribution)
7: return hfin(x) = argmaxy∈Y ∑t:ht(x)=y− logεt/(1− εt)
18
![Page 23: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/23.jpg)
boosting
AdaBoost1: Initialize D1(i) = 1/m for all i ∈ {1,2, ...,m}2: for t = 1,2,...T do3: CallWeakLearn, providing it with distribution Dt4: Get back hypothesis ht : X→ Y5: Calculate error of ht : εt = ∑i:ht(xi )̸=yi Dt(i)6: Update distribution
Dt : Dt+1(i) =Dt(i)Zt ×
{εt if ht(xi) = yi1− εt otherwise
where Zt is a normalization constant (chosen so Dt+1 is aprobability distribution)
7: return hfin(x) = argmaxy∈Y ∑t:ht(x)=y− logεt/(1− εt)
18
![Page 24: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/24.jpg)
stacking
Use a classifier to combine predictions of base classifiers
Example
• Use a perceptron to do stacking• Use decision trees as base classifiers
19
![Page 25: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/25.jpg)
clustering
![Page 26: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/26.jpg)
clustering
DefinitionClustering is the distribution of a set of instances of examplesinto non-known groups according to some common relationsor affinities.
ExampleMarket segmentation of customers
ExampleSocial network communities
21
![Page 27: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/27.jpg)
clustering
DefinitionGiven
• a set of instances I• a number of clusters K• an objective function cost(I)
a clustering algorithm computes an assignment of a clusterfor each instance
f : I→{1, . . . ,K}
that minimizes the objective function cost(I)
22
![Page 28: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/28.jpg)
clustering
DefinitionGiven
• a set of instances I• a number of clusters K• an objective function cost(C, I)
a clustering algorithm computes a set C of instances with|C|= K that minimizes the objective function
cost(C, I) = ∑x∈I
d2(x,C)
where
• d(x,c): distance function between x and c• d2(x,C) =minc∈Cd2(x,c): distance from x to the nearest pointin C
23
![Page 29: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/29.jpg)
k-means
• 1. Choose k initial centers C= {c1, . . . ,ck}• 2. while stopping criterion has not been met
• For i= 1, . . . ,N• find closest center ck ∈ C to each instance pi• assign instance pi to cluster Ck
• For k= 1, . . . ,K• set ck to be the center of mass of all points in Ci
24
![Page 30: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/30.jpg)
k-means++
• 1. Choose a initial center c1• For k= 2, . . . ,K
• select ck = p ∈ I with probability d2(p,C)/cost(C, I)• 2. while stopping criterion has not been met
• For i= 1, . . . ,N• find closest center ck ∈ C to each instance pi• assign instance pi to cluster Ck
• For k= 1, . . . ,K• set ck to be the center of mass of all points in Ci
25
![Page 31: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/31.jpg)
performance measures
Internal Measures
• Sum square distance• Dunn index D= dmin
dmax
• C-Index C= S−SminSmax−Smin
External Measures
• Rand Measure• F Measure• Jaccard• Purity
26
![Page 32: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/32.jpg)
density based methods
DBSCAN
• ε -neighborhood(p): set of points that are at a distance of pless or equal to ε
• Core object: object whose ε -neighborhood has an overallweight at least µ
• A point p is directly density-reachable from q if• p is in ε -neighborhood(q)• q is a core object
• A point p is density-reachable from q if• there is a chain of points p1, . . . ,pn such that pi+1 is directlydensity-reachable from pi
• A point p is density-connected from q if• there is point o such that p and q are density-reachable from o
27
![Page 33: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/33.jpg)
density based methods
DBSCAN
• A cluster C of points satisfies• if p ∈ C and q is density-reachable from p, then q ∈ C• all points p,q ∈ C are density-connected
• A cluster is uniquely determined by any of its core points• A cluster can be obtained
• choosing an arbitrary core point as a seed• retrieve all points that are density-reachable from the seed
28
![Page 34: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/34.jpg)
dbscan
Figure 3: DBSCAN Point Example with µ=3
29
![Page 35: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/35.jpg)
density based methods
DBSCAN
• select an arbitrary point p• retrieve all points density-reachable from p• if p is a core point, a cluster is formed• If p is a border point
• no points are density-reachable from p• DBSCAN visits the next point of the database
• Continue the process until all of the points have beenprocessed
30
![Page 36: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/36.jpg)
frequent pattern mining
![Page 37: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/37.jpg)
frequent patterns
Suppose D is a dataset of patterns, t ∈D , and min_sup is aconstant.
DefinitionSupport (t): number ofpatterns in D that aresuperpatterns of t.
DefinitionPattern t is frequent ifSupport (t)≥ min_sup.
Frequent Subpattern ProblemGiven D and min_sup, find all frequent subpatterns of patternsin D .
32
![Page 38: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/38.jpg)
frequent patterns
Suppose D is a dataset of patterns, t ∈D , and min_sup is aconstant.
DefinitionSupport (t): number ofpatterns in D that aresuperpatterns of t.
DefinitionPattern t is frequent ifSupport (t)≥ min_sup.
Frequent Subpattern ProblemGiven D and min_sup, find all frequent subpatterns of patternsin D .
32
![Page 39: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/39.jpg)
frequent patterns
Suppose D is a dataset of patterns, t ∈D , and min_sup is aconstant.
DefinitionSupport (t): number ofpatterns in D that aresuperpatterns of t.
DefinitionPattern t is frequent ifSupport (t)≥ min_sup.
Frequent Subpattern ProblemGiven D and min_sup, find all frequent subpatterns of patternsin D .
32
![Page 40: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/40.jpg)
frequent patterns
Suppose D is a dataset of patterns, t ∈D , and min_sup is aconstant.
DefinitionSupport (t): number ofpatterns in D that aresuperpatterns of t.
DefinitionPattern t is frequent ifSupport (t)≥ min_sup.
Frequent Subpattern ProblemGiven D and min_sup, find all frequent subpatterns of patternsin D .
32
![Page 41: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/41.jpg)
pattern mining
Dataset ExampleDocument Patterns
d1 abced2 cded3 abced4 acded5 abcded6 bcd
33
![Page 42: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/42.jpg)
itemset mining
d1 abced2 cded3 abced4 acded5 abcded6 bcd
Support Frequentd1,d2,d3,d4,d5,d6 cd1,d2,d3,d4,d5 e,ced1,d3,d4,d5 a,ac,ae,aced1,d3,d5,d6 b,bcd2,d4,d5,d6 d,cdd1,d3,d5 ab,abc,abe
be,bce,abced2,d4,d5 de,cde
minimal support = 3
34
![Page 43: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/43.jpg)
itemset mining
d1 abced2 cded3 abced4 acded5 abcded6 bcd
Support Frequent6 c5 e,ce4 a,ac,ae,ace4 b,bc4 d,cd3 ab,abc,abe
be,bce,abce3 de,cde
35
![Page 44: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/44.jpg)
itemset mining
d1 abced2 cded3 abced4 acded5 abcded6 bcd
Support Frequent Gen Closed6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab
be,bce,abce be abce3 de,cde de cde
35
![Page 45: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/45.jpg)
itemset mining
d1 abced2 cded3 abced4 acded5 abcded6 bcd
Support Frequent Gen Closed Max6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab
be,bce,abce be abce abce3 de,cde de cde cde
35
![Page 46: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/46.jpg)
itemset mining
d1 abced2 cded3 abced4 acded5 abcded6 bcd
Support Frequent Gen Closed Max6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab
be,bce,abce be abce abce3 de,cde de cde cde
36
![Page 47: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/47.jpg)
itemset mining
d1 abced2 cded3 abced4 acded5 abcded6 bcd
e→ ce
Support Frequent Gen Closed Max6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab
be,bce,abce be abce abce3 de,cde de cde cde
36
![Page 48: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/48.jpg)
itemset mining
d1 abced2 cded3 abced4 acded5 abcded6 bcd
Support Frequent Gen Closed Max6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab
be,bce,abce be abce abce3 de,cde de cde cde
36
![Page 49: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/49.jpg)
itemset mining
d1 abced2 cded3 abced4 acded5 abcded6 bcd
Support Frequent Gen Closed Max6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab
be,bce,abce be abce abce3 de,cde de cde cde
37
![Page 50: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/50.jpg)
itemset mining
d1 abced2 cded3 abced4 acded5 abcded6 bcd
a→ ace
Support Frequent Gen Closed Max6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab
be,bce,abce be abce abce3 de,cde de cde cde
37
![Page 51: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/51.jpg)
itemset mining
d1 abced2 cded3 abced4 acded5 abcded6 bcd
Support Frequent Gen Closed Max6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab
be,bce,abce be abce abce3 de,cde de cde cde
38
![Page 52: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/52.jpg)
closed patterns
Usually, there are too many frequent patterns. We cancompute a smaller set, while keeping the same information.
ExampleA set of 1000 items, has 21000 ≈ 10301 subsets, that is morethan the number of atoms in the universe ≈ 1079
39
![Page 53: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/53.jpg)
closed patterns
A priori propertyIf t′ is a subpattern of t, then Support (t′)≥ Support (t).
DefinitionA frequent pattern t is closed if none of its propersuperpatterns has the same support as it has.
Frequent subpatterns and their supports can be generatedfrom closed patterns.
39
![Page 54: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/54.jpg)
maximal patterns
DefinitionA frequent pattern t is maximal if none of its propersuperpatterns is frequent.
Frequent subpatterns can be generated from maximalpatterns, but not with their support.
All maximal patterns are closed, but not all closed patterns aremaximal.
40
![Page 55: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/55.jpg)
non streaming frequent itemset miners
Representation:
• Horizontal layoutT1: a, b, cT2: b, c, eT3: b, d, e
• Vertical layouta: 1 0 0b: 1 1 1c: 1 1 0
Search:
• Breadth-first (levelwise): Apriori• Depth-first: Eclat, FP-Growth 41
![Page 56: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/56.jpg)
the apriori algorithm
Apriori Algorithm1 Initialize the item set size k= 12 Start with single element sets3 Prune the non-frequent ones4 while there are frequent item sets5 do create candidates with one item more6 Prune the non-frequent ones7 Increment the item set size k= k+1
8 Output: the frequent item sets
42
![Page 57: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/57.jpg)
the eclat algorithm
Depth-First Search
• divide-and-conquer scheme : the problem is processed bysplitting it into smaller subproblems, which are thenprocessed recursively• conditional database for the prefix a• transactions that contain a
• conditional database for item sets without a• transactions that not contain a
• Vertical representation• Support counting is done by intersecting lists of transactionidentifiers
43
![Page 58: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/58.jpg)
the fp-growth algorithm
Depth-First Search
• divide-and-conquer scheme : the problem is processed bysplitting it into smaller subproblems, which are thenprocessed recursively• conditional database for the prefix a• transactions that contain a
• conditional database for item sets without a• transactions that not contain a
• Vertical and Horizontal representation : FP-Tree• prefix tree with links between nodes that correspond to thesame item
• Support counting is done using FP-Tree
44
![Page 59: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/59.jpg)
mining graph data
ProblemGiven a data set D of graphs, find frequent graphs.
Transaction Id Graph
1C C S N
O
O
2C C S N
O
C
3 C C S NN
45
![Page 60: Introduction to Big Data Science](https://reader034.vdocuments.mx/reader034/viewer/2022051503/587138f01a28abf0568b64d5/html5/thumbnails/60.jpg)
the gspan algorithm
gSpan(g,D,min_sup,S)Input: A graph g, a graph dataset D, min_sup.Output: The frequent graph set S.
1 if g ̸=min(g)2 then return S3 insert g into S4 update support counter structure5 C← /06 for each g′ that can be right-most
extended from g in one step7 do if support(g) ≥min_sup8 then insert g′ into C9 for each g′ in C
10 do S← gSpan(g′,D,min_sup,S)11 return S
46