lecture 13. clustering. gaussian mixture modelย ยท โข suppose that we are using the algorithm that...
TRANSCRIPT
![Page 1: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/1.jpg)
Semester 2, 2017Lecturer: Andrey Kan
Lecture 13. Clustering.Gaussian Mixture Model
COMP90051 Statistical Machine Learning
Copyright: University of Melbourne
![Page 2: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/2.jpg)
Statistical Machine Learning (S2 2017) Deck 13
This lectureโข Unsupervised learning
โ Diversity of problemsโ Pipelines
โข Clusteringโ Problem formulationโ Algorithmsโ Choosing the number of clusters
โข Gaussian mixture model (GMM)โ A probabilistic approach to clusteringโ GMM clustering as an optimisation problem
2
![Page 3: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/3.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Unsupervised Learning
A large branch of ML that concerns with learning the structure of the
data in the absence of labels
3
![Page 4: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/4.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Previously: Supervised learningโข Supervised learning: Overarching aim is making
predictions from data
โข We studied methods such as random forest, ANN and SVM in the context of this aim
โข We had instances ๐๐๐๐ โ ๐น๐น๐๐, ๐๐ = 1, โฆ ,๐๐ and corresponding labels ๐ฆ๐ฆ๐๐ as inputs, and the aim was to predict labels for new instances
โข Can be viewed as a function approximation problem, but with a big caveat
โข The ability to generalise is critical
4
![Page 5: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/5.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Now: Unsupervised learningโข Next few lectures: unsupervised learning methods
โข In unsupervised learning, there is no dedicated variable called a โlabelโ
โข Instead, we just have a set of points ๐๐๐๐ โ ๐น๐น๐๐, ๐๐ = 1, โฆ ,๐๐โ And often data cannot be reduced to points in ๐น๐น๐๐ (e.g., data is
a set of variable length sequences)
โข The aim of unsupervised learning is to explore the structure (patterns, regularities) of the data
โข The aim of โexploring the structureโ is vague
5
![Page 6: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/6.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Applications of unsupervised learningโข Diversity of tasks fall into unsupervised learning category
โข This subject covers some common applications of unsupervised learning:โ Clustering (this week)โ Dimensionality reduction (next week)โ Learning parameters of probabilistic models (after break)
โข A few other applications not covered in this course:โ Marked basket analysis. E.g., use supermarket transaction logs
to find items that are frequently purchased togetherโ Outlier detection. E.g., find potentially fraudulent credit card
transactions
6
![Page 7: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/7.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Data analysis pipelinesโข Clustering and dimensionality reduction are listed
as separate problems
โข In many cases, these can be viewed as distinct unsupervised learning tasks
โข However, different methods can be combined into powerful data analysis pipelines
โข E.g., we will use these pipelines for data points where patterns are obscure, or for datasets that are not points in ๐น๐น๐๐
โข Before getting to this stage, we first consider a classical clustering problem
7art: adapted from Clker-Free-Vector-Images at pixabay.com (CC0)
![Page 8: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/8.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Clustering
A fundamental task in Machine Learning. Thousands of algorithms, yet no definitive universal solution
8
![Page 9: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/9.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Introduction to clusteringโข Clustering is automatic grouping of objects such that the
objects within each group (cluster) are more similar to each other than objects from different groupsโ An extremely vague definition that reflects the variety of real-world
problems that require clustering
โข The key to this definition is defining what โsimilarโ means
9
Data
Clustering 1
Clustering 2
![Page 10: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/10.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Measuring (dis)similarityโข Consider a โclassicalโ setting where the data is a set of points ๐๐๐๐ โ ๐น๐น๐๐, ๐๐ = 1, โฆ ,๐๐
โข Instead of โsimilarityโ, it is sometimes more convenient to use an opposite concept of โdissimilarityโ or โdifferenceโ
โข A natural choice for formalising the โdifferenceโ between a pair of points is Euclidean distance
๐๐๐๐๐๐ โก ๐๐๐๐ โ ๐๐๐๐ = ๏ฟฝ๐๐=1
๐๐
๐๐๐๐ ๐๐ โ ๐๐๐๐ ๐๐
2
โข Here ๐๐๐๐ is a vector and ๐๐๐๐ ๐๐ denotes the ๐๐-th element of that vector
10
![Page 11: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/11.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Refresher on K-means clustering
11Figure: Bishop, Section 9.1
Data: Old Faithful Geyser Data: waiting
time between eruptions and the
duration of eruptions
Requires specifying the number of clusters in advance
Measures โdissimilarityโ using Euclidean distance
Finds โsphericalโ clusters
An iterative optimization procedure
![Page 12: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/12.jpg)
Statistical Machine Learning (S2 2017) Deck 13
K-means as iterative optimisation
1. Initialisation: choose ๐๐ cluster centroids randomly
2. Update:a) Assign points to the nearest centroidb) Compute centroids under the current assignment
3. Termination: if no change then stop
4. Go to Step 2
12
![Page 13: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/13.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Clustering algorithmsโข There are thousands (!) of clustering algorithms
โ Jain A.K. et al., Data clustering: 50 years beyond K-means, 2010, Pattern recognition letters
โข K-means is still one of the most popular algorithms (perhaps the most popular)โ Wu X. et al., Top 10 algorithms in data mining, 2008, Knowledge and
information systems
โข Many popular algorithms do not require specifying ๐๐โ Hierarchical clustering (e.g., see Hastie at al., The elements of
statistical learning)โ DBSCAN (Ester M. et al., A density-based algorithm for discovering
clusters in large spatial databases with noise, 1996, KDD conference)โ Affinity propagation (Brendan J.F. and Dueck D., Clustering by passing
messages between data points, 2007, Science)
13
![Page 14: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/14.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Consensus clustering (1/3)โข Unsupervised learning counterpart of bagging
โ Monti S. et al., Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data, 2003, Machine Learning
โข Consider a dataset ๐ท๐ท = ๐๐1, โฆ ,๐๐๐๐ , and some pre-defined number of iterations ๐๐
โข The algorithm creates sampled versions ๐ท๐ท1, โฆ ,๐ท๐ท๐๐ of the original dataโ In principle, one can use bootstrapping (sampling with replacement),
resulting in ๐ท๐ท๐ก๐ก = ๐๐โ For consensus clustering authors consider subsampling (without
replacement), resulting in ๐ท๐ท๐ก๐ก < ๐๐
14
![Page 15: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/15.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Consensus clustering (2/3)โข ๐ท๐ท1, โฆ ,๐ท๐ท๐๐ are sampled versions of original data
โข Define an ๐๐ ร ๐๐ indicator matrix ๐ฐ๐ฐ๐ก๐ก, such that ๐ฐ๐ฐ๐ก๐ก ๐๐, ๐๐ = 1 if ๐๐๐๐ ,๐๐๐๐ โ ๐ท๐ท๐ก๐ก and ๐ฐ๐ฐ๐ก๐ก ๐๐, ๐๐ = 0 otherwise
โข Apply clustering independently on each ๐ท๐ท๐ก๐กโข Define an ๐๐ ร ๐๐ association matrix ๐ด๐ด๐ก๐ก, such that ๐ด๐ด๐ก๐ก ๐๐, ๐๐ = 1
if ๐๐๐๐ ,๐๐๐๐ are in the same cluster, and ๐ด๐ด๐ก๐ก ๐๐, ๐๐ = 0 otherwiseโ If ๐๐๐๐ โ ๐ท๐ท๐ก๐ก or ๐๐๐๐ โ ๐ท๐ท๐ก๐ก then ๐ด๐ด๐ก๐ก ๐๐, ๐๐ = 0
โข The consensus matrix ๐๐ is defined as ๐๐ ๐๐, ๐๐ โก โ๐ก๐ก=1๐๐ ๐ด๐ด๐ก๐ก ๐๐,๐๐โ๐ก๐ก=1๐๐ ๐ฐ๐ฐ๐ก๐ก ๐๐,๐๐
โ Proportion of clustering runs in which two items cluster together
15
![Page 16: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/16.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Consensus clustering (3/3)โข Algorithm
1. Choose a clustering algorithm2. For ๐ก๐ก = 1 โฆ๐๐
a) Sample ๐ท๐ท๐ก๐ก from ๐ท๐ท (this can also be done by sampling features rather than data points)
b) Apply clustering on ๐ท๐ท๐ก๐กc) Save connectivity matrix ๐ด๐ด๐ก๐ก and indicator matrix ๐ฐ๐ฐ๐ก๐ก
3. Construct consensus matrix ๐๐4. Cluster ๐ท๐ท using ๐๐ as a similarity matrix (e.g., apply
hierarchical clustering)
16
Compare with random forest
generation
![Page 17: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/17.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Choosing the number of clustersโข Many clustering algorithms do not require ๐๐, but require
specifying some other parameters that influence resulting number of clusters
โข Suppose that we are using the algorithm that does require ๐๐
โข The number of clusters can be known from contextโ E.g., clustering genetic profiles from a group of cells that is known to
contain a certain number of cell types
โข Visualising the data (e.g., using multidimensional reduction, next week) can help to estimate the number of clusters
โข Another strategy is to try a few plausible values and see whether it makes difference to the analysis
17
![Page 18: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/18.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Number of clusters and overfittingโข In the context of k-means it might be appealing to treat ๐๐ as
an parameter to be optimised
โข This does not work because larger ๐๐ results in a more flexible model that will always fit the data betterโ This is analogous to overfitting in supervised learning
โข Instead, approaches that can work are:โ Cross-validation-like strategies for determining ๐๐โ Try several possible ๐๐ and look at the trendโ Information-theoretic results
โข Example of the second approach is shown in the next slide
โข Examples of the third approach are given in green slides after that
18
![Page 19: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/19.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Kink method and gap statisticsโข Manual inspection of minimised within cluster variation as a
function of ๐๐
19
โข The real number of clusters is indicated by the kink in this curve
โข The gap statistics (Hastie et al. book) developed for k-means clustering is an automated version of the kink method
โข This method measures the gaps between each ๐๐ and analyses their distribution
![Page 20: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/20.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Akaike information criterionโข This and next methods are only applicable for parametric
probabilistic models ๐๐ ๐ฟ๐ฟ|๐ฝ๐ฝ . Let ๏ฟฝ๐ฝ๐ฝ be MLE for this model, and let ๐ฟ๐ฟโ โก ๐๐ ๐ฟ๐ฟ|๏ฟฝ๐ฝ๐ฝ
โข Akaike information criterion (AIC) is defined as๐ด๐ด๐ด๐ด๐ด๐ด = 2๐๐๐๐๐๐๐๐ โ 2 ln ๐ฟ๐ฟโ
โข Here ๐๐๐๐๐๐๐๐ is the number of free parameters
โข The equation is simple, but the derivation is very complicated. AIC is one of fundamental results in statistic
โข AIC estimates divergence between the true unknown model (e.g., GMM with true number of clusters), and the current model
20
![Page 21: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/21.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Akaike information criterionโข Method: consider several different ๐๐, and their corresponding
GMM. Find MLE parameters for each model
โข Compute AIC for each model and choose ๐๐ that resulted in the smallest AIC
โข The smallest AIC is preferable because AIC is an estimator of divergence between the true and current models
โข AIC estimator was shown to be biased for finite sample sizes, thus a correction has been proposed which should be used in practice instead AIC (๐๐ is the number of data points)
๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด = 2๐๐๐๐๐๐๐๐ + ln ๐ฟ๐ฟโ +2๐๐๐๐๐๐๐๐ ๐๐๐๐๐๐๐๐ + 1๐๐ โ ๐๐๐๐๐๐๐๐ โ 1
21
![Page 22: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/22.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Bayesian information criterionโข AIC was derived from a frequentist standpoint. Bayesian
information criterion (BIC) represents the Bayesian approach to model selection
โข Bayesian model selection is based on marginal likelihood ๐๐ ๐๐๐๐๐ก๐ก๐๐|๐๐๐๐๐๐๐๐๐๐
โข BIC is an approximate computation of the marginal likelihood
โข BIC is defined as๐ต๐ต๐ด๐ด๐ด๐ด = ๐๐๐๐๐๐๐๐ ln๐๐ โ 2 ln ๐ฟ๐ฟโ
โข One should choose a model with the smallest BIC
22
![Page 23: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/23.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Gaussian Mixture Model
A probabilistic view of clustering
23
![Page 24: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/24.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Why GMM clusteringโข K-means algorithm is one of the most popular algorithms,
GMM clustering is a generalisation of k-means
โข Empirically, works well in many casesโ Moreover, it can be used in a manifold learning pipeline (coming soon)
โข Reasonably simple and mathematically tractable
โข Example of a probabilistic approach
โข Example application of Expectation Maximisation (EM) algorithmโ EM algorithm is a generic technique, not only for GMM clustering
24
![Page 25: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/25.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Clustering: Probabilistic interpretation
25
Clustering can be viewed as identification of components of a probability density function that generated the data
Cluster 1Cluster 2
Identifying cluster centroids can be viewed as finding modes of distributions
![Page 26: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/26.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Modelling uncertainty in data clustering
โข K-means clustering assigns each point to exactly one clusterโ In other words, the result of such a clustering is partitioning into ๐๐
subsets
โข Similar to k-means, a probabilistic mixture model requires the user to choose the number of clusters in advance
โข Unlike k-means, the probabilistic model gives us a power to express uncertainly about the origin of each pointโ Each point originates from cluster ๐ด๐ด with probability ๐ค๐ค๐๐, ๐ด๐ด = 1, โฆ , ๐๐
โข That is, each point still originates from one particular cluster (aka component), but we are not sure from which one
โข Next, for each individual component, the normal distribution is a common modelling choice
26
![Page 27: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/27.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Normal (aka Gaussian) distribution
27
โข Recall that a 1D Gaussian is
๐ฉ๐ฉ ๐ฅ๐ฅ|๐๐,๐๐ โก12๐๐๐๐2
exp โ๐ฅ๐ฅ โ ๐๐ 2
2๐๐2
โข And a ๐๐-dimensional Gaussian is
๐ฉ๐ฉ ๐๐|๐๐,๐บ๐บ โก 2๐๐ โ๐๐2 det๐บ๐บ โ12 exp โ12๐๐ โ ๐๐ โฒ๐บ๐บโ1 ๐๐ โ ๐๐
โ ๐บ๐บ is a symmetric ๐๐ ร ๐๐ matrix that is assumed to be positive definiteโ det๐บ๐บ denotes matrix determinant
Figure: Bishop
![Page 28: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/28.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Gaussian mixture model (GMM)
28
โข Gaussian mixture distribution (for one data point):
๐๐ ๐๐ โก๏ฟฝ๐๐=1
๐๐
๐ค๐ค๐๐๐ฉ๐ฉ ๐๐|๐๐๐๐ ,๐บ๐บ๐๐
1D exampleโข Here ๐ค๐ค๐๐ โฅ 0 and โ๐๐=1๐๐ ๐ค๐ค๐๐ = 1
โข That is, ๐ค๐ค1, โฆ ,๐ค๐ค๐๐ is a probability distribution over components
โข Parameters of the model are ๐ค๐ค๐๐, ๐๐๐๐, ๐บ๐บ๐๐, ๐ด๐ด = 1, โฆ , ๐๐
Mixture and individual component densities are re-scaled for visualisation purposes
Figure: Bishop
![Page 29: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/29.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Checkpointโข Consider a GMM with five components for 3D
data. How many independent parameters does this model have?
a) 6 ร 5 + 3 ร 5 + 4
b) 6 ร 5 + 3 ร 5 + 5
c) 9 ร 5 + 3 ร 5 + 5
29
art: OpenClipartVectors at pixabay.com (CC0)
![Page 30: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/30.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Clustering as an optimisation problem
โข Given a set of data points, we assume that data points are generated by a GMMโ Each point in our dataset originates from the ๐ด๐ด-th normal
distribution component with probability ๐ค๐ค๐๐
โข Clustering now amounts to finding parameters of the GMM that โbest explainโ the observed data
โข But what does โbest explainโ mean?
โข We are going to call upon another old friend: MLE principle tells us to use parameter values that maximise ๐๐(๐๐1, โฆ ,๐๐๐๐)
30
![Page 31: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/31.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Fitting a GMM model to dataโข Assuming that data points are independent, our aim is to
find ๐ค๐ค๐๐, ๐๐๐๐, ๐บ๐บ๐๐, ๐ด๐ด = 1, โฆ , ๐๐ that maximise
๐๐ ๐๐1, โฆ ,๐๐๐๐ = ๏ฟฝ๐๐=1
๐๐
๏ฟฝ๐๐=1
๐๐
๐ค๐ค๐๐๐ฉ๐ฉ ๐๐๐๐|๐๐๐๐ ,๐บ๐บ๐๐
โข This is actually an ill-posed problemโ Singularities (points at which the likelihood is not defined)โ Non-uniqueness
31
Bishop, Fig. 9.7
โข Theoretical cure โ Bayesian approach
โข Practical cure โ heuristically avoid singularities
![Page 32: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/32.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Fitting a GMM model to data
โข Yet again, we are facing an optimisation problem
โข Assuming that data points are independent, our aim is to find ๐ค๐ค๐๐, ๐๐๐๐, ๐บ๐บ๐๐, ๐ด๐ด = 1, โฆ , ๐๐ that maximise
๐๐ ๐๐1, โฆ ,๐๐๐๐ = ๏ฟฝ๐๐=1
๐๐
๏ฟฝ๐๐=1
๐๐
๐ค๐ค๐๐๐ฉ๐ฉ ๐๐๐๐|๐๐๐๐ ,๐บ๐บ๐๐
โข Letโs see if this can be solved analytically
โข Taking the derivative of this expression is pretty awkward, try the usual log trick
32
![Page 33: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/33.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Attempting the log trick for GMM
โข Our aim is to find ๐ค๐ค๐๐, ๐๐๐๐, ๐บ๐บ๐๐, ๐ด๐ด = 1, โฆ ,๐๐ that maximise
log ๐๐ ๐๐1, โฆ ,๐๐๐๐ = ๏ฟฝ๐๐=1
๐๐
log ๏ฟฝ๐๐=1
๐๐
๐ค๐ค๐๐๐ฉ๐ฉ ๐๐๐๐|๐๐๐๐ ,๐บ๐บ๐๐
โข The log cannot be pushed inside the sum. The derivative of the log likelihood still going to have a cumbersome form
โข We should use an iterative procedure
33
![Page 34: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/34.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Fitting a GMM using iterative optimisation
โข So thereโs little prospect in analytical solution
โข At this point, we could use the gradient descent algorithm
โข But it still requires taking partial derivatives
โข Another problem of using the gradient descent are complicated constraints on parameters
โข We aim to find ๐ค๐ค๐๐, ๐๐๐๐, ๐บ๐บ๐๐, ๐ด๐ด = 1, โฆ , ๐๐, where ๐บ๐บ๐๐ are symmetric and positive definite for each ๐ด๐ด, and where ๐ค๐ค๐๐ add up to one
34
![Page 35: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/35.jpg)
Statistical Machine Learning (S2 2017) Deck 13
Introduction to Expectation Maximisation
โข Expectation Maximisation (EM) algorithm is a common way to find parameters of a GMM
โข EM is a generic algorithm for finding MLE of parameters of a probabilistic model
โข Broadly speaking, as โinputโ EM requiresโ A probabilistic model that is can be specified by a fixed number
of parametersโ Data
โข EM is widely used outside clustering and GMMs (another application of EM coming soon)
35
![Page 36: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/36.jpg)
Statistical Machine Learning (S2 2017) Deck 13
MLE vs EMโข MLE is a frequentist principle that suggests that given
a dataset, the โbestโ parameters to use are the ones that maximise the probability of the dataโ MLE is a way to formally pose the problem
โข EM is an algorithmโ EM is a way to solve the problem posed by MLE
โข MLE can be found by other methods such as gradient descent (but gradient descent is not always the most convenient method)
36
![Page 37: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โข Suppose that we are using the algorithm that does require ๐๐ โข The number of clusters can be known from context โE.g.,](https://reader034.vdocuments.mx/reader034/viewer/2022050200/5f54049efc8c9f69172db86f/html5/thumbnails/37.jpg)
Statistical Machine Learning (S2 2017) Deck 13
This lectureโข Unsupervised learning
โ Diversity of problemsโ Pipelines
โข Clusteringโ Problem formulationโ Algorithmsโ Choosing the number of clusters
โข Gaussian mixture model (GMM)โ A probabilistic approach to clusteringโ GMM clustering as an optimisation problem
37