lecture 13. clustering. gaussian mixture modelย ยท โ€ข suppose that we are using the algorithm that...

37
Semester 2, 2017 Lecturer: Andrey Kan Lecture 13. Clustering. Gaussian Mixture Model COMP90051 Statistical Machine Learning Copyright: University of Melbourne

Upload: others

Post on 18-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Semester 2, 2017Lecturer: Andrey Kan

Lecture 13. Clustering.Gaussian Mixture Model

COMP90051 Statistical Machine Learning

Copyright: University of Melbourne

Page 2: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

This lectureโ€ข Unsupervised learning

โˆ— Diversity of problemsโˆ— Pipelines

โ€ข Clusteringโˆ— Problem formulationโˆ— Algorithmsโˆ— Choosing the number of clusters

โ€ข Gaussian mixture model (GMM)โˆ— A probabilistic approach to clusteringโˆ— GMM clustering as an optimisation problem

2

Page 3: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Unsupervised Learning

A large branch of ML that concerns with learning the structure of the

data in the absence of labels

3

Page 4: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Previously: Supervised learningโ€ข Supervised learning: Overarching aim is making

predictions from data

โ€ข We studied methods such as random forest, ANN and SVM in the context of this aim

โ€ข We had instances ๐’™๐’™๐‘–๐‘– โˆˆ ๐‘น๐‘น๐‘š๐‘š, ๐‘–๐‘– = 1, โ€ฆ ,๐‘›๐‘› and corresponding labels ๐‘ฆ๐‘ฆ๐‘–๐‘– as inputs, and the aim was to predict labels for new instances

โ€ข Can be viewed as a function approximation problem, but with a big caveat

โ€ข The ability to generalise is critical

4

Page 5: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Now: Unsupervised learningโ€ข Next few lectures: unsupervised learning methods

โ€ข In unsupervised learning, there is no dedicated variable called a โ€œlabelโ€

โ€ข Instead, we just have a set of points ๐’™๐’™๐‘–๐‘– โˆˆ ๐‘น๐‘น๐‘š๐‘š, ๐‘–๐‘– = 1, โ€ฆ ,๐‘›๐‘›โˆ— And often data cannot be reduced to points in ๐‘น๐‘น๐‘š๐‘š (e.g., data is

a set of variable length sequences)

โ€ข The aim of unsupervised learning is to explore the structure (patterns, regularities) of the data

โ€ข The aim of โ€œexploring the structureโ€ is vague

5

Page 6: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Applications of unsupervised learningโ€ข Diversity of tasks fall into unsupervised learning category

โ€ข This subject covers some common applications of unsupervised learning:โˆ— Clustering (this week)โˆ— Dimensionality reduction (next week)โˆ— Learning parameters of probabilistic models (after break)

โ€ข A few other applications not covered in this course:โˆ— Marked basket analysis. E.g., use supermarket transaction logs

to find items that are frequently purchased togetherโˆ— Outlier detection. E.g., find potentially fraudulent credit card

transactions

6

Page 7: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Data analysis pipelinesโ€ข Clustering and dimensionality reduction are listed

as separate problems

โ€ข In many cases, these can be viewed as distinct unsupervised learning tasks

โ€ข However, different methods can be combined into powerful data analysis pipelines

โ€ข E.g., we will use these pipelines for data points where patterns are obscure, or for datasets that are not points in ๐‘น๐‘น๐‘š๐‘š

โ€ข Before getting to this stage, we first consider a classical clustering problem

7art: adapted from Clker-Free-Vector-Images at pixabay.com (CC0)

Page 8: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Clustering

A fundamental task in Machine Learning. Thousands of algorithms, yet no definitive universal solution

8

Page 9: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Introduction to clusteringโ€ข Clustering is automatic grouping of objects such that the

objects within each group (cluster) are more similar to each other than objects from different groupsโˆ— An extremely vague definition that reflects the variety of real-world

problems that require clustering

โ€ข The key to this definition is defining what โ€œsimilarโ€ means

9

Data

Clustering 1

Clustering 2

Page 10: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Measuring (dis)similarityโ€ข Consider a โ€œclassicalโ€ setting where the data is a set of points ๐’™๐’™๐‘–๐‘– โˆˆ ๐‘น๐‘น๐‘š๐‘š, ๐‘–๐‘– = 1, โ€ฆ ,๐‘›๐‘›

โ€ข Instead of โ€œsimilarityโ€, it is sometimes more convenient to use an opposite concept of โ€œdissimilarityโ€ or โ€œdifferenceโ€

โ€ข A natural choice for formalising the โ€œdifferenceโ€ between a pair of points is Euclidean distance

๐‘‘๐‘‘๐‘–๐‘–๐‘–๐‘– โ‰ก ๐’™๐’™๐‘–๐‘– โˆ’ ๐’™๐’™๐‘–๐‘– = ๏ฟฝ๐‘™๐‘™=1

๐‘š๐‘š

๐’™๐’™๐‘–๐‘– ๐‘™๐‘™ โˆ’ ๐’™๐’™๐‘–๐‘– ๐‘™๐‘™

2

โ€ข Here ๐’™๐’™๐‘–๐‘– is a vector and ๐’™๐’™๐‘–๐‘– ๐‘™๐‘™ denotes the ๐‘™๐‘™-th element of that vector

10

Page 11: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Refresher on K-means clustering

11Figure: Bishop, Section 9.1

Data: Old Faithful Geyser Data: waiting

time between eruptions and the

duration of eruptions

Requires specifying the number of clusters in advance

Measures โ€œdissimilarityโ€ using Euclidean distance

Finds โ€œsphericalโ€ clusters

An iterative optimization procedure

Page 12: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

K-means as iterative optimisation

1. Initialisation: choose ๐‘˜๐‘˜ cluster centroids randomly

2. Update:a) Assign points to the nearest centroidb) Compute centroids under the current assignment

3. Termination: if no change then stop

4. Go to Step 2

12

Page 13: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Clustering algorithmsโ€ข There are thousands (!) of clustering algorithms

โˆ— Jain A.K. et al., Data clustering: 50 years beyond K-means, 2010, Pattern recognition letters

โ€ข K-means is still one of the most popular algorithms (perhaps the most popular)โˆ— Wu X. et al., Top 10 algorithms in data mining, 2008, Knowledge and

information systems

โ€ข Many popular algorithms do not require specifying ๐‘˜๐‘˜โˆ— Hierarchical clustering (e.g., see Hastie at al., The elements of

statistical learning)โˆ— DBSCAN (Ester M. et al., A density-based algorithm for discovering

clusters in large spatial databases with noise, 1996, KDD conference)โˆ— Affinity propagation (Brendan J.F. and Dueck D., Clustering by passing

messages between data points, 2007, Science)

13

Page 14: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Consensus clustering (1/3)โ€ข Unsupervised learning counterpart of bagging

โˆ— Monti S. et al., Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data, 2003, Machine Learning

โ€ข Consider a dataset ๐ท๐ท = ๐’™๐’™1, โ€ฆ ,๐’™๐’™๐‘›๐‘› , and some pre-defined number of iterations ๐‘‡๐‘‡

โ€ข The algorithm creates sampled versions ๐ท๐ท1, โ€ฆ ,๐ท๐ท๐‘‡๐‘‡ of the original dataโˆ— In principle, one can use bootstrapping (sampling with replacement),

resulting in ๐ท๐ท๐‘ก๐‘ก = ๐‘›๐‘›โˆ— For consensus clustering authors consider subsampling (without

replacement), resulting in ๐ท๐ท๐‘ก๐‘ก < ๐‘›๐‘›

14

Page 15: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Consensus clustering (2/3)โ€ข ๐ท๐ท1, โ€ฆ ,๐ท๐ท๐‘‡๐‘‡ are sampled versions of original data

โ€ข Define an ๐‘›๐‘› ร— ๐‘›๐‘› indicator matrix ๐‘ฐ๐‘ฐ๐‘ก๐‘ก, such that ๐‘ฐ๐‘ฐ๐‘ก๐‘ก ๐‘–๐‘–, ๐‘—๐‘— = 1 if ๐’™๐’™๐‘–๐‘– ,๐’™๐’™๐‘–๐‘– โˆˆ ๐ท๐ท๐‘ก๐‘ก and ๐‘ฐ๐‘ฐ๐‘ก๐‘ก ๐‘–๐‘–, ๐‘—๐‘— = 0 otherwise

โ€ข Apply clustering independently on each ๐ท๐ท๐‘ก๐‘กโ€ข Define an ๐‘›๐‘› ร— ๐‘›๐‘› association matrix ๐‘ด๐‘ด๐‘ก๐‘ก, such that ๐‘ด๐‘ด๐‘ก๐‘ก ๐‘–๐‘–, ๐‘—๐‘— = 1

if ๐’™๐’™๐‘–๐‘– ,๐’™๐’™๐‘–๐‘– are in the same cluster, and ๐‘ด๐‘ด๐‘ก๐‘ก ๐‘–๐‘–, ๐‘—๐‘— = 0 otherwiseโˆ— If ๐’™๐’™๐‘–๐‘– โˆ‰ ๐ท๐ท๐‘ก๐‘ก or ๐’™๐’™๐‘–๐‘– โˆ‰ ๐ท๐ท๐‘ก๐‘ก then ๐‘ด๐‘ด๐‘ก๐‘ก ๐‘–๐‘–, ๐‘—๐‘— = 0

โ€ข The consensus matrix ๐“œ๐“œ is defined as ๐“œ๐“œ ๐‘–๐‘–, ๐‘—๐‘— โ‰ก โˆ‘๐‘ก๐‘ก=1๐‘‡๐‘‡ ๐‘ด๐‘ด๐‘ก๐‘ก ๐‘–๐‘–,๐‘–๐‘–โˆ‘๐‘ก๐‘ก=1๐‘‡๐‘‡ ๐‘ฐ๐‘ฐ๐‘ก๐‘ก ๐‘–๐‘–,๐‘–๐‘–

โˆ— Proportion of clustering runs in which two items cluster together

15

Page 16: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Consensus clustering (3/3)โ€ข Algorithm

1. Choose a clustering algorithm2. For ๐‘ก๐‘ก = 1 โ€ฆ๐‘‡๐‘‡

a) Sample ๐ท๐ท๐‘ก๐‘ก from ๐ท๐ท (this can also be done by sampling features rather than data points)

b) Apply clustering on ๐ท๐ท๐‘ก๐‘กc) Save connectivity matrix ๐‘ด๐‘ด๐‘ก๐‘ก and indicator matrix ๐‘ฐ๐‘ฐ๐‘ก๐‘ก

3. Construct consensus matrix ๐“œ๐“œ4. Cluster ๐ท๐ท using ๐“œ๐“œ as a similarity matrix (e.g., apply

hierarchical clustering)

16

Compare with random forest

generation

Page 17: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Choosing the number of clustersโ€ข Many clustering algorithms do not require ๐‘˜๐‘˜, but require

specifying some other parameters that influence resulting number of clusters

โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜

โ€ข The number of clusters can be known from contextโˆ— E.g., clustering genetic profiles from a group of cells that is known to

contain a certain number of cell types

โ€ข Visualising the data (e.g., using multidimensional reduction, next week) can help to estimate the number of clusters

โ€ข Another strategy is to try a few plausible values and see whether it makes difference to the analysis

17

Page 18: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Number of clusters and overfittingโ€ข In the context of k-means it might be appealing to treat ๐‘˜๐‘˜ as

an parameter to be optimised

โ€ข This does not work because larger ๐‘˜๐‘˜ results in a more flexible model that will always fit the data betterโˆ— This is analogous to overfitting in supervised learning

โ€ข Instead, approaches that can work are:โˆ— Cross-validation-like strategies for determining ๐‘˜๐‘˜โˆ— Try several possible ๐‘˜๐‘˜ and look at the trendโˆ— Information-theoretic results

โ€ข Example of the second approach is shown in the next slide

โ€ข Examples of the third approach are given in green slides after that

18

Page 19: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Kink method and gap statisticsโ€ข Manual inspection of minimised within cluster variation as a

function of ๐‘˜๐‘˜

19

โ€ข The real number of clusters is indicated by the kink in this curve

โ€ข The gap statistics (Hastie et al. book) developed for k-means clustering is an automated version of the kink method

โ€ข This method measures the gaps between each ๐‘˜๐‘˜ and analyses their distribution

Page 20: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Akaike information criterionโ€ข This and next methods are only applicable for parametric

probabilistic models ๐‘๐‘ ๐‘ฟ๐‘ฟ|๐œฝ๐œฝ . Let ๏ฟฝ๐œฝ๐œฝ be MLE for this model, and let ๐ฟ๐ฟโˆ— โ‰ก ๐‘๐‘ ๐‘ฟ๐‘ฟ|๏ฟฝ๐œฝ๐œฝ

โ€ข Akaike information criterion (AIC) is defined as๐ด๐ด๐ด๐ด๐ด๐ด = 2๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘ โˆ’ 2 ln ๐ฟ๐ฟโˆ—

โ€ข Here ๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘ is the number of free parameters

โ€ข The equation is simple, but the derivation is very complicated. AIC is one of fundamental results in statistic

โ€ข AIC estimates divergence between the true unknown model (e.g., GMM with true number of clusters), and the current model

20

Page 21: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Akaike information criterionโ€ข Method: consider several different ๐‘˜๐‘˜, and their corresponding

GMM. Find MLE parameters for each model

โ€ข Compute AIC for each model and choose ๐‘˜๐‘˜ that resulted in the smallest AIC

โ€ข The smallest AIC is preferable because AIC is an estimator of divergence between the true and current models

โ€ข AIC estimator was shown to be biased for finite sample sizes, thus a correction has been proposed which should be used in practice instead AIC (๐‘›๐‘› is the number of data points)

๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด = 2๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘ + ln ๐ฟ๐ฟโˆ— +2๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘ ๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘ + 1๐‘›๐‘› โˆ’ ๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘ โˆ’ 1

21

Page 22: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Bayesian information criterionโ€ข AIC was derived from a frequentist standpoint. Bayesian

information criterion (BIC) represents the Bayesian approach to model selection

โ€ข Bayesian model selection is based on marginal likelihood ๐‘๐‘ ๐‘‘๐‘‘๐‘‘๐‘‘๐‘ก๐‘ก๐‘‘๐‘‘|๐‘š๐‘š๐‘š๐‘š๐‘‘๐‘‘๐‘š๐‘š๐‘™๐‘™

โ€ข BIC is an approximate computation of the marginal likelihood

โ€ข BIC is defined as๐ต๐ต๐ด๐ด๐ด๐ด = ๐‘๐‘๐‘๐‘๐‘๐‘๐‘๐‘ ln๐‘›๐‘› โˆ’ 2 ln ๐ฟ๐ฟโˆ—

โ€ข One should choose a model with the smallest BIC

22

Page 23: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Gaussian Mixture Model

A probabilistic view of clustering

23

Page 24: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Why GMM clusteringโ€ข K-means algorithm is one of the most popular algorithms,

GMM clustering is a generalisation of k-means

โ€ข Empirically, works well in many casesโˆ— Moreover, it can be used in a manifold learning pipeline (coming soon)

โ€ข Reasonably simple and mathematically tractable

โ€ข Example of a probabilistic approach

โ€ข Example application of Expectation Maximisation (EM) algorithmโˆ— EM algorithm is a generic technique, not only for GMM clustering

24

Page 25: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Clustering: Probabilistic interpretation

25

Clustering can be viewed as identification of components of a probability density function that generated the data

Cluster 1Cluster 2

Identifying cluster centroids can be viewed as finding modes of distributions

Page 26: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Modelling uncertainty in data clustering

โ€ข K-means clustering assigns each point to exactly one clusterโˆ— In other words, the result of such a clustering is partitioning into ๐‘˜๐‘˜

subsets

โ€ข Similar to k-means, a probabilistic mixture model requires the user to choose the number of clusters in advance

โ€ข Unlike k-means, the probabilistic model gives us a power to express uncertainly about the origin of each pointโˆ— Each point originates from cluster ๐ด๐ด with probability ๐‘ค๐‘ค๐‘๐‘, ๐ด๐ด = 1, โ€ฆ , ๐‘˜๐‘˜

โ€ข That is, each point still originates from one particular cluster (aka component), but we are not sure from which one

โ€ข Next, for each individual component, the normal distribution is a common modelling choice

26

Page 27: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Normal (aka Gaussian) distribution

27

โ€ข Recall that a 1D Gaussian is

๐’ฉ๐’ฉ ๐‘ฅ๐‘ฅ|๐œ‡๐œ‡,๐œŽ๐œŽ โ‰ก12๐œ‹๐œ‹๐œŽ๐œŽ2

exp โˆ’๐‘ฅ๐‘ฅ โˆ’ ๐œ‡๐œ‡ 2

2๐œŽ๐œŽ2

โ€ข And a ๐‘š๐‘š-dimensional Gaussian is

๐’ฉ๐’ฉ ๐’™๐’™|๐๐,๐šบ๐šบ โ‰ก 2๐œ‹๐œ‹ โˆ’๐‘š๐‘š2 det๐šบ๐šบ โˆ’12 exp โˆ’12๐’™๐’™ โˆ’ ๐๐ โ€ฒ๐šบ๐šบโˆ’1 ๐’™๐’™ โˆ’ ๐๐

โˆ— ๐šบ๐šบ is a symmetric ๐‘š๐‘š ร— ๐‘š๐‘š matrix that is assumed to be positive definiteโˆ— det๐šบ๐šบ denotes matrix determinant

Figure: Bishop

Page 28: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Gaussian mixture model (GMM)

28

โ€ข Gaussian mixture distribution (for one data point):

๐‘๐‘ ๐’™๐’™ โ‰ก๏ฟฝ๐‘๐‘=1

๐‘˜๐‘˜

๐‘ค๐‘ค๐‘๐‘๐’ฉ๐’ฉ ๐’™๐’™|๐๐๐‘๐‘ ,๐šบ๐šบ๐‘๐‘

1D exampleโ€ข Here ๐‘ค๐‘ค๐‘๐‘ โ‰ฅ 0 and โˆ‘๐‘๐‘=1๐‘˜๐‘˜ ๐‘ค๐‘ค๐‘๐‘ = 1

โ€ข That is, ๐‘ค๐‘ค1, โ€ฆ ,๐‘ค๐‘ค๐‘˜๐‘˜ is a probability distribution over components

โ€ข Parameters of the model are ๐‘ค๐‘ค๐‘๐‘, ๐๐๐‘๐‘, ๐šบ๐šบ๐‘๐‘, ๐ด๐ด = 1, โ€ฆ , ๐‘˜๐‘˜

Mixture and individual component densities are re-scaled for visualisation purposes

Figure: Bishop

Page 29: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Checkpointโ€ข Consider a GMM with five components for 3D

data. How many independent parameters does this model have?

a) 6 ร— 5 + 3 ร— 5 + 4

b) 6 ร— 5 + 3 ร— 5 + 5

c) 9 ร— 5 + 3 ร— 5 + 5

29

art: OpenClipartVectors at pixabay.com (CC0)

Page 30: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Clustering as an optimisation problem

โ€ข Given a set of data points, we assume that data points are generated by a GMMโˆ— Each point in our dataset originates from the ๐ด๐ด-th normal

distribution component with probability ๐‘ค๐‘ค๐‘๐‘

โ€ข Clustering now amounts to finding parameters of the GMM that โ€œbest explainโ€ the observed data

โ€ข But what does โ€œbest explainโ€ mean?

โ€ข We are going to call upon another old friend: MLE principle tells us to use parameter values that maximise ๐‘๐‘(๐’™๐’™1, โ€ฆ ,๐’™๐’™๐‘›๐‘›)

30

Page 31: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Fitting a GMM model to dataโ€ข Assuming that data points are independent, our aim is to

find ๐‘ค๐‘ค๐‘๐‘, ๐๐๐‘๐‘, ๐šบ๐šบ๐‘๐‘, ๐ด๐ด = 1, โ€ฆ , ๐‘˜๐‘˜ that maximise

๐‘๐‘ ๐’™๐’™1, โ€ฆ ,๐’™๐’™๐‘›๐‘› = ๏ฟฝ๐‘–๐‘–=1

๐‘›๐‘›

๏ฟฝ๐‘๐‘=1

๐‘˜๐‘˜

๐‘ค๐‘ค๐‘๐‘๐’ฉ๐’ฉ ๐’™๐’™๐‘–๐‘–|๐๐๐‘๐‘ ,๐šบ๐šบ๐‘๐‘

โ€ข This is actually an ill-posed problemโˆ— Singularities (points at which the likelihood is not defined)โˆ— Non-uniqueness

31

Bishop, Fig. 9.7

โ€ข Theoretical cure โ€“ Bayesian approach

โ€ข Practical cure โ€“ heuristically avoid singularities

Page 32: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Fitting a GMM model to data

โ€ข Yet again, we are facing an optimisation problem

โ€ข Assuming that data points are independent, our aim is to find ๐‘ค๐‘ค๐‘๐‘, ๐๐๐‘๐‘, ๐šบ๐šบ๐‘๐‘, ๐ด๐ด = 1, โ€ฆ , ๐‘˜๐‘˜ that maximise

๐‘๐‘ ๐’™๐’™1, โ€ฆ ,๐’™๐’™๐‘›๐‘› = ๏ฟฝ๐‘–๐‘–=1

๐‘›๐‘›

๏ฟฝ๐‘๐‘=1

๐‘˜๐‘˜

๐‘ค๐‘ค๐‘๐‘๐’ฉ๐’ฉ ๐’™๐’™๐‘–๐‘–|๐๐๐‘๐‘ ,๐šบ๐šบ๐‘๐‘

โ€ข Letโ€™s see if this can be solved analytically

โ€ข Taking the derivative of this expression is pretty awkward, try the usual log trick

32

Page 33: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Attempting the log trick for GMM

โ€ข Our aim is to find ๐‘ค๐‘ค๐‘๐‘, ๐๐๐‘๐‘, ๐šบ๐šบ๐‘๐‘, ๐ด๐ด = 1, โ€ฆ ,๐‘˜๐‘˜ that maximise

log ๐‘๐‘ ๐’™๐’™1, โ€ฆ ,๐’™๐’™๐‘›๐‘› = ๏ฟฝ๐‘–๐‘–=1

๐‘›๐‘›

log ๏ฟฝ๐‘๐‘=1

๐‘˜๐‘˜

๐‘ค๐‘ค๐‘๐‘๐’ฉ๐’ฉ ๐’™๐’™๐‘–๐‘–|๐๐๐‘๐‘ ,๐šบ๐šบ๐‘๐‘

โ€ข The log cannot be pushed inside the sum. The derivative of the log likelihood still going to have a cumbersome form

โ€ข We should use an iterative procedure

33

Page 34: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Fitting a GMM using iterative optimisation

โ€ข So thereโ€™s little prospect in analytical solution

โ€ข At this point, we could use the gradient descent algorithm

โ€ข But it still requires taking partial derivatives

โ€ข Another problem of using the gradient descent are complicated constraints on parameters

โ€ข We aim to find ๐‘ค๐‘ค๐‘๐‘, ๐๐๐‘๐‘, ๐šบ๐šบ๐‘๐‘, ๐ด๐ด = 1, โ€ฆ , ๐‘˜๐‘˜, where ๐šบ๐šบ๐‘๐‘ are symmetric and positive definite for each ๐ด๐ด, and where ๐‘ค๐‘ค๐‘๐‘ add up to one

34

Page 35: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

Introduction to Expectation Maximisation

โ€ข Expectation Maximisation (EM) algorithm is a common way to find parameters of a GMM

โ€ข EM is a generic algorithm for finding MLE of parameters of a probabilistic model

โ€ข Broadly speaking, as โ€œinputโ€ EM requiresโˆ— A probabilistic model that is can be specified by a fixed number

of parametersโˆ— Data

โ€ข EM is widely used outside clustering and GMMs (another application of EM coming soon)

35

Page 36: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

MLE vs EMโ€ข MLE is a frequentist principle that suggests that given

a dataset, the โ€œbestโ€ parameters to use are the ones that maximise the probability of the dataโˆ— MLE is a way to formally pose the problem

โ€ข EM is an algorithmโˆ— EM is a way to solve the problem posed by MLE

โ€ข MLE can be found by other methods such as gradient descent (but gradient descent is not always the most convenient method)

36

Page 37: Lecture 13. Clustering. Gaussian Mixture Modelย ยท โ€ข Suppose that we are using the algorithm that does require ๐‘˜๐‘˜ โ€ข The number of clusters can be known from context โˆ—E.g.,

Statistical Machine Learning (S2 2017) Deck 13

This lectureโ€ข Unsupervised learning

โˆ— Diversity of problemsโˆ— Pipelines

โ€ข Clusteringโˆ— Problem formulationโˆ— Algorithmsโˆ— Choosing the number of clusters

โ€ข Gaussian mixture model (GMM)โˆ— A probabilistic approach to clusteringโˆ— GMM clustering as an optimisation problem

37