cs260 machine learning algorithms - computer...
TRANSCRIPT
Clustering
Professor Ameet Talwalkar
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 1 / 26
Outline
1 Administration
2 Review of last lecture
3 Clustering
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 2 / 26
Schedule and Final
Upcoming Schedule
Today: Last day of class
Next Monday (3/13): No class – I will hold office hours in my office(BH 4531F)
Next Wednesday (3/15): Final Exam, HW6 due
Final Exam
Cumulative but with more emphasis on new material
8 short questions (recall that midterm had 6 short and 3 longquestions)
Should be much shorter than midterm, but of equal difficulty
Focus on major concepts (e.g., MLE, primal and dual formulations ofSVM, gradient descent, etc.)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 3 / 26
Outline
1 Administration
2 Review of last lecture
3 Clustering
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 4 / 26
Neural Networks – Basic Idea
Learning nonlinear basis functions and classifiers
Hidden layers are nonlinear mappings from input features to newrepresentation
Output layers use the new representations for classification andregression
Learning parameters
Backpropogation = efficient algorithm for (stochastic) gradientdescent
Can write down explicit updates via chain rule of calculus
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 5 / 26
Single'Node'
9'
Sigmoid'(logis1c)'ac1va1on'func1on:' g(z) =1
1 + e�z
h✓(x) =1
1 + e�✓Txh✓(x) = g (✓|x)
x0 = 1x0 = 1
“bias'unit”'
h✓(x) =1
1 + e�✓Tx
x =
2664
x0
x1
x2
x3
3775 ✓ =
2664
✓0
✓1
✓2
✓3
3775
✓0✓1
✓2
✓3
Based'on'slide'by'Andrew'Ng'
X
h✓(x) =1
1 + e�✓Tx
Neural'Network'
11'
Layer'3'(Output'Layer)'
Layer'1'(Input'Layer)'
Layer'2'(Hidden'Layer)'
x0 = 1bias'units' a(2)0
Slide'by'Andrew'Ng'
Feed@Forward'Steps:'
Vectoriza1on'
15'
a(2)1 = g
⇣⇥
(1)10 x0 + ⇥
(1)11 x1 + ⇥
(1)12 x2 + ⇥
(1)13 x3
⌘= g
⇣z(2)1
⌘
a(2)2 = g
⇣⇥
(1)20 x0 + ⇥
(1)21 x1 + ⇥
(1)22 x2 + ⇥
(1)23 x3
⌘= g
⇣z(2)2
⌘
a(2)3 = g
⇣⇥
(1)30 x0 + ⇥
(1)31 x1 + ⇥
(1)32 x2 + ⇥
(1)33 x3
⌘= g
⇣z(2)3
⌘
h⇥(x) = g⇣⇥
(2)10 a
(2)0 + ⇥
(2)11 a
(2)1 + ⇥
(2)12 a
(2)2 + ⇥
(2)13 a
(2)3
⌘= g
⇣z(3)1
⌘
a(2)1 = g
⇣⇥
(1)10 x0 + ⇥
(1)11 x1 + ⇥
(1)12 x2 + ⇥
(1)13 x3
⌘= g
⇣z(2)1
⌘
a(2)2 = g
⇣⇥
(1)20 x0 + ⇥
(1)21 x1 + ⇥
(1)22 x2 + ⇥
(1)23 x3
⌘= g
⇣z(2)2
⌘
a(2)3 = g
⇣⇥
(1)30 x0 + ⇥
(1)31 x1 + ⇥
(1)32 x2 + ⇥
(1)33 x3
⌘= g
⇣z(2)3
⌘
h⇥(x) = g⇣⇥
(2)10 a
(2)0 + ⇥
(2)11 a
(2)1 + ⇥
(2)12 a
(2)2 + ⇥
(2)13 a
(2)3
⌘= g
⇣z(3)1
⌘
a(2)1 = g
⇣⇥
(1)10 x0 + ⇥
(1)11 x1 + ⇥
(1)12 x2 + ⇥
(1)13 x3
⌘= g
⇣z(2)1
⌘
a(2)2 = g
⇣⇥
(1)20 x0 + ⇥
(1)21 x1 + ⇥
(1)22 x2 + ⇥
(1)23 x3
⌘= g
⇣z(2)2
⌘
a(2)3 = g
⇣⇥
(1)30 x0 + ⇥
(1)31 x1 + ⇥
(1)32 x2 + ⇥
(1)33 x3
⌘= g
⇣z(2)3
⌘
h⇥(x) = g⇣⇥
(2)10 a
(2)0 + ⇥
(2)11 a
(2)1 + ⇥
(2)12 a
(2)2 + ⇥
(2)13 a
(2)3
⌘= g
⇣z(3)1
⌘
a(2)1 = g
⇣⇥
(1)10 x0 + ⇥
(1)11 x1 + ⇥
(1)12 x2 + ⇥
(1)13 x3
⌘= g
⇣z(2)1
⌘
a(2)2 = g
⇣⇥
(1)20 x0 + ⇥
(1)21 x1 + ⇥
(1)22 x2 + ⇥
(1)23 x3
⌘= g
⇣z(2)2
⌘
a(2)3 = g
⇣⇥
(1)30 x0 + ⇥
(1)31 x1 + ⇥
(1)32 x2 + ⇥
(1)33 x3
⌘= g
⇣z(2)3
⌘
h⇥(x) = g⇣⇥
(2)10 a
(2)0 + ⇥
(2)11 a
(2)1 + ⇥
(2)12 a
(2)2 + ⇥
(2)13 a
(2)3
⌘= g
⇣z(3)1
⌘
Based'on'slide'by'Andrew'Ng'
z(2) = ⇥(1)x
a(2) = g(z(2))
Add a(2)0 = 1
z(3) = ⇥(2)a(2)
h⇥(x) = a(3) = g(z(3))
z(2) = ⇥(1)x
a(2) = g(z(2))
Add a(2)0 = 1
z(3) = ⇥(2)a(2)
h⇥(x) = a(3) = g(z(3))
z(2) = ⇥(1)x
a(2) = g(z(2))
Add a(2)0 = 1
z(3) = ⇥(2)a(2)
h⇥(x) = a(3) = g(z(3))⇥(1) ⇥(2)
h✓(x) =1
1 + e�✓Tx
Learning'in'NN:'Backpropaga1on'• Similar'to'the'perceptron'learning'algorithm,'we'cycle'through'our'examples'– If'the'output'of'the'network'is'correct,'no'changes'are'made'– If'there'is'an'error,'weights'are'adjusted'to'reduce'the'error'
• We'are'just'performing''(stochas1c)'gradient'descent!'
32'Based'on'slide'by'T.'Finin,'M.'desJardins,'L'Getoor,'R.'Par'
Op1mizing'the'Neural'Network'
34'
Solve'via:''
J(⇥) = � 1
n
"nX
i=1
KX
k=1
yik log(h⇥(xi))k + (1 � yik) log⇣1 � (h⇥(xi))k
⌘#
+�
2n
L�1X
l=1
sl�1X
i=1
slX
j=1
⇣⇥
(l)ji
⌘2
Unlike before, J(Θ)'is'not'convex,'so'GD'on'a'neural'net'yields'a'local'op1mum'
Based'on'slide'by'Andrew'Ng'
@
@⇥(l)ij
J(⇥) = a(l)j �
(l+1)i (ignoring'λ;'if'λ = 0)'
Backpropaga1on'Forward'Propaga1on'
Forward'Propaga1on'• Given'one'labeled'training'instance'(x, y):''Forward'Propaga1on'• a(1)'= x'• z(2) = Θ(1)a(1)
• a(2)'= g(z(2)) [add'a0(2)]'
• z(3) = Θ(2)a(2)
• a(3)'= g(z(3)) [add'a0(3)]'
• z(4) = Θ(3)a(3)
• a(4)'= hΘ(x) = g(z(4))'
35'
a(1)
a(2) a(3) a(4)
Based'on'slide'by'Andrew'Ng'
Backpropaga1on:'Gradient'Computa1on'Let'δj
(l) ='“error”'of'node'j'in'layer'l (#layers'L'='4)
Backpropaga1on''• δ(4) = a(4) – y • δ(3) = (Θ(3))Tδ(4) .* g’(z(3)) • δ(2) = (Θ(2))Tδ(3) .* g’(z(2)) • (No'δ(1))
''
42'
g’(z(3)) = a(3) .* (1–a(3))'
g’(z(2)) = a(2) .* (1–a(2))'
δ(4) δ(3) δ(2)
Element@wise'product'.*'
Based'on'slide'by'Andrew'Ng'
Training'a'Neural'Network'via'Gradient'Descent'with'Backprop'
45'
Given: training set {(x1, y1), . . . , (xn, yn)}Initialize all ⇥(l) randomly (NOT to 0!)Loop // each iteration is called an epoch
Set �(l)ij = 0 8l, i, j
For each training instance (xi, yi):Set a(1) = xi
Compute {a(2), . . . ,a(L)} via forward propagationCompute �(L) = a(L) � yi
Compute errors {�(L�1), . . . , �(2)}Compute gradients �
(l)ij = �
(l)ij + a
(l)j �
(l+1)i
Compute avg regularized gradient D(l)ij =
(1n�
(l)ij + �⇥
(l)ij if j 6= 0
1n�
(l)ij otherwise
Update weights via gradient step ⇥(l)ij = ⇥
(l)ij � ↵D
(l)ij
Until weights converge or max #epochs is reached
(Used'to'accumulate'gradient)'
Based'on'slide'by'Andrew'Ng'
Backpropaga1on'
Summary of the course so far
Supervised learning has been our focus
Setup: given a training dataset {xn, yn}Nn=1, we learn a function h(x)to predict x’s true value y (i.e., regression or classification)
Linear vs. nonlinear features1 Linear: h(x) depends on wTx2 Nonlinear: h(x) depends on wTφ(x), where φ is either explicit or
depends on a kernel function k(xm,xn) = φ(xm)Tφ(xn)
Loss function1 Squared loss: least square for regression (minimizing residual sum of
errors)2 Logistic loss: logistic regression3 Exponential loss: AdaBoost4 Margin-based loss: support vector machines
Principles of estimation1 Point estimate: maximum likelihood, regularized likelihood
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 6 / 26
cont’d
Optimization1 Methods: gradient descent, Newton method2 Convex optimization: global optimum vs. local optimum3 Lagrange duality: primal and dual formulation
Learning theory1 Difference between training error and generalization error2 Overfitting, bias and variance tradeoff3 Regularization: various regularized models
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 7 / 26
Supervised versus Unsupervised Learning
Supervised Learning from labeled observations
Labels ‘teach’ algorithm to learn mapping from observations to labels
Classification, Regression
Unsupervised Learning from unlabeled observations
Learning algorithm must find latent structure from features alone
Can be goal in itself (discover hidden patterns, exploratory analysis)
Can be means to an end (preprocessing for supervised task)
Clustering (Today)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 8 / 26
Supervised versus Unsupervised Learning
Supervised Learning from labeled observations
Labels ‘teach’ algorithm to learn mapping from observations to labels
Classification, Regression
Unsupervised Learning from unlabeled observations
Learning algorithm must find latent structure from features alone
Can be goal in itself (discover hidden patterns, exploratory analysis)
Can be means to an end (preprocessing for supervised task)
Clustering (Today)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 8 / 26
Outline
1 Administration
2 Review of last lecture
3 ClusteringK-meansGaussian mixture models
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 9 / 26
ClusteringSetup Given D = {xn}Nn=1 and K, we want to output:
{µk}Kk=1: prototypes of clusters
A(xn) ∈ {1, 2, . . . ,K}: the cluster membership
Toy Example Cluster data into two clusters.
(a)
−2 0 2
−2
0
2 (i)
−2 0 2
−2
0
2
Applications
Identify communities within social networks
Find topics in news stories
Group similiar sequences into gene families
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 10 / 26
ClusteringSetup Given D = {xn}Nn=1 and K, we want to output:
{µk}Kk=1: prototypes of clusters
A(xn) ∈ {1, 2, . . . ,K}: the cluster membership
Toy Example Cluster data into two clusters.
(a)
−2 0 2
−2
0
2 (i)
−2 0 2
−2
0
2
Applications
Identify communities within social networks
Find topics in news stories
Group similiar sequences into gene families
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 10 / 26
ClusteringSetup Given D = {xn}Nn=1 and K, we want to output:
{µk}Kk=1: prototypes of clusters
A(xn) ∈ {1, 2, . . . ,K}: the cluster membership
Toy Example Cluster data into two clusters.
(a)
−2 0 2
−2
0
2 (i)
−2 0 2
−2
0
2
Applications
Identify communities within social networks
Find topics in news stories
Group similiar sequences into gene families
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 10 / 26
K-means example
(a)
−2 0 2
−2
0
2 (b)
−2 0 2
−2
0
2
(c)
−2 0 2
−2
0
2
(d)
−2 0 2
−2
0
2 (e)
−2 0 2
−2
0
2 (f)
−2 0 2
−2
0
2
(g)
−2 0 2
−2
0
2 (h)
−2 0 2
−2
0
2 (i)
−2 0 2
−2
0
2
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 11 / 26
K-means example
(a)
−2 0 2
−2
0
2 (b)
−2 0 2
−2
0
2 (c)
−2 0 2
−2
0
2
(d)
−2 0 2
−2
0
2 (e)
−2 0 2
−2
0
2 (f)
−2 0 2
−2
0
2
(g)
−2 0 2
−2
0
2 (h)
−2 0 2
−2
0
2 (i)
−2 0 2
−2
0
2
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 11 / 26
K-means example
(a)
−2 0 2
−2
0
2 (b)
−2 0 2
−2
0
2 (c)
−2 0 2
−2
0
2
(d)
−2 0 2
−2
0
2
(e)
−2 0 2
−2
0
2 (f)
−2 0 2
−2
0
2
(g)
−2 0 2
−2
0
2 (h)
−2 0 2
−2
0
2 (i)
−2 0 2
−2
0
2
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 11 / 26
K-means example
(a)
−2 0 2
−2
0
2 (b)
−2 0 2
−2
0
2 (c)
−2 0 2
−2
0
2
(d)
−2 0 2
−2
0
2 (e)
−2 0 2
−2
0
2
(f)
−2 0 2
−2
0
2
(g)
−2 0 2
−2
0
2 (h)
−2 0 2
−2
0
2 (i)
−2 0 2
−2
0
2
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 11 / 26
K-means example
(a)
−2 0 2
−2
0
2 (b)
−2 0 2
−2
0
2 (c)
−2 0 2
−2
0
2
(d)
−2 0 2
−2
0
2 (e)
−2 0 2
−2
0
2 (f)
−2 0 2
−2
0
2
(g)
−2 0 2
−2
0
2 (h)
−2 0 2
−2
0
2 (i)
−2 0 2
−2
0
2
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 11 / 26
K-means clustering
Intuition Data points assigned to cluster k should be near prototype µk
Distortion measure (clustering objective function, cost function)
J =
N∑
n=1
K∑
k=1
rnk‖xn − µk‖22
where rnk ∈ {0, 1} is an indicator variable
rnk = 1 if and only if A(xn) = k
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 12 / 26
K-means clustering
Intuition Data points assigned to cluster k should be near prototype µk
Distortion measure (clustering objective function, cost function)
J =
N∑
n=1
K∑
k=1
rnk‖xn − µk‖22
where rnk ∈ {0, 1} is an indicator variable
rnk = 1 if and only if A(xn) = k
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 12 / 26
Algorithm
Minimize distortion Alternative optimization between {rnk} and {µk}Step 0 Initialize {µk} to some values
Step 1 Fix {µk} and minimize over {rnk}, to get this assignment:
rnk =
{1 if k = argminj ‖xn − µj‖220 otherwise
Step 2 Fix {rnk} and minimize over {µk} to get this update:
µk =
∑n rnkxn∑n rnk
Step 3 Return to Step 1 unless stopping criterion is met
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 13 / 26
Algorithm
Minimize distortion Alternative optimization between {rnk} and {µk}Step 0 Initialize {µk} to some values
Step 1 Fix {µk} and minimize over {rnk}, to get this assignment:
rnk =
{1 if k = argminj ‖xn − µj‖220 otherwise
Step 2 Fix {rnk} and minimize over {µk} to get this update:
µk =
∑n rnkxn∑n rnk
Step 3 Return to Step 1 unless stopping criterion is met
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 13 / 26
Algorithm
Minimize distortion Alternative optimization between {rnk} and {µk}Step 0 Initialize {µk} to some values
Step 1 Fix {µk} and minimize over {rnk}, to get this assignment:
rnk =
{1 if k = argminj ‖xn − µj‖220 otherwise
Step 2 Fix {rnk} and minimize over {µk} to get this update:
µk =
∑n rnkxn∑n rnk
Step 3 Return to Step 1 unless stopping criterion is met
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 13 / 26
Remarks
Prototype µk is the mean of points assigned to cluster k, hence‘K-means’
The procedure reduces J in both Step 1 and Step 2 and thus makesimprovements on each iteration
No guarantee we find the global solution
Quality of local optimum depends on initial values at Step 0
k-means++ is a principled approximation algorithm
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 14 / 26
Probabilistic interpretation of clustering?
How can we model p(x) to reflect our intuition that points stay close totheir cluster centers?
(b)
0 0.5 1
0
0.5
1Points seem to form 3 clusters
We cannot model p(x) withsimple and known distributions
E.g., the data is not a Guassianb/c we have 3 distinctconcentrated regions
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 15 / 26
Probabilistic interpretation of clustering?
How can we model p(x) to reflect our intuition that points stay close totheir cluster centers?
(b)
0 0.5 1
0
0.5
1Points seem to form 3 clusters
We cannot model p(x) withsimple and known distributions
E.g., the data is not a Guassianb/c we have 3 distinctconcentrated regions
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 15 / 26
Probabilistic interpretation of clustering?
How can we model p(x) to reflect our intuition that points stay close totheir cluster centers?
(b)
0 0.5 1
0
0.5
1Points seem to form 3 clusters
We cannot model p(x) withsimple and known distributions
E.g., the data is not a Guassianb/c we have 3 distinctconcentrated regions
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 15 / 26
Gaussian mixture models: intuition
(a)
0 0.5 1
0
0.5
1
Model each region with adistinct distribution
Can use Gaussians — Gaussianmixture models (GMMs)
We don’t know clusterassignments (label), parametersof Gaussians, or mixturecomponents!
Must learn from unlabeled dataD = {xn}Nn=1
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 16 / 26
Gaussian mixture models: intuition
(a)
0 0.5 1
0
0.5
1
Model each region with adistinct distribution
Can use Gaussians — Gaussianmixture models (GMMs)
We don’t know clusterassignments (label), parametersof Gaussians, or mixturecomponents!
Must learn from unlabeled dataD = {xn}Nn=1
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 16 / 26
Gaussian mixture models: formal definition
GMM has the following density function for x
p(x) =
K∑
k=1
ωkN(x|µk,Σk)
K: number of Gaussians — they are called mixture components
µk and Σk: mean and covariance matrix of k-th component
ωk: mixture weights (or priors) represent how much each componentcontributes to final distribution. They satisfy 2 properties:
∀ k, ωk > 0, and∑
k
ωk = 1
These properties ensure p(x) is in fact a probability density function
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 17 / 26
Gaussian mixture models: formal definition
GMM has the following density function for x
p(x) =
K∑
k=1
ωkN(x|µk,Σk)
K: number of Gaussians — they are called mixture components
µk and Σk: mean and covariance matrix of k-th component
ωk: mixture weights (or priors) represent how much each componentcontributes to final distribution. They satisfy 2 properties:
∀ k, ωk > 0, and∑
k
ωk = 1
These properties ensure p(x) is in fact a probability density function
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 17 / 26
Gaussian mixture models: formal definition
GMM has the following density function for x
p(x) =
K∑
k=1
ωkN(x|µk,Σk)
K: number of Gaussians — they are called mixture components
µk and Σk: mean and covariance matrix of k-th component
ωk: mixture weights (or priors) represent how much each componentcontributes to final distribution. They satisfy 2 properties:
∀ k, ωk > 0, and∑
k
ωk = 1
These properties ensure p(x) is in fact a probability density function
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 17 / 26
GMM as the marginal distribution of a joint distribution
Consider the following joint distribution
p(x, z) = p(z)p(x|z)
where z is a discrete random variable taking values between 1 and K.
Denoteωk = p(z = k)
Now, assume the conditional distributions are Gaussian distributions
p(x|z = k) = N(x|µk,Σk)
Then, the marginal distribution of x is
p(x) =
K∑
k=1
ωkN(x|µk,Σk)
Namely, the Gaussian mixture model
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 18 / 26
GMM as the marginal distribution of a joint distribution
Consider the following joint distribution
p(x, z) = p(z)p(x|z)
where z is a discrete random variable taking values between 1 and K.Denote
ωk = p(z = k)
Now, assume the conditional distributions are Gaussian distributions
p(x|z = k) = N(x|µk,Σk)
Then, the marginal distribution of x is
p(x) =
K∑
k=1
ωkN(x|µk,Σk)
Namely, the Gaussian mixture model
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 18 / 26
GMM as the marginal distribution of a joint distribution
Consider the following joint distribution
p(x, z) = p(z)p(x|z)
where z is a discrete random variable taking values between 1 and K.Denote
ωk = p(z = k)
Now, assume the conditional distributions are Gaussian distributions
p(x|z = k) = N(x|µk,Σk)
Then, the marginal distribution of x is
p(x) =
K∑
k=1
ωkN(x|µk,Σk)
Namely, the Gaussian mixture model
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 18 / 26
GMMs: example
(a)
0 0.5 1
0
0.5
1
The conditional distribution between x and z(representing color) are
p(x|z = red) = N(x|µ1,Σ1)
p(x|z = blue) = N(x|µ2,Σ2)
p(x|z = green) = N(x|µ3,Σ3)
(b)
0 0.5 1
0
0.5
1 The marginal distribution is thus
p(x) = p(red)N(x|µ1,Σ1) + p(blue)N(x|µ2,Σ2)
+ p(green)N(x|µ3,Σ3)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 19 / 26
GMMs: example
(a)
0 0.5 1
0
0.5
1
The conditional distribution between x and z(representing color) are
p(x|z = red) = N(x|µ1,Σ1)
p(x|z = blue) = N(x|µ2,Σ2)
p(x|z = green) = N(x|µ3,Σ3)
(b)
0 0.5 1
0
0.5
1 The marginal distribution is thus
p(x) = p(red)N(x|µ1,Σ1) + p(blue)N(x|µ2,Σ2)
+ p(green)N(x|µ3,Σ3)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 19 / 26
Parameter estimation for Gaussian mixture models
The parameters in GMMs are
θ = {ωk,µk,Σk}Kk=1
Let’s first consider the simple/unrealistic case where we have labels z
Define D′ = {xn, zn}Nn=1
D′ is the complete data
D the incomplete data
How can we learn our parameters?
Given D′, the maximum likelihood estimation of the θ is given by
θ = argmax logD′ =∑
n
log p(xn, zn)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 20 / 26
Parameter estimation for Gaussian mixture models
The parameters in GMMs are θ = {ωk,µk,Σk}Kk=1
Let’s first consider the simple/unrealistic case where we have labels z
Define D′ = {xn, zn}Nn=1
D′ is the complete data
D the incomplete data
How can we learn our parameters?
Given D′, the maximum likelihood estimation of the θ is given by
θ = argmax logD′ =∑
n
log p(xn, zn)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 20 / 26
Parameter estimation for Gaussian mixture models
The parameters in GMMs are θ = {ωk,µk,Σk}Kk=1
Let’s first consider the simple/unrealistic case where we have labels z
Define D′ = {xn, zn}Nn=1
D′ is the complete data
D the incomplete data
How can we learn our parameters?
Given D′, the maximum likelihood estimation of the θ is given by
θ = argmax logD′ =∑
n
log p(xn, zn)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 20 / 26
Parameter estimation for GMMs: complete data
The complete likelihood is decomposable
∑
n
log p(xn, zn) =∑
n
log p(zn)p(xn|zn) =∑
k
∑
n:zn=k
log p(zn)p(xn|zn)
where we have grouped data by its values zn.
Let γnk ∈ {0, 1} be a binary variable that indicates whether zn = k:
∑
n
log p(xn, zn) =∑
k
∑
n
γnk log p(z = k)p(xn|z = k)
=∑
k
∑
n
γnk [logωk + logN(xn|µk,Σk)]
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 21 / 26
Parameter estimation for GMMs: complete data
The complete likelihood is decomposable
∑
n
log p(xn, zn) =∑
n
log p(zn)p(xn|zn) =∑
k
∑
n:zn=k
log p(zn)p(xn|zn)
where we have grouped data by its values zn.
Let γnk ∈ {0, 1} be a binary variable that indicates whether zn = k:
∑
n
log p(xn, zn) =∑
k
∑
n
γnk log p(z = k)p(xn|z = k)
=∑
k
∑
n
γnk [logωk + logN(xn|µk,Σk)]
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 21 / 26
Parameter estimation for GMMs: complete data
The complete likelihood is decomposable
∑
n
log p(xn, zn) =∑
n
log p(zn)p(xn|zn) =∑
k
∑
n:zn=k
log p(zn)p(xn|zn)
where we have grouped data by its values zn.
Let γnk ∈ {0, 1} be a binary variable that indicates whether zn = k:
∑
n
log p(xn, zn) =∑
k
∑
n
γnk log p(z = k)p(xn|z = k)
=∑
k
∑
n
γnk [logωk + logN(xn|µk,Σk)]
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 21 / 26
Parameter estimation for GMMs: complete data
The complete likelihood is decomposable
∑
n
log p(xn, zn) =∑
n
log p(zn)p(xn|zn) =∑
k
∑
n:zn=k
log p(zn)p(xn|zn)
where we have grouped data by its values zn.
Let γnk ∈ {0, 1} be a binary variable that indicates whether zn = k:
∑
n
log p(xn, zn) =∑
k
∑
n
γnk log p(z = k)p(xn|z = k)
=∑
k
∑
n
γnk [logωk + logN(xn|µk,Σk)]
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 21 / 26
Parameter estimation for GMMs: complete dataFrom our previous discussion, we have
∑
n
log p(xn, zn) =∑
k
∑
n
γnk [logωk + logN(xn|µk,Σk)]
Regrouping, we have
∑
n
log p(xn, zn) =∑
k
∑
n
γnk logωk +∑
k
{∑
n
γnk logN(xn|µk,Σk)
}
The term inside the braces depends on k-th component’s parameters. It is noweasy to show that (left as an exercise) the MLE is:
ωk =
∑n γnk∑
k
∑n γnk
, µk =1∑n γnk
∑
n
γnkxn
Σk =1∑n γnk
∑
n
γnk(xn − µk)(xn − µk)T
What’s the intuition?
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 22 / 26
Parameter estimation for GMMs: complete dataFrom our previous discussion, we have
∑
n
log p(xn, zn) =∑
k
∑
n
γnk [logωk + logN(xn|µk,Σk)]
Regrouping, we have
∑
n
log p(xn, zn) =∑
k
∑
n
γnk logωk +∑
k
{∑
n
γnk logN(xn|µk,Σk)
}
The term inside the braces depends on k-th component’s parameters. It is noweasy to show that (left as an exercise) the MLE is:
ωk =
∑n γnk∑
k
∑n γnk
, µk =1∑n γnk
∑
n
γnkxn
Σk =1∑n γnk
∑
n
γnk(xn − µk)(xn − µk)T
What’s the intuition?Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 22 / 26
Intuition
Since γnk is binary, the previous solution is nothing but
ωk: fraction of total data points whose zn is kI note that
∑k
∑n γnk = N
µk: mean of all data points whose zn is k
Σk: covariance of all data points whose zn is k
This intuition will help us develop an algorithm for estimating θ when wedo not know zn (incomplete data)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 23 / 26
Intuition
Since γnk is binary, the previous solution is nothing but
ωk: fraction of total data points whose zn is kI note that
∑k
∑n γnk = N
µk: mean of all data points whose zn is k
Σk: covariance of all data points whose zn is k
This intuition will help us develop an algorithm for estimating θ when wedo not know zn (incomplete data)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 23 / 26
Parameter estimation for GMMs: incomplete data
When zn is not given, we can guess it via the posterior probability
p(zn = k|xn) =p(xn|zn = k)p(zn = k)
p(xn)=
p(xn|zn = k)p(zn = k)∑K
k′=1 p(xn|zn = k′)p(zn = k′)
To compute the posterior probability, we need to know the parameters θ!
Let’s pretend we know the value of the parameters so we can compute theposterior probability.
How is that going to help us?
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 24 / 26
Parameter estimation for GMMs: incomplete data
When zn is not given, we can guess it via the posterior probability
p(zn = k|xn) =p(xn|zn = k)p(zn = k)
p(xn)=
p(xn|zn = k)p(zn = k)∑K
k′=1 p(xn|zn = k′)p(zn = k′)
To compute the posterior probability, we need to know the parameters θ!
Let’s pretend we know the value of the parameters so we can compute theposterior probability.
How is that going to help us?
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 24 / 26
Estimation with soft γnk
We define γnk = p(zn = k|xn)
Recall that γnk was previously binary
Now it’s a “soft” assignment of xn to k-th component
Each xn is assigned to a component fractionally according top(zn = k|xn)
We now get the same expression for the MLE as before!
ωk =
∑n γnk∑
k
∑n γnk
, µk =1∑n γnk
∑
n
γnkxn
Σk =1∑n γnk
∑
n
γnk(xn − µk)(xn − µk)T
But remember, we’re ‘cheating’ by using θ to compute γnk!
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 25 / 26
Estimation with soft γnk
We define γnk = p(zn = k|xn)
Recall that γnk was previously binary
Now it’s a “soft” assignment of xn to k-th component
Each xn is assigned to a component fractionally according top(zn = k|xn)
We now get the same expression for the MLE as before!
ωk =
∑n γnk∑
k
∑n γnk
, µk =1∑n γnk
∑
n
γnkxn
Σk =1∑n γnk
∑
n
γnk(xn − µk)(xn − µk)T
But remember, we’re ‘cheating’ by using θ to compute γnk!
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 25 / 26
Estimation with soft γnk
We define γnk = p(zn = k|xn)
Recall that γnk was previously binary
Now it’s a “soft” assignment of xn to k-th component
Each xn is assigned to a component fractionally according top(zn = k|xn)
We now get the same expression for the MLE as before!
ωk =
∑n γnk∑
k
∑n γnk
, µk =1∑n γnk
∑
n
γnkxn
Σk =1∑n γnk
∑
n
γnk(xn − µk)(xn − µk)T
But remember, we’re ‘cheating’ by using θ to compute γnk!
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 25 / 26
Iterative procedure
Alternate between estimating γnk and computing parameters
Step 0: initialize θ with some values (random or otherwise)
Step 1: compute γnk using the current θ
Step 2: update θ using the just computed γnk
Step 3: go back to Step 1
This is an example of the EM algorithm — a powerful procedure for modelestimation with hidden/latent variables
Connection with K-means?
GMMs provide probabilistic interpretation for K-means
K-means is “hard” GMM or GMMs is “soft” K-means
Posterior γnk provides a probabilistic assignment for xn to cluster k
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 26 / 26
Iterative procedure
Alternate between estimating γnk and computing parameters
Step 0: initialize θ with some values (random or otherwise)
Step 1: compute γnk using the current θ
Step 2: update θ using the just computed γnk
Step 3: go back to Step 1
This is an example of the EM algorithm — a powerful procedure for modelestimation with hidden/latent variables
Connection with K-means?
GMMs provide probabilistic interpretation for K-means
K-means is “hard” GMM or GMMs is “soft” K-means
Posterior γnk provides a probabilistic assignment for xn to cluster k
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 26 / 26
Iterative procedure
Alternate between estimating γnk and computing parameters
Step 0: initialize θ with some values (random or otherwise)
Step 1: compute γnk using the current θ
Step 2: update θ using the just computed γnk
Step 3: go back to Step 1
This is an example of the EM algorithm — a powerful procedure for modelestimation with hidden/latent variables
Connection with K-means?
GMMs provide probabilistic interpretation for K-means
K-means is “hard” GMM or GMMs is “soft” K-means
Posterior γnk provides a probabilistic assignment for xn to cluster k
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 26 / 26
Iterative procedure
Alternate between estimating γnk and computing parameters
Step 0: initialize θ with some values (random or otherwise)
Step 1: compute γnk using the current θ
Step 2: update θ using the just computed γnk
Step 3: go back to Step 1
This is an example of the EM algorithm — a powerful procedure for modelestimation with hidden/latent variables
Connection with K-means?
GMMs provide probabilistic interpretation for K-means
K-means is “hard” GMM or GMMs is “soft” K-means
Posterior γnk provides a probabilistic assignment for xn to cluster k
Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 26 / 26