cs260 machine learning algorithms - computer...

Clustering

Professor Ameet Talwalkar

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 1 / 26

Page 2: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Outline

1 Administration

2 Review of last lecture

3 Clustering

Page 3: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Schedule and Final

Upcoming Schedule

Today: Last day of class

Next Monday (3/13): No class – I will hold office hours in my office(BH 4531F)

Next Wednesday (3/15): Final Exam, HW6 due

Final Exam

Cumulative but with more emphasis on new material

8 short questions (recall that midterm had 6 short and 3 longquestions)

Should be much shorter than midterm, but of equal difficulty

Focus on major concepts (e.g., MLE, primal and dual formulations ofSVM, gradient descent, etc.)

Page 4: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Outline

1 Administration

3 Clustering

Page 5: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Neural Networks – Basic Idea

Learning nonlinear basis functions and classifiers

Hidden layers are nonlinear mappings from input features to newrepresentation

Output layers use the new representations for classification andregression

Learning parameters

Backpropogation = efficient algorithm for (stochastic) gradientdescent

Can write down explicit updates via chain rule of calculus

Page 6: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Single'Node'

9'

Sigmoid'(logis1c)'ac1va1on'func1on:' g(z) =1

1 + e�z

h✓(x) =1

1 + e�✓Txh✓(x) = g (✓|x)

x0 = 1x0 = 1

“bias'unit”'

h✓(x) =1

1 + e�✓Tx

x =

2664

x0

x1

x2

x3

3775 ✓ =

2664

✓0

✓1

✓2

✓3

3775

✓0✓1

✓2

✓3

Based'on'slide'by'Andrew'Ng'

X

Page 7: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

h✓(x) =1

1 + e�✓Tx

Neural'Network'

11'

Layer'3'(Output'Layer)'

Layer'1'(Input'Layer)'

Layer'2'(Hidden'Layer)'

x0 = 1bias'units' a(2)0

Slide'by'Andrew'Ng'

Page 8: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Feed@Forward'Steps:'

Vectoriza1on'

15'

a(2)1 = g

⇣⇥

(1)10 x0 + ⇥

(1)11 x1 + ⇥

(1)12 x2 + ⇥

(1)13 x3

⌘= g

⇣z(2)1

⌘

a(2)2 = g

⇣⇥

(1)20 x0 + ⇥

(1)21 x1 + ⇥

(1)22 x2 + ⇥

(1)23 x3

⌘= g

⇣z(2)2

⌘

a(2)3 = g

⇣⇥

(1)30 x0 + ⇥

(1)31 x1 + ⇥

(1)32 x2 + ⇥

(1)33 x3

⌘= g

⇣z(2)3

⌘

h⇥(x) = g⇣⇥

(2)10 a

(2)0 + ⇥

(2)11 a

(2)1 + ⇥

(2)12 a

(2)2 + ⇥

(2)13 a

(2)3

⌘= g

⇣z(3)1

⌘

a(2)1 = g

⇣⇥

(1)10 x0 + ⇥

(1)11 x1 + ⇥

(1)12 x2 + ⇥

(1)13 x3

⌘= g

⇣z(2)1

⌘

a(2)2 = g

⇣⇥

(1)20 x0 + ⇥

(1)21 x1 + ⇥

(1)22 x2 + ⇥

(1)23 x3

⌘= g

⇣z(2)2

⌘

a(2)3 = g

⇣⇥

(1)30 x0 + ⇥

(1)31 x1 + ⇥

(1)32 x2 + ⇥

(1)33 x3

⌘= g

⇣z(2)3

⌘

h⇥(x) = g⇣⇥

(2)10 a

(2)0 + ⇥

(2)11 a

(2)1 + ⇥

(2)12 a

(2)2 + ⇥

(2)13 a

(2)3

⌘= g

⇣z(3)1

⌘

a(2)1 = g

⇣⇥

(1)10 x0 + ⇥

(1)11 x1 + ⇥

(1)12 x2 + ⇥

(1)13 x3

⌘= g

⇣z(2)1

⌘

a(2)2 = g

⇣⇥

(1)20 x0 + ⇥

(1)21 x1 + ⇥

(1)22 x2 + ⇥

(1)23 x3

⌘= g

⇣z(2)2

⌘

a(2)3 = g

⇣⇥

(1)30 x0 + ⇥

(1)31 x1 + ⇥

(1)32 x2 + ⇥

(1)33 x3

⌘= g

⇣z(2)3

⌘

h⇥(x) = g⇣⇥

(2)10 a

(2)0 + ⇥

(2)11 a

(2)1 + ⇥

(2)12 a

(2)2 + ⇥

(2)13 a

(2)3

⌘= g

⇣z(3)1

⌘

a(2)1 = g

⇣⇥

(1)10 x0 + ⇥

(1)11 x1 + ⇥

(1)12 x2 + ⇥

(1)13 x3

⌘= g

⇣z(2)1

⌘

a(2)2 = g

⇣⇥

(1)20 x0 + ⇥

(1)21 x1 + ⇥

(1)22 x2 + ⇥

(1)23 x3

⌘= g

⇣z(2)2

⌘

a(2)3 = g

⇣⇥

(1)30 x0 + ⇥

(1)31 x1 + ⇥

(1)32 x2 + ⇥

(1)33 x3

⌘= g

⇣z(2)3

⌘

h⇥(x) = g⇣⇥

(2)10 a

(2)0 + ⇥

(2)11 a

(2)1 + ⇥

(2)12 a

(2)2 + ⇥

(2)13 a

(2)3

⌘= g

⇣z(3)1

⌘

z(2) = ⇥(1)x

a(2) = g(z(2))

Add a(2)0 = 1

z(3) = ⇥(2)a(2)

h⇥(x) = a(3) = g(z(3))

z(2) = ⇥(1)x

a(2) = g(z(2))

Add a(2)0 = 1

z(3) = ⇥(2)a(2)

h⇥(x) = a(3) = g(z(3))

z(2) = ⇥(1)x

a(2) = g(z(2))

Add a(2)0 = 1

z(3) = ⇥(2)a(2)

h⇥(x) = a(3) = g(z(3))⇥(1) ⇥(2)

h✓(x) =1

1 + e�✓Tx

Page 9: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Learning'in'NN:'Backpropaga1on'•  Similar'to'the'perceptron'learning'algorithm,'we'cycle'through'our'examples'–  If'the'output'of'the'network'is'correct,'no'changes'are'made'–  If'there'is'an'error,'weights'are'adjusted'to'reduce'the'error'

•  We'are'just'performing''(stochas1c)'gradient'descent!'

32'Based'on'slide'by'T.'Finin,'M.'desJardins,'L'Getoor,'R.'Par'

Page 10: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Op1mizing'the'Neural'Network'

34'

Solve'via:''

J(⇥) = � 1

n

"nX

i=1

KX

k=1

yik log(h⇥(xi))k + (1 � yik) log⇣1 � (h⇥(xi))k

⌘#

+�

2n

L�1X

l=1

sl�1X

i=1

slX

j=1

⇣⇥

(l)ji

⌘2

Unlike before, J(Θ)'is'not'convex,'so'GD'on'a'neural'net'yields'a'local'op1mum'

@

@⇥(l)ij

J(⇥) = a(l)j �

(l+1)i (ignoring'λ;'if'λ = 0)'

Backpropaga1on'Forward'Propaga1on'

Page 11: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Forward'Propaga1on'•  Given'one'labeled'training'instance'(x, y):''Forward'Propaga1on'•  a(1)'= x'•  z(2) = Θ(1)a(1)

•  a(2)'= g(z(2)) [add'a0(2)]'

•  z(3) = Θ(2)a(2)

•  a(3)'= g(z(3)) [add'a0(3)]'

•  z(4) = Θ(3)a(3)

•  a(4)'= hΘ(x) = g(z(4))'

35'

a(1)

a(2) a(3) a(4)

Page 12: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Backpropaga1on:'Gradient'Computa1on'Let'δj

(l) ='“error”'of'node'j'in'layer'l (#layers'L'='4)

Backpropaga1on''•  δ(4) = a(4) – y •  δ(3) = (Θ(3))Tδ(4) .* g’(z(3)) •  δ(2) = (Θ(2))Tδ(3) .* g’(z(2)) •  (No'δ(1))

''

42'

g’(z(3)) = a(3) .* (1–a(3))'

g’(z(2)) = a(2) .* (1–a(2))'

δ(4) δ(3) δ(2)

Element@wise'product'.*'

Page 13: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Training'a'Neural'Network'via'Gradient'Descent'with'Backprop'

45'

Given: training set {(x1, y1), . . . , (xn, yn)}Initialize all ⇥(l) randomly (NOT to 0!)Loop // each iteration is called an epoch

Set �(l)ij = 0 8l, i, j

For each training instance (xi, yi):Set a(1) = xi

Compute {a(2), . . . ,a(L)} via forward propagationCompute �(L) = a(L) � yi

Compute errors {�(L�1), . . . , �(2)}Compute gradients �

(l)ij = �

(l)ij + a

(l)j �

(l+1)i

Compute avg regularized gradient D(l)ij =

(1n�

(l)ij + �⇥

(l)ij if j 6= 0

1n�

(l)ij otherwise

Update weights via gradient step ⇥(l)ij = ⇥

(l)ij � ↵D

(l)ij

Until weights converge or max #epochs is reached

(Used'to'accumulate'gradient)'

Backpropaga1on'

Page 14: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Summary of the course so far

Supervised learning has been our focus

Setup: given a training dataset {xn, yn}Nn=1, we learn a function h(x)to predict x’s true value y (i.e., regression or classification)

Linear vs. nonlinear features1 Linear: h(x) depends on wTx2 Nonlinear: h(x) depends on wTφ(x), where φ is either explicit or

depends on a kernel function k(xm,xn) = φ(xm)Tφ(xn)

Loss function1 Squared loss: least square for regression (minimizing residual sum of

errors)2 Logistic loss: logistic regression3 Exponential loss: AdaBoost4 Margin-based loss: support vector machines

Principles of estimation1 Point estimate: maximum likelihood, regularized likelihood

Page 15: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

cont’d

Optimization1 Methods: gradient descent, Newton method2 Convex optimization: global optimum vs. local optimum3 Lagrange duality: primal and dual formulation

Learning theory1 Difference between training error and generalization error2 Overfitting, bias and variance tradeoff3 Regularization: various regularized models

Page 16: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Supervised versus Unsupervised Learning

Supervised Learning from labeled observations

Labels ‘teach’ algorithm to learn mapping from observations to labels

Classification, Regression

Unsupervised Learning from unlabeled observations

Learning algorithm must find latent structure from features alone

Can be goal in itself (discover hidden patterns, exploratory analysis)

Can be means to an end (preprocessing for supervised task)

Clustering (Today)

Page 17: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Supervised versus Unsupervised Learning

Supervised Learning from labeled observations

Labels ‘teach’ algorithm to learn mapping from observations to labels

Classification, Regression

Unsupervised Learning from unlabeled observations

Learning algorithm must find latent structure from features alone

Can be goal in itself (discover hidden patterns, exploratory analysis)

Can be means to an end (preprocessing for supervised task)

Clustering (Today)

Page 18: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Outline

1 Administration

3 ClusteringK-meansGaussian mixture models

Page 19: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

ClusteringSetup Given D = {xn}Nn=1 and K, we want to output:

{µk}Kk=1: prototypes of clusters

A(xn) ∈ {1, 2, . . . ,K}: the cluster membership

Toy Example Cluster data into two clusters.

(a)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2

Applications

Identify communities within social networks

Find topics in news stories

Group similiar sequences into gene families

Page 20: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

(a)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2

Applications

Page 21: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

(a)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2

Applications

Page 22: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

K-means example

(a)

−2 0 2

−2

0

2 (b)

−2 0 2

−2

0

2

(c)

−2 0 2

−2

0

2

(d)

−2 0 2

−2

0

2 (e)

−2 0 2

−2

0

2 (f)

−2 0 2

−2

0

2

(g)

−2 0 2

−2

0

2 (h)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2

Page 23: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

K-means example

(a)

−2 0 2

−2

0

2 (b)

−2 0 2

−2

0

2 (c)

−2 0 2

−2

0

2

(d)

−2 0 2

−2

0

2 (e)

−2 0 2

−2

0

2 (f)

−2 0 2

−2

0

2

(g)

−2 0 2

−2

0

2 (h)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2

Page 24: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

K-means example

(a)

−2 0 2

−2

0

2 (b)

−2 0 2

−2

0

2 (c)

−2 0 2

−2

0

2

(d)

−2 0 2

−2

0

2

(e)

−2 0 2

−2

0

2 (f)

−2 0 2

−2

0

2

(g)

−2 0 2

−2

0

2 (h)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2

Page 25: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

K-means example

(a)

−2 0 2

−2

0

2 (b)

−2 0 2

−2

0

2 (c)

−2 0 2

−2

0

2

(d)

−2 0 2

−2

0

2 (e)

−2 0 2

−2

0

2

(f)

−2 0 2

−2

0

2

(g)

−2 0 2

−2

0

2 (h)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2

Page 26: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

K-means example

(a)

−2 0 2

−2

0

2 (b)

−2 0 2

−2

0

2 (c)

−2 0 2

−2

0

2

(d)

−2 0 2

−2

0

2 (e)

−2 0 2

−2

0

2 (f)

−2 0 2

−2

0

2

(g)

−2 0 2

−2

0

2 (h)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2

Page 27: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

K-means clustering

Intuition Data points assigned to cluster k should be near prototype µk

Distortion measure (clustering objective function, cost function)

J =

N∑

n=1

K∑

k=1

rnk‖xn − µk‖22

where rnk ∈ {0, 1} is an indicator variable

rnk = 1 if and only if A(xn) = k

Page 28: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

K-means clustering

Intuition Data points assigned to cluster k should be near prototype µk

Distortion measure (clustering objective function, cost function)

J =

N∑

n=1

K∑

k=1

rnk‖xn − µk‖22

where rnk ∈ {0, 1} is an indicator variable

rnk = 1 if and only if A(xn) = k

Algorithm

Minimize distortion Alternative optimization between {rnk} and {µk}Step 0 Initialize {µk} to some values

Step 1 Fix {µk} and minimize over {rnk}, to get this assignment:

rnk =

{1 if k = argminj ‖xn − µj‖220 otherwise

Step 2 Fix {rnk} and minimize over {µk} to get this update:

µk =

∑n rnkxn∑n rnk

Step 3 Return to Step 1 unless stopping criterion is met

Algorithm

rnk =

µk =

∑n rnkxn∑n rnk

Algorithm

rnk =

µk =

∑n rnkxn∑n rnk

Page 32: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Remarks

Prototype µk is the mean of points assigned to cluster k, hence‘K-means’

The procedure reduces J in both Step 1 and Step 2 and thus makesimprovements on each iteration

No guarantee we find the global solution

Quality of local optimum depends on initial values at Step 0

k-means++ is a principled approximation algorithm

Page 33: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Probabilistic interpretation of clustering?

How can we model p(x) to reflect our intuition that points stay close totheir cluster centers?

(b)

0 0.5 1

0

0.5

1Points seem to form 3 clusters

We cannot model p(x) withsimple and known distributions

E.g., the data is not a Guassianb/c we have 3 distinctconcentrated regions

Page 34: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

(b)

0 0.5 1

0

0.5

Page 35: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

(b)

0 0.5 1

0

0.5

Page 36: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Gaussian mixture models: intuition

(a)

0 0.5 1

0

0.5

1

Model each region with adistinct distribution

Can use Gaussians — Gaussianmixture models (GMMs)

We don’t know clusterassignments (label), parametersof Gaussians, or mixturecomponents!

Must learn from unlabeled dataD = {xn}Nn=1

Page 37: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Gaussian mixture models: intuition

(a)

0 0.5 1

0

0.5

1

Model each region with adistinct distribution

Can use Gaussians — Gaussianmixture models (GMMs)

We don’t know clusterassignments (label), parametersof Gaussians, or mixturecomponents!

Must learn from unlabeled dataD = {xn}Nn=1

Page 38: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Gaussian mixture models: formal definition

GMM has the following density function for x

p(x) =

K∑

k=1

ωkN(x|µk,Σk)

K: number of Gaussians — they are called mixture components

µk and Σk: mean and covariance matrix of k-th component

ωk: mixture weights (or priors) represent how much each componentcontributes to final distribution. They satisfy 2 properties:

∀ k, ωk > 0, and∑

k

ωk = 1

These properties ensure p(x) is in fact a probability density function

Page 39: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

p(x) =

K∑

k=1

ωkN(x|µk,Σk)

∀ k, ωk > 0, and∑

k

ωk = 1

Page 40: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

p(x) =

K∑

k=1

ωkN(x|µk,Σk)

∀ k, ωk > 0, and∑

k

ωk = 1

Page 41: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

GMM as the marginal distribution of a joint distribution

Consider the following joint distribution

p(x, z) = p(z)p(x|z)

where z is a discrete random variable taking values between 1 and K.

Denoteωk = p(z = k)

Now, assume the conditional distributions are Gaussian distributions

p(x|z = k) = N(x|µk,Σk)

Then, the marginal distribution of x is

p(x) =

K∑

k=1

ωkN(x|µk,Σk)

Namely, the Gaussian mixture model

Page 42: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

where z is a discrete random variable taking values between 1 and K.Denote

ωk = p(z = k)

p(x) =

K∑

k=1

ωkN(x|µk,Σk)

Page 43: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

where z is a discrete random variable taking values between 1 and K.Denote

ωk = p(z = k)

p(x) =

K∑

k=1

ωkN(x|µk,Σk)

Page 44: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

GMMs: example

(a)

0 0.5 1

0

0.5

1

The conditional distribution between x and z(representing color) are

p(x|z = red) = N(x|µ1,Σ1)

p(x|z = blue) = N(x|µ2,Σ2)

p(x|z = green) = N(x|µ3,Σ3)

(b)

0 0.5 1

0

0.5

1 The marginal distribution is thus

p(x) = p(red)N(x|µ1,Σ1) + p(blue)N(x|µ2,Σ2)

+ p(green)N(x|µ3,Σ3)

Page 45: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

GMMs: example

(a)

0 0.5 1

0

0.5

1

The conditional distribution between x and z(representing color) are

p(x|z = red) = N(x|µ1,Σ1)

p(x|z = blue) = N(x|µ2,Σ2)

p(x|z = green) = N(x|µ3,Σ3)

(b)

0 0.5 1

0

0.5

1 The marginal distribution is thus

p(x) = p(red)N(x|µ1,Σ1) + p(blue)N(x|µ2,Σ2)

+ p(green)N(x|µ3,Σ3)

Page 46: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Parameter estimation for Gaussian mixture models

The parameters in GMMs are

θ = {ωk,µk,Σk}Kk=1

Let’s first consider the simple/unrealistic case where we have labels z

Define D′ = {xn, zn}Nn=1

D′ is the complete data

D the incomplete data

How can we learn our parameters?

Given D′, the maximum likelihood estimation of the θ is given by

θ = argmax logD′ =∑

n

log p(xn, zn)

Page 47: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

The parameters in GMMs are θ = {ωk,µk,Σk}Kk=1

n

log p(xn, zn)

Page 48: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

The parameters in GMMs are θ = {ωk,µk,Σk}Kk=1

n

log p(xn, zn)

Page 49: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Parameter estimation for GMMs: complete data

The complete likelihood is decomposable

∑

n

log p(xn, zn) =∑

n

log p(zn)p(xn|zn) =∑

k

∑

n:zn=k

log p(zn)p(xn|zn)

where we have grouped data by its values zn.

Let γnk ∈ {0, 1} be a binary variable that indicates whether zn = k:

∑

n

log p(xn, zn) =∑

k

∑

n

γnk log p(z = k)p(xn|z = k)

=∑

k

∑

n

γnk [logωk + logN(xn|µk,Σk)]

Page 50: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

∑

n

log p(xn, zn) =∑

n

k

∑

n:zn=k

log p(zn)p(xn|zn)

∑

n

log p(xn, zn) =∑

k

∑

n

=∑

k

∑

n

Page 51: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

∑

n

log p(xn, zn) =∑

n

k

∑

n:zn=k

log p(zn)p(xn|zn)

∑

n

log p(xn, zn) =∑

k

∑

n

=∑

k

∑

n

Page 52: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

∑

n

log p(xn, zn) =∑

n

k

∑

n:zn=k

log p(zn)p(xn|zn)

∑

n

log p(xn, zn) =∑

k

∑

n

=∑

k

∑

n

Page 53: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Parameter estimation for GMMs: complete dataFrom our previous discussion, we have

∑

n

log p(xn, zn) =∑

k

∑

n

Regrouping, we have

∑

n

log p(xn, zn) =∑

k

∑

n

γnk logωk +∑

k

{∑

n

γnk logN(xn|µk,Σk)

}

The term inside the braces depends on k-th component’s parameters. It is noweasy to show that (left as an exercise) the MLE is:

ωk =

∑n γnk∑

k

∑n γnk

, µk =1∑n γnk

∑

n

γnkxn

Σk =1∑n γnk

∑

n

γnk(xn − µk)(xn − µk)T

What’s the intuition?

Page 54: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Parameter estimation for GMMs: complete dataFrom our previous discussion, we have

∑

n

log p(xn, zn) =∑

k

∑

n

Regrouping, we have

∑

n

log p(xn, zn) =∑

k

∑

n

γnk logωk +∑

k

{∑

n

γnk logN(xn|µk,Σk)

}

The term inside the braces depends on k-th component’s parameters. It is noweasy to show that (left as an exercise) the MLE is:

ωk =

∑n γnk∑

k

∑n γnk

, µk =1∑n γnk

∑

n

γnkxn

Σk =1∑n γnk

∑

n

What’s the intuition?Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 22 / 26

Page 55: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Intuition

Since γnk is binary, the previous solution is nothing but

ωk: fraction of total data points whose zn is kI note that

∑k

∑n γnk = N

µk: mean of all data points whose zn is k

Σk: covariance of all data points whose zn is k

This intuition will help us develop an algorithm for estimating θ when wedo not know zn (incomplete data)

Page 56: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Intuition

Since γnk is binary, the previous solution is nothing but

ωk: fraction of total data points whose zn is kI note that

∑k

∑n γnk = N

µk: mean of all data points whose zn is k

Σk: covariance of all data points whose zn is k

This intuition will help us develop an algorithm for estimating θ when wedo not know zn (incomplete data)

Page 57: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Parameter estimation for GMMs: incomplete data

When zn is not given, we can guess it via the posterior probability

p(zn = k|xn) =p(xn|zn = k)p(zn = k)

p(xn)=

p(xn|zn = k)p(zn = k)∑K

k′=1 p(xn|zn = k′)p(zn = k′)

To compute the posterior probability, we need to know the parameters θ!

Let’s pretend we know the value of the parameters so we can compute theposterior probability.

How is that going to help us?

Page 58: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Parameter estimation for GMMs: incomplete data

When zn is not given, we can guess it via the posterior probability

p(zn = k|xn) =p(xn|zn = k)p(zn = k)

p(xn)=

p(xn|zn = k)p(zn = k)∑K

k′=1 p(xn|zn = k′)p(zn = k′)

To compute the posterior probability, we need to know the parameters θ!

Let’s pretend we know the value of the parameters so we can compute theposterior probability.

How is that going to help us?

Page 59: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Estimation with soft γnk

We define γnk = p(zn = k|xn)

Recall that γnk was previously binary

Now it’s a “soft” assignment of xn to k-th component

Each xn is assigned to a component fractionally according top(zn = k|xn)

We now get the same expression for the MLE as before!

ωk =

∑n γnk∑

k

∑n γnk

, µk =1∑n γnk

∑

n

γnkxn

Σk =1∑n γnk

∑

n

But remember, we’re ‘cheating’ by using θ to compute γnk!

Page 60: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

ωk =

∑n γnk∑

k

∑n γnk

, µk =1∑n γnk

∑

n

γnkxn

Σk =1∑n γnk

∑

n

Page 61: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

ωk =

∑n γnk∑

k

∑n γnk

, µk =1∑n γnk

∑

n

γnkxn

Σk =1∑n γnk

∑

n

Page 62: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Iterative procedure

Alternate between estimating γnk and computing parameters

Step 0: initialize θ with some values (random or otherwise)

Step 1: compute γnk using the current θ

Step 2: update θ using the just computed γnk

Step 3: go back to Step 1

This is an example of the EM algorithm — a powerful procedure for modelestimation with hidden/latent variables

Connection with K-means?

GMMs provide probabilistic interpretation for K-means

K-means is “hard” GMM or GMMs is “soft” K-means

Posterior γnk provides a probabilistic assignment for xn to cluster k

Page 63: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Iterative procedure

Page 64: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Iterative procedure

Page 65: CS260 Machine Learning Algorithms - Computer Scienceweb.cs.ucla.edu/~ameet/teaching/winter17/cs260/lectures/lec15.pdf · Professor Ameet Talwalkar CS260 Machine Learning Algorithms

Iterative procedure

cs260 machine learning algorithms - computer...

Documents