carnegie%mellon%university em pca

69
EM + PCA 1 10601 Introduction to Machine Learning Matt Gormley Lecture 17 March 22, 2017 Machine Learning Department School of Computer Science Carnegie Mellon University EM, GMM Readings: Murphy 11.4.1, 11.4.2, 11.4.4 Bishop 9 HTF 8.5 8.5.3 Mitchell 6.12 6.12.2 PCA Readings: Murphy 12 Bishop 12 HTF 14.5 Mitchell

Upload: others

Post on 12-Jun-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Carnegie%Mellon%University EM PCA

EM+

PCA

1

10-­‐601  Introduction  to  Machine  Learning

Matt  GormleyLecture  17

March  22,  2017

Machine  Learning  DepartmentSchool  of  Computer  ScienceCarnegie  Mellon  University

EM,  GMM  Readings:Murphy  11.4.1,  11.4.2,  11.4.4Bishop  9HTF  8.5  -­‐ 8.5.3Mitchell  6.12  -­‐ 6.12.2

PCA  Readings:Murphy  12Bishop  12HTF  14.5Mitchell  -­‐-­‐

Page 2: Carnegie%Mellon%University EM PCA

Reminders

• Homework 5:  Readings /  Application of  ML– Release:  Wed,  Mar.  08– Due:  Wed,  Mar.  22  at  11:59pm

2

Page 3: Carnegie%Mellon%University EM PCA

EM  AND  GMMS

3

Page 4: Carnegie%Mellon%University EM PCA

Expectation-­‐Maximization  Outline• Background

– Multivariate  Gaussian  Distribution– Marginal  Probabilities

• Building  up  to  GMMs– Distinction  #1:  Model  vs.  Objective  Function– Gaussian  Naïve  Bayes  (GNB)– Gaussian  Discriminant  Analysis– Gaussian  Mixture  Model  (GMM)

• Expectation-­‐Maximization– Distinction  #2:  Complete  Data  Likelihood  vs.  

Marginal  Data  Likelihood– Distinction  #3:  Latent  Variables  vs.  Parameters– Objective  Functions  for  EM– EM  Algorithm– EM  for  GMM  vs.  K-­‐Means

• Properties  of  EM– Nonconvexity /  Local  Optimization– Example:  Grammar  Induction– Variants  of  EM

4

This  Lecture

Last  Lecture

Page 5: Carnegie%Mellon%University EM PCA

Background

Whiteboard–Multivariate  Gaussian  Distribution–Marginal  Probabilities

6

Page 6: Carnegie%Mellon%University EM PCA

GAUSSIAN  MIXTURE  MODEL  (GMM)

7

Page 7: Carnegie%Mellon%University EM PCA

Building  up  to  GMMs

Whiteboard– Distinction  #1:  Model  vs.  Objective  Function– Gaussian  Naïve  Bayes  (GNB)– Gaussian  Discriminant  Analysis– Gaussian  Mixture  Model  (GMM)

8

Page 8: Carnegie%Mellon%University EM PCA

Gaussian  Discriminant  Analysis

9

Data:  

Model: Joint:

Log-­‐likelihood:

Generative  Story: z � Categorical(�)

� Gaussian(µz,�z)

p( , z; �, µ,�) = p( |z; µ,�)p(z; �)

D = {( (i), (i))}Ni=1 where

(i) � RM and z(i) � {1, . . . , K}

�(�, µ,�) =N�

i=1

p( (i), z(i); �, µ,�)

=N�

i=1

p( (i)|z(i); µ,�) + p(z(i); �)

Page 9: Carnegie%Mellon%University EM PCA

Gaussian  Discriminant  Analysis

10

Data:  

Log-­‐likelihood:

D = {( (i), (i))}Ni=1 where

(i) � RM and z(i) � {1, . . . , K}

Maximum  Likelihood  Estimates:Take  the  derivative  of  the  Lagrangian,  set  it  equal  to  zero  and  solve.

�k =1

N

N�

i=1

I(z(i) = k), �k

µk =

�Ni=1 I(z(i) = k) (i)

�Ni=1 I(z(i) = k)

, �k

�k =

�Ni=1 I(z(i) = k)( (i) � µk)( (i) � µk)T

�Ni=1 I(z(i) = k)

, �k

�(�, µ,�) =N�

i=1

p( (i)|z(i); µ,�) + p(z(i); �)

Implementation:  Just  counting  

Page 10: Carnegie%Mellon%University EM PCA

Gaussian  Mixture-­‐Model

11

Data:  

Assumewe are given data, D, consisting ofN fully unsupervised ex-amples in M dimensions:

D = { (i)}Ni=1 where (i) � RM

Model: Joint:

Marginal:

(Marginal)  Log-­‐likelihood:

Generative  Story: z � Categorical(�)

� Gaussian(µz,�z)

p( ; �, µ,�) =K�

z=1

p( |z; µ,�)p(z; �)

p( , z; �, µ,�) = p( |z; µ,�)p(z; �)

�(�, µ,�) =N�

i=1

p( (i); �, µ,�)

=N�

i=1

K�

z=1

p( (i)|z; µ,�)p(z; �)

Page 11: Carnegie%Mellon%University EM PCA

Mixture-­‐Model

12

Data:  

Assumewe are given data, D, consisting ofN fully unsupervised ex-amples in M dimensions:

D = { (i)}Ni=1 where (i) � RM

Model: p�,�( , z) = p�( |z)p�(z)

p�,�( ) =K�

z=1

p�( |z)p�(z)

Joint:

Marginal:

(Marginal)  Log-­‐likelihood:�(�) =

N�

i=1

p�,�( (i))

=N�

i=1

K�

z=1

p�( (i)|z)p�(z)

Generative  Story: z � Multinomial(�)

� p�(·|z)

z � Categorical(�)

� Gaussian(µz,�z)

Page 12: Carnegie%Mellon%University EM PCA

Mixture-­‐Model

13

Data:  

Assumewe are given data, D, consisting ofN fully unsupervised ex-amples in M dimensions:

D = { (i)}Ni=1 where (i) � RM

Model: p�,�( , z) = p�( |z)p�(z)

p�,�( ) =K�

z=1

p�( |z)p�(z)

Joint:

Marginal:

(Marginal)  Log-­‐likelihood:�(�) =

N�

i=1

p�,�( (i))

=N�

i=1

K�

z=1

p�( (i)|z)p�(z)

Generative  Story: z � Multinomial(�)

� p�(·|z)

z � Categorical(�)

� Gaussian(µz,�z)This  could  be  any  arbitrary  distribution  parameterized  by  θ.

Today  we’re  thinking  about  the  case  where  it  is  a  Multivariate  Gaussian.

Page 13: Carnegie%Mellon%University EM PCA

Unsupervised  Learning:  Parameters  are  coupled  by  marginalization.

Supervised  Learning:  The  parameters  decouple!

Learning  a  Mixture  Model

14

X1 XMX2

Z

��, �� =�,�

N�

i=1

p�( (i)|z(i))p�(z(i))

�� =�

N�

i=1

p�( (i)|z(i))

�� =�

N�

i=1

p�(z(i))

X1 XMX2

Z

D = { (i)}Ni=1

��, �� =�,�

N�

i=1

K�

z=1

p�( (i)|z)p�(z)

D = {( (i), (i))}Ni=1

Page 14: Carnegie%Mellon%University EM PCA

Unsupervised  Learning:  Parameters  are  coupled  by  marginalization.

Supervised  Learning:  The  parameters  decouple!

Learning  a  Mixture  Model

15

X1 XMX2

Z

��, �� =�,�

N�

i=1

p�( (i)|z(i))p�(z(i))

�� =�

N�

i=1

p�( (i)|z(i))

�� =�

N�

i=1

p�(z(i))

X1 XMX2

Z

D = { (i)}Ni=1

��, �� =�,�

N�

i=1

K�

z=1

p�( (i)|z)p�(z)

Training  certainly  isn’t  as  simple  as  the  supervised  case.

In  many  cases,  we  could  still  use  some  black-­‐box  optimization  method  (e.g.  Newton-­‐Raphson)  to  solve  this  coupledoptimization  problem.

This  lecture  is  about  a  more  problem-­‐specific  method:  EM.

D = {( (i), (i))}Ni=1

Page 15: Carnegie%Mellon%University EM PCA

EXPECTATION  MAXIMIZATION

16

Page 16: Carnegie%Mellon%University EM PCA

Expectation-­‐Maximization

Whiteboard– Distinction  #2:  Complete  Data  Likelihood  vs.  Marginal  Data  Likelihood

– Distinction  #3:  Latent  Variables  vs.  Parameters– Objective  Functions  for  EM– EM  Algorithm– EM  for  GMM  vs.  K-­‐Means

17

Page 17: Carnegie%Mellon%University EM PCA

EXAMPLE:  K-­‐MEANS  VS  GMM

24

Page 18: Carnegie%Mellon%University EM PCA

Example:  K-­‐Means

25

Page 19: Carnegie%Mellon%University EM PCA

Example:  K-­‐Means

26

Page 20: Carnegie%Mellon%University EM PCA

Example:  K-­‐Means

27

Page 21: Carnegie%Mellon%University EM PCA

Example:  K-­‐Means

28

Page 22: Carnegie%Mellon%University EM PCA

Example:  K-­‐Means

29

Page 23: Carnegie%Mellon%University EM PCA

Example:  K-­‐Means

30

Page 24: Carnegie%Mellon%University EM PCA

Example:  K-­‐Means

31

Page 25: Carnegie%Mellon%University EM PCA

Example:  K-­‐Means

32

Page 26: Carnegie%Mellon%University EM PCA

Example:  GMM

33

Page 27: Carnegie%Mellon%University EM PCA

Example:  GMM

34

Page 28: Carnegie%Mellon%University EM PCA

Example:  GMM

35

Page 29: Carnegie%Mellon%University EM PCA

Example:  GMM

36

Page 30: Carnegie%Mellon%University EM PCA

Example:  GMM

37

Page 31: Carnegie%Mellon%University EM PCA

Example:  GMM

38

Page 32: Carnegie%Mellon%University EM PCA

Example:  GMM

39

Page 33: Carnegie%Mellon%University EM PCA

Example:  GMM

40

Page 34: Carnegie%Mellon%University EM PCA

Example:  GMM

41

Page 35: Carnegie%Mellon%University EM PCA

Example:  GMM

42

Page 36: Carnegie%Mellon%University EM PCA

Example:  GMM

43

Page 37: Carnegie%Mellon%University EM PCA

Example:  GMM

44

Page 38: Carnegie%Mellon%University EM PCA

Example:  GMM

45

Page 39: Carnegie%Mellon%University EM PCA

Example:  GMM

46

Page 40: Carnegie%Mellon%University EM PCA

Example:  GMM

47

Page 41: Carnegie%Mellon%University EM PCA

Example:  GMM

48

Page 42: Carnegie%Mellon%University EM PCA

Example:  GMM

49

Page 43: Carnegie%Mellon%University EM PCA

Example:  GMM

50

Page 44: Carnegie%Mellon%University EM PCA

Example:  GMM

51

Page 45: Carnegie%Mellon%University EM PCA

Example:  GMM

52

Page 46: Carnegie%Mellon%University EM PCA

Example:  GMM

53

Page 47: Carnegie%Mellon%University EM PCA

Example:  GMM

54

Page 48: Carnegie%Mellon%University EM PCA

K-­‐Means  vs.  GMMConvergence:  K-­‐Means tends  to  convergemuch  faster  than  EM  for  GMM

Speed:  Each  iteration  of  K-­‐Means is  computationally  less  intensive than  each  iteration  of  EM  for  GMM

Initialization:  To  initialize EM  for  GMM,  we  typically  first  run  K-­‐Means and  use  the  resulting  cluster  centers  as  the  means  of  the  Gaussian  components

Output:  A  GMM  gives  a  probability  distribution  over  the  cluster  assignment  for  each  point;  whereas  K-­‐Means gives  a  single  hard  assignment

55

Page 49: Carnegie%Mellon%University EM PCA

PROPERTIES  OF  EM

56

Page 50: Carnegie%Mellon%University EM PCA

Properties  of  EM• EM  is  trying to  optimize  a  nonconvex function

• But  EM  is  a  localoptimization  algorithm

• Typical  solution:  Random  Restarts– Just  like  K-­‐Means,  we  run  the  algorithm  many  times

– Each  time  initialize  parameters  randomly

– Pick  the  parameters  that  give  highest  likelihood

57

Page 51: Carnegie%Mellon%University EM PCA

Example:  Grammar  InductionQuestion:  Can  maximizing  (unsupervised)  marginal  likelihood  produce  useful  results?

Answer: Let’s  look  at  an  example…• Babies learn  the  syntax  of  their  native  language  (e.g.  

English)  just  by  hearingmany  sentences• Can  a  computer similarly  learn  syntax  of  a  human  

language just  by  looking  at  lots  of  example  sentences?– This  is  the  problem  of  Grammar  Induction!– It’s  an  unsupervised  learning  problem– We  try  to  recover  the  syntactic  structure  for  each  

sentence  without  any  supervision

58

Page 52: Carnegie%Mellon%University EM PCA

Example:  Grammar  Induction

59

3 Alice saw Bob on a hill with a telescope

Alice saw Bob on a hill with a telescope

4 time flies like an arrow

time flies like an arrow

time flies like an arrow

time flies like an arrow

time flies like an arrow

2

3 Alice saw Bob on a hill with a telescope

Alice saw Bob on a hill with a telescope

4 time flies like an arrow

time flies like an arrow

time flies like an arrow

time flies like an arrow

time flies like an arrow

2

3 Alice saw Bob on a hill with a telescope

Alice saw Bob on a hill with a telescope

4 time flies like an arrow

time flies like an arrow

time flies like an arrow

time flies like an arrow

time flies like an arrow

2

3 Alice saw Bob on a hill with a telescope

Alice saw Bob on a hill with a telescope

4 time flies like an arrow

time flies like an arrow

time flies like an arrow

time flies like an arrow

time flies like an arrow

2

No  semantic  interpretation

Page 53: Carnegie%Mellon%University EM PCA

Example:  Grammar  Induction

60

real likeflies soupSample  2:

time likeflies an arrowSample  1:

with youtime will seeSample  4:

flies withfly their wingsSample  3:

Training  Data:  Sentences  only,  without  parses  

x(1)

x(2)

x(3)

x(4)

Test  Data:  Sentences  with  parses,  so  we  can  evaluate  accuracy  

Page 54: Carnegie%Mellon%University EM PCA

Example:  Grammar  Induction

61

lti

-20.2 -20 -19.8 -19.6 -19.4 -19.2 -1910

20

30

40

50

60

Att

achm

ent

Acc

urac

y (%

)

Log-Likelihood (per sentence)4

Pearson’s r = 0.63(strong correlation)

Dependency Model with Valence (Klein & Manning, 2004)

Figure  from  Gimpel &  Smith  (NAACL  2012)  -­‐ slides

Q: Does  likelihood  correlate  with  accuracy  on  a  task  we  care  about?

A:  Yes,  but  there  is  still  a  wide  range  of  accuracies  for  a  particular  likelihood  value

Page 55: Carnegie%Mellon%University EM PCA

Variants  of  EM• Generalized  EM:  Replace  the  M-­‐Step  by  a  single  gradient-­‐step  that  improves  the  likelihood

• Monte  Carlo  EM:  Approximate  the  E-­‐Step  by  sampling

• Sparse  EM:  Keep  an  “active  list”  of  points  (updated  occasionally)  from  which  we  estimate  the  expected  counts  in  the  E-­‐Step

• Incremental  EM  /  Stepwise  EM:  If  standard  EM  is  described  as  a  batch algorithm,  these  are  the  online equivalent

• etc.63

Page 56: Carnegie%Mellon%University EM PCA

Eric  Xing

A  Report  Card  for  EM• Some  good  things  about  EM:– no  learning  rate  (step-­‐size)  parameter– automatically  enforces  parameter  constraints– very  fast  for  low  dimensions– each  iteration  guaranteed  to  improve  likelihood

• Some  bad  things  about  EM:– can  get  stuck  in  local  minima– can  be  slower  than  conjugate  gradient  (especially  near  convergence)

– requires  expensive  inference  step– is  a  maximum  likelihood/MAP  method

64©  Eric  Xing  @  CMU,  2006-­‐2011

Page 57: Carnegie%Mellon%University EM PCA

DIMENSIONALITY  REDUCTION

65

Page 58: Carnegie%Mellon%University EM PCA

PCA  Outline• Dimensionality  Reduction– High-­‐dimensional  data– Learning  (low  dimensional)  representations

• Principal  Component  Analysis  (PCA)– Examples:  2D  and  3D– Data  for  PCA– PCA  Definition– Objective  functions  for  PCA– PCA,  Eigenvectors,  and  Eigenvalues– Algorithms  for  finding  Eigenvectors  /  Eigenvalues

• PCA  Examples– Face  Recognition– Image  Compression

66

Page 59: Carnegie%Mellon%University EM PCA

• High-­‐Dimensions  =  Lot  of  Features

Document  classificationFeatures  per  document  =  thousands  of  words/unigramsmillions  of  bigrams,    contextual  information

Surveys  -­‐ Netflix480189  users  x  17770  movies

Big  &  High-­‐Dimensional  Data

Slide  from  Nina  Balcan

Page 60: Carnegie%Mellon%University EM PCA

• High-­‐Dimensions  =  Lot  of  Features

MEG  Brain  Imaging120  locations  x  500  time  points  x  20  objects

Big  &  High-­‐Dimensional  Data

Or  any  high-­‐dimensional  image  data

Slide  from  Nina  Balcan

Page 61: Carnegie%Mellon%University EM PCA

• Useful  to  learn  lower  dimensional  representations  of  the  data.

• Big  &  High-­‐Dimensional  Data.

Slide  from  Nina  Balcan

Page 62: Carnegie%Mellon%University EM PCA

PCA,  Kernel  PCA,  ICA:  Powerful  unsupervised  learning  techniques  for  extracting  hidden  (potentially  lower  dimensional)  structure  from  high  dimensional  datasets.

Learning  Representations

Useful  for:

• Visualization  

• Further  processing  by  machine  learning  algorithms

• More  efficient  use  of  resources  (e.g.,  time,  memory,  communication)

• Statistical:  fewer  dimensions  à better  generalization

• Noise  removal  (improving  data  quality)

Slide  from  Nina  Balcan

Page 63: Carnegie%Mellon%University EM PCA

PRINCIPAL  COMPONENT  ANALYSIS  (PCA)

73

Page 64: Carnegie%Mellon%University EM PCA

PCA  Outline• Dimensionality  Reduction– High-­‐dimensional  data– Learning  (low  dimensional)  representations

• Principal  Component  Analysis  (PCA)– Examples:  2D  and  3D– Data  for  PCA– PCA  Definition– Objective  functions  for  PCA– PCA,  Eigenvectors,  and  Eigenvalues– Algorithms  for  finding  Eigenvectors  /  Eigenvalues

• PCA  Examples– Face  Recognition– Image  Compression

74

Page 65: Carnegie%Mellon%University EM PCA

Principal  Component  Analysis  (PCA)

In  case  where  data    lies  on  or  near  a  low  d-­‐dimensional  linear  subspace,  axes  of  this  subspace  are  an  effective  representation  of  the  data.

Identifying  the  axes  is  known  as  Principal  Components  Analysis,  and  can  be  obtained  by  using  classic  matrix  computation  tools  (Eigen  or  Singular  Value  Decomposition).

Slide  from  Nina  Balcan

Page 66: Carnegie%Mellon%University EM PCA

2D  Gaussian  dataset

Slide  from  Barnabas  Poczos

Page 67: Carnegie%Mellon%University EM PCA

1st PCA  axis

Slide  from  Barnabas  Poczos

Page 68: Carnegie%Mellon%University EM PCA

2nd PCA  axis

Slide  from  Barnabas  Poczos

Page 69: Carnegie%Mellon%University EM PCA

Principal  Component  Analysis  (PCA)

Whiteboard– Data  for  PCA– PCA  Definition

79