cs 6347 lecture 18 - university of texas at dallas

18
CS 6347 Lecture 18 Expectation Maximization

Upload: others

Post on 04-Nov-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 6347 Lecture 18 - University of Texas at Dallas

CS 6347

Lecture 18

Expectation Maximization

Page 2: CS 6347 Lecture 18 - University of Texas at Dallas

Unobserved Variables

β€’ Latent or hidden variables in the model are never observed

β€’ We may or may not be interested in their values, but their

existence is crucial to the model

β€’ Some observations in a particular sample may be missing

β€’ Missing information on surveys or medical records (quite

common)

β€’ We may need to model how the variables are missing

2

Page 3: CS 6347 Lecture 18 - University of Texas at Dallas

Hidden Markov Models

𝑝 π‘₯1, … , π‘₯𝑇 , 𝑦1, … , 𝑦𝑇 = 𝑝 𝑦1 𝑝 π‘₯1 𝑦1

𝑑

𝑝 𝑦𝑑 π‘¦π‘‘βˆ’1 𝑝(π‘₯𝑑|𝑦𝑑)

β€’ 𝑋’s are observed variables, π‘Œβ€™s are latent

β€’ Example: 𝑋 variables correspond sizes of tree growth rings for one year, the π‘Œ variables correspond to average temperature

π‘Œ1 π‘Œ2 π‘Œπ‘‡βˆ’1 π‘Œπ‘‡...

𝑋1 𝑋2 π‘‹π‘‡βˆ’1 𝑋𝑇...

Page 4: CS 6347 Lecture 18 - University of Texas at Dallas

Missing Data

β€’ Data can be missing from the model in many different ways

– Missing completely at random: the probability that a data item is

missing is independent of the observed data and the other

missing data

– Missing at random: the probability that a data item is missing

can depend on the observed data

– Missing not at random: the probability that a data item is missing

can depend on the observed data and the other missing data

4

Page 5: CS 6347 Lecture 18 - University of Texas at Dallas

Handling Missing Data

β€’ Discard all incomplete observations

– Can introduce bias

β€’ Imputation: actual values are substituted for missing values so that all of the data is fully observed

– E.g., find the most probable assignments for the missing data and substitute them in (not possible if the model is unknown)

– Use the sample mean/mode

β€’ Explicitly model the missing data

– For example, could expand the state space

– The most sensible solution, but may be non-trivial if we don’t know how/why the data is missing

5

Page 6: CS 6347 Lecture 18 - University of Texas at Dallas

Modelling Missing Data

β€’ Add additional binary variable π‘šπ‘– to the model for each possible

observed variable π‘₯𝑖 that indicates whether or not that variable is

observed

𝑝 π‘₯π‘œπ‘π‘  , π‘₯π‘šπ‘–π‘ , π‘š = 𝑝 π‘š π‘₯π‘œπ‘π‘  , π‘₯π‘šπ‘–π‘  𝑝(π‘₯π‘œπ‘π‘  , π‘₯π‘šπ‘–π‘ )

6

Page 7: CS 6347 Lecture 18 - University of Texas at Dallas

Modelling Missing Data

β€’ Add additional binary variable π‘šπ‘– to the model for each possible

observed variable π‘₯𝑖 that indicates whether or not that variable is

observed

𝑝 π‘₯π‘œπ‘π‘  , π‘₯π‘šπ‘–π‘ , π‘š = 𝑝 π‘š π‘₯π‘œπ‘π‘  , π‘₯π‘šπ‘–π‘  𝑝(π‘₯π‘œπ‘π‘  , π‘₯π‘šπ‘–π‘ )

7

Explicit model of the missing data

(missing not at random)

Page 8: CS 6347 Lecture 18 - University of Texas at Dallas

Modelling Missing Data

β€’ Add additional binary variable π‘šπ‘– to the model for each possible

observed variable π‘₯𝑖 that indicates whether or not that variable is

observed

𝑝 π‘₯π‘œπ‘π‘  , π‘₯π‘šπ‘–π‘ , π‘š = 𝑝 π‘š π‘₯π‘œπ‘π‘  𝑝(π‘₯π‘œπ‘π‘  , π‘₯π‘šπ‘–π‘ )

8

Missing at

random

Page 9: CS 6347 Lecture 18 - University of Texas at Dallas

Modelling Missing Data

β€’ Add additional binary variable π‘šπ‘– to the model for each possible

observed variable π‘₯𝑖 that indicates whether or not that variable is

observed

𝑝 π‘₯π‘œπ‘π‘  , π‘₯π‘šπ‘–π‘ , π‘š = 𝑝(π‘š)𝑝(π‘₯π‘œπ‘π‘  , π‘₯π‘šπ‘–π‘ )

9

Missing

completely at

random

Page 10: CS 6347 Lecture 18 - University of Texas at Dallas

Modelling Missing Data

β€’ Add additional binary variable π‘šπ‘– to the model for each possible

observed variable π‘₯𝑖 that indicates whether or not that variable is

observed

𝑝 π‘₯π‘œπ‘π‘  , π‘₯π‘šπ‘–π‘ , π‘š = 𝑝(π‘š)𝑝(π‘₯π‘œπ‘π‘  , π‘₯π‘šπ‘–π‘ )

10

Missing

completely at

random

How can you model latent

variables in this framework?

Page 11: CS 6347 Lecture 18 - University of Texas at Dallas

Learning with Missing Data

β€’ In order to design learning algorithms for models with missing data,

we will make two assumptions

– The data is missing at random

– The model parameters corresponding to the missing data (𝛿) are

separate from the model parameters of the observed data (πœƒ)

β€’ That is

𝑝 π‘₯π‘œπ‘π‘  , π‘š|πœƒ, 𝛿 = 𝑝 π‘š π‘₯π‘œπ‘π‘  , 𝛿 𝑝(π‘₯π‘œπ‘π‘ |πœƒ)

11

Page 12: CS 6347 Lecture 18 - University of Texas at Dallas

Learning with Missing Data

𝑝 π‘₯π‘œπ‘π‘  , π‘š|πœƒ, 𝛿 = 𝑝 π‘š π‘₯π‘œπ‘π‘  , 𝛿 𝑝 π‘₯π‘œπ‘π‘  πœƒ

β€’ Under the previous assumptions, the log-likelihood of samples

π‘₯1, π‘š1 , … , (π‘₯𝐾 , π‘šπΎ) is equal to

𝑙 πœƒ, 𝛿 =

π‘˜=1

𝐾

log 𝑝(π‘šπ‘˜|π‘₯π‘œπ‘π‘ π‘˜ , 𝛿) +

π‘˜=1

𝐾

log

π‘₯π‘šπ‘–π‘ π‘˜

𝑝(π‘₯π‘œπ‘π‘ π‘˜π‘˜ , π‘₯π‘šπ‘–π‘ π‘˜|πœƒ)

12

Page 13: CS 6347 Lecture 18 - University of Texas at Dallas

Learning with Missing Data

𝑝 π‘₯π‘œπ‘π‘  , π‘š|πœƒ, 𝛿 = 𝑝 π‘š π‘₯π‘œπ‘π‘  , 𝛿 𝑝 π‘₯π‘œπ‘π‘  πœƒ

β€’ Under the previous assumptions, the log-likelihood of samples

π‘₯1, π‘š1 , … , (π‘₯𝐾 , π‘šπΎ) is equal to

𝑙 πœƒ, 𝛿 =

π‘˜=1

𝐾

log 𝑝(π‘šπ‘˜|π‘₯π‘œπ‘π‘ π‘˜ , 𝛿) +

π‘˜=1

𝐾

log

π‘₯π‘šπ‘–π‘ π‘˜

𝑝(π‘₯π‘œπ‘π‘ π‘˜π‘˜ , π‘₯π‘šπ‘–π‘ π‘˜|πœƒ)

13

Separable in πœƒ and 𝛿, so if we don’t care about 𝛿, then we

only have to maximize the second term over πœƒ

Page 14: CS 6347 Lecture 18 - University of Texas at Dallas

Learning with Missing Data

𝑙 πœƒ =

π‘˜=1

𝐾

log

π‘₯π‘šπ‘–π‘ π‘˜

𝑝(π‘₯π‘œπ‘π‘ π‘˜π‘˜ , π‘₯π‘šπ‘–π‘ π‘˜|πœƒ)

β€’ This is NOT a concave function of πœƒ

– In the worst case, could have a different local maximum for each

possible value of the missing data

– No longer have a closed form solution, even in the case of

Bayesian networks

14

Page 15: CS 6347 Lecture 18 - University of Texas at Dallas

Expectation Maximization

β€’ The expectation-maximization algorithm (EM) is method to find a

local maximum or a saddle point of the log-likelihood with missing

data

β€’ Basic idea:

15

𝑙 πœƒ =

π‘˜=1

𝐾

log

π‘₯π‘šπ‘–π‘ π‘˜

𝑝(π‘₯π‘œπ‘π‘ π‘˜π‘˜ , π‘₯π‘šπ‘–π‘ π‘˜|πœƒ)

=

π‘˜=1

𝐾

log

π‘₯π‘šπ‘–π‘ π‘˜

π‘žπ‘˜ π‘₯π‘šπ‘–π‘ π‘˜ ⋅𝑝 π‘₯π‘œπ‘π‘ π‘˜π‘˜ , π‘₯π‘šπ‘–π‘ π‘˜ πœƒ

π‘žπ‘˜ π‘₯π‘šπ‘–π‘ 

β‰₯

π‘˜=1

𝐾

π‘₯π‘šπ‘–π‘ π‘˜

π‘žπ‘˜ π‘₯π‘šπ‘–π‘ π‘˜ log𝑝 π‘₯π‘œπ‘π‘ π‘˜π‘˜ , π‘₯π‘šπ‘–π‘ π‘˜ πœƒ

π‘žπ‘˜ π‘₯π‘šπ‘–π‘ π‘˜

Page 16: CS 6347 Lecture 18 - University of Texas at Dallas

Expectation Maximization

𝐹 π‘ž, πœƒ ≑

π‘˜=1

𝐾

π‘₯π‘šπ‘–π‘ π‘˜

π‘žπ‘˜ π‘₯π‘šπ‘–π‘ π‘˜ log𝑝 π‘₯π‘œπ‘π‘ π‘˜π‘˜ , π‘₯π‘šπ‘–π‘ π‘˜ πœƒ

π‘žπ‘˜ π‘₯π‘šπ‘–π‘ π‘˜

β€’ Maximizing 𝐹 is equivalent to the maximizing the log-likelihood

β€’ Could maximize it using coordinate ascent

π‘žπ‘‘+1 = arg maxπ‘ž1,…,π‘žπΎπΉ(π‘ž, πœƒπ‘‘)

πœƒπ‘‘+1 = argmaxπœƒπΉ(π‘žπ‘‘+1, πœƒ)

16

Page 17: CS 6347 Lecture 18 - University of Texas at Dallas

Expectation Maximization

π‘₯π‘šπ‘–π‘ π‘˜

π‘žπ‘˜ π‘₯π‘šπ‘–π‘ π‘˜ log𝑝 π‘₯π‘œπ‘π‘ π‘˜π‘˜ , π‘₯π‘šπ‘–π‘ π‘˜ πœƒ

π‘žπ‘˜ π‘₯π‘šπ‘–π‘ π‘˜

β€’ This is just βˆ’π‘‘ π‘žπ‘˜||𝑝 π‘₯π‘œπ‘π‘ π‘˜π‘˜ ,β‹… πœƒ

β€’ Maximized when π‘žπ‘˜ π‘₯π‘šπ‘–π‘ π‘˜ = 𝑝(π‘₯π‘šπ‘–π‘ π‘˜|π‘₯π‘œπ‘π‘ π‘˜π‘˜ , πœƒ)

β€’ Can reformulate the EM algorithm as

πœƒπ‘‘+1 = argmaxπœƒ

π‘˜=1

𝐾

π‘₯π‘šπ‘–π‘ π‘˜

𝑝(π‘₯π‘šπ‘–π‘ π‘˜|π‘₯π‘œπ‘π‘ π‘˜π‘˜ , πœƒπ‘‘) log 𝑝 π‘₯π‘œπ‘π‘ π‘˜

π‘˜ , π‘₯π‘šπ‘–π‘ π‘˜ πœƒ

17

Page 18: CS 6347 Lecture 18 - University of Texas at Dallas

An Example: Bayesian Networks

β€’ Recall that MLE for Bayesian networks without latent variables

yielded

πœƒπ‘₯𝑖|π‘₯π‘π‘Žπ‘Ÿπ‘’π‘›π‘‘π‘  𝑖 =Nπ‘₯𝑖,π‘₯parents 𝑖 π‘₯𝑖′Nπ‘₯𝑖′,π‘₯parents 𝑖

β€’ Let’s suppose that we are given observations from a Bayesian

network in which one of the variables is hidden

– What do the iterations of the EM algorithm look like?

(on board)

18