maximum likelihood estimation & expectation maximization

Lectures 3 – Oct 5, 2011CSE 527 Computational Biology, Fall 2011

Instructor: Su-In LeeTA: Christopher Miles

Monday & Wednesday 12:00-1:20Johnson Hall (JHN) 022

Maximum Likelihood Estimation & Expectation Maximization

1

2

Outline Probabilistic models in biology

Model selection problem

Mathematical foundations

Bayesian networks

Learning from data Maximum likelihood estimation Maximum a posteriori (MAP) Expectation and maximization

3

Parameter Estimation Assumptions

Fixed network structure Fully observed instances of the network variables: D={d[1],

…,d[M]} Maximum likelihood estimation (MLE)!

“Parameters” of the Bayesian network

For example, {i0,d1,g1,l0,s0

}

from Koller & Friedman

4

The Thumbtack example Parameter learning for a single variable.

Variable X: an outcome of a thumbtack toss Val(X) = {head, tail}

Data A set of thumbtack tosses: x[1],…x[M]

X

5

Maximum likelihood estimation Say that P(x=head) = Θ, P(x=tail) = 1-Θ

P(HHTTHHH…<Mh heads, Mt tails>; Θ) =

Definition: The likelihood function L(Θ : D) = P(D; Θ)

Maximum likelihood estimation (MLE) Given data D=HHTTHHH…<Mh heads, Mt tails>,

find Θ that maximizes the likelihood function L(Θ : D).

6

Likelihood function

7

MLE for the Thumbtack problem Given data D=HHTTHHH…<Mh heads, Mt

tails>, MLE solution Θ* = Mh / (Mh+Mt ).

Proof:

Continuous space Assuming sample x1, x2,…, xn is from a

parametric distribution f (x|Θ) , estimate Θ.

Say that the n samples are from a normal distribution with mean μ and variance σ2.

8

Continuous space (cont.) Let Θ1=μ, Θ2= σ2

9

),...,,:,( 2121 nxxxL

),...,,:,(log 2121 nxxxL

),...,,:,(log 21211

nxxxL

),...,,|,(log 21212

nxxxL

Any Drawback? Is it biased?

Is it? Yes. As an extreme, when n = 1, =0. The MLE systematically underestimates θ2 .

Why? A bit harder to see, but think about n = 2. Then θ1 is exactly between the two sample points, the position that exactly minimizes the expression for . Any other choices for θ1, θ2 make the likelihood of the observed data slightly lower. But it’s actually pretty unlikely that two sample points would be chosen exactly equidistant from, and on opposite sides of the mean, so the MLE systematically underestimates θ2 .

10

2̂

2̂

2̂

2̂

Maximum a posteriori Incorporating priors. How?

MLE vs MAP estimation

11

12

MLE for general problems Learning problem setting

A set of random variables X from unknown distribution P*

Training data D = M instances of X: {d[1],…,d[M]}

A parametric model P(X; Θ) (a ‘legal’ distribution)

Define the likelihood function: L(Θ : D) =

Maximum likelihood estimation Choose parameters Θ* that satisfy:

13

MLE for Bayesian networks

Likelihood decomposition:

The local likelihood function for Xi is:

x2

x3

x1

x4

Structure G

Given D: x[1],…x[m]…,x[M], estimate θ.

(x1[m],x2[m],x3[m],x4[m])

Θx1, Θx2 , Θx3|x1,x2 , Θx4|x1,x3

(more generally Θxi|pai)

PG = P(x1,x2,x3,x4)

Parameters θ

= P(x1) P(x2) P(x3|x1,x2) P(x4|x1,x3)More generally? )|x(PP i

iiG pa

14

Bayesian network with table CPDs

MtMhMh

θ

ˆ

Difficulty

GradeX

Intelligence

D: {H…x[m]…T} D: {(i1,d1,g1)…(i[m],d[m],g[m])…}

The Thumbtack exampleThe Student example

Data

Likelihood function

Parameters

MLE solution

Joint distribution

vs

θI, θD, θG|I,D

P(X) P(I,D,G) =

L(θ:D) = P(D;θ)

θ

θMh(1-θ)Mt

15

Maximum Likelihood Estimation Review

Find parameter estimates which make observed data most likely

General approach, as long as tractable likelihood function exists

Can use all available information

16

Instruction for making the proteins Instruction for when and where to make them

What turns genes on (producing a protein) and off?

When is a gene turned on or off? Where (in which cells) is a gene turned on? How many copies of the gene product are

produced?

“Coding” Regions

“Regulatory” Regions (Regulons)

Example – Gene Expression

Regulatory regions contain “binding sites” (6-20 bp). “Binding sites” attract a special class of proteins, known

as “transcription factors”. Bound transcription factors can initiate transcription

(making RNA). Proteins that inhibit transcription can also be bound to

their binding sites.

17

Regulation of Genes

GeneRegulatory Element (binding sites)

RNA polymerase

(Protein)

Transcription Factor(Protein)

DNA

source: M. Tompa, U. of Washington

AC..TCG..A

18

Regulation of Genes

Gene


Regulatory Element

DNA


RNA polymerase

(Protein)

AC..TCG..A

19

Regulation of Genes

Gene

RNA polymerase


Regulatory Element

DNA


AC..TCG..A

20

Regulation of Genes

RNA polymera

se

Transcription Factor

Regulatory Element

DNA

New proteinsource: M. Tompa, U. of Washington

AC..TCG..A

21

The Gene regulation example What determines the expression level of a gene? What are observed and hidden variables?

e.G, e.TF’s: observed; Process.G: hidden variables want to infer!

e.G

Process.G

e.TF1 e.TF2

e.TFN

...e.TF3 e.TF4

= p1= p2

= p3

Expression level of a gene

Biological process the gene is involved in

Expression level of TF1

22

The Gene regulation example What determines the expression level of a gene? What are observed and hidden variables?

e.G, e.TF’s: observed; Process.G: hidden variables want to infer! How about BS.G’s? How deterministic is the sequence of a binding

site? How much do we know?

e.G

Process.G

e.TF1 e.TF2

e.TFN

...e.TF3 e.TF4

BS1.G

BSN.G

...

Expression level of a gene

= Yes = Yes

Whether the gene has TF1’s binding

site

23

Not all data are perfect Most MLE problems are simple to solve with

complete data.

Available data are “incomplete” in some way.

24

Outline Learning from data

Maximum likelihood estimation (MLE) Maximum a posteriori (MAP) Expectation-maximization (EM) algorithm

25

Continuous space revisited Assuming sample x1, x2,…, xn is from a

mixture of parametric distributions,

x

x1 x2 … xm xm+1 … xn

26

A real example CpG content of human gene promoters

“A genome-wide analysis of CpG dinucleotides in the human genome distinguishes twodistinct classes of promoters” Saxonov, Berg, and Brutlag, PNAS 2006;103:1412-1417

GC frequency

27

Mixture of Gaussians

Parameters θ means variances

mixing parameters

P.D.F

),...,:,,,,,( 12122

2121 nxxL

28

Apply MLE?

No closed form solution known for finding θ maximizing L.

However, what if we knew the hidden data?

),...,:,,,,,( 12122

2121 nxxL

29

EM as Chicken vs Egg IF zij known, could estimate parameters θ

e.g., only points in cluster 2 influence μ2, σ2.

IF parameters θ known, could estimate zij

e.g., if |xi - μ1|/σ1 << |xi – μ2|/σ2, then zi1 >> zi2

BUT we know neither; (optimistically) iterate: E-step: calculate expected zij, given parameters M-step: do “MLE” for parameters (μ,σ), given E(zij)

Overall, a clever “hill-climbing” strategy

Convergence provable? YES

30

“Classification EM” If zij < 0.5, pretend it’s 0; zij > 0.5, pretend

it’s 1i.e., classify points as component 0 or 1

Now recalculate θ, assuming that partition

Then recalculate zij , assuming that θ

Then recalculate θ, assuming new zij , etc., etc.

31

Full EM xi’s are known; Θ unknown. Goal is to find the MLE Θ

of:L (Θ : x1,…,xn ) (hidden data likelihood)

Would be easy if zij’s were known, i.e., consider

L (Θ : x1,…,xn, z11,z12,…,zn2 ) (complete data likelihood)

But zij’s are not known. Instead, maximize expected likelihood of observed

dataE[ L(Θ : x1,…,xn, z11,z12,…,zn2 ) ]

where expectation is over distribution of hidden data (zij’s).

32

The E-step Find E(zij), i.e., P(zij=1)

Assume θ known & fixed. Let A: the event that xi was drawn from f1

B: the event that xi was drawn from f2

D: the observed data xi

Then, expected value of zi1 is P(A|D)

P(A|D) =

33

Complete data likelihood Recall:

so, correspondingly,

Formulas with “if’s” are messy; can we blend more smoothly?

34

M-step Find θ maximizing E[ log(Likelihood) ]

35

EM summary Fundamentally an MLE problem

Useful if analysis is more tractable when 0/1

Hidden data z known

Iterate:E-step: estimate E(z) for each z, given θM-step: estimate θ maximizing E(log likelihood)

given E(z) where “E(logL)” is wrt random z ~ E(z) = p(z=1)

36

EM Issues EM is guaranteed to increase likelihood with

every E-M iteration, hence will converge.

But may converge to local, not global, max.

Issue is intrinsic (probably), since EM is often applied to NP-hard problems (including clustering, above, and motif-discovery, soon)

Nevertheless, widely used, often effective

37

Acknowledgement Profs Daphne Koller & Nir Friedman,

“Probabilistic Graphical Models”

Prof Larry Ruzo, CSE 527, Autumn 2009

Prof Andrew Ng, ML lecture note

maximum likelihood estimation & expectation maximization

Documents