probabilistic graphical models - tau › ~haimk › pgm-seminar › graphicals-tomer.pdf ·...

67
Motivation Maximum likelihood estimation Bayesian parameter estimation Generalization analysis (Bonus?) Probabilistic Graphical Models Parameter Estimation Tomer Galanti December 14, 2015 Tomer Galanti Probabilistic Graphical Models

Upload: others

Post on 08-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Probabilistic Graphical ModelsParameter Estimation

Tomer Galanti

December 14, 2015

Tomer Galanti Probabilistic Graphical Models

Page 2: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Overview

1 Motivation

2 Maximum likelihood estimation

3 Bayesian parameter estimation

4 Generalization analysis (Bonus?)

Tomer Galanti Probabilistic Graphical Models

Page 3: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

What did we have so far?

1 Representations: how do we model the problem?(directed/undirected).

2 Inference: given a model and partially observed data, howcan we recognize the rest of the data?(VE/MCMC/Gibbs sampling/etc’).

Tomer Galanti Probabilistic Graphical Models

Page 4: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Motivation

A complete discussion of graphical models includes

Representation: the conditions between the variables.

Inference: the ability to make observations within the model.

Parameters: specifying the probabilities.

Tomer Galanti Probabilistic Graphical Models

Page 5: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Motivation

Why estimating the parameters of a graphical model is importantto us?

This problem rises very often in practice, since numericalparameters are harder to elicit from human experts than thestructure is.

Tomer Galanti Probabilistic Graphical Models

Page 6: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Motivation

1 Consider a Bayesian network.

2 The network’s structure is fixed.

3 We assume a data set D of fully observed network variablesD = {ζ[1], ..., ζ[M]}.

4 How do we estimate the parameters of the network?

Tomer Galanti Probabilistic Graphical Models

Page 7: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Solutions

We will have two different approaches!

1 One is based on the maximum likelihood estimation.

2 The other is based on Bayesian perspectives.

Tomer Galanti Probabilistic Graphical Models

Page 8: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Solutions

We will have two different approaches!

1 One is based on the maximum likelihood estimation.

2 The other is based on Bayesian perspectives.

Tomer Galanti Probabilistic Graphical Models

Page 9: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Maximum likelihood: estimating a normal distribution

Before we give a formal definition for this method...Let’s start with a few simple examples...!

Assume the points D = {x1, ..., xM} are i.i.d samples of a 1DGaussian distribution N (µ, 1).

How do we select µ that best fit the data?

Tomer Galanti Probabilistic Graphical Models

Page 10: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Maximum likelihood: estimating a normal distribution

A reasonable approach: select µ that maximizes the probabilityfor sampling D from N (µ, 1).

Formally, solve the following program:

µ∗ = arg maxµ

P[D;µ]

Tomer Galanti Probabilistic Graphical Models

Page 11: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Maximum likelihood: estimating a normal distribution

By i.i.dness: P[D;µ] =∏M

i=1 P[xi ;µ].

In addition, xi ∼ N (µ, 1) and therefore,

µ∗ = arg maxµ

M∏i=1

1√2π

exp(−(xi − µ)2/2)

Taking log of the inner argument yields the same solution, sincelog is a strictly increasing function,

µ∗ = arg maxµ

log

[M∏i=1

1√2π

exp(−(xi − µ)2/2)

]

Tomer Galanti Probabilistic Graphical Models

Page 12: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Maximum likelihood: estimating a normal distribution

Equivalently,

µ∗ = arg maxµ

M∑i=1

−(xi − µ)2/2

Differentiate and hope for good:

M∑i=1

(xi − µ) = 0 =⇒ µ∗ =1

M

M∑i=1

xi

The second derivative is −M < 0 and therefore, this is indeed themaximum of this function.

Tomer Galanti Probabilistic Graphical Models

Page 13: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Maximum likelihood: estimating a multinomial distribution

What about coin flips? This will be our running example.

Provided with a data set D = {x1, ..., xM} of i.i.d samplesxi ∼ Bernoulli(θ), we want to estimate θ.

head “=” 1 and tail “=” 0.

Tomer Galanti Probabilistic Graphical Models

Page 14: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Maximum likelihood: estimating a multinomial distribution

Again we maximize the probability for sampling the data,

P[D; θ] = θM[head ] · (1− θ)M[tail ]

Where: M[head ] = number of heads in Dand M[tail ] = number of tails in D.

Tomer Galanti Probabilistic Graphical Models

Page 15: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Maximum likelihood: estimating a multinomial distribution

By taking log we have:

M[head ] log θ + M[tail ] log(1− θ)

Which is maximized by:

θ̂ =M[head ]

M[head ] + M[tail ]

Reasonable right?

Tomer Galanti Probabilistic Graphical Models

Page 16: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Maximum likelihood

This is just a special case of a much more general concept..

Definition (Maximum likelihood)

For a data set D = {ζ[1], ..., ζ[M]} and a family of distributionsP[·; θ], the likelihood of D for a given choice of the parameters θ is

L(θ : D) =∏m

P[ζ[m]; θ]

The maximum likelihood estimator (MLE) returns θ thatmaximizes this quantity.

In many cases, we apply the “taking log” trick to simplify theproblem.

Tomer Galanti Probabilistic Graphical Models

Page 17: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Maximum likelihood: Bayesian networks

So far, we designed a rule for selecting the parameters of astatistical model.

It is very interesting to see how it works out in the case ofBayesian networks!

Tomer Galanti Probabilistic Graphical Models

Page 18: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Maximum likelihood: Bayesian networks

Let’s start with the simplest non-trivial network..

Consider a graph of two boolean variables X → Y .

In this case, the parameterization θ consists of 6 parameters:

θx0 := P[X = 0] and θx1 := P[X = 1].

θy0|x0 := P[Y = 0|X = 0] and θy1|x0 := P[Y = 1|X = 0].

θy0|x0 := P[Y = 0|X = 1] and θy1|x0 := P[Y = 1|X = 1].

Tomer Galanti Probabilistic Graphical Models

Page 19: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Maximum likelihood: Bayesian networks

Given a data set D = {(x [m], y [m])}Mm=1 the likelihood becomes:

L(θ : D) =∏m

P[(x [m], y [m]); θ]

=∏m

P[x [m]; θ] · P[y [m]|x [m]; θ]

=∏m

P[x [m]; θ]∏m

P[y [m]|x [m]; θ]

= θM[0]xo · θM[1]

x1

∏m

P[y [m]|x [m]; θ]

Where M[x ] counts the number of samples such that x [m] = x .

Tomer Galanti Probabilistic Graphical Models

Page 20: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Maximum likelihood: Bayesian networks

It is left to represent the other product:∏m

P[y [m]|x [m]; θ] =∏m

P[y [m]|x [m]; θY |X ]

=∏

m:x[m]=0

P[y [m]|x [m]; θY |X ] ·∏

m:x[m]=1

P[y [m]|x [m]; θY |X ]

=∏

m:x[m]=0

P[y [m]|x [m]; θY |X=0] ·∏

m:x[m]=1

P[y [m]|x [m]; θY |X=1]

= θM[0,0]y0|x0 · θ

M[0,1]y1|x0 · θ

M[1,0]y0|x1 · θ

M[1,1]y1|x1

Where M[x , y ] counts the number of samples(x [m], y [m]) = (x , y).

Tomer Galanti Probabilistic Graphical Models

Page 21: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Maximum likelihood: Bayesian networks

Finally, the likelihood decomposes very nicely.

L(θ : D) = θM[0]xo · θM[1]

x1 · θM[0,0]y0|x0 · θ

M[0,1]y1|x0 · θ

M[1,0]y0|x1 · θ

M[1,1]y1|x1

We have three sets of separable terms: θM[0]xo · θM[1]

x1 ,

θM[0,0]y0|x0 · θ

M[0,1]y1|x0 and θ

M[1,0]y0|x1 · θ

M[1,1]y1|x1 .

Therefore, we can maximize each one separately.

Tomer Galanti Probabilistic Graphical Models

Page 22: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Maximum likelihood: Bayesian networks

By the same analysis used for the coin flip example, we arriveto the following conclusions:

1 θ̂x0 = M[0]M[0]+M[1] and θ̂x1 = 1− θ̂x0 .

2 θ̂y0|x0 = M[0,0]M[0,0]+M[0,1] and θ̂y1|x0 = 1− θ̂y0|x0 .

3 θ̂y0|x1 = M[1,0]M[1,0]+M[1,1] and θ̂y1|x1 = 1− θ̂y0|x1 .

(By log-likelihood maximization).

Intuitive, right?

Tomer Galanti Probabilistic Graphical Models

Page 23: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Maximum likelihood: Bayesian networks

What about the general case?

Actually, there is no much difference...

Tomer Galanti Probabilistic Graphical Models

Page 24: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Maximum likelihood: Bayesian networks

Consider a Bayesian network G .

For each variable Xi we have a set of parameters specifyingthe probabilities for it’s values given it’s parents θXi |PaXi

(i.e,

it’s CPD).

The total set of parameters is: θ = ∪iθXi |PaXi.

Tomer Galanti Probabilistic Graphical Models

Page 25: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Maximum likelihood: Bayesian networks

The likelihood decomposes into local likelihoods.

L(θ : D) =∏m

P[ζ[m]; θ]

=∏m

∏i

P[Xi [m]|PaXi[m]; θ]

=∏i

∏m

P[Xi [m]|PaXi[m]; θ]

=∏i

Li (θXi |PaXi: D)

Each Li (θXi |PaXi: D) =

∏m P[Xi [m]|PaXi

[m]; θ] is the locallikelihood of Xi .

Tomer Galanti Probabilistic Graphical Models

Page 26: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Maximum likelihood: Bayesian networks

The local likelihoods are parameterized by disjoint sets ofparameters.Therefore, maximizing the total likelihood is equivalent tomaximizing each local likelihood separately.

Tomer Galanti Probabilistic Graphical Models

Page 27: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Maximum likelihood: Bayesian networks

Here is how to maximize a local likelihood.

Let X be a variable and U it’s parents. We have a parameter θx |ufor each combination x ∈ Val(X ) and u ∈ Val(U).

LX (θX |U : D) =∏m

θx[m]|u[m]

=∏

u∈Val(U)

∏x∈Val(X )

θM[u,x]x |u

Where M[u, x ] is the number of times x [m] = x and u[m] = u.

Tomer Galanti Probabilistic Graphical Models

Page 28: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Maximum likelihood: Bayesian networks

For each u we maximize∏

x∈Val(X ) θM[u,x]x |u separately and obtain:

θ̂x |u :=M[u, x ]

M[u]= “ratio of X = x between

all examples that satisfy U = u in the data set”

Where M[u] :=∑

x M[u, x ].

Tomer Galanti Probabilistic Graphical Models

Page 29: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Maximum likelihood estimation as M-Projection

The MLE principle gives a recipe how to construct estimatorsfor diferent statistical models (for example, multinomials andGaussians). As we have seen, for simple examples the resultingestimators are quite intuitive.

However, the same principle can be applied in a much broaderrange of parametric models.

Tomer Galanti Probabilistic Graphical Models

Page 30: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Maximum likelihood estimation as M-Projection

Recall the notion of projection: finding the distribution, withina specified class, that is closest to a given target distribution.

Parameter estimation is similar in the sense that we select adistribution from a given class that is closest to our data.

As we show next, the MLE aims to find the distribution that isclosest to the empirical distribution P̂D .

Tomer Galanti Probabilistic Graphical Models

Page 31: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Maximum likelihood estimation as M-Projection

Theorem

The MLE θ̂ in a parametric family of distributions relative to dataset D is the M-Projection of P̂D on the parametric family.

θ̂ = arg minθ∈Θ

KL(P̂D ||Pθ)

Here, P̂D(x) = |{z∈D:z=x}||D|

And KL(p||q) = Ex∼p[log(p(x)/q(x))].

Tomer Galanti Probabilistic Graphical Models

Page 32: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Maximum likelihood estimation as M-Projection

Proof.

log L(θ : D) =∑m

logP[ζ[m]; θ]

=∑ζ

logP[ζ; θ] ·∑m

I [ζ[m] = ζ]

=∑ζ

MP̂D(ζ) logP[ζ; θ]

= M · EP̂D[logP[ζ; θ]]

= M ·(H(P̂D)− KL(P̂D ||Pθ)

)

Maximizing this object is equivalent to minimizing KL(P̂D ||Pθ) since H(P̂D) isfixed.

Tomer Galanti Probabilistic Graphical Models

Page 33: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation

Bayesian parameter estimation.

Tomer Galanti Probabilistic Graphical Models

Page 34: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation

The MLE seems plausible, but it can be overly simplistic in manycases.

Assume we perform an experiment of tossing a coin and get 3heads out of 10. A reasonable conclusion would be that theprobability for head is ≈ 0.3, right?

What if we have a great experience with tossing random coins?What if a probable coin is very close to a fair coin?

This is called, prior knowledge. We do not want the priorknowledge to be the absolute guide, but rather a reasonablestarting point.

Tomer Galanti Probabilistic Graphical Models

Page 35: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation

Bayesian parameter estimation: a full discussion of parameterestimation includes both analysis of the data and a priorknowledge about the parameters.

Tomer Galanti Probabilistic Graphical Models

Page 36: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: joint probabilistic model

How do we model the prior knowledge..?

One approach is to encode the prior knowledge about θ with adistribution. Think of it as a hierarchy between the differentchoices of θ.

This is called a prior distribution and is denoted by P(θ).

Tomer Galanti Probabilistic Graphical Models

Page 37: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: joint probabilistic model

In the previous model, the samples were i.i.d according to adistribution P[·; θ]. In the current setup the samples areconditionally independent given θ, since θ is itself a randomvariable.

Tomer Galanti Probabilistic Graphical Models

Page 38: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: joint probabilistic model

Visually, the sampling process looks like this.

Tomer Galanti Probabilistic Graphical Models

Page 39: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: joint probabilistic model

In this model, the samples are taken along to the parameter ofthe model.

The joint probability of the data and the parameter factorizes asfollows:

P[D, θ] = P[D|θ] · P(θ)

= P(θ) ·∏m

P[x [m]|θ]·

Where D = {x [1], ..., x [M]}.

Tomer Galanti Probabilistic Graphical Models

Page 40: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: the posterior

We define the posterior distribution as follows.

P[θ|D] =P[D|θ] · P(θ)

P[D]

The posterior is actually what we are interested in, right?

It encodes our posterior knowledge about the choices of theparameter given the data and the prior knowledge.

Tomer Galanti Probabilistic Graphical Models

Page 41: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: the posterior

Tomer Galanti Probabilistic Graphical Models

Page 42: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: the posterior

Let’s take a further look on the posterior.

P[θ|D] =P[D|θ] · P(θ)

P[D]

The first term in the numerator is the likelihood.

The second is the prior distribution.

The denominator is a normalizing constant.

Posterior distribution ∝ likelihood × prior distribution.

Tomer Galanti Probabilistic Graphical Models

Page 43: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: the posterior

Tomer Galanti Probabilistic Graphical Models

Page 44: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: prediction

We are also very interested in making predictions.

This is done through the distribution for a new sample given thedata set:

P[x [M + 1]|D]

=

∫P[x [M + 1]|θ,D] · P[θ|D]dθ

=

∫P[x [M + 1]|θ] · P[θ|D]dθ

Tomer Galanti Probabilistic Graphical Models

Page 45: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: prediction

Let’s revisit the coin flip example!

Assume that the prior is uniform over θ ∈ [0, 1].

What is the probability of a new sample given the data set?

P[x [M + 1]|D] =

∫P[x [M + 1]|θ] · P[θ|D]dθ

Tomer Galanti Probabilistic Graphical Models

Page 46: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: prediction

Since θ is uniformly distributed,

P[θ|D] = P[D|θ]/P[D]

In addition, P[x [M + 1] = head |θ] = θ.

Therefore,

P[x [M + 1] = head |D] =

∫P[x [M + 1]|θ] · P[θ|D]dθ

=1

P[D]

∫θ · θM[head ] · (1− θ)M[tail ]dθ

=1

P[D]

(M[head ] + 1)!M[tail ]!

(M[head ] + M[tail ] + 2)!

(See Beta functions)Tomer Galanti Probabilistic Graphical Models

Page 47: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: prediction

Similarily:

P[x [M + 1] = tail |D] =1

P[D]· (M[tail ] + 1)!M[head ]!

(M[head ] + M[tail ] + 2)!

Their sum is:1

P[D]

∫θM[head ] · (1− θ)M[tail ]dθ = M[head ]!M[tail ]!

(M[head ]+M[tail ]+1)! .

Tomer Galanti Probabilistic Graphical Models

Page 48: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: prediction

We normalize and obtain:

P[x [M + 1] = head |D]

=(M[head ] + 1)!M[tail ]!

P[D](M[head ] + M[tail ] + 2)!/

M[head ]!M[tail ]!

P[D](M[head ] + M[tail ] + 1)!

=M[head ] + 1

M[head ] + M[tail ] + 2

Similar to the MLE prediction except that it adds one imaginarysample to each count. As the number of samples grows theBayesian estimator and the maximum likelihood estimatorconverge to the same value.

Tomer Galanti Probabilistic Graphical Models

Page 49: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: Bayesian networks

We now turn to Bayesian estimation in the context of a Bayesiannetwork. Recall that the Bayesian framework requires us tospecify a joint distribution over the unknown parameters andthe data instances.

Tomer Galanti Probabilistic Graphical Models

Page 50: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: Bayesian networks

We would like to introduce a simplified formula for the posteriordistribution.

For this purpose, we take two simplifying assumptions on thedecomposition of the prior.

Tomer Galanti Probabilistic Graphical Models

Page 51: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: Bayesian networks

The first assumption asserts global decomposition of the prior.

Definition (Global parameter indepdence)

Let G be a Bayesian network with parametersθ = (θX1|PaX1

, ..., θXn|PaXn ). A prior distribution satisfies the globalparameter independence if it has the form:

P(θ) =∏i

P(θXi |PaXi)

Tomer Galanti Probabilistic Graphical Models

Page 52: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: Bayesian networks

Tomer Galanti Probabilistic Graphical Models

Page 53: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: Bayesian networks

The second assumption asserts local decomposition of the prior.

Definition (Local parameter independence)

Let X be a variable with parents U. We say that the prior P(θX |U)satisfies local parameter independence if

P(θX |U) =∏u

P(θX |u)

Tomer Galanti Probabilistic Graphical Models

Page 54: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: Bayesian networks

First, we decompose the distribution.

P[θ|D] =P[D|θ] · P(θ)

P[D]

Recall the decomposition of the likelihood:

P[D|θ] =∏i

Li (θXi |PaXi: D)

And by the global parameter independence:

P[θ|D] =1

P[D]·∏i

Li (θXi |PaXi: D) · P(θXi |PaXi

)

Tomer Galanti Probabilistic Graphical Models

Page 55: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: Bayesian networks

By definition of local likelihoods:

P[θ|D] =1

P[D]·∏i

Li (θXi |PaXi: D) · P(θXi |PaXi

)

=1

P[D]·∏i

∏m

P[xi [m]|PaXi[m]; θXi |PaXi

] · P(θXi |PaXi)

Tomer Galanti Probabilistic Graphical Models

Page 56: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: Bayesian networks

And by applying bayes rule

P(θXi |PaXi|D) =

P[xi [m]|PaXi[m]; θXi |PaXi

] · P(θXi |PaXi)

P[xi [m]|PaXi[m]]

We obtain,

P[θ|D] =1

P[D]·∏i

(P(θXi |PaXi |D) ·

∏m

P[xi [m]|PaXi [m]]

)=∏i

P(θXi |PaXi |D)

Finally, by the local parameter independence:

P[θ|D] =∏i

∏p instantiation of PaXi

P(θXi |p|D)

Tomer Galanti Probabilistic Graphical Models

Page 57: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: selecting a prior

It is left to choose the prior distribution.

We arrived to a very pleasing decomposition of the prior,

P(θ) =∏i

∏p

P(θXi |p)

Each θXi |p behaves as a discrete distribution, i.e, a vector thatsums to 1.

How would you choose the prior on each θXi |p?

Tomer Galanti Probabilistic Graphical Models

Page 58: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: selecting a prior

A widely applicable distribution over discrete distributions is theDirichlet distribution.

(a1, ..., ak) ∼ Dir(α1, ..., αk) s.t∑i

ai = 1 and ai ∈ (0, 1)

The PDF is:

1

B(α1, ..., αk)

∏i

aαi−1i where B(α1, ..., αk) =

∏i Γ(αi )

Γ(∑

i αi )

Here Γ(t) =∫∞

0 x t−1e−xdx is the well known Gamma function.

Tomer Galanti Probabilistic Graphical Models

Page 59: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: selecting a prior

The following theorem shows that it is convenient to choosedirichlet distributions as priors.

Theorem

If θ ∼ Dir(α1, ..., αk) then θ|D ∼ Dir(α1 + M[1], ..., αk + M[k])where M[k] counts the number of occurances of k in D.

[See h.w]

Tomer Galanti Probabilistic Graphical Models

Page 60: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: selecting a prior

Finally we have:

P[θ|D] =∏i

∏p instantiation of PaXi

P(θXi |p|D)

Where each prior is:

θX |p ∼ Dir(αx1|p, ..., αxK |p)

And therefore the posterior is:

θXi |p|D ∼ Dir(αx1|p + M[p, x1], ..., αxK |p + M[p, xK ])

Tomer Galanti Probabilistic Graphical Models

Page 61: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: making predictions

The predictive model is dictated by the following distribution.

P[X1[M + 1], ...,Xn[M + 1]|D]

=∏i

P[Xi [M + 1]|PaXi[M + 1],D]

=∏i

∫P[Xi [M + 1]|PaXi

[M + 1], θXi |PaXi] · P[θXi |PaXi

|D]dθXi |PaXi

Tomer Galanti Probabilistic Graphical Models

Page 62: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Bayesian parameter estimation: making predictions

notice that:

(Xi [M + 1]|PaXi[M + 1] = p, θXi |p) ∼ Categorical(θXi |p)

And

θXi |p|D ∼ Dir(αx1|p + M[p, x1], ..., αxK |p + M[p, xK ])

In analogue to the coin flips example, the posterior induces apredictive model in which:

P[Xi [M + 1] = xi |PaXi[M + 1] = p,D] =

αxi |p + M[p, xi ]∑i αxi |p + M[p, xi ]

Tomer Galanti Probabilistic Graphical Models

Page 63: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Generalization analysis

One intuition that permeates our discussion is that more traininginstances give rise to more accurate parameter estimates.

Next, we provide some formal analysis that supports this intuition.

Tomer Galanti Probabilistic Graphical Models

Page 64: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Generalization analysis: Almost surely convergence

Theorem

Let P∗ be the generating distribution, let P(·; θ) be a parametricfamily of distributions, and let θ∗ = arg minθ KL(P∗||P(·; θ)) bethe M-projection on P∗ on this family. In addition,θ̂ = arg minθ KL(P̂D ||P(·; θ)). Then,

limM→∞

P(·; θ̂) = P(·; θ∗)

Almost surely.

Tomer Galanti Probabilistic Graphical Models

Page 65: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Generalization analysis: Convergence of multinomials

We revisit the coin flip example.

This time we are interested to measure the number of samplesrequired to approximate the probability for head/tail.

Tomer Galanti Probabilistic Graphical Models

Page 66: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Generalization analysis: Convergence of multinomials

Assume we have a data set D = {x1, ..., xM} of i.i.d samplesxi ∼ Binomial(θ).We would like to estimate how large M should be in order tosuffice that the MLE is close to θ.

Tomer Galanti Probabilistic Graphical Models

Page 67: Probabilistic Graphical Models - TAU › ~haimk › pgm-seminar › Graphicals-tomer.pdf · 2015-12-21 · Tomer Galanti Probabilistic Graphical Models. Motivation Maximum likelihood

MotivationMaximum likelihood estimationBayesian parameter estimation

Generalization analysis (Bonus?)

Generalization analysis: Convergence of multinomials

Theorem

Let ε, δ > 0 and let M > 12ε2 log(2/δ), then:

P[|θ̂ − θ| < ε] ≥ 1− δ

The probability is over D, as set of M i.i.d samples. Here, θ̂ is theMLE with respect to D.

[See h.w]

Tomer Galanti Probabilistic Graphical Models