pgm: tirgul 10 parameter learning and priors

48
. PGM: Tirgul 10 Parameter Learning and Priors

Upload: fineen

Post on 13-Jan-2016

35 views

Category:

Documents


0 download

DESCRIPTION

PGM: Tirgul 10 Parameter Learning and Priors. Why learning?. Knowledge acquisition bottleneck Knowledge acquisition is an expensive process Often we don’t have an expert Data is cheap Vast amounts of data becomes available to us Learning allows us to build systems based on the data. E. B. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: PGM: Tirgul 10 Parameter Learning and Priors

.

PGM: Tirgul 10Parameter Learning

and Priors

Page 2: PGM: Tirgul 10 Parameter Learning and Priors

2

Why learning?

Knowledge acquisition bottleneck Knowledge acquisition is an expensive process Often we don’t have an expert

Data is cheap Vast amounts of data becomes available to us

Learning allows us to build systems based on the data

Page 3: PGM: Tirgul 10 Parameter Learning and Priors

3

Learning Bayesian networks

InducerInducerInducerInducerData + Prior information

E

R

B

A

C .9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

b

e

BE P(A | E,B)

Page 4: PGM: Tirgul 10 Parameter Learning and Priors

4

Known Structure -- Complete Data E, B, A<Y,N,N><Y,Y,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>

InducerInducerInducerInducer

E B

A.9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

b

e

BE P(A | E,B)

? ?

e

b

e

? ?

? ?

? ?

be

b

b

e

BE P(A | E,B) E B

A

Network structure is specified Inducer needs to estimate parameters

Data does not contain missing values

Page 5: PGM: Tirgul 10 Parameter Learning and Priors

5

Unknown Structure -- Complete Data E, B, A<Y,N,N><Y,Y,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>

InducerInducerInducerInducer

E B

A.9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

b

e

BE P(A | E,B)

? ?

e

b

e

? ?

? ?

? ?

be

b

b

e

BE P(A | E,B) E B

A

Network structure is not specified Inducer needs to select arcs & estimate parameters

Data does not contain missing values

Page 6: PGM: Tirgul 10 Parameter Learning and Priors

6

Known Structure -- Incomplete Data

InducerInducerInducerInducer

E B

A.9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

b

e

BE P(A | E,B)

? ?

e

b

e

? ?

? ?

? ?

be

b

b

e

BE P(A | E,B) E B

A

Network structure is specified Data contains missing values

We consider assignments to missing values

E, B, A<Y,N,N><Y,?,Y><N,N,Y><N,Y,?> . .<?,Y,Y>

Page 7: PGM: Tirgul 10 Parameter Learning and Priors

7

Known Structure / Complete Data

Given a network structure G And choice of parametric family for P(Xi|Pai)

Learn parameters for network

Goal Construct a network that is “closest” to probability

that generated the data

Page 8: PGM: Tirgul 10 Parameter Learning and Priors

8

Example: Binomial Experiment(Statistics 101)

When tossed, it can land in one of two positions: Head or Tail

We denote by the (unknown) probability P(H).Estimation task: Given a sequence of toss samples x[1], x[2], …,

x[M] we want to estimate the probabilities P(H)= and P(T) = 1 -

Head Tail

Page 9: PGM: Tirgul 10 Parameter Learning and Priors

9

Statistical Parameter Fitting Consider instances x[1], x[2], …, x[M] such

that The set of values that x can take is known Each is sampled from the same distribution Each sampled independently of the rest

The task is to find a parameter so that the data can be summarized by a probability P(x[j]| ).

Depends on the given family of probability distributions: multinomial, Gaussian, Poisson, etc.

For now, focus on multinomial distributions

i.i.d.samples

Page 10: PGM: Tirgul 10 Parameter Learning and Priors

10

The Likelihood Function How good is a particular ?

It depends on how likely it is to generate the observed data

The likelihood for the sequence H,T, T, H, H is

m

mxPDPDL )|][()|():(

)1()1():( DL

0 0.2 0.4 0.6 0.8 1

L(

:D)

Page 11: PGM: Tirgul 10 Parameter Learning and Priors

11

Sufficient Statistics

To compute the likelihood in the thumbtack example we only require NH and NT

(the number of heads and the number of tails)

NH and NT are sufficient statistics for the binomial distribution

TH NNDL )1():(

Page 12: PGM: Tirgul 10 Parameter Learning and Priors

12

Sufficient Statistics

A sufficient statistic is a function of the data that summarizes the relevant information for the likelihood

Formally, s(D) is a sufficient statistics if for any two datasets D and D’

s(D) = s(D’ ) L( |D) = L( |D’)

Datasets

Statistics

Page 13: PGM: Tirgul 10 Parameter Learning and Priors

13

Maximum Likelihood Estimation

MLE Principle:

Choose parameters that maximize the likelihood function

This is one of the most commonly used estimators in statistics

Intuitively appealing

Page 14: PGM: Tirgul 10 Parameter Learning and Priors

14

Example: MLE in Binomial Data

Applying the MLE principle we get

(Which coincides with what one would expect)

0 0.2 0.4 0.6 0.8 1

L(

:D)

Example:(NH,NT ) = (3,2)

MLE estimate is 3/5 = 0.6

TH

H

NN

N

Page 15: PGM: Tirgul 10 Parameter Learning and Priors

15

Learning Parameters for a Bayesian Network

E B

A

C

][][][][

]1[]1[]1[]1[

MCMAMBME

CABE

D

Training data has the form:

Page 16: PGM: Tirgul 10 Parameter Learning and Priors

16

Learning Parameters for a Bayesian Network

E B

A

C

Since we assume i.i.d. samples,likelihood function is

m

mCmAmBmEPDL ):][],[],[],[():(

Page 17: PGM: Tirgul 10 Parameter Learning and Priors

17

Learning Parameters for a Bayesian Network

E B

A

C

By definition of network, we get

m

m

mAmCP

mEmBmAP

mBP

mEP

mCmAmBmEPDL

):][|][(

):][],[|][(

):][(

):][(

):][],[],[],[():(

][][][][

]1[]1[]1[]1[

MCMAMBME

CABE

Page 18: PGM: Tirgul 10 Parameter Learning and Priors

18

Learning Parameters for a Bayesian Network

E B

A

C

Rewriting terms, we get

m

m

m

m

m

mAmCP

mEmBmAP

mBP

mEP

mCmAmBmEPDL

):][|][(

):][],[|][(

):][(

):][(

):][],[],[],[():(

][][][][

]1[]1[]1[]1[

MCMAMBME

CABE

Page 19: PGM: Tirgul 10 Parameter Learning and Priors

19

General Bayesian Networks

Generalizing for any Bayesian network:

The likelihood decomposes according to the structure of the network.

iii

i miii

m iiii

mn

DL

mPamxP

mPamxP

mxmxPDL

):(

):][|][(

):][|][(

):][,],[():( 1 i.i.d. samples

Network factorization

Page 20: PGM: Tirgul 10 Parameter Learning and Priors

20

General Bayesian Networks (Cont.)

Decomposition

Independent Estimation Problems

If the parameters for each family are not related, then they can be estimated independently of each other.

Page 21: PGM: Tirgul 10 Parameter Learning and Priors

21

From Binomial to Multinomial

For example, suppose X can have the values 1,2,…,K

We want to learn the parameters 1, 2. …, K

Sufficient statistics: N1, N2, …, NK - the number of times each

outcome is observed

Likelihood function:

MLE:

K

k

Nk

kDL1

):(

N

Nkk

ˆ

Page 22: PGM: Tirgul 10 Parameter Learning and Priors

22

Likelihood for Multinomial Networks

When we assume that P(Xi | Pai ) is multinomial, we get further decomposition:

i i

ii

ii

i i

ii

i ii

pa x

paxNpax

pa x

paxNiii

pa pamPamiii

miiiii

paxP

pamxP

mPamxPDL

),(|

),(

][,

):|(

):|][(

):][|][():(

m

iiiii mPamxPDL ):][|][():(

i i

ii

i ii

pa x

paxNiii

pa pamPamiii

miiiii

paxP

pamxP

mPamxPDL

),(

][,

):|(

):|][(

):][|][():(

i iipa pamPamiii

miiiii

pamxP

mPamxPDL

][,

):|][(

):][|][():(

Page 23: PGM: Tirgul 10 Parameter Learning and Priors

23

Likelihood for Multinomial Networks

When we assume that P(Xi | Pai ) is multinomial, we get further decomposition:

For each value pai of the parents of Xi we get an independent multinomial problem

The MLE is

i i

ii

iipa x

paxNpaxii DL ),(

|):(

)(),(ˆ

|i

iipax paN

paxNii

Page 24: PGM: Tirgul 10 Parameter Learning and Priors

24

Maximum Likelihood Estimation

Consistency Estimate converges to best possible value as the

number of examples grow

To make this formal, we need to introduce some definitions

Page 25: PGM: Tirgul 10 Parameter Learning and Priors

25

KL-Divergence

Let P and Q be two distributions over X A measure of distance between P and Q is the

Kullback-Leibler Divergence

KL(P||Q) = 1 (when logs are in base 2) = The probability P assigns to an instance is, on

average, half the probability Q assigns to it KL(P||Q) 0 KL(P||Q) = 0 iff are P and Q equal

x xQ

xPxPQPKL

)()(

log)()||(

Page 26: PGM: Tirgul 10 Parameter Learning and Priors

26

Consistency

Let P(X| ) be a parametric family We need to make various regularity condition we won’t

go into now Let P*(X) be the distribution that generates the data Let be the MLE estimate given a dataset D

Thm As N , where

with probability 1

D

*ˆ D

):(||)((minarg ** XPXPKL

Page 27: PGM: Tirgul 10 Parameter Learning and Priors

27

Consistency -- Geometric Interpretation

P*

P(X| * )

Space of probability distribution

Distributions that canrepresented by P(X| )

Page 28: PGM: Tirgul 10 Parameter Learning and Priors

28

Is MLE all we need?

Suppose that after 10 observations, ML estimates P(H) = 0.7 for the thumbtack Would you bet on heads for the next toss?

Suppose now that after 10 observations,ML estimates P(H) = 0.7 for a coinWould you place the same bet?

Page 29: PGM: Tirgul 10 Parameter Learning and Priors

29

Bayesian Inference

Frequentist Approach: Assumes there is an unknown but fixed parameter Estimates with some confidence Prediction by using the estimated parameter value

Bayesian Approach: Represents uncertainty about the unknown parameter Uses probability to quantify this uncertainty:

Unknown parameters as random variables Prediction follows from the rules of probability:

Expectation over the unknown parameters

Page 30: PGM: Tirgul 10 Parameter Learning and Priors

30

Bayesian Inference (cont.)

We can represent our uncertainty about the sampling process using a Bayesian network

The values of X are independent given

The conditional probabilities, P(x[m] | ), are the parameters in the model

Prediction is now inference in this network

X[1] X[2] X[m] X[m+1]

Observed data Query

Page 31: PGM: Tirgul 10 Parameter Learning and Priors

31

Bayesian Inference (cont.)

Prediction as inference in this network

where

dMxxPMxP

dMxxPMxxMxP

MxxMxP

])[,],1[|()|]1[(

])[,],1[|(])[,],1[,|]1[(

])[,],1[|]1[(

])[],1[()()|][],1[(

])[],1[|(MxxP

PMxxPMxxP

Posterior

Likelihood Prior

Probability of data

X[1] X[2] X[m] X[m+1]

Page 32: PGM: Tirgul 10 Parameter Learning and Priors

32

Example: Binomial Data Revisited

Prior: uniform for in [0,1] P( ) = 1

Then P( |D) is proportional to the likelihood L( :D)

(NH,NT ) = (4,1)

MLE for P(X = H ) is 4/5 = 0.8 Bayesian prediction is 7142.0

75

)|()|]1[( dDPDHMxP

0 0.2 0.4 0.6 0.8 1

)()|][],1[(])[],1[|( PMxxPMxxP

Page 33: PGM: Tirgul 10 Parameter Learning and Priors

33

Bayesian Inference and MLE

In our example, MLE and Bayesian prediction differ

But…

If prior is well-behaved Does not assign 0 density to any “feasible”

parameter value

Then: both MLE and Bayesian prediction converge to the same value

Both are consistent

Page 34: PGM: Tirgul 10 Parameter Learning and Priors

34

Dirichlet Priors

Recall that the likelihood function is

A Dirichlet prior with hyperparameters 1,…,K is defined as

for legal 1,…, K

Then the posterior has the same form, with

hyperparameters 1+N 1,…,K +N K

K

1k

Nk

kDL ):(

K

kk

kP1

1)(

K

k

Nk

K

k

Nk

K

kk

kkkkDPPDP1

1

11

1)|()()|(

Page 35: PGM: Tirgul 10 Parameter Learning and Priors

35

Dirichlet Priors (cont.)

We can compute the prediction on a new event in closed form:

If P() is Dirichlet with hyperparameters 1,…,K then

Since the posterior is also Dirichlet, we get

kk dPkXP )()]1[(

)(

)|()|]1[(N

NdDPDkMXP kk

k

Page 36: PGM: Tirgul 10 Parameter Learning and Priors

36

Dirichlet Priors -- Example

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 0.2 0.4 0.6 0.8 1

Dirichlet(1,1)Dirichlet(2,2)

Dirichlet(0.5,0.5)Dirichlet(5,5)

Page 37: PGM: Tirgul 10 Parameter Learning and Priors

37

Prior Knowledge

The hyperparameters 1,…,K can be thought of as “imaginary” counts from our prior experience

Equivalent sample size = 1+…+K

The larger the equivalent sample size the more confident we are in our prior

Page 38: PGM: Tirgul 10 Parameter Learning and Priors

38

Effect of Priors

Prediction of P(X=H ) after seeing data with NH = 0.25•NT for different sample sizes

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

0.6

0 20 40 60 80 100

Different strength H + T Fixed ratio H / T

Fixed strength H + T

Different ratio H / T

Page 39: PGM: Tirgul 10 Parameter Learning and Priors

39

Effect of Priors (cont.) In real data, Bayesian estimates are less

sensitive to noise in the data

0.1

0.2

0.3

0.4

0.5

0.6

0.7

5 10 15 20 25 30 35 40 45 50

P(X

= 1

|D)

N

MLEDirichlet(.5,.5)

Dirichlet(1,1)Dirichlet(5,5)

Dirichlet(10,10)

N

0

1Toss Result

Page 40: PGM: Tirgul 10 Parameter Learning and Priors

40

Conjugate Families The property that the posterior distribution follows the

same parametric form as the prior distribution is called conjugacy

Dirichlet prior is a conjugate family for the multinomial likelihood

Conjugate families are useful since: For many distributions we can represent them with

hyperparameters They allow for sequential update within the same

representation In many cases we have closed-form solution for

prediction

Page 41: PGM: Tirgul 10 Parameter Learning and Priors

41

Bayesian Networks and Bayesian Prediction

Priors for each parameter group are independent Data instances are independent given the

unknown parameters

X

X[1] X[2] X[M] X[M+1]

Observed dataPlate notation

Y[1] Y[2] Y[M] Y[M+1]

Y|X X

Y|Xm

X[m]

Y[m]

Query

Page 42: PGM: Tirgul 10 Parameter Learning and Priors

42

Bayesian Networks and Bayesian Prediction (Cont.)

We can also “read” from the network:

Complete data posteriors on parameters are independent

X

X[1] X[2] X[M] X[M+1]

Observed dataPlate notation

Y[1] Y[2] Y[M] Y[M+1]

Y|X X

Y|Xm

X[m]

Y[m]

Query

Page 43: PGM: Tirgul 10 Parameter Learning and Priors

43

Bayesian Prediction(cont.) Since posteriors on parameters for each family are

independent, we can compute them separately Posteriors for parameters within families are also

independent:

Complete data independent posteriors on Y|X=0 and Y|X=1

X

Y|Xm

X[m]

Y[m]

Refined model

X

Y|X=0m

X[m]

Y[m]

Y|X=1

Page 44: PGM: Tirgul 10 Parameter Learning and Priors

44

Bayesian Prediction(cont.)

Given these observations, we can compute the posterior for each multinomial Xi | pai

independently

The posterior is Dirichlet with parameters

(Xi=1|pai)+N (Xi=1|pai),…, (Xi=k|pai)+N (Xi=k|pai)

The predictive distribution is then represented by the parameters

)()(

),(),(~|

ii

iiiipax paNpa

paxNpaxii

Page 45: PGM: Tirgul 10 Parameter Learning and Priors

45

Assessing Priors for Bayesian Networks

We need the(xi,pai) for each node xj

We can use initial parameters 0 as prior information

Need also an equivalent sample size parameter M0

Then, we let (xi,pai) = M0P(xi,pai|0)

This allows to update a network using new data

Page 46: PGM: Tirgul 10 Parameter Learning and Priors

46

Learning Parameters: Case Study (cont.)

Experiment: Sample a stream of instances from the alarm

network Learn parameters using

MLE estimator Bayesian estimator with uniform prior with

different strengths

Page 47: PGM: Tirgul 10 Parameter Learning and Priors

47

Learning Parameters: Case Study (cont.)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

KL

Div

erg

en

ce

M

MLEBayes w/ Uniform Prior, M'=5

Bayes w/ Uniform Prior, M'=10Bayes w/ Uniform Prior, M'=20Bayes w/ Uniform Prior, M'=50

Page 48: PGM: Tirgul 10 Parameter Learning and Priors

48

Learning Parameters: Summary

Estimation relies on sufficient statistics For multinomial these are of the form N (xi,pai) Parameter estimation

Bayesian methods also require choice of priors Both MLE and Bayesian are asymptotically equivalent

and consistent Both can be implemented in an on-line manner by

accumulating sufficient statistics

)()(),(),(~

|ii

iiiipax paNpa

paxNpaxii

)(),(ˆ

|i

iipax paN

paxNii

MLE Bayesian (Dirichlet)