20040611_a brief maximum entropy tutorial

8/3/2019 20040611_A Brief Maximum Entropy Tutorial

1/39

A brief maximum entropy

tutorial


2/39

Overv

iew Statistical modeling addresses the problem of

modeling the behavior of a random process

In constructing this model, we typically have atour disposal a sample of output from the process.

From the sample, which constitutes an incomplete

state of knowledge about the process, the

modeling problem is toparlay this knowledge into

a succinct, accurate representation of the process

We can then use this representation to make

predictions of the future behav

ior of the process


3/39

Motiv

ating example Suppose we wish to model an expert translators

decisions concerning the proper French rendering

of the English word in. A modelp of the experts decisions assigns to

each French word or phrase fan estimate,p(f), of

the probability that the expert would choosefas a

translation ofin.

Developp collect a large sample of instances of

the experts decisions


4/39

Motiv

ating example Our goal is to

Extract a set of facts about the decision-making

process from the sample (the first task ofmodeling)

Construct a model of this process (the second

task)


5/39

Motivating example

One obvious clue we might glean from the sample

is the list of allowed translations

in {dans, en, , au cours de, pendant}

With this information in hand, we can impose ourfirst constraint on our model p:

This equation represents our first statistic of theprocess; we can now proceed to search for a

suitable model which obeys this equation

There are infinite number of models p for which this

identify holds

1)()()()()( ! pendantpdecoursaupapenpdansp


6/39

Motiv

ating example One model which satisfies the above equation isp(dans)=1 in other words, the model always

predicts dans. Another model which obeys this constraint

predictspendantwith a probability of , and with a probability of .

But both of these models offend our sensibilities:knowing only that the expert always chose fromamong these five French phrases, how can wejustify either of these probability distributions?


7/39

Motivating example

Knowing only that the expert chose exclusively from

among these five French phrases, the most intuitively

appealing model is

5/1)(

5/1)(

5/1)(5/1)(

5/1)(

!

!

!

!

!

pendantp

decoursaup

apenp

dansp

This model, which allocates the total probability evenly among

the five possible phrases, is the most uniform model subject to

our knowledge

It is not, however, the most uniform overall; that model would

grant an equal probability to everypossible French phrase.


8/39

Motivating example

We might hope to glean more clues about the expertsdecisions from our sample.

Suppose we notice that the expert chose eitherdans oren

30% of the time

Once again there are many probability distributions

consistent with these two constraints.

In the absence of any other knowledge,

a reasonable choice forp is again themost uniform that is, the distribution

which allocates its probability as evenly

as possible, subject to the constrains:


10/3)()( ! enpdansp

30/7)(

30/7)(

30/7)(

20/3)(

20/3)(

!

!

!

!

!

penda

ntp

decoursaup

ap

enp

dansp


9/39

Motiv

ating example Say we inspect the data once more, and this time notice

another interesting fact: in half the cases, the expert chose

eitherdans or. We can incorporate this information intoour model as a third constraint:

We can once again look for the most uniformp satisfying

these constraints, but now the choice is not as obvious.


10/3)()( ! enpdansp

2/1)()(! a

pda

nsp


10/39

Motiv

ating example As we have added complexity, we have

encountered two problems:

First, what exactly is meant by uniform, and how canone measure the uniformity of a model?

Second, having determined a suitable answer to these

questions, how does one find the most uniform model

subject to a set of constraints like those we hav

edescribed?


11/39

Motiv

ating example The maximum entropy method answers both these

questions.

Intuitively, the principle is simple: model all that is known and assume nothing about that

which is unknown

In other words, given a collection of facts, choose a

model which is consistent with all the facts, butotherwise as uniform as possible.

This is precisely the approach we took in selecting

our modelp at each step in the above example


12/39

MaxentModeling

Consider a random process which produces an

output valuey, a member of a finite set .

y may be any word in the set {dans, en, , au cours de,

pendant}

In generatingy, the process may be influenced by

some contextual informationx, a mamber of a

finite set X.

x could include the words in the English sentencesurrounding in

To construct a stochastic model that accurately

represents the behavior of the random process

Given a context x, the process will outputy.


13/39

Training data Collect a large number of samples (x1, y1), (x2, y2),,

(xN,yN)

Each sample would consist of a phrasex containing thewords surrounding in, together with the translation y of

in which the process produced

Typically, a particular pair (x,y) will either not occur

at all in the sample, or will occur at most a few times.

smoothing

samplein theoccurs,thattimesofnumber1,~ yxN

yxp v|


14/39

Features and constraints The goal is to construct a statistical model of the

process which generated the training sample

The building blocks of this model will be a set ofstatistics of the training sample

The frequency that in translated to eitherdans oren

was 3/10

The frequency that in translated to eitherdans oraucours de was

And so on

yxp ,~

Statistics of the

training sample yxp ,~


15/39

Features and constraints Conditioning informationx

E.g., in the training sample, ifAprilis the word

following in, then the translation ofin is en withfrequency 9/10

Indicator function

Expected value off

!!

otherwise0

followsandif1

,

inAprileny

yxf

(1),,~~,

|yx

yxfyxpfp


16/39

Features and constraints We can express any statistic of the sample as the

expected value of an appropriate binary-valued

indicator functionf We call such function a featurefunction orfeature for

short


17/39

Features and constraints

When we discover a statistic that we feel is useful,

we can acknowledge its importance by requiring

that our model accord with it

We do this by constraining the expected value thatthe model assigns to the corresponding feature

functionf

The expected value offwith respect to the model

p(y| x) is

(2),|~,

|yx

yxfxypxpfp

sampletrainingin theofondistributiempiricaltheis~

where xxp


18/39

Features and constraints We constrain this expected value to be the same as

the expected value offin the training sample. That

is, we require

We call the requirement (3) a constraintequation or

simply a constraint

Combining (1), (2) and (3) yields

(3)~ fpfp !

!yxyx

yxfyxpyxfxypxp,,

,,~,|~


19/39

Features and constraints To sum up so far, we now have

A means of representing statistical phenomena inherent

in a sample of data (namely, )

A means of requiring that our model of the process

exhibit these phenomena (namely, )

Feature:

Is a binary-v

alue function of (x, y) Constraint

Is an equation between the expected value of the feature

function in the model and its expected value in the

training data

fp~

fpfp ~!


20/39

The maxent principle Suppose that we are given n feature functionsfi,

which determine statistics we feel are important in

modeling the process. We would like our model toaccord with these statistics

That is, we would like p to lie in the subset CofP

defined by

_ a_ a (4),...,2,1for~| nifpfpp ii !| PC


21/39

(a)

P

(b)

P

(d)

P

(c)

P

C1 C1C1 C2C2

Figure 1:

If we impose no constraints, then all probability models are

allowable

Imposing one linear constraint C1 restricts us to those pPwhich

lie on the region defined by C1

A second linear constraint could determinep exactly, if the twoconstraints are satisfiable, where the intersection of C1 and C2 is

non-empty. p C1 C2

Alternatively, a second linear constraint could be inconsistent with

the first (i,e, C1 C2 = ); no pPcan satisfy them both


22/39

The maxent principle In the present setting, however, the linear

constraints are extracted from the training sample

and cannot, by construction,be inconsistent Furthermore, the linear constraints in our

applications will not even come close to

determining pPuniquely as they do in (c);

instead, the set C=C1 C2 Cn of allowablemodels will be infinite


23/39

The maxent principle Among the modelspC, the maximum entropy

philosophy dictates that we select the distribution

which is most uniform A mathematical measure of the uniformity of a

conditional distributionp(y|x) is provided by the

conditional entropy

(5)|log|~,

|yx

xypxypxppH


24/39

The maxent principle The principle of maximum entropy

To select a model from a set Cof allowed probability

distributions, choose the model p

Cwith maximumentropyH(p):

(6)maxarg* pHpp C

!


25/39

Exponential form The maximum entropy principle presents us with a

problem in constrained optimization: find the

p

Cwhich maximizesH(p) Find

(7)|log|~maxarg

maxarg

,

*

!

!

yxCp

Cp

xypxypxp

pHp


26/39

Exponential form We refer to this as the primalproblem; it is a

succinct way of saying that we seek to maximize

H(p) subject to the following constraints: 1.

2.

This and the previous condition guarantee thatp is a

conditional probability distribution

3.

In other words, pC, and so satisfies the active constraints C

.,allfor0| yxxyp u

.allfor1| xxypy

!

_ a.,...,2,1for

,,~,|~,,

ni

yxfyxpyxfxypxpyxyx

!


27/39

Exponential form To solve this optimization problem, introduce the

Lagrangian

|0

1|

(8),|~,,~

|log|~,,

,

,

y

i yx

iii

yx

xyp

yxfxypxpyxfyxp

xypxypxpp

K

P

K\


28/39

Exponential form

(10)1~exp,~exp|

1~,~|log

,~|log1~

0,~|log1~

(9),~|log1~

|

!

!

!

!

!x

x

xpyxfxpxyp

xpyxfxpxyp

yxfxpxypxp

yxfxpxypxp

yxfxpxypxpxyp

i

ii

i

ii

i

ii

i

ii

i

ii

KP

KP

KP

KP

KP\


29/39

Exponential form We have thus found the parametric form ofp, and so we

now take up the task of solving for the optimal values 0,

K.

Recognizing that the second factor in this equation is the

factor corresponding to the second of the constraints listed

above, we can rewrit (10) as

where Z(x), the normalizing factor, is given by (11),exp|

!

iii yxfxZxyp P

(12),exp

!

y i

ii yxfxZ P


30/39

!@

!

!

!

!

!

!

yi

ii

yi

ii

yi

ii

yi

iiy

y

yxfxZ

xZyxf

xp

yxfxp

xpyxfxyp

xypx

,exp

1

,exp

11~exp

1,exp1~exp

11~exp,exp|

1|,:constraintsecond

:(12)Proof

*

P

P

K

PK

KP


31/39

Exponential form We have found Kbut not yet 0. Towards this end we

introduce some further notation. Define the dualfunction

=(0) as

and the dual optimization problem as

Sincep and K are fixed, the righthand side of (14) has

only the free variables 0={P1, P2,, Pn}.

(13),, 0|0= K\ p

(14)maxargFind 0=!00


32/39

Exponential form Final result

The maximum entropy model subject to the constraints

Chas the parametric form p

of (11), where

can bedetermined by maximizing the dual function =(0)


33/39

Maximum likelihood

.~sampleingtrain

theoflikelihoodthemaximizesthat|familyparametric

in themodeltheisentropymaximumwithmodelThe

:asrephrasedbecansectionprevioustheofresultthe

tion,interpretaWith this(11).offormparametricthehaswhere

(16)

isthat;model

lexponentiafor thelikelihood-logjust thefact,inis,section

previoustheoffunctiondualthecheck thateasy toisIt

(15)|log,~|log

bydefinedismodelabypredictedas

~ondistributiempiricaltheoflikelihood-logThe

*

~

,,

,~

~

~

p

xyp

Cp

p

pL

p

xypyxpxyppL

p

ppL

p

yxyx

yxp

p

p

!0=

0=

!|


34/39

Maximum likelihood

|0|0=

|0

|0

yx

yx

y

i yx

iii

yx

xypxypxpp

xypxypxpp

xyp

yxfxypxpyxfyxp

xypxypxpp

,

**

,

,

,

|log|~~,,

|log|~,,

1|

,|~,,~

|log|~,,

:(8)Fromand(16)Since

K\

K\

K

P

K\


35/39

? A

? A

? A

? A ? A

!

-

!

-

!

-

!

-

!

-

!

|0|0=

i

ii

x

i yx

ii

x

yx i

ii

x

yx i

ii

yx

yx i

ii

yx i

ii

yx

fpxZxp

yxfyxpxZxp

yxfyxpxZxp

yxfxypxpxZxypxp

yxfxZxypxp

yxfxZ

xypxp

xypxypxpp

~log~

.,~log~

.,~log~

.|~~log|~~

.log|~~

.exp1log|~~

|log|~~,,

,

.

.,

,

,

,

**

P

P

P

P

P

P

K\


36/39

O

utline (MaxentModeling summary) We began by seeking the conditional distributionp(y|x)

which had maximal entropy H(p) subject to a set of linear

constraints (7)

Following the traditional procedure in constrained

optimization, we introduced the Lagrangian \( p,0,K),

where 0, K are a set of Lagrange multipliers for the

constraints we imposed on p(y|x)

To find the solution to the optimization problem, weappealed to the Kuhn-Tucker theorem, which states that we

can (1) first solve \( p,0,K) forp to get a parametric form

forp in terms of0, K; (2) then plug pback in to

\( p,0

,K), this time solv

ing for0

, K

.


37/39

O

utline (MaxentModeling summary) The parametric form forp turns out to have the

exponential form (11)

The K

giv

es rise to the normalizing factorZ

(x), giv

en in(12)

The 0 will be solved for numerically using the dual

function (14). Furthermore, it so happens that this function,

=(0), is the log-likelihood for the exponential model p

(11). So what started as the maximization of entropysubject to a set of linear constraints turns out to be

equivalent to the unconstrained maximization of likelihood

of a certain parametric family of distributions.


38/39

O

utline(MaxentModeling summary)

Table 1 summarize the primal-dual framework

Primal Dual

problem

description

type ofsearch

search domainsolution

argmaxpCH(p)

maximum entropy

constrained optimization

pC

p

argmaxP=(0)

maximum likelihood

unconstrained optimization

real-v

aluev

ectors {P1 P2,}0

Kuhn-Tucker theorem: p = p0


39/39

Computing the parameters

convergedhavetheallnotifsteptoGotoaccordingofvaluetheUpdateb

yxfyxf

fpyxfyxfxypxp

tosolutionthebeLeta

nieachforDo

niallforwithStart

delpoptimalmoluesrametervaOptimalpa

yxptionldistribu; empiricaf,,fnctionsfFeaturefu

calingterative SImproved I

i

iiii

n

i i

i

yx

ii

i

i

*

i

n

2.3:.

(19),,where

(18))(~),(exp),()|()(~

.

:},,2,1{.2

},,2,1{0.1

*;:Output

),(~:Input

1Algorithm

1

#

,

#

21

PPPPP

P

P

P

(n

|

!(

(

!

0

!

.

.

.

(i

ii yxf ,P

20040611_a brief maximum entropy tutorial

Documents