20040611_a brief maximum entropy tutorial
TRANSCRIPT
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
1/39
A brief maximum entropy
tutorial
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
2/39
Overv
iew Statistical modeling addresses the problem of
modeling the behavior of a random process
In constructing this model, we typically have atour disposal a sample of output from the process.
From the sample, which constitutes an incomplete
state of knowledge about the process, the
modeling problem is toparlay this knowledge into
a succinct, accurate representation of the process
We can then use this representation to make
predictions of the future behav
ior of the process
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
3/39
Motiv
ating example Suppose we wish to model an expert translators
decisions concerning the proper French rendering
of the English word in. A modelp of the experts decisions assigns to
each French word or phrase fan estimate,p(f), of
the probability that the expert would choosefas a
translation ofin.
Developp collect a large sample of instances of
the experts decisions
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
4/39
Motiv
ating example Our goal is to
Extract a set of facts about the decision-making
process from the sample (the first task ofmodeling)
Construct a model of this process (the second
task)
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
5/39
Motivating example
One obvious clue we might glean from the sample
is the list of allowed translations
in {dans, en, , au cours de, pendant}
With this information in hand, we can impose ourfirst constraint on our model p:
This equation represents our first statistic of theprocess; we can now proceed to search for a
suitable model which obeys this equation
There are infinite number of models p for which this
identify holds
1)()()()()( ! pendantpdecoursaupapenpdansp
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
6/39
Motiv
ating example One model which satisfies the above equation isp(dans)=1 in other words, the model always
predicts dans. Another model which obeys this constraint
predictspendantwith a probability of , and with a probability of .
But both of these models offend our sensibilities:knowing only that the expert always chose fromamong these five French phrases, how can wejustify either of these probability distributions?
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
7/39
Motivating example
Knowing only that the expert chose exclusively from
among these five French phrases, the most intuitively
appealing model is
5/1)(
5/1)(
5/1)(5/1)(
5/1)(
!
!
!
!
!
pendantp
decoursaup
apenp
dansp
This model, which allocates the total probability evenly among
the five possible phrases, is the most uniform model subject to
our knowledge
It is not, however, the most uniform overall; that model would
grant an equal probability to everypossible French phrase.
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
8/39
Motivating example
We might hope to glean more clues about the expertsdecisions from our sample.
Suppose we notice that the expert chose eitherdans oren
30% of the time
Once again there are many probability distributions
consistent with these two constraints.
In the absence of any other knowledge,
a reasonable choice forp is again themost uniform that is, the distribution
which allocates its probability as evenly
as possible, subject to the constrains:
1)()()()()( ! pendantpdecoursaupapenpdansp
10/3)()( ! enpdansp
30/7)(
30/7)(
30/7)(
20/3)(
20/3)(
!
!
!
!
!
penda
ntp
decoursaup
ap
enp
dansp
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
9/39
Motiv
ating example Say we inspect the data once more, and this time notice
another interesting fact: in half the cases, the expert chose
eitherdans or. We can incorporate this information intoour model as a third constraint:
We can once again look for the most uniformp satisfying
these constraints, but now the choice is not as obvious.
1)()()()()( ! pendantpdecoursaupapenpdansp
10/3)()( ! enpdansp
2/1)()(! a
pda
nsp
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
10/39
Motiv
ating example As we have added complexity, we have
encountered two problems:
First, what exactly is meant by uniform, and how canone measure the uniformity of a model?
Second, having determined a suitable answer to these
questions, how does one find the most uniform model
subject to a set of constraints like those we hav
edescribed?
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
11/39
Motiv
ating example The maximum entropy method answers both these
questions.
Intuitively, the principle is simple: model all that is known and assume nothing about that
which is unknown
In other words, given a collection of facts, choose a
model which is consistent with all the facts, butotherwise as uniform as possible.
This is precisely the approach we took in selecting
our modelp at each step in the above example
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
12/39
MaxentModeling
Consider a random process which produces an
output valuey, a member of a finite set .
y may be any word in the set {dans, en, , au cours de,
pendant}
In generatingy, the process may be influenced by
some contextual informationx, a mamber of a
finite set X.
x could include the words in the English sentencesurrounding in
To construct a stochastic model that accurately
represents the behavior of the random process
Given a context x, the process will outputy.
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
13/39
Training data Collect a large number of samples (x1, y1), (x2, y2),,
(xN,yN)
Each sample would consist of a phrasex containing thewords surrounding in, together with the translation y of
in which the process produced
Typically, a particular pair (x,y) will either not occur
at all in the sample, or will occur at most a few times.
smoothing
samplein theoccurs,thattimesofnumber1,~ yxN
yxp v|
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
14/39
Features and constraints The goal is to construct a statistical model of the
process which generated the training sample
The building blocks of this model will be a set ofstatistics of the training sample
The frequency that in translated to eitherdans oren
was 3/10
The frequency that in translated to eitherdans oraucours de was
And so on
yxp ,~
Statistics of the
training sample yxp ,~
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
15/39
Features and constraints Conditioning informationx
E.g., in the training sample, ifAprilis the word
following in, then the translation ofin is en withfrequency 9/10
Indicator function
Expected value off
!!
otherwise0
followsandif1
,
inAprileny
yxf
(1),,~~,
|yx
yxfyxpfp
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
16/39
Features and constraints We can express any statistic of the sample as the
expected value of an appropriate binary-valued
indicator functionf We call such function a featurefunction orfeature for
short
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
17/39
Features and constraints
When we discover a statistic that we feel is useful,
we can acknowledge its importance by requiring
that our model accord with it
We do this by constraining the expected value thatthe model assigns to the corresponding feature
functionf
The expected value offwith respect to the model
p(y| x) is
(2),|~,
|yx
yxfxypxpfp
sampletrainingin theofondistributiempiricaltheis~
where xxp
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
18/39
Features and constraints We constrain this expected value to be the same as
the expected value offin the training sample. That
is, we require
We call the requirement (3) a constraintequation or
simply a constraint
Combining (1), (2) and (3) yields
(3)~ fpfp !
!yxyx
yxfyxpyxfxypxp,,
,,~,|~
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
19/39
Features and constraints To sum up so far, we now have
A means of representing statistical phenomena inherent
in a sample of data (namely, )
A means of requiring that our model of the process
exhibit these phenomena (namely, )
Feature:
Is a binary-v
alue function of (x, y) Constraint
Is an equation between the expected value of the feature
function in the model and its expected value in the
training data
fp~
fpfp ~!
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
20/39
The maxent principle Suppose that we are given n feature functionsfi,
which determine statistics we feel are important in
modeling the process. We would like our model toaccord with these statistics
That is, we would like p to lie in the subset CofP
defined by
_ a_ a (4),...,2,1for~| nifpfpp ii !| PC
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
21/39
(a)
P
(b)
P
(d)
P
(c)
P
C1 C1C1 C2C2
Figure 1:
If we impose no constraints, then all probability models are
allowable
Imposing one linear constraint C1 restricts us to those pPwhich
lie on the region defined by C1
A second linear constraint could determinep exactly, if the twoconstraints are satisfiable, where the intersection of C1 and C2 is
non-empty. p C1 C2
Alternatively, a second linear constraint could be inconsistent with
the first (i,e, C1 C2 = ); no pPcan satisfy them both
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
22/39
The maxent principle In the present setting, however, the linear
constraints are extracted from the training sample
and cannot, by construction,be inconsistent Furthermore, the linear constraints in our
applications will not even come close to
determining pPuniquely as they do in (c);
instead, the set C=C1 C2 Cn of allowablemodels will be infinite
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
23/39
The maxent principle Among the modelspC, the maximum entropy
philosophy dictates that we select the distribution
which is most uniform A mathematical measure of the uniformity of a
conditional distributionp(y|x) is provided by the
conditional entropy
(5)|log|~,
|yx
xypxypxppH
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
24/39
The maxent principle The principle of maximum entropy
To select a model from a set Cof allowed probability
distributions, choose the model p
Cwith maximumentropyH(p):
(6)maxarg* pHpp C
!
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
25/39
Exponential form The maximum entropy principle presents us with a
problem in constrained optimization: find the
p
Cwhich maximizesH(p) Find
(7)|log|~maxarg
maxarg
,
*
!
!
yxCp
Cp
xypxypxp
pHp
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
26/39
Exponential form We refer to this as the primalproblem; it is a
succinct way of saying that we seek to maximize
H(p) subject to the following constraints: 1.
2.
This and the previous condition guarantee thatp is a
conditional probability distribution
3.
In other words, pC, and so satisfies the active constraints C
.,allfor0| yxxyp u
.allfor1| xxypy
!
_ a.,...,2,1for
,,~,|~,,
ni
yxfyxpyxfxypxpyxyx
!
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
27/39
Exponential form To solve this optimization problem, introduce the
Lagrangian
|0
1|
(8),|~,,~
|log|~,,
,
,
y
i yx
iii
yx
xyp
yxfxypxpyxfyxp
xypxypxpp
K
P
K\
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
28/39
Exponential form
(10)1~exp,~exp|
1~,~|log
,~|log1~
0,~|log1~
(9),~|log1~
|
!
!
!
!
!x
x
xpyxfxpxyp
xpyxfxpxyp
yxfxpxypxp
yxfxpxypxp
yxfxpxypxpxyp
i
ii
i
ii
i
ii
i
ii
i
ii
KP
KP
KP
KP
KP\
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
29/39
Exponential form We have thus found the parametric form ofp, and so we
now take up the task of solving for the optimal values 0,
K.
Recognizing that the second factor in this equation is the
factor corresponding to the second of the constraints listed
above, we can rewrit (10) as
where Z(x), the normalizing factor, is given by (11),exp|
!
iii yxfxZxyp P
(12),exp
!
y i
ii yxfxZ P
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
30/39
!@
!
!
!
!
!
!
yi
ii
yi
ii
yi
ii
yi
iiy
y
yxfxZ
xZyxf
xp
yxfxp
xpyxfxyp
xypx
,exp
1
,exp
11~exp
1,exp1~exp
11~exp,exp|
1|,:constraintsecond
:(12)Proof
*
P
P
K
PK
KP
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
31/39
Exponential form We have found Kbut not yet 0. Towards this end we
introduce some further notation. Define the dualfunction
=(0) as
and the dual optimization problem as
Sincep and K are fixed, the righthand side of (14) has
only the free variables 0={P1, P2,, Pn}.
(13),, 0|0= K\ p
(14)maxargFind 0=!00
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
32/39
Exponential form Final result
The maximum entropy model subject to the constraints
Chas the parametric form p
of (11), where
can bedetermined by maximizing the dual function =(0)
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
33/39
Maximum likelihood
.~sampleingtrain
theoflikelihoodthemaximizesthat|familyparametric
in themodeltheisentropymaximumwithmodelThe
:asrephrasedbecansectionprevioustheofresultthe
tion,interpretaWith this(11).offormparametricthehaswhere
(16)
isthat;model
lexponentiafor thelikelihood-logjust thefact,inis,section
previoustheoffunctiondualthecheck thateasy toisIt
(15)|log,~|log
bydefinedismodelabypredictedas
~ondistributiempiricaltheoflikelihood-logThe
*
~
,,
,~
~
~
p
xyp
Cp
p
pL
p
xypyxpxyppL
p
ppL
p
yxyx
yxp
p
p
!0=
0=
!|
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
34/39
Maximum likelihood
|0|0=
|0
|0
yx
yx
y
i yx
iii
yx
xypxypxpp
xypxypxpp
xyp
yxfxypxpyxfyxp
xypxypxpp
,
**
,
,
,
|log|~~,,
|log|~,,
1|
,|~,,~
|log|~,,
:(8)Fromand(16)Since
K\
K\
K
P
K\
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
35/39
? A
? A
? A
? A ? A
!
-
!
-
!
-
!
-
!
-
!
|0|0=
i
ii
x
i yx
ii
x
yx i
ii
x
yx i
ii
yx
yx i
ii
yx i
ii
yx
fpxZxp
yxfyxpxZxp
yxfyxpxZxp
yxfxypxpxZxypxp
yxfxZxypxp
yxfxZ
xypxp
xypxypxpp
~log~
.,~log~
.,~log~
.|~~log|~~
.log|~~
.exp1log|~~
|log|~~,,
,
.
.,
,
,
,
**
P
P
P
P
P
P
K\
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
36/39
O
utline (MaxentModeling summary) We began by seeking the conditional distributionp(y|x)
which had maximal entropy H(p) subject to a set of linear
constraints (7)
Following the traditional procedure in constrained
optimization, we introduced the Lagrangian \( p,0,K),
where 0, K are a set of Lagrange multipliers for the
constraints we imposed on p(y|x)
To find the solution to the optimization problem, weappealed to the Kuhn-Tucker theorem, which states that we
can (1) first solve \( p,0,K) forp to get a parametric form
forp in terms of0, K; (2) then plug pback in to
\( p,0
,K), this time solv
ing for0
, K
.
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
37/39
O
utline (MaxentModeling summary) The parametric form forp turns out to have the
exponential form (11)
The K
giv
es rise to the normalizing factorZ
(x), giv
en in(12)
The 0 will be solved for numerically using the dual
function (14). Furthermore, it so happens that this function,
=(0), is the log-likelihood for the exponential model p
(11). So what started as the maximization of entropysubject to a set of linear constraints turns out to be
equivalent to the unconstrained maximization of likelihood
of a certain parametric family of distributions.
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
38/39
O
utline(MaxentModeling summary)
Table 1 summarize the primal-dual framework
Primal Dual
problem
description
type ofsearch
search domainsolution
argmaxpCH(p)
maximum entropy
constrained optimization
pC
p
argmaxP=(0)
maximum likelihood
unconstrained optimization
real-v
aluev
ectors {P1 P2,}0
Kuhn-Tucker theorem: p = p0
-
8/3/2019 20040611_A Brief Maximum Entropy Tutorial
39/39
Computing the parameters
convergedhavetheallnotifsteptoGotoaccordingofvaluetheUpdateb
yxfyxf
fpyxfyxfxypxp
tosolutionthebeLeta
nieachforDo
niallforwithStart
delpoptimalmoluesrametervaOptimalpa
yxptionldistribu; empiricaf,,fnctionsfFeaturefu
calingterative SImproved I
i
iiii
n
i i
i
yx
ii
i
i
*
i
n
2.3:.
(19),,where
(18))(~),(exp),()|()(~
.
:},,2,1{.2
},,2,1{0.1
*;:Output
),(~:Input
1Algorithm
1
#
,
#
21
PPPPP
P
P
P
(n
|
!(
(
!
0
!
.
.
.
(i
ii yxf ,P