20040611_a brief maximum entropy tutorial

Upload: lequanghung

Post on 06-Apr-2018

231 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    1/39

    A brief maximum entropy

    tutorial

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    2/39

    Overv

    iew Statistical modeling addresses the problem of

    modeling the behavior of a random process

    In constructing this model, we typically have atour disposal a sample of output from the process.

    From the sample, which constitutes an incomplete

    state of knowledge about the process, the

    modeling problem is toparlay this knowledge into

    a succinct, accurate representation of the process

    We can then use this representation to make

    predictions of the future behav

    ior of the process

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    3/39

    Motiv

    ating example Suppose we wish to model an expert translators

    decisions concerning the proper French rendering

    of the English word in. A modelp of the experts decisions assigns to

    each French word or phrase fan estimate,p(f), of

    the probability that the expert would choosefas a

    translation ofin.

    Developp collect a large sample of instances of

    the experts decisions

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    4/39

    Motiv

    ating example Our goal is to

    Extract a set of facts about the decision-making

    process from the sample (the first task ofmodeling)

    Construct a model of this process (the second

    task)

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    5/39

    Motivating example

    One obvious clue we might glean from the sample

    is the list of allowed translations

    in {dans, en, , au cours de, pendant}

    With this information in hand, we can impose ourfirst constraint on our model p:

    This equation represents our first statistic of theprocess; we can now proceed to search for a

    suitable model which obeys this equation

    There are infinite number of models p for which this

    identify holds

    1)()()()()( ! pendantpdecoursaupapenpdansp

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    6/39

    Motiv

    ating example One model which satisfies the above equation isp(dans)=1 in other words, the model always

    predicts dans. Another model which obeys this constraint

    predictspendantwith a probability of , and with a probability of .

    But both of these models offend our sensibilities:knowing only that the expert always chose fromamong these five French phrases, how can wejustify either of these probability distributions?

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    7/39

    Motivating example

    Knowing only that the expert chose exclusively from

    among these five French phrases, the most intuitively

    appealing model is

    5/1)(

    5/1)(

    5/1)(5/1)(

    5/1)(

    !

    !

    !

    !

    !

    pendantp

    decoursaup

    apenp

    dansp

    This model, which allocates the total probability evenly among

    the five possible phrases, is the most uniform model subject to

    our knowledge

    It is not, however, the most uniform overall; that model would

    grant an equal probability to everypossible French phrase.

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    8/39

    Motivating example

    We might hope to glean more clues about the expertsdecisions from our sample.

    Suppose we notice that the expert chose eitherdans oren

    30% of the time

    Once again there are many probability distributions

    consistent with these two constraints.

    In the absence of any other knowledge,

    a reasonable choice forp is again themost uniform that is, the distribution

    which allocates its probability as evenly

    as possible, subject to the constrains:

    1)()()()()( ! pendantpdecoursaupapenpdansp

    10/3)()( ! enpdansp

    30/7)(

    30/7)(

    30/7)(

    20/3)(

    20/3)(

    !

    !

    !

    !

    !

    penda

    ntp

    decoursaup

    ap

    enp

    dansp

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    9/39

    Motiv

    ating example Say we inspect the data once more, and this time notice

    another interesting fact: in half the cases, the expert chose

    eitherdans or. We can incorporate this information intoour model as a third constraint:

    We can once again look for the most uniformp satisfying

    these constraints, but now the choice is not as obvious.

    1)()()()()( ! pendantpdecoursaupapenpdansp

    10/3)()( ! enpdansp

    2/1)()(! a

    pda

    nsp

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    10/39

    Motiv

    ating example As we have added complexity, we have

    encountered two problems:

    First, what exactly is meant by uniform, and how canone measure the uniformity of a model?

    Second, having determined a suitable answer to these

    questions, how does one find the most uniform model

    subject to a set of constraints like those we hav

    edescribed?

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    11/39

    Motiv

    ating example The maximum entropy method answers both these

    questions.

    Intuitively, the principle is simple: model all that is known and assume nothing about that

    which is unknown

    In other words, given a collection of facts, choose a

    model which is consistent with all the facts, butotherwise as uniform as possible.

    This is precisely the approach we took in selecting

    our modelp at each step in the above example

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    12/39

    MaxentModeling

    Consider a random process which produces an

    output valuey, a member of a finite set .

    y may be any word in the set {dans, en, , au cours de,

    pendant}

    In generatingy, the process may be influenced by

    some contextual informationx, a mamber of a

    finite set X.

    x could include the words in the English sentencesurrounding in

    To construct a stochastic model that accurately

    represents the behavior of the random process

    Given a context x, the process will outputy.

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    13/39

    Training data Collect a large number of samples (x1, y1), (x2, y2),,

    (xN,yN)

    Each sample would consist of a phrasex containing thewords surrounding in, together with the translation y of

    in which the process produced

    Typically, a particular pair (x,y) will either not occur

    at all in the sample, or will occur at most a few times.

    smoothing

    samplein theoccurs,thattimesofnumber1,~ yxN

    yxp v|

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    14/39

    Features and constraints The goal is to construct a statistical model of the

    process which generated the training sample

    The building blocks of this model will be a set ofstatistics of the training sample

    The frequency that in translated to eitherdans oren

    was 3/10

    The frequency that in translated to eitherdans oraucours de was

    And so on

    yxp ,~

    Statistics of the

    training sample yxp ,~

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    15/39

    Features and constraints Conditioning informationx

    E.g., in the training sample, ifAprilis the word

    following in, then the translation ofin is en withfrequency 9/10

    Indicator function

    Expected value off

    !!

    otherwise0

    followsandif1

    ,

    inAprileny

    yxf

    (1),,~~,

    |yx

    yxfyxpfp

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    16/39

    Features and constraints We can express any statistic of the sample as the

    expected value of an appropriate binary-valued

    indicator functionf We call such function a featurefunction orfeature for

    short

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    17/39

    Features and constraints

    When we discover a statistic that we feel is useful,

    we can acknowledge its importance by requiring

    that our model accord with it

    We do this by constraining the expected value thatthe model assigns to the corresponding feature

    functionf

    The expected value offwith respect to the model

    p(y| x) is

    (2),|~,

    |yx

    yxfxypxpfp

    sampletrainingin theofondistributiempiricaltheis~

    where xxp

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    18/39

    Features and constraints We constrain this expected value to be the same as

    the expected value offin the training sample. That

    is, we require

    We call the requirement (3) a constraintequation or

    simply a constraint

    Combining (1), (2) and (3) yields

    (3)~ fpfp !

    !yxyx

    yxfyxpyxfxypxp,,

    ,,~,|~

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    19/39

    Features and constraints To sum up so far, we now have

    A means of representing statistical phenomena inherent

    in a sample of data (namely, )

    A means of requiring that our model of the process

    exhibit these phenomena (namely, )

    Feature:

    Is a binary-v

    alue function of (x, y) Constraint

    Is an equation between the expected value of the feature

    function in the model and its expected value in the

    training data

    fp~

    fpfp ~!

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    20/39

    The maxent principle Suppose that we are given n feature functionsfi,

    which determine statistics we feel are important in

    modeling the process. We would like our model toaccord with these statistics

    That is, we would like p to lie in the subset CofP

    defined by

    _ a_ a (4),...,2,1for~| nifpfpp ii !| PC

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    21/39

    (a)

    P

    (b)

    P

    (d)

    P

    (c)

    P

    C1 C1C1 C2C2

    Figure 1:

    If we impose no constraints, then all probability models are

    allowable

    Imposing one linear constraint C1 restricts us to those pPwhich

    lie on the region defined by C1

    A second linear constraint could determinep exactly, if the twoconstraints are satisfiable, where the intersection of C1 and C2 is

    non-empty. p C1 C2

    Alternatively, a second linear constraint could be inconsistent with

    the first (i,e, C1 C2 = ); no pPcan satisfy them both

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    22/39

    The maxent principle In the present setting, however, the linear

    constraints are extracted from the training sample

    and cannot, by construction,be inconsistent Furthermore, the linear constraints in our

    applications will not even come close to

    determining pPuniquely as they do in (c);

    instead, the set C=C1 C2 Cn of allowablemodels will be infinite

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    23/39

    The maxent principle Among the modelspC, the maximum entropy

    philosophy dictates that we select the distribution

    which is most uniform A mathematical measure of the uniformity of a

    conditional distributionp(y|x) is provided by the

    conditional entropy

    (5)|log|~,

    |yx

    xypxypxppH

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    24/39

    The maxent principle The principle of maximum entropy

    To select a model from a set Cof allowed probability

    distributions, choose the model p

    Cwith maximumentropyH(p):

    (6)maxarg* pHpp C

    !

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    25/39

    Exponential form The maximum entropy principle presents us with a

    problem in constrained optimization: find the

    p

    Cwhich maximizesH(p) Find

    (7)|log|~maxarg

    maxarg

    ,

    *

    !

    !

    yxCp

    Cp

    xypxypxp

    pHp

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    26/39

    Exponential form We refer to this as the primalproblem; it is a

    succinct way of saying that we seek to maximize

    H(p) subject to the following constraints: 1.

    2.

    This and the previous condition guarantee thatp is a

    conditional probability distribution

    3.

    In other words, pC, and so satisfies the active constraints C

    .,allfor0| yxxyp u

    .allfor1| xxypy

    !

    _ a.,...,2,1for

    ,,~,|~,,

    ni

    yxfyxpyxfxypxpyxyx

    !

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    27/39

    Exponential form To solve this optimization problem, introduce the

    Lagrangian

    |0

    1|

    (8),|~,,~

    |log|~,,

    ,

    ,

    y

    i yx

    iii

    yx

    xyp

    yxfxypxpyxfyxp

    xypxypxpp

    K

    P

    K\

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    28/39

    Exponential form

    (10)1~exp,~exp|

    1~,~|log

    ,~|log1~

    0,~|log1~

    (9),~|log1~

    |

    !

    !

    !

    !

    !x

    x

    xpyxfxpxyp

    xpyxfxpxyp

    yxfxpxypxp

    yxfxpxypxp

    yxfxpxypxpxyp

    i

    ii

    i

    ii

    i

    ii

    i

    ii

    i

    ii

    KP

    KP

    KP

    KP

    KP\

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    29/39

    Exponential form We have thus found the parametric form ofp, and so we

    now take up the task of solving for the optimal values 0,

    K.

    Recognizing that the second factor in this equation is the

    factor corresponding to the second of the constraints listed

    above, we can rewrit (10) as

    where Z(x), the normalizing factor, is given by (11),exp|

    !

    iii yxfxZxyp P

    (12),exp

    !

    y i

    ii yxfxZ P

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    30/39

    !@

    !

    !

    !

    !

    !

    !

    yi

    ii

    yi

    ii

    yi

    ii

    yi

    iiy

    y

    yxfxZ

    xZyxf

    xp

    yxfxp

    xpyxfxyp

    xypx

    ,exp

    1

    ,exp

    11~exp

    1,exp1~exp

    11~exp,exp|

    1|,:constraintsecond

    :(12)Proof

    *

    P

    P

    K

    PK

    KP

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    31/39

    Exponential form We have found Kbut not yet 0. Towards this end we

    introduce some further notation. Define the dualfunction

    =(0) as

    and the dual optimization problem as

    Sincep and K are fixed, the righthand side of (14) has

    only the free variables 0={P1, P2,, Pn}.

    (13),, 0|0= K\ p

    (14)maxargFind 0=!00

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    32/39

    Exponential form Final result

    The maximum entropy model subject to the constraints

    Chas the parametric form p

    of (11), where

    can bedetermined by maximizing the dual function =(0)

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    33/39

    Maximum likelihood

    .~sampleingtrain

    theoflikelihoodthemaximizesthat|familyparametric

    in themodeltheisentropymaximumwithmodelThe

    :asrephrasedbecansectionprevioustheofresultthe

    tion,interpretaWith this(11).offormparametricthehaswhere

    (16)

    isthat;model

    lexponentiafor thelikelihood-logjust thefact,inis,section

    previoustheoffunctiondualthecheck thateasy toisIt

    (15)|log,~|log

    bydefinedismodelabypredictedas

    ~ondistributiempiricaltheoflikelihood-logThe

    *

    ~

    ,,

    ,~

    ~

    ~

    p

    xyp

    Cp

    p

    pL

    p

    xypyxpxyppL

    p

    ppL

    p

    yxyx

    yxp

    p

    p

    !0=

    0=

    !|

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    34/39

    Maximum likelihood

    |0|0=

    |0

    |0

    yx

    yx

    y

    i yx

    iii

    yx

    xypxypxpp

    xypxypxpp

    xyp

    yxfxypxpyxfyxp

    xypxypxpp

    ,

    **

    ,

    ,

    ,

    |log|~~,,

    |log|~,,

    1|

    ,|~,,~

    |log|~,,

    :(8)Fromand(16)Since

    K\

    K\

    K

    P

    K\

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    35/39

    ? A

    ? A

    ? A

    ? A ? A

    !

    -

    !

    -

    !

    -

    !

    -

    !

    -

    !

    |0|0=

    i

    ii

    x

    i yx

    ii

    x

    yx i

    ii

    x

    yx i

    ii

    yx

    yx i

    ii

    yx i

    ii

    yx

    fpxZxp

    yxfyxpxZxp

    yxfyxpxZxp

    yxfxypxpxZxypxp

    yxfxZxypxp

    yxfxZ

    xypxp

    xypxypxpp

    ~log~

    .,~log~

    .,~log~

    .|~~log|~~

    .log|~~

    .exp1log|~~

    |log|~~,,

    ,

    .

    .,

    ,

    ,

    ,

    **

    P

    P

    P

    P

    P

    P

    K\

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    36/39

    O

    utline (MaxentModeling summary) We began by seeking the conditional distributionp(y|x)

    which had maximal entropy H(p) subject to a set of linear

    constraints (7)

    Following the traditional procedure in constrained

    optimization, we introduced the Lagrangian \( p,0,K),

    where 0, K are a set of Lagrange multipliers for the

    constraints we imposed on p(y|x)

    To find the solution to the optimization problem, weappealed to the Kuhn-Tucker theorem, which states that we

    can (1) first solve \( p,0,K) forp to get a parametric form

    forp in terms of0, K; (2) then plug pback in to

    \( p,0

    ,K), this time solv

    ing for0

    , K

    .

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    37/39

    O

    utline (MaxentModeling summary) The parametric form forp turns out to have the

    exponential form (11)

    The K

    giv

    es rise to the normalizing factorZ

    (x), giv

    en in(12)

    The 0 will be solved for numerically using the dual

    function (14). Furthermore, it so happens that this function,

    =(0), is the log-likelihood for the exponential model p

    (11). So what started as the maximization of entropysubject to a set of linear constraints turns out to be

    equivalent to the unconstrained maximization of likelihood

    of a certain parametric family of distributions.

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    38/39

    O

    utline(MaxentModeling summary)

    Table 1 summarize the primal-dual framework

    Primal Dual

    problem

    description

    type ofsearch

    search domainsolution

    argmaxpCH(p)

    maximum entropy

    constrained optimization

    pC

    p

    argmaxP=(0)

    maximum likelihood

    unconstrained optimization

    real-v

    aluev

    ectors {P1 P2,}0

    Kuhn-Tucker theorem: p = p0

  • 8/3/2019 20040611_A Brief Maximum Entropy Tutorial

    39/39

    Computing the parameters

    convergedhavetheallnotifsteptoGotoaccordingofvaluetheUpdateb

    yxfyxf

    fpyxfyxfxypxp

    tosolutionthebeLeta

    nieachforDo

    niallforwithStart

    delpoptimalmoluesrametervaOptimalpa

    yxptionldistribu; empiricaf,,fnctionsfFeaturefu

    calingterative SImproved I

    i

    iiii

    n

    i i

    i

    yx

    ii

    i

    i

    *

    i

    n

    2.3:.

    (19),,where

    (18))(~),(exp),()|()(~

    .

    :},,2,1{.2

    },,2,1{0.1

    *;:Output

    ),(~:Input

    1Algorithm

    1

    #

    ,

    #

    21

    PPPPP

    P

    P

    P

    (n

    |

    !(

    (

    !

    0

    !

    .

    .

    .

    (i

    ii yxf ,P