expectation propagation

Expecta(on Propaga(on Theory and Applica(on

Dong Guo Research Workshop 2013 Hulu Internal

See more details in

hEp://dongguo.me/blog/2014/01/01/expecta(on-‐propaga(on/ hEp://dongguo.me/blog/2013/12/01/bayesian-‐ctr-‐predic(on-‐for-‐bing/

Outline

•  Overview •  Background •  Theory •  Applica(ons

OVERVIEW

Bayesian Paradigm

•  Infer posterior distribu(on Prior

Data

Posterior Make decision

Note: figure of LDA is from Wikipedia, and the right figure is from paper ‘Web-‐Scale Bayesian Click-‐Through Rate PredicFon for Sponsored Search AdverFsing in MicrosoI’s Bing Search Engine’

Bayesian inference methods

•  Exact inference – Belief propaga(on

•  Approximate inference – Stochas(c (sampling) – Determinis(c

•  Assumed density filtering •  Expecta(on propaga(on •  Varia(onal Bayes

Message passing

•  A form of communica(on used in mul(ple domains of computer science – Parallel compu(ng (MPI) – Object-‐oriented programming –  Inter-‐process communica(on – Bayesian inference

•  A family of methods to infer posterior distribu(on

Expecta(on Propaga(on

•  Belongs to message passing family

•  Approximated method (itera(on is needed)

•  Very popular in Bayesian inference, especially in graphic model

Researchers

•  Thomas Minka – EP was proposed in PhD thesis

•  Kevin p. Murphy – Machine Learning A ProbabilisFc PerspecFve

BACKGROUND

Background

•  (Truncated) Gaussian •  Exponen(al family •  Graphic model •  Factor graph •  Belief propaga(on •  Moment matching

Gaussian and Truncated Gaussian

•  Gaussian opera(on is a basis for EP inference – Gaussian +*/ Gaussian – Gaussian integral

•  Truncated Gaussian is used in many EP applica(ons

•  See details here

Exponen(al family distribu(on

•  Very good summary in Wikipedia •  Sufficient sta(s(cs of Gaussian distribu(on: (x, x^2) •  Typical distribu(on

q(z) = h(z)g(η)exp{ηTu(z)}

Note: above 4 figures are from Wikipedia

Graphical Models •  Directed graph (Bayesian Network) •  Undirected graph (Condi(onal

Random Field)

P(x) = p(xk | pak )k=1

K

∏

x1

x4

x3 x2 x1

x4

x3 x2

Factor graph

•  Express rela(on between variable nodes explicitly •  Rela(on in edge -‐> factor node

•  Hide the difference of BN and CRF in inference •  Make inference more intui(onal

x1

x4

x3 x2 x1

x4

x3 x2 fa

fc

c

BELIEF PROPAGATION

Belief Propaga(on Overview

•  Exact Bayesian method to infer marginal distribu(on –  ‘sum-‐product’ message passing

•  Key components – Calculate posterior distribu(on of variable node – Two kinds of messages

Posterior distribu(on of variable node

•  Factor graph

p(X) = Fs (s,Xs )s∈ne(x )∏ , for any variable x in the graph

p(x) = p(X)X \x∑ = Fs (s,Xs )

s∈ne(x )∏ =

X \x∑ Fs (x,Xs )

Xs∑

s∈ne(x )∏ = µ fs −>x

(x)s∈ne(x )∏

in which µ fs −>x(x) = Fs (x,Xs )

Xs∑

Note: the figure is from book ‘PaMern recogniFon and machine learning’

Message: factor -‐> variable node

•  Factor graph

µ fs −>x(x) = ...

x1

∑ fs (x, x1,..., xM )xM∑ µxm −> fs

(xm )xm∈ne( fs )\x∏ ,

in which {x1,..., xM } is the set of variables on which the factor fs depends


Message: variable -‐> factor node

•  Factor graph

µxm −> fs(xm ) = µ fl −>xm

(xm )l∈ne(xm )\ fs∏

Summary: posterior distribuFon is only determined by factors !!


Whole steps of BP

•  Steps to calculate posterior distribu(on of given variable node –  Step 1: construct factor graph –  Step 2: treat the variable node as root, and ini(alize messages sent from leaf nodes

–  Step 3: leverage the message passing steps recursively un(l the root node receives messages from all of its neighbors

–  Step 4: get the marginal distribu(on by mul(plying all messages sent in

Note: the figures are from book ‘PaMern recogniFon and machine learning’

BP: example •  Infer marginal distribu(on of x_3

•  Infer marginal distribu(on of every variables

Note: the figures are from book ‘PaMern recogniFon and machine learning’

Posterior is intractable some(mes

•  Example –  Infer the mean of a Gaussian distribu(on

– Ad predictor

p(x |θ ) = (1−w)N(x |θ , I )+wN(x | 0,aI )

p(θ ) = N(θ | 0,bI )


Distribu(on Approxima(on

Such that: q(x) = h(x)g(η)exp{ηTu(x)}

KL(p || q) = − p(x)∫ In q(x)p(x)

dx = − p(x)Inq(x)dx +∫ p(x)Inp(x)∫ dx

= − p(x)Ing(η)dx − p(x)ηTu(x)∫ dx + const∫ = − Ing(η)−ηTΕ p(x )[u(x)]+ constwhere const terms are independent of the natural parameter η

Minimize KL(p || q) by setting the gradient with repect to η to zero: => −∇Ing(η) = Ε p(x )[u(x)]By leveraging formula (2.226) in PRML: => Eq(x )[u(x)]= −∇Ing(η) = Ε p(x )[u(x)]

Approximate p(x) with q(x), which belongs to exponential family

Moment matching

•  Moments of a distribu(on

It's called moment matching when q(x) is Gaussian distribution then u(x) = (x, x2 )T

=> q(x)xdx = p(x)xdx∫∫ , and q(x)x2 dx = p(x)x2 dx∫∫=> meanq(x ) = q(x)xdx = p(x)xdx∫∫ = meanp(x ),

varianceq(x ) = q(x)x2 dx − (meanq(x ) )2∫

= p(x)x2 dx∫ − (meanp(x ) )2 = variance p(x )

k'th moment Mk = xk f (x)dxa

b

∫

EXPECTATION PROPAGATION = Belief Propaga(on + Moment matching?

Key Idea •  Approximate each factor with Gaussian distribu(on

•  Approximate corresponding factor pairs one by one?

•  Approximate each factor in turn in the context of all remaining factors (Proposed by Minka)

refine factor f j(θ ) by ensuring qnew (θ )∝ f j(θ )q \ j (θ ) is close with f j (θ )q \ j (θ )

in which q \ j (θ ) = q(θ )f j(θ )

EP: The detail steps

1.Initialize all of the approximating factors fi(θ )

2.Initialize the posterior approximation by setting :q(θ )∝ fi(θ )i∏

3.Until convergence :

(a). Choose a fator f j(θ ) to refine.

(b). Remove f j(θ ) from the posterior by division :q \ j (θ ) = q(θ )f j(θ )

(c). Get the new posterior by settting sufficient statistics of qnew (θ ) equal to those of f j (θ )q \ j (θ )

z j

(minimize KL(f j (θ )q \ j (θ )

z j|| qnew (θ ))),in which z j = f j (θ )q \ j (θ )dθ∫ , and qnew (θ ) = 1

kf j(θ )q \ j (θ )

(d). Get the refined factor f j(θ ) : f j(θ ) = k qnew (θ )q \ j (θ )

Example: The cluEer problem(2)

•  Approximate complex factor(e.g. mixture Gaussian) with Gaussian

fn (θ ) in blue, fn(θ ) in red, and q \n (θ ) in green Remember variance of q \n (θ ) is usually very small, so fn(θ ) only need to approximate fn (θ ) in small range

Note: above 2 figures are from book ‘PaMern recogniFon and machine learning’

Applica(on: Bayesian CTR predictor for Bing

•  See the details here –  Inference step by step – Make predic(on

•  Some insights –  Variance of each feature increases aker every exposure

–  Sample with more features will have bigger variance •  Independent assump(on for features

Experimenta(on •  Dataset is very Inhomogeneous

•  Performance

– Other metrics

•  Pros: speed, parameter choice cost, online learning support, interpreta(ve, support add more factors

•  Cons: sparse •  Code

Model FTRL OWLQN Ad predictor

AUC 0.638 0.641 0.639

Application: XBOX skill rating system

• 

See details in P793~798 of Machine Learning A ProbabilisFc PerspecFve Note: the figure is from paper: ‘TrueSkill: A Bayesian Skill RaFng System’

Apply to all Bayesian models

•  Infer.net (Microsok/Bishop) – A framework for running Bayesian inference in graphical models

– Model-‐based machine learning

References •  Books

–  Chapter 2/8/10 of PaMern RecogniFon and Machine Learning –  Chapter 22 of Machine Learning: A ProbabilisFc PerspecFve

•  Papers –  A family of algorithms for approximate Bayesian inference –  From belief propagaFon to expectaFon propagaFon –  TrueSkill: A Bayesian Skill RaFng System –  Web-‐Scale Bayesian Click-‐Through Rate PredicFon for Sponsored

Search AdverFsing in MicrosoI’s Bing Search Engine

•  Roadmap for EP

expectation propagation

Technology

sne x f x

variable x

ssne x x sin

variance q x

dx meanq x

mean p x

graphsne x px

e q x ux