mlpi lecture 3: advanced sampling techniques
TRANSCRIPT
Overview• Collapsed*Gibbs*Sampling
• Sampling*with*Auxiliary*Variables
• Slice*Sampling
• Simulated*Tempering*&*Parallel*Tempering
• Swendsen?Wang*Algorithm
• Hamiltonian*Monte*Carlo
2
Collapsed)Gibbs)Sampling
• Basic&idea:"replace"the"original"condi0onal"distribu0on"with"a"condi0onal"distribu0on"of"a"marginal(distribu.on,"o7en"called"a"reduced(condi.onal(distribu.on.
• Consider"the"example"above,"we"consider"a"marginal(distribu.on:
7
Collapsed)Gibbs)Sampling)(cont'd)
• Draw& ,&with& &marginalized&out,&as:
• Draw&
• Can&we&exchange&the&order&of&these&two&steps?&Why?
8
Basic&Guidelines
• Order%of%steps%ma-ers!
• Generally,*one*can*move*components*from*"being'sampled"*to*"being'condi0oned'on".
• replacing*outputs*with*intermediates*would*change*the*sta:onary*distribu:on.
• A*variable*can*be*updated*mul:ple*:mes*in*an*itera:on.
9
Rao$Blackwell+Theorem
Consider)an)example) )and)we)want)to)es1mate) .)Suppose)we)have)two)tractable)ways)to)do)so:
(1)$draw$ ,$and$compute
11
Rao$Blackwell+Theorem+(cont'd)
(2)$draw$ $where$ $is$the$marginal$distribu4on,$and$compute
• Both&are&correct.&By&Strong&LLN,&both& &and& &converge&to& &almost&surely.
• Which&one&is&be<er?&Can&you&jus@fy&your&answer?12
Rao$Blackwell+Theorem+(cont'd)• (Rao%Blackwell,Theorem)"Sample"variance"will"be"reduced"when"some"components"are"marginalized"out."With"the"se:ng"above,"we"have
• Generally,*reducing)sample)variance*would*also*lead*to*the*reduc3on*of*autocorrela2on*of*the*chain,*thus*improving*the*mixing*performance.
13
Sampling)with)Auxiliary)Variables• The%Rao$Blackwell$Theorem%suggests%that%in%order%to%achieve%be3er%performance,%one%should%try%to%marginalize%out%as%many%components%as%possible.
• However,%in%many%cases,%one%may%want%to%do%the%opposite,%that%is,%to%introduce%addi>onal%variables%to%facilitate%the%simula>ons.
• For%example,%when%the%target$distribu6on%is%mul6modal,%one%may%use%an%auxiliary%variable%to%help%the%chain%escape%from%local%traps.
14
Use$Auxiliary$Variables
• Specify)an)auxiliary)variable) )and)the)joint)distribu8on) )such)that)
)for)certain) .
• Design)a)chain)to)update) )using)the)M=H)algorithm)or)the)Gibbs)sampler.
• The)samples)of) )can)then)be)obtained)through)marginaliza)on)or)condi)oning.
15
Slice&Sampler
• Sampling* *is*equivalent*to*sampling*uniformly*from*the*area*under* :*
.
• Gibbs*sampling*based*on*the*uniform*distribu;on*over* .*Each*itera;on*consists*of*two*steps:
• Given* ,*
• Given* ,*
17
Slice&Sampler&(Discussion)• Slice&sampler"can"mix"very"rapidly,"as"it"will"not"be"locally"trapped.
• Slice&sampler"is"o7en"nontrivial"to"implement"in"prac8ce."Drawing" "is"some8mes"very"difficult.
• For"distribu8ons"of"certain"forms,"which"have"an"easy&way"to"draw" ,"slice&sampling"is"good"strategy.
19
Gibbs%Measure
A"Gibbs%measure"is"a"probability"measure"with"a"density"in"the"following"form:
Here,% %is%called%the%energy&func*on,% %is%called%the%inverse&temperature,%and%the%normalizing%constant% %depends%on% .
21
Gibbs%Measure%(cont'd)
In#literature#of#MCMC#sampling,#we#o5en#parameterize#a#Gibbs#measure#using#the#temperature(parameter# ,#thus#
.
22
Tempered'MCMC
Typical(MCMC(methods(usually(rely(on(local%moves(to(explore(the(state(space.(What(is(the(problem?
23
Simulated*Tempering
Suppose'we'intend'to'sample'from'
Basic&idea:!Augment!the!target!distribu0on!by!including!a!temperature(index( ,!with!joint!distribu0on!given!by
25
Simulated*Tempering*(cont'd)• We$only$collect$samples$at$the$lowest'temperature,$
.
• The$chain$mixes$much$faster$at$high$temperatures,$but$we$want$to$collect$samples$at$the$lowest$temperature.$So$we$have$to$constantly$switch$between$temperatures.
26
Simulated*Tempering*(Algorithm)
One$itera)on$of$Simulated*Tempering$has$two$steps:
• (Base&transi+on):#update# #at#the#same#temperature,#i.e.#holding# #fixed.
• (Temperature&switching):#with# #fixed,#propose##with# #such#that#
• Accept#the#change#with#probability#.
• Any#drawbacks?27
Simulated*Tempering*(Discussion)
• Set% .%Given% ,%we%should%set% %such%that%uphill%moves%from%( )%should%have%a%considerable%probability%of%being%accepted.
• Build%the%temperature(ladder%step%by%step%un?l%we%have%a%sufficiently%smooth%distribu?on%at%the%top.
• The%?me%spent%on%the%base%level% %is%around%.%If%we%have%too%many%levels,%only%a%very%
small%por?on%of%samples%can%be%used.
28
Simulated*Tempering*(Discussion)• All$temperature$levels$play$an$important$role.$So$it$is$desirable$to$spend$comparable$amount$of$8me$at$
each$level.$Se:ng$ $for$each$ ,$we$have
• The%normalzing%constants% %are%typically%unknown%and%es8ma8ng%them%is%very%difficult%and%expensive.
29
Parallel&Tempering
(Basic'idea)!rather!than!jumping!between!temperatures,!it!simultaneously!simulate!mul3ple!chains,!each!at!a!temperature!level! ,!called!a!replica,!and!constantly!swap!samples!between!replicas.
30
Parallel&Tempering&(Algorithm)
Each%itera*on%consists%of%the%following%steps:
• (Parallel'update):"simulate"each"replica"with"its"own"transi2on"kernel
• (Replica'exchange):"propose"to"swap"states"between"two"replicas"(say"the" 7th"and" 7th,"where"
):
31
Parallel&Tempering&(Algorithm)• The%proposal%is%accepted%with%probability%
,%where
• We$collect$samples$from$the$base$replica$(the$one$with$ ).
• Why$does$this$algorithm$produce$the$desired$distribu;on?
32
Parallel&Tempering&(Jus1fica1on)
Let$ .$We$define
Obviously,+the+step+of+parallel&update+preserves+the+invariant+distribu5on+ .
33
Parallel&Tempering&(Jus1fica1on)
Note%that%the%step%of%replica(exchange%is%symmetric,%i.e.%the%probabili0es%of%going%up%and%down%are%equal,%then%according%to%the%Metropolis(algorithm,%we%have%
%with
34
Parallel&Tempering&(Discussion)• It$is$efficient$and$very$easy$to$implement,$especially$in$a$parallel$compu6ng$environment.
• It$is$o9en$an$art$instead$of$a$technique$to$tune$a$parallel$tempering$system.
• The$parallel-tempering$is$a$special$case$of$a$large$family$of$MCMC$methods$called$Extended-Ensemble-Monte-Carlo,$which$involves$a$collec6on$of$parallel$Markov$chains$and$the$simula6on$switches$between$these$them.
35
Swendsen'Wang+Algorithm
The$Swendsen'Wang$algorithm$(R.-Swendsen$and$J.-Wang,$1987)$is$an$efficient$Gibbs$sampling$algorithm$for$sampling$from$the$extended-Ising-model.
36
Standard'Ising'Model
The$standard$Ising&model$is$defined$as
where% %for%each% %is%called%a%spin,%and%.
• Gibbs&sampling&is&extremely&slow,&especially&when&the&temperature&is&low.
37
Extended'Ising'Model• We$extend$the$model$by$introducing$addi5onal$bond%variables$ ,$each$for$an$edge.$Each$bond$has$two$states:$ $indica5ng$connected$and$ $indica5ng$disconnected.
• We$define$a$joint$distribu5on$that$couples$the$spins$and$bonds,
38
Extended'Ising'Model'(cont'd)
Here,% %is%described%as%below:
• When& ,& &for&every&se.ng&of&
• When& ,&
39
Extended'Ising'Model'(cont'd)
With%this%se(ng,% %can%be%wri1en%as:
where% :
• when& ,& &must&be&
• when& ,& &is&set&to&zero&with&probability&.
40
Swendsen'Wang+Algorithm
!Each!itera*on!consists!of!two!steps:
• (Clustering):"condi(oned"on"the"spins" ,"draw"the"bonds" "independenly."For"an"edge"
:
• If" ,"set"
• If" ,"set" "with"probability""or" "otherwise.
41
Swendsen'Wang+Algorithm
• (Swapping):"condi(oned"on"the"bonds" ,"draw"the"spins" .
• For"each"connected"component,"draw" "or" "with"equal"chance,"and"assign"the"resultant"value"to"all"nodes"in"the"component.
42
Swendsen'Wang+Algorithm+(Illustra7on)In the case of a rectangular grid, this Gibbs sampling algorithm mixes very rapidly.
The following figures illustrate Gibbs sampling. Spin states up and down areshown by filled and empty circles. Bond states 1 and 0 are shown by thick lines andthin dotted lines. We start from a state with five connected components. (Rememberthat isolated spins count as connected components, albeit of size 1.)
First, let’s update the bonds The forbidden bonds are highlighted
Bonds are forbidden from forming wherever the two adjacent spins are in oppositestates. The bonds that are not forbidden are set to the 1 state with probability p.
After updating the bonds Now we update spins Update bonds again
1.2 Other properties of the extended model
We already mentioned that the partition function Z is the same as that of the Isingmodel.
The marginal P (x) is correct, because when we sum the factor gm over dm, we getfm. Summing over dm is easy because it appears in only one factor.
OK, we’ve summed out d and obtained the Ising model. What if we sum out x?The marginal P (d) is called the random cluster model. Summing over x for given
d, all factors are constants. The number of states is 2number of clusters. Thus
P (d) =1
Z̃
!
m
"
pdm(1 − p)1−dm
#
2c(d) (10)
where c(d) is the number of connected components in the state d. Isolated spinswhose neighbouring bonds are all zero count as single connected components.
The random cluster model can be generalized by replacing the number 2 by aparameter q:
P (q)(d) =!
m
"
pdm(1 − p)1−dm
#
qc(d) (11)
The random cluster model can be simulated directly, just as the Ising model canbe simulated directly; but the S–W method, augmenting the bonds with spins, isprobably the most efficient way to simulate the model. For integer values of q, theappropriate spin system is the ‘Potts model’, the generalization of the Ising modelfrom 2 spin states to q.
In the case of a rectangular grid, this Gibbs sampling algorithm mixes very rapidly.The following figures illustrate Gibbs sampling. Spin states up and down are
shown by filled and empty circles. Bond states 1 and 0 are shown by thick lines andthin dotted lines. We start from a state with five connected components. (Rememberthat isolated spins count as connected components, albeit of size 1.)
First, let’s update the bonds The forbidden bonds are highlighted
Bonds are forbidden from forming wherever the two adjacent spins are in oppositestates. The bonds that are not forbidden are set to the 1 state with probability p.
After updating the bonds Now we update spins Update bonds again
1.2 Other properties of the extended model
We already mentioned that the partition function Z is the same as that of the Isingmodel.
The marginal P (x) is correct, because when we sum the factor gm over dm, we getfm. Summing over dm is easy because it appears in only one factor.
OK, we’ve summed out d and obtained the Ising model. What if we sum out x?The marginal P (d) is called the random cluster model. Summing over x for given
d, all factors are constants. The number of states is 2number of clusters. Thus
P (d) =1
Z̃
!
m
"
pdm(1 − p)1−dm
#
2c(d) (10)
where c(d) is the number of connected components in the state d. Isolated spinswhose neighbouring bonds are all zero count as single connected components.
The random cluster model can be generalized by replacing the number 2 by aparameter q:
P (q)(d) =!
m
"
pdm(1 − p)1−dm
#
qc(d) (11)
The random cluster model can be simulated directly, just as the Ising model canbe simulated directly; but the S–W method, augmenting the bonds with spins, isprobably the most efficient way to simulate the model. For integer values of q, theappropriate spin system is the ‘Potts model’, the generalization of the Ising modelfrom 2 spin states to q.
43
Swendsen'Wang+Algorithm+(Discussion)
• When& &is&large,& &has&a&high&probability&of&being&set&to&one,&i.e.& &and& &are&likely&to&be&connected.
• Experiments&show&that&the&Swendsen)Wang&algorithm&mixes&very&rapidly,&especially&for&rectangular&grids.
• Can&you&provide&an&intui?ve&explana?on?
44
Swendsen'Wang+Algorithm+(Discussion)
• The%Swendsen'Wang%algorithm%can%be%generalized%to%Po4s%models%(nodes%can%take%values%from%a%finite%set).
• The%Swendsen'Wang%algorithm%has%been%widely%used%in%image%analysis%applicaAons,%e.g.%image%segmentaAon%(in%this%case,%it%is%called%Swendsen'Wang,cut).
45
Hamiltonian)Monte)Carlo• An$MCMC$method$based$on$Hamiltonian)Dynamics.$It$was$originally$devised$for$molecular)simula1on
• In$1987,$a$seminal$paper$by$Duane$et)al$unifies$MCMC$and$molecular$dynamics.$They$called$it$Hybrid)Monte)Carlo,$which$abbreviates$to$HMC
• In$many$arEcles,$people$call$it$Hamiltonian)Monte)Carlo,$as$this$name$is$considered$to$be$more$specific$and$informaEve,$and$it$retains$the$same$abbreviaEon$"HMC".
46
Mo#va#ng(Example:(Free(Fall
• The%change%of%momentum% %is%caused%by%the%accumula5on/release%of%the%poten(al+energy:
• The%change%of%loca-on% %is%caused%by%velocity,%the%deriva-ve%of%kinema-c.energy%w.r.t.%the%momentum:
48
Hamiltonian)Dynamics• Hamiltonian)Dynamics"is"a"generalized"theory"of"the"classical)mechanics,"which"provides"a"elegant"and"flexible"abstrac:on"of"a"dynamic"system"in"physics.
• In"Hamiltonian"Dynamics,"a"physical"system"is"described"by" ,"where" "and" "are"respec:vely"the"posi1on"and"momentum"of"the" @th"en:ty.
49
Hamilton's+Equa/ons
The$dynamics$of$the$system$is$characterized$by$the$Hamilton's+Equa/ons:
Here,% %is%called%the%Hamiltonian,%which%can%be%interpreted%as%the%total)energy%of%the%system.
50
Hamilton's+Equa/ons+(cont'd)
• The%Hamiltonian% %is%o)en%formulated%as%the%sum%of%the%poten+al,energy% %and%the%kine+c,energy% :
• With&this&se)ng,&the&Hamilton's+Equa/ons&become:
51
Conserva)on*of*Hamiltonian
The$Hamiltonian$is$conserved,$i.e.,$it$is$invariant$over$,me:
Intui&vely,,this,reflects,the,law$of$energy$conserva/on.
52
Hamiltonian)Reversibility• The%Hamiltonian)dynamics%is%reversible
• Let%the%ini+al%states%be% %and%the%states%at%+me% %be% .%Then,%it%we%reverse%the%process,%star+ng%at% ,%then%the%states%at%+me% %would%be% .
• In%the%context%of%MCMC,%this%leads%to%the%reversibility%of%the%underlying%chain.
53
Simula'on*of*Hamiltonian*Dynamics
A"natural"idea"to"simulate"Hamiltonian)dynamics"is"to"use"Euler's)method"over"discre1zed"1me"steps:
Is#this#a#good#method?
54
Leapfrog)Method
Be#er%results%can%be%obtained%with%leapfrog:
More%importantly,%the%leapfrog%update%is%reversible.55
Hamiltonian)Monte)Carlo
(Basic'idea):!Consider!the!poten&al)energy!as!the!Gibbs)energy,!and!introduce!the!"momentums"!as!auxiliary)variables!to!control!the!dynamics.
59
Hamiltonian)Monte)Carlo)(cont'd)
Suppose'the'target&distribu,on'is'
,'then'we'form'an'augmented&
distribu,on'as
Here,%the%loca%ons% %represent%the%variables%of%interest,%and%the%momentums% %control%the%dynamics%of%simula7on.
60
Hamiltonian)Monte)Carlo)(Algorithm)
Each%itera*on%of%HMC%comprises%two%steps:
• Gibbs%update:#sample#the#momentums# #from#the#Gaussian#prior#given#by
62
Hamiltonian)Monte)Carlo)(Algorithm)• Metropolis*update:#using#Hamiltonian#dynamics#to#propose#a#new#state.#Star8ng#from# ,#simulate#the#dynamic#system#with#the#leapfrog#method#for# #steps#with#step<size# ,#which#yields# .#The#proposed#state#is#accepted#with#probability:
63
HMC$(Discussion)• If$the$simula.on$is$exact,$we$will$have$
,$and$thus$the$proposed$state$should$always$be$accepted.$
• In$prac.ce,$there$can$be$some$devia.on$due$to$discre.za.on,$we$have$to$use$the$Metropolis$rule$to$guarantee$the$correctness.
64
HMC$(Discussion)
• HMC%has%a%high%acceptance%rate%while%allowing%large%moves%along%less6constrained%direc8ons%at%each%itera8on.
• This%is%a%key%advantage%as%compared%to%random'walk%proposals,%which,%in%order%to%maintain%a%reasonably%high%acceptance%rate,%has%to%keep%a%very%small%step%size,%resul8ng%in%substan8al%correla8on%between%consecu8ve%samples.
65
Tuning&HMC• For%efficient%simula1on,%it%is%important%to%choose%appropriate%values%for%both%the%leapfrog%step%size% %and%the%number%of%leapfrog%steps%per%itera1on% .
• Tuning%HMC%(and%actually%many%generic%sampling%methods)%oCen%requires%preliminary*runs%with%different%trial%seGngs%and%different%ini1al%values,%as%well%as%careful%analysis%of%the%energy%trajectories.
66
Tuning&HMC&(cont'd)
• For%most%cases,% %and% %can%be%tuned%independently.
• Too%small%a%stepsize%would%waste%computa8on%8me,%while%large%stepsize%would%cause%unstable%simula8on,%and%thus%low%acceptance%rate.
• One%should%choose% %such%that%the%energy%trajectory%is%stable%and%the%acceptance%rate%is%maintained%at%a%reasonably%high%level.
• One%should%choose% %such%that%back@and@forth%movement%of%the%states%can%be%observed.
67
Generic'Sampling'Systems
A"number"of"so,ware"systems"are"available"for"sampling"from"models"specified"by"the"user
• WinBUGS:*based*on*BUGS*(Bayesian*inference*Using*Gibbs*Sampling).
• provide*a*friendly*language*for*user*to*specify*the*model
• Running*only*on*Windows
• Note:*The*development*has*stopped*since*2007.68
Generic'Sampling'Systems'(cont'd)• JAGS:'"Just'Another'Gibbs'Sampler"
• Cross8pla9orm'support
• Use'a'dialect'of'BUGS
• Extensible:'allow'users'to'write'customized'funcCons,'distribuCons,'and'samplers
69
Generic'Sampling'Systems'(cont'd)• Stan:'"Sampling'Through'Adap5ve'Neighborhoods"
• Core'wri=en'in'C++,'and'ports'available'in'Python,'R,'Matlab,'and'Julia
• A'user'friendly'language'for'model'specifica5on
• Use'Hamiltonian'Monte'Carlo'(HMC)'and'No'ULTurn'Samplers'(NUTS)'as'core'algorithm
• Open'source'(GPLv3'licensed)'and'under'ac5ve'development'on'Github
70
Stan%Exampledata { int<lower=0> N; vector[N] x; vector[N] y;}parameters { real alpha; real beta; real<lower=0> sigma;}model { for (n in 1:N) y[n] ~ normal(alpha + beta * x[n], sigma);}
71