chapter 11. stochastic methods rooted in statistical mechanics · chapter 11. stochastic methods...

Chapter11.StochasticMethodsRootedinStatisticalMechanics

NeuralNetworksandLearningMachines(Haykin)

LectureNotesonSelf-learningNeuralAlgorithms

Byoung-TakZhangSchoolofComputerScienceandEngineering

SeoulNationalUniversity

Version: 20170926è 20170928è20171011

Contents11.1Introduction……………………………………………………………………………....311.2StatisticalMechanics……………………………………………………………..….411.3MarkovChains………………………..………………………………………..……....611.4MetropolisAlgorithm ……….………..………………….……………………..... 1611.5SimulatedAnnealing ………………………………………………….…….…...….1911.6GibbsSampling ….…………….…….………………..……..………………………..2211.7BoltzmannMachine …..……………………………………….……..……………..2411.8LogisticBeliefNets……………….…………………..……………….…......…….2911.9DeepBeliefNets ………………………….…………………………..….........….3011.10DeterministicAnnealing(DA) …………….………………………..…….…...3411.11AnalogyofDAwithEM…..……….….…………………………………….…….39SummaryandDiscussion…………….…………….………………………….………...41

11.1 Introduction

• Statisticalmechanicsasasourceofideasforunsupervised(self-organized)learningsystems

• Statisticalmechanicsü Theformalstudyofmacroscopicequilibriumproperties oflarge

systemsofelementsthataresubjecttothemicroscopiclawsofmechanics.

ü Thenumberofdegreesoffreedomisenormous,makingtheuseofprobabilisticmethods mandatory.

ü Theconceptofentropy playsavitalroleinstatisticalmechanics,aswiththeShannon’sinformationtheory.

ü The moreordered thesystem,orthemoreconcentrated theunderlyingprobabilitydistribution,thesmallertheentropy willbe.

• Statisticalmechanicsforthestudyofneuralnetworksü Cragg andTemperley (1954)andCowan(1968)ü Boltzmannmachine(HintonandSejnowsky,1983,1986;Ackleyetal.,

11.2 StatisticalMechanics(1/2)

pi : !probability!of!occurrence!of!state!i !of!a!stochastic!system!!!!!pi ≥0!(for!all!i)!!and! pi

i∑ =1

Ei : !energy!of!the!system!when!it!is!in!state!iIn!thermal!equilibrium,!the!probability!of!state!i !is(Canonical!distribution!/!Gibbs!distribution)

!!!!!pi =1Zexp −

⎝⎜⎞

⎠⎟

!!!!!Z = exp −EikBT

⎝⎜⎞

⎠⎟i∑

exp −E /kBT( ): !Boltzmann!factor!Z : !sum!over!states!(partition!function)

We!set!kB =1!and!view!− logpi !as!"energy"

1. Statesoflowenergyhaveahigherprobabilityofoccurrencethanthestatesofhighenergy.

2. AsthetemperatureT isreduced,theprobabilityisconcentratedonasmallersubsetoflow-energystates.

11.2 StatisticalMechanics(2/2)

Helmholtz!free!energy!!!!!!!F = −T logZ< E > ! = piEi

i∑ !!!!!!(avergage!energy)

!!!!!! < E > − !F = −T pi logpii∑

H = − pi logpii∑ !!!!!(entropy)

Thus,!we!have!!!!!! < E > − !F =TH!!!!!!!F = ! < E > − !TH

Consider!two!systems!A!and!A'!in!thermal!contact.ΔH !and!ΔH ': !entropy!changes!of!A!and!A'!The!total!entropy!tends!to!increase!with!!!!!!!ΔH +ΔH '≥0

The!free!energy!of!the!system,!F ,!tends!to!decrease!andbecome!a!minimum!in!an!equilibrium!situation.!The!resulting!probability!distribution!is!defined!by!Gibbs!distribution!(The!Principle!of!Minimum!Free!Energy).!

Naturelikestofindaphysicalsystemwithminimumfreeenergy.

Markov property P( Xn+1 = xn+1 | Xn = xn ,…, X1 = x1) = P( Xn+1 = xn+1 | Xn = xn )

Transition probability from state i at time n to j at time n+1 pij = P( Xn+1 = j | Xn = i)

(pij ≥ 0 ∀i, j and pij = 1 ∀ij∑ )

If the transition probabilities are fixed, the Markov chain is homogeneous.In case of a system with a finite number of possible states K , the transition probabilities constitute a K-by-K matrix (stochastic matrix):

p11 … p1K

! " !pK1 # pKK

⎜⎜⎜

⎟⎟⎟

11.3 MarkovChains(1/9)

Generalization to m-step transition probability

pij(m) = P( Xn+m = x j | Xn = xi ), m = 1,2,…

pij(m+1) = pik

(m) pkj , m = 1,2,…, pik(1) = pikk∑

We can further generalize to (Chapman-Kolmogorov identity)

pij(m+n) = pik

(m) pkj(n)

k∑ , m,n = 1,2,…

k→∞vi(k) = π i i = 1,2,…, K

Properties of markov chains

Recurrent pi = P(every returning to state i)Transient pi <1

Periodic

If i ∈Sk and pi > 0, thenj ∈Sk+1, for k = 1,...,d -1

j ∈Sk , for k = 1,...,d

⎧⎨⎪

⎩⎪

AperiodicAccessable: Accessable from i if there is a finite sequence of transition from i to jCommunicate: If the states i and j are accessible to each otherIf two states communicate each other, they belong to the same class.If all the states consists of a single class, the Markov chain is indecomposible or irreducible.

Figure11.1:AperiodicrecurrentMarkovchainwithd=3.

Ergodic Markov chains Ergodicity: time average = ensemble average

i.e. long-term proportion of time spent by the chain in state i corresponds to the steady-state probability π i

vi(k) : Proportion of time spent in state i after k returns

vi(k) = kTi(ℓ)ℓ=1

limk→∞

vi(k) = π i i = 1,2,…, K

Convergence to Stationary DistributionsConsider an ergodic Markov chain with a stochastic matrix P

π (n−1) : state transition vector of the chain at time n -1State transition vector at time n is

π (n) = π (n−1)PBy iteration we obtain

π (n) = π (n−1)P = π (n−2)P2 = π (n−3)P3 =! π (n) = π (0)Pn

π (0) : initial value

limn→∞

π1 … π K

" # "π1 ! π K

⎜⎜⎜

⎟⎟⎟=

⎜⎜⎜

⎟⎟⎟

Ergodic theorem

1. limn→∞

pij(n) = π j ∀i

2. π j > 0 ∀j

3. π j = 1j=1

K∑ 4. π j = π i piji=1

K∑ for j = 1,2,…, K

Figure11.2:State-transitiondiagramofMarkovchainforExample1:Thestatesx1andx2andmaybeidentifiedasup-to-datebehind,respectively.

⎢⎢⎢⎢

⎥⎥⎥⎥

π (0) = 16

⎣⎢⎢

⎦⎥⎥

π (1) =π (1)P

!!!!!!! = ! 16

⎣⎢⎢

⎦⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

!!!!!! = ! 1124

⎣⎢⎢

⎦⎥⎥

P(2) = 0.4375 0.56250.3750 0.6250

⎣⎢

⎦⎥

P(3) = 0.4001 0.59990.3999 0.6001

⎣⎢

⎦⎥

P(4) = 0.4000 0.60000.4000 0.6000

⎣⎢

⎦⎥

Figure11.3:State-transitiondiagramofMarkovchainforExample2.

0 0 113

⎢⎢⎢⎢⎢⎢⎢

⎥⎥⎥⎥⎥⎥⎥

π1 =13π2 +

π2 =16π2 +

π3 =π1 +12π2

π j = π i piji=1

π1 =0.3953π2 =0.1395π3 =0.4652

Figure11.4:ClassificationofthestatesofaMarkovchainandtheirassociatedlong-termbehavior.

Principle of detailed balance At thermal equilibrium, the rate of occurrence of any transition equals the corresponding rate of occurrence of the inverse transition, π i pij = π j p ji

Application : stationary distribution

π i piji=1

∑ =π iπ j

pij⎛

⎝⎜

⎠⎟

∑ π j =π j

p ji⎛

⎝⎜

⎠⎟

∑ π j

= p ji( )i=1

∑ π j (π i pij = π j p ji ,detailed balance)

= π j (since p jii=1

∑ =1)

11.4 MetropolisAlgorithm(1/3)

Metropolis Algorithm A stochastic algorithm for simulating the evolution of a physical system to thermal equilibrium. A modified Monte Carlo method. Markov Chain Monte Carlo (MCMC) method

Algorithm Metropolis1. Xn = xi . Randomly generate a new state x j .

2. ΔE = E(x j )− E(xi )

3. If ΔE < 0, then Xn+1 = x j

else if ΔE ≥ 0, then { Select a random number ξ ∼ U[0,1]. If ξ < exp(−ΔE / T ), then Xn+1 = x j , (accept)

else Xn+1 = xi . (reject)

Choice of Transition ProbabilitiesProposed set of transition probabilities 1. τ ij > 0 (for all i, j) : Nonnegativity

2. τ ijj∑ =1 (for all i) : Normalization

3. τ ij = τ ji (for all i, j) : Symmetry

Desired set of transition probabilities

pij =τ ij

⎝⎜

⎠⎟ for

π i<1

τ ij forπ j

π i≥1

⎪⎪⎪

pii = τ ii + τ ij 1−π j

⎝⎜

⎠⎟ =1− α ijτ ijj≠i∑j≠i∑

Moving probability

α ij = min 1,π j

⎝⎜⎞

⎠⎟

How to choose the ratio π j / π i ?

We choose the probability distribution to which we want the Markov chain to coverge to be a Gibbs distribution

π j =1Z

exp −E j

⎝⎜⎞

⎠⎟

= exp − ΔET

⎛⎝⎜

⎞⎠⎟

ΔE = E j − Ei

Proof of detailed balance :Case 1: ΔE < 0. π i pij = π iτ ij = π iτ ji

π j p ji = π j

⎝⎜

⎠⎟ = π iτ ji

Case 2: ΔE > 0.

π i pij = π i

⎝⎜⎞

⎠⎟= π jτ ji

π j p ji = π iτ ij

11.5SimulatedAnnealing(1/3)

SimulatedAnnealing• Astochasticrelaxationtechniqueforsolvingoptimizationproblems.• ImprovesthecomputationalefficiencyoftheMetropolisalgorithm.• Makesrandommovesontheenergysurface

• Operateastochasticsystematahigh temperature(whereconvergencetoequilibriumisfast)andtheniterativelylower thetemperature(atT=0,theMarkovchaincollapsesontheglobalminima).

Twoingredients:1. Aschedulethatdeterminestherateatwhichthetemperatureislowered.2. Analgorithm,suchastheMetropolisalgorithm,thatiterativelyfindsthe

equilibriumdistributionateachnewtemperatureintheschedulebyusingthefinalstateofthesystemattheprevioustemperatureasthestartingpointforthenewtemperature.

!!F = ! < E > −TH , !!!!!! limT→0

!F ! = ! < E >

1. InitialValueoftheTemperature.TheinitialvalueT0 ofthetemperatureischosenhighenoughtoensurethatvirtuallyallproposedtransitionsareacceptedbythesimulated-annealingalgorithm

2. DecrementoftheTemperature.Ordinarily,thecoolingisperformedexponentially,andthechangesmadeinthevalueofthetemperaturearesmall.Inparticular,thedecrementfunctionisdefinedby

whereα isaconstantsmallerthan,butcloseto,unity.Typicalvaluesofαliebetween0.8and0.99.Ateachtemperature,enoughtransitionsareattemptedsothatthereare10acceptedtransitionsperexperiment,onaverage.

3. FinalValueoftheTemperature.Thesystemisfixedandannealingstopsifthedesirednumberofacceptancesisnotachievedatthreesuccessivetemperatures

Tk =αTk−1, k = 1,2,…, K

SimulatedAnnealingforCombinatorialOptimization

11.6GibbsSampling(1/2)

Gibbs sampling An iterative adaptive scheme that generates a single value for the conditional distribution for each component of the random vector X , rather than all values of the variables at the same time.X = X1, X2 ,..., X K : a random vector of K components

Assume we know P( Xk | X−k ),where X−k = X1, X2 ,..., Xk−1Xk+1,..., X K

Gibbs sampling algorithm (Gibbs sampler)1. Initialize x1(0),x2(0),...,xK (0).

2. i ←1 x1(1) ∼ P( X1 | x2(0),x3(0),x4(0),...,xK (0))

x2(1) ∼ P( X2 | x1(1),x3(0),x4(0),...,xK (0))

x3(1) ∼ P( X3 | x1(1),x2(1),x3(0),...,xK (0))

" xk (1) ∼ P( Xk | x1(1),x2(1),...,xk−1(1),xk+1(0),xK (0))

" xK (1) ∼ P( X K | x1(1),x2(1),...,xK−1(1))

3. If (termination condition not met), then i ← i +1 and go to step 2.

11.6GibbsSampling(2/2)

1. Convergence theorem. The random variable Xk (n) converges in distribution to the true

probability distributions of Xk for k = 1, 2, ..., K as n approaches infinity; that is,

limn→∞

P( Xk(n) ≤ x | xk (0)) = PXk

(x) for k = 1,2,…, K

where PXk(x) is marginal cumulative distribution function of Xk .

2. Rate-of-convergence theorem. The joint cumulative distribution of the random variables X1(n), X2(n), ..., X K (n) converges to the true joint cumulative distribution

of X1, X2 , ..., X K at a geometric rate in n.

3. Ergodic theorem. For any measurable function g of the random variables X1, X2 , ..., X K whose expectation exists, we have

limn→∞

g( X1(i), X2(i),…, X K (i))→ E[g( X1, X2 ,…X K )]i=1

n∑ with probability 1 (i.e., almost surely).

11.7BoltzmannMachine(1/5)

Figure11.5:ArchitecturalgraphofBoltzmannmachine;Kisthenumberofvisibleneurons,andListhenumberofhiddenneurons.Thedistinguishingfeaturesofthemachineare:1.Theconnectionsbetweenthevisibleandhiddenneuronsaresymmetric.2.Thesymmetricconnectionsareextendedtothevisibleandhiddenneurons.

Boltzmann machine (BM)x : state vector of BMwji : synaptic connection from i to j

Structure (weights) wji = wij ∀i, j

wii = 0 ∀i

Energy

E(x) = − 12

wjixix jj≠i∑i∑Probability

P(X = x) = 1Z

exp − E(x)T

⎛⎝⎜

⎞⎠⎟

Astochasticmachineconsistingofstochasticneuronswithsymmetricsynapticconnections.

Consider three events: A : X j = x j

B : Xi = xi{ }i=1

Kwith i ≠ j

C : Xi = xi{ }i=1

The joint event B excludes A, and the joint event C includes both A and B.P(C) = P( A, B)

2Twjixix jj , j≠i∑i∑⎛

⎝⎜⎞⎠⎟

P(B) = P( A, B)A∑

2Twjixix jj , j≠i∑i∑⎛

⎝⎜⎞⎠⎟x j

The component involving x j

2Twjixi

i≠ j∑

P( A | B) = P( A, B)P(B)

1+ exp −x j

Twjixi

ii≠ j

∑⎛

⎜⎜

⎟⎟

P X j = x | Xi = xi{ }i=1,i≠ j

K( ) =ϕ xT

wjixii,i≠ j

K∑⎛⎝⎜

⎞⎠⎟

ϕ(v) = 11+ exp(−v)

Figure11.6:Sigmoid-shapedfunctionP(v).

L(w) = log P(Xα = xα )xα∈ℑ

∏ = log P(Xα = xα )

xα∈ℑ∑

1. Positive phase. In this phase, the network operates in its clamped condition (i.e.,under the direct influence of the training sample ).2. Negative phase. In this second phase, the network is allowed to run freely, and therefore with no environmental input.

xα : the state of the visible neurons (subset of x)

xβ : the state of the hidden neurons (subset of x)

Probability of the visible state

P(Xα = xα ) = 1Z

exp − E(x)T

⎛⎝⎜

⎞⎠⎟xβ

Z = exp − E(x)T

⎛⎝⎜

⎞⎠⎟x∑

Log-likelihood function given the training data ℑ

L(w) = log P(x | w) = log exp − E(x)T

⎛⎝⎜

⎞⎠⎟xβ

∑ − log exp − E(x)T

⎛⎝⎜

⎞⎠⎟x∑⎛

⎝⎜⎞⎠⎟xα∈ℑ

∑Derivative of the log-likelihood function

∂L(w)∂wji

P(Xβ = xβ | Xα = xα )xβ

∑ x jxi − P(X = x)x jxix∑( )xα∈ℑ∑

Mean firing rate in the positive phase (clamped)

ρ ji+ = x jxi

+= P(X = xβ |X = xα )x jxixβ

∑xα∈ℑ∑

Mean firing rate in the negative phase (free-running)

ρ ji− = x jxi

−= P(X = x)x jxix∑xα∈ℑ∑

Thus, we may write

∂L(w)∂wji

= 1T(ρ ji

+ − ρ ji− )

Gradient ascent to maximize the L(w)

Δwji =η∂L(w)∂wji

=η '(ρ ji+ − ρ ji

η ' = εT

Boltzmannmachinelearningrule

11.8LogisticBeliefNets

Parents of node j

pa( X j )⊆ X1, X2 ,…, X j−1{ }Conditional probability P( X j = x j | X1 = x1,…, X j−1 = x j−1)

= P( X j = x j | pa( X j ))

Astochasticmachineconsistingofmultiplelayersofstochasticneuronswithdirected synapticconnections.

Calculation of conditional probabilities 1. wji = 0 for all Xi ∉ pa(X j )

2. wji = 0 for i ≥ j (∵acyclic)

Weight update rule

Δwji =η∂

∂wji

11.9DeepBeliefNets(1/4)

Figure11.8:NeuralstructureofrestrictedBoltzmannmachine(RBM).ContrastingthiswiththatofFig.11.6,weseethatunliketheBoltzmannmachine,therearenoconnectionsamongthevisibleneuronsandthehiddenneuronsintheRBM.

Maximum-LikelihoodLearninginaRestrictedBoltzmannMachine(RBM)

Sequential pre - training1. Update the hidden states h in parallel, given the visible states x.2. Doing the same, but in reverse: update the visible states x in parallel, given the hidden states h.Maximum - likelihood learning

∂L(w)∂wji

= ρ ji(0) − ρ ji

Figure11.9:Top-downlearning,usinglogisticbeliefnetworkofinfinitedepth.

Figure11.10:AhybridgenerativemodelinwhichthetwotoplayersformarestrictedBoltzmannmachineandthelowertwolayersformadirectedmodel.Theweightsshownwithblueshadedarrowsarenotpartofthegenerativemodel;theyareusedtoinferthefeaturevaluesgiventothedata,buttheyarenotusedforgeneratingdata.

Figure11.11:IllustratingtheprogressionofalternatingGibbssamplinginanRBM.Aftersufficientlymanysteps,thevisibleandhiddenvectorsaresampledfromthestationarydistributiondefinedbythecurrentparametersofthemodel.

Figure11.12:Thetaskofmodelingthesensory(visible)dataisdividedintotwosubtasks.

11.10DeterministicAnnealing(1/5)

Deterministic Annealing Incorporates randomness into the energy function, which is then deterministically optimized at a sequence of decreasing temperature (cf. simulated annealing: random moves on the energy surface)Clustering via Deterministic Annealing x : source (input) vector y : reconstruction (output) vector

Distortion measure: d(x,y) = x − y2

Expected distortion: D = P(X = x,Y = y)d(x,y)y∑x∑

= P(X = x) P(Y = y | X = x)d(x,y)y∑x∑

Probability of joint event P(X = x,Y = y) = P(Y = y | X = x)

association probability! "## $## P(X = x)

Table11.2

Entropy as randomness measure H (X,Y) = − P(X = x,Y = y)logP(X = x,Y = y)

y∑x∑Constrained optimization of D as minimization of the Lagrangean F = D −TH H (X,Y) = H (X)

source entropy! + H (Y |X)

conditional entropy!"# $#

H (Y |X) = − P(X = x) P(Y = y |X = x)logP(Y = y |X = x)y∑x∑

P(Y = y |X = x) = 1Zx

exp − d(x,y)T

⎛⎝⎜

⎞⎠⎟

, Zx = exp − d(x,y)T

⎛⎝⎜

⎞⎠⎟y∑

F * = minP(Y=y|X=x )

= −T P(X = x) log Zxx∑Setting

∂∂y

F * = P(X = x,Y = y)∂∂y

d(x,y)x∑ = 0 ∀y ∈ϒ

The minimizing condition is

P(Y = y | X = x)x∑ ∂

∂yd(x,y) = 0 ∀y ∈ϒ

The deterministic annealing algorithm consists of minimizing the Lagrangian F * with respect to the code vectors at a high value of temperature T and then tracking the minimum while the temperature T is lowered.

Figure11.13:Clusteringatvariousphases.Thelinesareequiprobability contours,p=½in(b),andp=⅓elsewhere:(a)1cluster(B=0),(b)2clusters(B=0.0049),(c)3clusters(B=0.0056),(d)4clusters(B=0.0100),(e)5clusters(B=0.0156),(f)6clusters(B=0.0347),and(g)19clusters(B=0.0605).

!!B = 1

Figure11.14:PhasediagramfortheCaseStudyindeterministicannealing.Thenumberofeffectiveclustersisshownforeachphase.

11.11AnalogyofDAwithEM(1/2)

Suppose we view the association probability P(Y = y | X = x)as the expected value of a random binary variable Vxy defined as

1 if thesource vector x isassigned tocode vector y0 otherwise

⎧⎨⎪

⎩⎪

Then, the two steps of DA = two steps of EM1. Step 1 of DA (= E-step of EM) Compute the association probabilities P(Y = y | X = x) 2. Step 2 of DA (= M-step of EM) Optimize the distortion measure d(x,y)

11.11AnalogyofDAwithEM(2/2)

r : complete data including missing data z d = d(r) : incomplete dataConditional pdf of r given param vector θ

pD (d |θ) = pc (r |θ)drℜ(d )∫

ℜ(d ) : subspace of ℜ determined by d = d(r)Incomplete log-likelihood function L(θ) = log pD (d |θ)Complete-data log-likelihood function Lc (θ) = log pc (r |θ)

Expectation - Maximization Algorithm

θ̂(n) : value of θ at iteration n of EM 1. E-step

Q(θ, θ̂(n)) = Eθ̂(n)

LC (θ)⎡⎣ ⎤⎦2. M-step

θ̂(n +1) = arg maxθQ(θ, θ̂(n))

After an interation of the EM algorithm, the incomplete-data log-likelihood function is not decreased:

L(θ̂(n +1) ≥ L(θ̂(n)) for n = 0,1,2,…,K

SummaryandDiscussionn Statisticalmechanicsasmathematicalbasisforthe

formulationofstochasticsimulation/optimization/learning1. Metropolisalgorithm2. Simulatedannealing3. Gibbssampling

n Stochasticlearningmachines1. (Classical)Boltzmannmachine2. RestrictedBoltzmannmachine(RBM)3. Deepbeliefnets(DBN)

n Deterministicannealing(DA)1. Foroptimization:Connectiontosimulatedannealing(SA)2. Forclustering:Connectiontoexpectation-maximization(EM)

chapter 11. stochastic methods rooted in statistical mechanics · chapter 11. stochastic methods...

Documents

annealing stochastic approximation monte carlo...

euler.mcs.utulsa.edueuler.mcs.utulsa.edu/~rogerw/papers/tsp-dmcc-91.pdf ·...

deconvolution of high dimensional mixtures via boosting...

stochastic relaxation, simulating annealing, global...

graph-based simulated annealing: a hybrid … › stochastik...

janhunen, tomi; gebser, martin; rintanen, jussi; nyman...

deterministic annealing

stochastic parameter optimization for empirical molecular...

ed).mitter/publications/65_metropolis_type_siam.pdfkey...

part vii property...

stochastic approximation and simulated annealing

stochastic and deterministic search algorithms for global...

stochastic process optimization...

proposed syllabus for t.e. (mechanical engineering) › ......

graph-based simulated annealing: a hybrid … › dirkkroese...

5. simulated annealing 5.1 basic...

performance evaluation of spsa algorithm for solving...

hyetosr: an r package for temporal stochastic simulation ......

stochastic relaxation, simulating annealing, global...

stochastic annealing for variational inference