chapter 11. stochastic methods rooted in statistical mechanics · chapter 11. stochastic methods...
Post on 29-Apr-2018
225 Views
Preview:
TRANSCRIPT
Chapter11.StochasticMethodsRootedinStatisticalMechanics
NeuralNetworksandLearningMachines(Haykin)
LectureNotesonSelf-learningNeuralAlgorithms
Byoung-TakZhangSchoolofComputerScienceandEngineering
SeoulNationalUniversity
Version: 20170926è 20170928è20171011
Contents11.1Introduction……………………………………………………………………………....311.2StatisticalMechanics……………………………………………………………..….411.3MarkovChains………………………..………………………………………..……....611.4MetropolisAlgorithm ……….………..………………….……………………..... 1611.5SimulatedAnnealing ………………………………………………….…….…...….1911.6GibbsSampling ….…………….…….………………..……..………………………..2211.7BoltzmannMachine …..……………………………………….……..……………..2411.8LogisticBeliefNets……………….…………………..……………….…......…….2911.9DeepBeliefNets ………………………….…………………………..….........….3011.10DeterministicAnnealing(DA) …………….………………………..…….…...3411.11AnalogyofDAwithEM…..……….….…………………………………….…….39SummaryandDiscussion…………….…………….………………………….………...41
(c)2017BiointelligenceLab,SNU 2
11.1 Introduction
(c)2017BiointelligenceLab,SNU 3
• Statisticalmechanicsasasourceofideasforunsupervised(self-organized)learningsystems
• Statisticalmechanicsü Theformalstudyofmacroscopicequilibriumproperties oflarge
systemsofelementsthataresubjecttothemicroscopiclawsofmechanics.
ü Thenumberofdegreesoffreedomisenormous,makingtheuseofprobabilisticmethods mandatory.
ü Theconceptofentropy playsavitalroleinstatisticalmechanics,aswiththeShannon’sinformationtheory.
ü The moreordered thesystem,orthemoreconcentrated theunderlyingprobabilitydistribution,thesmallertheentropy willbe.
• Statisticalmechanicsforthestudyofneuralnetworksü Cragg andTemperley (1954)andCowan(1968)ü Boltzmannmachine(HintonandSejnowsky,1983,1986;Ackleyetal.,
1985)
11.2 StatisticalMechanics(1/2)
(c)2017BiointelligenceLab,SNU 4!!
pi : !probability!of!occurrence!of!state!i !of!a!stochastic!system!!!!!pi ≥0!(for!all!i)!!and! pi
i∑ =1
Ei : !energy!of!the!system!when!it!is!in!state!iIn!thermal!equilibrium,!the!probability!of!state!i !is(Canonical!distribution!/!Gibbs!distribution)
!!!!!pi =1Zexp −
EikBT
⎛
⎝⎜⎞
⎠⎟
!!!!!Z = exp −EikBT
⎛
⎝⎜⎞
⎠⎟i∑
exp −E /kBT( ): !Boltzmann!factor!Z : !sum!over!states!(partition!function)
We!set!kB =1!and!view!− logpi !as!"energy"
1. Statesoflowenergyhaveahigherprobabilityofoccurrencethanthestatesofhighenergy.
2. AsthetemperatureT isreduced,theprobabilityisconcentratedonasmallersubsetoflow-energystates.
11.2 StatisticalMechanics(2/2)
(c)2017BiointelligenceLab,SNU 5!!!
Helmholtz!free!energy!!!!!!!F = −T logZ< E > ! = piEi
i∑ !!!!!!(avergage!energy)
!!!!!! < E > − !F = −T pi logpii∑
H = − pi logpii∑ !!!!!(entropy)
Thus,!we!have!!!!!! < E > − !F =TH!!!!!!!F = ! < E > − !TH
Consider!two!systems!A!and!A'!in!thermal!contact.ΔH !and!ΔH ': !entropy!changes!of!A!and!A'!The!total!entropy!tends!to!increase!with!!!!!!!ΔH +ΔH '≥0
!!!
The!free!energy!of!the!system,!F ,!tends!to!decrease!andbecome!a!minimum!in!an!equilibrium!situation.!The!resulting!probability!distribution!is!defined!by!Gibbs!distribution!(The!Principle!of!Minimum!Free!Energy).!
Naturelikestofindaphysicalsystemwithminimumfreeenergy.
(c)2017BiointelligenceLab,SNU 6
Markov property P( Xn+1 = xn+1 | Xn = xn ,…, X1 = x1) = P( Xn+1 = xn+1 | Xn = xn )
Transition probability from state i at time n to j at time n+1 pij = P( Xn+1 = j | Xn = i)
(pij ≥ 0 ∀i, j and pij = 1 ∀ij∑ )
If the transition probabilities are fixed, the Markov chain is homogeneous.In case of a system with a finite number of possible states K , the transition probabilities constitute a K-by-K matrix (stochastic matrix):
P =
p11 … p1K
! " !pK1 # pKK
⎛
⎝
⎜⎜⎜
⎞
⎠
⎟⎟⎟
11.3 MarkovChains(1/9)
11.3 MarkovChains(2/9)
(c)2017BiointelligenceLab,SNU 7
Generalization to m-step transition probability
pij(m) = P( Xn+m = x j | Xn = xi ), m = 1,2,…
pij(m+1) = pik
(m) pkj , m = 1,2,…, pik(1) = pikk∑
We can further generalize to (Chapman-Kolmogorov identity)
pij(m+n) = pik
(m) pkj(n)
k∑ , m,n = 1,2,…
lim
k→∞vi(k) = π i i = 1,2,…, K
11.3 MarkovChains(3/9)
(c)2017BiointelligenceLab,SNU 8
Properties of markov chains
Recurrent pi = P(every returning to state i)Transient pi <1
Periodic
If i ∈Sk and pi > 0, thenj ∈Sk+1, for k = 1,...,d -1
j ∈Sk , for k = 1,...,d
⎧⎨⎪
⎩⎪
AperiodicAccessable: Accessable from i if there is a finite sequence of transition from i to jCommunicate: If the states i and j are accessible to each otherIf two states communicate each other, they belong to the same class.If all the states consists of a single class, the Markov chain is indecomposible or irreducible.
11.3 MarkovChains(4/9)
Figure11.1:AperiodicrecurrentMarkovchainwithd=3.
(c)2017BiointelligenceLab,SNU 9
11.3 MarkovChains(5/9)
(c)2017BiointelligenceLab,SNU 10
Ergodic Markov chains Ergodicity: time average = ensemble average
i.e. long-term proportion of time spent by the chain in state i corresponds to the steady-state probability π i
vi(k) : Proportion of time spent in state i after k returns
vi(k) = kTi(ℓ)ℓ=1
k∑
limk→∞
vi(k) = π i i = 1,2,…, K
11.3 MarkovChains(6/9)
(c)2017BiointelligenceLab,SNU 11
Convergence to Stationary DistributionsConsider an ergodic Markov chain with a stochastic matrix P
π (n−1) : state transition vector of the chain at time n -1State transition vector at time n is
π (n) = π (n−1)PBy iteration we obtain
π (n) = π (n−1)P = π (n−2)P2 = π (n−3)P3 =! π (n) = π (0)Pn
π (0) : initial value
limn→∞
Pn =
π1 … π K
" # "π1 ! π K
⎛
⎝
⎜⎜⎜
⎞
⎠
⎟⎟⎟=
π"π
⎛
⎝
⎜⎜⎜
⎞
⎠
⎟⎟⎟
Ergodic theorem
1. limn→∞
pij(n) = π j ∀i
2. π j > 0 ∀j
3. π j = 1j=1
K∑ 4. π j = π i piji=1
K∑ for j = 1,2,…, K
11.3 MarkovChains(6/9)
Figure11.2:State-transitiondiagramofMarkovchainforExample1:Thestatesx1andx2andmaybeidentifiedasup-to-datebehind,respectively.
12
!!
P=
14
34
12
12
⎡
⎣
⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥
!!
π (0) = 16
56
⎡
⎣⎢⎢
⎤
⎦⎥⎥
π (1) =π (1)P
!!!!!!! = ! 16
56
⎡
⎣⎢⎢
⎤
⎦⎥⎥
14
34
12
12
⎡
⎣
⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥
!!!!!! = ! 1124
1324
⎡
⎣⎢⎢
⎤
⎦⎥⎥
!!
P(2) = 0.4375 0.56250.3750 0.6250
⎡
⎣⎢
⎤
⎦⎥
P(3) = 0.4001 0.59990.3999 0.6001
⎡
⎣⎢
⎤
⎦⎥
P(4) = 0.4000 0.60000.4000 0.6000
⎡
⎣⎢
⎤
⎦⎥
11.3 MarkovChains(7/9)
Figure11.3:State-transitiondiagramofMarkovchainforExample2.
(c)2017BiointelligenceLab,SNU 13!!
P=
0 0 113
16
12
34
14 0
⎡
⎣
⎢⎢⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥⎥⎥
!
π1 =13π2 +
34π3
π2 =16π2 +
14π3
π3 =π1 +12π2
π j = π i piji=1
K∑
!
π1 =0.3953π2 =0.1395π3 =0.4652
11.3 MarkovChains(8/9)
Figure11.4:ClassificationofthestatesofaMarkovchainandtheirassociatedlong-termbehavior.
14
11.3 MarkovChains(9/9)
(c)2017BiointelligenceLab,SNU 15
Principle of detailed balance At thermal equilibrium, the rate of occurrence of any transition equals the corresponding rate of occurrence of the inverse transition, π i pij = π j p ji
Application : stationary distribution
π i piji=1
K
∑ =π iπ j
pij⎛
⎝⎜
⎞
⎠⎟
i=1
K
∑ π j =π j
π j
p ji⎛
⎝⎜
⎞
⎠⎟
i=1
K
∑ π j
= p ji( )i=1
K
∑ π j (π i pij = π j p ji ,detailed balance)
= π j (since p jii=1
K
∑ =1)
11.4 MetropolisAlgorithm(1/3)
(c)2017BiointelligenceLab,SNU 16
Metropolis Algorithm A stochastic algorithm for simulating the evolution of a physical system to thermal equilibrium. A modified Monte Carlo method. Markov Chain Monte Carlo (MCMC) method
Algorithm Metropolis1. Xn = xi . Randomly generate a new state x j .
2. ΔE = E(x j )− E(xi )
3. If ΔE < 0, then Xn+1 = x j
else if ΔE ≥ 0, then { Select a random number ξ ∼ U[0,1]. If ξ < exp(−ΔE / T ), then Xn+1 = x j , (accept)
else Xn+1 = xi . (reject)
}
11.4 MetropolisAlgorithm(2/3)
(c)2017BiointelligenceLab,SNU 17
Choice of Transition ProbabilitiesProposed set of transition probabilities 1. τ ij > 0 (for all i, j) : Nonnegativity
2. τ ijj∑ =1 (for all i) : Normalization
3. τ ij = τ ji (for all i, j) : Symmetry
Desired set of transition probabilities
pij =τ ij
π j
π i
⎛
⎝⎜
⎞
⎠⎟ for
π j
π i<1
τ ij forπ j
π i≥1
⎧
⎨
⎪⎪⎪
⎩
⎪⎪⎪
pii = τ ii + τ ij 1−π j
π i
⎛
⎝⎜
⎞
⎠⎟ =1− α ijτ ijj≠i∑j≠i∑
Moving probability
α ij = min 1,π j
π i
⎛
⎝⎜⎞
⎠⎟
11.4 MetropolisAlgorithm(3/3)
(c)2017BiointelligenceLab,SNU 18
How to choose the ratio π j / π i ?
We choose the probability distribution to which we want the Markov chain to coverge to be a Gibbs distribution
π j =1Z
exp −E j
T⎛
⎝⎜⎞
⎠⎟
π j
π i
= exp − ΔET
⎛⎝⎜
⎞⎠⎟
ΔE = E j − Ei
Proof of detailed balance :Case 1: ΔE < 0. π i pij = π iτ ij = π iτ ji
π j p ji = π j
π i
π j
τ ji
⎛
⎝⎜
⎞
⎠⎟ = π iτ ji
Case 2: ΔE > 0.
π i pij = π i
π j
π i
τ ij
⎛
⎝⎜⎞
⎠⎟= π jτ ji
π j p ji = π iτ ij
11.5SimulatedAnnealing(1/3)
(c)2017BiointelligenceLab,SNU 19
SimulatedAnnealing• Astochasticrelaxationtechniqueforsolvingoptimizationproblems.• ImprovesthecomputationalefficiencyoftheMetropolisalgorithm.• Makesrandommovesontheenergysurface
• Operateastochasticsystematahigh temperature(whereconvergencetoequilibriumisfast)andtheniterativelylower thetemperature(atT=0,theMarkovchaincollapsesontheglobalminima).
Twoingredients:1. Aschedulethatdeterminestherateatwhichthetemperatureislowered.2. Analgorithm,suchastheMetropolisalgorithm,thatiterativelyfindsthe
equilibriumdistributionateachnewtemperatureintheschedulebyusingthefinalstateofthesystemattheprevioustemperatureasthestartingpointforthenewtemperature.
!!F = ! < E > −TH , !!!!!! limT→0
!F ! = ! < E >
11.5SimulatedAnnealing(2/3)
(c)2017BiointelligenceLab,SNU 20
1. InitialValueoftheTemperature.TheinitialvalueT0 ofthetemperatureischosenhighenoughtoensurethatvirtuallyallproposedtransitionsareacceptedbythesimulated-annealingalgorithm
2. DecrementoftheTemperature.Ordinarily,thecoolingisperformedexponentially,andthechangesmadeinthevalueofthetemperaturearesmall.Inparticular,thedecrementfunctionisdefinedby
whereα isaconstantsmallerthan,butcloseto,unity.Typicalvaluesofαliebetween0.8and0.99.Ateachtemperature,enoughtransitionsareattemptedsothatthereare10acceptedtransitionsperexperiment,onaverage.
3. FinalValueoftheTemperature.Thesystemisfixedandannealingstopsifthedesirednumberofacceptancesisnotachievedatthreesuccessivetemperatures
Tk =αTk−1, k = 1,2,…, K
11.5SimulatedAnnealing(3/3)
SimulatedAnnealingforCombinatorialOptimization
(c)2017BiointelligenceLab,SNU 21
11.6GibbsSampling(1/2)
22
Gibbs sampling An iterative adaptive scheme that generates a single value for the conditional distribution for each component of the random vector X , rather than all values of the variables at the same time.X = X1, X2 ,..., X K : a random vector of K components
Assume we know P( Xk | X−k ),where X−k = X1, X2 ,..., Xk−1Xk+1,..., X K
Gibbs sampling algorithm (Gibbs sampler)1. Initialize x1(0),x2(0),...,xK (0).
2. i ←1 x1(1) ∼ P( X1 | x2(0),x3(0),x4(0),...,xK (0))
x2(1) ∼ P( X2 | x1(1),x3(0),x4(0),...,xK (0))
x3(1) ∼ P( X3 | x1(1),x2(1),x3(0),...,xK (0))
" xk (1) ∼ P( Xk | x1(1),x2(1),...,xk−1(1),xk+1(0),xK (0))
" xK (1) ∼ P( X K | x1(1),x2(1),...,xK−1(1))
3. If (termination condition not met), then i ← i +1 and go to step 2.
11.6GibbsSampling(2/2)
(c)2017BiointelligenceLab,SNU 23
1. Convergence theorem. The random variable Xk (n) converges in distribution to the true
probability distributions of Xk for k = 1, 2, ..., K as n approaches infinity; that is,
limn→∞
P( Xk(n) ≤ x | xk (0)) = PXk
(x) for k = 1,2,…, K
where PXk(x) is marginal cumulative distribution function of Xk .
2. Rate-of-convergence theorem. The joint cumulative distribution of the random variables X1(n), X2(n), ..., X K (n) converges to the true joint cumulative distribution
of X1, X2 , ..., X K at a geometric rate in n.
3. Ergodic theorem. For any measurable function g of the random variables X1, X2 , ..., X K whose expectation exists, we have
limn→∞
1n
g( X1(i), X2(i),…, X K (i))→ E[g( X1, X2 ,…X K )]i=1
n∑ with probability 1 (i.e., almost surely).
11.7BoltzmannMachine(1/5)
Figure11.5:ArchitecturalgraphofBoltzmannmachine;Kisthenumberofvisibleneurons,andListhenumberofhiddenneurons.Thedistinguishingfeaturesofthemachineare:1.Theconnectionsbetweenthevisibleandhiddenneuronsaresymmetric.2.Thesymmetricconnectionsareextendedtothevisibleandhiddenneurons.
24
Boltzmann machine (BM)x : state vector of BMwji : synaptic connection from i to j
Structure (weights) wji = wij ∀i, j
wii = 0 ∀i
Energy
E(x) = − 12
wjixix jj≠i∑i∑Probability
P(X = x) = 1Z
exp − E(x)T
⎛⎝⎜
⎞⎠⎟
Astochasticmachineconsistingofstochasticneuronswithsymmetricsynapticconnections.
11.7BoltzmannMachine(2/5)
(c)2017BiointelligenceLab,SNU 25
Consider three events: A : X j = x j
B : Xi = xi{ }i=1
Kwith i ≠ j
C : Xi = xi{ }i=1
K
The joint event B excludes A, and the joint event C includes both A and B.P(C) = P( A, B)
= 1Z
exp1
2Twjixix jj , j≠i∑i∑⎛
⎝⎜⎞⎠⎟
P(B) = P( A, B)A∑
= 1Z
exp1
2Twjixix jj , j≠i∑i∑⎛
⎝⎜⎞⎠⎟x j
∑
The component involving x j
x j
2Twjixi
i≠ j∑
P( A | B) = P( A, B)P(B)
= 1
1+ exp −x j
Twjixi
ii≠ j
∑⎛
⎝
⎜⎜
⎞
⎠
⎟⎟
P X j = x | Xi = xi{ }i=1,i≠ j
K( ) =ϕ xT
wjixii,i≠ j
K∑⎛⎝⎜
⎞⎠⎟
ϕ(v) = 11+ exp(−v)
11.7BoltzmannMachine(3/5)
Figure11.6:Sigmoid-shapedfunctionP(v).
(c)2017BiointelligenceLab,SNU 26
L(w) = log P(Xα = xα )xα∈ℑ
∏ = log P(Xα = xα )
xα∈ℑ∑
1. Positive phase. In this phase, the network operates in its clamped condition (i.e.,under the direct influence of the training sample ).2. Negative phase. In this second phase, the network is allowed to run freely, and therefore with no environmental input.
𝕵
11.7BoltzmannMachine(4/5)
(c)2017BiointelligenceLab,SNU 27
xα : the state of the visible neurons (subset of x)
xβ : the state of the hidden neurons (subset of x)
Probability of the visible state
P(Xα = xα ) = 1Z
exp − E(x)T
⎛⎝⎜
⎞⎠⎟xβ
∑
Z = exp − E(x)T
⎛⎝⎜
⎞⎠⎟x∑
Log-likelihood function given the training data ℑ
L(w) = log P(x | w) = log exp − E(x)T
⎛⎝⎜
⎞⎠⎟xβ
∑ − log exp − E(x)T
⎛⎝⎜
⎞⎠⎟x∑⎛
⎝⎜⎞⎠⎟xα∈ℑ
∑Derivative of the log-likelihood function
∂L(w)∂wji
= 1T
P(Xβ = xβ | Xα = xα )xβ
∑ x jxi − P(X = x)x jxix∑( )xα∈ℑ∑
11.7BoltzmannMachine(5/5)
(c)2017BiointelligenceLab,SNU 28
Mean firing rate in the positive phase (clamped)
ρ ji+ = x jxi
+= P(X = xβ |X = xα )x jxixβ
∑xα∈ℑ∑
Mean firing rate in the negative phase (free-running)
ρ ji− = x jxi
−= P(X = x)x jxix∑xα∈ℑ∑
Thus, we may write
∂L(w)∂wji
= 1T(ρ ji
+ − ρ ji− )
Gradient ascent to maximize the L(w)
Δwji =η∂L(w)∂wji
=η '(ρ ji+ − ρ ji
− )
η ' = εT
Boltzmannmachinelearningrule
11.8LogisticBeliefNets
Figure11.7:Directed(logistic)beliefnetwork.(c)2017BiointelligenceLab,SNU 29
Parents of node j
pa( X j )⊆ X1, X2 ,…, X j−1{ }Conditional probability P( X j = x j | X1 = x1,…, X j−1 = x j−1)
= P( X j = x j | pa( X j ))
Astochasticmachineconsistingofmultiplelayersofstochasticneuronswithdirected synapticconnections.
Calculation of conditional probabilities 1. wji = 0 for all Xi ∉ pa(X j )
2. wji = 0 for i ≥ j (∵acyclic)
Weight update rule
Δwji =η∂
∂wji
L(w)
11.9DeepBeliefNets(1/4)
Figure11.8:NeuralstructureofrestrictedBoltzmannmachine(RBM).ContrastingthiswiththatofFig.11.6,weseethatunliketheBoltzmannmachine,therearenoconnectionsamongthevisibleneuronsandthehiddenneuronsintheRBM.
(c)2017BiointelligenceLab,SNU 30
Maximum-LikelihoodLearninginaRestrictedBoltzmannMachine(RBM)
Sequential pre - training1. Update the hidden states h in parallel, given the visible states x.2. Doing the same, but in reverse: update the visible states x in parallel, given the hidden states h.Maximum - likelihood learning
∂L(w)∂wji
= ρ ji(0) − ρ ji
(∞)
11.9DeepBeliefNets(2/4)
Figure11.9:Top-downlearning,usinglogisticbeliefnetworkofinfinitedepth.
31
Figure11.10:AhybridgenerativemodelinwhichthetwotoplayersformarestrictedBoltzmannmachineandthelowertwolayersformadirectedmodel.Theweightsshownwithblueshadedarrowsarenotpartofthegenerativemodel;theyareusedtoinferthefeaturevaluesgiventothedata,buttheyarenotusedforgeneratingdata.
11.9DeepBeliefNets(3/4)
Figure11.11:IllustratingtheprogressionofalternatingGibbssamplinginanRBM.Aftersufficientlymanysteps,thevisibleandhiddenvectorsaresampledfromthestationarydistributiondefinedbythecurrentparametersofthemodel.
32(c)2017BiointelligenceLab,SNU
11.9DeepBeliefNets(4/4)
Figure11.12:Thetaskofmodelingthesensory(visible)dataisdividedintotwosubtasks.
33(c)2017BiointelligenceLab,SNU
11.10DeterministicAnnealing(1/5)
(c)2017BiointelligenceLab,SNU 34
Deterministic Annealing Incorporates randomness into the energy function, which is then deterministically optimized at a sequence of decreasing temperature (cf. simulated annealing: random moves on the energy surface)Clustering via Deterministic Annealing x : source (input) vector y : reconstruction (output) vector
Distortion measure: d(x,y) = x − y2
Expected distortion: D = P(X = x,Y = y)d(x,y)y∑x∑
= P(X = x) P(Y = y | X = x)d(x,y)y∑x∑
Probability of joint event P(X = x,Y = y) = P(Y = y | X = x)
association probability! "## $## P(X = x)
11.10DeterministicAnnealing(2/5)
Table11.2
(c)2017BiointelligenceLab,SNU 35
Entropy as randomness measure H (X,Y) = − P(X = x,Y = y)logP(X = x,Y = y)
y∑x∑Constrained optimization of D as minimization of the Lagrangean F = D −TH H (X,Y) = H (X)
source entropy! + H (Y |X)
conditional entropy!"# $#
H (Y |X) = − P(X = x) P(Y = y |X = x)logP(Y = y |X = x)y∑x∑
P(Y = y |X = x) = 1Zx
exp − d(x,y)T
⎛⎝⎜
⎞⎠⎟
, Zx = exp − d(x,y)T
⎛⎝⎜
⎞⎠⎟y∑
11.10DeterministicAnnealing(3/5)
(c)2017BiointelligenceLab,SNU 36
F * = minP(Y=y|X=x )
F
= −T P(X = x) log Zxx∑Setting
∂∂y
F * = P(X = x,Y = y)∂∂y
d(x,y)x∑ = 0 ∀y ∈ϒ
The minimizing condition is
1N
P(Y = y | X = x)x∑ ∂
∂yd(x,y) = 0 ∀y ∈ϒ
The deterministic annealing algorithm consists of minimizing the Lagrangian F * with respect to the code vectors at a high value of temperature T and then tracking the minimum while the temperature T is lowered.
11.10DeterministicAnnealing(4/5)
Figure11.13:Clusteringatvariousphases.Thelinesareequiprobability contours,p=½in(b),andp=⅓elsewhere:(a)1cluster(B=0),(b)2clusters(B=0.0049),(c)3clusters(B=0.0056),(d)4clusters(B=0.0100),(e)5clusters(B=0.0156),(f)6clusters(B=0.0347),and(g)19clusters(B=0.0605).
37(c)2017BiointelligenceLab,SNU
!!B = 1
T
11.10DeterministicAnnealing(5/5)
Figure11.14:PhasediagramfortheCaseStudyindeterministicannealing.Thenumberofeffectiveclustersisshownforeachphase.
38(c)2017BiointelligenceLab,SNU!!B = 1
T
11.11AnalogyofDAwithEM(1/2)
(c)2017BiointelligenceLab,SNU 39
Suppose we view the association probability P(Y = y | X = x)as the expected value of a random binary variable Vxy defined as
Vxy
1 if thesource vector x isassigned tocode vector y0 otherwise
⎧⎨⎪
⎩⎪
Then, the two steps of DA = two steps of EM1. Step 1 of DA (= E-step of EM) Compute the association probabilities P(Y = y | X = x) 2. Step 2 of DA (= M-step of EM) Optimize the distortion measure d(x,y)
11.11AnalogyofDAwithEM(2/2)
(c)2017BiointelligenceLab,SNU 40
r : complete data including missing data z d = d(r) : incomplete dataConditional pdf of r given param vector θ
pD (d |θ) = pc (r |θ)drℜ(d )∫
ℜ(d ) : subspace of ℜ determined by d = d(r)Incomplete log-likelihood function L(θ) = log pD (d |θ)Complete-data log-likelihood function Lc (θ) = log pc (r |θ)
Expectation - Maximization Algorithm
θ̂(n) : value of θ at iteration n of EM 1. E-step
Q(θ, θ̂(n)) = Eθ̂(n)
LC (θ)⎡⎣ ⎤⎦2. M-step
θ̂(n +1) = arg maxθQ(θ, θ̂(n))
After an interation of the EM algorithm, the incomplete-data log-likelihood function is not decreased:
L(θ̂(n +1) ≥ L(θ̂(n)) for n = 0,1,2,…,K
SummaryandDiscussionn Statisticalmechanicsasmathematicalbasisforthe
formulationofstochasticsimulation/optimization/learning1. Metropolisalgorithm2. Simulatedannealing3. Gibbssampling
n Stochasticlearningmachines1. (Classical)Boltzmannmachine2. RestrictedBoltzmannmachine(RBM)3. Deepbeliefnets(DBN)
n Deterministicannealing(DA)1. Foroptimization:Connectiontosimulatedannealing(SA)2. Forclustering:Connectiontoexpectation-maximization(EM)
(c)2017BiointelligenceLab,SNU 41
top related