learning bayesian networks most slides by nir friedman some by dan geiger
Post on 22-Dec-2015
217 views
TRANSCRIPT
.
Learning Bayesian networks
Most Slides by Nir Friedman
Some by Dan Geiger
2
Known Structure -- Incomplete Data
InducerInducerInducerInducer
E B
A.9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
? ?
e
b
e
? ?
? ?
? ?
be
b
b
e
BE P(A | E,B) E B
A
Network structure is specified Data contains missing values
We consider assignments to missing values
E, B, A<Y,N,N><Y,?,Y><N,N,Y><N,Y,?> . .<?,Y,Y>
3
Learning Parameters from Incomplete Data
Incomplete data: Posterior distributions can become interdependent Consequence:
ML parameters can not be computed separately for each multinomial
Posterior is not a product of independent posteriors
X
Y|X=Hm
X[m]
Y[m]
Y|X=T
4
Learning Parameters from Incomplete Data (cont.).
In the presence of incomplete data, the likelihood can have multiple global maxima
Example: We can rename the values of hidden variable
H If H has two values, likelihood has two global
maxima
Similarly, local maxima are also replicated Many hidden variables a serious problem
H Y
5
Expectation Maximization (EM) A general purpose method for learning from incomplete dataIntuition: If we had access to counts, then we can estimate
parameters However, missing values do not allow to perform counts “Complete” counts using current parameter assignment
6
Expectation Maximization (EM)
1.30.41.71.6
X Z N (X,Y )X Y #H
THHT
Y
??HTT
TT?TH
HTHT
HHTT
P(Y=H|X=T, Z=T, ) = 0.4
Expected CountsP(Y=H|X=H,Z=T,) = 0.3
Data
Current model
These numbers are placed for illustration; they have not been computed.
X
YZ
7
EM (cont.)
TrainingData
X1 X2 X3
H
Y1 Y2 Y3
Initial network (G,0)
Expected CountsN(X1)N(X2)N(X3)N(H, X1, X1, X3)N(Y1, H)N(Y2, H)N(Y3, H)
Computation
(E-Step)
Reparameterize
X1 X2 X3
H
Y1 Y2 Y3
Updated network (G,1)
(M-Step)
Reiterate
8
L(
|D)
Expectation Maximization (EM):Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function is better scoring than the current point
MLE from Incomplete Data Finding MLE parameters: nonlinear optimization problem
9
EM in Practice
Initial parameters: Random parameters setting “Best” guess from other source
Stopping criteria: Small change in likelihood of data Small change in parameter values
Avoiding bad local maxima: Multiple restarts Early “pruning” of unpromising ones
10
The setup of the EM algorithm
We start with a likelihood function parameterized by .
The observed quantity is denoted X=x. It is often a vector x1,…,xL of observations (e.g., evidence for some nodes in a Bayesian network).
The hidden quantity is a vector Y=y (e.g. states of unobserved variables in a Bayesian network). The quantity y is defined such that if it were known, the likelihood of the completed data point P(x,y|) is easy to maximize.
The log-likelihood of an observation x has the form:log P(x| ) = log P(x,y| ) – log P(y|x,)
(Because P(x,y| ) = P(x| ) P(y|x, )).
11
The goal of EM algorithm
The log-likelihood of an observation x has the form:log P(x| ) = log P(x,y| ) – log P(y|x,)
The goal: Starting with a current parameter vector ’, EM’s goal is to find a new vector such that P(x| ) > P(x| ’) with the highest possible difference.
The result: After enough iterations EM reaches a local maximum of the likelihood P(x| ).
For independent points (xi, yi), i=1,…,m, we can similarly write:
i log P(xi| ) = i log P(xi,yi| ) – i log P(yi|xi,)
We will stick to one observation in our derivation recalling that all derived equations can be modified by summing over x.
12
The Mathematics involvedRecall that the expectation of a random variable Y with a pdf P(y) is given by E[Y] = y y p(y).
The expectation of a function L(Y) is given by E[L(Y)] = y L(y) p(y).
A bit harder to comprehend example: E’[log p(x,y|)] = y p(y|x, ’) log p(x ,y|)
The expectation operator E is linear. For two random variables X,, and constants a,b, the following holds
E[aX+bY] = a E[X] + b E[Y]
Q( |’)
13
The Mathematics involved (Cont.)Starting with log P(x| ) = log P(x, y| ) – log P(y|x, ), multiplying both sides by P(y|x ,’), and summing over y, yields
Log P(x |) = P(y|x, ’) log P(x ,y|) - P(y|x, ’) log P(y |x, ) y y
= E’[log p(x,y|)] = Q( |’) We now observe that
= log P(x| ) – log P(x|’) = Q( | ’) – Q(’ | ’) + P(y|x, ’) log [P(y |x, ’) / P(y |x, )] y
0 (relative entropy)So choosing * = argmax Q(| ’) maximizes the difference , and repeating this process leads to a local maximum of log P(x| ).
14
The EM algorithm itselfInput: A likelihood function p(x,y| ) parameterized by .
Initialization: Fix an arbitrary starting value ’ Repeat
E-step: Compute Q( | ’) = E’[log P(x,y| )]
M-step: ’ argmax Q(| ’)
Until = log P(x| ) – log P(x|’) <
Comment: At the M-step one can actually choose any ’ as long as > 0. This change yields the so called Generalized EM algorithm. It is important when argmax is hard to compute.
16
Expectation Maximization (EM)
In practice, EM converges rather quickly at start but converges slowly near the (possibly-local) maximum.
Hence, often EM is used few iterations and then Gradient Ascent steps are applied.
17
Gradient Ascent:Follow gradient of likelihood w.r.t. to parameters
L(
|D)
MLE from Incomplete Data Finding MLE parameters: nonlinear optimization problem
18
MLE from Incomplete Data
Both Ideas:Find local maxima only.Require multiple restarts to find approximation to the global maximum.
19
Gradient Ascent
Main result
Theorem GA:
)],[|,(1)|(log
,,
mopaxPDP
iimpaxpax iiii
Requires computation: P(xi,pai|o[m],) for all i, m
Inference replaces taking derivatives.
20
Gradient Ascent (cont)
m pax ii
moPmoP ,
)|][()|][(
1
m paxpax iiii
moPDP
,,
)|][(log)|(log
ii pax
moP
,
)|][(
How do we compute ?
Proof:
21
Gradient Ascent (cont)
Since:
ii pax
ii opaxP
','
),,','(
=1
ii pax ','
iindi
ndii
d paxPopaPopaxoP
),'|'()|,'(),,','|(
ii iipax pax
ndiii
ndii
d opaPpaxPopaxoP
, ','
)|,(),|(),,,|(
ii iiiipax pax
ii
pax
opaxPoP
, ','','
)|,,()|(
ipaix
opaxPoP ii,
)|,,()|(
22
Gradient Ascent (cont)
Putting all together we get
m paxpax iiii
moP
moP
DP
,,
)|][(
)|][(
1)|(log
m pax
ii
ii
mopaxP
moP ,
)|][,,(
)|][(
1
m pax
ii
ii
mopaxP
,
)],[|,(
)],[|,(1)|(log
,,
mopaxPDP
iimpaxpax iiii