![Page 1: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/1.jpg)
School of Computer Science
Probabilistic Graphical Models
Parameter Est. in fully observed BNs
Eric XingLecture 4, January 25, 2016
Reading: KF-chap 17
X1
X4
X2 X3
X4
X2 X3
X1X1
X2
X1
X3
X1
X4
X2 X3
X4
X2 X3
X1X1
X2
X1
X3
© Eric Xing @ CMU, 2005-2016 1
![Page 2: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/2.jpg)
Learning Graphical ModelsThe goal:
Given set of independent samples (assignments of random variables), find the best (the most likely?) Bayesian Network (both DAG and CPDs)
(B,E,A,C,R)=(T,F,F,T,F)(B,E,A,C,R)=(T,F,T,T,F)……..
(B,E,A,C,R)=(F,T,T,T,F)
E
R
B
A
C
0.9 0.1
e
be
0.2 0.8
0.01 0.990.9 0.1
bebb
e
BE P(A | E,B)
E
R
B
A
C
Structural learning
Parameter learning
© Eric Xing @ CMU, 2005-2016 2
![Page 3: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/3.jpg)
Learning Graphical Models Scenarios:
completely observed GMs directed undirected
partially or unobserved GMs directed undirected (an open research topic)
Estimation principles: Maximal likelihood estimation (MLE) Bayesian estimation Maximal conditional likelihood Maximal "Margin" Maximum entropy
We use learning as a name for the process of estimating the parameters, and in some cases, the topology of the network, from data.
© Eric Xing @ CMU, 2005-2016 3
![Page 4: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/4.jpg)
ML Structural Learning for completely observed
GMs
E
R
B
A
C
Data
),,( )()( 111 nxx
),,( )()( Mn
M xx 1
),,( )()( 221 nxx
© Eric Xing @ CMU, 2005-2016 4
![Page 5: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/5.jpg)
Two “Optimal” approaches “Optimal” here means the employed algorithms guarantee to
return a structure that maximizes the objectives (e.g., LogLik) Many heuristics used to be popular, but they provide no guarantee on attaining
optimality, interpretability, or even do not have an explicit objective E.g.: structured EM, Module network, greedy structural search, deep learning via
auto-encoders, etc.
We will learn two classes of algorithms for guaranteed structure learning, which are likely to be the only known methods enjoying such guarantee, but they only apply to certain families of graphs: Trees: The Chow-Liu algorithm (this lecture) Pairwise MRFs: covariance selection, neighborhood-selection (later)
© Eric Xing @ CMU, 2005-2016 5
![Page 6: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/6.jpg)
Structural Search How many graphs over n nodes?
How many trees over n nodes?
But it turns out that we can find exact solution of an optimal tree (under MLE)! Trick: MLE score decomposable to edge-related elements Trick: in a tree each node has only one parent! Chow-liu algorithm
)(2
2nO
)!(nO
© Eric Xing @ CMU, 2005-2016 6
![Page 7: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/7.jpg)
Information Theoretic Interpretation of ML
i xGiGiGi
i xGiGi
Gi
i nGiGnin
n iGiGnin
GG
Gii
iii
Gii
ii
i
ii
ii
xpxpM
xpMxcount
M
xp
xp
GDpDG
)(
)(
,)(|)()(
,)(|)(
)(
)(|)(,,
)(|)(,,
),|(log),(ˆ
),|(log),(
),|(log
),|(log
),|(log);,(
x
x
xx
xx
x
x
l
From sum over data points to sum over count of variable states
© Eric Xing @ CMU, 2005-2016 7
![Page 8: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/8.jpg)
Information Theoretic Interpretation of ML (con'd)
ii
iGi
i xii
i x iG
GiGiGi
i x i
i
G
GiGiGi
i xGiGiGi
GG
xHMxIM
xpxpMxpp
xpxpM
xpxp
pxp
xpM
xpxpM
GDpDG
i
iGii i
ii
i
Gii i
ii
i
Gii
iii
)(ˆ),(ˆ
)(ˆlog)(ˆ)(ˆ)(ˆ
),,(ˆlog),(ˆ
)(ˆ)(ˆ
)(ˆ),,(ˆ
log),(ˆ
),|(ˆlog),(ˆ
),|(ˆlog);,(
)(
, )(
)(|)()(
, )(
)(|)()(
,)(|)()(
)(
)(
)(
x
xx
x
xx
x
xx
x
x
x
l
Decomposable score and a function of the graph structure
© Eric Xing @ CMU, 2005-2016 8
![Page 9: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/9.jpg)
Chow-Liu tree learning algorithm Objection function:
Chow-Liu: For each pair of variable xi and xj
Compute empirical distribution:
Compute mutual information:
Define a graph with node x1,…, xn
Edge (I,j) gets weight
ii
iGi
GG
xHMxIM
GDpDG
i)(ˆ),(ˆ
),|(ˆlog);,(
)(
x
l i
Gi ixIMGC ),(ˆ)( )(x
Mxxcount
XXp jiji
),(),(ˆ
ji xx ji
jijiji xpxp
xxpxxpXXI
, )(ˆ)(ˆ),(ˆ
log),(ˆ),(ˆ
),(ˆ ji XXI
© Eric Xing @ CMU, 2005-2016 9
![Page 10: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/10.jpg)
Chow-Liu algorithm (con'd) Objection function:
Chow-Liu:Optimal tree BN Compute maximum weight spanning tree Direction in BN: pick any node as root, do breadth-first-search to define directions I-equivalence:
ii
iGi
GG
xHMxIM
GDpDG
i)(ˆ),(ˆ
),|(ˆlog);,(
)(
x
l i
Gi ixIMGC ),(ˆ)( )(x
A
B C
D E
C
A E
B
DE
C
D A B
),(),(),(),()( ECIDCICAIBAIGC © Eric Xing @ CMU, 2005-2016 10
![Page 11: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/11.jpg)
Structure Learning for general graphs Theorem:
The problem of learning a BN structure with at most d parents is NP-hard for any (fixed) d≥2
Most structure learning approaches use heuristics Exploit score decomposition Two heuristics that exploit decomposition in different ways
Greedy search through space of node-orders
Local search of graph structures
© Eric Xing @ CMU, 2005-2016 11
![Page 12: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/12.jpg)
ML Parameter Est. for completely observed GMs of
given structure
Z
X
The data:{ (z1,x1), (z2,x2), (z3,x3), ... (zN,xN)}
© Eric Xing @ CMU, 2005-2016 12
![Page 13: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/13.jpg)
Parameter Learning Assume G is known and fixed,
from expert design from an intermediate outcome of iterative structure learning
Goal: estimate from a dataset of N independent, identically distributed (iid) training cases D = {x1, . . . , xN}.
In general, each training case xn=(xn,1, . . . , xn,M) is a vector of M values, one per node, the model can be completely observable, i.e., every element in xn is known (no
missing values, no hidden variables), or, partially observable, i.e., i, s.t. xn,i is not observed.
In this lecture we consider learning parameters for a BN with given structure and is completely observable
i ninin
n iinin ii
xpxpDpD ),|(log),|(log)|(log);( ,,,, xxl
© Eric Xing @ CMU, 2005-2016 13
![Page 14: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/14.jpg)
Review of density estimation Can be viewed as single-node graphical models
Instances of exponential family dist.
Building blocks of general GM
MLE and Bayesian estimate
x1 x2 x3 xN…
xiN
GM:
© Eric Xing @ CMU, 2005-2016 14
![Page 15: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/15.jpg)
Bernoulli distribution: Ber(p)
Multinomial distribution: Mult(1,)
Multinomial (indicator) variable:
. , w.p.
and ],,[
where , ∑
∑
[1,...,6]∈
[1,...,6]∈
11
110
6
5
4
3
2
1
jjjj
jjj
X
XX
XXXXXX
X
x
k
xk
xT
xG
xC
xAj
j
kTGCA
jXPjxp
∏
}face-dice index the where,{))(( 1
Discrete Distributions
101
xpxp
xPfor for
)( xx ppxP 11 )()(
© Eric Xing @ CMU, 2005-2016 15
![Page 16: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/16.jpg)
Multinomial distribution: Mult(n,)
Count variable:
Discrete Distributions
jj
K
Nnn
nn where ,
1
n
K
nK
nn
K nnnN
nnnNnp K
!!!!
!!!!)(
2121
21
21
© Eric Xing @ CMU, 2005-2016 16
![Page 17: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/17.jpg)
Example: multinomial model Data:
We observed N iid die rolls (K-sided): D={5, 1, K, …, 3}
Representation:
Unit basis vectors:
Model:
How to write the likelihood of a single observation xn?
The likelihood of datasetD={x1, …,xN}:
Knnn x
Kxx
k
kni nkxPxP,2,1,
21
, })rollth theof side-die index the where,1({)(
N
n k
xk
N
nnN
knxPxxxP11
21,)|()|,...,,(
1 and },1,0{ where, 1
,,
,
2,
1,
K
kknkn
Kn
n
n
n xx
x
xx
x
1 and , w.p.1}1{
, ,...Kk
kkknX
K
k
xk
kn
1
,
k
nk
k
x
kk
N
nkn
1,
x1 x2 x3 xN…
xiN
GM:
© Eric Xing @ CMU, 2005-2016 17
![Page 18: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/18.jpg)
MLE: constrained optimization with Lagrange multipliers Objective function:
We need to maximize this subject to the constrain
Constrained cost function with a Lagrange multiplier
Take derivatives wrt k
Sufficient statistics The counts, are sufficient statistics of data D
k
kkk
nk nDPD k loglog)|(log);(l
11
K
kk
K
kk
kkkn
11 logl
k
kk
kkk
k
k
k
Nnn
n 0l
Nnk
MLEk ,
n
knMLEk xN ,,1
or
Frequency as sample mean
, ),,,( ,1 n knkK xnnnn
© Eric Xing @ CMU, 2005-2016 18
![Page 19: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/19.jpg)
Bayesian estimation: Dirichlet distribution:
Posterior distribution of :
Notice the isomorphism of the posterior to the prior, such a prior is called a conjugate prior
Posterior mean estimation:
kk
kk
kk
kk
kk CP 1-1- )()(
)()(
k
nk
kk
k
nk
N
NN
kkkk
xxppxxpxxP 11
1
11 ),...,(
)()|,...,(),...,|(
NndCdDp kk
k
nkkkk
kk 1)|(
x1 x2 x3 xN…
GM:
N
xi
Dirichlet parameters can be understood as pseudo-counts
© Eric Xing @ CMU, 2005-2016 19
![Page 20: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/20.jpg)
More on Dirichlet Prior: Where is the normalize constant C() come from?
Integration by parts () is the gamma function: For inregers,
Marginal likelihood:
Posterior in closed-form:
Posterior predictive rate:
k k
k kKK dd
CK
)()(
111
11
1
0
1 dtet t )(!)( nn 1
)()()|()|()|()|},...,({
nC
Cdpnpnpxxp N1
k
nkN
kknCnp
pnpxxP 11 )(
)|()|()|()},,...,{|(
)(Dir n
)()()()},,...,{|( 1
11N
ni
k
nkNN xnC
nCdnCxxixp kkkk
||||
nn ii
© Eric Xing @ CMU, 2005-2016 20
![Page 21: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/21.jpg)
Sequential Bayesian updating Start with Dirichlet prior Observe N ' samples with sufficient statistics . Posterior
becomes:
Observe another N " samples with sufficient statistics . Posterior becomes:
So sequentially absorbing data in any order is equivalent to batch update.
):(Dir)|( P
'n
)':(Dir)',|( nnP
"n
)"':(Dir)",',|( nnnnP
© Eric Xing @ CMU, 2005-2016 21
![Page 22: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/22.jpg)
Hierarchical Bayesian Models are the parameters for the likelihood p(x|) are the parameters for the prior p(|) . We can have hyper-hyper-parameters, etc. We stop when the choice of hyper-parameters makes no
difference to the marginal likelihood; typically make hyper-parameters constants.
Where do we get the prior? Intelligent guesses Empirical Bayes (Type-II maximum likelihood) computing point estimates of :
)|(maxarg
npMLE
© Eric Xing @ CMU, 2005-2016 22
![Page 23: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/23.jpg)
Limitation of Dirichlet Prior:
N
xi
© Eric Xing @ CMU, 2005-2016 23
![Page 24: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/24.jpg)
- Log Partition Function- Normalization Constant
log
logexp
,~
,~
1
1
1
1
1
1
1
0
K
i
K
iii
KK
K
i
i
eC
e
N
LN
μ Σ
N
xi
The Logistic Normal Prior
Pro: co-variance structure Con: non-conjugate (we will discuss how to solve this later)
© Eric Xing @ CMU, 2005-2016 24
![Page 25: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/25.jpg)
Logistic Normal Densities
© Eric Xing @ CMU, 2005-2016 25
![Page 26: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/26.jpg)
Uniform Probability Density Function
Normal (Gaussian) Probability Density Function
The distribution is symmetric, and is often illustrated as a bell-shaped curve. Two parameters, (mean) and (standard deviation), determine the location and shape of
the distribution. The highest point on the normal curve is at the mean, which is also the median and mode. The mean can be any numerical value: negative, zero, or positive.
Multivariate Gaussian
Continuous Distributions
elsewhere for )/()(
01
bxaabxp
22 2
21
/)()( xexp
xx
ff((xx))
xx
ff((xx))
XXXp T
n1
212 21
21 exp),;(
//
© Eric Xing @ CMU, 2005-2016 26
![Page 27: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/27.jpg)
MLE for a multivariate-Gaussian It can be shown that the MLE for µ and Σ is
where the scatter matrix is
The sufficient statistics are nxn and nxnxnT.
Note that XTX=nxnxnT may not be full rank (eg. if N <D), in which case ΣML is not
invertible
SN
xxN
xN
nT
MLnMLnMLE
n nMLE
11
1
TMLMLn n
Tnnn
TMLnMLn NxxxxS
Kn
n
n
n
x
xx
x
,
2,
1,
TN
T
T
x
xx
X2
1
© Eric Xing @ CMU, 2005-2016 27
![Page 28: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/28.jpg)
Bayesian parameter estimation for a Gaussian There are various reasons to pursue a Bayesian approach
We would like to update our estimates sequentially over time. We may have prior knowledge about the expected magnitude of the parameters. The MLE for Σ may not be full rank if we don’t have enough data.
We will restrict our attention to conjugate priors.
We will consider various cases, in order of increasing complexity: Known σ, unknown µ Known µ, unknown σ Unknown µ and σ
© Eric Xing @ CMU, 2005-2016 28
![Page 29: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/29.jpg)
Bayesian estimation: unknown µ, known σ
Normal Prior:
Joint probability:
Posterior:
220
212 22 /)(exp)( /
P
1
222
022
2
22
2 11
11
N
Nx
NN ~ and ,
///
///~ where
x1 x2 x3 xN…
GM:
N
xi
Sample mean
220
212
1
22
22
22
212
/)(exp
exp),(
/
/
N
nn
N xxP
22212 22 ~/)~(exp~)|( /
xP
© Eric Xing @ CMU, 2005-2016 29
![Page 30: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/30.jpg)
Bayesian estimation: unknown µ, known σ
The posterior mean is a convex combination of the prior and the MLE, with weights proportional to the relative noise levels.
The precision of the posterior 1/σ2N is the precision of the prior 1/σ2
0 plus one contribution of data precision 1/σ2 for each observed data point.
Sequentially updating the mean µ∗ = 0.8 (unknown), (σ2)∗ = 0.1 (known)
Effect of single data point
Uninformative (vague/ flat) prior, σ20 →∞
1
20
22
020
2
20
20
2
2 11
11
N
Nx
NN
N~ ,
///
///
20
2
20
020
2
20
001
)( )( xxx
0 N
© Eric Xing @ CMU, 2005-2016 30
![Page 31: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/31.jpg)
Other scenarios Known µ, unknown λ = 1/σ2
The conjugate prior for λ is a Gamma with shape a0 and rate (inverse scale) b0
The conjugate prior for σ2 is Inverse-Gamma
Unknown µ and unknown σ2 The conjugate prior is
Normal-Inverse-Gamma
Semi conjugate prior
Multivariate case: The conjugate prior is
Normal-Inverse-Wishart© Eric Xing @ CMU, 2005-2016 31
![Page 32: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/32.jpg)
Estimation of conditional density Can be viewed as two-node graphical models
Instances of GLIM
Building blocks of general GM
MLE and Bayesian estimate
See supplementary slides
Q
X
Q
X
© Eric Xing @ CMU, 2005-2016 32
![Page 33: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/33.jpg)
MLE for general BNs If we assume the parameters for each CPD are globally
independent, and all nodes are fully observed, then the log-likelihood function decomposes into a sum of local terms, one per node:
i ninin
n iinin ii
xpxpDpD ),|(log),|(log)|(log);( ,,,, xxl
X2=1
X2=0
X5=0
X5=1
© Eric Xing @ CMU, 2005-2016 33
![Page 34: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/34.jpg)
A plate is a “macro” that allows subgraphs to be replicated
For iid (exchangeable) data, the likelihood is
We can represent this as a Bayes net with N nodes. The rules of plates are simple: repeat every structure in a box a number of
times given by the integer in the corner of the box (e.g. N), updating the plate index variable (e.g. n) as you go.
Duplicate every arrow going into the plate and every arrow leaving the plate by connecting the arrows to each copy of the structure.
X1
X2
XN
…
Xn
N
Plates
n
nxpDp )|()|(
© Eric Xing @ CMU, 2005-2016 34
![Page 35: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/35.jpg)
Consider the distribution defined by the directed acyclic GM:
This is exactly like learning four separate small BNs, each of which consists of a node and its parents.
Decomposable likelihood of a BN
),,|(),|(),|()|()|( 432431321211 xxxpxxpxxpxpxp
X1
X4
X2 X3
X4
X2 X3
X1X1
X2
X1
X3
© Eric Xing @ CMU, 2005-2016 35
![Page 36: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/36.jpg)
MLE for BNs with tabular CPDs Assume each CPD is represented as a table (multinomial)
where
Note that in case of multiple parents, will have a composite state, and the CPD will be a high-dimensional table
The sufficient statistics are counts of family configurations
The log-likelihood is
Using a Lagrange multiplier to enforce , we get:
)|(def
kXjXpiiijk
iX
nkn
jinijk ixxn ,,
def
kji
ijkijkkji
nijk nD ijk
,,,,
loglog);( l
1 j ijk
''
jkij
ijkMLijk n
n
© Eric Xing @ CMU, 2005-2016 36
![Page 37: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/37.jpg)
Earthquake
Radio
Burglary
Alarm
Call
M
ii i
xpp1
)|()( xxXFactorization:
ji
kii x
jkixp
x
x|
)|(
Local Distributions defined by, e.g., multinomial parameters:
How to define parameter prior?
Assumptions (Geiger & Heckerman 97,99):
Complete Model Equivalence Global Parameter Independence Local Parameter Independence Likelihood and Prior Modularity
? )|( Gp
© Eric Xing @ CMU, 2005-2016
37
![Page 38: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/38.jpg)
Global Parameter IndependenceFor every DAG model:
Local Parameter IndependenceFor every node:
M
iim GpGp
1
)|()|(
Earthquake
Radio
Burglary
Alarm
Call
i
ji
ki
q
jxi GpGp
1|
)|()|(
x
independent of)( | YESAlarmCallP
)( | NOAlarmCallP
Global & Local Parameter Independence
© Eric Xing @ CMU, 2005-2016 38
![Page 39: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/39.jpg)
Provided all variables are observed in all cases, we can perform Bayesian update each parameter independently !!!
sample 1
sample 2
2|11 2|1
X1 X2
X1 X2
Global ParameterIndependence
Local ParameterIndependence
Parameter Independence,Graphical View
© Eric Xing @ CMU, 2005-2016 39
![Page 40: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/40.jpg)
Which PDFs Satisfy Our Assumptions? (Geiger & Heckerman 97,99)
Discrete DAG Models:
Dirichlet prior:
Gaussian DAG Models:
Normal prior:
Normal-Wishart prior:
kk
kk
kk
kk
kk CP 1-1- )()(
)()(
)(Multi~| jxi i
x
),(Normal~| jxi i
x
)()'(exp
||)(),|( //
1
212 21
21
np
. where
,WtrexpW),(),|W( /)(/
1
212
21
W
nww
wwncp
,)(,Normal),,|( 1 WW p
© Eric Xing @ CMU, 2005-2016 40
![Page 41: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/41.jpg)
Consider a time-invariant (stationary) 1st-order Markov model Initial state probability vector:
State transition probability matrix:
The joint:
The log-likelihood:
Again, we optimize each parameter separately is a multinomial frequency vector, and we've seen it before What about A?
Parameter sharing
X1 X2 X3 XT…
A
)(def
11 kk Xp
)|(def
11 1 it
jtij XXpA
T
t tttT XXpxpXp
2 2111 )|()|()|( :
n
T
ttntn
nn AxxpxpD
211 ),|(log)|(log);( ,,, l
© Eric Xing @ CMU, 2005-2016 41
![Page 42: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/42.jpg)
Learning a Markov chain transition matrix A is a stochastic matrix: Each row of A is multinomial distribution. So MLE of Aij is the fraction of transitions from i to j
Application: if the states Xt represent words, this is called a bigram language model
Sparse data problem: If i j did not occur in data, we will have Aij =0, then any future sequence with
word pair i j will have zero probability. A standard hack: backoff smoothing or deleted interpolation
1 j ijA
n
T
ti
tn
jtnn
T
ti
tnMLij
x
xxi
jiA2 1
2 1
,
,,
)(#)(#
MLiti AA )(
~ 1
© Eric Xing @ CMU, 2005-2016 42
![Page 43: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/43.jpg)
Bayesian language model Global and local parameter independence
The posterior of Ai∙ and Ai'∙ is factorized despite v-structure on Xt, because Xt-
1 acts like a multiplexer Assign a Dirichlet prior i to each row of the transition matrix:
We could consider more realistic priors, e.g., mixtures of Dirichlets to account for types of words (adjectives, verbs, etc.)
X1 X2 X3 XT…
Ai · Ai'·
…
A
)(# where,)(
)(#)(#
),,|( ',
,def
i
Ai
jiDijpA
i
ii
MLijikii
i
kii
Bayesij
1
© Eric Xing @ CMU, 2005-2016 43
![Page 44: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/44.jpg)
Example: HMM: two scenarios Supervised learning: estimation when the “right answer” is known
Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good
(experimental) annotations of the CpG islandsGIVEN: the casino player allows us to observe him one evening,
as he changes dice and produces 10,000 rolls
Unsupervised learning: estimation when the “right answer” is unknown Examples:
GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition
GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice
QUESTION: Update the parameters of the model to maximize P(x|) --- Maximal likelihood (ML) estimation
© Eric Xing @ CMU, 2005-2016 44
![Page 45: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/45.jpg)
Recall definition of HMM Transition probabilities between
any two states
or
Start probabilities
Emission probabilities associated with each state
or in general:
A AA Ax2 x3x1 xT
y2 y3y1 yT...
... ,)|( , jiit
jt ayyp 11 1
.,,,,lMultinomia~)|( ,,, I iaaayyp Miiiitt 211 1
.,,,lMultinomia~)( Myp 211
.,,,,lMultinomia~)|( ,,, I ibbbyxp Kiiiitt 211
.,|f~)|( I iyxp iitt 1
© Eric Xing @ CMU, 2005-2016 45
![Page 46: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/46.jpg)
Supervised ML estimation Given x = x1…xN for which the true state path y = y1…yN is known,
Define:Aij = # times state transition ij occurs in yBik = # times state i in y emits k in x
We can show that the maximum likelihood parameters are:
What if x is continuous? We can treat as NTobservations of, e.g., a Gaussian, and apply learning rules for Gaussian …
' ',
,,
)(#)(#
j ij
ij
n
T
ti
tn
jtnn
T
ti
tnMLij A
A
y
yyi
jia2 1
2 1
' ',
,,
)(#)(#
k ik
ik
n
T
ti
tn
ktnn
T
ti
tnMLik B
By
xyi
kib1
1
NnTtyx tntn :,::, ,, 11
© Eric Xing @ CMU, 2005-2016 46
![Page 47: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/47.jpg)
Supervised ML estimation, ctd. Intuition:
When we know the underlying states, the best estimate of is the average frequency of transitions & emissions that occur in the training data
Drawback: Given little data, there may be overfitting:
P(x|) is maximized, but is unreasonable0 probabilities – VERY BAD
Example: Given 10 casino rolls, we observe
x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3y = F, F, F, F, F, F, F, F, F, F
Then: aFF = 1; aFL = 0bF1 = bF3 = .2; bF2 = .3; bF4 = 0; bF5 = bF6 = .1
© Eric Xing @ CMU, 2005-2016 47
![Page 48: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/48.jpg)
Pseudocounts Solution for small training sets:
Add pseudocountsAij = # times state transition ij occurs in y + Rij
Bik = # times state i in y emits k in x + Sik
Rij, Sij are pseudocounts representing our prior belief Total pseudocounts: Ri = jRij , Si = kSik ,
--- "strength" of prior belief, --- total number of imaginary instances in the prior
Larger total pseudocounts strong prior belief
Small total pseudocounts: just to avoid 0 probabilities --- smoothing
This is equivalent to Bayesian est. under a uniform prior with "parameter strength" equals to the pseudocounts
© Eric Xing @ CMU, 2005-2016 48
![Page 49: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/49.jpg)
Summary: Learning GM
For fully observed BN, the log-likelihood function decomposes into a sum of local terms, one per node; thus learning is also factored Structural learning
Chow liu Neighborhood selection
Learning single-node GM – density estimation: exponential family dist. Typical discrete distribution Typical continuous distribution Conjugate priors
Learning two-node BN: GLIM Conditional Density Est. Classification
Learning BN with more nodes Local operations
© Eric Xing @ CMU, 2005-2016 49
![Page 50: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/50.jpg)
Supplemental review:
© Eric Xing @ CMU, 2005-2016 50
![Page 51: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/51.jpg)
ClassificationGenerative and discriminative approaches
Q
X
Q
X
Linear/Logistic Regression
Two node fully observed BNs
Conditional mixtures
© Eric Xing @ CMU, 2005-2016 51
![Page 52: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/52.jpg)
Classification: Goal: Wish to learn f: X Y
Generative: Modeling the joint distribution
of all data
Discriminative: Modeling only points
at the boundary
© Eric Xing @ CMU, 2005-2016 52
![Page 53: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/53.jpg)
Conditional Gaussian The data:
Both nodes are observed: Y is a class indicator vector
X is a conditional Gaussian variable with a class-specific mean
GM:Yi
XiN
k
yknn
knyyp ,):(multi)(
22
12/12, )-(-exp
)2(1),,1|( 2 knknn xyxp
n k
ykn
knxNyxp ,),:(),,|(
),(,),,(),,(),,( NN yxyxyxyx 332211
© Eric Xing @ CMU, 2005-2016 53
![Page 54: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/54.jpg)
Data log-likelihood
MLE
),,|()|(log),(log);( nnn
nn
nn yxpypyxpD θl
),;(max,
,, Nn
N
yDrga kn
kn
MLEkMLEk
θlthe fraction of samples of class m
k
nnkn
nkn
nnkn
MLEkMLEk n
xy
y
xyD
,
,
,
,, ),;(maxarg θl the average of samples of class m
MLE of conditional GaussianGM:
Yi
XiN
© Eric Xing @ CMU, 2005-2016 54
![Page 55: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/55.jpg)
Prior:
Posterior mean (Bayesian est.)
Bsyesian estimation of conditional Gaussian
GM:
Yi
XiN
):(Dir)|( P
),:(Normal)|( kkP K
1
222
22
2
22
2 11
11
N
nnn
Bayesk
MLkk
kBayesk and ,
///
///
,,
Nn
NNN kkk
MLkBayesk ,,
© Eric Xing @ CMU, 2005-2016 55
![Page 56: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/56.jpg)
Classification Gaussian Discriminative Analysis:
The joint probability of a datum and it label is:
Given a datum xn, we predict its label using the conditional probability of the label given the datum:
This is basic inference introduce evidence, and then normalize
22
12/12
,,,
)-(-exp)2(
1),,1|()1(),|1,(
2 knk
knnknknn
x
yxpypyxp
'
2'2
12/12'
22
12/12
,
)-(-exp)2(
1
)-(-exp)2(
1
),,|1(2
2
kknk
knk
nkn
x
xxyp
Yi
Xi
© Eric Xing @ CMU, 2005-2016 56
![Page 57: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/57.jpg)
Transductive classification Given Xn, what is its corresponding Yn
when we know the answer for a set of training data?
Frequentist prediction: we fit , and from data first, and then …
Bayesian: we compute the posterior dist. of the parameters first …
GM:
Yi
XiN
KYn
Xn
'''
,, ),|,(
),|,(),,|(
),,|,1(),,,|1(
kknk
knk
n
nknnkn xN
xNxp
xypxyp
© Eric Xing @ CMU, 2005-2016 57
![Page 58: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/58.jpg)
Linear Regression The data:
Both nodes are observed: X is an input vector Y is a response vector
(we first consider y as a generic continuous response vector, then we consider the special case of classification where y is a discrete indicator)
A regression scheme can be used to model p(y|x) directly,rather than p(x,y)
Yi
XiN
),(,),,(),,(),,( NN yxyxyxyx 332211
© Eric Xing @ CMU, 2005-2016 58
![Page 59: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/59.jpg)
A discriminative probabilistic model Let us assume that the target variable and the inputs are
related by the equation:
where ε is an error term of unmodeled effects or random noise
Now assume that ε follows a Gaussian N(0,σ), then we have:
By independence assumption:
iiT
iy x
2
2
221
)(exp);|( i
Ti
iiyxyp x
2
12
1 221
n
i iT
inn
iii
yxypL
)(exp);|()(
x
© Eric Xing @ CMU, 2005-2016 59
![Page 60: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/60.jpg)
Linear regression Hence the log-likelihood is:
Do you recognize the last term?
Yes it is:
It is same as the MSE!
n
i iT
iynl1
22 2
1121 )(log)( x
n
ii
Ti yJ
1
2
21 )()( x
© Eric Xing @ CMU, 2005-2016 60
![Page 61: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/61.jpg)
A recap: LMS update rule
Pros: on-line, low per-step cost Cons: coordinate, maybe slow-converging
Steepest descent
Pros: fast-converging, easy to implement Cons: a batch,
Normal equations
Pros: a single-shot algorithm! Easiest to implement. Cons: need to compute pseudo-inverse (XTX)-1, expensive, numerical issues
(e.g., matrix is singular ..)
ntT
nntt y xx )(1
n
in
tTnn
tt y1
1 xx )(
yXXX TT 1*
© Eric Xing @ CMU, 2005-2016 61
![Page 62: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/62.jpg)
Bayesian linear regression
© Eric Xing @ CMU, 2005-2016 62
![Page 63: Probabilistic Graphical Modelsepxing/Class/10708-16/slide/lecture4... · 2016. 2. 8. · Learning Graphical Models Scenarios: completely observed GMs directed undirected partially](https://reader034.vdocuments.mx/reader034/viewer/2022051902/5ff148e9e93f4910853fd37b/html5/thumbnails/63.jpg)
ClassificationGenerative and discriminative approach
Q
X
Q
X
RegressionLinear, conditional mixture, nonparametric
X Y
Density estimationParametric and nonparametric methods
,
XX
Simple GMs are the building blocks of complex BNs
© Eric Xing @ CMU, 2005-2016 63