linear regressionepxing/class/10708-07/slides/lecture9-learningbn… · 2 eric xing 3 logistic...
TRANSCRIPT
1
1
School of Computer Science
Learning generalized linear models and tabular CPT of
structured full BN
Probabilistic Graphical Models (10Probabilistic Graphical Models (10--708)708)
Lecture 9, Oct 15, 2007
Eric XingEric XingReceptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
X1 X2
X3 X4 X5
X6
X7 X8
Reading: J-Chap. 7,8.
Eric Xing 2
Linear RegressionLet us assume that the target variable and the inputs are related by the equation:
where ε is an error term of unmodeled effects or random noise
Now assume that ε follows a Gaussian N(0,σ), then we have:
iiT
iy εθ += x
⎟⎟⎠
⎞⎜⎜⎝
⎛ −−= 2
2
221
σθ
σπθ )(exp);|( i
Ti
iiyxyp x
2
Eric Xing 3
Logistic Regression (sigmoid classifier)
The condition distribution: a Bernoulli
where µ is a logistic function
We can used the brute-force gradient method as in LR
But we can also apply generic laws by observing the p(y|x) is an exponential family function, more specifically, a generalized linear model
yy xxxyp −−= 11 ))(()()|( µµ
xTex
θµ
−+=
11)(
Eric Xing 4
Exponential familyFor a numeric random variable X
is an exponential family distribution with natural (canonical) parameter η
Function T(x) is a sufficient statistic.Function A(η) = log Z(η) is the log normalizer.Examples: Bernoulli, multinomial, Gaussian, Poisson, gamma,...
{ }{ })(exp)(
)(
)()(exp)()|(
xTxhZ
AxTxhxp
T
T
ηη
ηηη1
=
−= XnN
3
Eric Xing 5
Multivariate Gaussian Distribution
For a continuous vector random variable X∈Rk:
Exponential family representation
Note: a k-dimensional Gaussian is a (d+d2)-parameter distribution with a (d+d2)-element vector of sufficient statistics (but because of symmetry and positivity, parameters are constrained and have lower degree of freedom)
( )
( )( ){ }Σ−Σ−Σ+Σ−=
⎭⎬⎫
⎩⎨⎧ −Σ−−
Σ=Σ
−−−
−
logtrexp
)()(exp),(
/
//
µµµπ
µµπ
µ
12111
21
2
1212
21
21
21
TTTk
Tk
xxx
xxxp
( )[ ] ( )[ ]( )[ ]
( )( ) 2
221
112211
21
121
21
1211
211
2
2/)(
log)(trlog)(vec;)(
and ,vec,vec;
k
TT
T
xh
AxxxxT
−
−
−−−−
=
−−−=Σ+Σ=
=
Σ−=Σ==Σ−Σ=
π
ηηηηµµη
ηµηηηµη
Moment parameter
Natural parameter
Eric Xing 6
Multinomial distributionFor a binary vector random variable
Exponential family representation
),|(multi~ πxx
⎭⎬⎫
⎩⎨⎧
== ∑k
kkx
Kxx xxp
K
πππππ lnexp)( L21
21
[ ]
1
1
0
1
1
1
=
⎟⎠
⎞⎜⎝
⎛=⎟
⎠
⎞⎜⎝
⎛−−=
=
⎥⎦⎤
⎢⎣⎡
⎟⎠⎞⎜
⎝⎛=
∑∑=
−
=
)(
lnln)(
)(
;ln
xh
eA
xxTK
k
K
kk
K
k
kηπη
ππη
⎭⎬⎫
⎩⎨⎧
⎟⎠
⎞⎜⎝
⎛−+⎟⎟
⎠
⎞⎜⎜⎝
⎛∑−
=
⎭⎬⎫
⎩⎨⎧
⎟⎠
⎞⎜⎝
⎛−⎟
⎠
⎞⎜⎝
⎛−+=
∑∑
∑∑∑−
=
−
=−
=
−
=
−
=
−
=
1
1
1
111
1
1
1
1
1
1
11
11
K
kk
K
k kKk
kk
K
kk
K
k
KK
kk
k
x
xx
ππ
π
ππ
lnlnexp
lnlnexp
4
Eric Xing 7
Why exponential family?Moment generating property
{ }{ }
[ ])()(
)(exp)()(
)(exp)()(
)()(
)(log
xTE
dxZ
xTxhxT
dxxTxhdd
Z
Zdd
ZZ
dd
ddA
T
T
=
=
=
==
∫
∫
ηη
ηηη
ηηη
ηηη1
1
{ } { }
[ ] [ ][ ])(
)()(
)()()(
)(exp)()()(
)(exp)()(
xTVarxTExTE
Zdd
Zdx
ZxTxhxTdx
ZxTxhxT
dAd
2
TT
=−=
−= ∫∫2
22
2 1 ηηηη
ηηη
η
Eric Xing 8
Moment estimationWe can easily compute moments of any exponential family distribution by taking the derivatives of the log normalizerA(η).The qth derivative gives the qth centered moment.
When the sufficient statistic is a stacked vector, partial derivatives need to be considered.
L
variance)(
mean)(
=
=
2
2
ηηηη
dAdddA
5
Eric Xing 9
Moment vs canonical parametersThe moment parameter µ can be derived from the natural (canonical) parameter
A(h) is convex since
Hence we can invert the relationship and infer the canonical parameter from the moment parameter (1-to-1):
A distribution in the exponential family can be parameterized not only by η − the canonical parameterization, but also by µ − the moment parameterization.
[ ] µηη def
)()(== xTE
ddA
[ ] 02
2
>= )()( xTVardAdη
η
)(def
µψη =
4
8
-2 -1 0 1 2
4
8
-2 -1 0 1 2
A
ηη∗
Eric Xing 10
MLE for Exponential FamilyFor iid data, the log-likelihood is
Take derivatives and set to zero:
This amounts to moment matching.We can infer the canonical parameters using
{ }
∑ ∑
∏
−⎟⎠
⎞⎜⎝
⎛+=
−=
n nn
Tn
nn
Tn
NAxTxh
AxTxhD
)()()(log
)()(exp)(log);(
ηη
ηηηl
)(
)()(
)()(
∑
∑
∑
=
=∂
∂
⇒
=∂
∂−=
∂∂
nnMLE
nn
nn
xTN
xTN
A
ANxT
1
1
0
µ
ηη
ηη
η
)
l
)( MLEMLE µψη )) =
6
Eric Xing 11
SufficiencyFor p(x|θ), T(x) is sufficient for θ if there is no information in Xregarding θ yeyond that in T(x).
We can throw away X for the purpose pf inference w.r.t. θ .
Bayesian view
Frequentist view
The Neyman factorization theorem
T(x) is sufficient for θ if
T(x) θX ))(|()),(|( xTpxxTp θθ =
T(x) θX ))(|()),(|( xTxpxTxp =θ
T(x) θX
))(,()),(()),(,( xTxxTxTxp 21 ψθψθ =
))(,()),(()|( xTxhxTgxp θθ =⇒
Eric Xing 12
ExamplesGaussian:
Multinomial:
Poisson:
( )[ ]( )[ ]
( ) 2211
21
1211
2 /)(
log)(vec;)(
vec;
k
T
T
xh
AxxxxT
−
−
−−
=
Σ+Σ=
=
Σ−Σ=
π
µµη
µη
∑∑ ==⇒n
nn
nMLE xN
xTN
111 )(µ
[ ]
1
1
0
1
1
1
=
⎟⎠
⎞⎜⎝
⎛=⎟
⎠
⎞⎜⎝
⎛−−=
=
⎥⎦⎤
⎢⎣⎡
⎟⎠⎞⎜
⎝⎛=
∑∑=
−
=
)(
lnln)(
)(
;ln
xh
eA
xxTK
k
K
kk
K
k
kηπη
ππη
∑=⇒n
nMLE xN1µ
!)(
)()(
log
xxh
eAxxT
1=
==
==
ηλη
λη
∑=⇒n
nMLE xN1µ
7
Eric Xing 13
Generalized Linear Models (GLIMs)
The graphical modelLinear regressionDiscriminative linear classificationCommonality:
model Ep(Y)=µ=f(θTX)What is p()? the cond. dist. of Y.What is f()? the response function.
GLIMThe observed input x is assumed to enter into the model via a linear combination of its elementsThe conditional mean µ is represented as a function f(ξ) of ξ, where f is known as the response functionThe observed output y is assumed to be characterized by an exponential family distribution with conditional mean µ.
Xn
YnN
xTθξ =
Eric Xing 14
GLIM, cont.
The choice of exp family is constrained by the nature of the data YExample: y is a continuous vector multivariate Gaussian
y is a class label Bernoulli or multinomial
The choice of the response functionFollowing some mild constrains, e.g., [0,1]. Positivity …Canonical response function:
In this case θTx directly corresponds to canonical parameter η.
ηψfθ
xµξ
( ){ })()(exp)()|( ηηη φ Ayxyhyp T −=⇒ 1
)(⋅= −1ψf
{ })()(exp)()|( ηηη Ayxyhyp T −=
8
Eric Xing 15
MLE for GLIMs with natural response
Log-likelihood
Derivative of Log-likelihood
Online learning for canonical GLIMsStochastic gradient ascent = least mean squares (LMS) algorithm:
( )∑ ∑ −+=n n
nnnT
n Ayxyh )()(log ηθl
( )
)(
)(
µ
µ
θη
ηη
θ
−=
−=
⎟⎟⎠
⎞⎜⎜⎝
⎛−=
∑
∑
yX
xy
dd
ddAyx
dd
T
nnnn
n
n
n
nnn
l
This is a fixed point function because µ is a function of θ
( ) ntnn
tt xy µρθθ −+=+1
( ) size step a is and where ρθµ nTtt
n x=
Eric Xing 16
Batch learning for canonical GLIMs
The Hessian matrix
where is the design matrix and
which can be computed by calculating the 2nd derivative of A(ηn)
( )
WXX
xxddx
dd
ddx
ddxxy
dd
dddH
T
nT
nTn
n n
nn
Tn
n n
nn
nTn
nn
nnnTT
−=
=−=
−=
=−==
∑
∑
∑∑
θηηµ
θη
ηµ
θµµ
θθθ
since
l2
[ ]TnxX =
⎟⎟⎠
⎞⎜⎜⎝
⎛=
N
N
dd
ddW
ηµ
ηµ ,,diag K
1
1
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
−−−−
−−−−−−−−
=
nx
xx
XMMM
2
1
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=
ny
yy
yM
v 2
1
9
Eric Xing 17
Recall LMSCost function in matrix form:
To minimize J(θ), take derivative and set to zero:
( ) ( )yy
yJ
T
n
ii
Ti
vv −−=
−= ∑=
θθ
θθ
XX
x
2121
1
2)()( ⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
−−−−
−−−−−−−−
=
nx
xx
XMMM
2
1
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=
ny
yy
yM
v 2
1
( )
( )
( )0
221
22121
=−=
−+=
∇+∇−∇=
+−−∇=∇
yXXX
yXXXXX
yyXyXX
yyXyyXXXJ
TT
TTT
TTTT
TTTTTT
v
v
vvv
vvvv
θ
θθ
θθθ
θθθθ
θθθ
θθ
trtrtr
tryXXX TT v=⇒ θ
The normal equations
( ) yXXX TT v1−=*θ
⇓
Eric Xing 18
Iteratively Reweighted Least Squares (IRLS)
Recall Newton-Raphson methods with cost function J
We now have
Now:
where the adjusted response is
This can be understood as solving the following " Iteratively reweighted least squares " problem
JHttθθθ ∇−= −+ 11
WXXH
yXJT
T
−=
−=∇ )( µθ
( ) [ ]( ) ttTtT
tTttTtT
tt
zWXXWX
yXXWXXWX
H
1
1
11
−
−
−+
=
−+=
∇+=
)( µθ
θθ θ l
( ) )( tttt yWXz µθ −+=−1
)()(minarg θθθθ
XzWXz Tt −−=+1
10
Eric Xing 19
Example 1: logistic regression (sigmoid classifier)
The condition distribution: a Bernoulli
where µ is a logistic function
p(y|x) is an exponential family function, with mean:
and canonical response function
IRLS
yy xxxyp −−= 11 ))(()()|( µµ
)()( xex ηµ −+
=1
1
[ ] )(| xexyE ηµ −+
==1
1
xTθξη ==
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛
−
−=
−=
)(
)(
)(
NN
W
dd
µµ
µµ
µµηµ
1
1
1
11
O
Eric Xing 20
Logistic regression: practical issues
It is very common to use regularized maximum likelihood.
IRLS takes O(Nd3) per iteration, where N = number of training cases and d = dimension of input x.Quasi-Newton methods, that approximate the Hessian, work faster.Conjugate gradient takes O(Nd) per iteration, and usually works best in practice.Stochastic gradient descent can also be used if N is large c.f. perceptronrule:
( ) θθλθσθ
λθ
θσθθ
T
nn
Tn
Txy
xyl
Ip
xye
xyp T
2
01
11
1
−=
=+
=±=
∑
−
−
)(log)(
),(Normal~)(
)(),(
( ) λθθσθ −−=∇ nnnT
n xyxy )(1l
11
Eric Xing 21
Example 2: linear regressionThe condition distribution: a Gaussian
where µ is a linear function
p(y|x) is an exponential family function, with mean:
and canonical response function
IRLS
)()( xxx T ηθµ ==
[ ] xxyE Tθµ ==|
xTθξη ==1
IWdd
=
= 1ηµ
( )( ){ })()(exp)(
))(())((exp),,( //
ηη
µµπ
θ
Ayxxh
xyxyxyp
T
Tk
−Σ−⇒
⎭⎬⎫
⎩⎨⎧ −Σ−−
Σ=Σ
−
−
121
1212 2
12
1
( )( ) ( )
( ) )(
)(tTTt
ttTT
ttTtTt
yXXX
yXXXX
zWXXWX
µθ
µθ
θ
−+=
−+=
=
−
−
−+
1
1
11
⇒∞→
⇒t
YXXX TT 1−= )(θ
Steepest descent Normal equation
Rescale
Eric Xing 22
ClassificationGenerative and discriminative approach
Q
X
Q
X
RegressionLinear, conditional mixture, nonparametric
X Y
Density estimationParametric and nonparametric methods
µ,σ
XX
Simple GMs are the building blocks of complex BNs
12
23
School of Computer ScienceAn (incomplete)
genealogy of graphical
models
The structures of most GMs (e.g., all listed here), are not learned from data, but designed by human.
But such designs are useful and indeed favored because thereby human knowledge are put into good use …
Eric Xing 24
MLE for general BNsIf we assume the parameters for each CPD are globally independent, and all nodes are fully observed, then the log-likelihood function decomposes into a sum of local terms, one per node:
∑ ∑∏ ∏ ⎟⎠
⎞⎜⎝
⎛=⎟⎟
⎠
⎞⎜⎜⎝
⎛==
i ninin
n iinin ii
xpxpDpD ),|(log),|(log)|(log);( ,,,, θθθθ ππ xxl
X2=1
X2=0
X5=0
X5=1
13
Eric Xing 25
Earthquake
Radio
Burglary
Alarm
Call
∏=
==M
ii i
xpp1
)|()( πxxXFactorization:
ji
kii x
jkixp
πθπ x
x|
)|( =
Local Distributions defined by, e.g., multinomial parameters:
How to define parameter prior?
Assumptions (Geiger & Heckerman 97,99):Complete Model EquivalenceGlobal Parameter IndependenceLocal Parameter IndependenceLikelihood and Prior Modularity
? )|( Gp θ
Eric Xing
Global Parameter IndependenceFor every DAG model:
Local Parameter IndependenceFor every node:
∏=
=M
iim GpGp
1)|()|( θθ
Earthquake
Radio
Burglary
Alarm
Call
∏=
=i
ji
ki
q
jxi GpGp
1)|()|(
| πθθ
x
independent of)( | YESAlarmCallP =θ
)( | NOAlarmCallP =θ
Global & Local Parameter Independence
14
Eric Xing
Provided all variables are observed in all cases, we can perform Bayesian update each parameter independently !!!
sample 1
sample 2
M
θ2|1θ1 θ2|1
X1 X2
X1 X2
Global ParameterIndependence
Local ParameterIndependence
Parameter Independence,Graphical View
Eric Xing 28
Which PDFs Satisfy Our Assumptions? (Geiger & Heckerman 97,99)
Discrete DAG Models:
Dirichlet prior:
Gaussian DAG Models:
Normal prior:
Normal-Wishart prior:
∏∏∏
∑-- )(
)(
)()(
kk
kk
kk
kk
kk CP 11 αα θαθα
αθ =
Γ
Γ=
)(Multi~| θπ jxi i
x
),(Normal~| Σµπ jxi i
x
⎭⎬⎫
⎩⎨⎧ −Ψ−−
Ψ=Ψ − )()'(exp
||)(),|( // νµνµ
πνµ 1
212 21
21
np
{ }
. where
,WtrexpW),(),|W( /)(/
1
212
21
−
−−
Σ=⎭⎬⎫
⎩⎨⎧ ΤΤ=Τ
W
nww
wwncp αααα
( ),)(,Normal),,|( 1−= WW µµ ανανµp
15
Eric Xing 29
MLE for general BNsIf we assume the parameters for each CPD are globally independent, and all nodes are fully observed, then the log-likelihood function decomposes into a sum of local terms, one per node:
∑ ∑
∏ ∏
⎟⎠
⎞⎜⎝
⎛=
⎟⎟⎠
⎞⎜⎜⎝
⎛=
=
i niin
n iiin
i
i
xp
xp
DpD
),|(log
),|(log
)|(log);(
,
,
θ
θ
θθ
π
π
x
x
l
Eric Xing 30
Consider the distribution defined by the directed acyclic GM:
This is exactly like learning four separate small BNs, each of which consists of a node and its parents.
Example: decomposable likelihood of a directed model
),,|(),|(),|()|()|( 132431311211 θθθθθ xxxpxxpxxpxpxp =
X1
X4
X2 X3
X4
X2 X3
X1X1
X2
X1
X3
16
Eric Xing 31
MLE for BNs with tabular CPDsAssume each CPD is represented as a table (multinomial) where
Note that in case of multiple parents, will have a composite state, and the CPD will be a high-dimensional tableThe sufficient statistics are counts of family configurations
The log-likelihood is
Using a Lagrange multiplier to enforce , we get:
)|(def
kXjXpiiijk === πθ
iπX
∑=n
kn
jinijk ixxn π,,
def
∑∏ ==kji
ijkijkkji
nijk nD ijk
,,,,
loglog);( θθθl
1=∑j ijkθ ∑=
kjikij
ijkMLijk n
n
,','
θ
Eric Xing 32
MLE and Kulback-Leiblerdivergence
KL divergence
Empirical distribution
Where δ(x,xn) is a Kronecker delta function
Maxθ(MLE) Minθ(KL)
( ) ∑=x xp
xqxqxpxqD)()(log)()(||)(
∑=
=N
nnxxN
xp1
1 ),()(~ defδ
≡( )
);(
)|(log)(~log)(~
)|(log)(~)(~log)(~)|(
)(~log)(~)|(||)(~
DN
C
xpN
xpxp
xpxpxpxpxpxpxpxpxpD
nn
x
xx
x
θ
θ
θθ
θ
l1
1
+=
−=
−=
=
∑∑
∑∑
∑
17
Eric Xing 33
Consider a time-invariant (stationary) 1st-order Markov modelInitial state probability vector:
State transition probability matrix:
The joint:
The log-likelihood:
Again, we optimize each parameter separatelyπ is a multinomial frequency vector, and we've seen it beforeWhat about A?
Parameter sharing
X1 X2 X3 XT…
Aπ
)(def
11 == kk Xpπ
)|(def
11 1 === −it
jtij XXpA
∏∏= =
−=T
t tttT XXpxpXp
2 2111 )|()|()|( : πθ
∑∑∑=
−+=n
T
ttntn
nn AxxpxpD
211 ),|(log)|(log);( ,,, πθl
Eric Xing 34
Learning a Markov chain transition matrix
A is a stochastic matrix: Each row of A is multinomial distribution.So MLE of Aij is the fraction of transitions from i to j
Application: if the states Xt represent words, this is called a bigram language model
Sparse data problem:If i j did not occur in data, we will have Aij =0, then any futher sequence with word pair i j will have zero probability. A standard hack: backoff smoothing or deleted interpolation
1=∑ j ijA
∑ ∑∑ ∑
= −
= −=•→
→=
n
T
titn
jtnn
T
titnML
ijx
xxi
jiA2 1
2 1
,
,,
)(#)(#
MLiti AA •→•→ −+= )(~
λλη 1
18
Eric Xing 35
Bayesian language modelGlobal and local parameter independence
The posterior of Ai · and Ai' · is factorized despite v-structure on Xt, because Xt-1 acts like a multiplexerAssign a Dirichlet prior βi to each row of the transition matrix:
We could consider more realistic priors, e.g., mixtures of Dirichlets to account for types of words (adjectives, verbs, etc.)
X1 X2 X3 XT…
Ai ·π Ai' ·
βα
…
A
)(# where,)(
)(#)(#
),,|( ',
,def
•→+=−+=
+•→+→
==i
Ai
jiDijpA
i
ii
MLijikii
i
kii
Bayesij β
βλλβλ
ββ
β 1
Eric Xing 36
Example: HMM: two scenariosSupervised learning: estimation when the “right answer” is known
Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good
(experimental) annotations of the CpG islandsGIVEN: the casino player allows us to observe him one evening,
as he changes dice and produces 10,000 rolls
Unsupervised learning: estimation when the “right answer” is unknown
Examples:GIVEN: the porcupine genome; we don’t know how frequent are the
CpG islands there, neither do we know their compositionGIVEN: 10,000 rolls of the casino player, but we don’t see when he
changes dice
QUESTION: Update the parameters θ of the model to maximize P(x|θ) --- Maximal likelihood (ML) estimation
19
Eric Xing 37
Recall definition of HMMTransition probabilities between any two states
or
Start probabilities
Emission probabilities associated with each state
or in general:
A AA Ax2 x3x1 xT
y2 y3y1 yT...
... ,)|( , jiit
jt ayyp === − 11 1
( ) .,,,,lMultinomia~)|( ,,, I∈∀=− iaaayyp Miiiitt K211 1
( ).,,,lMultinomia~)( Myp πππ K211
( ) .,,,,lMultinomia~)|( ,,, I∈∀= ibbbyxp Kiiiitt K211
( ) .,|f~)|( I∈∀⋅= iyxp iitt θ1
Eric Xing 38
Supervised ML estimationGiven x = x1…xN for which the true state path y = y1…yN is known,
Define:Aij = # times state transition i→j occurs in yBik = # times state i in y emits k in x
We can show that the maximum likelihood parameters θ are:
What if x is continuous? We can treat as N×Tobservations of, e.g., a Gaussian, and apply learning rules for Gaussian …
∑∑ ∑∑ ∑ ==
•→→
== −
= −
' ',
,,
)(#)(#
j ij
ij
n
T
titn
jtnn
T
titnML
ij AA
y
yyi
jia2 1
2 1
∑∑ ∑∑ ∑ ==
•→→
==
=
' ',
,,
)(#)(#
k ik
ik
n
T
titn
ktnn
T
titnML
ik BB
y
xyi
kib1
1
( ){ }NnTtyx tntn :,::, ,, 11 ==
20
Eric Xing 39
Supervised ML estimation, ctd.Intuition:
When we know the underlying states, the best estimate of θ is the average frequency of transitions & emissions that occur in the training data
Drawback:Given little data, there may be overfitting:
P(x|θ) is maximized, but θ is unreasonable0 probabilities – VERY BAD
Example:Given 10 casino rolls, we observe
x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3y = F, F, F, F, F, F, F, F, F, F
Then: aFF = 1; aFL = 0bF1 = bF3 = .2; bF2 = .3; bF4 = 0; bF5 = bF6 = .1
Eric Xing 40
PseudocountsSolution for small training sets:
Add pseudocountsAij = # times state transition i→j occurs in y + RijBik = # times state i in y emits k in x + Sik
Rij, Sij are pseudocounts representing our prior beliefTotal pseudocounts: Ri = ΣjRij , Si = ΣkSik ,
--- "strength" of prior belief, --- total number of imaginary instances in the prior
Larger total pseudocounts ⇒ strong prior belief
Small total pseudocounts: just to avoid 0 probabilities --- smoothing
This is equivalent to Bayesian est. under a uniform prior with "parameter strength" equals to the pseudocounts