11 : factor analysis and state space modelsepxing/class/10708-17/... · 4 11 : factor analysis and...

10-708: Probabilistic Graphical Models 10-708, Spring 2017

11 : Factor Analysis and State Space Models

Lecturer: Eric P. Xing Scribes: Shruti Palaskar

1 Overview

In this lecture, we study graphical models that have a continuous random vector as a latent variable,especially latent state Gaussian models.

The Factor Analysis (FA) model is a simple latent variable model, where the latent variable is assumedto lie on a lower-dimensional linear subspace of the space of the observed variable. The graphical model forFA is the same as a mixture model, except that both the observed and latent variables are assumed to becontinuous. FA can be viewed as a generalized dimensionality reduction technique. FA models are widelyused in social sciences, behavioral sciences, marketing/recommendation systems, and other applied sciences.

The State Space Models (SSM) can be viewed as a chain of Factor Analysis models, where the la-tent variables are connected sequentially (as a time-series), in a manner that is very similar to the HiddenMarkov Model (HMM), following all Markov properties and independence assumptions. HMM is a dynami-cal generalization of a mixture model. A dynamic generalization of FA leads to Kalman Filter, a methodfor time series analysis. SSM is widely used for econometrics, navigation control etc.

2 Mathematics Review

2.1 Multivariate Gaussian

Let us recall that the probability density function for a Multivariate Gaussian distribution is of the followingform:

p(x | µ,Σ) =1

(2π)n2 | Σ | 12

exp{− 1

2(x− µ)T Σ−1(x− µ)

}(1)

If we represent a multivariate Gaussian distribution in the following block form:

p(

[x1x2

]| µ,Σ) = N (

[x1x2

]|[µ1

µ2

],

[Σ11 Σ12

Σ21 Σ22

]) (2)

Then we can represent the marginal probability and conditional probability with µ and Σ:

p(x2) = N (x2 |mm2 ,V

m2 ) p(x1 | x2) = N (x1 |m1|2,V1|2) (3)

mm2 = µ2 m1|2 = µ1 + Σ12Σ−122 (x2 − µ2) (4)

Vm2 = Σ22 V1|2 = Σ11 − Σ12Σ−122 Σ21 (5)

1

2 11 : Factor Analysis and State Space Models

2.2 Matrix Inversion

Let us now review matrix inversion for block matrix. Consider the following block matrix M :

M =

[E FG H

]We can derive the inverse of matrix i.e. M−1 as:

M−1 =

[E FG H

]−1=

[I 0

−H−1G I

] [(M/H)−1 0

0 H−1

] [I −FH−1

0 I

]

=

[E−1 + E−1F(M/E)−1GE−1 −E−1F(M/E)−1

G− (M/E)−1GE−1 (M/E)−1

]−1

We also get the matrix inverse lemma:

(E− FH−1G)−1 = E−1 + E−1F(H−GE−1F)−1GE−1 (6)

2.3 Matrix Algebra

We review the matrix trace and derivative in this section. The trace of a matrix is defined as follows:

tr[A] =∑i

aii

We have Cyclic Permutations as:

tr[ABC] = tr[CAB] = tr[BCA]

Taking derivatives of a trace:

∂tr[BA]

∂A= BT

∂tr[xTAx]

∂A=∂tr[xxTA]

∂A= xxT

Also, we review the formula for taking the derivative of the matrix determinant:

∂log | A |∂A

= A−1

3 Factor Analysis

For density estimation given a set of observations, although the data vector might be of a high-dimension,the actual latent variables might lie near the lower-dimensions only i.e. a lower-dimensional representationwould be sufficient for representation of a given observed states. For such cases, data-modeling can be donein two stages:

1. A point in the manifold can be generated using a simple probability density

11 : Factor Analysis and State Space Models 3

2. Observed data is conditioned from another simple density centered on that point

Coordinates of such a point form the components of the latent random vector. If the manifold is a linearsubspace, a model called Factor Analysis is obtained. In factor analysis, we assume that the random vectoris a Gaussian random vector. Consider FA as an unsupervised linear regression model.

3.1 The FA model and Parameterization

The model essentially can be viewed as X → Y where X is continuous and hidden, and Y is continuousand observed. Geometrically, it can be interpreted as sampling X from a Gaussian distribution in a low-dimensional subspace and then generating Y by sampling a Gaussian distribution conditioned on X. Thefollowing figure (from Michael Jordan’s notes) represents this:

Figure 1: Dimensionality reduction with the Factor Analysis model

The advantage of such a model is that since both X and Y|X are Gaussians, all marginals, conditionals andjoint distributions are also Gaussian. Therefore, we can easily determine these distributions by computingthe mean and variance of these distributions.We have,

p(x) = N (x;O, I)

p(y | x) = N (y;µ+ Λx,Ψ)

where, Λ is called the factor loading matrix and Ψ is the diagonal covariance matrix. Via the formulaediscussed in the Mathematical review section, we write the joint distribution of X and Y as:

p

([xy

])= N

([xy

]|[µxµy

],

[Σxx ΣxyΣyx Σyy

])


As we have already assumed the values for µx and Σxx, we calculate µy and Σyy assuming added noise iduncorrelated with data i.e. W∼ N (0,Ψ):

µy = E[Y] = E[µ+ ΛX + W]

= µ+ ΛE[X] + E[W]

= µ+ Λ0 + 0 = µ

Σyy = Var[Y] = E[(Y− µ)(Y− µ)T]

= E[(µ+ ΛX + W− µ)(µ+ ΛX + W− µ)T]

= E[(ΛX + W)(ΛX + W)T]

= ΛE[XXT]ΛT + E[WWT]

= ΛE[XXT]ΛT + E[WWT]

= ΛΛT + Ψ

Here, Y is a summation of the diagonal covariance matrix Ψ and the outer product of a tall skinny matrixwith itself i.e. ΛΛT. Thus, although Y will be a high-dimensional matrix, it may have a low-rank structure.

Now, we write the covariance between X and Y as:

Σxy = Cov[X,Y] = E[(X− 0)(Y− µ)T]

= E[X(µ+ ΛX + W− µ)T]

= E[XXTΛT + XWT] = ΛT

Thus, the joint distribution of X and Y can be written as :

p

([xy

])= N

([xy

]|[

0µ

],

[I ΛT

Λ ΛΛT + Ψ

])

3.2 Inference

Using equations (4) and (5) as given in the mathematical review section, we can compute the conditionalprobability

p(X | Y) = N (X |mx|y,Vx|y)

Let p be the dimensions of variable X, and q be the dimensions of variable Y. For FA models, we have p <q. We use the matrix inversion lemma, equation (6), here as computing the quantity

(I + ΛTΨ−1Λ)−1 (7)

requires us to invert a p×p matrix, which is easier than computing

(ΛΛT + Ψ)−1 (8)


where we need to invert a q×q matrix. Using the equations presented above, we have:

p(X | Y) = N (X |mx|y,Vx|y)

Vx|y = Σxx − ΣxyΣ−1yy Σyx

= I− ΛT(ΛΛT + Ψ)−1Λ

= (I + ΛTΨ−1Λ)−1

mx|y = µx + ΣxyΣ−1yy (Y− µy)

= ΛT(ΛΛT + Ψ)−1(Y− µ)

= [(ΛΛT + Ψ)−1]−1(Y− µ)

= [Λ + Ψ(ΛT)−1]−1(Y− µ)

= [Ψ(Ψ−1Λ + (ΛT)−1]−1(Y− µ)

= [Ψ(ΛT)−1(ΛTΨ−1Λ + I]−1(Y− µ)

= (I + ΛTΨ−1Λ)−1ΛTΨ−1(Y− µ)

= Vx|yΛTΨ−1(Y− µ)

We see that the posterior covariance does not depend on the observed data. Also, computing the posteriormean value is a linear operation. This is equivalent to projecting the distribution Y onto the subspace of X.This is a lower-dimensional subspace spanned by the loading matrix Λ.

3.3 Learning

We have derived the equation for the estimation of the conditional probability p(X|Y) in the above section.We now learn the parameters Λ, Ψ and µ via Expectation Minimization (EM) algorithm. If we have com-plete data, estimation of X is similar to a Gaussian density estimation, and Y is a linear function of x withadditive while Gaussian noise W. Therefore, in E-step, we ‘’fill-in‘’ X, and in M-step we estimate Λ and Ψusing linear regression.

E-step: We start by computing the incomplete log likelihood:

l(θ,D) = −N2log | ΛΛT + Ψ | −1

2

∑n

(yn − µ)T(ΛΛT + Ψ)−1(yn − µ)

= −N2log | ΛΛT + Ψ | −1

2tr[(ΛΛT + Ψ)−1S]

where,S =∑n

(yn − µ)(yn − µ)T

Estimating µ is straight forward,

µML =1

N

∑n

yn

The variables Λ and Ψ are coupled together in the equation we obtain using the matrix inversion lemma.In this case, it is not possible to decouple these variables. However, if we pretend that all variables areobserved, we would have coupled variables although combined linearly. Since we don’t actually know the


hidden variables, we use EM. For simplification of derivation, we assume that the data has been normalizedsuch that Y |x ∼ N (Λx,Ψ).

The complete log likelihood is given by:

lc =∑n

logp(xn, yn) =∑n

logp(xn) + logp(yn | xn)

= −N2log | I | −1

2

∑n

xTn xn −

N2log | Ψ | −1

2

∑n

(yn − Λxn)TΨ−1(yn − Λxn)

= −N2log | Ψ | −1

2

∑n

tr[xnxTn ]− N

2tr[SΨ−1]

where,S =1

N

∑n

(yn − Λxn)(yn − Λxn)T

We replace the unknown variables with their expectations, by law of total variance:

V ar(Y ) = V ar(E(Y | X)) + E(V ar(Y | X)) for〈XnXTn 〉)

〈S〉 =1

N

∑n

(ynyTn − yn〈X

Tn 〉ΛT − Λ〈XT

n 〉yTn + Λ〈XnXT

n 〉ΛT)

〈Xn〉 = E[Xn | yn]

〈XnXTn 〉 = Var[Xn | yn] + E[Xn | yn]E[Xn | yn]T

where,

〈Xn〉 = mxn|ynand 〈XnXT

n 〉 = VXn|Yn + mxn|ynmT

xn|yn

are our sufficient statistics as defined above. Thus, we conclude the E-step.

M-step: We now take the partial derivative of the complete log likelihood with respect to the two pa-rameters. Using the trace and determinant rules as described in the mathematical review section:

∂

∂Ψ−1〈lc〉 =

∂

∂Ψ−1

(− N

2log | Ψ | −1

2

∑n

tr[〈XnXTn 〉]−

N

2tr[〈S〉Ψ−1]

)

=N

2Ψ− N

2〈S〉 ⇒ Ψt+1 = 〈S〉

and

∂

∂Λ〈lc〉 =

∂

∂Λ

(− N

2log | Ψ | −1

2

∑n

tr[〈XnXTn 〉]−

N

2tr[〈S〉Ψ−1]

)= −N

2Ψ−1

∂

∂Λ〈S〉

= −N2

Ψ−1∂

∂Λ

(1

N

∑n

(ynyTn − yn〈XT

n 〉ΛT − Λ〈XTn 〉yTn + Λ〈XnX

Tn 〉ΛT )

)

= Ψ−1∑n

yn〈XTn 〉 −Ψ−1Λ

∑n

〈XnXTn 〉 ⇒ Λt+1 =

(∑n

yn〈XTn 〉

)(∑n

〈XnXTn 〉

)−1

There is a degeneracy in the FA model as the loading matrix Λ appears only in the outer product with itself,the model is invariant to rotation and flips of these basis vectors define the latent manifold. Consider any


orthonormal matrix Q where we replace Λ with ΛQ and the model remains the same:(ΛQ)(ΛQ)T = Λ(QQT)Λ T = Λ Λ T. This implies that there is no best setting for this parameter. Ifwe need to find a low-dimensional subspace representation of our data, we do not need to worry aboutthis degeneracy. But if we want to implement the process that generated our data, this is not a fool-prooftechnique as rotation can change the meaning. In general, these models are called unidentifiable as twodifferent people fitting parameters to the same data are not guaranteed to arrive at the same values of thisparameter.

4 State Space Models

A State Space Model (SSM) is a dynamical generalization of the FA model. As mentioned above, the SSMis structurally identical to a Hidden Markov Model but the variables in SSM follow continuous Gaussiandistributions. Although the variables are continuous in nature, we use the properties of Gaussian distributionto derive the inference of SSM quite simply. Following is a graphical representation of a SSM:

Figure 2: Graphical representation of a State Space Model

Here each y t is the continuous random vector at an instance of time t. We assume there exists a latentsequence X that generates these observations. x t and y t are Gaussian distributions.We introduce a transition matrix that determines the relationship between the latent variables, such thatthe mean of the state at time t, x t is linear in the mean of the state at time t-1.

xt = Axt−1 +Gwt

where, wt = N (0, Q)

We add a Gaussian white noise w t into the model without affecting the linearity as linear combination ofGaussians is also a Gaussian.

We use FA model at each point in time t to represent the output. Let C be a loading matrix. C isshared across all x t, y t pairs. We assume that all data points lie in the same low-dimensional space. Werepresent the output as:

yt = Xct + xt

where, vt = N (0, R)

We again add a Gaussian white noise v t. Note that so far no assumptions are made on the Q and R matrices.

Let the starting point be:

x0 = N (0,Σ0)


Hence, this is a linear dynamic system.

4.1 Linear Dynamical System for 2D Tracking - An SSM example

Here we review the application of an SSM for latent space inference. In general, SSMs are not used fordimensionality reduction as FAs are used.

Consider a point moving in 2D space following the rule:

new position = old position +∆t× velocity + noise

This is a constant velocity model with Gaussian noise. Consider the following formulation:x1tx2tx1tx2t

=

1 0 ∆t 00 1 0 ∆t0 0 1 00 0 0 1

x1t−1x2t−1x1t−1x2t−1

+ noise

In this case, the observations y are simply x 1, x 2, i.e. the projection of the hidden state x onto the first twocomponents:

(y1ty2t

)=

(1 0 0 00 1 0 0

)x1tx2tx1tx2t

+ noise

Figure 3: 2D Tracking


4.2 Inference

As SSMs are structurally similar to HMMs, inference algorithms for HMM transfer to SSM intuitively. Letus consider the SSM versions of filtering and smoothing.

Filtering - Inference Problem 1Given y1,. . . ,yt we want to estimate p(xt |y1:t). Let us review the forward algorithm of HMMs here asfiltering algorithm is very similar to the forward algorithm of HMMs:

p(xt = i | y1:t) = αit ∝ p(yt | xt = i)

∑j

p(xt = i | xt−1 = j)αjt−1

This is a dynamic programming approach to finding the sequential Bayesian updates i.e. exact online infer-ence. We can apply the exact same algorithm to SSMs where it is called Kalman Filtering (discussed insection 3.2.1).

Smoothing - Inference Problem 2Given y1,. . . ,yt we want to estimate p(xt |y1:t) for t<T. In HMMs, we use the forward-backward algorithmto perform smoothing. Similarly, we can apply this to SSMs where it is called the Rauch-Tung-Strievelsmoother. For HMM we have:

p(xt = i | y1:t) = γit ∝∑j

αitp(x

jt+1 | x

ji )γ

jt+1

The main difference between smoothing and filtering is that filtering takes into consideration only the pastinformation to estimate the probability value of hidden variables, while smoothing considers the past as wellas the future information. Therefore, smoothing performs better than filtering in cases where the futureinformation of hidden variables is available.

4.2.1 Kalman Filtering

Let us derive the equations for Kalman filtering now. As an example, consider the task of visual processing inthe brain. We can model the world as consisting of hidden states that change over time (as time progresses).What we actually observe consists of noisy projections of the real hidden state of the world i.e. change inobserved states as time progresses. For visual processing, we want to estimate

p(x | y) = p(xt+1 | y1:t)

In an SSM, as all the conditional probabilities are linear Gaussians, the entire system defines a large multi-variate Gaussian. Also, all marginals are also a Gaussian. We thus represent the belief state p(xt | y1:t) asa Gaussian with mean E(xt | y1:t) = u(t|t) and covariance Cov(xt | y1:t) = Σ t|t.Kalman filtering is a recursive process made up of the following steps:

1. Predict step/Time Update: Compute p(xt+1 | y1:t) from the prior belief p(xt | y1:t) and the dynamicmodel p(xt+1 | xt)

2. Update step/Measurement Update: Compute the new belief p(xt+1 | y1:t+1) from the predictionp(xt+1 | y1:t), the observation yt+1 and the observation model p(yt+1 | xt+1)

Now, let us derive the above update formulae.


4.2.2 Example of Kalman Filtering in 1-Dimension

Consider noisy observations of a 1D particle doing a random walk:

xt|t−1 = xt−1 + w w ∼ N (0, σx) zt = xt + v v ∼ N (0, σz)

Using Kalman Filter equations, and A=I, C=I, G=I, we get:

xt+1|t = Axt|t = xt|t

pt+1|t = Apt|tAT +GQGT = σt + σx

Kt+1 = pt+1|tCT (Cpt+1|tC

T +R)−1 =σt + σx

σt + σx + σz

xt+1|t+1 = xt+1|t +Kt+1(zt+1 − Cxt+1|t) =(σt + σx)zt+1 + σzxt | t

σt + σx + σz

pt+1|t+1 = pt+1|t −Kt+1Cpt+1|t =(σt + σx)σzσt + σx + σz

The term (zt+1 - Cx t+1|t) is called innovation.

KF Intuition: The new belief obtained is a convex combination of updates from the prior and the ob-servation, weighted by the Kalman gain matrix. If the observations are unreliable, σx (i.e. R) is large, henceK t+1 is small, and therefore we pay more attention to the prediction. On the other hand, if the old prior isunreliable i.e. large σt, or else if the process is itself very unpredictable i.e. large σx, we pay more attentionto the observation zt+1.

4.2.3 Complexity Analysis of Kalman Filter

Let xt ∈ RNx and yt ∈ RNy . The complexity of one KF step is:

• Time update: pt+1|t = Apt|tAT +GQGT takes O(N2

x) time, assuming dense P and dense A.

• Measurement update: Kt+1 = pt+1|tCT (CPt+1|tC

T +R)−1 takes O(N3Y ) time.

Hence the overall time is max{N2x , N

3y } in general.

4.2.4 Rauch-Tung-Strievel Smoothing

Like Kalman filter, the RTS Smoother equations are:

xt|T = xt|t + Lt(xt+1|T − xt|t)pt|T = pt|t + Lt(pt+1|T − pt+1|T )LT

t Lt = pt|tATP−1t+1|t

Here we observe that the above equations follow the pattern: KF results + the difference of the smoothedand predicted results of the next step. Also, we use backward computation here i.e., we pretend to knowthings at time t+1, where such conditioning makes things simple and we remove this condition finally.We use the law of iterated expectation and law of total variance to calculate xt | y1:t:

E[X | Z] = E[E[X | Y,Z] | Z]

V ar[X | Z] = V ar[E[X | Y,Z] | Z] + E[V ar[X | Y,Z] | Z]


Hence, xTt is given by:

xTt = E[xt | y1:T ]

= E[E[xt | xt+1, y1:T ] | y1:T ]

= E[E[xt | xt+1, y1:t] | y1:T ]

= E[xt | xt+1, y1:t]

Following the results in KF, we can derive p(xt+1, xt | y1:t) ∼ N (m,V ), where

m =

[xttxtt+1

]V =

[pt|t p(t | t)AT

Ap(t | t) p(t+ 1 | t)

]which we use for a forward KF pass. Using the formulae for conditional Gaussian distributions, we canderive the equations for RTS smoother:

xTt = E[xt | xt+1, y1:T ]

= xt|t + Lt(xt+1|T − xt+1|t)

pt|T ) = V ar[xt|T | y1:T ] + E[V ar[xt | xt+1, y1:t] | y1:T ]

= pt|t + Lt(pt+1|T − pt+1|t)LTt

where, Lt = pt|tAT p−1t+1|t

4.3 Learning

The complete log likelihood of SSM is given by:

lc(θ,D) =∑n

log p(xn, yn) =∑n

log p(x1) +∑n

∑t

log p(xn,t | xn,t−1) +∑n

∑t

log p(yn,t | xn,t)

= f1(x1,Σ0) + f2({xtxTt−1, xtxTt , xt : ∀t};A,Q,G) + f3({xtxTt , xt : ∀t};C,R)

We use EM algorithm to learn the SSM model as well:

E-step:Via KF and RTS filters, we infer the following:

〈xtxTt−1〉, 〈xtxTt 〉, 〈xt〉 | y1:T

M-step:We apply MLE

〈lc(θ,D)〉 = f1(x1,Σ0) + f2({〈xtxTt−1〉, 〈xtxTT 〉, 〈xt〉 : ∀t};A,Q,G) + f3({〈xtxTt 〉, 〈xt〉 : ∀t};C,R)

5 Nonlinear Systems

Many applications involving motion model and observation are often nonlinear, like robotics:

xt = f(xt−1) +Gwt, yt = g(xt−1) + vt

Hence, we can no longer find a optimal closed-form solution to the filtering problem. f and g are thenon-linear functions that are sometimes represented by neural networks. We can learn the parameters of


these functions offline using the EM algorithm, via gradient descent in the M step. We can also learn theparameters online by adding them to the state space, but this makes the problem more non-linear.

Extended Kalman Filter is a solution to this problem. We basically linearize the non-linear functions fand g using a second order Taylor expansion. Then we apply the standard Kalman Filter. We approximatethe stationary nonlinear system with a non-stationary linear system.

xt = f(xt−1|t−1) +Axt−1|t−1(xt−1 − xt−1|t−1) + wt

yt = g(xt|t−1) + Cxt|t−1(xt − xt|t−1) + vt

where, xt|t−1 = f(xt−1|t−1), Ax =∂f

∂x| x and Cx =

∂g

∂x| x

The noise covariance Q and R are not changed i.e. additional error is not modeled due to linearization.There also exist cases where the hidden state is constant but the observation matrix is a time-varying vector.Therefore, the observation vector at each time slide t is a linear regression. We can estimate the parameterrecursively using the Kalman Filter:

θt+1 = θt + ptR−1(yt+1 − xTt θt)xt

This is called the recursive least squares (RLS) algorithm. The ptR−1 term is approximated by a scalarconstant called the least mean square (LMS) algorithm using stochastic approximation theory.

11 : factor analysis and state space modelsepxing/class/10708-17/... · 4 11 : factor analysis and...

Documents