latent factor models

Latent Factor Models

Geoff GordonJoint work w/ Ajit Singh, Byron Boots,

Sajid Siddiqi, Nick Roy

Motivation

A key component of a cognitive tutor: student cognitive model

Tracks what skills student currently knows—latent factors

circle-area

rectangle-area

decompose-area

right-answer

Motivation

Student models are a key bottleneck in cognitive tutor authoring and performance

rough estimate: 20-80 hrs to hand-code model for 1 hr of content

result may be too simple, not rigorously verified

But, demonstrated improvements in learning from better models

E.g., Cen et al [2007]:12% less time to learn 6 geometry units (same retention) using tutor w/ more accurate model

This talk: automatic discovery of new models and data-driven revision of existing models via (latent) factor analysis

SCORE: STDNT I, ITEM J

Simple case: snapshot, no side information

1 2 3 4 5 6 …

A 1 1 0 0 1 0 …

B 0 1 1 0 0 0 …

C 1 1 0 1 1 0 …

D 1 0 0 1 1 0 …

… … … … … … … …

STUDENTS

Missing data

1 2 3 4 5 6 …

A 1 ? ? ? 1 0 …

B 0 ? 1 0 ? ? …

C 1 1 ? ? ? 0 …

D 1 0 0 1 ? ? …

… … … … … … … …

STUDENTS

Data matrix X

..xxnn

STUDENTS

Simple case: model

U: student latent factorsV: item latent factorsX: observed performance

n students

m items k latent factors

k latent factors

observed

unobserved

Linear-Gaussian version

student factoritem factor

n students

m items k latent factors

k latent factors

U: Gaussian (0 mean, fixed var)V: Gaussian (0 mean, fixed var)X: Gaussian (fixed var, mean at left)

Matrix form: Principal Components Analysis

..xxnn

DATA MATRIX X

COMPRESSED MATRIX U

..uunn

vv11 …… vvkk

BASIS MATRIX VT

PCA: the picture

PCA: matrix form

..xxnn

DATA MATRIX X

COMPRESSED MATRIX U

..uunn

vv11 …… vvkk

BASIS MATRIX VT

COLS OF V SPAN THE LOW-RANK SPACE

Interpretation of factors

..uunn

vv11 …… vvkk

STUDENTS

ITEMSBASIS WEIGHTS

BASIS VECTORS

BASIS VECTORS ARE CANDIDATE “SKILLS” OR “KNOWLEDGE COMPONENTS”

WEIGHTS ARE STUDENTS’ KNOWLEDGE LEVELS

PCA is a widely successful model

FACE IMAGES FROM Groundhog Day, EXTRACTED BY CAMBRIDGE FACE DB PROJECT

Data matrix: face images

..xxnn

IMAGES

PIXELS

Result of factoring

..uunn

vv11 …… vvkk

IMAGES

PIXELSBASIS WEIGHTS

BASIS VECTORS

BASIS VECTORS ARE OFTEN CALLED “EIGENFACES”

Eigenfaces

IMAGE CREDIT: AT&T LABS CAMBRIDGE

PCA: the good

Unsupervised: need no human labels of latent state!

No worry about “expert blind spot”

Of course, labels helpful if available

Post-hoc human interpretation of latents is nice too—e.g., intervention design

PCA: the bad

Linear, Gaussian

PCA assumes E(X) is linear in UV

PCA assumes (X–E(X)) is i.i.d. Gaussian

Nonlinearity: conjunctive skills

P(CORRECT)

SKILL 1SKILL 2

Nonlinearity: disjunctive skills

P(CORRECT)

SKILL 1SKILL 2

Nonlinearity: “other”P(CORRECT)

SKILL 1SKILL 2

Non-Gaussianity

Typical hand-developed skill-by-item matrix

1 2 3 4 5 6 …

1 1 0 0 1 1 …

0 0 1 1 0 1 …

SKILLS

Result of Gaussian assumption

true recovered

rows of true and recovered V matrices

Result of Gaussian assumption

true recovered

rows of true and recovered V matrices

The ugly: MLE only

PCA yields maximum-likelihood estimate

Good, right?

sadly, the usual reasons to want the MLE don’t apply here

e.g., consistency: variance and bias of estimates of U and V do not approach 0 (unless #items/student and #students/item )

Result: MLE is typically far too confident of itself

Too certain: example

Learned coefficients

(e.g., a row of U)

Predictions

Result: “fold-in problem”

Nonsensical results when trying to apply learned model to a new student or item

Similar to overfitting problem in supervised learning: confident-but-wrong parameters do not generalize to new examples

Unlike overfitting, fold-in problem doesn’t necessarily go away with more data

Summary: 3 problems w/ PCA

Can’t handle nonlinearity

Can’t handle non-Gaussian distributions

Uses MLE only (==> fold-in problem)

Let’s look at each problem in turn

Nonlinearity

In PCA, had Xij ≈ Ui ⋅ Vj

What if

Xij ≈ exp(Ui ⋅ Vj)

Xij ≈ logit(Ui ⋅ Vj)

Non-Gaussianity

In PCA, had Xij ∼ Normal(μ), μ = Ui ⋅ Vj

What if

Xij ∼ Poisson(μ)

Xij ∼ Binomial(p)

Exponential family review

Exponential family of distributions:

P(X | θ) = P0(X) exp(X⋅θ – G(θ))

G(θ) is always strictly convex, differentiable on interior of domain

• means G’ is strictly monotone (strictly generalized monotone in 2D or higher)

Exponential family review

Exponential family PDF:

P(X | θ) = P0(X) exp(X⋅θ – G(θ))

• Surprising result: G’(θ) = g(θ) = E(X | θ)

• g & g–1 = “link function”

• θ = “natural parameter”

• E(X | θ) = “expectation parameter”

Examples

Normal(mean)

g = identity

Poisson(log rate)

g = exp

Binomial(log odds)

g = sigmoid

Nonlinear & non-Gaussian

Let P(X | θ) be an exponential family with natural parameter θ

Predict Xij ∼ P(X | θij), where θij = Ui ⋅ Vj

e.g., in Poisson, E(Xij) = exp(θij)

e.g., in Binomial, E(Xij) = logit(θij)

Optimization problem

max ∑ log P(Xij | θij)

s.t. θij = Ui ⋅ Vj

• “Generalized linear” or “exponential family” PCA

• all P(…) terms are exponential families

• analogy to GLMs

+ log P(U) + log P(V)U,V

[Collins et al, 2001][Gordon, 2002][Roy & Gordon, 2005]

Special cases

PCA, probabilistic PCA

Poisson PCA

k-means clustering

Max-margin matrix factorization (MMMF)

Almost: pLSI, pHITS, NMF

Comparison to AFM

p = probability correct

θ = student overall performance

β = skill difficulty

Q = item x skill matrix

γ = skill practice slope

T = number of practice opportunities

TTikik γkkθ

Theorem

• In GL PCA, finding U which maximizes likelihood (holding V fixed) is a convex optimization problem

• And, finding best V (holding U fixed) is a convex problem

• Further, Hessian is block diagonal

So, an efficient and effective optimization algorithm: alternately improve U and V

Example: compressing

histograms w/ Poisson PCA

Points: observed frequencies in ℝ3

Hidden manifold: a 1-parameter family of multinomials

Example

ITERATION 1

Example

ITERATION 2

Example

ITERATION 3

Example

ITERATION 4

Example

ITERATION 5

Example

ITERATION 9

Remaining problem: MLE

Well-known rule of thumb: if MLE gets you in trouble due to overfitting, move to fully-Bayesian inference

Typical problem: computation

In our case, the computation is just fine if we’re a little clever

Additional wrinkle: switch to hierarchical model

Bayesian hierarchical exponential-family PCA

U: student latent factorsV: item latent factorsX: observed performanceR: shared prior for student latentsS: shared prior for item latents

n students

m items

k latent factors

observed

unobserved

student factoritem factor

A little clever: MCMC

Z P(X)

Experimental comparisonGeometry Area 1996-1997

Geometry tutor: 139 items presented to 59 students

On average, each student tested on 60 items

Results: hold-out error

Embedding dimension for *EPCA is K = 15

credit: Ajit Singh

Extensions

Relational models

Temporal models

Relational models

1 2 3 4 5 6john

1 1 0 0 1 0

sue 0 1 1 0 0 0

tom 1 1 0 1 1 0

STUDENTS

1 2 3 4 5 6trig 1 1 0 0 1 0

0 1 1 0 0 0

hard 1 1 0 1 1 0

Relational hierarchical Bayesian exponential-family

X, Y: observed dataU: student latent factorsV: item latent factorsZ: tag latent factorsR, S, T: shared priors

n students

m items

k latent factors

observed

unobserved

p tags

ZZk latent factors

X ≈ f(UVT) Y ≈ g(VZT)

Example: brain imaging

2000 dictionary words

60 stimulus words

500 brain voxels

X = co-occurrence of (dictionary word, stimulus word) on web

Y = activation of voxel when presented with stimulus

Task: predict X

HB-EPCA

H-EPCA

Relational versions

Mean squared error

credit: Ajit Singh

Temporal models

So far: latent factors of students and content

e.g., knowledge components

for student: skill at KC

for problem: need for KC

e.g., student affect

But limited idea of evolution through time

e.g., fixed-structure models: proficiency = a + B X, where x = # practice opportunities, A = initial skill level, b = skill learning rate

Temporal models

For evolving factors, we expect far better results if we learn about time explicitly

learning curves, gaming state, affective state, motivational state, self-efficacy, …

XX11XX11XX11LATENT STATE

PROPERTIES OF TRANSACTION

X1X1X1X1YY11

INSTRUCTIONAL DECISIONS X1X1X1X1UU11

TRANS. 1 TRANS. 2 TRANS. 3

XX11XX11XX22

X1X1X1X1YY22

X1X1X1X1UU22

XX11XX11XX33

X1X1X1X1YY33

X1X1X1X1UU33

Example: Bayesian Evaluation & Assessment

[BECK ET AL., 2008]

PROPERTIES OF TRANSACTIONS

LATENT STATE

INSTRUCTIONAL DECISIONS

The hope

Fit a temporal model

Examine learned parameters and latent states

Discover important evolving factors which affect performance

learning curve, affective state, gaming state, …

Discover how they evolve

The hope

Reduce assumptions about what the factors are

Explore a wider variety of models

Model search guided by data

⇒ discover factors we might otherwise have missed

Walking: original data

QuickTime™ and a decompressor

are needed to see this picture.

THANKS: BYRON BOOTS, SAJID

SIDDIQI

Walking: original data

THANKS: BYRON BOOTS, SAJID

SIDDIQI

JOINT ANGLESX1X1X1X1YY11

DESIREDDIRECTION X1X1X1X1UU11

XX11XX11XX22

X1X1X1X1YY22

X1X1X1X1UU22

XX11XX11XX33

X1X1X1X1YY33

X1X1X1X1UU33

Walking: learned model

Steam: original data

PIXELSX1X1X1X1YY11

(EMPTY) X1X1X1X1UU11

XX11XX11XX22

X1X1X1X1YY22

X1X1X1X1UU22

XX11XX11XX33

X1X1X1X1YY33

X1X1X1X1UU33

Steam: learned model

latent factor models

student latent factorsv

item latent factorsx

human labels of latent

gaussian fixed var

skills student

tutor w

student cognitive modeltracks

cognitive tutor authoring

Documents

max-margin latent variable models

modeling dynamic functional connectivity with latent ... ·...

introduction to latent variable models -...

latent factor models for web recommender systems

markov latent feature models - columbia university...

dynamic factor volatility modeling: a bayesian latent

mccauslw.github.io · dynamic factor models with stochastic...

sas(r) visual data...

econometric analysis of large factor models...econometric...

modeling item response profiles using factor models, latent...

supervised hashing with latent factor...

fast structure learning with modular regularization -...

growth modeling with latent variables using mplus ......

marcoulides moustaki-latent variable and latent structure...

developmental models: latent growth models brad verhulst &...

the fact: taming latent factor models for explainability...

probabilistic latent variable models in statistical...

introduction to latent variable models

8. heterogeneity: latent class models

latent variable models in education