latent semantic variable models thomas hofmann intelligent systems group technical university of...

Latent Semantic Variable Models

Thomas Hofmann

Intelligent Systems Group Technical University of Darmstadt, Germany

Fraunhofer Institute for Integrated Publication & Information Systems (IPSI)

[email protected]

2

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5

IntroductionIntroduction

Information Retrieval & Latent Semantic Information Retrieval & Latent Semantic

IndexingIndexing

Probabilistic Latent Semantic IndexingProbabilistic Latent Semantic Indexing

Semantic Features for Text CategorizisationSemantic Features for Text Categorizisation

Probabilistic HITSProbabilistic HITS

Collaborative FilteringCollaborative Filtering

ConclusionConclusion

3

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Latent Structure

Given a matrix that “encodes” data, e.g. co-occurrence counts

Potential problems

too large too complicated missing entries noisy entries lack of structure ...

Is there a simpler way to explain entries?

There might be a latent structure underlying the data.

How can we reveal or discover this structure?

4

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5

Common approach: approximately factorize matrix

Factors are typically constrained to be “thin”

Matrix Decomposition

approximation left factor right factor

reduction

factors = latent structure

5

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5



IndexingIndexing






6

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Searching & Finding

What is by far the most popular “portal” to information?

Web SearchNews search

Search in Portals

Site Search for Organisations

7

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Ad Hoc Retrieval

Magic: Identify relevant documents based on short, ambiguous, and incomplete query

Search a document collection to find the ones that satisfy an immediate information need

By far the most popular form of information access:

85% of American Internet users have ever used an online search engine to find information on the Web (Fox, 2002)

29% of Internet users rely on a search engine on a typical day (Fox, 2002)

8

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5

Document-Term Matrix

w jintelligence

d i

Texas Instruments said it has developed the first 32-bit computer chip designed specifically for artificial intelligence applications [...]

D = Document collection W = Lexicon/Vocabulary

...

art

ifici

al

1

inte

llig

ence

inte

rest

0

art

ifact

0 ...... 2t

=d i

term weighting

X

w1 ... w j ... w J

d1

...

d i

...

d I

D

W

...

...

...

...


9

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5A 100 Millionths of a Typical


0

1

0

2

Typical: • Number of documents 1.000.000• Vocabulary 100.000• Sparseness < 0.1 %• Fraction depicted 1e-8

10

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Robust Information Retrieval —

Beyond Keyword-based Search

“labour immigrants Germany”

query

match

“German job market for immigrants”query

?

“foreign workers in Germany”query

?

“German green card”query

?

“labour immigrants Germany”

query

match

“German job market for immigrants”query

?

“foreign workers in Germany”query

?

“German green card”query

?

G. W. Furnas, T. K. Landauer, L. M. Gomez , S. T. Dumais, The Vocabulary Problem in Human-System Communication: an Analysis and a Solution, Bell Communications Research, 1987

Vocabulary Mismatch Problemdifferent people using different vocabulary to describe the same concept

matching queries and documents based on keywords is insufficient

11

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Challenges

Compactness: few search terms, rarely redundance

Variability: Synonyms and semantically related terms, expressions, writing styles, etc.

Ambiguity: Terms with multiple senses (polysems), z.B. Java, jaguar, bank, head

Quality & Authority: Correlation between quality and relevance

Average number of search terms per query = 2.5

(Source: Spink et al.: From E-sex to E-commerce: Web search changes, IEEE Computer, March 2002 )

12

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Latent Semantic Analysis

Perform a low-rank approximation of document-term matrix (typical rank 100-300)

General idea

Map documents (and terms) to a low-dimensional representation.

Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space).

Compute document similarity based on the inner product in the latent semantic space

Goals

Similar terms map to similar location in low dimensional space

Noise reduction by dimension reduction

M. Berry, S. Dumais, and G. O'Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4):573-595, 1995.

13

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5

For an arbitrary matrix A there exists a factorization (Singular Value Decomposition = SVD) as follows:

Where

(i)

(ii)

(iii)

(iv)

Singular Value Decomposition

orthonormal columns

singular values (ordered)

14

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5

SVD can be used to compute optimal low-rank approximations.

Approximation problem:

Solution via SVD

Low-rank Approximation

column notation: sum of rank 1 matrices

Frobenius norm

set small singular values to zero

C. Eckart, G. Young, The approximation of a matrix by another of lower rank. Psychometrika, 1, 211-218, 1936.

15

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5

The LSA decomposition via SVD can be summarized as follows:

Document similarity

Folding-in queries

LSA Decomposition

documents

...

...

terms

...

LSA documentvectors

... LSA termvectors=

< , >

16

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Latent Semantic Analysis

Latent semantic space: illustrating example

courtesy of Susan Dumais© Bellcore

17

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5



IndexingIndexing






18

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Search as Statistical Inference

Document in bag-of-words representation

US

Disneyeconomi

cintellectual

property

relations

Beijing

human rightsfree

negotiations

imports

?China

China US trade relations

Given the context, how probable is it that terms like “China“ or “trade“ might occur?

Document expansion: Additional index terms can be added automatically via statistical inference

M. E. Maron and J. L. Kuhns, On Relevance, Probabilistic Indexing and Information Retrieval, Journal of the ACM, 7:216-244, 1960.

19

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Language Model Paradigm in IR

Probabilistic relevance model

Random variables

Bayes’ ruleprior probability of relevance for

document d (e.g. quality, popularity)

probability of generating a query q to ask for

relevant d

probability that document d is relevant for query q

J. Ponte and W.B. Croft, A Language Model Approach to Information Retrieval, ACM SIGIR, 1998.

20

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Language Model Paradigm

First contribution: prior probability of relevance

simplest case: uniform (drops out for ranking) popularity: document usage statistics (e.g. library

circulation records, download or access statistics, hyperlink structure)

Second contribution: query likelihood

query terms q are treated as a sample drawn from an (unknown) relevant document

12

1

2

21

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Language Model Paradigm

Query generation model: how might a query look like that would ask for a specific document?

Maron & Kuhns: Indexer manually assigns probabilities for pre-specified set of tags/terms

Ponte & Croft: Statistical estimation problem

Think of a relevant document. Formulate a query by picking some of the keywords as query terms.

Environmentalists are blasting a Bush administration proposal to lift a ban on logging in remote areas of national forests, saying the move ignores popular support for protecting forests.

22

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Naive Approach

TermsDocuments

Maximum Likelihood Estimation

number of occurrencesof term w in document d

Zero frequency problem: terms not occurring in a document get zero probability

23

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Estimation Problem

Crucial question: In which way can the document collection be utilized to improve probability estimates?

document estimation

(i.i.d) sample

other documents

learning from otherdocuments in a collection ?

24

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Probabilistic Latent Semantic

Analysis

Latent Concepts

TermsDocuments

TRADE

economic

imports

trade

Concept expression proba-bilities are estimated based on all documents that are dealing with a concept.

“Unmixing” of superimposed concepts is achieved by statistical learning algorithm.

Conclusion: No prior knowledge about concepts required, context and term co-occurrences are exploited

25

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5

Structural modeling assumption (mixture model)

pLSA – Latent Variable Model

Model fitting

Documentlanguage model

Latent concepts or topics

Concept expressionprobabilities

Document-specificmixture proportions

T. Hofmann, Probabilistic Latent Semantic Analysis, UAI 1999.

26

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5

Mixture model can be written as a matrix factorization

Equivalent symmetric (joint) model

Contrast to LSA/SVD: non-negativity and normalization (intimate relation to non-negative matrix factorization)

pLSA: Matrix Decomposition

=

...

...

pLSA termprobabilities

pLSA documentprobabilities

concept probabilities

D. D. Lee and H. S. Seung, Algorithms for non-negative matrix factorization, NIPS 13, pp. 556-562, 2001.

27

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5

docu

men

tco

llect

ion

single documentin collection

word occurrences

in a document

pLSA: Graphical Model

z

wn(d)

colle

ctio

n

N

wn(d)

P(z|d)

z

N

wn(d)

P(z|d)

z

P(w|z)

N

wn(d)

P(z|d)

z

shared by all words in a document

shared by all documents in

collection

P(w|z)

28

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5

Log-Likelihood

Goal: Find model parameters that maximize the log-likelihood, i.e. maximize the average predictive probability for observed word occurrences (non-convex optimization problem)

pLSA via Likelihood Maximization

Observed word frequencies

argmax

Predictive probabilityof pLSA mixture model

29

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5

E step: posterior probability of latent variables (“concepts”)

Expectation Maximization Algorithm

how often is term w associated with concept

z ?

how often is document d associated with concept

z ?

Probability that the occurence of term w in document d can be “explained“ by concept z

M step: parameter estimation based on “completed” statistics

A.P. Dempster, N.M. Laird, and D.B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of Royal Statistical Society B, vol. 39, no. 1, pp. 1-38, 1977

30

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5EM Algorithm: Derivation

Q-parameterized lower bound on log-likelihood

Follows from Jensen’s inequality

observed pairs with index r Q distribution

31

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Tempered EM Algorithm

Heurstic approach to avoid overfitting: artifically increase entropy

32

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Example (1)

Concepts (3 of 100) extracted from AP news

Concept 2ship 109.41212

coast 93.70902guard 82.11109sea 77.45868boat 75.97172

fishing 65.41328vessel 64.25243tanker 62.55056spill 60.21822

exxon 58.35260boats 54.92072waters 53.55938valdez 51.53405alaska 48.63269ships 46.95736port 46.56804

hazelwood 44.81608vessels 43.80310

ferry 42.79100fishermen 41.65175

Concept 1securities 94.96324

firm 88.74591drexel 78.33697

investment 75.51504bonds 64.23486

sec 61.89292bond 61.39895junk 61.14784

milken 58.72266firms 51.26381

investors 48.80564lynch 44.91865

insider 44.88536shearson 43.82692boesky 43.74837lambert 40.77679merrill 40.14225

brokerage 39.66526corporate 37.94985burnham 36.86570

Concept 3india 91.74842singh 50.34063

militants 49.21986gandhi 48.86809

sikh 47.12099indian 44.29306peru 43.00298hindu 42.79652lima 41.87559

kashmir 40.01138tamilnadu 39.54702

killed 39.47202india's 39.25983punjab 39.22486delhi 38.70990

temple 38.38197shining 37.62768menem 35.42235hindus 34.88001

violence 33.87917

33

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Example (2)

Concepts (10 of 128) extracted from Science Magazine articles (12K)

P(w

|z)

P(w

|z)

34

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Experimental Evaluation

0

10

20

30

40

50

60

70

80

Medline CRAN CACM CISI TREC

VSM

LSA

PLSA

Ave

rag

e P

reci

sion

-5%

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

Medline CRAN CACM CISI TREC

VSM

LSA

PLSA

Rel

ativ

e G

ain

in A

vera

ge P

rec.

Summary of quantitative evaluation

Consistent improvements of retrieval accuracy

Relative improvements of average precision 15-45%

On TREC3: 18% improvement compared to SMART retrieval metric

35

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Live Implementation

36

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5

Joint probabilistic model

Marginal distribution of a document

Latent Dirichlet Allocation

D. M. Blei and A. Y. Ng and M. I. Jordan, Latent Dirichlet allocation, J. Mach. Learn. Res., vol 3, 993—1022, 2003.

: parameter in pLSA,random variable in LDA

Dirichlet densityconcept assoc.with r-th term

r-th term in document

37

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Literature & Related Work

Language Model IR

pLSA / pLSI

Latent Dirichlet Allocation

J. Ponte and W.B. Croft, A Language Model Approach to Information Retrieval, ACM SIGIR, 1998.

A. Berger and J. Lafferty, Information Retrieval as Statistical Translation, ACM SIGIR 1999.

T. Hofmann, Probabilistic Latent Semantic Indexing, ACM SIGIR, 1999.

T. Hofmann, Probabilistic Latent Semantic Analysis, Uncertainty in AI, 1999.

T. Hofmann, Unsupervised Learning by Probabilistic Latent Semantic Analysis, Machine Learning Journal, 2000.

D. M. Blei and A. Y. Ng and M. I. Jordan, Latent Dirichlet allocation, J. Mach. Learn. Res., vol 3, 993—1022, 2003.

T. Hofmann, J. Puzicha, P. Tufts, Finding Information: Intelligent Retrieval & Categorization, Whitepaper, Recommind, December 2002

W. Buntine, Variational Extensions to EM and Multinomial PCA, ECML, 2003.

T. Minka and J. Lafferty, Expectation-propagation for the generative aspect model, UAI 2002.

38

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5



IndexingIndexing






39

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Concept-based Text Categorization

Text categorization algorithms: Perceptron, naive Bayes, k-nearest neighbor, SVM, AdaBoost, etc.

Term-based document representation: vulnerable to linguistic variations

Explore the use of concept-based document representation in text categorization

Concepts are extracted automatically using pLSA (potentially using unlabeled documents)

Learning algorithms: AdaBoost.MH and .MR

R. E. Schapire and Y. Singer. Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135-168, 2000.

40

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Terms & Concepts as Features

Two types of weak hypotheses

Indicator function for term features

Indicator function for concept features

Standard features used in TextBooster (or SVMs)

Features automatically extracted by pLSA

L. Cai and T. Hofmann, Text Categorization by Boosting Automatically Extracted Concepts, ACM SIGIR 2003.

41

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Improvements on Reuters-21578

Reuters 1987 newswire: 21,578 documents & 37,926 word types

Train pLSA model with all data, use ModApte split for categorization: 9,603 training documents, 3,299 test documents, 8,676 unused

42

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Improvements on OHSUMED87

OHSUMED87 collection consists of Medline documents from the year 1987: 54,708 instances and 67,096 words of which 19,066 word types remain after pruning

Train pLSA model with all the remaining data. Randomly split the collection into 10 folds. 9 of them are used for training and the remaining one for testing.

Evaluation on top 50 categories

43

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200


Fisher kernel based on pLSA (Fisher scores)

Semantic kernels based on LSA

Semantic kernel engineering

N. Cristianini, J. Shawe-Taylor, and H. Lodhi. Latent semantic kernels. In Proceedings 18th International Conference on Machine Learning, pp. 66-73, Morgan Kaufmann Publishers, 2001.

J. Kandola, N. Cristianini, and J. Shawe-Taylor. Learning semantic similarity. In Advances in Neural Information Processing Systems (NIPS) volume 15, 2003.

T. Hofmann. Learning the similarity of documents. In MIT Press, editor, Advances in Neural Information Processing Systems (NIPS), volume 12, 2000.

L. Cai and T. Hofmann, Text Categorization by Boosting Automatically Extracted Concepts, SIGIR 2003.

R. E. Schapire and Y. Singer. Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135-168, 2000.

T. Jaakkola. D. Haussler, Exploiting generative models in discriminative classifiers. In Advances in Neural Information Processing Systems (NIPS) volume 11, 1999

44

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5



IndexingIndexing






45

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Probabilistic HITS

Probabilistic model of link structure

Probabilistic graph model, i.e., predictive model for additional links/nodes based on existing ones

Centered around the notion of “Web communities” Probabilistic version of HITS Enables to predict the existence of hyperlinks: estimate

the entropy of the Web graph

Combining with content

Text at every node ...

46

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Finding Latent Web Communities

Web Community: densely connected bipartite subgraph

Probabilistic model pHITS (cf. pLSA model)

st

identical

Source nodes

probability that a random out-link from s is part of the community z

Target nodes

probability that a random in-link from t is part of the community z

47

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Decomposing the Web Graph

Web subgraph Community 1

Community 3Community 2

Links (probabilistically)belong to exactly one community.

Nodes may belong tomultiple communities.

48

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Linking Hyperlinks and Content

pLSA and pHITS (probabilistic HITS) can be combined into one joint decomposition model

w

z

P(z|s)

P(w|z)

concept/topic

P(t|z)

t

Web community

D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In Advances in Neural Information Processing Systems (NIPS), volume 13, 2001.

49

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Example: Ulysses

Decomposition of a base set generated from Altavista with query “Ulysses” (combined decomposition based on links and text)

ulysses 0.022082space 0.015334page 0.013885home 0.011904nasa 0.008915science 0.007417solar 0.007143esa 0.006757mission 0.006090

ulysses.jpl.nasa.gov/ 0.028583helio.estec.esa.nl/ulysses 0.026384www.sp.ph.ic.ak.uk/Ulysses 0.026384

grant 0.019197s 0.017092ulysses 0.013781online 0.006809war 0.006619school 0.005966poetry 0.005762president 0.005259civil 0.005065

www.lib.siu.edu/projects/usgrant/ 0.019358www.whitehouse.gov/WH/glimpse /presidents/ug18.html 0.017598saints.css.edu/mkelsey/gppg.html 0.015838

page 0.020032ulysses 0.013361new 0.010455web 0.009060site 0.009009joyce 0.008430net 0.007799teachers 0.007236information 0.007170http://www.purchase.edu/Joyce/Ulysses.htm 0.008469http://www.bibliomania.com/Fiction/joyce/ulysses/index.html 0.007274 http://teachers.net/chatroom/ 0.005082

50

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200


Hypertext-induced Topic Search (HITS)

Probabilistic HITS

J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998.

K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in hyperlinked environments. ACM SIGIR, 1998.

D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In Proceedings of the 17th International Conference on Machine Learning (ICML), 2000.

D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In Advances in Neural Information Processing Systems (NIPS), volume 13, 2001.

51

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5



IndexingIndexing






52

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Predictions & Recommendations

Goal: Predict future user behavior, ratings or preferences based on past behavior

your typical useruser profile

?

...

53

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5

Items, e.g. MultimediaDocuments Users

Database withUser Profiles

UserID ItemID Rating

10002 451 3

10221 647 4

10245 647 2

12344 801 5

… … …

Ratings

Collaborative Filtering: Data

54

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5pLSA for Collaborative Filtering

Application of pLSA for collaborative filtering: rating variable needs to be added

u

z

y v

Communityvariant y

z

u v

Categorizedvariant

55

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Gaussian & Multinomial pLSA

Key question: how to parameterize the conditional distribution of ratings

Gaussian pLSA

Multinomial pLSA

Mean rating ofcommunity z on item y

Standard deviation of rating of community z on item y

56

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Interests Group, Each Movie

57

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Des-Interests Group, Each Movie

58

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Gaussian Probabilistic LSA

Illustration: characteristics of a user community

...

59

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Normalized Gaussian pLSA

Improvement: take into account that different users may differ in their mean ratings and the dynamic range of ratings

Normalized Gaussian pLSA

60

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5

Model parameters can be fitted via Expectation Maximization (EM) algorithm

Alternate two steps:

E-step

M-step

EM Algorithm for Gaussian pLSA

61

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Experimental Setup

Data set: EachMovie data (courtesy of DEC)

1,623 movies 61,265 users >2.1M ratings on 0-5 scale

Evaluation Protocol

Leave-one-out prediction error For each user with at least M≥2 available ratings, one

rating is used for testing and M-1 for training. Different randomization, averages over 20 runs to

establish statistical significance

62

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Recommendation Accuracy

63

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Prediction Accuracy

64

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Prediction Accuracy (2)

How is prediction accuracy affected by (minimal) number of available ratings M?

0.78

0.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

minimal number of ratings

MA

E Pearson

GpLSA

Pearson 0.949 0.949 0.935 0.938 0.942 0.945 0.937

GpLSA 0.895 0.885 0.88 0.87 0.864 0.857 0.848

2 5 10 20 50 100 200

65

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Prediction Accuracy (3)

How does prediction accuracy depend on the number of communities k in the Gaussian pLSA model?

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2

1.25

number of communities

err

or RMS

MAE

RMS 1.21 1.20 1.18 1.17 1.17 1.17 1.18 1.18 1.18 1.19 1.20

MAE 0.93 0.92 0.91 0.90 0.90 0.90 0.91 0.91 0.92 0.93 0.93

5 10 20 30 40 50 60 75 100 150 200

66

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Other Aspects

Off-line Complexity

EM iteration for Gaussian pLSA with k=40 ca. 30 secs Approximately 40-120 iterations are required Total training time: less than one hour

On-line Complexity

Computing a prediction is O(k) (k: number of communities)

Does not depend on the number of users/items Constant-time prediction

Model-based approach

Once a pLSA model has been trained, predictions can be made without access to original user profiles

Privacy & data security

67

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200


Collaborative filtering with latent variable models

T. Hofmann and J. Puzicha, Latent Class Models for Collaborative Filtering, IJCAI 1999.

T. Hofmann, “What People (Don’t) Want”, ECML 2001.

T. Hofmann, Latent Semantic Models for Collaborative Filtering, ACM TOIS 2004.

J.S. Breese, D. Heckerman, C. Kardie, Empirical Analysis of Predictive Algorithms for Collaborative Filtering, UAI, 1998

L. Ungar, D. Foster, Clustering Methods for Collaborative Filtering, AAAI Workshop on Recommendation Systems, 1998.

68

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5



IndexingIndexing






69

© T

hom

as H

ofm

an

n,

Tech

nic

al U

niv

ers

ity o

f D

arm

sta

dt

& F

rau

nh

ofe

r IP

SI

P

ascal W

ork

sh

op

, B

oh

inj,

Slo

ven

ia,

Marc

h 2

3,

200

5Conclusion

Latent Semantic Variable Models: simple idea, simple algorithm

Conceptual improvements (cf. Mike Titterington‘s and Wray Buntine‘s talks)

more sophisticated Bayesian versions (e.g. LDA) various extensions (rating variable, correspondence

problems, etc.) and relations to other approaches

Wide range of applications

Information retrieval Text categorization Collaborative filtering Web modeling/retrieval (Language modeling, Multimedia retrieval)

Thanks to: David Cohn (Google), Jan Puzicha (Recommind), Lijuan Cai (Brown University)