latent semantic variable models thomas hofmann intelligent systems group technical university of...
Post on 19-Dec-2015
224 views
TRANSCRIPT
Latent Semantic Variable Models
Thomas Hofmann
Intelligent Systems Group Technical University of Darmstadt, Germany
Fraunhofer Institute for Integrated Publication & Information Systems (IPSI)
2
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5
IntroductionIntroduction
Information Retrieval & Latent Semantic Information Retrieval & Latent Semantic
IndexingIndexing
Probabilistic Latent Semantic IndexingProbabilistic Latent Semantic Indexing
Semantic Features for Text CategorizisationSemantic Features for Text Categorizisation
Probabilistic HITSProbabilistic HITS
Collaborative FilteringCollaborative Filtering
ConclusionConclusion
3
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Latent Structure
Given a matrix that “encodes” data, e.g. co-occurrence counts
Potential problems
too large too complicated missing entries noisy entries lack of structure ...
Is there a simpler way to explain entries?
There might be a latent structure underlying the data.
How can we reveal or discover this structure?
4
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5
Common approach: approximately factorize matrix
Factors are typically constrained to be “thin”
Matrix Decomposition
approximation left factor right factor
reduction
factors = latent structure
5
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5
IntroductionIntroduction
Information Retrieval & Latent Semantic Information Retrieval & Latent Semantic
IndexingIndexing
Probabilistic Latent Semantic IndexingProbabilistic Latent Semantic Indexing
Semantic Features for Text CategorizisationSemantic Features for Text Categorizisation
Probabilistic HITSProbabilistic HITS
Collaborative FilteringCollaborative Filtering
ConclusionConclusion
6
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Searching & Finding
What is by far the most popular “portal” to information?
Web SearchNews search
Search in Portals
Site Search for Organisations
7
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Ad Hoc Retrieval
Magic: Identify relevant documents based on short, ambiguous, and incomplete query
Search a document collection to find the ones that satisfy an immediate information need
By far the most popular form of information access:
85% of American Internet users have ever used an online search engine to find information on the Web (Fox, 2002)
29% of Internet users rely on a search engine on a typical day (Fox, 2002)
8
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5
Document-Term Matrix
w jintelligence
d i
Texas Instruments said it has developed the first 32-bit computer chip designed specifically for artificial intelligence applications [...]
D = Document collection W = Lexicon/Vocabulary
...
art
ifici
al
1
inte
llig
ence
inte
rest
0
art
ifact
0 ...... 2t
=d i
term weighting
X
w1 ... w j ... w J
d1
...
d i
...
d I
D
W
...
...
...
...
Document-Term Matrix
9
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5A 100 Millionths of a Typical
Document-Term Matrix
0
1
0
2
Typical: • Number of documents 1.000.000• Vocabulary 100.000• Sparseness < 0.1 %• Fraction depicted 1e-8
10
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Robust Information Retrieval —
Beyond Keyword-based Search
“labour immigrants Germany”
query
match
“German job market for immigrants”query
?
“foreign workers in Germany”query
?
“German green card”query
?
“labour immigrants Germany”
query
match
“German job market for immigrants”query
?
“foreign workers in Germany”query
?
“German green card”query
?
G. W. Furnas, T. K. Landauer, L. M. Gomez , S. T. Dumais, The Vocabulary Problem in Human-System Communication: an Analysis and a Solution, Bell Communications Research, 1987
Vocabulary Mismatch Problemdifferent people using different vocabulary to describe the same concept
matching queries and documents based on keywords is insufficient
11
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Challenges
Compactness: few search terms, rarely redundance
Variability: Synonyms and semantically related terms, expressions, writing styles, etc.
Ambiguity: Terms with multiple senses (polysems), z.B. Java, jaguar, bank, head
Quality & Authority: Correlation between quality and relevance
Average number of search terms per query = 2.5
(Source: Spink et al.: From E-sex to E-commerce: Web search changes, IEEE Computer, March 2002 )
12
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Latent Semantic Analysis
Perform a low-rank approximation of document-term matrix (typical rank 100-300)
General idea
Map documents (and terms) to a low-dimensional representation.
Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space).
Compute document similarity based on the inner product in the latent semantic space
Goals
Similar terms map to similar location in low dimensional space
Noise reduction by dimension reduction
M. Berry, S. Dumais, and G. O'Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4):573-595, 1995.
13
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5
For an arbitrary matrix A there exists a factorization (Singular Value Decomposition = SVD) as follows:
Where
(i)
(ii)
(iii)
(iv)
Singular Value Decomposition
orthonormal columns
singular values (ordered)
14
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5
SVD can be used to compute optimal low-rank approximations.
Approximation problem:
Solution via SVD
Low-rank Approximation
column notation: sum of rank 1 matrices
Frobenius norm
set small singular values to zero
C. Eckart, G. Young, The approximation of a matrix by another of lower rank. Psychometrika, 1, 211-218, 1936.
15
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5
The LSA decomposition via SVD can be summarized as follows:
Document similarity
Folding-in queries
LSA Decomposition
documents
...
...
terms
...
LSA documentvectors
... LSA termvectors=
< , >
16
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Latent Semantic Analysis
Latent semantic space: illustrating example
courtesy of Susan Dumais© Bellcore
17
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5
IntroductionIntroduction
Information Retrieval & Latent Semantic Information Retrieval & Latent Semantic
IndexingIndexing
Probabilistic Latent Semantic IndexingProbabilistic Latent Semantic Indexing
Semantic Features for Text CategorizisationSemantic Features for Text Categorizisation
Probabilistic HITSProbabilistic HITS
Collaborative FilteringCollaborative Filtering
ConclusionConclusion
18
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Search as Statistical Inference
Document in bag-of-words representation
US
Disneyeconomi
cintellectual
property
relations
Beijing
human rightsfree
negotiations
imports
?China
China US trade relations
Given the context, how probable is it that terms like “China“ or “trade“ might occur?
Document expansion: Additional index terms can be added automatically via statistical inference
M. E. Maron and J. L. Kuhns, On Relevance, Probabilistic Indexing and Information Retrieval, Journal of the ACM, 7:216-244, 1960.
19
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Language Model Paradigm in IR
Probabilistic relevance model
Random variables
Bayes’ ruleprior probability of relevance for
document d (e.g. quality, popularity)
probability of generating a query q to ask for
relevant d
probability that document d is relevant for query q
J. Ponte and W.B. Croft, A Language Model Approach to Information Retrieval, ACM SIGIR, 1998.
20
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Language Model Paradigm
First contribution: prior probability of relevance
simplest case: uniform (drops out for ranking) popularity: document usage statistics (e.g. library
circulation records, download or access statistics, hyperlink structure)
Second contribution: query likelihood
query terms q are treated as a sample drawn from an (unknown) relevant document
12
1
2
21
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Language Model Paradigm
Query generation model: how might a query look like that would ask for a specific document?
Maron & Kuhns: Indexer manually assigns probabilities for pre-specified set of tags/terms
Ponte & Croft: Statistical estimation problem
Think of a relevant document. Formulate a query by picking some of the keywords as query terms.
Environmentalists are blasting a Bush administration proposal to lift a ban on logging in remote areas of national forests, saying the move ignores popular support for protecting forests.
22
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Naive Approach
TermsDocuments
Maximum Likelihood Estimation
number of occurrencesof term w in document d
Zero frequency problem: terms not occurring in a document get zero probability
23
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Estimation Problem
Crucial question: In which way can the document collection be utilized to improve probability estimates?
document estimation
(i.i.d) sample
other documents
learning from otherdocuments in a collection ?
24
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Probabilistic Latent Semantic
Analysis
Latent Concepts
TermsDocuments
TRADE
economic
imports
trade
Concept expression proba-bilities are estimated based on all documents that are dealing with a concept.
“Unmixing” of superimposed concepts is achieved by statistical learning algorithm.
Conclusion: No prior knowledge about concepts required, context and term co-occurrences are exploited
25
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5
Structural modeling assumption (mixture model)
pLSA – Latent Variable Model
Model fitting
Documentlanguage model
Latent concepts or topics
Concept expressionprobabilities
Document-specificmixture proportions
T. Hofmann, Probabilistic Latent Semantic Analysis, UAI 1999.
26
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5
Mixture model can be written as a matrix factorization
Equivalent symmetric (joint) model
Contrast to LSA/SVD: non-negativity and normalization (intimate relation to non-negative matrix factorization)
pLSA: Matrix Decomposition
=
...
...
pLSA termprobabilities
pLSA documentprobabilities
concept probabilities
D. D. Lee and H. S. Seung, Algorithms for non-negative matrix factorization, NIPS 13, pp. 556-562, 2001.
27
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5
docu
men
tco
llect
ion
single documentin collection
word occurrences
in a document
pLSA: Graphical Model
z
wn(d)
colle
ctio
n
N
wn(d)
P(z|d)
z
N
wn(d)
P(z|d)
z
P(w|z)
N
wn(d)
P(z|d)
z
shared by all words in a document
shared by all documents in
collection
P(w|z)
28
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5
Log-Likelihood
Goal: Find model parameters that maximize the log-likelihood, i.e. maximize the average predictive probability for observed word occurrences (non-convex optimization problem)
pLSA via Likelihood Maximization
Observed word frequencies
argmax
Predictive probabilityof pLSA mixture model
29
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5
E step: posterior probability of latent variables (“concepts”)
Expectation Maximization Algorithm
how often is term w associated with concept
z ?
how often is document d associated with concept
z ?
Probability that the occurence of term w in document d can be “explained“ by concept z
M step: parameter estimation based on “completed” statistics
A.P. Dempster, N.M. Laird, and D.B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of Royal Statistical Society B, vol. 39, no. 1, pp. 1-38, 1977
30
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5EM Algorithm: Derivation
Q-parameterized lower bound on log-likelihood
Follows from Jensen’s inequality
observed pairs with index r Q distribution
31
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Tempered EM Algorithm
Heurstic approach to avoid overfitting: artifically increase entropy
32
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Example (1)
Concepts (3 of 100) extracted from AP news
Concept 2ship 109.41212
coast 93.70902guard 82.11109sea 77.45868boat 75.97172
fishing 65.41328vessel 64.25243tanker 62.55056spill 60.21822
exxon 58.35260boats 54.92072waters 53.55938valdez 51.53405alaska 48.63269ships 46.95736port 46.56804
hazelwood 44.81608vessels 43.80310
ferry 42.79100fishermen 41.65175
Concept 1securities 94.96324
firm 88.74591drexel 78.33697
investment 75.51504bonds 64.23486
sec 61.89292bond 61.39895junk 61.14784
milken 58.72266firms 51.26381
investors 48.80564lynch 44.91865
insider 44.88536shearson 43.82692boesky 43.74837lambert 40.77679merrill 40.14225
brokerage 39.66526corporate 37.94985burnham 36.86570
Concept 3india 91.74842singh 50.34063
militants 49.21986gandhi 48.86809
sikh 47.12099indian 44.29306peru 43.00298hindu 42.79652lima 41.87559
kashmir 40.01138tamilnadu 39.54702
killed 39.47202india's 39.25983punjab 39.22486delhi 38.70990
temple 38.38197shining 37.62768menem 35.42235hindus 34.88001
violence 33.87917
33
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Example (2)
Concepts (10 of 128) extracted from Science Magazine articles (12K)
P(w
|z)
P(w
|z)
34
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Experimental Evaluation
0
10
20
30
40
50
60
70
80
Medline CRAN CACM CISI TREC
VSM
LSA
PLSA
Ave
rag
e P
reci
sion
-5%
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
Medline CRAN CACM CISI TREC
VSM
LSA
PLSA
Rel
ativ
e G
ain
in A
vera
ge P
rec.
Summary of quantitative evaluation
Consistent improvements of retrieval accuracy
Relative improvements of average precision 15-45%
On TREC3: 18% improvement compared to SMART retrieval metric
35
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Live Implementation
36
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5
Joint probabilistic model
Marginal distribution of a document
Latent Dirichlet Allocation
D. M. Blei and A. Y. Ng and M. I. Jordan, Latent Dirichlet allocation, J. Mach. Learn. Res., vol 3, 993—1022, 2003.
: parameter in pLSA,random variable in LDA
Dirichlet densityconcept assoc.with r-th term
r-th term in document
37
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Literature & Related Work
Language Model IR
pLSA / pLSI
Latent Dirichlet Allocation
J. Ponte and W.B. Croft, A Language Model Approach to Information Retrieval, ACM SIGIR, 1998.
A. Berger and J. Lafferty, Information Retrieval as Statistical Translation, ACM SIGIR 1999.
T. Hofmann, Probabilistic Latent Semantic Indexing, ACM SIGIR, 1999.
T. Hofmann, Probabilistic Latent Semantic Analysis, Uncertainty in AI, 1999.
T. Hofmann, Unsupervised Learning by Probabilistic Latent Semantic Analysis, Machine Learning Journal, 2000.
D. M. Blei and A. Y. Ng and M. I. Jordan, Latent Dirichlet allocation, J. Mach. Learn. Res., vol 3, 993—1022, 2003.
T. Hofmann, J. Puzicha, P. Tufts, Finding Information: Intelligent Retrieval & Categorization, Whitepaper, Recommind, December 2002
W. Buntine, Variational Extensions to EM and Multinomial PCA, ECML, 2003.
T. Minka and J. Lafferty, Expectation-propagation for the generative aspect model, UAI 2002.
38
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5
IntroductionIntroduction
Information Retrieval & Latent Semantic Information Retrieval & Latent Semantic
IndexingIndexing
Probabilistic Latent Semantic IndexingProbabilistic Latent Semantic Indexing
Semantic Features for Text CategorizisationSemantic Features for Text Categorizisation
Probabilistic HITSProbabilistic HITS
Collaborative FilteringCollaborative Filtering
ConclusionConclusion
39
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Concept-based Text Categorization
Text categorization algorithms: Perceptron, naive Bayes, k-nearest neighbor, SVM, AdaBoost, etc.
Term-based document representation: vulnerable to linguistic variations
Explore the use of concept-based document representation in text categorization
Concepts are extracted automatically using pLSA (potentially using unlabeled documents)
Learning algorithms: AdaBoost.MH and .MR
R. E. Schapire and Y. Singer. Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135-168, 2000.
40
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Terms & Concepts as Features
Two types of weak hypotheses
Indicator function for term features
Indicator function for concept features
Standard features used in TextBooster (or SVMs)
Features automatically extracted by pLSA
L. Cai and T. Hofmann, Text Categorization by Boosting Automatically Extracted Concepts, ACM SIGIR 2003.
41
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Improvements on Reuters-21578
Reuters 1987 newswire: 21,578 documents & 37,926 word types
Train pLSA model with all data, use ModApte split for categorization: 9,603 training documents, 3,299 test documents, 8,676 unused
42
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Improvements on OHSUMED87
OHSUMED87 collection consists of Medline documents from the year 1987: 54,708 instances and 67,096 words of which 19,066 word types remain after pruning
Train pLSA model with all the remaining data. Randomly split the collection into 10 folds. 9 of them are used for training and the remaining one for testing.
Evaluation on top 50 categories
43
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Literature & Related Work
Fisher kernel based on pLSA (Fisher scores)
Semantic kernels based on LSA
Semantic kernel engineering
N. Cristianini, J. Shawe-Taylor, and H. Lodhi. Latent semantic kernels. In Proceedings 18th International Conference on Machine Learning, pp. 66-73, Morgan Kaufmann Publishers, 2001.
J. Kandola, N. Cristianini, and J. Shawe-Taylor. Learning semantic similarity. In Advances in Neural Information Processing Systems (NIPS) volume 15, 2003.
T. Hofmann. Learning the similarity of documents. In MIT Press, editor, Advances in Neural Information Processing Systems (NIPS), volume 12, 2000.
L. Cai and T. Hofmann, Text Categorization by Boosting Automatically Extracted Concepts, SIGIR 2003.
R. E. Schapire and Y. Singer. Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135-168, 2000.
T. Jaakkola. D. Haussler, Exploiting generative models in discriminative classifiers. In Advances in Neural Information Processing Systems (NIPS) volume 11, 1999
44
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5
IntroductionIntroduction
Information Retrieval & Latent Semantic Information Retrieval & Latent Semantic
IndexingIndexing
Probabilistic Latent Semantic IndexingProbabilistic Latent Semantic Indexing
Semantic Features for Text CategorizisationSemantic Features for Text Categorizisation
Probabilistic HITSProbabilistic HITS
Collaborative FilteringCollaborative Filtering
ConclusionConclusion
45
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Probabilistic HITS
Probabilistic model of link structure
Probabilistic graph model, i.e., predictive model for additional links/nodes based on existing ones
Centered around the notion of “Web communities” Probabilistic version of HITS Enables to predict the existence of hyperlinks: estimate
the entropy of the Web graph
Combining with content
Text at every node ...
46
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Finding Latent Web Communities
Web Community: densely connected bipartite subgraph
Probabilistic model pHITS (cf. pLSA model)
st
identical
Source nodes
probability that a random out-link from s is part of the community z
Target nodes
probability that a random in-link from t is part of the community z
47
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Decomposing the Web Graph
Web subgraph Community 1
Community 3Community 2
Links (probabilistically)belong to exactly one community.
Nodes may belong tomultiple communities.
48
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Linking Hyperlinks and Content
pLSA and pHITS (probabilistic HITS) can be combined into one joint decomposition model
w
z
P(z|s)
P(w|z)
concept/topic
P(t|z)
t
Web community
D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In Advances in Neural Information Processing Systems (NIPS), volume 13, 2001.
49
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Example: Ulysses
Decomposition of a base set generated from Altavista with query “Ulysses” (combined decomposition based on links and text)
ulysses 0.022082space 0.015334page 0.013885home 0.011904nasa 0.008915science 0.007417solar 0.007143esa 0.006757mission 0.006090
ulysses.jpl.nasa.gov/ 0.028583helio.estec.esa.nl/ulysses 0.026384www.sp.ph.ic.ak.uk/Ulysses 0.026384
grant 0.019197s 0.017092ulysses 0.013781online 0.006809war 0.006619school 0.005966poetry 0.005762president 0.005259civil 0.005065
www.lib.siu.edu/projects/usgrant/ 0.019358www.whitehouse.gov/WH/glimpse /presidents/ug18.html 0.017598saints.css.edu/mkelsey/gppg.html 0.015838
page 0.020032ulysses 0.013361new 0.010455web 0.009060site 0.009009joyce 0.008430net 0.007799teachers 0.007236information 0.007170http://www.purchase.edu/Joyce/Ulysses.htm 0.008469http://www.bibliomania.com/Fiction/joyce/ulysses/index.html 0.007274 http://teachers.net/chatroom/ 0.005082
50
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Literature & Related Work
Hypertext-induced Topic Search (HITS)
Probabilistic HITS
J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998.
K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in hyperlinked environments. ACM SIGIR, 1998.
D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In Proceedings of the 17th International Conference on Machine Learning (ICML), 2000.
D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In Advances in Neural Information Processing Systems (NIPS), volume 13, 2001.
51
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5
IntroductionIntroduction
Information Retrieval & Latent Semantic Information Retrieval & Latent Semantic
IndexingIndexing
Probabilistic Latent Semantic IndexingProbabilistic Latent Semantic Indexing
Semantic Features for Text CategorizisationSemantic Features for Text Categorizisation
Probabilistic HITSProbabilistic HITS
Collaborative FilteringCollaborative Filtering
ConclusionConclusion
52
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Predictions & Recommendations
Goal: Predict future user behavior, ratings or preferences based on past behavior
your typical useruser profile
?
...
53
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5
Items, e.g. MultimediaDocuments Users
Database withUser Profiles
UserID ItemID Rating
10002 451 3
10221 647 4
10245 647 2
12344 801 5
… … …
Ratings
Collaborative Filtering: Data
54
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5pLSA for Collaborative Filtering
Application of pLSA for collaborative filtering: rating variable needs to be added
u
z
y v
Communityvariant y
z
u v
Categorizedvariant
55
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Gaussian & Multinomial pLSA
Key question: how to parameterize the conditional distribution of ratings
Gaussian pLSA
Multinomial pLSA
Mean rating ofcommunity z on item y
Standard deviation of rating of community z on item y
56
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Interests Group, Each Movie
57
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Des-Interests Group, Each Movie
58
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Gaussian Probabilistic LSA
Illustration: characteristics of a user community
...
59
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Normalized Gaussian pLSA
Improvement: take into account that different users may differ in their mean ratings and the dynamic range of ratings
Normalized Gaussian pLSA
60
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5
Model parameters can be fitted via Expectation Maximization (EM) algorithm
Alternate two steps:
E-step
M-step
EM Algorithm for Gaussian pLSA
61
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Experimental Setup
Data set: EachMovie data (courtesy of DEC)
1,623 movies 61,265 users >2.1M ratings on 0-5 scale
Evaluation Protocol
Leave-one-out prediction error For each user with at least M≥2 available ratings, one
rating is used for testing and M-1 for training. Different randomization, averages over 20 runs to
establish statistical significance
62
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Recommendation Accuracy
63
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Prediction Accuracy
64
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Prediction Accuracy (2)
How is prediction accuracy affected by (minimal) number of available ratings M?
0.78
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
minimal number of ratings
MA
E Pearson
GpLSA
Pearson 0.949 0.949 0.935 0.938 0.942 0.945 0.937
GpLSA 0.895 0.885 0.88 0.87 0.864 0.857 0.848
2 5 10 20 50 100 200
65
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Prediction Accuracy (3)
How does prediction accuracy depend on the number of communities k in the Gaussian pLSA model?
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
number of communities
err
or RMS
MAE
RMS 1.21 1.20 1.18 1.17 1.17 1.17 1.18 1.18 1.18 1.19 1.20
MAE 0.93 0.92 0.91 0.90 0.90 0.90 0.91 0.91 0.92 0.93 0.93
5 10 20 30 40 50 60 75 100 150 200
66
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Other Aspects
Off-line Complexity
EM iteration for Gaussian pLSA with k=40 ca. 30 secs Approximately 40-120 iterations are required Total training time: less than one hour
On-line Complexity
Computing a prediction is O(k) (k: number of communities)
Does not depend on the number of users/items Constant-time prediction
Model-based approach
Once a pLSA model has been trained, predictions can be made without access to original user profiles
Privacy & data security
67
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Literature & Related Work
Collaborative filtering with latent variable models
T. Hofmann and J. Puzicha, Latent Class Models for Collaborative Filtering, IJCAI 1999.
T. Hofmann, “What People (Don’t) Want”, ECML 2001.
T. Hofmann, Latent Semantic Models for Collaborative Filtering, ACM TOIS 2004.
J.S. Breese, D. Heckerman, C. Kardie, Empirical Analysis of Predictive Algorithms for Collaborative Filtering, UAI, 1998
L. Ungar, D. Foster, Clustering Methods for Collaborative Filtering, AAAI Workshop on Recommendation Systems, 1998.
68
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5
IntroductionIntroduction
Information Retrieval & Latent Semantic Information Retrieval & Latent Semantic
IndexingIndexing
Probabilistic Latent Semantic IndexingProbabilistic Latent Semantic Indexing
Semantic Features for Text CategorizisationSemantic Features for Text Categorizisation
Probabilistic HITSProbabilistic HITS
Collaborative FilteringCollaborative Filtering
ConclusionConclusion
69
© T
hom
as H
ofm
an
n,
Tech
nic
al U
niv
ers
ity o
f D
arm
sta
dt
& F
rau
nh
ofe
r IP
SI
P
ascal W
ork
sh
op
, B
oh
inj,
Slo
ven
ia,
Marc
h 2
3,
200
5Conclusion
Latent Semantic Variable Models: simple idea, simple algorithm
Conceptual improvements (cf. Mike Titterington‘s and Wray Buntine‘s talks)
more sophisticated Bayesian versions (e.g. LDA) various extensions (rating variable, correspondence
problems, etc.) and relations to other approaches
Wide range of applications
Information retrieval Text categorization Collaborative filtering Web modeling/retrieval (Language modeling, Multimedia retrieval)
Thanks to: David Cohn (Google), Jan Puzicha (Recommind), Lijuan Cai (Brown University)