matrix decomposition methods in information retrieval thomas hofmann department of computer science...
Post on 20-Dec-2015
226 views
TRANSCRIPT
Matrix Decomposition Methods
in Information RetrievalThomas HofmannDepartment of Computer ScienceBrown Universitywww.cs.brown.edu/people/th(& Chief Scientist, RecomMind Inc.)
In collaboration with:Jan Puzicha, UC Berkeley & RecomMindDavid Cohen, CMU & Burning Glass
2
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Overview
1. Introduction: A Brief History of
Mechanical IR
2. Latent Semantic Analysis
3. Probabilistic Latent Semantic
Analysis
4. Learning (from) Hyperlink Graphs
5. Collaborative Filtering
6. Future Work and Conclusion
1. Introduction: A Brief History of Mechanical IR
3
4
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Memex – “As we may think.”
Vannevar Bush (1945)
The idea of an easily accessible, individually configurable storehouse of knowledge, the beginning of the literature on mechanized information retrieval:
“Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and to coin one at random, ‘memex’ will do. A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory.”
“The world has arrived at an age of cheap complex devices of great reliability; and something is bound to come of it.”
5
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Memex – “As we may think.”
Vannevar Bush (1945)
The civilizational challenge:
“The difficulty seems to be, not so much that we publish unduly in view of the extent and variety of present day interests, but rather that publication has been extended far beyond our present ability to make real use of the record. The summation of human experience is being expanded at a prodigious rate, and the means we use for threading through the consequent maze to the momentarily important item is the same as was used in the days of square-rigged ships.”
V. Bush, “As we may think”, Atlantic Monthly, 176 (1945), pp.101-108
6
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
The Thesaurus Approach
Hans Peter Luhn (1957, 1961) Words of similar or related meaning are grouped into
“notional families” Encoding of documents in terms of notional elements Matching by measuring the degree of notional
similarity A common language for annotating documents, key
word in context (KWIC) indexing. “… the faculty of interpretation is beyond the talent of
machines.” Statistical cues extracted by machines to assist human
indexer; vocabulary method to detecting similarities.
H.P. Luhn, “A statistical approach to mechanical literature searching”, New York, IBM Research Center, 1957.H.P. Luhn, “The Automatic Derivation of Information Retrieval Encodements from Machine-Readable Text”, Information Retrieval and Machine Translation, 3(2), pp.1021-1028, 1961
7
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
To Punch or not to punch …
T. Joyce & R.M. Needham (1958)
Lattices & hierarchies of search terms“As in other systems, the documents are represented by holes in punched cards which represent the various terms, and in addition, when a hole is punched in any term card, all the terms at higher levels of the lattice […] are also punched.”The postcoordinate revolution: card sorting at search time!
“Investigations […] to lessen the physical work are continuing.”T. Joyce & R.M. Needham, “The Thesaurus Approach to Information Retrieval”,
American Documentation, 9, pp. 192-197, 1958.
8
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Term Associations
Lauren B. Doyle (1962)
Unusual co-occurrences of pairs of words = associations of words in text
Statistical testing: Chi-square and Pearson correlation coefficient to determine pairwise correlations
Term association maps for interactive retrieval
Today: semantic maps
L.B. Doyle, “Indexing and Abstracting by Association”, Unisys Corporation, 1962.
10
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Vector Space Model
Gerard Salton (1960/70)
Instead of indexing documents by selected index terms, preserve (almost) all terms in automatic indexing
Represent documents by a high-dimensional vector.
Each term can be associated with a weight Geometrical interpretation
G. Salton, “The SMART Retrieval System – Experiments in Automatic Document Processing”, 1971.
11
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Term-Document Matrix
d i
w jintelligence
w1 ... w j ... w J
d1
...
d i
...
d I
D
W
...
...
...
... ),( ji wdc
Texas Instruments said it has developed the first 32-bit computer chip designed specifically for artificial intelligence applications [...]
D = {documents in database}
W = {terms in vocabulary}
...
art
ifici
al
1
inte
llig
ence
inte
rest
0
art
ifact
0 ...... 2t
term-document matrix
term weighting
Xd =
12
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
similarity betweendocument and query
Documents in “Inner” Space
Retrieval method rank documents according
to similarity with query term weighting schemes,
for example, TFIDF used in SMART system and
many successor systems, high popularity
00.2
0.40.6
0.8
1
0
0.2
0.4
0.6
0.8
10
0.2
0.4
0.6
0.8
1
0.75
0.64
cosine of angle between query and document(s)
qd
q,d)q,d(cos)q,d(sim
13
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Advantages of the Vector Space Model
No subjective selection of index terms Partial matching of queries and documents
(dealing with the case where no document contains all search terms)
Ranking according to similarity score (dealing with large result sets)
Term weighting schemes (improves retrieval performance)
Various extensions Document clustering Relevance feedback (modifying query vector)
Geometric foundation
2. Latent Semantic Analysis
14
15
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Limitations of the Vector Space Model Dimensionality:
Vector space representation is high-dimensional (several 10-100K).
Learning and estimation has to deal with curse of dimensionality.
Sparseness: Document vectors are typically very sparse. Cosine similarity can be noisy and inaccurate.
Semantics: The inner product can only match occurrences of exactly
the same terms. The vector representation does not capture semantic
relations between words. Independence
Bag-of-Words Representation Unable to capture phrases and semantic/syntactic
regularities
16
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
The Lost Meaning of Words …
Ambiguity and association in natural language
Polysemy: Words often have a multitude of meanings and different types of usage (more urgent for very heterogeneous collections).The vector space model is unable to discriminate between different meanings of the same word.
Synonymy: Different terms may have an identical or a similar meaning (weaker: words indicating the same topic).No associations between words are made in the vector space representation.
)q,d(cos)q,d(sim
)q,d(cos)q,d(sim
17
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Polysemy and Context
Document similarity on single word level: polysemy and context
carcompany
•••dodgeford
meaning 2
ringjupiter
•••space
voyagermeaning 1…
saturn...
…planet
...
contribution to similarity, if used in 1st meaning, but not if in 2nd
18
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Latent Semantic Analysis
General idea Map documents (and terms) to a low-
dimensional representation. Design a mapping such that the low-
dimensional space reflects semantic associations (latent semantic space).
Compute document similarity based on the inner product in the latent semantic space.
Goals Similar terms map to similar location in low
dimensional space. Noise reduction by dimension reduction.
19
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Dimension reduction by singular value decomposition of term-document matrix
original td matrix
original td matrix
L2 optimalapproximation
reconstructed td matrix
reconstructed td matrixCVUVUC ˆˆ tt
term/documentvectors
term/documentvectors
thresholdedsingular values
thresholdedsingular values
LSA: Matrix Decomposition by SVD
)w,d(cc),c( jiijij C word frequencies(possibly transformed)
•Document length normalization•Sublinear transformation (e.g., log)•Global term weight
20
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Singular Value Decomposition, definition
: orthonormal columns : diagonal with singular values (ordered)
Properties: Existence & uniqueness Thresholding small singular values yields an optimal
low-rank approximation (in the sense of the Frobenius norm)
Background: SVD
VU,
tVUC = X Xn X m n X n n X n n X m
tˆˆ VUC = X Xn X m n X k k X k k X m
21
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
SVD and PCA
If (!) the rows of would be shifted such that their mean is zero, then:
Then, one would essentially perform a projection on the principal axis defined by the columns of
Yet, this would destroy the sparseness of the term-document matrix (and consequently might hurt the performance of SVD methods)
t2tttt )( UUVUVUCC
C
U
22
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Hirschfield 1935, Hotelling 1936, Fisher 1940:Correlation analysis for contingency tables
K
2kjkikkjiij vu1ccc
J
1jiji cc
I
1iijj cc
I
1i
J
1jjkjiki 0vcuc
I
1ikl
J
1jjljkjiliki vvcuuc
1ii
constraints
Canonical Analysis
23
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Correspondence Analysis (as a method of scaling):Guttman 1941, Torgerson 1958, Benzecri 1969, Hill 1974Whitaker 1967: “gradient analysis”
“reciprocal averaging”
j
jijjc
1i vcu
iiij
ic
1j ucv
solutions: unit vectors and scores of canonical analysis SVD of rescaled matrix with entries
jiijij cc/cc
Canoncial & Correspondence Analysis
(not exactly what is done in LSA)
24
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
lower dimensionaldocument representation
lower dimensionaldocument representation
t2tttt UˆU)Uˆ)(ˆU(UˆVVˆUˆˆ CC
Similarity: inner product in lower dimensional space
For given decomposition, additional documents or queries can be mapped to semantic space (folding-in) Notice that:
Hence, for new document/query q
Semantic Inner Product / Kernel
1t CVUCVU
1tt qq̂ V
25
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Term Associations from LSA
(taken from slide by S. Dumais)
Ter
m 2
Term 1
Conce
pt
26
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
LSA: Discussion
pros: Low-dimensional document representation is able to capture
synonyms. Noise removal and robustness by dimension reduction Experimentally: advantages over naïve vector space model
cons: “Formally”: L2 norm is inappropriate as a distance function for
count vectors (reconstruction may contain negative entries) “Conceptually”:
Problem of polysemy is not addressed; principle of linear superposition, no active disambiguation
Context of terms is not taken into account. Directions in latent space are hard to interpret. No probabilistic model of term occurrences.
[ad hoc selection of the number of dimensions, ...]
27
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Features of IR Methods
Features VSM LSA
Quantitative relevance score yes yes
Partial query matching yes yes
Document similarity yes yes
Word correlations, synonyms no yes
Low-dimensional representation no yes
Notional families, concepts no not really
Dealing with polysemy no no
Probabilistic model no no
Sparse representation yes no
3. Probabilistic Latent Semantic Analysis
28
29
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Documents as Information Sources
w1 ... w j ... w J
d1
...
d i
...
d I
D
W
...
...
...
... )w,d(c ji
D = {documents in database}
W = {words in vocabulary}
“real” document: empirical probability distrib. relative frequencies
)d(c
)w,d(c)d|w(P̂
sampleother documents ?)d|w(P
“ideal” document: (memoryless) information source
30
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Information Source Models in IR
Bayes rule: probability of relevance of document w.r.t. query
)d(P)d|q(P)q|d(P prior probabilityof relevance
w
qt
)d|w(P )w|t(P)d|t(P
,)d|t(P)d|q(P
Query translation model
• Probability that q is “generated” from d
• Probability that query term is generated
Language model
Translation model
J. Ponte & W.B. Croft, ”A Language Model Approach to Information Retrieval”, SIGIR 1998.A. Berger & J. Lafferty, “Information Retrieval as Statistical Translation, SIGIR 1999.
31
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Probabilistic Latent Semantic Analysis How can we learn document-specific language
models? Sparseness problem, even for unigrams.
Probabilistic dimension reduction techniques to overcome data sparseness problem.
Factor analysis for count data: factors concepts
z
d)|P(zz)|P(wd)|P(w
(topic) factor“sources”
document-specificmixing proportions
document“sources”
latent variable z(“small” #states)
T. Hofmann, “Probabilistic Latent Semantic Analysis”, UAI 1999.
z
)z(P)z|P(dz)|P(w)dP(w,
32
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
docu
men
tco
llect
ion
single documentin collection
word occurrences
in a document
PLSA: Graphical Model
z
wc(d)
P(w|d) P(w|z) P(z|d)z
33
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
PLSA: Graphical Model
P(w|d) P(w|z) P(z|d)z
colle
ctio
n
N
wc(d)
P(z|d)
z
34
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
PLSA: Graphical Model
P(w|d) P(w|z) P(z|d)z
N
wc(d)
P(z|d)
z
P(w|z)
35
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
PLSA: Graphical Model
P(w|d) P(w|z) P(z|d)z
N
wc(d)
P(z|d)
z
shared by all words in a document
shared by all documents in
collection
P(w|z)
36
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Probabilistic Latent Semantic Space documents are represented as points in low-
dimensional sub-simplex (dimensionality reduction for probability distributions)
KL-divergence projection, not orthogonal
)z|w(P 1
spannedsub-
simplex
0
+simplexembedding )d|w(P̂
)z|w(P 1
)z|w(P 3)z|w(P 2
)d|w(P
37
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Positive Matrix Decomposition mixture decomposition in matrix notation
CPPC~t
wd diag(P( ),..., P( ))z zK1
)z|dP()( kik,id P
)z|wP()( kjk,jw P
constraints Non-negativity of all matrices
Normalization according to L1-norm
(no orthogonality)
D.D. Lee & H.S. Seung, “Learning the parts of objects by non-negative matrix factorization”, Nature, 1999.
38
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Positive Matrix Decomposition & SVD mixture decomposition in matrix notation
CPPC~t
wd diag(P( ),..., P( ))z zK1
)z|dP()( kik,id P
)z|wP()( kjk,jw P
CVUVUC ˆˆ tt compare to
probabilistic approach vs. linear algebra decomposition conditional independence assumption “replaces” outer product
class-conditional distributions “replace” left/right eigenvectors
maximum likelihood instead of minimum L2 norm
criterion
j,i z
ijijj,i
ijij )z(P)z|d(P)z|w(Plogcc~logcL
39
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Expectation Maximization Algorithm
Maximizing log-likelihood by (tempered) EM iterations
E-step (posterior probabilities of latent variables)
M-step (max. of expected complete log-likelihood)
d
)w,d|z(P)w,d(c)z|wP(
probability that a term occurrence w within d is explained by topic z“
w
) w, d| z( P) w, d(c ) z| d P(
d, w
) w, d| z( P) w, d(c ) z P(
'z
)'z(P)'z|w(P)'z|d(P
)z(P)z|w(P)z|d(P)w,d|zP(
'z
)'z(P)]'z|w(P)'z|d(P[
)z(P)]z|w(P)z|d(P[)w,d|zP(
40
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Example: Science Magazine Papers
Dataset with approx. 12K papers from Science Magazine Selected concepts from model with K=200
41
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Example: TDT1 news stories
TDT1 = document collection with approx. 16,000 news stories (Reuters, CNN, years 1994/95)
results based on decomposition with 128 concepts 2 main factors for “flight“ and “love“ (most probable
words)
“love”
homefamilylikejustkidsmotherlifehappyfriendscnn
film moviemusicnewbesthollywoodloveactorentertainmentstar
“flight”
planeairportcrashflightsafetyaircraftairpassengerboardairline
spaceshuttlemissionastronautslaunchstationcrewnasasatelliteearth
pro
babili
tyP(w
|z)
42
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Folding-in a Document/Query
unbosnianserbsbosniaserbsarajevonatopeacekeep.nationspeacebihacwar
iraqiraquisanctionskuwaituncouncilgulfsaddambaghdadhusseinresolutionborder
refugeesaidrwandareliefpeoplecampszairecampfoodrwandanungoma
buildingcitypeoplerescuebuildingsworkerskobevictimsareaearthquakedisastermissing
4 selected factorswith their most probable keywords
TDT1 collection: approx. 16,000 news storiesPLSA model with 128 dimensionsQuery keywords: “aid food medical people UN war”4 most probable factors for queryTrack posteriors for every key word
43
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Folding-in a Document/Query
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
aidfoodmedicalpeopleunwar
unbosnianserbsbosniaserbsarajevonatopeacekeep.nationspeacebihacwar
iraqiraquisanctionskuwaituncouncilgulfsaddambaghdadhusseinresolutionborder
refugeesaidrwandareliefpeoplecampszairecampfoodrwandanungoma
buildingcitypeoplerescuebuildingsworkerskobevictimsareaearthquakedisastermissing
Iteration 1Posterior
probabilites
44
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Folding-in a Document/Query
aidfoodmedicalpeopleunwar
unbosnianserbsbosniaserbsarajevonatopeacekeep.nationspeacebihacwar
iraqiraquisanctionskuwaituncouncilgulfsaddambaghdadhusseinresolutionborder
refugeesaidrwandareliefpeoplecampszairecampfoodrwandanungoma
buildingcitypeoplerescuebuildingsworkerskobevictimsareaearthquakedisastermissing
Iteration 2Posterior
probabilites
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
45
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Folding-in a Document/Query
aidfoodmedicalpeopleunwar
unbosnianserbsbosniaserbsarajevonatopeacekeep.nationspeacebihacwar
iraqiraquisanctionskuwaituncouncilgulfsaddambaghdadhusseinresolutionborder
refugeesaidrwandareliefpeoplecampszairecampfoodrwandanungoma
buildingcitypeoplerescuebuildingsworkerskobevictimsareaearthquakedisastermissing
Iteration 5Posterior
probabilites
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
46
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Folding-in a Document/Query
aidfoodmedicalpeopleunwar
unbosnianserbsbosniaserbsarajevonatopeacekeep.nationspeacebihacwar
iraqiraquisanctionskuwaituncouncilgulfsaddambaghdadhusseinresolutionborder
refugeesaidrwandareliefpeoplecampszairecampfoodrwandanungoma
buildingcitypeoplerescuebuildingsworkerskobevictimsareaearthquakedisastermissing
Iteration Posterior
probabilites
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
47
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
0 50 1000
10
20
30
40
50
60
70
80
90
MED
recall [%]
pre
cisi
on
[%]
0 50 1000
10
20
30
40
50
60
70
CRAN
recall [%]0 50 100
0
10
20
30
40
50
60
CACM
recall [%]0 50 100
0
5
10
15
20
25
30
35
40
45
50
CISI
recall [%]
cosLSIPLSI*
cosLSIPLSI*
cosLSIPLSI*
cosLSIPLSI*
Experiments: Precison-Recall
4 test collections (each with approx.1000- 3500 docs)
48
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Experimental Results: TFIDF
0
10
20
30
40
50
60
70
80
Medline CRAN CACM CISI
VSM
LSA
PLSA
Avera
ge P
reci
sion
-Reca
ll
49
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Experimental Results: TFIDF
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
Medline CRAN CACM CISI
VSM
LSA
PLSA
Rela
tive G
ain
in A
vera
ge P
R
50
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
From Probabilistic Models to Kernels: The Fisher Kernel
Use idea of a Fisher kernel: Main idea: Derive a kernel or similarity function from a
generative model How do ML estimates of parameters change, around a
point in sample space?
Derive Fisher scores from model
Kernel/similarity function
y1t
x U)θ̂(IU)y,x(sim
θ)|x(PlogU θx point sample:x parameters model:θ
T. Jaakkola & D. Haussler, “Exploiting Generative Models for Discriminative Training”, NIPS 1999.
51
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Semantic Kernel from PLSA: Outline
Outline of the technical derivation Parameterize multinomials by variance
stabilizing parameters (=square-root parameterization)
Assume information orthogonality of parameters for different multinomials (approximation).
In each block, an isometric embedding with constant Fisher information is obtained. (Inversion problem for information matrix is circumvented)
… and the result …
52
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Semantic Kernel from PLSA: Result
J
1j mi
mjijK
1kjmji
K
1kmimi
cc
cc)w,d|z(P)w,d|z(Pα
)d|z(P)d|z(P)d,d(sim
K=1 essentially reduces to Vector Space Model (!)
topical overlap: probability that randomly chosen word in first and in second document refer to the same topic/concept
word sense(!) overlap: do both terms refer to the same concept?
word overlap: do both documents contain common terms?
53
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Text Categorization: SVM with PLSA standard text collection: Reuters21578 (5 main
categories) with standard kernel and PLSA kernel (Fisher kernel)
substantial improvement, if additional unlabeled documents are available
0
1
2
3
4
5
6
7
8
Error%
ear
n
acq
money
grai
n
crude
SVM 5%
SVM+ 5%
SVM 20%
SVM+ 20%
SVM 100%
SVM+ 100%
54
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Latent Class Analysis: Example
•document collection with approx. 1,400 abstracts on “clustering“ (INSPEC 1991-1997), preprocessing: stemming, stop word list
•4 main factors (K=128) for term “SEGMENT“ (most probable words)
imagSEGMENTtexturcolortissubrainsliceclustermrivolum
image segmentation
videosequencmotionframesceneSEGMENTshotimagclustervisual
motionsegmentation
constraintlinematchlocatimaggeometrimposSEGMENTfundamentrecogn
linematching
speakerspeechrecognisignaltrainHMMsourcspeakerindep.SEGMENTsound
speechrecognition
55
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Multiresolution wavelet decomposition and neuro-fuzzy clustering for segmentation of radiographic images.
Segmentation of medical images is a challenging problem in the field of image analysis. Several diagnostics are based on proper segmentation of the digitized image. Segmentation of medical images is needed for applications involving estimation of the boundary of an object, classification of tissue abnormalities, shape analysis, contour detection and texture segmentation. […]
0.55340.00000.00120.0000
Unknown-multiple signal source clustering problem using ergodic HMM and applied to speaker classification.
The authors consider signals originated from a sequence of sources. More specifically, the problems of segmenting such signals and relating the segments to their sources are addressed. This issue has wide applications in many fields. The report describes a resolution method that is based on an ergodic hidden Markov model (HMM), in which each HMM state corresponds to a signal source. […]
0.00020.66890.04550.0000
relative similarity (VSM): 1.4 relative similarity (PLSA): 0.7
“image” “speech”“video”“line”
Document Similarity: Example (1)
56
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
McCalpin, J.P.; Nishenko, S.P.: Holocene paleoseismicity, temporal clustering, and probabilities of future large (M>7) earthquakes on the Wasatch fault zone, Utah.
The chronology of M>7 paleoearthquakes on the central five segments of the Wasatch fault zone (WFZ) contains 16 earthquakes in the past 5500 years with an average repeat time of 350 years. Four of the central five segments ruptured between 620+or-30 and 1230+or-60 calendar years B.P. The remaining segment (Brigham City segment) has not ruptured in the past 2120+or-100 years. Comparison of the WFZ space-time diagram of paleoearthquakes with synthetic paleoseismic histories indicates that the observed temporal clusters and gaps have about an equal probability (depending on model assumptions) of reflecting random coincidence as opposed to intersegment contagion. Regional seismicity suggests […]
relative similarity (VSM): 1.0 relative similarity (PLSA): 0.5
Blatt, M.; Wiseman, S.; Domany, E.: Clustering data through an analogy to the Potts model
A new approach for clustering is proposed. This method is based on an analogy to a physical model; the ferromagnetic Potts model at thermal equilibrium is used as an analog computer for this hard optimization problem. We do not assume any structure of the underlying distribution of the data. Phase space of the Potts model is divided into three regions; ferromagnetic, super-paramagnetic and paramagnetic phases. The region of interest is that corresponding to the super-paramagnetic one, where domains of aligned spins appear. The range of temperatures where these structures are stable is indicated by […]
Document Similarity: Example (2)
57
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Features of IR Methods
Features LSA PLSA
Quantitative relevance score yes yes
Partial query matching yes yes
Document similarity yes yes
Word correlations, synonyms yes yes
Low-dimensional representation yes yes
Notional families, concepts not really
yes
Dealing with polysemy no yes
Probabilistic model no yes
Sparse representation no yes
4. Learning (from) Hyperlink Graphs
58
59
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
The Importance of Hyperlinks in IR
Hyperlinks provide latent human annotation Hyperlinks represent an implicit
endorsement of the page being pointed to Social structures are reflected in the Web
graph (cyber/virtual/Web communities) Link structure allows assessment of page
authority goes beyond content-based analysis potentially discriminates between high and low
quality sites
60
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
HITS (Hyperlink Induced Topic Search)
Jon Kleinberg and the Smart group (IBM) HITS
Retrieve a subset of Web pages, based on query-based search: result set + context graph
Extract hyperlink graph of pages in subset Rescoring method with hubs- and authority weights
using the adjacency matrix of a Web subgraph
Solution: left/right eigenvectors (SVD)
J. Kleinberg, “Authoritative Sources in a Hyperlinked Environment”, 1998.
E)p,q(:p
)t(p
)1t(q
E)p,q(:q
)t(q
)t(p
xy
yx Authority scores
Hub scores
pq
…
…)t(qy )t(
px
qp
…
…)t(px )1t(
qy
61
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Learning a Semantic Model of the Web
Making sense of the text Probabilistic latent semantic analysis Automatically identifies concepts and topics.
Making sense of the link structure Probabilistic graph model, i.e., predictive model
for additional links/nodes based on existing ones Centered around the notion of “Web
communities” Probabilistic version of HITS Enables to predict the existence of hyperlinks:
estimate the entropy of the Web graph
62
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Finding Web Communities
)z|s(P)z|t(P
z
)z|t(P)z|s(P)z(P)t,s(P
Probabilistic model
Source nodes Target nodes
st
identical
Web Community: densely connected bipartite subgraph
63
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Decomposing the Web Graph
Web subgraph Community 1
Community 3Community 2
Links (probabilistically)belong to exactly one community.
Nodes may belong tomultiple communities.
64
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Linking Hyperlinks and Content
PLSA and PHITS (probabilistic HITS) can be combined into one joint decomposition model
w
z
P(z|s)
P(w|z)
concept/topic
P(t|z)
t
Web community
65
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
“Ulysses” Webs: Space, War, and Genius (no heros wanted)
ulysses 0.022082space 0.015334page 0.013885home 0.011904nasa 0.008915science 0.007417solar 0.007143esa 0.006757mission 0.006090
ulysses.jpl.nasa.gov/ 0.028583helio.estec.esa.nl/ulysses 0.026384www.sp.ph.ic.ak.uk/Ulysses 0.026384
grant 0.019197s 0.017092ulysses 0.013781online 0.006809war 0.006619school 0.005966poetry 0.005762president 0.005259civil 0.005065www.lib.siu.edu/projects/usgrant/ 0.019358www.whitehouse.gov/WH/glimpse /presidents/ug18.html 0.017598saints.css.edu/mkelsey/gppg.html 0.015838
page 0.020032ulysses 0.013361new 0.010455web 0.009060site 0.009009joyce 0.008430net 0.007799teachers 0.007236information 0.007170http://www.purchase.edu/Joyce/Ulysses.htm 0.008469http://www.bibliomania.com/Fiction/joyce/ulysses/index.html 0.007274 http://teachers.net/chatroom/ 0.005082
D. Cohn & T. Hofmann, “The Missing Link”, NIPS 2001.
• Decomposition of a base set generated from Altavista with query “Ulysses”• Combined decomposition based on links and text
5. Collaborative Filtering
66
67
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Personalized Information Filtering:
Users/Customers
Objects
Judgement/Selection
“likes”“has seen”
68
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Predicting Preferences and Actions
User ProfileDr. Strangeloves *****Three Colors: Blue *****Fargo *****Pretty Woman *Movie? Rating?
.
***************
69
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Collaborative and Content-Based Filtering
Collaborative/social filtering Properties of persons or similarities between
persons are used to improve predictions. Makes use of user profile data Formally: starting point is sparse matrix with
user ratings
Content-based filtering properties of objects or similarities between
objects are used to improve predictions
70
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
PLSA for Predicting User Ratings
Multi-valued (or real-valued) rating }5,4,3,2,1,0{v
u y
z v preference v is independent of person u, given latent state z“community-based” variant
• Each user is represented by a specific probability distribution
• Analogy to IR [user=document], [items=terms]
71
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
PLSA vs. Memory-Based Approaches
Standard approach: memory-based Given active user, compute correlation with all
user profiles in the data base (e.g., Pearson) Transform correlation into relative weight and
perform a weighted prediction over neighbors PLSA
Explicitly decomposes preferences: interests are inherently “multi-dimensional”, no global similarity function used (!)
Probabilistic model Data mining: interest groups
72
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
EachMovie Data Set (I)
33.4
35.3
39.940.8
30
32
34
36
38
40
42
Baseline
Memory
PLSA, K=20
PLSA, K=200
EachMovie: >40K users, >1.6K movies, >2M votes
Experimental evaluation: comparison with memory-based method (competitive), leave-one-out protocol
Prediction accuracy
73
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
EachMovie Data Set (II)
Absolute Deviation
1.091
0.951 0.9470.924
0.9
0.95
1
1.05
1.1
Baseline
Memory
PLSA, K=20
PLSA, K=200
74
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
EachMovie Data Set (III)
26.95 27.89
44.64 45.98
0
10
20
30
40
50
Baseline
Memory
PLSA, K=20
PLSA, K=200
Ranking score: exponential fall-off of weights with position in recommendation list
75
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Interests Group, Each Movie
76
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Des-Interests Group, Each Movie
6. Open Problems & Conclusions
77
78
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Scalability of Matrix Decomposition
RecomMind Inc., Retrieval Engine >1M documents >50K vocabulary >1K concepts
Internet Archive (www.archive.org) Large-scale Web experiments, >10M sites
79
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Conclusion: Matrix Decomposition
Enables semantic document indexing: concepts, notional families
Increased robustness in information retrieval Text/data mining: finding regularities & patterns Improved categorization by providing more
suitable document representations Probabilistic nature of models allows the use of
formal inference Very versatile: term-document matrix,
adjacency matrix, rating matrix, etc.
80
© T
hom
as
Hofm
an
nC
S D
ep
art
men
t, B
row
n U
niv
ers
ity, Pro
vid
en
ce R
I, t
h@
cs.b
row
n.e
du
KerM
IT &
Neu
roC
OLT
Work
shop
, A
pri
l 3
0th-M
ay 2
nd 2
00
1, C
um
berl
an
d L
od
ge
Open Problems
Conceptual Bayesian model learning and model combination Distributed learning of latent class models Relational Bayesian networks (Koller et al.) Principled ways to exploit sparseness in algorithm
design Beyond bag-of-words models (string kernels, bigram
language models)
Applications Combining content filtering with collaborative filtering Personalized information retrieval Interactive retrieval using extracted structure Multimedia retrieval New application domains