a similarity-basedprobabilitymodelfor latentsemanticindexing/67531/metadc... · semantic...

12
LBNL-43202 ASimilarity-BasedProbabilityModelfor LatentSemanticIndexing Chris H.Q. Ding National Energy Research Scientific Computing Division Ernest Orlando Lawrence Berkeley National Laboratory University of California Berkeley, California 94720 May 1999 This work was supported by the Director, Office of Computational and Technology Research, Division of Mathematicrd, Information, and Computational Sciences, and the Director, Office of Science, Office of Laboratory Policy and Infrastructure, of the U.S. Department of Energy under Contract No. DE-AC03- 76SFOO098.

Upload: others

Post on 13-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Similarity-BasedProbabilityModelfor LatentSemanticIndexing/67531/metadc... · semantic meanings/relationships as defined by the cosine similarity. Thus the likelihood is the key

LBNL-43202

A Similarity-BasedProbabilityModelforLatentSemanticIndexing

Chris H.Q. Ding

National Energy Research Scientific Computing DivisionErnestOrlandoLawrence Berkeley NationalLaboratory

University of CaliforniaBerkeley, California 94720

May 1999

This work was supported by the Director, Office of Computational and Technology Research, Division ofMathematicrd, Information, and Computational Sciences, and the Director, Office of Science, Office ofLaboratory Policy and Infrastructure, of the U.S. Department of Energy under Contract No. DE-AC03-76SFOO098.

Page 2: A Similarity-BasedProbabilityModelfor LatentSemanticIndexing/67531/metadc... · semantic meanings/relationships as defined by the cosine similarity. Thus the likelihood is the key

@lE!Lycle5pap

Page 3: A Similarity-BasedProbabilityModelfor LatentSemanticIndexing/67531/metadc... · semantic meanings/relationships as defined by the cosine similarity. Thus the likelihood is the key

DISCLAIMER

This repofl was prepared as an account of work sponsoredby an agency of the United States Government. Neitherthe United States Government nor any agency thereof, norany of their employees, make any warranty, express orimplied, or assumes any legal liability or responsibility forthe accuracy, completeness, or usefulness of anyinformation, apparatus, product, or process disclosed, orrepresents that its use would not infringe privately ownedrights. Reference herein to any specific commercialproduct, process, or service by trade name, trademark,manufacturer, or otherwise does not necessarily constituteor imply its endorsement, recommendation, or favoring bythe United States Government or any agency thereof. Theviews and opinions of authors expressed herein do notnecessarily state or reflect those of the United StatesGovernment or any agency thereof.

Page 4: A Similarity-BasedProbabilityModelfor LatentSemanticIndexing/67531/metadc... · semantic meanings/relationships as defined by the cosine similarity. Thus the likelihood is the key

DISCLAIMER

Portions of this document may be illegiblein electronic image products. Images areproduced from the best available originaldocument.

Page 5: A Similarity-BasedProbabilityModelfor LatentSemanticIndexing/67531/metadc... · semantic meanings/relationships as defined by the cosine similarity. Thus the likelihood is the key

C. Ding, ACM SIGIR’99, Berkeley, CA7August 1999 1

A Simikrity-based Probability Model for Latent Semantic Indexing

Chris H.Q. DingNERSC Division, Lawrence Berkeley National Laboratory

University of Cdlfornia, Berkeley, CA 94720

Abstract

A dual probability model is constructed for the La-tent Semantic Indexing (LSI) using the cosine simi-larity measure. Both the document-document sim-ilarity matrix and the term-term similid.y matrixnaturally arise from the maximum likelihood esti-mation of the model parameters, and the optimzdsolutions are the latent semantic vectors of of LSI.Dimensionslity reduction is justified by the statis-tical significance of latent semantic vectors as mea-sured by the likelihood of the model. This leads toa statistical criterion for the optimal semantic di-mensions, answering a critical open question in LSIwith practical importance. Thus the model estab-lishes a statistical framework for LSI. Ambiguitiesrelated to statistical modeling of LSI are clarified.

1 Introduction

Automatic document retrieval to a user query, suchas searching documents on Internet search engines,often matches the keywords in the query to theindex words for all documents in the database, fol-lowing the vector-space model for information re-trieval (IR)[I]. Latent semantic indexing (LSI) [2,3, 4, 5] is a successful scheme to go beyond

lexical matching to address the well-known prob-lem of using individual keywords to identify thecontent of documents. LSI attempts to capturethe underlying or latent semantic structures,’ whichbetter index the documents than individual index-ing te~ms, by the truncated singular value decom-position (SVD) of the term-document matrix X.The effectiveness of LSI has been demonstrated em-pirically in several text collections as increased av-erage retrieval precision.

Clearly, a theoretical and quantitative under-standing beyond empirical evidences is desirable.To date, several theoretical results or explanations[6, 7, 8, 9] have appeared, and these studies pro-vide better understanding of LSL However, manyfundamental questions remain unresolved.

In this paper, we outline a dual probabilisticmodel for LSI based on the similarity concept wid&lyused in vector-space model. For this model, theterm-similarity matrix XXT and document simi-larity matrix =X naturally arise during the max-imum likelihood estimation, and LSI is the optimalsolution of the model.

From statistical point of view, LSI amounts toan effective dimensionality reduction, i.e., reducethe problem dimension to k-dim LSI space. Di-mensions with small singular values are thus oftenviewed as representing semantic noises and thusare ignored. This generic argument, consideringits fundamental. importance, needs to be clarified,quantified and verified. Our model provides a mech-anism to do so by checking the statistical signifi-cance of the semantic dimensions: If a few seman-tic dimensions can effectively characterize the datastatistically, as indicated by the likelihood of themodel, we believe they also effectively represent the

Page 6: A Similarity-BasedProbabilityModelfor LatentSemanticIndexing/67531/metadc... · semantic meanings/relationships as defined by the cosine similarity. Thus the likelihood is the key

C. Ding, ACM STGIR’99, Berkeley, CA, August 1999 2

semantic meanings/relationships as defined by thecosine similarity.

Thus the likelihood is the key to the verificationof optimal semantic subspace that LSI advocates.We give some theretical results and an illustrativeexample to support the existence of such optimalsemantic subspace. In doing so, a criterion for de-termining the optimal semantic dimensions can bedefined, addressing a critical open question in LSIwith practical import ante.

2 Latent Semantic Indexing

In vector-space model of information. retrieval, theterm to document association relation is representedas a term-document matrix

‘=( !:: Y(x’”””xn)=tl “)

containing the frequence of the d index terms oc-curring in the n documents and properly weightedby other factors[3, 10]. Here xi is a, d-componentcolumn vector representing a document ((xi)@ Rz;). tp is a n-component row vectajr representinga term ((t~)j s z$). (In this paper, capital let-ters refer to matrices, bold face lower-case lettersto vectors; subscripts refer to documents, super-scripts refer to terms; 0+/3 sum over all d termsand i, j sum over all n documents.) Given a userquery q, consisting of a set of terms (keywords),the system calculate a n-component score vectors = qTX as the relevance of each dc)cument to thequery. The relevant documents are sorted accord-ing to the score and returned to user.

LSI re-represents both terms ancl documents ina new vector space with smaller k dimensions, inorder to capture the underlying or latent structures(indices). This is done through the truncated sin-gdar value decomposition, X & ukplicitly,

x~(u~...u~)

~VkT,or ex-

V1

‘)

(2)

v~

where ul. ..ukandvl. . .v~ are left and right singu-lar vectors. al,. -., ak are singdar values. Math-ematically, the truncated SVD is the best approx-imation of X in the reduced k-dimensional sub-space. In this k-dim LSI subspace, q,uery is repre-sented as qTUk,and documents are represented as

T The score vector is calculatedcolumns of ~kvk .= S = (qT~k)(~~v~T).

Here we point out an important feature of LSI:if document vectors (columns of X ) are normalizedto one in the original d-dim space, their represen-tations (columns of ~,kv~~) in the reduced k-dimLSI subspace are also normalized to one. To provethis, we have

(~@’jj)T(~~V~) = (U/&Vf)~(U&#__) N X~X(3)

Since columns of X are normalized, diagonal ele-ments of RX are all one’s, which implies the nor-mslzation of columns of ~kV~T. Therefore, LSI willpreserve the uniform scale if we start with normal-ized document vectors. For this reason, we believethat documents should be normalized before LSIis applied. We assume so in this paper, withoutloss of generality. Therefore, cosine similarity isequivaknt to dot-product similarity in both spaces.Note that Eq.3 also indicates that the document-document similarities are preserved in the LSI’ ‘k-dim subspace. [6] further proved that this preser-vation is an optimal one.

TypicaJly taking k = 200 – 300 (far more lessthan either d, or n), LSI increases the retrievalprecision for the query. The optirmd k to achievebest precision is currently determined by exhaus-tive evaluation. How to calculate it directly fromX remains an open question[4].

3 Similarity matrices

Our starting point is the understan.ding of matri-ces XTX and XXT , since they determine the SVDand give arise to latent semantic vectors U1....u~andvl-. . Vk. @X is the similarity matrix be-tween documents, using the cosine or dot-product

Page 7: A Similarity-BasedProbabilityModelfor LatentSemanticIndexing/67531/metadc... · semantic meanings/relationships as defined by the cosine similarity. Thus the likelihood is the key

C. Ding, ACM SIGIR’99, Berkeley, CA, August 1999 3

similarity measure in vector-space IR models:

CY=l

Note the document-document similarity is definedin the space spanned by the d indexing terms (term

space). This similaiit y measure is of fundamental

important in vector-space IR models. The dot-product between two term vectors tl, t? (rows ofx): la

sim(tl, t2) = tl . t2 = Et: . # (5)

i=l

measures their co-occurrences through all documentsin the collection, therefore their closeness or simi-larity[l,2]. X~ contains dot-products of all pairsof term and is the term-term similarity matrix.In several automatic text categorization methods,terms are often first clustered according their co-occurrences in documents using X2@ . A statisti-cal models of LSI in document space should involveboth ~X and Xfl . We will show later they in-deed arise naturally in our model.

Here we emphasize the dual relationship be-tween documents and terms. As discussed above,similarity bet ween documents are defined in term-space, and similarity between terms are definedin document-space. This fundamental relationshipbetween documents and terms naturally correspondsto the occurrence of right and left singular vectorin SVD, and is a key feature of our statistical mod-eling.

4 Dual Probability Model

If we view each document as a data entry in thed-dimensional term-space (index space), there arereasons to believe that documents do not occurentirely randomly. Thus we assume they occuraccording to certain probabilityy distribution, andcan be modeled by standard statistical methods.This idea is similar to, e.g., the Naive Bayes docu-ment classification approach where documents areassumed to be governed or generated by a proba-bility distribution.

Consider a column vector c representing a doc-ument, characterizing a Latent semantic structurein LSL The probability of the occurrence of a doc-ument xi is related to its similarity (cf. Eq.4) tothe latent structure vector c. Motivated by thewidespread use of Gaussian distribution, we as-sume the documents are distributed according tothe probability

p(x~lc) = e(xi.c)2/Z(c) (6)

where the normalization constant Z(c) = J exp(x.C)2dx. The next step is to find c as the optimalparamet~r for the probability model subject to theconstraint Icl == 1. For this purpose, we use themaximum likelihood estimation (MLE), a standardmethod in statistics. In MLE, we try to find thec that maximize the following log-likelihood func-tion:

t(c) = log fip(xilc) = CTXXTC – Tdogz(c) (7)i=1

assuming data are independently, identically dis-tributed. Here we have used

The term-term similarity matrix 2C@ arises as nat-ural consequence of the modeL We point out thatit is term similarity matrix XXT arise here, not thedocument similarity matrix -X%X (as one might hadexpected). Here documents are data which live inthe index space (term space). X.X? measures the“correlation” between components of data whenproperly normalized, and would not change muchif more data are included, thus serves a role similarto the covariance matrix in a principle componentanalysis. The conventional covariance matrix onthe data set U1. . . Un must subtract the averagesand would look qualitatively different from AZ@due to the sparsity of the data matrix X. ThusLSI is not principle component analysis, althoughthey are similar.

In general, finding c that maximizes i?(c) in-volves rather complicated numerical procedure, par-ticularly difficult because Z(c) is an integral in

Page 8: A Similarity-BasedProbabilityModelfor LatentSemanticIndexing/67531/metadc... · semantic meanings/relationships as defined by the cosine similarity. Thus the likelihood is the key

C. Ding, ACM SIGIR’99, Berkeley, CA, August 1999 4

d = 103 – 105 dimensional space amd is analyti-cally intractable. However, note that nlogZ(c) isa very slow changing function in comparing to thefirst term cT~c, and thus can be ignored. Inessence, c is similar to the mean vector p in Gaus-sian @stribution where the normalization Z is in-dependent of p. Thus we expect Z(c) to be nearlyindependent of c.

Therefore, we need only to max~mize the firstterm, cTXflc. The symmetric positive-definitematrix XXT has the spectral decomposition: XXT =~~=1 &ua-u~, Al ~ ~z ~ .0. ~ &, here A.j Ua

are the ath eigenvalue and eigenvector. Thus theoptimal solution is c = ul.

We can-improve the statistical modeling of thedata by using k characteristic document vectors,and generalizing Eq.6 to

p(xlcl 0. . c~) cxe(X.cl)z+...+(ckyky (9)

with the constraint that they are mutually orthog-onal. Following the same maximum likelihood es-timation procedure, the optimal solution for modelparameters c1 .”. Ck are the first k eigenvectors ofXxT,lq . . . Uk, exactly the left singular vectors ofLSI/SVD (cf Eq.2). The optimaJ model is therefore

p(xlul -. . u~) = e(x”ul J2+”””+(x”u~)2/Z~ (lo)

where Zk = Z(UI . . . u~) is the normalization con-stant.

The above analysis of modeling documents arecarried out in term-space. We can also model termstl .-” td as defined by their co-occurrences in thedocument collection, the document space. In thismodel, the data are the terms, indexed by docu-ments. Consider a (normalized) row vector r repre-senting a term. Using the term similarity measureEq.5, we assume r characterizes the data accordingto the probability

p(tlr) = e(t”r)2 /Z(r), (11)

similar to Eq.6. To find optimal r, we calculate thelog-likelihood,

l(r) = log fi p(t~lr) = rTXTXr -- dlogZ(r) (12)@=l

after some algebra, and noting

? ~?f; = (XT% (13)ff=l

The document-document similarity matrix RX ariseagain as direct consequence of the model. The sym-metric positive-definite matrix =X has the spec-tral decomposition: ATA = Zf=l fi(v;)Tvi, C$IZ&>.. .> tfn, here f;, vi are the ith eigenvalue andeigenvector. Thus the optimal solution is r = VI.

We may also use k characteristic row vectorsto model the data, and the optimal solution is theright singular vectors VI -.. Vk of the SVD. Thuswe obt tin the final probabilityy representation

p(tlvl . . .Vk) = e(t.Vl)2+”””+(t.V~)2/z~ (14)

We have constructed a dual probability model,one for documents in term-space using the docu-ment similarity, and another for terms in document-space using term sirnilarity. For bcth models, theoptimal solutions for the model parameters are foundto be exactly the LSI/SVD vectors. Thus LSI isthe optimal solution of the model, and we refer tou~‘-.uk andvl . . .Vk as latent semantic or indexvectors, meaning they identify the kLkJlt structures.,in LSI.

Eqs.10,14 are dual probability representationsof the LSI. This dual relationship is further en-hanced by the following facts: (a) ~ and flXhave the same eigenvalues

~j=(j=~~, j=l, ”--, k;

(b) left and right LSI vectors are related by

Uj = (1/aj)xvj> j = 1,. ... k.

Thus both probability models have the same max-imum log-likelihood

tk=~:+ .-. + at – ?dc]gzk (15)

with the only difference in the normalization con-stants. This is a direct consequence of dual rela-tionship between terms and documents. In partic-ular, for statistical modeling of the observed term-text co-occurrence data, both probability models

Page 9: A Similarity-BasedProbabilityModelfor LatentSemanticIndexing/67531/metadc... · semantic meanings/relationships as defined by the cosine similarity. Thus the likelihood is the key

C. Ding, ACM S.IGIR’99, Berkeley, CA, August 1999

should be considered with szime number k, as isthe case in the SVD. Eq.15 also indicates that thestatistical significance of each LSI vector is propor-tional to the square of its singular values (a? in thelikelihood). Therefore, contributions of LSI vectorswith smill singular values is much smaller than a~itself as it appeaz in the SVD (cf. Eq.2). This isan important result of the theory.

5 Optimal Semantic Subspace

The central theme in LSI is that the LSI subspacecaptures the essential mtiningful semantic associ-ations while reducing redundant and noisy seman-tic information. Our model provide the means toverify this claim, by measuring the statistical sig-nificance of the LSI vectors. We can compute thenumerical values of the likelihood and verify that asmore latent index vectors are included in the prob-ability density Eq.10, the likelihood of the modelincreases, indicating the improvement of the qual-ity of statistical model and hence the effectivenessof LSL We further conjecture that latent index vec-tors with small eigenvalues contain statistically in-significant information, and their inclusion in theprobability density will not increase the the likeli-hood. In LSI, they represent redundant and noisysemantic information.

Thus the likelihood is the key to verify the ex-istence of the optimal semantic subspace. The log-likelihood for the k latent vectors is

In general, Zk ~ -Z(W “”” u~) is difficult to calcu-late, because it is an integral in a high d-dimensionalspace ( d = 103 – 105):

/Jz~= ““”JX-U1)2+-+(X%)2 &sl . . . &#.

Fortunately in maximum likelihood analysis, whatmatters is the relative variation of log-likelihood vsk, not the absolute values. To this goal, we mayproceed using statistical sampling. In the statisti-cal modeling: data (documents) are samples drawnrandomly from the population, thus

Zk H ~ ~(xi-ul)2+”””+(xi”uk)2&.8 (17)

i=l

5

4.1

462 - .s?.,V... V. v- v. V...’v . . ..v.. ,V,,

..” ““v V “:

-46.3 -.V’

v“

-$6.4 -

3’ -v. 46.s

~

&6.6 -:

-46.7 -:

-46.6 -! CUCamlWtMOdd N=17T=16

-46.9 -

..491“ Io 2 4 6 8

k

Figure 1: The log-likelihoodments in term space.

This is an unbiased estimate,tion improves as n increases.size, we can take them out of

10 12 14 16

for modeling docu-

and the approxima-E all dxi have samethe sum. In general

case, we may also factor them out by a properly de-fined average (dx) = (q/n) ~~=1 CZxi,where q w 1and is weakly dependent on k. We may furtherabsorb the difference in the dkcrete summation (aproportional constant of Eq.17) into (dx), and ob-tain

Z~ = (dx) ~ e(xi”u’)2+-..+(xuk)2)2= (dx).z~ (18)i=l

The key point here is that (dx) depends on thegiven text collection, but independent (or very weaklydependent) of k. In the following likelihood analy-sis, we will ignore (dx), and compute ~k only. Thuswe have a practical method to calculate Z~.

5.1 An illustrative example

Here we use a concrete example to illustrate someof the useful concepts. For this goal, we adoptthe example of 17 book titles reviewed in SIAMReview[5]. They are indexed by 16 terms, resultingin the 16x17 term-document matrix. After normal-izing each document vector (column of X ) to 1, wecompute the left singular vectors (eigenvectors of.7L@ ), and the log-likelihood (cf. Eq.16), as shownin Figure 1. The likelihood increases steadily as k

Page 10: A Similarity-BasedProbabilityModelfor LatentSemanticIndexing/67531/metadc... · semantic meanings/relationships as defined by the cosine similarity. Thus the likelihood is the key

C. Ding, ACM SIG-IR’99, .Berlreley, CA, August 1999 6

v,,-s4.s -

“bv ,v . . ..~ ..V . ..YJ.

-55 - v.. v“v’

,V.. .V”

-55.5 -

~ -= - ,:

p, -’;

-5T -

-57.5 -’

-m - Tem Made!.N.17 T.16+

-s+2 4 6 8 10 12 14

k

Figure 2: The log-likelihood for modeling terms indocument space.

increases from 1 to 5, clearly indicating that theprobability model provides better statistical mod-eling of the documents as more LSI vectors are in-cluded; the likelihood fluctuate for k > 6, indicat-ing no meaningful statistical information are rep-resented by those latent index vectc}rs with smallereigenvalues.

To model the terms, we computed the rightsingular vectors (eigenvectors of 3:X ), and thelog-likelihood, as shown in Figure 2. One see thelikelihood peaked at around k = 7 and fluctuate af-terwards. These two likelihood curves behave qtml-itatively similarly, indicating the kind of feature weexpect if we believe LSI vectors with small singu-lar values are statistically unimportant. It wouldbe very interesting to repeat these calculations onmuch large text collections. Clearly the optimsl kcan be determined by this statistical model. In thiscollection, Ii.Pt = 5 fW7.

5.2 Likelihood Analysis

One may ask if the likelihood curves for the booktitle collection will hold for general cases. After all,as more parameters are included in the model, onewould e~pect the likelihood continue to increase.The answer is that even though Z~(c) changes veryslowly indeed, an approximation is still made infinding the optimal analytic solution to Eq.9 in the

MLE procedure. Thus the likelihocld is not guar-anteed to monotonically increase in our model.

Given this clarification, we have some theoret-ical indications that the likelihood behavior of thebook titles example is likely true folrgeneral cases.We can prove the following relation

forwe

t~+~ & 1~ (19)

reasonably large k. By “reasonably large k“,mean that in

,Z.k+I= ~ e(xi”u’)’+-””+(x’”uk)’”t(x’-uk+’)’i=l

e(xi.uk+l)’ is statistically independent ofe(Xi.Ul )Z+-+(w WC)2 so we Cm write

after expadk~ e(xi”uk+1)2& 1 + (Xi . Uk+l )2 sinceIxi . u~+lI <1, and using Eq.8. Substituting thisinto Eq.16 for Zk+l, we obtain Eq.19.

This relation indicates a plateau or a peak inthe likelihood curve, instead of a monotonic in-crease. The theory does not predict whether it WNbe a peak or a pIateau.

6 Invariance Properties

We have outlined the dual model and worked out afew results. Here we mention the invariance prop-erties of the model. First, the model is invariantwith respect to (w.r.t.) the order that terms or doc-uments are indexed, since they depend on the dot-product which is invariant w.r.t. the order. TheSVD and singular vector and values are idso in-variant, since they depends on XXT and flX ,both of which are invariant w.r.t. the order.

Second, the projections in the k-dim subspacepreserve the dot-product similarity. The documentprojections, columns of UfX = X~V~T, preservethe similarity as shown in Eq.3. The term projec-tions, columns of V~TXT = ~hu~, aho preserve the

Page 11: A Similarity-BasedProbabilityModelfor LatentSemanticIndexing/67531/metadc... · semantic meanings/relationships as defined by the cosine similarity. Thus the likelihood is the key

C. Ding, ACM SIGIR’99, Berkeley, CA, August 1999 7

term-term similarity,

(x@jqz@~) = (Ukzkv;)(ukzkvfy s! X’@-(20)

up to the minor difference due to the truncation inSVD. In particular, if these quantitiesare normal-ized in the original space, they will remain normal-ized in the LSI subspace.

Third, the model is invariant with respect toincorporating a scale parameter s, an average sim-ilarity, in Eq.9,

p(x;lcl . . .q) ~ J(w”cl )2+---+ (xck)2]2s2s2 , (21)

similar to the standard deviation in Gaussian dis-tributions. We obtain same LSI vectors and samelikelihood curves except that the verticaI scale isenlarged.

7 Related work

Traditional IR probabilistic models, such as the bi-nary independence retrieval model [11, 12) focus onrelevance to queries. There, relevance to a specificquery is pre-determined or iteratively determinedin the relevance feedback, on individual query ba-sis. Our new approach focuses ,on the data, theterm-document matrix X, ignoring query-specificinformation at present.

As discussed above, similarity matrices XX~ ,flX are key considerations of our model. flX isused as the primary target in the multi-dimensionalscaling interpretation[6] of LSI. where it is shownthat LSI/SVD is the best approximation to X*Xin the reduced k-dimensional subspace. There, thedocument-document similarity are also generalizedto include arbitrary weighting, which can be simi-larly carried out in our model.

Minimum description length principle is usedin [9] to determine optimal k which is rather closeto the experimentally determined value. The re-lations between the model and the term-documentmatrix there require further clarifications, however.

8 Concluding remarks

In this paper, we introduced a dual probability

model for LSI based “on the fundamental cosine(dot-product) similarity measures. Similarity ma-trices are then direct consequences of the model,and latent semantic vectors of LSIjSVD are theoptimal solutions of the model. The latent se-mantic relationship, as characterized by the latentsemantic vectors, are then related to the statis-tical significance as they are used in characterize(parametrize) the probability distribution. The like-lihood is then proposed to quantify this signifi-cance. Both the illustrative ex~ple and our theo-retical understanding (cf. Eq. 19) indicate a plateau(or peak) in the likelihood curve. This signals theexistence of an optimal semantic subspace with muchsmaller dimensions that effectively capture the es-sential associative semantic relationship betweenterms and documents, consistent with the empiri-cal evidences and the general intuition.

LS1/SVD techniques have been used in infor-mation filtering (document classification) and com-putational linguistics (e.g., [4, 13, 14]). Our modelapplies to these cases too, as long as the seman-tic structures defined by the dot-product similarityremains the essential relationship there. In textclassification[4, 14], documents are projected intothe LSI subspace; the same semantic relations re-main in this new feature space as in the retrievalcases. In word sense disambiguation[ 13], the rel-evant relationship is the cosine between two vec-tors in the context space and thus our theory ap-plies here also. In all these cases, it is the ap-propriate similarity matrix, not the conventionalcovariance matrix, that determine the meaningfulreduced subspace.

The dual probability model outlined here is aconstructive model. It can be further extended.One may try to add the query-related informationfor IR, or other factors relevant for the particularapplication.

In summary, we believe this model establishes asound theoretical framework for LSI and LSI/SVDreiated dimensionality reduction methods, and an-swer some of the fundamental questions in infor-

Page 12: A Similarity-BasedProbabilityModelfor LatentSemanticIndexing/67531/metadc... · semantic meanings/relationships as defined by the cosine similarity. Thus the likelihood is the key

C. Ding, ACM 5YG1R’99, Ilerkeleyj CA, August 1999 8

mation retrieval and ii.ltering.

Acknowledgement. I thank Drs. llongyuan Zha,Osni Marques, IIorst Simon, Inderjit Dhillon forvaluable conversations. This work is supported bythe Director, Office of Computational and Tech-nology Research, Division of Mathematical, Infor-mation, and Computational Sciences of the U.S.Department of Energy under contract number DE-AC03-76SFOO098, and by the Director, Office ofScience, Office of Laboratory Policy and Infrastruct-ure, of the U.S. Department of Energy under con-tract number DE-AC03-76SFOO098.

References

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

G. %lton and M.J. McGill. Introduction toModern Information Retrieval. McGraw-Hfl,1983.

S. Deerwester, S.T. Dumais, T.Ii. Lan-dauer, G.W. Furnas, R.A. Harshman.Indexing by latent semantic analysis.J.Amer. Soc.Info.Sci, 41(6), pp.391-407.1990.

S.T. Dumais. Improving the retrieval of infor-mation from external sources. Behavior Res-earch Methods, Instruments and Comput-ers, 23(2), 229-236.1991.

S.T. Dumais. Using LSI for information fil-tering: TREC-3 experiments. D. Harman(Ed.), Overview of TREC-3, National In-stitute of Standards and Technology SpecialPublication, Tech-report 500-335, 1995.

M.W. Berry, S.T. Dumais, G.W. O’Brien.Using linear algebra for intelligent informa-tion retrieval. SIAM Review, 37(4), pp.573-595, 1995.

13.T. Bartell, G.W. Cottrell, and R.K.Belew. Representing Documents Usingan Explicit Model of Their SimilaritiesJ.Amer. Soc.Info.Sci, 46, 251-271, 1995.

C.H. Papadimitriou, P. Raghavan, H.Tamaki, S. Vempala. Proc. of Symposiumon Principles of Database Systems (P(3DS),Seattle, Washington, June 1998. ACM Press.

H. Zha, O. Marques and H. Simon. ASubspace-Based Model for l[nformation Re-

[9]

[10]

[11]

[12]

[13]

[14]

trieval with Applications in Latent Semantic

,bdexing, Proceedings of Irregular ’98, Lec-ture Notes in Computer %ience, Vol. 1457.

pp.29-42, 1998, Springer-Verlag.

R.E. Story. An explanation o,f the effective-ness of latent semantic indexing by meansof a Bayeaian regression model. InformationProcessing & Management, 32(03), pp. 329-344.1996.

G. Salton and C. Buckley. Term-weightingapproaches in automatic text retrieval. Infor-mation Processing and Mana~gement,24(5),513-523, 1988.

C.J. van Rijsbergen. Informational Retrieval.2nd Ed. Butterworths. 1979.

N. Fuhr. Probabilistic Models in InformationRetrieval. Computer Journal, v.35, pp.243-255, 1992.

H. Schutze. Dimension of meaning. Ln Pro-ceedings of Supercomputing’92. pp. 787-796.IEEE Press. 1992.

Y. Yang. Noise Reduction in a,Statistical Ap-proach to Text Categorization. In Proc. of18th Annual ACM SIGIR Conference (SI-GIR’95) 1995:256-263.