a review on topic modeling in information retrieval

Upload: ijirae

Post on 03-Jun-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 A Review on Topic Modeling in Information Retrieval

    1/5

    International Journal of Innovative Research in Advanced Engineering (IJIRAE)Volume 1 Issue 1 (April 2014)

    ___________________________________________________________________________________________________ISSN: 2278-2311 IJIRAE | http://ijirae.com

    2014, IJIRAE All Rights Reserved Page - 169

    A Review on Topic Modeling in Information Retrieval

    Sawant Ganesh S. Kanawade Bhavana R.Computer Department, DCOER, IT Department, DCOER,

    University of Pune. University of Pune.

    [email protected] [email protected]

    Abstract Todays age is information age. We require information for each and every aspect of our work. Withadvent of internet more and more online text is generated each and every day. But this text- information is in raw and

    unorganized form and hence of least use unless and until we refine it and understand, what this information is allabout. Once we understand the underlying subject or theme of available text, then we can easily classify the textaccording to its content i.e. theme. Topic modeling is all about identifying various hidden themes in a given textcollection. We call these themes as topics. Topic modeling algorithm aims to discover and annotate large archives ofdocuments with thematic information based semantics and context of words. The given text collection is called as

    Corpus. Corpus is simply a collection of text documents from which we identify underlying themes. Here we discussdifferent topic models that look through corpus of words and group them together into topics based on similarity andcontext.

    KeywordsText- Information, theme, topic, corpus, annotate, context.

    I. INTRODUCTIONThe organization of large collection of text is a tedious but an important task. As information is an important asset

    of any organization, there is a need to know about text collection we have. This is because just having a large collectionof text without knowing what it is all about, is of no use. To make proper utilization of available text, we need to identify

    subjects discussed in text collection. Subjects are referred as topics. Topics are abstract keyword which tells us about thecontents of document. Informally we can say that topic captures what a document is about [1]. In short topic representscluster of words having thematic information. It can also be viewed as title or heading representing particular block of

    text. We can relate topic to document representative, as document representative represents whole document, in sameway topic represents collection of text. But topic generation process is different form generation of documentrepresentative.

    The generation of topic from large corpus i.e. collection of text, takes place by analysing semantics and context of

    words. Analysing context of the words helps us to assign this word to its appropriate topic. If we dont consider thecontext of word, means in which situation this particular word is used then we may end up with problem called

    vocabulary mismatch. The result of this situation is, the word under consideration gets assigned to non-related topic.Hence consideration of context of word is of very much importance in topic modeling. The basic idea behindconsideration of context of word is, humans make use of particular word in particular context. To illustrate this statement

    lets consider one example, say there is word called charger and we want to assign this word to particular topic. For thisassignment we need to consider in which situation i.e. context the word charger is used. Context could be, the wordcharger is used in context of cell phone charger or laptop charger or any other context. So assessing the context of

    word is a crucial task, otherwise in above example we might assign charger word related to cell phone charger totopic which discuss about laptops. This situation where one word is related to multiple meanings is called as polysemy[2]. Topic modeling helps in removing such kind of ambiguity [3]. We will address this problem later. So till now thiswas general idea about topic modeling and why do we need topic modeling. One definition of topic modeling is, a way oftext mining for identifying patterns in a corpus.

    In this paper we will discuss about various topic models starting from traditional manual built topic model to latestones. Section II mainly describes why there is need for topic modeling. We will also see some problems such aspolysemy and homonymy ([2],[3]) that gave rise for topic modeling. Section III describes Literature studied, in that wefocus on manual built topic models, term similarity based model and also Topic models such as LSI [5], which make use

    of SVD technique and LDA [4], which make use of probability distribution are also discussed.

    II. NEED FOR TOPIC MODELINGIn this section we will see some problems that gave rise to topic modeling. They are discussed as follows.

    A.Large volume of information.As discussed earlier with increase in online information, we simply dont have human power to read and study

    them. Hence there is need for an automatic technique that will scan all documents in corpus and discover topics i.e.

    subjects or theme discussed in documents. Once we know underlying theme of documents then utilization of documentswill increase and these themes can be used in many applications.

  • 8/12/2019 A Review on Topic Modeling in Information Retrieval

    2/5

    International Journal of Innovative Research in Advanced Engineering (IJIRAE)Volume 1 Issue 1 (April 2014)

    ___________________________________________________________________________________________________ISSN: 2278-2311 IJIRAE | http://ijirae.com

    2014, IJIRAE All Rights Reserved Page - 170

    B. Polysemy and Homonymy.Context mismatch problem is a problem in which word in one context when used in another changes semantics of

    topic. Consider word system when used with word operating generates combined word operating system and whenused with word sound generates sound system. This two generated words represents different concepts. Hence in

    which context we does the word comes is of importance. Topic modeling aims at finding the semantic relationshipamong the words. Thus context of word is considered in topic modeling and accordingly words become part of

    appropriate topics. This problem is called as polysemy and homonymy problem [3]. In polysemy one word may haveseveral meanings and homonymy means several words have same meaning. This is shown in fig 1 [3].

    Polysemy: Homonymy:

    Meaning_1 Word_1Word_1 Meaning _1

    Meaning _n Word _n

    Fig 1: Polysemy Vs Homonymy.

    If we dont address the problem of polysemy and homonymy, the precision of results will get hampered as manyunrelated topic could be part of retrieved answer set.

    III.LITERATURE SURVEY

    In this section we will see evolution of topic models and what techniques were used in extraction of topics.Starting with earliest manually built models to latest probabilistic models such as latent dirichlet analysis (LDA) arediscussed with respect to technique used , advantages and shortcomings.

    A. Manually Built Topic model [6],[7]Manually Built Topic models are the earliest topic models. As name suggest manually built topic models are

    manmade topic models. Means they are constructed by humans based on their knowledge of language. As humans canbetter understand the themes within text collection they can manually assign particular document or set of documents toparticular topic. Humans make use of pre-defined knowledge and rules they have to assign particular document toparticular topic. These models give better retrieval results for subset of query and hence required query expansiontechnique to improve performance of retrieved results [6]. The advantage of this topic model is its precise results forparticular topic. This is because they are manually constructed and humans have good knowledge of language,

    documents are precisely allocated to its respective topic. But this model has some shortcomings such as it is timeconsuming and labour intensive process [7]. Instead of being labour intensive and time consuming process these modelsremained attractive and used in scenario such as user feedback system, and open resource such as word Net. The Fig 2

    represents general overview of manual built topic model. In this M1 is a model having parameters such asD, T, H, f(D).Where D is set of input documents also called as corpus and T is set of topic extracted from documents D in corpus.f(D)is processing function which will helps identifying topics and it is based on user knowledge. H is a human who will use

    his knowledge and identify topics in a corpus.

    B. Term similarity model [8].Term similarity model considers closeness of one document to another based on similar terms present in document.

    This is accomplished making use of various similarity measures such as Dices coefficient, Jaccards coefficient, Cosinemeasure etc [8]. Once we identify which documents are close to each other we can assign them to one particular topic.

    Unlike manual built topic models this model is automatic in nature, means there is no need of any human to physicallyassign documents to topic. Instead this model makes use of algorithm describing one of the similarity measures and willidentify which documents are close to each other based on common terms present in the documents. One advantage of

    this model is its automatic nature. Hence there is no need for human to assign documents to topics. This advantage inturn has another advantage and that is, tedious and labour intensive process becomes fast and flexible. But this model hasone shortcoming and that is, similarity between documents is based on number of common terms present in documents

    i.e. context of word is not considered in this model. To elaborate this lets consider one

  • 8/12/2019 A Review on Topic Modeling in Information Retrieval

    3/5

    International Journal of Innovative Research in Advanced Engineering (IJIRAE)Volume 1 Issue 1 (April 2014)

    ___________________________________________________________________________________________________ISSN: 2278-2311 IJIRAE | http://ijirae.com

    2014, IJIRAE All Rights Reserved Page - 171

    d1 t1 d1 t1

    : : : :

    : : : : :

    dn tn dn tn

    M1= {D, T, H, f(D)} M1= {D, T , f(D)}

    D= Set of input documents.D = {d1, d2, d3, , dn}. D= Set of input documents. D = {d1, d2, d3, , dn}T= Set of topics generated. T = {t1, t2, t3, , tn}. T= Set of topics generated. T = {t1, t2, t3, , tn}.

    H= Human responsible for assigning documents to particular topic. f(D): f(D) = T or f: DTf(D) : f(D) = T or f: DT - i.e. {dl,d2,d3 ,dn}{t1, t2, .tn}.

    - i.e. {dl,d2,d3 ,dn}{t1, t2, .tn}. Mapping is based on term similarity measures.- Mapping is based on User Knowledge.

    Fig 2: Manually built topic model. Fig 3: Term similarity based model.

    example, let there be two documents named train and car. The train and car document consist of following wordsTrain= {train, fast, driver, station, long, speed}.

    Car= {car, speed, racing, driver, red, fast}.There are three common terms in both documents and they are fast, speed and driver. If we apply one of the termsimilarity measures then we can conclude that both these documents are closely related to each other. But this is not thetrue case. One document describes train and another describes car. Yes, the documents containing common terms are

    related but we need to consider in which context these common terms are used. Then only we can precisely commentabout similarity between two documents. Fig 3 shows general model based on term similarity. We can easily identifydifference between Manually built topic model in fig 2 and term similarity based model in fig 3. There no Hparameter

    in term similarity based model. This is because topic assignment process is done automatically without help of human.Another key difference is mapping function f(D),in manually built topic models this function is influenced by humanknowledge and in term similarity model this function is associated with similarity measures mentioned in [8].

    C. Latent semantic Analysis [9].Latent semantic Analysis (LSA) is a true topic model which considers semantic relationship among terms and

    documents to discover the underlying semantic relations. Means LSA tries to capture meaning behind the word i.e. topics.

    Difference between words and topics is, words are observable while topics are not, they are hidden or latent. LSA usedimensionality reduction technique to capture conceptual and semantic relationship among words and documents [9],[5],and hence LSA addresses problem of polysemy in clean and effective way. This is because to identify semantic relations

    LSA identifies the context in which the word is used rather than occurrences of common words. To identify latentsemantic and contextual meaning LSA makes used of statistical method known as singular values decomposition (SVD).The SVD is a dimensionality reduction technique that makes use of term-document matrix as an input and discoverssemantic relationship among terms and documents.

    First step in LSA modeling is to prepare two dimensional term-document matrix as shown in fig 4. The firstdimension in this matrix is terms present in documents and it forms a row vector. The Second dimension is documents

    itself represented as column vector. The entries in this matrix are frequency of occurrence of word in particular document.

    Next LSA applies SVD of term-document matrix. Let this original term-document matrix be called Amn matrix. Now

    this Amn matrix can be broken down into product of three matrix as,

    Amn = UmmSmnV nnT

    Where,Uand Vare orthogonal Matrix .i.e. U.U

    T= I and V

    T. V =I

    The columns of matrix U are orthonorrnal eigen vectors of A.AT.

    The columns of matrix V are orthonorrnal eigen vectors of AT.

    A.

    Matrix S is diagonal matrix containing square root of eigen values of U or V in descending order.For Example as in [9]- consider Document collection show in fig 4. First of all prepare term-document matrix as shownin fig.5 and apply above matrix decomposition formulae i.e. SVD. After solving the matrix multiplication we get finalmatrix A as show in.fig 6. This matrix is final matrix and describes semantic relationships among words withindocuments.

    Now lets analysize output of matrixdecomposition as shown in fig 6. Consider documents C2 and C4.document C2 contains the term user and doucment C4 contains the term human. Obviously human and user are thesame entity, means they are semantically related. This thought can be captured in fig 6. The entry for human

    corresponding to document C2 is 0 (zero) in fig 5 and entry corresponding to document C4 is 1(one). Means with

    respect to fig 5 human word is not related to document C2 as its entry is zero.

    Manual Built

    topic model

    Term similarity

    model

  • 8/12/2019 A Review on Topic Modeling in Information Retrieval

    4/5

    International Journal of Innovative Research in Advanced Engineering (IJIRAE)Volume 1 Issue 1 (April 2014)

    ___________________________________________________________________________________________________ISSN: 2278-2311 IJIRAE | http://ijirae.com

    2014, IJIRAE All Rights Reserved Page - 172

    Now observe decomposed matrix in fig 6, entry for human for document C2 is 0.40. Means the term human is

    related to document C2 as 0.40 value is above zero, indicating some kind of relation. And this relation is due tohuman and user user is same entity. In this way LSI identifies latent semantic relationship among words anddoucments. Identification of semantic relationship is a main feature of LSI, which makes it an true and attractive topic

    model. Here SVD decomposition technique is not discussed but you find this whole working of this technique in [9],[5],also online tutorial are available which explains basic working of SVD matrix decomposition.

    Fig 4. Document collection. Fig 5. Term-Document matrix.

    Fig 6. matrix decomposition of fig 5using SVD . fig 7: Graphical Model for Latent Dirichlet Allocation.

    There was no any clear idea why to go for SVD technology, one can directly go to models fitting over data such as

    maximum likelihood and Bayesian methods [4]. Hoffman in 1999 presented probabilistic LSI model (pLSI) in thiscontext which will act as alternative to LSI [13]. pLSI model as name suggest considers probability of word that can beassigned to topic but does not considers how documents could be assign to topics. pLSI works at term level and not atdocument level. To overcome this shortcoming a new model was proposed named as Latent Dirichlet Allocation (LDA)

    [4].

    D. Latent Dirichlet Allocation (LDA) [4],[10]LDA is probabilistic topic model which considers probability distribution functions for assigning words in a

    document to particular topic. Probabilistic topic models are suit of algorithms that focus on discovering hidden thematicand contextual information from set of documents [10]. The underlying instinct behind LDA is, documents are mixtureof multiple topics. For example document named as computer science, can have topics such as data structure, algorithms,theory of computation, computer network etc means documents are mixture of topics. These topics are distributed overdocument in equal or non equal proportion. There are two types of variables in LDA as hidden variables and observed

    variables. Observed variables are words within documents. While hidden variables describes topic structure. In fig 7hidden variables such as topic proportion, topic assignment and topic are shown by unshaded circles, while observedvariable i.e words within documents is shown by shaded circle[10].LDA model follows a generative process, a process

    by which documents are generated. More precisely data arises from hidden random variables and these variables formtopic structure.. The process of inferring hidden structure from document is accomplished by computing posteriordistribution. This distribution is conditional distribution of hidden variables in documents . The word Dirichlet in LatentDirichlet Allocation is a distribution that is used to draw per document topic distribution i.e. it specifies how topics are

    distributed in particular document. In generative process this output of dirichlet distribution is used to assign words of

    documents to different topics.

  • 8/12/2019 A Review on Topic Modeling in Information Retrieval

    5/5

    International Journal of Innovative Research in Advanced Engineering (IJIRAE)Volume 1 Issue 1 (April 2014)

    ___________________________________________________________________________________________________ISSN: 2278-2311 IJIRAE | http://ijirae.com

    2014, IJIRAE All Rights Reserved Page - 173

    Formally LDA can be represented with following notations. The topics are 1:k, d is topic proportions for dth

    document, d,k topic proportion for topic k in document d. Zd is topic assignment for dth

    document, Zd,n is topic

    assignment for nth, word in document d, Wdare observed words in document d, Wd,nis nthobserved word in document d.

    Then joint distribution over hidden and observed variables is given as follows [10].

    We assume that topics are computed in advance and all document in corpus share same set of topics. Toaccomplish the task of topic identification LDA makes use of topic modeling algorithm. These algorithms falls into twobroad category as sampling based algorithm[14] and variational algorithm[15]. Gibbs sampling is commonly used

    sampling algorithm which makes use of Markov chain. Markov chain is sequence of random variables, each dependenton previous. Variational algorithm transforms inference mechanism used in sampling algorithm into optimizationproblem [10]

    Basically LDA is basic probabilistic topic model. If we relax some constraints on it we obtain various variants ofLDA such as pachinko topic model and correlation topic model which takes into account correlation between topics,spherical topic model which considers unlikely word to get occur in topic, sparse topic models and last bursty topic

    models-A realistic model of word count.IV.SELECTION OF TOPIC MODEL

    Given a set of corpus consisting of n number of documents, then the question that arises in our mind is, which

    topic model should I use? To answer this question one need to evaluate the topic models. While evaluation first dedicatesome portion of corpus as test set. Then fit variety of topic models to rest of corpus and calculate approximate fitmeasure for each trained model on test set. Lastly choose topic model which gives best performance [10].

    V. CONCLUSIONSIn this paper we saw basic concept of topic modeling, starting with why there is need for topic modeling, along

    with problems that gave rise for topic modeling. Next we saw different topics models starting from earliest manually-

    built topic model till latest LDA and its variants. SVD was the basic matrix decomposition technique used to identifysemantic and contextual relationship between words. These models are really helpful in todays information age whereorganization of large collection of digitized information is tedious task. Having large set of data without knowing what

    information is present in it is of no use, topic models understands our data collection and discovers a fruitful hidden

    theme residing inside data sets, which will help us in better utilization of data sets.

    REFERENCES[1] Megan R. Brett, Topic Modeling: Basic Introduction. Available: http://journalofdigitalhumanities.org/2-1/topic-

    modeling-a-basic-introduction-by-megan-r-brett/.

    [2] Ingrid lossius Falkum, Generativity, Relevance and Problem of Polysemy.[3] Gergely Petho, What is polysemy- A survey on current research and results, Department of German linguistic,

    University of Debrecen, Hungary.[4] David M. Blei et al, Latent Dirichlet Allocation,journal on machine learning Research 3 2003.[5] Scott Deerwester et al, Indexing in Latent Semantic Analysis, Graduate library school, University of Chicago.[6] Wei, X. and croft, W.B, Investigating Retrieval performance with Manually-Built Topic Models, in Proceedings

    of RIOA 2007.[7] Spark Jones, K., Automatic Keyword Classification for Information Retrieval,London:Butterworths 1971.

    [8] C.J. Rijsbergen Information Retrieval Available: http://www.dcs.gla.ac.uk/keith/preface.html.[9] Landauer, T.K Foltz, P.W & laham. D, Introduction to latent semantic Analysis Discourse Processes, 25, 259-284, 1998.

    [10] David M. Blei, Introduction to Probabilistic Topic Models, Princeton University.[11] Xing Yi and James Allan,A Comparative study of utilizing topic models,university of Massachusetts, Amherst,

    USA.[12] George W. Furnas, Scott Deerwester,Susan T. Dumais, Information Retrieval using a Singular value

    Decomposition Model of Latent Semantic Structure.[13] T. Hofmann, probabilistic latent semantic analysis ,UAI 1999.[14] M.Steyyers and T.Griffiths, probabilistic topic models,In T.Laudauer, D.McNamara, S.Dennis, and W.Kintsch,

    editors, Latent semantic Analysis: A Road to meaning. 2006.[15] M. Jordan, Z. Ghahramani, T. Jaakkola, and L.Saul. Introduction to Variational methods for graphical models,

    machine learning, 1999.