under the guidance of dr. r. bhaskaran head of department - school of mathematics

65
Vritti Cognitive Search – Discovering concepts and trends in large body of text MS Computer Science Project, Final Presentation Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics Madurai Kamaraj University, Madurai S Gopi 092504174 Course: MS (Computer Science) MS Computer Science, Manipal University 1

Upload: brandon-patrick

Post on 03-Jan-2016

24 views

Category:

Documents


1 download

DESCRIPTION

Vritti Cognitive Search – Discovering concepts and trends in large body of text MS Computer Science Project, Final Presentation. Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics Madurai Kamaraj University, Madurai. S Gopi 092504174 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

VrittiCognitive Search – Discovering concepts and trends

in large body of text

MS Computer Science Project, Final Presentation

Under the guidance ofDr. R. Bhaskaran

Head of Department - School of MathematicsMadurai Kamaraj University, Madurai

S Gopi092504174Course: MS (Computer Science)

MS Computer Science, Manipal University 1

Page 2: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Project Objective and Goal Methodology and Design Algorithms System design and Implementation Results Conclusion Proposed Future work

Table of Contents

MS Computer Science, Manipal University 2

Page 3: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Develop a system named Vritti for extracting concepts and trends from large body of text.

Enable users to search through a large body of text / documents with ease.

Leverage keyword based search framework, by augmenting text mining algorithms

Project Objective

MS Computer Science, Manipal University 3

Page 4: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Precision = Relevant Retrieved / Total Retrieved

Recall = Relevant Retrieved / Total Relevant

MS Computer Science, Manipal University 4

Project Goal – In technical terms

Traditional search focuses on Precision Vritti focuses on Recall Vritti uses Berry picking and open

literature discovery process

Page 5: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

MS Computer Science, Manipal University 5

Vritti Search Traditional Search

Results

Concepts Extraction

Concept interactions

Query refinement

Search

Query

ResultsRefine query manually

ResultsRefine query manually

Results

The first step of text exploration is search, followed by discovering concepts and their associated relationships.

Equipped with these concepts which present a high level view of the underlying documents, the users should be able to search/infer information from large body of text with ease.

Increased Recall at every cycle

Precision focused at each iteration

Page 6: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Algorithms for Increased search effectiveness, in terms of recall, by presenting the users with concepts; in addition to documents matching the given query.

Allow the users to interact between the search results and discovered concepts in the form of query expansion or modification

Project Goal

MS Computer Science, Manipal University 6

Page 7: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

1. Literature Survey2. Data Preparation3. Algorithm Selection / Creation and

Validation4. System Use Cases / Story Boards5. User Interface Design6. High Level System Design7. System Build and Unit Testing8. System testing9. Documentation and Final write up

Project Methodology

MS Computer Science, Manipal University 7

Page 8: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

1. Search Result Ranking2. Keyword Weighting Scheme3. Unigram Discovery4. Bigram and Trigram Discovery5. Collocation Algorithm6. Association Discovery Algorithm7. Search Result Clustering

NMF Clustering

MS Computer Science, Manipal University 8

Vritti Algorithms

Page 9: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

MS Computer Science, Manipal University 9

1. Search Result RankingSearch Result Ranking using vector space.

(Yang, 1975) Every document represented by a multidimensional vector.

Each component of the vector is a particular keyword in the document.

The value of the component depends on the degree of relationship between the term and the underlying document. Term weighting schemes decide the relationship between the term and the document.

Vector cosine similarity decides document query or document- document similarity

Yang, G. A. (1975, Nov). A vector space model for automatic indexing. Communications ofthe ACM.

http://lucene.apache.org/java/docs/index.html

Page 10: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

MS Computer Science, Manipal University 10

2. Keyword weighting schemeIndependence Assumptions

 I1 –Distribution of terms in relevant documents is independent ,Distribution in all documents is independent

I2 –Distribution of terms in relevant documents is independentDistribution in non-relevant documentsis independent. 

Ordering / Principles

I1 I2

O1 F1 F2O2 F3 F4

Ordering principles O1 – Probable relevance is based only on the presence of query terms in the documents.

O2 – Probable relevance is based on both the presence and absence of query terms in the documents.

𝐹 4=log¿

Relevant documents having the term / Relevant documents not having the term

Non Relevant documents having the term / Non relevant documents not having the term

Page 11: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

R Number of relevant documents that contain the term

n Number of documents that contain the term

r Number of relevant documents in the collection

N Number of documents in the collection

n-r Number of non-relevant documents that contain the term

R-r Number of relevant document that do not contain the term

N-n Number of documents that don’t contain the term

N-R Number of non-relevant documents

N-n-R+r

Number of non-relevant document that don’t contain the term

MS Computer Science, Manipal University 11

Page 12: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

1. Search the document collection for the given query

2. Extract terms of length one (unigram) from the search result.

3. The terms should have a minimum frequency count. Vritti uses a 3 as threshold

4. The terms should be alpha numeric5. The terms should not be English stop

words.6. Apply weighting scheme on these

terms. A weight is derived for each of the term.

7. Select top ‘N’ terms based on a threshold

MS Computer Science, Manipal University 12

3. Unigram DiscoverySearch

Candidate Unigrams

Filters

Weighting

Page 13: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

MS Computer Science, Manipal University 13

4. Bigram and Trigram Discovery

Search

Candidate Bigram / Trigrams

Weighting

Apply Collocatio

n

Flow similar to Unigrams

Let us say if a search results yields us ‘M’, documents. If we split these documents into ‘n’ words (non-unique), we can eventually have, nC2, bigrams and nC3 trigrams. It is computationally not feasible to process all these bigrams, as most of them may not make any sense.

Hence we apply collocation algorithm to extract only meaningful bigrams / trigrams

Page 14: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Likelihood ratios are used for collocation Given two hypothesis likelihood ratio is a number that can tell how

much more likely one hypothesis is than the other

MS Computer Science, Manipal University 14

5. Collocation Algorithm

Hypothesis 1 

Hypothesis 1 is a formalization of independence. Occurrence of w2 is independent of the previous occurrence of w1.

 Hypothesis 2

Hypothesis 2 is a formalization of dependence. It serves as a good evidence for an interesting collocation.

Let us say, c1, c2 and c12 are the number of occurrences of w1, w2 and w1w2 in the document collection. We can derive P, P1 and P2 as

Page 15: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Assuming a binomial distribution of words,

The likelihood of getting the counts for w1 and w2 and w1w2 that we actually observe is then

Given the above the log likelihood is defined as

The bigrams are finally ranked based on their likelihood ratios and top N among them will be selected.

Page 16: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

MS Computer Science, Manipal University 16

6. Association DiscoveryGiven a term document matrix, A

1. Compute transpose of A 2. Compute the co-weight matrix B by

multiplying AT by A.3. Compute matrix C by transforming co-

weights into pair wise similarities using Jacquard’s coefficient,

4. Transform C into a row normalized matrix D by converting row vectors into unit vectors.

5. Compute the transpose of D by changing rows into columns and columns into rows

6. Compute the cosine similarity matrix E, by multiplying DT with D.  

Since row i of E represents the neighborhood of term i, for a given row, the nearest neighbor of term i is a term other than itself with the largest similarity value. Thus for every terms, the associated terms are discovered in Vritti.

Page 17: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

MS Computer Science, Manipal University 17

7. Search Result Clustering - NMFGeneric Non negative matrix factorization

problem can be sated as follows, Given a nonnegative matrix A ɛ R m×n and a positive integer k < min {m,n}, find nonnegative matrices W ɛ R m × k and H ɛ R k × n to minimize the function

Multiplicative update algorithm is as follows

W = rand (m,k) , Initialize W as random dense matrixH = rand(k,n), Initialize H as random dense matrixFor i = 1 : max iterations

H = H .* (WTA) ./ (WTWH)W = W .* (AHT) ./ (WHHT)

 

Input Text

Term document Matrix of index term, TFID weighting

NMF Clustering

Weight and Feature Matrix

𝐴 (𝑑𝑜𝑐𝑠 , 𝑡𝑒𝑟𝑚 )=𝑊 (𝑑𝑜𝑐𝑠 , 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠 )∗𝐻 ( 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠∗𝑡𝑒𝑟𝑚𝑠)

Page 18: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

High Level Design

MS Computer Science, Manipal University 18

Technology

Python 2.7Web.pyPyLuceneXAMPP / Apache HTTPNLTK (Natural Language tool kit

Page 19: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

MS Computer Science, Manipal University 19

Indexing module

Search and Ranking Module

Page 20: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

MS Computer Science, Manipal University 20

Text Mining Module

Page 21: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Work flow module

Work flow module follows chain of command design pattern.

Command class is the basic processing unit.

Chain class links all the command class together. A list of command object when passed to the chain object, they are executed in a serial fashion

Search

Unigram Discovery

Weight Assignment

Chain

Context

Association

Page 22: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

MS Computer Science, Manipal University 22

Deployment Overview

XAMPPPythonWeb.py port 8080

Web browser

PyLucene

Page 23: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

MS Computer Science, Manipal University 23

User Interface

Landing Page

Search Screen

Page 24: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Unigram Discovery

Association Analysis

Page 25: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Search result clustering

Page 26: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Data Source◦ For building and testing Vritti we use National

Science Foundation (NSF) Research award abstracts 1990-2003 data set This dataset contains,129,000 abstracts describing NSF awards for basic research.

Index Creation◦ Ingested data are stored as inverted indices for

faster search performance.◦ Apache Lucene is used for storing the inverted

index.

Inverted Index

MS Computer Science, Manipal University 26

Page 27: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Inverted index data dictionary

MS Computer Science, Manipal University 27

Page 28: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

MS Computer Science, Manipal University 28

ResultsKeyword Total

Documents retrieved

Relevant Documents

Relevant documents retrieved

Precision Recall

Awards 50 35 32 70% 91%Oviposition 50 33 30 66% 90%

Hydrography

50 20 18 40% 90%

Isoelectric 50 27 24 54% 88%

• Vritti performs consistently in the recall parameter.

• Aim of Vritti is to have a good recall rate without worrying about the precision. However if we remove the cap on number of documents returned for a search result, the precision measure will also increase considerably.

Page 29: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Conclusion

MS Computer Science, Manipal University 29

By focusing on recall, and providing the users with sophisticated text mining capability and query expansion capability, Vritti carves a niche space for itself with in the information retrieval systems available today. In addition to being a stand-alone system, Vritti can also serve as a platform for text mining professionals to jump start their analysis. Vritti can be expanded by augmenting a range of other technologies, including • Document polarity discovery• Text sentiment analysis• Markov Chain Models for automatic sentence construction• Language models for spell check and query expansion and many

others. Vritti project documents and source code have been uploaded to Google code project, http://code.google.com/p/vritti/. With apache license 2.0, Vritti is open source software now, thus allowing students, researchers and fellow programmers to use, develop and maintain Vritti going forward.

Page 30: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

CRM – Analyze customer responses Ticketing systems – Mining for finding frequently occurring problems / Themes Stock Exchange Trade Chats – Find suspecting transactions Extending to Social Network applications – Understanding discussions among members

MS Computer Science, Manipal University 30

Vritti Commercial Applications

Page 31: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Thank You

MS Computer Science, Manipal University 31

Page 32: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

MS Computer Science, Manipal University 32

Backup Slides

Page 33: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

We define concept as a word or a phrase which describes a meaningful subject within a particular field.

Vritti discovers concepts within the context of the corpus under consideration

Trends are defined as recurring concepts in multiple documents inside the corpus

Concepts and Trends

MS Computer Science, Manipal University 33

Page 34: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

The first step of text exploration is search, followed by discovering concepts and their associated relationships.

Equipped with these concepts which present a high level view of the underlying documents, the users should be able to search/infer information from large body of text with ease.

Vritti Text Exploration

MS Computer Science, Manipal University 34

Page 35: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Subsumption - A learner, supported by an appropriate environment, shall be able to attach a new concept to those existent inside his/her cognitive structure.

Vritti aims to apply the same for searching and text exploration. Make search a more natural phenomenon by enhancing the search experience of the information seeker.

Motivation

MS Computer Science, Manipal University 35

Joseph D. Novak & Alberto J. CañasFlorida Institute for Human and Machine CognitionTechnical Report IHMC CmapTools 2006-01 Rev 2008-01

Page 36: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Text exploration◦ Literature Based Discovery (LBD)◦ Berry picking

IR Models and Weighting Schemes◦ Vector space models◦ Term weighting schemes◦ Search Ranking Schemes

Concept Definition and Discovery◦ Word space models◦ Random Projections◦ Document Clustering

Lingo Non Negative Matrix Factorization

◦ Scalar Clustering

Literature Survey

MS Computer Science, Manipal University 36

Page 37: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Concept discovery in text was hugely popularized by the work of Dr Swanson in trying to identify the relationship between fish oil and Reynaud’s syndrome.

Focus of Dr. Swanson’ work was to identify concepts and their relationship in bibliographic databases. His technique is known as Literature Based Discovery (LBD) and he defines it as a process of finding complementary structures in disjoint science literature.

Literature Based Discovery (LBD)

Janneck, M. C. (2006). Recent Advances in Literature Based Discovery. Journal of theAmerican Society for Information Science and Technology, JASIST.MS Computer Science, Manipal

University 37

Page 38: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

LBD Open discovery process

MS Computer Science, Manipal University 38

Page 39: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Why is it necessary for the searcher to find a way to represent the information need in a query, understandable by the system?

Why not the system make it possible for the searchers to express the need directly as they would ordinarily, instead of in an artificial query representation for the system consumption.

Berry Picking

J.Bates, M. (1989). The design of browsing and berrypicking techniques for the online searchinterface [Quick Edit] . Online Information Retrieval, 407-424.

Berry picking challenges current keyword search methodology in four areas

1. Nature of the query2. Nature of the overall search process3. Range of search techniques used4. Information domain or territory where the search is conducted

MS Computer Science, Manipal University 39

Page 40: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Traditional Search vs. Berry Picking

MS Computer Science, Manipal University 40

Page 41: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Information Retrieval Model Central premise of any information retrieval

system is to identify relevant and irrelevant documents for a given query.

They perform this relevance using a ranking algorithm. Ranking algorithms use index terms. An index term is simply a word whose semantics helps in remembering the document’s main theme.

IR Models and Weighting Schemes

Ricardo Baeza Yates, B. R. (1999). Modern Information Retrieval. Association forComputing Machinery Inc (ACM).

MS Computer Science, Manipal University 41

Page 42: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

In vector space model (Yang, 1975) every document represented by a multidimensional vector.

Each component of the vector is a particular keyword in the document.

The value of the component depends on the degree of relationship between the term and the underlying document. Term weighting schemes decide the relationship between the term and the document.

Vector cosine similarity decides document query or document- document similarity

Vector Space Model

Yang, G. A. (1975, Nov). A vector space model for automatic indexing. Communications ofthe ACM.

MS Computer Science, Manipal University 42

Page 43: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Several mathematical schemes based on the type of IR models have been developed to identify index terms.◦ Spark Jones developed IDF, the Inverse document

frequency weighting.◦ Probabilistic IDF, called IDFP was developed by

Robertson. ◦ All the above mentioned weighting schemes

decide the weight of a term based on its presence in the document

IR Model Math Schemes

Robertson, S. (2004). Understanding Inverse Document Frequency: On theoriticalArguments of IDF. Journal of Documentation, 503-520.K, S. J. (1972). A statisitical interpretation of term specificity and its application in retrieval.Journal of Documentation, 11-21.MS Computer Science, Manipal

University 43

Page 44: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Binary: Simplest case, the association is binary: aij=1 when keyword i occurs in document j, aij =0 otherwise.

Term frequency: aij = tfij, where tfij denotes how many times term i occurs in document j.

TF-IDF: aij = tfij . log(N/dfi), where dfi denotes the number of documents in which term i appears and N represents the total number of documents in the collection.

Term Weighting

MS Computer Science, Manipal University 44

Introduction to information retrieval.Christopher D Manning, Prabhakar Raghavan, Hinrich Schutze, Cambridge University Press

Page 45: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Combination of the Vector Space Model Boolean model to determine how relevant a given Document is to a User's query.

Boolean model to first narrow down the documents that need to be scored based on the use of Boolean logic in the Query specification.

More times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query.

The score of query q for document d correlates to the cosine-distance or dot-product between document and query vectors in a Vector Space Model (VSM). A document whose vector is closer to the query vector in that model is scored higher.

Search Ranking Schemes

MS Computer Science, Manipal University 45

Apache Lucene, Search Ranking Scheme

Page 46: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Concept is a word or a phrase which describes a meaningful subject within a particular field.

Principal Orthogonal vectors in VSM are good concept candidates

Non Poisson distributed word or co-occurring words are good concept candidates.

Concept Definition and Discovery

Srinivasan, P. (1992). Thesaurus Construction. In W. F. Baeza-Yates, Information Retrieval:Data Structures & Algorithm (pp. 161-218). Englewood Cliffs: Printice Hall.

MS Computer Science, Manipal University 46

Page 47: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

VSM treat words as indicator of contents, there is no exact matching from words to concepts.

In word space model, a high dimensional vector space is produced by collecting the data in a co-occurrence matrix F, such that each row Fw represents a unique word w and each column Fc represents a context c, typically a multi word segment such as a document or word. ◦ Latent Semantic Analysis (LSA) is an example of a word

space model that uses document based co-occurrence ◦ Hyperspace analogue to Language (HAL) is an example of

a model that uses word based co-occurrences.

Word Space Models

Asoh, L. S. (2001). Computing with Large Random Patterns.MS Computer Science, Manipal

University 47

Page 48: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Accumulate context vectors based on the occurrence of words in context.

Two step operation◦ First, each context (e.g. each document or each word) in the data is

assigned a unique and randomly generated representation called an index vector. These index vectors are sparse, high-dimensional and ternary, that is their dimensionality is on the order of thousands, and that they consist of a small number of randomly distributed +1s and -1s, with the rest of the elements of the vector set to 0.

◦ Then, context vectors are produced by scanning through the text, and each time a word occurs in a context, that context’s d-dimensional index vector is added to the context vector for the word in question. Words are thus represented by d-dimensional context vectors that are effectively the sum of the words’ context.

Random Projections

Kanerva.P. (1988). Sparse Distributed Memory. The MIT Press.Sahlgren, M. (2005). An Introduction to Random Indexing. Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE 2005.

MS Computer Science, Manipal University 48

Page 49: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Every word / document is represented as a vector, which is a sum of all the corresponding context vectors.

Searching for a word, can be performed at a context / concept level.

Incremental method, context vectors can be used for similarity even after a few examples.

Dimensionality d, does not change. New examples, does not change d, hence method is scalable for large data sets.

MS Computer Science, Manipal University 49

Random Projections

Page 50: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Documents tend to cluster around underlying concepts they represent

Clustering search results is a way of discovering concepts in a document corpus

Vritti implements two document clustering algorithms◦ Lingo◦ Non Negative Matrix Factorization

Document Clustering

MS Computer Science, Manipal University 50

Page 51: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

MS Computer Science, Manipal University 51

Input Text

Extract Frequent Phrases

Appears a specific number of times

Not cross sentence boundary

Neither begin or end with a stop word

Term document Matrix of index term, TFID weighting

Singular Value DecompositionUSVt, take k singular values

For every freq phrase, a col vector is made over term

space

Terms vs Freq phrase matrixAbstract concepts vs terms

matrix

Concepts vs Freq phrase matrix

Apply VSM to find documents matching Freq phrase for

clustering

Lingo Algorithm

Weiss, S. O. (2005). A Concept-Driven Algorithm for Clustering Search Results. IEEEIntelligent Systems.

Lingo combines common phrase discovery and Latent semantic indexing to separate documents into meaningful groups

Page 52: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Unsupervised learning algorithms such as principal-component analysis and vector quantization can be understood as factorizing a data matrix subject to different constraints. Depending upon the constraints utilized, the resulting factors can be shown to have very different representational properties.

In Vritti, we try to leverage this idea to factorize a term document matrix, to find the underlying semantic representation. Similar to SVD, NMF tries to find orthogonal representations which can be a good candidate for concepts.

NMF Clustering

MS Computer Science, Manipal University 52

Page 53: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

MS Computer Science, Manipal University 53

Scalar Clustering1. From input text, create a term document matrix , A2. Multiply A with transpose (A) to get term - term matrix, B3. Computer Jaccard’s Coefficient for each entry in B, P(AUB)/P(A)

+P(B)-P(A ^ B) as C4. Make C unit normal and Mutliply C and transpose (C) to get

cosine distance, the new matrix is D5. Matrix D is a term term matrix, having a Analogous score for

each term and its associated term.

Scalar clustering will be used to find analogous words to every word.

Page 54: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Data Source◦ For building and testing Vritti we will use National

Science Foundation (NSF) Research award abstracts 1990-2003 data set This dataset contains,129,000 abstracts describing NSF awards for basic research.

Index Creation◦ Ingested data will be stored as inverted indices

for faster search performance.◦ Apache Lucene will be used for storing the

inverted index.

2. Data Preparation

MS Computer Science, Manipal University 54

Page 55: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Inverted index data dictionary

MS Computer Science, Manipal University 55

Page 56: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Random Projection will be implemented using the semantic vector package, a project by University of Pittsburgh office of technology management.

Word statistics will be stored in an in memory database, called Redis.

2. Data Preparation

MS Computer Science, Manipal University 56

Page 57: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Term Weighting – TFIDF Search Ranking – VSM and Boolean Model Concept Extraction

◦ Words ranked by IDF◦ Words not following a Poisson distribution◦ Word co-occurrence pattern for key phrase extraction◦ First order, second order and third order word

associations◦ Scalar clustering for word analogue extractions◦ Random projection index vectors for concept searching◦ Lingo document clustering◦ NMF Clustering

3. Algorithm Selection and Validation

MS Computer Science, Manipal University 57

Page 58: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Story board 1 – Data Loading◦ Users can load documents, pop 3 account or an

URL. Vritti will create inverted index, random projections and word statistics of the corpus

Story board 2 – High Level view of corpus◦ Keyword and Key phrase display◦ For ever keyword and key phrase, associated

words till third level of association will be displayed

◦ The user can start a concept search or a keyword search based on these keywords.

4. System Use Cases and Story Boards

MS Computer Science, Manipal University 58

Page 59: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Story Board 3- Search◦ Search displays top N matching documents ◦ For each search, the top N words which distinguishes the

search result from the rest of the corpus will be displayed back. Associations and Analogues of these words can be viewed.

◦ Based on these words the user can refine his search query and do the search. Vritti will append the selected keywords to the search string.

Story Board 4- Search Result Clustering◦ Cluster the search result, user can either select Lingo or

NMF clustering algorithm◦ Every cluster will be labeled with a theme, extracted from

the documents under the cluster.◦ Association and Analogue for these cluster labels can be

found◦ User can start searching by modifying their query based

on these cluster labels.MS Computer Science, Manipal

University 59

Page 60: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

5. User Interface Design

MS Computer Science, Manipal University 60

Page 61: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

VrittiMining concepts from largeBody of text

Start Setup SearchAnalyze Themes

MS Computer Science, Manipal University 61

Page 62: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Start Setup SearchAnalyze Themes

1. Select Data Source1. Directory / File2. URL3. POP3

2. Stop words3. Advanced Parameters

MS Computer Science, Manipal University 62

Page 63: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Start Setup SearchAnalyze Themes

Top N Phrases

Top N words Associated Words Analogue

MS Computer Science, Manipal University 63

Page 64: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Start Setup SearchAnalyze Themes

Search

Search result displayed here

MS Computer Science, Manipal University 64

Page 65: Under the guidance of Dr. R. Bhaskaran Head of Department - School of Mathematics

Start Setup SearchAnalyze Themes

Select Algorithm

Themes discovered as displayed here

For a selected themes, the associated themes and strength of association are displayed here

MS Computer Science, Manipal University 65