lemur application toolkit kanishka p pathak bioinformatics cis 595

26
Lemur Application Lemur Application toolkit toolkit Kanishka P Pathak Kanishka P Pathak Bioinformatics Bioinformatics CIS 595 CIS 595

Upload: dayna-carr

Post on 29-Dec-2015

228 views

Category:

Documents


0 download

TRANSCRIPT

Lemur Application toolkitLemur Application toolkit

Kanishka P PathakKanishka P Pathak

BioinformaticsBioinformatics

CIS 595CIS 595

IntroductionIntroduction

A language model (LM) is a probabilistic A language model (LM) is a probabilistic mechanism for generating textmechanism for generating text

In the past several years, there has been In the past several years, there has been significant interest in the use of language significant interest in the use of language modeling for text and natural language modeling for text and natural language processing tasksprocessing tasks

We now have text information retrieval (IR) We now have text information retrieval (IR) based on statistical language modelingbased on statistical language modeling

Previous workPrevious work

The first statistical modeler was Claude The first statistical modeler was Claude Shannon. Shannon.

He thought of the human language as a He thought of the human language as a statistical source and …statistical source and …

He measured how well simple n-gram He measured how well simple n-gram models did at predicting and compressing models did at predicting and compressing natural text.natural text.

For many years, language models were used in For many years, language models were used in speech recognition.speech recognition.

However, basic language modeling ideas have However, basic language modeling ideas have been used in information retrieval for quite some been used in information retrieval for quite some time.time.

Some of the previous models are: Some of the previous models are: naïve Bayes modelnaïve Bayes modelRobertson and Sparck Jones model Robertson and Sparck Jones model

Their limitations….Their limitations…. Naïve Bayes Naïve Bayes

Suffers from the “Independence Suffers from the “Independence Assumptions” it makesAssumptions” it makes

RSJ RSJ Distribution of query trems in “relevant” Distribution of query trems in “relevant”

and and “non-relevant” documents“non-relevant” documents

Turning the problem aroundTurning the problem around

Ponte and Croft proposed the smoothed version of document Ponte and Croft proposed the smoothed version of document unigram model to assign a score to a queryunigram model to assign a score to a query

Berger and J.Lafferty built on this model.Berger and J.Lafferty built on this model.Their approach : Their approach :

““predict the input (i.e. the query)”predict the input (i.e. the query)”

This opened up new ways to think about information retrieval….This opened up new ways to think about information retrieval….

LemurLemur

‘‘Lemur’ is a nocturnal, Lemur’ is a nocturnal, monkey-like African monkey-like African animal largely confined to animal largely confined to the island of Madagascarthe island of Madagascar

The name was chosen The name was chosen partly because of partly because of resemblance to LM/IRresemblance to LM/IR

Secondly because LM Secondly because LM community was an island community was an island to the IR communityto the IR community

What is the Lemur project?What is the Lemur project? It is a research project being carried out by the computer It is a research project being carried out by the computer

Science dept. at Univ. of Massachusetts and Carnegie Science dept. at Univ. of Massachusetts and Carnegie Mellon UniversityMellon University

It is sponsored by the It is sponsored by the Advanced Research and Advanced Research and Development Activity in Information TechnologyDevelopment Activity in Information Technology (ARDA) (ARDA)

It is designed to facilitate research in language modeling It is designed to facilitate research in language modeling and Information retrieval and Information retrieval

It is written in C/C++ and runs under Unix as well as It is written in C/C++ and runs under Unix as well as WindowsWindows

Components and their interactionComponents and their interaction

Components and their interactionComponents and their interaction

Components and their interactionComponents and their interaction

The toolkitThe toolkit

The lemur toolkit is available on the site The lemur toolkit is available on the site www-2.cs.cmu.edu/~lemurwww-2.cs.cmu.edu/~lemur

To use the toolkit : To use the toolkit :

download download compile compile execute execute

Example of applicationsExample of applications Pre-processing : Pre-processing :

ParseQueryParseQueryParseToFileParseToFile

Building/Adding Index :Building/Adding Index :PushIndexerPushIndexerBuildBasicIndexBuildBasicIndex

Retrieval/Evaluation :Retrieval/Evaluation :RetEvalRetEvalStructQueryEvalStructQueryEval

Summarization :Summarization :BasicSummAppBasicSummAppMMRSummAppMMRSummApp

What do we need to run an What do we need to run an application?application?

Text documents in the format which is Text documents in the format which is acceptable by LEMUR (TREC format)acceptable by LEMUR (TREC format)

Parameter fileParameter file

Document format in LemurDocument format in Lemur

There are 5 documents formats supported There are 5 documents formats supported by Lemur :by Lemur :

TRECTRECWEBWEBCHINESECHINESECHINESECHARCHINESECHARARABICARABIC

Example of a Document formatExample of a Document format

Say, we take the document “web”Say, we take the document “web”

<DOC><DOC><DOCNO> any_number_here </DOCNO><DOCNO> any_number_here </DOCNO>Text hereText here</DOC></DOC>

<DOC><DOC><DOCNO> any_number_here </DOCNO><DOCNO> any_number_here </DOCNO>Text hereText here</DOC></DOC>

Example of Document formatExample of Document format<DOC><DOC><DOCNO> 251 </DOCNO><DOCNO> 251 </DOCNO>

Ballistic Cam Design Ballistic Cam Design

This paper presents a digital computer programThis paper presents a digital computer programfor the rapid calculation of manufacturing data for the rapid calculation of manufacturing data essential to the design of preproduction cams whichessential to the design of preproduction cams whichare utilized in ballistic computers of tank fire are utilized in ballistic computers of tank fire control systems. The cam profile generated introducescontrol systems. The cam profile generated introducesthe superelevation angle required by tank main the superelevation angle required by tank main armament for a particular type ammunition.armament for a particular type ammunition.

CACM November, 1961CACM November, 1961

Archambault, M.Archambault, M.

CA611117 JB March 15, 1978 10:37 PMCA611117 JB March 15, 1978 10:37 PM

</DOC></DOC>

Example of what a parameter file Example of what a parameter file looks likelooks like

Say we are creating a parameter file for the application Say we are creating a parameter file for the application ‘BuildBasicIndex’‘BuildBasicIndex’

The parameter file needs to have the following contents:The parameter file needs to have the following contents:

1.inputFile 1.inputFile : the path to the source file: the path to the source file2.outputPrefix 2.outputPrefix : a prefix name for your index: a prefix name for your index3.maxDocuments : maximum number of documents to 3.maxDocuments : maximum number of documents to

index (default 1000000) index (default 1000000)4.maxMemory 4.maxMemory : maximum amount of memory to be : maximum amount of memory to be

used for indexing (default 128MB) used for indexing (default 128MB)

Eg: Eg: inputFile=/usr/mydata/source;inputFile=/usr/mydata/source;outputPrefix= /usr/mydata/index;outputPrefix= /usr/mydata/index;maxDocuments=200000;maxDocuments=200000;

C:\lemur>BuildBasicIndex c:\lemur\buildpaC:\lemur>BuildBasicIndex c:\lemur\buildpa

The indexed file generated is : The indexed file generated is : /usr/mydata/index.bsc /usr/mydata/index.bsc

Contd….Contd….

Run the application with the parameter as Run the application with the parameter as the only argumentthe only argument

OROR

the first argument, if the application can take the first argument, if the application can take other parameters from the command lineother parameters from the command line

exampleexample

Example: Example: C:\lemur\lemur-2.0.3>BuildBasicIndex c:\lemur\C:\lemur\lemur-2.0.3>BuildBasicIndex c:\lemur\

parambasic.txt parambasic.txt OROR

C:\lemur\lemur-2.0.3>BuildBasicIndex c:\lemur\C:\lemur\lemur-2.0.3>BuildBasicIndex c:\lemur\parambasic.txt c:\lemur\source.txtparambasic.txt c:\lemur\source.txt

Where,Where,BuildBasicIndex is the applicationBuildBasicIndex is the applicationparambasic.txt is a parameter file for BuildBasicIndexparambasic.txt is a parameter file for BuildBasicIndexsource.txt is the file containing the source documentsource.txt is the file containing the source document

Lemur API Lemur API

The Lemur API is intended to allow a programmer The Lemur API is intended to allow a programmer to use the toolkit for special-purpose to use the toolkit for special-purpose applications that are not implemented in the applications that are not implemented in the toolkit itselftoolkit itself

The API interfaces are grouped at three different The API interfaces are grouped at three different levels: levels:

1.1. Utility levelUtility level2.2. Indexer levelIndexer level3.3. Retrieval levelRetrieval level

API levelsAPI levels Utility level : Includes common utilities such as Utility level : Includes common utilities such as

memory management, default exception memory management, default exception handler, program argument handler.handler, program argument handler.

Indexer level : Converts the raw text into efficient Indexer level : Converts the raw text into efficient data structures so that the information (i.e. word data structures so that the information (i.e. word counts) may be accessed conveniently and counts) may be accessed conveniently and efficiently later.efficiently later.

Retrieval level: It is most useful for users who Retrieval level: It is most useful for users who want to build a prototype system or evaluation want to build a prototype system or evaluation systemsystem

Future DevelopmentsFuture Developments

SummarizingSummarizingFiltering Filtering Question AnsweringQuestion AnsweringLanguage generationLanguage generation

ReferencesReferences

www-2.cs.cmu.edu/~lemurwww-2.cs.cmu.edu/~lemur

A language modeling approach to Information retrievalA language modeling approach to Information retrieval

by Jay M Ponte and W. Bruce Croft by Jay M Ponte and W. Bruce Croft (CS – UMass Amherst)(CS – UMass Amherst)

THANK YOUTHANK YOU

Any questions?Any questions?