wmes3103 : information retrieval

21
WMES3103 : INFORMATION RETRIEVAL TEXT OPERATIONS

Upload: aleron

Post on 05-Jan-2016

32 views

Category:

Documents


0 download

DESCRIPTION

WMES3103 : INFORMATION RETRIEVAL. TEXT OPERATIONS. INTRODUCTION. Not all words in a document = significant to represent the contents/meanings of a document Some word carry more meaning than others Noun words or group of noun words = most representative of a document content - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: WMES3103 : INFORMATION RETRIEVAL

WMES3103 : INFORMATION RETRIEVAL

TEXT OPERATIONS

Page 2: WMES3103 : INFORMATION RETRIEVAL

INTRODUCTION

Not all words in a document = significant to represent the contents/meanings of a documentSome word carry more meaning than othersNoun words or group of noun words = most representative of a document contentTherefore, need to preprocess the text of a document in a collection to be used as index terms

Page 3: WMES3103 : INFORMATION RETRIEVAL

Using the set of all words in a collection to index documents = too much noise for the retrieval taskReduce noise = reduce words which can be used to refer to the documentPreprocessing = process of controlling the size of the vocabulary or the number of distinct words used as index termsPreprocessing will lead to an improvement in the information retrieval performance

Page 4: WMES3103 : INFORMATION RETRIEVAL

However, some search engines on the Web omit preprocessing Every word in the document is an

index term Suppose to make the retrieval task

simpler and easier for the user

Page 5: WMES3103 : INFORMATION RETRIEVAL

DOCUMENT PREPROCESSING

Text operations = text transformations5 main operations :a. Lexical analysis of the text - digits,

hyphens, punctuations marks, and the case of letters

b. Elimination of stop words - filter out words which are not useful in the retrieval process

c. Stemming of the remaining words - remove affixes (prefixes and suffixes)

Page 6: WMES3103 : INFORMATION RETRIEVAL

d. Selection of index terms – choose words/stems (or groups of words) to be used as indexing terms

e. Construction of term categorization structures such as thesaurus, or extraction of structure directly represented in the text, for allowing the expansion of the original query with related terms

a – d = production of a set of good index termse = building of categorization hierarchies to capture relationship

Page 7: WMES3103 : INFORMATION RETRIEVAL

LEXICAL ANALYSIS OF TEXT

Change text of the documents into words to be adopted as index termsObjective - identify words in the text Digits, hyphens, punctuation marks, case of letters Numbers not good index terms – 1910, 1999 - but

510 B.C. – unique Hyphen – break up the words (eg. state-of-the-art =

state of the art)- but some words, eg. gilt-edged, B-49 - unique words which require hyphens

Punctuation marks – remove totally unless significant , eg. program code x.id and xid

Case of letters – not important and can convert all to upper or lower

Page 8: WMES3103 : INFORMATION RETRIEVAL

ELIMINATION OF STOPWORD

A word which occurs in 80% of the documents in a collection = useless for retrieval= stopwords and filtered out as potential index terms (eg. articles, prepositions, conjunctions)Reduces size of indexing structureIndexing structure compressed by 40%Some verbs, adverbs and adjectives can also be treated as stopwords

Page 9: WMES3103 : INFORMATION RETRIEVAL

425 stopwords identified by W.B. Frakes and R. Baeza-Yates. Information retrieval : data structures & algorithms. Englewood Cliffs : Prentice Hall, 1992.Programs in C for lexical analysis are also providedElimination of stopwords might reduce recall (eg. “To be or not to be” – all eliminated except “be” – no or irrelevant retrieval)

Page 10: WMES3103 : INFORMATION RETRIEVAL

STEMMING

Stem = a portion of a word which is left after the removal of it affixes (i.e. prefixes and suffixes)Reduces variants of the same root to a common conceptReduces size of indexing structure because number of distinct index terms is reducedMany Web search engines do not use stemming

Page 11: WMES3103 : INFORMATION RETRIEVAL

INDEX TERM SELECTION

If a full text representation of the text is adopted, then all words in the text are used as index terms = full text indexingNeed to select the words to be used as index termsNot all words will be selectedBibliographic sciences – done by a specialistOther alternative method is automatic selection

Page 12: WMES3103 : INFORMATION RETRIEVAL

THESAURI

Consists of : a precompiled list of important words in a given

discipline for each word, a set of related words Words and concepts

Aim to provide a standard vocabulary for indexing and

searching to assist users with locating terms for proper query

formulation to provide classified hierarchies that allow the

broadening and narrowing of the current request according to user needs

Page 13: WMES3103 : INFORMATION RETRIEVAL
Page 14: WMES3103 : INFORMATION RETRIEVAL
Page 15: WMES3103 : INFORMATION RETRIEVAL

Main components of a thesaurus – index terms, relationship among terms (BT, NT, RT) and a layout design for the term relationships, sometimes a definition or explanation (eg. seal (animal) and seal (document)Controlled vocabulary for indexing and searching – useful for established body of knowledge with established terms.Web – thesaurus or free-text searching ?????

Page 16: WMES3103 : INFORMATION RETRIEVAL

eg. Yahoo – present user with term classification hierarchy that reduces the space to be searched

Page 17: WMES3103 : INFORMATION RETRIEVAL
Page 18: WMES3103 : INFORMATION RETRIEVAL
Page 19: WMES3103 : INFORMATION RETRIEVAL
Page 20: WMES3103 : INFORMATION RETRIEVAL
Page 21: WMES3103 : INFORMATION RETRIEVAL

OTHERS

Document clustering – group similar or related documents in classes, operation on all documents in the collection and not operation of the text for a documentText compression – ways to represent the data in fewer bits and bytes, greatly reduces amount of space to store text on computers, text – compression – original text reconstructed, takes less time to transmit