wmes3103 information retrieval week 1 and 2. what is information retrieval? information retrieval...
Post on 21-Dec-2015
251 views
TRANSCRIPT
WMES3103WMES3103
INFORMATION RETRIEVALINFORMATION RETRIEVAL
WEEK 1 AND 2WEEK 1 AND 2
WHAT IS INFORMATION WHAT IS INFORMATION RETRIEVAL?RETRIEVAL?
Information Retrieval – IRInformation Retrieval – IR InformationInformation RetrievalRetrieval
Lancaster (1968) : Lancaster (1968) :
An information retrieval system does An information retrieval system does not inform (I.e change the knowledge) of not inform (I.e change the knowledge) of the user on the subject of his inquiry. It the user on the subject of his inquiry. It merely inform on the existence (or non-merely inform on the existence (or non-existence ) and whereabouts of existence ) and whereabouts of documents relating to his requestdocuments relating to his request
IR – process of getting/retrieving informationIR – process of getting/retrieving information Now : a lot of information – print and electronicNow : a lot of information – print and electronic Requirement : obtain information quickly and Requirement : obtain information quickly and
accuratelyaccurately IR – aims to provide fast , effective and efficient IR – aims to provide fast , effective and efficient
methods of representing, managing , searching, methods of representing, managing , searching, retrieving and presenting such informationretrieving and presenting such information
IR = the representation , storage, organization IR = the representation , storage, organization of and access to information items of and access to information items
Computer science perspectiveComputer science perspectiveDesign and build a large scale system that Design and build a large scale system that
will store, manipulate, retrieve and display will store, manipulate, retrieve and display electronic information of any kindelectronic information of any kind
Text, audio, image and graphics that are Text, audio, image and graphics that are stored in such a way that they are stored in such a way that they are available for interaction with human or available for interaction with human or machinemachine
Library and information perspectivesLibrary and information perspectivesSearch features – au, ti, su, keywordsSearch features – au, ti, su, keywordsRelevance of retrieve itemsRelevance of retrieve items
Examples of IRSExamples of IRS
Examples of IRSExamples of IRS
3 challenges for IR researchers and 3 challenges for IR researchers and practitionerspractitioners
Technical challenge : what tools should IR systems Technical challenge : what tools should IR systems provide to allow effective and efficient provide to allow effective and efficient manipulation of information within such diverse manipulation of information within such diverse media as text, image, video and audio?media as text, image, video and audio?
Interaction challenge : what features should IR Interaction challenge : what features should IR systems provide in order to support a wide variety systems provide in order to support a wide variety of users in their search for relevant information.of users in their search for relevant information.
Evaluation challenge : how can we evaluate which Evaluation challenge : how can we evaluate which tools and features are effective and usable, given tools and features are effective and usable, given the increasing diversity of end-users and the increasing diversity of end-users and information seeking situations?information seeking situations?
3 basic areas of research3 basic areas of research
Content analysis – describing the Content analysis – describing the contents of the documents in a form contents of the documents in a form suitable for computer processingsuitable for computer processing
Information structures – exploiting Information structures – exploiting relationships between documents to relationships between documents to improve the efficiency and improve the efficiency and effectiveness of retrieval strategieseffectiveness of retrieval strategies
Evaluation – measurement of Evaluation – measurement of effectiveness of retrievaleffectiveness of retrieval
Information Retrieval SystemInformation Retrieval System
Information Retrieval System = IRSInformation Retrieval System = IRSBefore :index document and retrieveBefore :index document and retrieveEg. OPAC of library – cataloguingEg. OPAC of library – cataloguingNow: modelling, document Now: modelling, document
classification and categorization, classification and categorization, system architecture, user interface, system architecture, user interface, data visualization, filtering languagesdata visualization, filtering languages
Eg. WWWEg. WWW
Basic Information Retrieval Basic Information Retrieval ProcessProcess
Translate into query OR keywords which summarizes the description
of user information needs
Query processed by a search engine or IRS
IRS retrieves information which is useful/relevant to the user
Question OR Full description of user information needs
Basic Concepts in Information Basic Concepts in Information RetrievalRetrieval
User TaskUser TaskLogical View of documentsLogical View of documents
User TaskUser Task
A user has to translate his A user has to translate his information needs into query in the information needs into query in the language provided by the systemlanguage provided by the system
Specify a set of wordsSpecify a set of wordsEnglish Language Statement : English Language Statement :
I want a book by J. K Rowling titled I want a book by J. K Rowling titled The Chamber of Secrets The Chamber of Secrets
Query entered in a computer systemQuery entered in a computer systemAu = RowlingAu = RowlingTi = Chamber of SecretsTi = Chamber of Secrets““Chamber of Secret”Chamber of Secret”Rowling AND StoneRowling AND StoneAu rowling ti chamber of secrets ti stoneAu rowling ti chamber of secrets ti stone
2 User Task2 User Task 2 user task – browsing and retrieval2 user task – browsing and retrieval BrowsingBrowsing – the process of retrieving info. – the process of retrieving info.
Whereby the main objective is not clearly Whereby the main objective is not clearly defined from the beginning and whose defined from the beginning and whose purpose might change during the purpose might change during the interaction with the system.interaction with the system.
Eg. User search the internet for info about Eg. User search the internet for info about marine organism marine organism look for info. About look for info. About Australian aborigines Australian aborigines user is said to be user is said to be browsing in the collection and not searchingbrowsing in the collection and not searching
Eg. Searching for a book in the library Eg. Searching for a book in the library shelvesshelves
Retrieval Retrieval – process of retrieving info – process of retrieving info whereby the main obj. is clearly whereby the main obj. is clearly defined from the onset of searching defined from the onset of searching process – eg. Eg. Searching for a process – eg. Eg. Searching for a book in the library shelvesbook in the library shelves
2 actions when user interacts with 2 actions when user interacts with an IRSan IRS
2 actions can be identified when a user 2 actions can be identified when a user interacts with an IRSYS – pulling and interacts with an IRSYS – pulling and pushing actions.pushing actions.
Pulling actionPulling action user request for info in user request for info in interactive way eg browsing and retrievalinteractive way eg browsing and retrieval
Pushing actionPushing action push info towards the push info towards the user periodically through the use of a user periodically through the use of a specified or specially designed s/ware specified or specially designed s/ware also known as filtering also known as filtering
eg. Yahoo Msgr Service eg. Yahoo Msgr Service alert user each alert user each time new message arrivetime new message arrive
Online Stock ExchangeOnline Stock Exchange
Interaction of the user with Interaction of the user with IRSYS through distinct taskIRSYS through distinct task
IR
Browsing
DB
USER
Logical View of Documents Logical View of Documents
Documents in a collection are Documents in a collection are represented by a set on index terms represented by a set on index terms or keywordsor keywords
KeywordsKeywordsAbstractAbstractFull textFull text
Logical View of DocumentsLogical View of Documents
Documents
Indexing Process
Assigned by humans
Extracted from text of document
Keywords/subject headings = Logical view of document
•Documents in a collection are represented by a set of index term/keywords
LISANET – search by abstractLISANET – search by abstract
MJLIS - EJournalMJLIS - EJournal
If full text :If full text :Each word in the text is a keywordEach word in the text is a keywordMost complex formMost complex formExpensiveExpensiveIf full text is too large, there are If full text is too large, there are
mechanisms built into the IRS to reduce mechanisms built into the IRS to reduce the number of keyword :the number of keyword :
Logical view of documents - continueLogical view of documents - continue
1.1. Stop words (eg articles and connectives Stop words (eg articles and connectives – a, the , an, and, of, etc)– a, the , an, and, of, etc)
2.2. Stemming (reduce distinct words to their Stemming (reduce distinct words to their common grammatical root) eg diary** common grammatical root) eg diary** will find diary or diarieswill find diary or diaries
3.3. Truncation – eg catalog* will retrieve Truncation – eg catalog* will retrieve catalog, catalogs, catalogue, cataloguescatalog, catalogs, catalogue, catalogues
4.4. Noun words (eliminates adjectives, Noun words (eliminates adjectives, adverbs, verbs) eg run will represent adverbs, verbs) eg run will represent runs, runningruns, running
5.5. compressioncompression
Conversion Process
Logical view of documents - continueLogical view of documents - continue
This conversion process is known as text This conversion process is known as text operation or transformationoperation or transformation
It reduce the complexity of the document It reduce the complexity of the document representation and allow the logical view representation and allow the logical view from that of a full text to a set of index from that of a full text to a set of index termsterms
On the other hand, the human assigned On the other hand, the human assigned keywords provides the most concise keywords provides the most concise logical view of a document but might lead logical view of a document but might lead to retrieval of poor quality – different to retrieval of poor quality – different interpretations, limited keywords if using interpretations, limited keywords if using thesaurus thesaurus
2 modes of retrieval2 modes of retrieval
Ad-Hoc – the documents in the IRS Ad-Hoc – the documents in the IRS remains static but new queries are remains static but new queries are submitted to the system – eg. CD-submitted to the system – eg. CD-ROM DatabaseROM Database
Filtering – the queries remain Filtering – the queries remain relatively static but new documents relatively static but new documents come into the IRS eg. Stock marketcome into the IRS eg. Stock market
FilteringFiltering Construct a user profile that reflects the user’s Construct a user profile that reflects the user’s
preferences and profile is matched against preferences and profile is matched against incoming documents to find a match or a hitincoming documents to find a match or a hit
Retrieve only documents of interest to the Retrieve only documents of interest to the user and as specified in the user profileuser and as specified in the user profile
User select relevant documents from the list.User select relevant documents from the list. Filtered documents can also be ranked to Filtered documents can also be ranked to
further assist the user as to relevancefurther assist the user as to relevance Construction of a user profile - user provide Construction of a user profile - user provide
necessary keywords or collect info about necessary keywords or collect info about preferences from the user and use this to preferences from the user and use this to construct a user profile dynamically construct a user profile dynamically
INFORMATION RETRIEVAL INFORMATION RETRIEVAL PROCESSPROCESS
A.A. DEFINE TEXT DATABASEDEFINE TEXT DATABASE The text database has to be defined before the The text database has to be defined before the
retrieval process beginsretrieval process begins Done by database manager – documents to be Done by database manager – documents to be
used, operations to be performed on the text, used, operations to be performed on the text, text modeltext model
Original documents is transformed into a logical Original documents is transformed into a logical view of the documents via the various text view of the documents via the various text operationsoperations
The database manager will then build up the The database manager will then build up the index of the text – manually / computer index of the text – manually / computer generatedgenerated
The retrieval system is testedThe retrieval system is tested
B. RETRIEVAL PROCESSB. RETRIEVAL PROCESS The IRS can be used once the document The IRS can be used once the document
database has been indexeddatabase has been indexed User puts or present his question/ user User puts or present his question/ user
need to the IRSneed to the IRS Question is change to a logical view of the Question is change to a logical view of the
document via the text operationdocument via the text operation The query operation will present this to The query operation will present this to
the system in a form understandable by the system in a form understandable by the systemthe system
Query is processed to obtain the retrieved Query is processed to obtain the retrieved documents.documents.
Continue…Continue… The retrieved document are ranked The retrieved document are ranked
according to relevanceaccording to relevance Retrieved document are sent to the userRetrieved document are sent to the user User looks through at the ranked User looks through at the ranked
documents and can modify question/user documents and can modify question/user need/ query via the user feedback cycleneed/ query via the user feedback cycle
Same process repeatedSame process repeated
DEVELOPMENTDEVELOPMENT For the past 4000 years , man has always been For the past 4000 years , man has always been
organizing information for retrieval and usage.organizing information for retrieval and usage. It started out with a table of contents for a book. It started out with a table of contents for a book.
Then, the amount of information extended over a Then, the amount of information extended over a number of booksnumber of books
A specialized data structure is needed to ensure A specialized data structure is needed to ensure faster access to the stored info.faster access to the stored info.
The oldest and the most popular data form of data The oldest and the most popular data form of data structure for fast IR is a collections of words or structure for fast IR is a collections of words or concept with which are associated pointers to the concept with which are associated pointers to the related info = INDEXrelated info = INDEX
Previously – ManualPreviously – Manual
Development…continueDevelopment…continue
Now, with the advent of computers, large Now, with the advent of computers, large indexes can be generated automatically. This indexes can be generated automatically. This automatic indexes provide the logical view of automatic indexes provide the logical view of the document as perceived by the system the document as perceived by the system and not the userand not the user
2 different views of the IR problems:2 different views of the IR problems:Computer-centeredComputer-centered building efficient building efficient
indexes , processing user queries with high indexes , processing user queries with high performance, develop ranking algorithm which performance, develop ranking algorithm which will improve the quality of the answer setwill improve the quality of the answer set
Human-CenteredHuman-Centered studying the behavior of studying the behavior of the user , understand his main needs, and of the user , understand his main needs, and of determining how such understanding affects determining how such understanding affects the organization and the operation the the the organization and the operation the the IRSYS.IRSYS.
IR in the LibraryIR in the Library Libraries are the first users of IRSYS to retrieve Libraries are the first users of IRSYS to retrieve
informationinformation Usually develop by academic institution and Usually develop by academic institution and
later by commercial vendorslater by commercial vendors 11stst generation – automation of the card catalog generation – automation of the card catalog
and allowed searches based on author and titleand allowed searches based on author and title 22ndnd generation – increased search functionality generation – increased search functionality
- searching by subject headings, keywords, - searching by subject headings, keywords, complex queries -OPACcomplex queries -OPAC
33rdrd generation – graphical interfaces, electronic generation – graphical interfaces, electronic forms, hypertext features, open system forms, hypertext features, open system architecture – Digital Librariesarchitecture – Digital Libraries
The Web and Digital The Web and Digital LibrariesLibraries
Search engine on the web are still using Search engine on the web are still using indexes which are similar to the ones used indexes which are similar to the ones used by libraries years ago.by libraries years ago.
So, what has change?So, what has change? Advances in computer technology has led to:Advances in computer technology has led to:
Cheaper access to various sources of Cheaper access to various sources of informationinformation
Greater access to network due to Greater access to network due to advances in all kind of digital advances in all kind of digital communicationcommunication
Freedom to post information on the webFreedom to post information on the web
ProblemsProblems
People still find it difficult to People still find it difficult to retrieve info relevant to their retrieve info relevant to their information needs from the webinformation needs from the web
Issues to address:Issues to address:Dynamic world on the webDynamic world on the webDemand for access and quick Demand for access and quick
responseresponseQuality of retrieval task is affected Quality of retrieval task is affected
by user interaction with the systemby user interaction with the system
THANK YOUTHANK YOU