intelligent information retrieval cs 336 xiaoyan li spring 2006 modified from lisa ballesteros’s...
TRANSCRIPT
Intelligent Information RetrievalCS 336
Xiaoyan Li
Spring 2006
Modified from Lisa Ballesteros’s slides
What is Information Retrieval?
• Includes the following:– Organization
– Storage/Representation
– Manipulation/Analysis
– Search/Retrieval
• How far back in history can we find examples?
IR Through the Ages
• 3rd Century BCE– Library of Alexandria
• 500,000 volumes• catalogs and classifications
• 13th Century A.D.– First concordance of the Bible
• What is a concordance?
• 15th Century A.D.– Invention of printing
• 1600– University of Oxford Library
• All books printed in England
IR Through the Ages• 1755
– Johnson’s Dictionary• Set standard for dictionaries• Included common language• Helped standardize spelling
• 1800– Library of Congress
• 1828– Webster’s Dictionary
• Significantly larger than previous dictionaries• Standardized American spelling
• 1852– Roget’s Thesaurus
IR Through the Ages• 1876
– Dewey Decimal Classification
• 1880’s– Carnegie Public Libraries
• 1,681 built (first public library 1850)
• 1930’s– Punched card retrieval systems
• 1940’s– Bush’s Memex– Shannon’s Communication Theory– Zipf’s “Law”
Historical Summary
• 1960’s– Basic advances in retrieval and indexing
techniques
• 1970’s– Probabilistic and vector space models– Clustering, relevance feedback– Large, on-line, Boolean information services– Fast string matching
• 1980’s– Natural Language Processing and IR– Expert systems and IR– Off-the-shelf IR systems
IR Through the Ages• Late 1980’s
– First mini-computer and PC systems incorporating “relevance ranking”
• Early 1990’s – information storage revolution
• 1992– First large-scale information service
incorporating probabilistic retrieval (West’s legal retrieval system)
IR Through the Ages
• Mid 1990’s to present– Multimedia databases
• 1994 to present– The Internet and Web explosion
• e.g. Google, Yahoo, Lycos, Infoseek (now Go)
• 1995 to present– Digital Libraries– Data Mining– Agents and Filtering– Knowledge and Distributed Intelligence– Information Organization– Knowledge Management
Historical Summary• 1990’s
– Large-scale, full-text IR and filtering experiments and systems (TREC)
– Dominance of ranking– Many web-based retrieval engines– Interfaces and browsing– Multimedia and multilingual– Machine learning techniques
Time
On-lineInformation
19901970
Batch systems...Interactive systems...Database Systems…Cheap Storage...Internet…Multimedia...
Gigabytes
Terabytes
Petabytes
TechnologiesBoolean Retrieval and Filtering
Ranked Retrieval
Distributed Retrieval
Concept-Based Retrieval
Image and VideoRetrieval
Information Extraction
Visualization
Summarization
Data Mining
Ranked Filtering
Trends in IR Technology
1-page word document without any images = ~10 kilobytes (kb) of disk space. 1 terabyte = one-hundred million imageless word docs1 petabyte = one-thousand terabytes.
Historical Summary• The Future
– Logic-based IR?– NLP?– Integration with other functionality– Distributed, heterogeneous database access – IR in context– “Anytime, Anywhere”
Information Retrieval• Ad Hoc Retrieval
– Given a query and a large database of text objects, find the relevant objects
• Distributed Retrieval– Many distributed databases
• Information Filtering– Given a text object from an information stream (e.g. newswire)
and many profiles (long-term queries), decide which profiles match
• Multimedia Retrieval– Databases of other types of unstructured data, e.g. images, video,
audio
Information Retrieval
• Multilingual Retrieval– Retrieval in a language other than English
• Cross-language Retrieval– Query in one language (e.g. Spanish),
retrieve documents in other languages (e.g. Chinese, French, and Spanish)
What does an IR system do?• Generate a representation of each document
– essentially pick best words and/or phrases
• Generate query representation– if documents processed specially, queries must also be– possibly weight query words
• Match queries and documents– find relevant documents
• Perhaps, rank and sort documents
Information Retrieval
• Text Representation (Indexing)– given a text document, identify the concepts that describe the
content and how well they describe it
• what makes a “good” representation?• how is a representation generated from text?• what are retrievable objects and how are they organized?
• Representing an Information Need (Query Formulation)– describe and refine information needs as explicit queries
• what is an appropriate query language?• how can interactive query formulation and refinement be supported?
Information Retrieval
• Comparing Representations (Retrieval)– compare text and information need representations to
determine which documents are likely to be relevant
• what is a “good” model of retrieval?• how is uncertainty represented?
• Evaluating Retrieved Text (Feedback)– present documents for user evaluation and modify query
based on feedback
• what are good metrics?• what constitutes a good experimental testbed
Information Retrieval and Filtering
Information Need Text Objects
Representation
Query
Comparison
Evaluation/Feedback
Indexed Objects
Retrieved Objects
Representation
Features of a Modern IR Product
• Effective “relevance ranking”• Simple free text (“natural language”) query capability• Boolean and proximity operators• Term weighting• Query formulation assistance• Query by example• Filtering• Field-based retrieval• Distributed architecture• Index anything• Fast retrieval• Information Organization
Typical Systems
• IR systems– Verity, Fulcrum, Excalibur
• Database systems– Oracle, Informix
• Web search and In-house systems– West, LEXIS/NEXIS, Dialog– Yahoo, Google, MSN, AskJeeves
IR vs. Database Systems
• Emphasis on effective, efficient retrieval of unstructured data
• IR systems typically have very simple schemas
• Query languages emphasize free text although Boolean combinations of words is also common
IR vs. Database Systems
• Matching is more complex than with structured data (semantics less obvious)– easy to retrieve the wrong objects
– need to measure accuracy of retrieval
• Less focus on concurrency control and recovery, although update is very important
Ambiguity Complicates the Task
• Synonyms: many ways to express concept– lorry/truck, elevator/lift, pump/impeller,
hypertension/high blood pressure– failure to use specific words => failure to get doc
• Words have many meanings– How many diff meanings are there for “bank”?
Ambiguity Complicates the Task
• Difficult to Specify Important but Vague Concepts– e.g. will interest rates be raised in the next
six months
• Spelling variants/ spelling errors
Basic Automatic Indexing
• Parse documents to recognize structure– e.g. title, date, other fields
• Scan for word tokens– numbers, special characters, hyphenation, capitalization,
etc.– languages like Chinese need segmentation– record positional information for proximity operators
• Stopword removal– based on short list of common words such as “the”, “and”,
“or”– saves storage overhead of very long indexes– can be dangerous (e.g. “Mr. The”, “and-or gates”)
Basic Automatic Indexing
• Stem words– group word variants such as plurals via
morphological processing• computer, computers, computing, computed,
computation, computerized, computerize, computerizable
– can make mistakes but generally preferred
• Optional– phrase indexing– thesaurus classes
How do you rank results?
• What does it mean for a document to be important/relevant?– Even human assessors do not agree with
each other.
• Word matching is imperfect, how do we decide which documents are most important?
How do you rank results?
• How do we decide which documents are most important?
– Count words • high frequency words indicate document “aboutness”
– Weight infrequent corpus words more strongly• can be strong signifiers of meaning; easier to partition
– Determine meaning by analyzing text surrounding a word»
– Give extra weight to title words, etc.
– Make sense of references given, citations received, etc.
Free Text Search Engines
• Different engines use different ranking strategies (often a trade secret)– Word frequency– Placement in document– Popularity of document– Number of links to document– Business relationships etc….
Announcement:
• Writing assignment: due next Monday– Create 10 topics/queries and search on
three popular Web search engines: Google.com, Yahoo.com and ask.com. Write a report to compare the three search engines and discuss why IR is so hard.
• Next Lecture: Query languages. (Ch. 4)