9/4/2001information organization and retrieval introduction to information retrieval university of...

Post on 20-Dec-2015

222 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

9/4/2001 Information Organization and Retrieval

Introduction to Information Retrieval

University of California, Berkeley

School of Information Management and Systems

SIMS 202: Information Organization and Retrieval

Lecture authors: Marti Hearst & Ray Larson

9/4/2001 Information Organization and Retrieval

Review: Information Overload

• “The world's total yearly production of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth.” (Varian & Lyman)

• “The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all.” (W.H. Auden)

9/4/2001 Information Organization and Retrieval

Information Organization and Retrieval

• To organize is to (1) furnish with organs, make organic, make into living tissue, become organic; (2) form into an organic whole; give orderly structure to; frame and put into working order; make arrangements for.

• Knowledge is knowing, familiarity gained by experience; person’s range of information; a theoretical or practical understanding of; the sum of what is known.

• To retrieve is to (1) recover by investigation or effort of memory, restore to knowledge or recall to mind; regain possession of; (2) rescue from a bad state, revive, repair, set right.

• Information is (1) informing, telling; thing told, knowledge, items of knowledge, news.

The Oxford English Dictionary, cf. Rowley

9/4/2001 Information Organization and Retrieval

Information Life CycleCreation

Utilization Searching

Active

Inactive

Semi-Active

Retention/Mining

Disposition

Discard

Using Creating

AuthoringModifying

OrganizingIndexing

StoringRetrieval

DistributionNetworking

AccessingFiltering

Note: This version of the Life cycle is based on the report of a conference on the Social Aspects of Digital Libraries held at UCLA. - C. Borgman, PI

9/4/2001 Information Organization and Retrieval

Authoring/Modifying

• Converting Data+Information+Knowledge to New Information.

• Creating information from observation, thought.

• Editing and Publication.

• Gatekeeping

9/4/2001 Information Organization and Retrieval

Organizing/Indexing

• Collecting and Integrating information.

• Affects Data, Information and Metadata.

• “Metadata” Describes data and information.– More on this later.

• Organizing Information.– Types of organization?

• Indexing

9/4/2001 Information Organization and Retrieval

Storing/Retrieving

• Information Storage – How and Where is Information stored?

• Retrieving Information.– How is information recovered from storage– How to find needed information– Linked with Accessing/Filtering stage

9/4/2001 Information Organization and Retrieval

Distribution/Networking

• Transmission of information– How is information transmitted?

• Networks vs Broadcast.

9/4/2001 Information Organization and Retrieval

Accessing/Filtering

• Using the organization created in the O/I stage to:– Select desired (or relevant) information– Locate that information– Retrieve the information from its storage

location (often via a network)

9/4/2001 Information Organization and Retrieval

Using/Creating

• Using Information.

• Transformation of Information to Knowledge.

• Knowledge to New Data and New Information.

9/4/2001 Information Organization and Retrieval

Key issues in this course• How to find the appropriate information resources

or information-bearing objects for someone’s (or your own) needs.– Retrieving

• How to describe information resources or information-bearing objects in ways so that they may be effectively used by those who need to use them.– Organizing

9/4/2001 Information Organization and Retrieval

Key IssuesCreation

Utilization Searching

Active

Inactive

Semi-Active

Retention/Mining

Disposition

Discard

Using Creating

AuthoringModifying

OrganizingIndexing

StoringRetrieval

DistributionNetworking

AccessingFiltering

9/4/2001 Information Organization and Retrieval

This Week

• Introduction to IR– Modern IR textbook topics

• The Information Seeking Process

9/4/2001 Information Organization and Retrieval

Textbook Topics

9/4/2001 Information Organization and Retrieval

Mor

e D

etai

led

Vie

w

9/4/2001 Information Organization and Retrieval

Wha

t We’

ll C

over

A Lot

A Little

9/4/2001 Information Organization and Retrieval

Search and RetrievalOutline of Part I of SIMS 202

• The Search Process• Information Retrieval Models• Content Analysis/Zipf Distributions• Evaluation of IR Systems

– Precision/Recall– Relevance– User Studies

• System and Implementation Issues• Web-Specific Issues• User Interface Issues• Special Kinds of Search

9/4/2001 Information Organization and Retrieval

What is an Information Need?

9/4/2001 Information Organization and Retrieval

The Standard Retrieval Interaction Model

9/4/2001 Information Organization and Retrieval

Standard Model

• Assumptions:– Maximizing precision and recall

simultaneously– The information need remains static– The value is in the resulting document set

9/4/2001 Information Organization and Retrieval

Problem with Standard Model:

• Users learn during the search process:– Scanning titles of retrieved documents– Reading retrieved documents– Viewing lists of related topics/thesaurus terms– Navigating hyperlinks

• Some users don’t like long disorganized lists of documents

9/4/2001 Information Organization and Retrieval

IR is an Iterative Process

Repositories

Workspace

Goals

9/4/2001 Information Organization and Retrieval

IR is a Dialog

– The exchange doesn’t end with first answer

– User can recognize elements of a useful answer

– Questions and understanding changes as the process

continues.

9/4/2001 Information Organization and Retrieval

“Berry-Picking” as an Information Seeking Strategy (Bates 90)

• Standard IR model– assumes the information need remains the same

throughout the search process

• Berry-picking model– interesting information is scattered like berries

among bushes– the query is continually shifting

9/4/2001 Information Organization and Retrieval

A sketch of a searcher… “moving through many actions towards a general goal of satisfactory

completion of research related to an information need.” (after Bates 89)

Q0

Q1

Q2

Q3

Q4

Q5

9/4/2001 Information Organization and Retrieval

Berry-picking model (cont.)

• The query is continually shifting

• New information may yield new ideas and new directions

• The information need– is not satisfied by a single, final retrieved set– is satisfied by a series of selections and bits of

information found along the way.

9/4/2001 Information Organization and Retrieval

Berry-picking model (cont.)

• The query is continually shifting

• New information may yield new ideas and new directions

• The information need– is not satisfied by a single, final retrieved set– is satisfied by a series of selections and bits of

information found along the way.

9/4/2001 Information Organization and Retrieval

Information Seeking Behavior

• Two parts of a process:• search and retrieval

• analysis and synthesis of search results

• This is a fuzzy area; we will look at several different working theories.

9/4/2001 Information Organization and Retrieval

Search Tactics and Strategies

• Search Tactics– Bates 79

• Search Strategies– Bates 89– O’Day and Jeffries 93

9/4/2001 Information Organization and Retrieval

Tactics vs. Strategies

• Tactic: short term goals and maneuvers– operators, actions

• Strategy: overall planning– link a sequence of operators together to achieve

some end

9/4/2001 Information Organization and Retrieval

Information Search Tactics (after Bates 79)

• Monitoring tactics– keep search on track

• Source-level tactics– navigate to and within sources

• Term and Search Formulation tactics– designing search formulation

– selection and revision of specific terms within search formulation

9/4/2001 Information Organization and Retrieval

Term Tactics

• Move around the thesaurus– superordinate, subordinate, coordinate – neighbor (semantic or alphabetic)– trace -- pull out terms from information already

seen as part of search (titles, etc)– morphological and other spelling variants– antonyms (contrary)

9/4/2001 Information Organization and Retrieval

Source-level Tactics• “Bibble”:

– look for a pre-defined result set – e.g., a good link page on web

• Survey:– look ahead, review available options– e.g., don’t simply use the first term or first source that

comes to mind

• Cut:– eliminate large proportion of search domain– e.g., search on rarest term first

9/4/2001 Information Organization and Retrieval

Source-level Tactics (cont.)• Stretch

– use source in unintended way

– e.g., use patents to find addresses

• Scaffold– take an indirect route to goal

– e.g., when looking for references to obscure poet, look up contemporaries

• Cleave– binary search in an ordered file

9/4/2001 Information Organization and Retrieval

Monitoring Tactics(strategy-level)• Check

– compare original goal with current state

• Weigh– make a cost/benefit analysis of current or anticipated

actions

• Pattern– recognize common strategies

• Correct Errors• Record

– keep track of (incomplete) paths

9/4/2001 Information Organization and Retrieval

Additional Considerations(Bates 79)

• Add a Sort tactic!• More detail is needed about short-term

cost/benefit decision rule strategies• When to stop?

– How to judge when enough information has been gathered?

– How to decide when to give up an unsuccesful search?

– When to stop searching in one source and move to another?

9/4/2001 Information Organization and Retrieval

Implications

• Interfaces should make it easy to store intermediate results

• Interfaces should make it easy to follow trails with unanticipated results

• Makes evaluation more difficult.

9/4/2001 Information Organization and Retrieval

• Later in the course:– More on Search Process and Strategies– User interfaces to improve IR process– Incorporation of Content Analysis into better

systems

9/4/2001 Information Organization and Retrieval

Restricted Form of the IR Problem

• The system has available only pre-existing, “canned” text passages.

• Its response is limited to selecting from these passages and presenting them to the user.

• It must select, say, 10 or 20 passages out of millions or billions!

9/4/2001 Information Organization and Retrieval

Information Retrieval

• Revised Task Statement:

Build a system that retrieves documents that users are likely to find relevant to their queries.

• This set of assumptions underlies the field of Information Retrieval.

9/4/2001 Information Organization and Retrieval

Some IR History

– Roots in the scientific “Information Explosion” following WWII

– Interest in computer-based IR from mid 1950’s• H.P. Luhn at IBM (1958)

• Probabilistic models at Rand (Maron & Kuhns) (1960)

• Boolean system development at Lockheed (‘60s)

• Vector Space Model (Salton at Cornell 1965)

• Statistical Weighting methods and theoretical advances (‘70s)

• Refinements and Advances in application (‘80s)• User Interfaces, Large-scale testing and application (‘90s)

9/4/2001 Information Organization and Retrieval

Structure of an IR SystemSearchLine Interest profiles

& QueriesDocuments

& data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Adapted from Soergel, p. 19

9/4/2001 Information Organization and Retrieval

Structure of an IR SystemSearchLine Interest profiles

& QueriesDocuments

& data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Adapted from Soergel, p. 19

9/4/2001 Information Organization and Retrieval

Structure of an IR SystemSearchLine Interest profiles

& QueriesDocuments

& data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Adapted from Soergel, p. 19

9/4/2001 Information Organization and Retrieval

Structure of an IR SystemSearchLine Interest profiles

& QueriesDocuments

& data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Adapted from Soergel, p. 19

9/4/2001 Information Organization and Retrieval

Relevance (introduction)• In what ways can a document be relevant to a

query?– Answer precise question precisely.

– Who is buried in grant’s tomb? Grant.

– Partially answer question.– Where is Danville? Near Walnut Creek.

– Suggest a source for more information.– What is lymphodema? Look in this Medical Dictionary.

– Give background information.– Remind the user of other knowledge.– Others ...

9/4/2001 Information Organization and Retrieval

Query Languages

• A way to express the question (information need)

• Types: – Boolean– Natural Language– Stylized Natural Language– Form-Based (GUI)

9/4/2001 Information Organization and Retrieval

Simple query language: Boolean

– Terms + Connectors (or operators)– terms

• words• normalized (stemmed) words• phrases• thesaurus terms

– connectors• AND• OR• NOT

9/4/2001 Information Organization and Retrieval

Boolean Queries• Cat

• Cat OR Dog

• Cat AND Dog

• (Cat AND Dog)

• (Cat AND Dog) OR Collar

• (Cat AND Dog) OR (Collar AND Leash)

• (Cat OR Dog) AND (Collar OR Leash)

9/4/2001 Information Organization and Retrieval

Boolean Queries

• (Cat OR Dog) AND (Collar OR Leash)– Each of the following combinations works:

• Cat x x x x• Dog x x x x x• Collar x x x x• Leash x x x x

9/4/2001 Information Organization and Retrieval

Boolean Queries

• (Cat OR Dog) AND (Collar OR Leash)– None of the following combinations work:

• Cat x x

• Dog x x

• Collar x x

• Leash x x

9/4/2001 Information Organization and Retrieval

Boolean Logic

A B

BABA

BABA

BAC

BAC

AC

AC

:Law sDeMorgan'

9/4/2001 Information Organization and Retrieval

Boolean Queries– Usually expressed as INFIX operators in IR

• ((a AND b) OR (c AND b))

– NOT is UNARY PREFIX operator• ((a AND b) OR (c AND (NOT b)))

– AND and OR can be n-ary operators• (a AND b AND c AND d)

– Some rules - (De Morgan revisited)• NOT(a) AND NOT(b) = NOT(a OR b)• NOT(a) OR NOT(b)= NOT(a AND b)• NOT(NOT(a)) = a

9/4/2001 Information Organization and Retrieval

Boolean Logic

t33

t11 t22

D11D22

D33

D44D55

D66

D88D77

D99

D1010

D1111

m1

m2

m3m5

m4

m7m8

m6

m2 = t1 t2 t3

m1 = t1 t2 t3

m4 = t1 t2 t3

m3 = t1 t2 t3

m6 = t1 t2 t3

m5 = t1 t2 t3

m8 = t1 t2 t3

m7 = t1 t2 t3

9/4/2001 Information Organization and Retrieval

Boolean Searching“Measurement of thewidth of cracks in prestressedconcrete beams”

Formal Query:cracks AND beamsAND Width_measurementAND Prestressed_concrete

Cracks

Beams Widthmeasurement

Prestressedconcrete

Relaxed Query:(C AND B AND P) OR(C AND B AND W) OR(C AND W AND P) OR(B AND W AND P)

9/4/2001 Information Organization and Retrieval

Psuedo-Boolean Queries

• A new notation, from web search– +cat dog +collar leash

• Does not mean the same thing!

• Need a way to group combinations.

• Phrases:– “stray cat” AND “frayed collar”– +“stray cat” + “frayed collar”

Informationneed

Index

Pre-process

Parse

Collections

Rank

Query

text input

9/4/2001 Information Organization and Retrieval

Result Sets• Run a query, get a result set• Two choices

– Reformulate query, run on entire collection

– Reformulate query, run on result set

• Example: Dialog query• (Redford AND Newman)• -> S1 1450 documents• (S1 AND Sundance)• ->S2 898 documents

Informationneed

Index

Pre-process

Parse

Collections

Rank

Query

text input

Reformulated Query

Re-Rank

9/4/2001 Information Organization and Retrieval

Ordering of Retrieved Documents• Pure Boolean has no ordering• In practice:

– order chronologically– order by total number of “hits” on query terms

• What if one term has more hits than others?• Is it better to one of each term or many of one term?

• Fancier methods have been investigated – p-norm is most famous

• usually impractical to implement• usually hard for user to understand

9/4/2001 Information Organization and Retrieval

Boolean• Advantages

– simple queries are easy to understand– relatively easy to implement

• Disadvantages– difficult to specify what is wanted– too much returned, or too little– ordering not well determined

• Dominant language in commercial systems until the WWW

9/4/2001 Information Organization and Retrieval

Faceted Boolean Query

• Strategy: break query into facets (polysemous with earlier meaning of facets)

– conjunction of disjunctionsa1 OR a2 OR a3

b1 OR b2

c1 OR c2 OR c3 OR c4

– each facet expresses a topic“rain forest” OR jungle OR amazon

medicine OR remedy OR cure

Smith OR Zhou

AND

AND

9/4/2001 Information Organization and Retrieval

Faceted Boolean Query

• Query still fails if one facet missing

• Alternative: Coordination level ranking– Order results in terms of how many facets (disjuncts)

are satisfied

– Also called Quorum ranking, Overlap ranking, and Best Match

• Problem: Facets still undifferentiated

• Alternative: assign weights to facets

9/4/2001 Information Organization and Retrieval

Proximity Searches• Proximity: terms occur within K positions of one

another– pen w/5 paper

• A “Near” function can be more vague– near(pen, paper)

• Sometimes order can be specified• Also, Phrases and Collocations

– “United Nations” “Bill Clinton”

• Phrase Variants– “retrieval of information” “information retrieval”

9/4/2001 Information Organization and Retrieval

Filters

• Filters: Reduce set of candidate docs• Often specified simultaneous with query• Usually restrictions on metadata

– restrict by:• date range• internet domain (.edu .com .berkeley.edu)• author• size• limit number of documents returned

9/4/2001 Information Organization and Retrieval

Next

• Statistical Properties of Text

• Preparing information for search: Lexical analysis

• Introduction to the Vector Space model of IR.

top related