2004.09.07 - slide 1is 202 – fall 2004 prof. ray larson & prof. marc davis uc berkeley sims...

65
2004.09.07 - SLIDE 1 IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004 http://www.sims.berkeley.edu/academics/courses/ is202/f04/ SIMS 202: Information Organization and Retrieval Lecture 3: Intro to Information Retrieval

Post on 21-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 1IS 202 – FALL 2004

Prof. Ray Larson & Prof. Marc Davis

UC Berkeley SIMS

Tuesday and Thursday 10:30 am - 12:00 pm

Fall 2004http://www.sims.berkeley.edu/academics/courses/is202/f04/

SIMS 202:

Information Organization

and Retrieval

Lecture 3: Intro to Information Retrieval

Page 2: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 2IS 202 – FALL 2004

Lecture Overview

• Introduction to Information Retrieval

• The Information Seeking Process

• Information Retrieval History and Developments

• Discussion

Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey

Page 3: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 3IS 202 – FALL 2004

Lecture Overview

• Introduction to Information Retrieval

• The Information Seeking Process

• Information Retrieval History and Developments

• Discussion

Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey

Page 4: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 4IS 202 – FALL 2004

Review: Information Overload

• “The world's total yearly production of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth.” (Varian & Lyman)

• “The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all.” (W.H. Auden)

Page 5: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 5IS 202 – FALL 2004

Key Issues In This Course

• How to describe information resources or information-bearing objects in ways so that they may be effectively used by those who need to use them– Organizing

• How to find the appropriate information resources or information-bearing objects for someone’s (or your own) needs– Retrieving

Page 6: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 6IS 202 – FALL 2004

Key Issues

Creation

Utilization Searching

Active

Inactive

Semi-Active

Retention/Mining

Disposition

Discard

Using Creating

AuthoringModifying

OrganizingIndexing

StoringRetrieval

DistributionNetworking

AccessingFiltering

Page 7: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 7IS 202 – FALL 2004

IR Topics for 202

• The Search Process• Information Retrieval Models

– Boolean, Vector, and Probabilistic

• Web-Specific Issues• Content Analysis/Zipf Distributions• Evaluation of IR Systems

– Precision/Recall– Relevance– User Studies

• User Interface Issues• Special Kinds of Search

Page 8: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 8IS 202 – FALL 2004

Lecture Overview

• Introduction to Information Retrieval

• The Information Seeking Process

• Information Retrieval History and Developments

• Discussion

Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey

Page 9: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 9IS 202 – FALL 2004

Web Search Questions

• What do people search for?

• How do people use search engines?– How often do people find what they are

looking for?

– How difficult is it for people to find what they are looking for?

• How can search engines be improved?

Page 10: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 10IS 202 – FALL 2004

What Do People Search for on the Web?

• Study by Spink et al., Oct 98– www.shef.ac.uk/~is/publications/infres/paper53.html

– Survey on Excite, 13 questions– Data for 316 surveys

• (If you are interested in this, Amanda Spink has a new book entitled “Web Search: Public Searching On the Web”)

Page 11: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 11IS 202 – FALL 2004

What Do People Search for on the Web?

• Topics• Genealogy/Public Figure: 12%• Computer related: 12%• Business: 12%• Entertainment: 8%• Medical: 8%• Politics & Government 7%• News 7%• Hobbies 6%• General info/surfing 6%• Science 6%• Travel 5%• Arts/education/shopping/images 14%

• Something is missing…

Page 12: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 12IS 202 – FALL 2004

What Do People Search for on the Web?

• 4660 sex• 3129 yahoo• 2191 internal site admin

check from kho• 1520 chat• 1498 porn• 1315 horoscopes• 1284 pokemon• 1283 SiteScope test

• 1223 hotmail• 1163 games• 1151 mp3• 1140 weather• 1127 www.yahoo.com• 1110 maps• 1036 yahoo.com• 983 ebay• 980 recipes

50,000 queries from excite 1997

Most frequent terms:

Page 13: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 13IS 202 – FALL 2004

Why Do These Differ?

• Self-reporting survey

• The nature of language– Only a few ways to say certain things

– Many different ways to express most concepts• UFO, flying saucer, space ship, satellite

• How many ways are there to talk about history?

Page 14: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 14IS 202 – FALL 2004

• 65002930 the• 62789720 a• 60857930 to• 57248022 of• 54078359 and• 52928506 in• 50686940 s• 49986064 for• 45999001 on• 42205245 this• 41203451 is• 39779377 by• 35439894 with• 35284151 or• 34446866 at• 33528897 all• 31583607 are• 30998255 from

• 30755410 e• 30080013 you• 29669506 be• 29417504 that• 28542378 not• 28162417 an• 28110383 as• 28076530 home• 27650474 it• 27572533 i• 24548796 have• 24420453 if• 24376758 new• 24171603 t• 23951805 your• 23875218 page• 22292805 about• 22265579 com• 22107392 information

Source: http://elib.cs.berkeley.edu/docfreq/index.html

What is on the Web?

List of 31,928,892 terms from analysis of49,602,191 web pages

Page 15: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 15IS 202 – FALL 2004

Intranet Queries (Aug 2000)

• 3351 bearfacts• 3349 telebears• 1909 extension• 1874 schedule+of+classes• 1780 bearlink• 1737 bear+facts• 1468 decal• 1443 infobears• 1227 calendar• 989 career+center• 974 campus+map• 920 academic+calendar• 840 map

• 773 bookstore• 741 class+pass• 738 housing• 721 tele-bears• 716 directory• 667 schedule• 627 recipes• 602 transcripts• 582 tuition• 577 seti• 563 registrar• 550 info+bears• 543 class+schedule• 470 financial+aid

Page 16: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 16IS 202 – FALL 2004

Intranet Queries

• Summary of sample data from 3 weeks of UCB queries– 13.2% Telebears/BearFacts/InfoBears/BearLink (12297)– 6.7% Schedule of classes or final exams (6222)– 5.4% Summer Session (5041)– 3.2% Extension (2932)– 3.1% Academic Calendar (2846)– 2.4% Directories (2202)– 1.7% Career Center (1588)– 1.7% Housing (1583)– 1.5% Map (1393)

• Average query length over last 4 months: 1.8 words• This suggests what is difficult to find from the home page

Page 17: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 17IS 202 – FALL 2004

Queries as Zeitgeist

From: http:://www.google.com/press/zeitgeist.html

Page 18: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 18IS 202 – FALL 2004

How DO people search?

• Different approaches for different tasks

• Models of the search process attempt to summarize how people interact with information resources when seeking information– Standard IR model– Alternative models

Page 19: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 19IS 202 – FALL 2004

The Standard Retrieval Interaction Model

Page 20: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 20IS 202 – FALL 2004

Standard Model of IR

• Assumptions:– The goal is maximizing precision and recall

simultaneously– The information need remains static– The value is in the resulting document set

Page 21: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 21IS 202 – FALL 2004

Problems with Standard Model

• Users learn during the search process:– Scanning titles of retrieved documents– Reading retrieved documents– Viewing lists of related topics/thesaurus terms– Navigating hyperlinks

• Some users don’t like long and (apparently) disorganized lists of documents

Page 22: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 22IS 202 – FALL 2004

IR is an Iterative Process

Repositories/Resources

Workspace

Goals/Needs

Page 23: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 23IS 202 – FALL 2004

IR is a Dialog

• The exchange doesn’t end with first answer• Users can recognize elements of a useful

answer, even when incomplete• Questions and understanding changes as the

process continues

Page 24: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 24IS 202 – FALL 2004

Bates’ “Berry-Picking” Model

• Standard IR model– Assumes the information need remains the

same throughout the search process

• Berry-picking model– Interesting information is scattered like berries

among bushes– The query is continually shifting

Page 25: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 25IS 202 – FALL 2004

Berry-Picking Model

Q0

Q1

Q2

Q3

Q4

Q5

A sketch of a searcher… “moving through many actions towards a general goal of satisfactory completion of research related to an information need.” (after Bates 89)

Page 26: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 26IS 202 – FALL 2004

Berry-Picking Model (cont.)

• The query is continually shifting

• New information may yield new ideas and new directions

• The information need– Is not satisfied by a single, final retrieved set– Is satisfied by a series of selections and bits

of information found along the way

Page 27: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 27IS 202 – FALL 2004

Information Seeking Behavior

• Two parts of a process:– Search and retrieval – Analysis and synthesis of search results

• This is a fuzzy area– We will look at (briefly) at some different

working theories

Page 28: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 28IS 202 – FALL 2004

Search Tactics and Strategies

• Search Tactics– Bates 1979

• Search Strategies– Bates 1989– O’Day and Jeffries 1993

Page 29: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 29IS 202 – FALL 2004

Tactics vs. Strategies

• Tactic: short term goals and maneuvers– Operators, actions

• Strategy: overall planning– Link a sequence of operators together to

achieve some end

Page 30: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 30IS 202 – FALL 2004

Information Search Tactics

• Monitoring tactics– Keep search on track

• Source-level tactics– Navigate to and within sources

• Term and Search Formulation tactics– Designing search formulation– Selection and revision of specific terms within

search formulation

Page 31: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 31IS 202 – FALL 2004

Monitoring Tactics (Strategy-Level)

• Check– Compare original goal with current state

• Weigh– Make a cost/benefit analysis of current or

anticipated actions

• Pattern– Recognize common strategies

• Correct Errors• Record

– Keep track of (incomplete) paths

Page 32: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 32IS 202 – FALL 2004

Source-Level Tactics

• “Bibble”:– Look for a pre-defined result set

• E.g., a good link page on web

• Survey:– Look ahead, review available options

• E.g., don’t simply use the first term or first source that comes to mind

• Cut:– Eliminate large proportion of search domain

• E.g., search on rarest term first

Page 33: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 33IS 202 – FALL 2004

Search Formulation Tactics

• Specify– Use as specific terms as possible

• Exhaust– Use all possible elements in a query

• Reduce– Subtract elements from a query

• Parallel– Use synonyms and parallel terms

• Pinpoint– Reducing parallel terms and refocusing query

• Block– To reject or block some terms, even at the cost of

losing some relevant documents

Page 34: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 34IS 202 – FALL 2004

Term Tactics

• Move around a thesaurus– Superordinate, subordinate, coordinate – Neighbor (semantic or alphabetic)– Trace – pull out terms from information

already seen as part of search (titles, etc.)– Morphological and other spelling variants– Antonyms (contrary)

Page 35: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 35IS 202 – FALL 2004

Additional Considerations (Bates 79)

• More detail is needed about short-term cost/benefit decision rule strategies

• When to stop?– How to judge when enough information has

been gathered?– How to decide when to give up an

unsuccessful search?– When to stop searching in one source and

move to another?

Page 36: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 36IS 202 – FALL 2004

Implications

• Search interfaces should make it easy to store intermediate results

• Interfaces should make it easy to follow trails with unanticipated results (and find your way back)

• This all makes evaluation of the search, the interface and the search process more difficult

Page 37: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 37IS 202 – FALL 2004

• Later in the course:– More on Search Process and Strategies– User interfaces to improve IR process– Incorporation of Content Analysis into better

systems

More Later…

Page 38: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 38IS 202 – FALL 2004

Restricted Form of the IR Problem

• The system has available only pre-existing, “canned” text passages

• Its response is limited to selecting from these passages and presenting them to the user

• It must select, say, 10 or 20 passages out of millions or billions!

Page 39: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 39IS 202 – FALL 2004

Information Retrieval

• Revised Task Statement:

Build a system that retrieves documents that users are likely to find relevant to their queries

• This set of assumptions underlies the field of Information Retrieval

Page 40: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 40IS 202 – FALL 2004

Relevance (Introduction)

• In what ways can a document be relevant to a query?– Answer precise question precisely

• Who is buried in grant’s tomb? Grant? or no one?

– Partially answer question• Where is Danville? Near Walnut Creek.• Where is Dublin?

– Suggest a source for more information.• What is lymphodema? Look in this Medical Dictionary.

– Give background information– Remind the user of other knowledge– Others...

Page 41: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 41IS 202 – FALL 2004

Relevance

• “Intuitively, we understand quite well what relevance means. It is a primitive ‘y’ know’ concept, as is information for which we hardly need a definition. … if and when any productive contact [in communication] is desired, consciously or not, we involve and use this intuitive notion or relevance.”

» Saracevic, 1975 p. 324

Page 42: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 42IS 202 – FALL 2004

Define your own relevance

• Relevance is the (A) gage of relevance of an (B) aspect of relevance existing between an (C) object judged and a (D) frame of reference as judged by an (E) assessor

• Where…

From Saracevic, 1975 and Schamber 1990

Page 43: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 43IS 202 – FALL 2004

A. Gages

• Measure

• Degree

• Extent

• Judgement

• Estimate

• Appraisal

• Relation

Page 44: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 44IS 202 – FALL 2004

B. Aspect

• Utility

• Matching

• Informativeness

• Satisfaction

• Appropriateness

• Usefulness

• Correspondence

Page 45: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 45IS 202 – FALL 2004

C. Object judged

• Document

• Document representation

• Reference

• Textual form

• Information provided

• Fact

• Article

Page 46: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 46IS 202 – FALL 2004

D. Frame of reference

• Question

• Question representation

• Research stage

• Information need

• Information used

• Point of view

• request

Page 47: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 47IS 202 – FALL 2004

E. Assessor

• Requester

• Intermediary

• Expert

• User

• Person

• Judge

• Information specialist

Page 48: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 48IS 202 – FALL 2004

Lecture Overview

• Introduction to Information Retrieval

• The Information Seeking Process

• Information Retrieval History and Developments

• Discussion

Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey

Page 49: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 49IS 202 – FALL 2004

Visions of IR Systems

• Rev. John Wilkins, 1600’s : The Philosophic Language and tables

• Wilhelm Ostwald and Paul Otlet, 1910’s: The “monographic principle” and Universal Classification

• Emanuel Goldberg, 1920’s - 1940’s• H.G. Wells, “World Brain: The idea of a

permanent World Encyclopedia.” (Introduction to the Encyclopédie Française, 1937)

• Vannevar Bush, “As we may think.” Atlantic Monthly, 1945.

Page 50: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 50IS 202 – FALL 2004

Card-Based IR Systems

• Uniterm (Casey, Perry, Berry, Kent: 1958)– Developed and used from mid 1940’s)

EXCURSION 43821 90 241 52 63 34 25 66 17 58 49130 281 92 83 44 75 86 57 88 119640 122 93 104 115 146 97 158 139870 342 157 178 199 207 248 269 298

LUNAR 12457110 181 12 73 44 15 46 7 28 39430 241 42 113 74 85 76 17 78 79820 761 602 233 134 95 136 37 118 109 901 982 194 165 127 198 179 377 288 407

Page 51: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 51IS 202 – FALL 2004

Card Systems

• Batten Optical Coincidence Cards (“Peek-a-Boo Cards”), 1948

Lunar

Excursion

Page 52: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 52IS 202 – FALL 2004

Card Systems

• Zatocode (edge-notched cards) Mooers, 1951

Document 1 Title: lksd ksdj sjd sjsjfkl Author: Smith, J. Abstract: lksf uejm jshy ksd jh uyw hhy jha jsyhe

Document 200 Title: Xksd Lunar sjd sjsjfkl Author: Jones, R. Abstract: Lunar uejm jshy ksd jh uyw hhy jha jsyhe

Document 34 Title: lksd ksdj sjd Lunar Author: Smith, J. Abstract: lksf uejm jshy ksd jh uyw hhy jha jsyhe

Page 53: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 53IS 202 – FALL 2004

Computer-Based Systems

• Bagley’s 1951 MS thesis from MIT suggested that searching 50 million item records, each containing 30 index terms would take approximately 41,700 hours – Due to the need to move and shift the text in

core memory while carrying out the comparisons

• 1957 – Desk Set with Katharine Hepburn and Spencer Tracy – EMERAC

Page 54: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 54IS 202 – FALL 2004

Historical Milestones in IR Research

• 1958 Statistic Language Properties (Luhn)• 1960 Probabilistic Indexing (Maron & Kuhns)• 1961 Term association and clustering (Doyle)• 1965 Vector Space Model (Salton)• 1968 Query expansion (Roccio, Salton)• 1972 Statistical Weighting (Sparck-Jones)• 1975 2-Poisson Model (Harter, Bookstein,

Swanson)• 1976 Relevance Weighting (Robertson, Sparck-

Jones)• 1980 Fuzzy sets (Bookstein)• 1981 Probability without training (Croft)

Page 55: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 55IS 202 – FALL 2004

Historical Milestones in IR Research (cont.)

• 1983 Linear Regression (Fox)• 1983 Probabilistic Dependence (Salton, Yu)• 1985 Generalized Vector Space Model (Wong,

Rhagavan)• 1987 Fuzzy logic and RUBRIC/TOPIC (Tong, et

al.)• 1990 Latent Semantic Indexing (Dumais,

Deerwester)• 1991 Polynomial & Logistic Regression (Cooper,

Gey, Fuhr)• 1992 TREC (Harman)• 1992 Inference networks (Turtle, Croft)• 1994 Neural networks (Kwok)

Page 56: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 56IS 202 – FALL 2004

Boolean IR Systems

• Synthex at SDC, 1960• Project MAC at MIT, 1963 (interactive)• BOLD at SDC, 1964 (Harold Borko)• 1964 New York World’s Fair – Becker and

Hayes produced system to answer questions (based on airline reservation equipment)

• SDC began production for a commercial service in 1967 – ORBIT

• NASA-RECON (1966) becomes DIALOG• 1972 Data Central/Mead introduced LEXIS –

Full text• Online catalogs – late 1970’s and 1980’s

Page 57: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 57IS 202 – FALL 2004

The Internet and the WWW

• Gopher, Archie, Veronica, WAIS• Tim Berners-Lee, 1991 creates WWW at

CERN – originally hypertext only• Web-crawler• Lycos• Alta Vista• Inktomi• Google• (and many others)

Page 58: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 58IS 202 – FALL 2004

Information Retrieval – Historical View

• Boolean model, statistics of language (1950’s)

• Vector space model, probablistic indexing, relevance feedback (1960’s)

• Probabilistic querying (1970’s)

• Fuzzy set/logic, evidential reasoning (1980’s)

• Regression, neural nets, inference networks, latent semantic indexing, TREC (1990’s)

• DIALOG, Lexus-Nexus, • STAIRS (Boolean based) • Information industry

(O($B))• Verity TOPIC (fuzzy logic)• Internet search engines

(O($100B??)) (vector space, probabilistic)

Research Industry

Page 59: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 59IS 202 – FALL 2004

Lecture Overview

• Introduction to Information Retrieval

• The Information Seeking Process

• Information Retrieval History and Developments

• Discussion

Credit for some of the slides in this lecture goes to Marti Hearst and Fred Gey

Page 60: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 60IS 202 – FALL 2004

Mano Marks on MIR

• The authors make a distinction between data retrieval and information retrieval. What is that distinction? When would data retrieval be more appropriate than information retrieval?

• When would information retrieval be more appropriate?

• In this context, what is data? What is information?

Page 61: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 61IS 202 – FALL 2004

Melissa Chan on Bates

• Bates published this berry picking article in 1989 stating that real-life queries tend to shift and evolve as a user retrieves information. How does Bates search strategies of footnote chasing, citation searching, journal run, area scanning, subject searches, and author searches parallel a research search on the Internet/online libraries today? Which methods do you more frequently use?

• Online Libraries 15 Years Later...Would you need to redesign Berkeley's online library to fit the search methods listed by Bates? Does the current design limit or expand your ability to "berry pick" among the library collections? See http://melvyl.cdlib.org/

Page 62: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 62IS 202 – FALL 2004

Irina Lib on Berlin

• The authors of TeamInfo put a lot of effort in organizing information into categories to minimize searching. With Google advocating the "search, not sort" approach to e-mail, do you think this approach for a group memory system? Do you think it works well for individual systems?

• TeamInfo was tested on a relatively small, homogenous group of people. Do you think a system such as TeamInfo would work well for larger, more heterogeneous groups? What problems, if any, would arise?

Page 63: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 63IS 202 – FALL 2004

Jen King on Munro

• What are the possible flaws with using social navigation (“navigation towards a cluster of people or navigation because other people have looked at something”) as a theoretical framework for design? One suggestion: if we base a design upon how an aggregate of people appear to use something, we will inevitably exclude some portion of the audience who doesn’t conform to the norm (Amazon.com recommendations are a possible example of this phenomenon).

• Non-verbal cues are an important element of human communication. Could social navigation help provide the contextual cues that non-verbal communication provides with helping individuals comprehend information?

Page 64: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 64IS 202 – FALL 2004

Jen King on Munro

• The central point of social navigation made in the reading is a shift from thinking about computers as external objects humans act upon to a ubiquitous computing environment where humans are engaged with computers in many contexts, both individually and as part of a social group. The authors note that an alternate design possibility includes a “move away from ‘dead’ information spaces we see on the Internet today and in every way possible open up the spaces for seeing other users — both directly and indirectly,” (p.6) or in other words, creating a “virtual reality” where the presence of other people (and not merely unidirectional web pages) define the environment. Have you encountered any computerized social environments that you thought worked well? If not, how did they fail? Do you agree that interacting directly with other users online is the future of information spaces?

Page 65: 2004.09.07 - SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004

2004.09.07 - SLIDE 65IS 202 – FALL 2004

Next Time

• Boolean Queries and Text Processing

• Readings (note – slight rearrangement of the web site and readings)– (Background) MIR Ch. 2 and Ch. 4– How to Use Controlled Vocabularies More

Effectively in Online Searching (Bates)– Improving Full-Text Precision on Short

Queries using Simple Constraints (Hearst)