university of malta csa1013:information search and retrieval © 2003- chris staff 1 of 24...

Post on 11-Jan-2016

215 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1 of 24cstaff@cs.um.edu.mt University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

CSA1013

Historical Perspectives of

Dr. Christopher StaffDepartment of Computer

Science & AIUniversity of Malta

Information Search and Retrieval

2 of 24cstaff@cs.um.edu.mt University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Aims and Objectives

• What is Information Search and Retrieval?

• What’s the “state-of-the-art”?• How did we get here?• What are the issues?• Where are we likely to go next?

3 of 24cstaff@cs.um.edu.mt University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

What’s Information Search and Retrieval?

• What’s information?– Structured vs. unstructured

• Where is it?• Question answering vs. Information lack or information need

4 of 24cstaff@cs.um.edu.mt University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

What’s the “state-of-the-art”?

• Information Retrieval in the “real” world– Web-based search engines

•Google, AllTheWeb, AltaVista, etc.

• Web directories– Yahoo, Excite, etc.

5 of 24cstaff@cs.um.edu.mt University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

What’s the “state-of-the-art”?

• Google, and Google-like search engines– Index > 24 billion web pages (pdf, doc, html, …)

– User expresses “Query” •terms, natural language query, etc

– System “compares” query to indexed documents

– Returns “list” of “relevant” documents

6 of 24cstaff@cs.um.edu.mt University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

What’s the “state-of-the-art”?

• Recent study by Jansen & Spink [Jansen] shows:– |Query| = 2.14 terms [Spink]– Queries with 1 term = 53%!– 54% of users are satisfied with first page of results (list of 10 documents)

– 80% of users view not more than 10 - 20 results

– 27.6% read only one document!– 66% read < 5 documents

7 of 24cstaff@cs.um.edu.mt University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Has life always been this good?

• It would seem that we’re living in information heaven

• Any info we seek is just a couple of query terms away

• In reality, although majority of queries appear to be “trivial”, the reality is quite different

8 of 24cstaff@cs.um.edu.mt University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Has life always been this good?

• What if we want to find all relevant information? (“The Invisible Web”)

• What if we want to find something that is difficult to describe?

• What if we don’t know what we’re looking for?– What tools do we use to find info in encyclopaedias, dictionaries, newspapers, reference manuals, novels and other books?

9 of 24cstaff@cs.um.edu.mt University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Here beginneth the history lesson…

• People have devised tools to find information again ever since we learnt to write things down…

• Think of information stored on your personal computers… how do you find something that you wrote last month, last year?

10 of 24cstaff@cs.um.edu.mt University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Prehistory!

• Well, nearly!• Early writings

– Papyrus scrolls– No paragraph, page numbers, etc– Couldn’t “scroll to the end” to read an index

– Instead, Greek/Roman libraries used “sillybus”/“index” of title

11 of 24cstaff@cs.um.edu.mt University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Greeks/Romans

• 3BC, Greeks probably use alphabetization in Library of Alexandria

• Around 2BC (Rome), evidence of hierarchies of information/classification systems– Greeks probably earlier

• Also, Tables of Contents date from around 2BC (Pliny the Elder reports before 79AD)

12 of 24cstaff@cs.um.edu.mt University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Printing Press

• Not much else was to happen until 1455, with the advent of the printing press

• Previously, still difficult to refer to information “within” a book, because copies were inaccurate– Info on one page in one book could be on a different page in other copies

13 of 24cstaff@cs.um.edu.mt University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Indices and the Printing Press

• Still, alphabetization was on initial letter, then on first four letters…

• Not until 18th Century did full alphabetization occur!

14 of 24cstaff@cs.um.edu.mt University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

The Second World War and beyond

• In 1945, Vannevar Bush publishes “As We May Think” in the Atlantic Monthly

• In 1949, Warren Weaver writes that if Chinese is English + codification, then Machine Translation should be possible

• These give rise to “intelligent” and “statistical” (or surface-based) approaches to Information Search and Retrieval respectively (amongst other things :-))

15 of 24cstaff@cs.um.edu.mt University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Intelligent vs. Surface-based

“Concepts”1950’s

• Lay in waiting for years, because hardware/software not around

“Words”1950’s

• First approaches were “Key Words in Context” (KWIC)

16 of 24cstaff@cs.um.edu.mt University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Intelligent vs. Surface-based

1960’s• Generality in AI (John McCarthy)

1960’s• Boolean Search• Measures of performance effectiveness

• Thesaural Lookup

• Vector Space Model

17 of 24cstaff@cs.um.edu.mt University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Intelligent vs. Surface-based

1970’s• Expert Systems• Still about “understanding” information and reasoning with and about it

1970’s• Explosion in availability of electronic text collections

• Library Retrieval Systems

• Full-text indexing• Probabilistic IR• Relevance Feedback

18 of 24cstaff@cs.um.edu.mt University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Intelligent vs. Surface-based

1980’s• Conceptual IR• Knowledge Rep Langs

• Lenat’s CYC• Contextual Reasoning

• 5th Generation Computing, Japan

• LSI feeds Statistical IR

1980’s

• OPACs• IR used by non-specialists

• Extended Boolean IR

• Word Sense Disambiguation

• Statistical IR (LSI, etc)

• Internet

19 of 24cstaff@cs.um.edu.mt University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Intelligent vs. Surface-based

1990’s• Better language processing

• information extraction

• entity name recognition

• Advances in contextual reasoning, ontologies

1990’s• WWW (1995 c. 10M pages, 2003 c. 3B!)

• Multimedia Indexing & Retrieval

• Web-based search engines

20 of 24cstaff@cs.um.edu.mt University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Intelligent vs. Surface-based

2000’s• Semantic Web

2000’s• Faster processors

• More memory• Cheaper storage space

• More superficial comparisons

21 of 24cstaff@cs.um.edu.mt University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Intelligent vs. Surface-based

The future• Computers that can find precisely the information you seek– Even if the answer is non-obvious

– Or the answer needs to be the result of reasoning

• MyLifeBits

The future• Computers that can approximate the information you seek– At much less cost

– At the expense of “correctness”

• MyLifeBits

22 of 24cstaff@cs.um.edu.mt University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

23 of 24cstaff@cs.um.edu.mt University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Main Issues

• Architecture to handle ever increasing numbers of docs + efficient data structures

• Freshness, indexing and retrieval speed (Efficient algorithms)

• What is “relevance”? (Better, cheaper and more accurate algorithms to understand what the user really wants)

24 of 24cstaff@cs.um.edu.mt University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Main References• Paijmans, J.J., last updated 2004, “The Retrieval of

Information from historical perspective”, http://pi0959.kub.nl/Paai/Onderw/V-I/Content/history.html

• American Society of Indexers, last updated 2005, “How Information Retrieval Started”, http://www.asindexing.org/site/history.shtml

• [Jansen] Jansen, B.J., and Spink, A., 2003, ‘An Analysis of Web Documents Retrieved and Viewed’, in Proceedings of the 4th International Conference on Internet Computing, Las Vegas, Nevada, 23-26 June 2003. http://ist.psu.edu/faculty_pages/jjansen/academic/pubs/pages_viewed.pdf

• [Spink] Spink, A., et. al., 2001, ‘Searching the Web: The Public and their Queries’, in JASIST 2001. http://jimjansen.tripod.com/academic/pubs/jasist2001/jasist2001.html

top related