university of malta csa1013:information search and retrieval © 2003- chris staff 1 of 24...

24
1 of 24 [email protected] University of Malta CSA1013:Information Search and Retrieval © 2003- Chris Staff CSA1013 Historical Perspectives of Dr. Christopher Staff Department of Computer Science & AI University of Malta Information Search and Retrieval

Upload: sara-bryan

Post on 11-Jan-2016

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: University of Malta CSA1013:Information Search and Retrieval © 2003- Chris Staff 1 of 24 cstaff@cs.um.edu.mt CSA1013 Historical Perspectives of Dr. Christopher

1 of [email protected] University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

CSA1013

Historical Perspectives of

Dr. Christopher StaffDepartment of Computer

Science & AIUniversity of Malta

Information Search and Retrieval

Page 2: University of Malta CSA1013:Information Search and Retrieval © 2003- Chris Staff 1 of 24 cstaff@cs.um.edu.mt CSA1013 Historical Perspectives of Dr. Christopher

2 of [email protected] University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Aims and Objectives

• What is Information Search and Retrieval?

• What’s the “state-of-the-art”?• How did we get here?• What are the issues?• Where are we likely to go next?

Page 3: University of Malta CSA1013:Information Search and Retrieval © 2003- Chris Staff 1 of 24 cstaff@cs.um.edu.mt CSA1013 Historical Perspectives of Dr. Christopher

3 of [email protected] University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

What’s Information Search and Retrieval?

• What’s information?– Structured vs. unstructured

• Where is it?• Question answering vs. Information lack or information need

Page 4: University of Malta CSA1013:Information Search and Retrieval © 2003- Chris Staff 1 of 24 cstaff@cs.um.edu.mt CSA1013 Historical Perspectives of Dr. Christopher

4 of [email protected] University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

What’s the “state-of-the-art”?

• Information Retrieval in the “real” world– Web-based search engines

•Google, AllTheWeb, AltaVista, etc.

• Web directories– Yahoo, Excite, etc.

Page 5: University of Malta CSA1013:Information Search and Retrieval © 2003- Chris Staff 1 of 24 cstaff@cs.um.edu.mt CSA1013 Historical Perspectives of Dr. Christopher

5 of [email protected] University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

What’s the “state-of-the-art”?

• Google, and Google-like search engines– Index > 24 billion web pages (pdf, doc, html, …)

– User expresses “Query” •terms, natural language query, etc

– System “compares” query to indexed documents

– Returns “list” of “relevant” documents

Page 6: University of Malta CSA1013:Information Search and Retrieval © 2003- Chris Staff 1 of 24 cstaff@cs.um.edu.mt CSA1013 Historical Perspectives of Dr. Christopher

6 of [email protected] University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

What’s the “state-of-the-art”?

• Recent study by Jansen & Spink [Jansen] shows:– |Query| = 2.14 terms [Spink]– Queries with 1 term = 53%!– 54% of users are satisfied with first page of results (list of 10 documents)

– 80% of users view not more than 10 - 20 results

– 27.6% read only one document!– 66% read < 5 documents

Page 7: University of Malta CSA1013:Information Search and Retrieval © 2003- Chris Staff 1 of 24 cstaff@cs.um.edu.mt CSA1013 Historical Perspectives of Dr. Christopher

7 of [email protected] University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Has life always been this good?

• It would seem that we’re living in information heaven

• Any info we seek is just a couple of query terms away

• In reality, although majority of queries appear to be “trivial”, the reality is quite different

Page 8: University of Malta CSA1013:Information Search and Retrieval © 2003- Chris Staff 1 of 24 cstaff@cs.um.edu.mt CSA1013 Historical Perspectives of Dr. Christopher

8 of [email protected] University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Has life always been this good?

• What if we want to find all relevant information? (“The Invisible Web”)

• What if we want to find something that is difficult to describe?

• What if we don’t know what we’re looking for?– What tools do we use to find info in encyclopaedias, dictionaries, newspapers, reference manuals, novels and other books?

Page 9: University of Malta CSA1013:Information Search and Retrieval © 2003- Chris Staff 1 of 24 cstaff@cs.um.edu.mt CSA1013 Historical Perspectives of Dr. Christopher

9 of [email protected] University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Here beginneth the history lesson…

• People have devised tools to find information again ever since we learnt to write things down…

• Think of information stored on your personal computers… how do you find something that you wrote last month, last year?

Page 10: University of Malta CSA1013:Information Search and Retrieval © 2003- Chris Staff 1 of 24 cstaff@cs.um.edu.mt CSA1013 Historical Perspectives of Dr. Christopher

10 of [email protected] University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Prehistory!

• Well, nearly!• Early writings

– Papyrus scrolls– No paragraph, page numbers, etc– Couldn’t “scroll to the end” to read an index

– Instead, Greek/Roman libraries used “sillybus”/“index” of title

Page 11: University of Malta CSA1013:Information Search and Retrieval © 2003- Chris Staff 1 of 24 cstaff@cs.um.edu.mt CSA1013 Historical Perspectives of Dr. Christopher

11 of [email protected] University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Greeks/Romans

• 3BC, Greeks probably use alphabetization in Library of Alexandria

• Around 2BC (Rome), evidence of hierarchies of information/classification systems– Greeks probably earlier

• Also, Tables of Contents date from around 2BC (Pliny the Elder reports before 79AD)

Page 12: University of Malta CSA1013:Information Search and Retrieval © 2003- Chris Staff 1 of 24 cstaff@cs.um.edu.mt CSA1013 Historical Perspectives of Dr. Christopher

12 of [email protected] University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Printing Press

• Not much else was to happen until 1455, with the advent of the printing press

• Previously, still difficult to refer to information “within” a book, because copies were inaccurate– Info on one page in one book could be on a different page in other copies

Page 13: University of Malta CSA1013:Information Search and Retrieval © 2003- Chris Staff 1 of 24 cstaff@cs.um.edu.mt CSA1013 Historical Perspectives of Dr. Christopher

13 of [email protected] University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Indices and the Printing Press

• Still, alphabetization was on initial letter, then on first four letters…

• Not until 18th Century did full alphabetization occur!

Page 14: University of Malta CSA1013:Information Search and Retrieval © 2003- Chris Staff 1 of 24 cstaff@cs.um.edu.mt CSA1013 Historical Perspectives of Dr. Christopher

14 of [email protected] University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

The Second World War and beyond

• In 1945, Vannevar Bush publishes “As We May Think” in the Atlantic Monthly

• In 1949, Warren Weaver writes that if Chinese is English + codification, then Machine Translation should be possible

• These give rise to “intelligent” and “statistical” (or surface-based) approaches to Information Search and Retrieval respectively (amongst other things :-))

Page 15: University of Malta CSA1013:Information Search and Retrieval © 2003- Chris Staff 1 of 24 cstaff@cs.um.edu.mt CSA1013 Historical Perspectives of Dr. Christopher

15 of [email protected] University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Intelligent vs. Surface-based

“Concepts”1950’s

• Lay in waiting for years, because hardware/software not around

“Words”1950’s

• First approaches were “Key Words in Context” (KWIC)

Page 16: University of Malta CSA1013:Information Search and Retrieval © 2003- Chris Staff 1 of 24 cstaff@cs.um.edu.mt CSA1013 Historical Perspectives of Dr. Christopher

16 of [email protected] University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Intelligent vs. Surface-based

1960’s• Generality in AI (John McCarthy)

1960’s• Boolean Search• Measures of performance effectiveness

• Thesaural Lookup

• Vector Space Model

Page 17: University of Malta CSA1013:Information Search and Retrieval © 2003- Chris Staff 1 of 24 cstaff@cs.um.edu.mt CSA1013 Historical Perspectives of Dr. Christopher

17 of [email protected] University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Intelligent vs. Surface-based

1970’s• Expert Systems• Still about “understanding” information and reasoning with and about it

1970’s• Explosion in availability of electronic text collections

• Library Retrieval Systems

• Full-text indexing• Probabilistic IR• Relevance Feedback

Page 18: University of Malta CSA1013:Information Search and Retrieval © 2003- Chris Staff 1 of 24 cstaff@cs.um.edu.mt CSA1013 Historical Perspectives of Dr. Christopher

18 of [email protected] University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Intelligent vs. Surface-based

1980’s• Conceptual IR• Knowledge Rep Langs

• Lenat’s CYC• Contextual Reasoning

• 5th Generation Computing, Japan

• LSI feeds Statistical IR

1980’s

• OPACs• IR used by non-specialists

• Extended Boolean IR

• Word Sense Disambiguation

• Statistical IR (LSI, etc)

• Internet

Page 19: University of Malta CSA1013:Information Search and Retrieval © 2003- Chris Staff 1 of 24 cstaff@cs.um.edu.mt CSA1013 Historical Perspectives of Dr. Christopher

19 of [email protected] University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Intelligent vs. Surface-based

1990’s• Better language processing

• information extraction

• entity name recognition

• Advances in contextual reasoning, ontologies

1990’s• WWW (1995 c. 10M pages, 2003 c. 3B!)

• Multimedia Indexing & Retrieval

• Web-based search engines

Page 20: University of Malta CSA1013:Information Search and Retrieval © 2003- Chris Staff 1 of 24 cstaff@cs.um.edu.mt CSA1013 Historical Perspectives of Dr. Christopher

20 of [email protected] University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Intelligent vs. Surface-based

2000’s• Semantic Web

2000’s• Faster processors

• More memory• Cheaper storage space

• More superficial comparisons

Page 21: University of Malta CSA1013:Information Search and Retrieval © 2003- Chris Staff 1 of 24 cstaff@cs.um.edu.mt CSA1013 Historical Perspectives of Dr. Christopher

21 of [email protected] University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Intelligent vs. Surface-based

The future• Computers that can find precisely the information you seek– Even if the answer is non-obvious

– Or the answer needs to be the result of reasoning

• MyLifeBits

The future• Computers that can approximate the information you seek– At much less cost

– At the expense of “correctness”

• MyLifeBits

Page 22: University of Malta CSA1013:Information Search and Retrieval © 2003- Chris Staff 1 of 24 cstaff@cs.um.edu.mt CSA1013 Historical Perspectives of Dr. Christopher

22 of [email protected] University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Page 23: University of Malta CSA1013:Information Search and Retrieval © 2003- Chris Staff 1 of 24 cstaff@cs.um.edu.mt CSA1013 Historical Perspectives of Dr. Christopher

23 of [email protected] University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Main Issues

• Architecture to handle ever increasing numbers of docs + efficient data structures

• Freshness, indexing and retrieval speed (Efficient algorithms)

• What is “relevance”? (Better, cheaper and more accurate algorithms to understand what the user really wants)

Page 24: University of Malta CSA1013:Information Search and Retrieval © 2003- Chris Staff 1 of 24 cstaff@cs.um.edu.mt CSA1013 Historical Perspectives of Dr. Christopher

24 of [email protected] University of Malta

CSA1013:Information Search and Retrieval© 2003- Chris Staff

Main References• Paijmans, J.J., last updated 2004, “The Retrieval of

Information from historical perspective”, http://pi0959.kub.nl/Paai/Onderw/V-I/Content/history.html

• American Society of Indexers, last updated 2005, “How Information Retrieval Started”, http://www.asindexing.org/site/history.shtml

• [Jansen] Jansen, B.J., and Spink, A., 2003, ‘An Analysis of Web Documents Retrieved and Viewed’, in Proceedings of the 4th International Conference on Internet Computing, Las Vegas, Nevada, 23-26 June 2003. http://ist.psu.edu/faculty_pages/jjansen/academic/pubs/pages_viewed.pdf

• [Spink] Spink, A., et. al., 2001, ‘Searching the Web: The Public and their Queries’, in JASIST 2001. http://jimjansen.tripod.com/academic/pubs/jasist2001/jasist2001.html