information retrieval - institute for creative...

45
All slides unless specifically mentioned are copyright © Anton Leuski & Donald Metzler Information Retrieval Overview and Introduction 1 Tuesday, January 10, 2012

Upload: ngokhue

Post on 20-Jun-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

All slides unless specifically mentioned are copyright © Anton Leuski & Donald Metzler

Information RetrievalOverview and Introduction

1Tuesday, January 10, 2012

Page 2: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

AdministrativaWhat is Information Retrieval (IR)?

Issues in IRDimensions of IR

Course goals

2Tuesday, January 10, 2012

Page 3: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

What?

§ CSCI 599. Special Topics. Applications of Natural Language Processing: Information Retrieval

§ Two other Applications of Natural Language Processing (NLP) courses– Machine Translation– Information Extraction

§ Related courses – CSCI 544. Natural Language Processing– CSCI 562. Empirical Methods in Natural Language Processing– CSCI 572. Information Retrieval and Web Search Engines– CSCI 599. Data Mining and Statistical Inference– CSCI 599. Social Media Analysis

3Tuesday, January 10, 2012

Page 4: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Who?

§ Anton Leuski– Institute for Creative Technologies

§ Donald Metzler– Information Sciences Institute

4Tuesday, January 10, 2012

Page 5: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Where?

§ Here: GFS 118

§ Web: http://projects.ict.usc.edu/nld/ir-class/

– schedule– lecture notes– homework assignments– discussions

5Tuesday, January 10, 2012

Page 6: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

When?

§ Every Tuesday and Thursday, 3:30-4:50 PM.

§ Office hours: after each lecture

§ See the schedule on the web site for more details

6Tuesday, January 10, 2012

Page 7: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Grading

§ 3 programming/homework assignments: 30%

§ Midterm exam: 20%

§ Final exam: 20%

§ Final project: 25%

§ Discussion participation: 5%

7Tuesday, January 10, 2012

Page 8: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Assignments

§ Homework tasks might include– modifying the "ranking function" or "indexer" of an open source information retrieval

toolkit (Lucene) for some search task– writing code to cluster documents based on their similarity– writing code to automatically evaluate the quality of search results– developing a system to automatically summarize a stream of Twitter messages

§ Framework: Lucene– http://lucene.apache.org/java/docs/index.html– Java-based, open source search engine

§ Final project– we would announce a number of topics to choose from at the middle of the semester– you could create your own project, but the topic has to be approved by us

8Tuesday, January 10, 2012

Page 9: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Reading

§ Books– W. B. Croft, D. Metzler, and T. Strohman. Search Engines: Information Retrieval in

Practice. 2009.– S. Buettcher, C. L. A. Clarke, G. V. Cormack. Information Retrieval: Implementing and

Evaluating Search Engines. 2010.– C. D. Manning, P. Raghavan and H. Schütze. Introduction to Information Retrieval.

2008.– C. J. van Rijsbergen. Information Retrieval. 1979.

§ http://www.dcs.gla.ac.uk/Keith/Preface.html– I. H. Witten, A. Moffat, T. C. Bell. Managing Gigabytes. 1999.– A. Moffat, J. Zobel, D. Hawking. Recommended Reading for IR Research Students.

2004.

§ Papers– TBA

9Tuesday, January 10, 2012

Page 10: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

AdministrativaWhat is Information Retrieval (IR)?

Issues in IRDimensions of IR

Course goals

10Tuesday, January 10, 2012

Page 11: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Example: Web Search

§ Document (web page) retrieval in response to a query

§ Quite effective (at some things)

§ Highly visible (mostly)

§ Commercially successful (some of them)

§ Is that it?..

11Tuesday, January 10, 2012

Page 12: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Information Retrieval

§ “Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.” (Salton, 1968)

§ To solve the “information overload” problem

§ IR is interdisciplinary– computer sciences– mathematics– information science– information architecture– cognitive psychology– linguistics– statistics

12Tuesday, January 10, 2012

Page 13: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

History

§ Since the beginning of the written word people tried to organize information

§ 3rd BC: Library of Alexandria

§ 1689: Vincentius Placcius invented a note-taking machine.

§ 1880s-1890s: Herman Hollerith invents the recording of data on a machine readable medium

§ 1920s-1930s: Emanuel Goldberg submits patents for his "Statistical Machine” a document search engine that used photoelectric cells and pattern recognition to search the metadata on rolls of microfilmed documents.

13Tuesday, January 10, 2012

Page 14: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

History

§ 1945: MEMory EXtender by Vannevar Bush

– hypothetical electro-mechanical system– “Proto-hypertext”

§ lateral browsing of microfilms following links between individual frames – associative trails

– Features§ extending§ storing§ consulting the records

– Missing features§ search§ metadata

14Tuesday, January 10, 2012

Page 15: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

History

§ 1950: The term “information retrieval” appears to have been coined by Calvin Mooers.

§ 1950s-1960s: First automated IR systems. – SMART– MEDLARS

§ MeSH

§ 1970s: First online IR systems. § MEDLINE. § Lockheed's Dialog.

– Hypertext.

§ 1978: 1st SIGIR conference

§ 1989: WWW proposals

§ 1992: 1st TREC conference

§ late 1990s: First Web search engines

15Tuesday, January 10, 2012

Page 16: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Knowledge Navigator

§ An IR system mockup by Apple Computers from 1988

§ A device that can access a large networked database of hypertext information, and use software agents to assist searching for information

§ http://www.youtube.com/watch?v=QRH8eimU_20

16Tuesday, January 10, 2012

Page 17: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

17Tuesday, January 10, 2012

Page 18: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

IR is not databases

Databases IRData structured unstructured

Fields well-defined semantics (SSN, age, ...)

no well-defined semantics (text fields)

Queries well-formed (relational algebra, SQL)

free text, some fuzzy operators

Matching exact (results are always “correct”)

impreciseMatching

SELECT * FROM Accounts WHERE balance > 50,000 ORDER BY name;

bank scandals in California

18Tuesday, January 10, 2012

Page 19: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Question: Is grep an IR system?

19Tuesday, January 10, 2012

Page 20: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Example: Text Matching

§ How do you measure “aboutness?”

§ Exact matching of words is not enough – Many different ways to write the same thing in a “natural language” like English– e.g., does a news story containing the text “bank director in LA steals funds” match

the query “bank scandals in California?”– Some stories will be better matches than others

20Tuesday, January 10, 2012

Page 21: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Search Process

Information need

text objects

indexed objects

query

indexed objects

retrieved objects

representationrepresentation

comparison

evaluation/feedback

21Tuesday, January 10, 2012

Page 22: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

AdministrativaWhat is Information Retrieval (IR)?

Issues in IRDimensions of IR

Course goals

22Tuesday, January 10, 2012

Page 23: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Issues in IR

§ Information need and user interaction

§ Relevance

§ Representation

§ Comparison

§ Evaluation

Information need text objects

indexed objects

query

indexed objects

retrieved objects

representationrepresentation

comparison

evaluation/feedback

relevance

23Tuesday, January 10, 2012

Page 24: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Users and Information Needs

§ Search is user‐centered

§ Keyword queries are often poor descriptions of actual information needs

§ Interaction and context are important for understanding user intent

§ Query refinement techniques such as query expansion, query suggestion, relevance feedback improve ranking

24Tuesday, January 10, 2012

Page 25: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Relevance

§ What is it?

§ Simple (and simplistic) definition: A relevant document contains the information that a person was looking for when they submitted a query to the search engine

§ Many factors influence a personʼs decision about what is relevant: e.g., task, context, novelty, style

§ Topical relevance (same topic) vs. user relevance (everything else)

§ Retrieval models define a view of relevance– Relevance – Ranking algorithms used in search engines are based on retrieval

models– Most models describe statistical properties of text rather than linguistic

§ i.e. counting simple text features such as words instead of parsing and analyzing the sentences§ Statistical approach to text processing started with Luhn in the 50s§ Linguistic features can be part of a statistical model

25Tuesday, January 10, 2012

Page 26: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Representation

§ Most successful approaches are statistical– directly, or an effort to capture and use word probabilities

§ Why not natural language understanding?– computer understands documents and query and matches them– state of the art is brittle in unrestricted domains– can be highly successful in predictable settings, though

§ information extraction on terrorism/takeovers (MUC)§ medical or legal settings with restricted vocabulary

§ Could use manually assigned headings– e.g., Library of Congress headings, Dewey Decimal headings

§ expensive and human agreement is not good§ hard to predict what headings are “interesting”

§ Statistical and not lexical– count words– lexical information plays secondary role

26Tuesday, January 10, 2012

Page 27: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Example: “Bag of Words”

§ Ignoring the word order

§ Popular and effective

§ Similar vocabulary → similar content

§ Consider reordering words in a headline– Random: beating takes points falling another Dow 355– Alphabetical: 355 another beating Dow falling points– “Interesting”: Dow points beating falling 355 another– Original: Dow takes another beating, falling 355 points

27Tuesday, January 10, 2012

Page 28: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

What is this about?

§ 16 × said$

§ 14 × McDonalds

§ 12 × fat$

§ 11 × fries– 8 × new"– 6 × company french nutrition

§ 5 × food oil percent reduce taste Tuesday§ 4 × amount change health Henstenburg make obesity§ 3 × acids consumer fatty polyunsaturated US§ 2 × amounts artery Beemer cholesterol clogging director down eat estimates expert fast formula

impact initiative moderate plans restaurant saturated trans win § 1 × ... added addition adults advocate affect afternoon age Americans Asia battling beef bet brand

Britt Brook Browns calorie center chain chemically … crispy customers cut … vegetable weapon weeks Wendys Wootan worldwide years York

Copyright © James Allan

28Tuesday, January 10, 2012

Page 29: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

The start of the original

McDonald's slims down spuds Fast-food chain to reduce certain types of fat in its french fries with new cooking oil.

NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier.

But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA.

But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste.

Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment. ...

http://money.cnn.com/2002/09/03/news/companies/mcdonalds/index.htm

Copyright © James Allan

29Tuesday, January 10, 2012

Page 30: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

The Point?

§ Basis of most IR is a very simple approach– find words in documents– compare them to words in a query– this approach is very effective!

§ Other types of features are often used– phrases– link structure– named entities (people, locations, organizations)– special features (chemical names, product names)

§ Focus is on improving accuracy, speed– …and on extending ideas elsewhere

30Tuesday, January 10, 2012

Page 31: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Comparison

§ Retrieval model– provide a mathematical framework for defining the matching process– includes explanation of assumptions– basis of many ranking algorithms– can be implicit

§ Some models that we will cover– boolean– vector space– inference networks– language models– relevance models

31Tuesday, January 10, 2012

Page 32: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Evaluation

§ Experimental procedures and measures for comparing system output with user expectations

– Originated in Cranfield experiments in the 60s

§ IR evaluation methods now used in many fields

§ Typically use test collection of documents, queries, and relevance judgments

– Most commonly used are TREC collections

§ Recall and precision are two examples of effectiveness measures

32Tuesday, January 10, 2012

Page 33: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

IR is not Search Engines

§ A search engine is the practical application of information retrieval techniques to large scale text collections

§ Information Retrieval– Information needs – User interaction– Relevance – Effective ranking– Representation – How to represent things– Comparison – How to match things– Evaluation – Testing and measuring

§ Search Engines– Performance – Efficient search and indexing– Incorporating new data – Coverage and freshness– Scalability – Growing with data and users– Adaptability – Tuning for applications– Specific problems – e.g. Spam

33Tuesday, January 10, 2012

Page 34: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

AdministrativaWhat is Information Retrieval (IR)?

Issues in IRDimensions of IR

Course goals

34Tuesday, January 10, 2012

Page 35: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Dimensions of IR

§ IR is not just for the Web

§ IR is not just search

§ 3 dimensions:– data– application/domain– task

35Tuesday, January 10, 2012

Page 36: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Data

§ Text

§ Multiple languages– accessing Chinese collection using English

§ Scanned Text (handwritten or typed)– either word images or OCRed text with errors

§ Images– features?

§ Video– features?

§ Speech (audio)– ASR output (with errors)

§ Music– features?

36Tuesday, January 10, 2012

Page 37: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Application

§ Web

§ Enterprise– like web, but smaller, more focused, more controlled

§ Desktop– smaller scale; different file formats; very user-centered

§ Forums– shorter than web; threads; typos;

§ Social/twitter– short; threads; typos;

§ P2P– distributed aspects

37Tuesday, January 10, 2012

Page 38: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Application (continued)

§ Literature– the original domain; cross-references; citations

§ Legal– specific language; well-defined guidelines;

§ Medical– similar to legal; unusual vocabulary;

§ Personal Information Management (PIM)– contacts and schedules

38Tuesday, January 10, 2012

Page 39: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Tasks

§ Search– collection is “static”, queries are “dynamic”

§ Filtering & Routing– think newswire; query is “static”, documents are “dynamic”

§ Detection & Tracking– newswire again; new topic discovery and tracking

§ Classification & Clustering– grouping similar documents together for analysis

§ Summarization– locating most important pieces

§ Question answering– factual information;

§ Collaborative– recommender systems; think Amazon reviews.– multi-agent search

39Tuesday, January 10, 2012

Page 40: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Dimensions of IR

Content Applications Tasks

Text Web Search

Multiple languages Enterprise Filtering & Routing

Scanned Text (handwritten or typed)

Desktop Detection & tracking

Images Forum Classification

Video P2P Question answering

Speech (audio) Literature Summarization

Music Legal Collaborative

PIM

40Tuesday, January 10, 2012

Page 41: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Assignment

§ Watch the “Knowledge Navigator” video

§ Think about how would you build such a system– what are the tasks the system performs?– what are the challenges?

§ Write down – what IR dimensions that we mentioned are covered in the video?– what dimensions are covered that we did not mention?

41Tuesday, January 10, 2012

Page 42: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

AdministrativaWhat is Information Retrieval (IR)?

Issues in IRDimensions of IR

Course goals

42Tuesday, January 10, 2012

Page 43: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Course Goals

§ Understand what IR is

§ Analyze core issues– ... and how they vary under different conditions ...

§ Consider core solutions– ... and how they can be applied under different conditions ...

§ Acquire some practical skills – how to apply that knowledge

43Tuesday, January 10, 2012

Page 44: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Schedule

§ Core IR – search engines– architecture– text processing – indexes– retrieval models – evaluation – user modeling

§ Topics in IR– filtering– multimedia: image & audio– cross-lingual – web search & advertising– distributed & p2p– question answering– social– semi-structured

44Tuesday, January 10, 2012

Page 45: Information Retrieval - Institute for Creative Technologiesprojects.ict.usc.edu/nld/ir-class/sites/projects.ict.usc.edu.nld... · Information Retrieval and Web Search Engines

Summary

§ IR is a large interdisciplinary field with a long history

§ IR deals with many different data types, applications, and tasks

§ At the core of IR is the match or comparison operation

45Tuesday, January 10, 2012