infm 700: session 7 search (part i) introduction to information retrieval paul jacobs the ischool...

39
INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United St See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Upload: vanessa-robbins

Post on 17-Jan-2018

214 views

Category:

Documents


0 download

DESCRIPTION

iSchool Today’s Topics Introduction to Information Retrieval Keywords, inverted indices, and Boolean retrieval The vector space model, ranked retrieval Major issues Some additional tricks Examples: web search and site search IR Intro Boolean Vector Space Issues & Tricks

TRANSCRIPT

Page 1: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

INFM 700: Session 7Search (Part I)Introduction to Information Retrieval

Paul JacobsThe iSchoolUniversity of Maryland

Monday, November 9, 2009

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Page 2: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Goals for Search Sessions Understand the basic issues in information

retrieval (searching primarily unstructured text) Know the techniques generally used by modern

search engines

Learn how search engines can be used most effectively in information architecture

Page 3: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Today’s Topics Introduction to Information Retrieval Keywords, inverted indices, and Boolean retrieval

The vector space model, ranked retrieval

Major issues

Some additional tricks Examples: web search and site search

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 4: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Levels of Structure Different types of data

Structured data Semi-structured data Unstructured data

How do you provide access to unstructured data? Manually develop an organization system (add

structure) Provide search capabilities

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 5: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

What is search? Search is query-based access

How is this different from browsing?

Things one can search on: Content Metadata Organization systems Labels …

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 6: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Some Key Concepts Different search paradigms

Boolean, “keyword” “Natural language” or “free text” (full text) search Current search engines are primarily full text and

statistical

The fundamental challenge: words & concepts

The basic method: weighting and context

Other tricks (there are many!) Structuring Popularity and importance (of pages, documents) Metadata and thesauri User feedback

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 7: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

The Central Problem in IR

SearcherAuthors

Concepts Concepts

Query Documents

Do these represent the same concepts?

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 8: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Architecture of IR SystemsDocumentsQuery

Hits

RepresentationFunction

RepresentationFunction

Query Representation Document Representation

ComparisonFunction Index

offlineonline

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 9: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

How do we represent text? Remember: computers don’t “understand”

documents or queries Simple, yet effective approach: “bag of words”

Treat all the words in a document as index terms Assign a “weight” to each term based on “importance” Disregard order, structure, meaning, etc. of the words

Assumptions Term occurrence is independent (of other terms) Document relevance is independent (of other

documents) “Words” can be defined

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 10: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

What’s a word?天主教教宗若望保祿二世因感冒再度住進醫院。這是他今年第二度因同樣的病因住院。 - باسم الناطق ريجيف مارك وقال

قبل - شارون إن اإلسرائيلية الخارجيةبزيارة األولى للمرة وسيقوم الدعوة

المقر طويلة لفترة كانت التي تونس،لبنان من خروجها بعد الفلسطينية التحرير لمنظمة الرسمي

1982عام . Выступая в Мещанском суде Москвы экс-глава ЮКОСа заявил не совершал ничего противозаконного, в чем обвиняет его генпрокуратура России.

भारत सरकार ने आर्थि� क सर्वे�क्षण में विर्वेत्तीय र्वेर्ष� 2005-06 में सात फ़ीसदी विर्वेकास दर हासिसल करने का आकलन विकया है और कर सुधार पर ज़ोर दिदया है

日米連合で台頭中国に対処…アーミテージ前副長官提言 조재영 기자 = 서울시는 25 일 이명박 시장이 ` 행정중심복합도시 '' 건설안에 대해 ` 군대라도 동원해 막고싶은 심정 '' 이라고 말했다는 일부 언론의 보도를 부인했다 .

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 11: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Sample DocumentMcDonald's slims down spudsFast-food chain to reduce certain types of fat in its french fries with new cooking oil.NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier.

But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA.

But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste.

Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment.

14 × McDonald’s

12 × fat

11 × fries

8 × new

6 × company, french, nutrition

5 × food, oil, percent, reduce, taste, Tuesday

“Bag of Words”

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 12: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Why does “bag of words” work (at all)? Words alone tell us a lot about content! Words are our main tool for describing concepts

Words in context are especially powerful

Getting beyond words is hard

Structure usually (but not always) can be guessed from content “355 back correction Dow pulls signaling” “blind Venetian” vs. “Venetian blind”

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 13: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Boolean Retrieval Users express queries as a Boolean (logical)

expression “terms” (usually words or phrases) joined by AND, OR,

NOT Can be arbitrarily nested

Difference between “term” and “keyword”?

Retrieval is based on the notion of sets Any given query divides the collection into two sets:

retrieved, not-retrieved (complement) Pure Boolean systems do not define an ordering of the

results (no ranking)

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 14: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

AND/OR/NOT

A B

All documents

C

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 15: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Logic Tables

A OR B

A AND B A NOT B

NOT B

0 1

1 1

0 1

0

1

AB

(= A AND NOT B)

0 0

0 1

0 1

0

1

AB

0 0

1 0

0 1

0

1

AB

1 0

0 1B

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 16: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Representing Documents

The quick brown fox jumped over the lazy dog’s back.

Document 1

Document 2

Now is the time for all good men to come to the aid of their party.

the

isfor

to

of

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

00110110110010100

11001001001101011

Term Doc

umen

t 1

Doc

umen

t 2

Stopword List

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 17: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Boolean View of a Collection

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

00110000010010110

01001001001100001

Term

Doc

1D

oc 2

00110110110010100

11001001001000001

Doc

3D

oc 4

00010110010010010

01001001000101001

Doc

5D

oc 6

00110010010010010

10001001001111000

Doc

7D

oc 8

Each column represents the view of a particular document: What terms are contained in this document?

Each row represents the view of a particular term: What documents contain this term?

To execute a query, pick out rows corresponding to query terms and then apply logic table of corresponding Boolean operator

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 18: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Sample Queries

foxdog 0

000

11

00

11

00

01

00

Term

Doc

1D

oc 2

Doc

3D

oc 4

Doc

5D

oc 6

Doc

7D

oc 8

dog fox 0 0 1 0 1 0 0 0

dog fox 0 0 1 0 1 0 1 0

dog fox 0 0 0 0 0 0 0 0

fox dog 0 0 0 0 0 0 1 0

dog AND fox Doc 3, Doc 5

dog OR fox Doc 3, Doc 5, Doc 7

dog NOT fox empty

fox NOT dog Doc 7

goodparty

00

10

00

10

00

11

00

11

g p 0 0 0 0 0 1 0 1

g p o 0 0 0 0 0 1 0 0

good AND party Doc 6, Doc 8over 1 0 1 0 1 0 1 1

good AND party NOT over Doc 6

Term

Doc

1D

oc 2

Doc

3D

oc 4

Doc

5D

oc 6

Doc

7D

oc 8

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 19: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Inverted Index

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

00110000010010110

01001001001100001

Term

Doc

1D

oc 2

00110110110010100

11001001001000001

Doc

3D

oc 4

00010110010010010

01001001000101001

Doc

5D

oc 6

00110010010010010

10001001001111000

Doc

7D

oc 8

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

4 82 4 61 3 71 3 5 72 4 6 83 53 5 72 4 6 831 3 5 7

1 3 5 7 8

2 4 82 6 8

1 5 72 4 6

1 36 8

Term Postings

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 20: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Boolean Retrieval To execute a Boolean query:

Build query syntax tree

For each clause, look up postings

Traverse postings and apply Boolean operator

Efficiency analysis Postings traversal is linear (assuming sorted postings) Start with shortest posting first

( fox or dog ) and quick

fox dog

ORquick

AND

foxdog 3 5

3 5 7

foxdog 3 5

3 5 7OR = union 3 5 7IR Intro

Boolean

Vector Space

Issues & Tricks

Page 21: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Why Boolean Retrieval Works Boolean operators approximate concepts How so?

AND can identify relationships between concepts• (e.g., interest rate, web design)

OR can identify alternate terminology• (e.g., interest percentage, HTML layout, etc.)

NOT can filter alternate meanings• (e.g., conflict AND interest AND NOT rate, NOT spider)

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 22: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Why Boolean Retrieval Fails It’s really hard to come up with the “right” queries Casual searchers have difficulty with the logic

Some concepts are just hard to express, e.g. “corporate mergers & acquisitions” – IBM acquired Lotus

Relevance is not absolute, some documents are more relevant, or more helpful, than others

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 23: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Ranked Retrieval in the Vector Space Model Order documents by how likely they are to be

relevant to the information need Estimate relevance(q, di) Sort documents by relevance Display sorted results, usually one screen at a time

How do we estimate relevance? Assume that document d is relevant to query q if they

share terms in common Replace relevance(q, di) with sim(q, di) (similarity) Compute similarity of vector representations

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 24: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Vector Representation “Bags of words” can be represented as vectors

Why? Computational efficiency, ease of manipulation Geometric metaphor: “arrows”

A vector is a set of values recorded in any consistent order

“The quick brown fox jumped over the lazy dog’s back”

[ 1 1 1 1 1 1 1 1 2 ]

1st position corresponds to “back”2nd position corresponds to “brown”3rd position corresponds to “dog”4th position corresponds to “fox”5th position corresponds to “jump”6th position corresponds to “lazy”7th position corresponds to “over”8th position corresponds to “quick”9th position corresponds to “the”

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 25: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Vector Space Model

Assumption: Documents that are “close together” in vector space “talk about” the same things

t1

d2

d1

d3

d4

d5

t3

t2

θ

φ

Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 26: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Similarity Metric How about |d1 – d2|?

Instead of Euclidean distance, use “angle” between the vectors It all boils down to the inner product (dot product) of

vectors

kj

kj

dd

dd

)cos(

n

i kin

i ji

n

i kiji

kj

kjkj

ww

ww

dd

ddddsim

12,1

2,

1 ,,),(

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 27: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Components of Similarity The “inner product” (aka dot product) is the key to

the similarity function

The denominator handles document length normalization

n

i kijikj wwdd1 ,,

n

i kij wd1

2,

24.41840941

20321

92200130221

2010220321

Example:

Example:

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 28: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Term Weighting Term weights consist of two components

Local: how important is the term in this doc? Global: how important is the term in the collection?

Here’s the intuition: Terms that appear often in a document should get high

weights Terms that appear in many documents should get low

weights

How do we capture this mathematically? Term frequency (local) Inverse document frequency (global)

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 29: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

TF.IDF Term Weighting

ijiji n

Nw logtf ,,

jiw ,

ji,tf

N

in

weight assigned to term i in document j

number of occurrence of term i in document j

number of documents in entire collection

number of documents with term iIR Intro

Boolean

Vector Space

Issues & Tricks

Page 30: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

TF.IDF Example

4

5

6

3

1

3

1

6

5

3

4

3

7

1

2

1 2 3

2

3

2

4

4

0.301

0.125

0.125

0.125

0.602

0.301

0.000

0.602

tfidf

complicated

contaminated

fallout

information

interesting

nuclear

retrieval

siberia

1,4

1,5

1,6

1,3

2,1

2,1

2,6

3,5

3,3

3,4

1,2

0.301

0.125

0.125

0.125

0.602

0.301

0.000

0.602

complicated

contaminated

fallout

information

interesting

nuclear

retrieval

siberia

4,2

4,3

2,3 3,3 4,2

3,7

3,1 4,4

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 31: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Document Scoring Algorithm Initialize accumulators to hold document scores For each query term t in the user’s query

Fetch t’s postings For each document, scoredoc += wt,d wt,q

Apply length normalization to the scores at end

Return top N documents

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 32: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Summary thus far… Represent documents (and queries) as “bags of

words” (terms) Derive term weights based on frequency

Use weighted term vectors for each document, query

Compute a vector-based similarity score

Display sorted, ranked resultsIR Intro

Boolean

Vector Space

Issues & Tricks

Page 33: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Issues and Tricks What’s a word/term?

We can ignore words (“stop words”), combine (phrases), split up (“stem”) words

Other special treatment (e.g. names, categories)

Query formulation/suggestion

Type of information need

Popularity Based on link analysis/page rank Based on click through, other

Structuring and tagging (e.g., “best bets”)

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 34: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Issues and Tricks (cont’d) Thesaurus/query expansion

Based on meaning, conceptual relationships Based on decomposition/type

User feedback/”More like this”

Clustering/grouping of results

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 35: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Morphological Variation Handling morphology: related concepts have

different forms Inflectional morphology: same part of speech

Derivational morphology: different parts of speech

Different morphological processes: Prefixing Suffixing Infixing Reduplication

dogs = dog + PLURALbroke = break + PAST

destruction = destroy + ionresearcher = research + er

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 36: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Stemming Dealing with morphological variation: index stems

instead of words Stem: a word equivalence class that preserves the

central concept

How much to stem? organization organize organ? resubmission resubmit/submission submit? reconstructionism?

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 37: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Does Stemming Work? Generally, yes! (in English)

Helps more for longer queries, fewer results Lots of work done in this area

But used very sparingly in web search – why?

Donna Harman (1991) How Effective is Suffixing? Journal of the American Society for Information Science, 42(1):7-15.

Robert Krovetz. (1993) Viewing Morphology as an Inference Process. Proceedings of SIGIR 1993.

David A. Hull. (1996) Stemming Algorithms: A Case Study for Detailed Evaluation. Journal of the American Society for Information Science, 47(1):70-84.

And others…

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 38: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Beyond Words… Stemming/tokenization = specific instance of a

general problem: what is it? Other units of indexing

Concepts (e.g., from WordNet) Named entities Relations …

IR Intro

Boolean

Vector Space

Issues & Tricks

Page 39: INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool University of Maryland Monday, November 9, 2009 This

iSchool

Recap Introduction to Information Retrieval Boolean retrieval

Ranked retrieval – term weighting, the vector space model

Advanced methods, things to think about

Next time: Deploying search enginesIR Intro

Boolean

Vector Space

Issues & Tricks