cse509 lecture 2

Muhammad Atif Qureshi and Arjumand Younus

Web Science Research Group

Institute of Business Administration (IBA)

CSE509: Introduction to Web Science

and Technology

Lecture 2: Basic Information Re-trieval Models

2

Last Time…

What is Web Science?

Why We Need Web Science?

Implications of Web Science

Web Science Case-Study: Diff-IE

July 16, 2011

3

Today

Basic Information Retrieval Approaches

Bag of Words Assumption

Information Retrieval Models Boolean model Vector-space model Topic/Language models

July 16, 2011

4

Computer-based Information Man-agement

Basic problem How to use computers to help humans store, organize and retrieve information?

What approaches have been taken and what has been successful?

July 16, 2011

5

Three Major Approaches

Database approach

Expert-system approach

Information retrieval approach

July 16, 2011

6

Database Approach

Information is stored in a highly-structured way Data stored in relational tables as tuples

Simple data model and query language Relational model and SQL query language

Clear interpretation of data and query

No ambition to be “intelligent” like humans

Main focus on system efficiency

Pros and cons?

July 16, 2011

7

Expert-System Approach

Information stored as a set of logical predicates Bird(X).Cat(X).Fly(X)

Given a query, the system infers the answer through logical inference Bird(Ostrich) Fly(Ostrich)?

Pros and cons?

July 16, 2011

8

Information Retrieval Approach

Uses existing text documents as information source No special structuring or database construction required

Text-based query language Keyword-based query or natural language query

Pros and cons?

July 16, 2011

9

Database vs. Information Retrieval

July 16, 2011

Other issues

Interaction with system

Results we get

Queries we’re posing

What we’re retrieving

IRDatabases

Issues downplayed.Concurrency, recovery, atomicity are all critical.

Interaction is important.

One-shot queries.

Sometimes relevant, often not.

Exact. Always correct in a formal sense.

Vague, imprecise information needs (often expressed in natural language).

Formally (mathematically) defined queries. Unambiguous.

Mostly unstructured. Free text with some metadata.

Structured data. Clear semantics based on a formal model.

10

Main Challenge of Information Retrieval Approach

Interpretation of query and data is not straightforward

Both queries and data are fuzzy Unstructured text and natural language query

July 16, 2011

11

Need forInformation Retrieval Model

Computers do not understand the document or the query

For implementation of information retrieval approach a computerizable model is essential

July 16, 2011

12

What is a Model?

A model is a construct designed to help us understand a complex system A particular way of “looking at things”

Models work by adopting simplifying assumptions

Different types of models Conceptual model

Physical analog model

Mathematical models

…

July 16, 2011

13

Major Simplification: Bag of Words

Consider each document as a “bag of words”

Consider queries as a “bag of words” as well

Great oversimplification, but works adequately in many cases

Still how do we match documents and queries?

July 16, 2011

14

Sample Document

July 16, 2011

McDonald's slims down spudsFast-food chain to reduce certain types of fat in its french fries with new cooking oil.NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier.But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA.But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste.Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment.…

16 × said

14 × McDonalds

12 × fat

11 × fries

8 × new

6 × company french nutrition

5 × food oil percent reduce taste Tuesday

…

“Bag of Words”

15

Vector Representation

“Bags of words” can be represented as vectors Computational efficiency Ease of manipulation

A vector is a set of values recorded in any consistent order

July 16, 2011

“The quick brown fox jumped over the lazy dog’s back”

[ 1 1 1 1 1 1 1 1 2 ]

1st position corresponds to “back”2nd position corresponds to “brown”3rd position corresponds to “dog”4th position corresponds to “fox”5th position corresponds to “jump”6th position corresponds to “lazy”7th position corresponds to “over”8th position corresponds to “quick”9th position corresponds to “the”

16

Representing Documents

July 16, 2011

The quick brown fox jumped over the lazy dog’s back.

Document 1

Document 2

Now is the time for all good men to come to the aid of their party.

the

isfor

to

of

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

00110110110010100

11001001001101011

Term Doc

ume

nt 1

Doc

ume

nt 2

Stopword List

17

Boolean Model

Return all documents that contain the words in the query Weights assigned to terms are either “0” or “1” “0” represents “absence”: term isn’t in the document “1” represents “presence”: term is in the document

No notion of “ranking” A document is either a match or a non-match

July 16, 2011

18

Evaluation: Precision and Recall

Q: Are all matching documents what users want?

Basic idea: a model is good if it returns a document if and only if it is relevant

R: set of relevant documents

D: set of documents returned by a model

July 16, 2011

|D|

|RD|Precision

|R|

|RD|Recall

19

Ranked Retrieval

Order documents by how likely they are to be relevant to the information need

Arranging documents by relevance is Closer to how humans think: some documents are “better” than others Closer to user behavior: users can decide when to stop reading

Best (partial) match: documents need not have all query terms Although documents with more query terms should be “better”

July 16, 2011

20

Similarity-Based Queries

Let’s replace relevance with “similarity” Rank documents by their similarity with the query

Treat the query as if it were a document Create a query bag-of-words

Find its similarity to each document

Rank order the documents by similarity

Surprisingly, this works pretty well!

July 16, 2011

21

Vector Space Model

Documents “close together” in vector space “talk about” the same things

Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)

July 16, 2011

t1

d2

d1

d3

d4

d5

t3

t2

θ

φ

22

Similarity Metric

Angle between the vectors

July 16, 2011

n

i ki

n

i ji

n

i kiji

kj

kjkj

ww

ww

dd

ddddsim

1

2,1

2,

1 ,,),(

kj

kj

dd

dd

)cos(

n

i ki

n

i ji

n

i kiji

kj

kjkj

ww

ww

dd

ddddsim

1

2,1

2,

1 ,,),(

Document Vector

Query Vector

23

How do we Weight Document Terms?

July 16, 2011

Here’s the intuition: Terms that appear often in a document should get high

weights

Terms that appear in many documents should get low weights

How do we capture this mathematically? Term frequency Inverse document frequency

24

TF.IDF Term Weighting

Simple, yet effective!

July 16, 2011

ijiji n

Nw logtf ,,

jiw ,

ji ,tf

N

in

weight assigned to term i in document j

number of occurrence of term i in document j

number of documents in entire collection

number of documents with term i

25

TF.IDF Example

July 16, 2011

4

5

6

3

1

3

1

6

5

3

4

3

7

1

nuclear

fallout

siberia

contaminated

interesting

complicated

information

retrieval

2

1 2 3

2

3

2

4

4

0.50

0.63

0.90

0.13

0.60

0.75

1.51

0.38

0.50

2.11

0.13

1.20

1 2 3

0.60

0.38

0.50

4

0.301

0.125

0.125

0.125

0.602

0.301

0.000

0.602

tf Wi,j

idf

26

Normalizing Document Vectors

Recall our similarity function:

Normalize document vectors in advance Use the “cosine normalization” method: divide each term weight through by

length of vector

July 16, 2011

n

i ki

n

i ji

n

i kiji

kj

kjkj

ww

ww

dd

ddddsim

1

2,1

2,

1 ,,),(

27

Normalization Example

July 16, 2011

nuclear

fallout

siberia

contaminated

interesting

complicated

information

retrieval

0.301

0.125

0.125

0.125

0.602

0.301

0.000

0.602

4

5

6

3

1

3

1

6

5

3

4

3

7

1

2

1 2 3

2

3

2

4

4

tf

0.50

0.63

0.90

0.13

0.60

0.75

1.51

0.38

0.50

2.11

0.13

1.20

1 2 3

0.60

0.38

0.50

4

Wi,j

idf

1.70 0.97 2.67 0.87Length

0.29

0.37

0.53

0.13

0.62

0.77

0.57

0.14

0.19

0.79

0.05

0.71

1 2 3

0.69

0.44

0.57

4

W'i,j

28

Retrieval Example

July 16, 2011

nuclear

fallout

siberia

contaminated

interesting

complicated

information

retrieval

1

query

W'i,j

1

0.29 0.9 0.19 0.57similarity score

Ranked list:Doc 2Doc 4Doc 1Doc 3

0.29

0.37

0.53

0.13

0.62

0.77

0.57

0.14

0.19

0.79

0.05

0.71

1 2 3

0.69

0.44

0.57

4

W'i,j

Query: contaminated retrieval

29

Language Models

Language models Based on the notion of probabilities and processes for generating text Documents are ranked based on the probability that they generated the

query Best/partial match

July 16, 2011

30

What is a Language Model?

Probability distribution over strings of text

How likely is a string in a given “language”?

Probabilities depend on what language we’re modeling

July 16, 2011

p1 = P(“a quick brown dog”)

p2 = P(“dog quick a brown”)

p3 = P(“быстрая brown dog”)

p4 = P(“быстрая собака”)

In a language model for English: p1 > p2 > p3 > p4

In a language model for Russian: p1 < p2 < p3 < p4

31

How do we Model a Language?

July 16, 2011

Brute force counts? Think of all the things that have ever been said or will ever be said, of

any length Count how often each one occurs

Is understanding the path to enlightenment? Figure out how meaning and thoughts are expressed Build a model based on this

Throw up our hands and admit defeat?

32

Unigram Language Model

July 16, 2011

Assume each word is generated independently Obviously this is not true… But it seems to work well in practice!

The probability of a string given a model

k

iik MqPMqqP

11 )|()|(

The probability of a sequence of words decomposes into a product of the probabilities of individual words

33

Physical Metaphor

July 16, 2011

Colored balls are randomly drawn from an urn (with replacement)

P ( )P ( ) =

= (4/9) (2/9) (4/9) (3/9)

wordsM

P ( ) P ( ) P ( )

35

QUESTIONS?

July 16, 2011

cse509 lecture 2

Technology

information retrievaljuly

imprecise information

over8th position

brown3rd position

lazy7th position

dog4th position

jump6th position

fox5th position