cse509 lecture 2
Upload: web-science-research-group-at-institute-of-business-administration-karachi-pakistan
Post on 11-May-2015
464 views
DESCRIPTION
Lecture 2 of CSE509:Web Science and Technology Summer CourseTRANSCRIPT
Muhammad Atif Qureshi and Arjumand Younus
Web Science Research Group
Institute of Business Administration (IBA)
CSE509: Introduction to Web Science
and Technology
Lecture 2: Basic Information Re-trieval Models
2
Last Time…
What is Web Science?
Why We Need Web Science?
Implications of Web Science
Web Science Case-Study: Diff-IE
July 16, 2011
3
Today
Basic Information Retrieval Approaches
Bag of Words Assumption
Information Retrieval Models Boolean model Vector-space model Topic/Language models
July 16, 2011
4
Computer-based Information Man-agement
Basic problem How to use computers to help humans store, organize and retrieve information?
What approaches have been taken and what has been successful?
July 16, 2011
5
Three Major Approaches
Database approach
Expert-system approach
Information retrieval approach
July 16, 2011
6
Database Approach
Information is stored in a highly-structured way Data stored in relational tables as tuples
Simple data model and query language Relational model and SQL query language
Clear interpretation of data and query
No ambition to be “intelligent” like humans
Main focus on system efficiency
Pros and cons?
July 16, 2011
7
Expert-System Approach
Information stored as a set of logical predicates Bird(X).Cat(X).Fly(X)
Given a query, the system infers the answer through logical inference Bird(Ostrich) Fly(Ostrich)?
Pros and cons?
July 16, 2011
8
Information Retrieval Approach
Uses existing text documents as information source No special structuring or database construction required
Text-based query language Keyword-based query or natural language query
Pros and cons?
July 16, 2011
9
Database vs. Information Retrieval
July 16, 2011
Other issues
Interaction with system
Results we get
Queries we’re posing
What we’re retrieving
IRDatabases
Issues downplayed.Concurrency, recovery, atomicity are all critical.
Interaction is important.
One-shot queries.
Sometimes relevant, often not.
Exact. Always correct in a formal sense.
Vague, imprecise information needs (often expressed in natural language).
Formally (mathematically) defined queries. Unambiguous.
Mostly unstructured. Free text with some metadata.
Structured data. Clear semantics based on a formal model.
10
Main Challenge of Information Retrieval Approach
Interpretation of query and data is not straightforward
Both queries and data are fuzzy Unstructured text and natural language query
July 16, 2011
11
Need forInformation Retrieval Model
Computers do not understand the document or the query
For implementation of information retrieval approach a computerizable model is essential
July 16, 2011
12
What is a Model?
A model is a construct designed to help us understand a complex system A particular way of “looking at things”
Models work by adopting simplifying assumptions
Different types of models Conceptual model
Physical analog model
Mathematical models
…
July 16, 2011
13
Major Simplification: Bag of Words
Consider each document as a “bag of words”
Consider queries as a “bag of words” as well
Great oversimplification, but works adequately in many cases
Still how do we match documents and queries?
July 16, 2011
14
Sample Document
July 16, 2011
McDonald's slims down spudsFast-food chain to reduce certain types of fat in its french fries with new cooking oil.NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier.But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA.But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste.Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment.…
16 × said
14 × McDonalds
12 × fat
11 × fries
8 × new
6 × company french nutrition
5 × food oil percent reduce taste Tuesday
…
“Bag of Words”
15
Vector Representation
“Bags of words” can be represented as vectors Computational efficiency Ease of manipulation
A vector is a set of values recorded in any consistent order
July 16, 2011
“The quick brown fox jumped over the lazy dog’s back”
[ 1 1 1 1 1 1 1 1 2 ]
1st position corresponds to “back”2nd position corresponds to “brown”3rd position corresponds to “dog”4th position corresponds to “fox”5th position corresponds to “jump”6th position corresponds to “lazy”7th position corresponds to “over”8th position corresponds to “quick”9th position corresponds to “the”
16
Representing Documents
July 16, 2011
The quick brown fox jumped over the lazy dog’s back.
Document 1
Document 2
Now is the time for all good men to come to the aid of their party.
the
isfor
to
of
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
00110110110010100
11001001001101011
Term Doc
ume
nt 1
Doc
ume
nt 2
Stopword List
17
Boolean Model
Return all documents that contain the words in the query Weights assigned to terms are either “0” or “1” “0” represents “absence”: term isn’t in the document “1” represents “presence”: term is in the document
No notion of “ranking” A document is either a match or a non-match
July 16, 2011
18
Evaluation: Precision and Recall
Q: Are all matching documents what users want?
Basic idea: a model is good if it returns a document if and only if it is relevant
R: set of relevant documents
D: set of documents returned by a model
July 16, 2011
|D|
|RD|Precision
|R|
|RD|Recall
19
Ranked Retrieval
Order documents by how likely they are to be relevant to the information need
Arranging documents by relevance is Closer to how humans think: some documents are “better” than others Closer to user behavior: users can decide when to stop reading
Best (partial) match: documents need not have all query terms Although documents with more query terms should be “better”
July 16, 2011
20
Similarity-Based Queries
Let’s replace relevance with “similarity” Rank documents by their similarity with the query
Treat the query as if it were a document Create a query bag-of-words
Find its similarity to each document
Rank order the documents by similarity
Surprisingly, this works pretty well!
July 16, 2011
21
Vector Space Model
Documents “close together” in vector space “talk about” the same things
Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)
July 16, 2011
t1
d2
d1
d3
d4
d5
t3
t2
θ
φ
22
Similarity Metric
Angle between the vectors
July 16, 2011
n
i ki
n
i ji
n
i kiji
kj
kjkj
ww
ww
dd
ddddsim
1
2,1
2,
1 ,,),(
kj
kj
dd
dd
)cos(
n
i ki
n
i ji
n
i kiji
kj
kjkj
ww
ww
dd
ddddsim
1
2,1
2,
1 ,,),(
Document Vector
Query Vector
23
How do we Weight Document Terms?
July 16, 2011
Here’s the intuition: Terms that appear often in a document should get high
weights
Terms that appear in many documents should get low weights
How do we capture this mathematically? Term frequency Inverse document frequency
24
TF.IDF Term Weighting
Simple, yet effective!
July 16, 2011
ijiji n
Nw logtf ,,
jiw ,
ji ,tf
N
in
weight assigned to term i in document j
number of occurrence of term i in document j
number of documents in entire collection
number of documents with term i
25
TF.IDF Example
July 16, 2011
4
5
6
3
1
3
1
6
5
3
4
3
7
1
nuclear
fallout
siberia
contaminated
interesting
complicated
information
retrieval
2
1 2 3
2
3
2
4
4
0.50
0.63
0.90
0.13
0.60
0.75
1.51
0.38
0.50
2.11
0.13
1.20
1 2 3
0.60
0.38
0.50
4
0.301
0.125
0.125
0.125
0.602
0.301
0.000
0.602
tf Wi,j
idf
26
Normalizing Document Vectors
Recall our similarity function:
Normalize document vectors in advance Use the “cosine normalization” method: divide each term weight through by
length of vector
July 16, 2011
n
i ki
n
i ji
n
i kiji
kj
kjkj
ww
ww
dd
ddddsim
1
2,1
2,
1 ,,),(
27
Normalization Example
July 16, 2011
nuclear
fallout
siberia
contaminated
interesting
complicated
information
retrieval
0.301
0.125
0.125
0.125
0.602
0.301
0.000
0.602
4
5
6
3
1
3
1
6
5
3
4
3
7
1
2
1 2 3
2
3
2
4
4
tf
0.50
0.63
0.90
0.13
0.60
0.75
1.51
0.38
0.50
2.11
0.13
1.20
1 2 3
0.60
0.38
0.50
4
Wi,j
idf
1.70 0.97 2.67 0.87Length
0.29
0.37
0.53
0.13
0.62
0.77
0.57
0.14
0.19
0.79
0.05
0.71
1 2 3
0.69
0.44
0.57
4
W'i,j
28
Retrieval Example
July 16, 2011
nuclear
fallout
siberia
contaminated
interesting
complicated
information
retrieval
1
query
W'i,j
1
0.29 0.9 0.19 0.57similarity score
Ranked list:Doc 2Doc 4Doc 1Doc 3
0.29
0.37
0.53
0.13
0.62
0.77
0.57
0.14
0.19
0.79
0.05
0.71
1 2 3
0.69
0.44
0.57
4
W'i,j
Query: contaminated retrieval
29
Language Models
Language models Based on the notion of probabilities and processes for generating text Documents are ranked based on the probability that they generated the
query Best/partial match
July 16, 2011
30
What is a Language Model?
Probability distribution over strings of text
How likely is a string in a given “language”?
Probabilities depend on what language we’re modeling
July 16, 2011
p1 = P(“a quick brown dog”)
p2 = P(“dog quick a brown”)
p3 = P(“быстрая brown dog”)
p4 = P(“быстрая собака”)
In a language model for English: p1 > p2 > p3 > p4
In a language model for Russian: p1 < p2 < p3 < p4
31
How do we Model a Language?
July 16, 2011
Brute force counts? Think of all the things that have ever been said or will ever be said, of
any length Count how often each one occurs
Is understanding the path to enlightenment? Figure out how meaning and thoughts are expressed Build a model based on this
Throw up our hands and admit defeat?
32
Unigram Language Model
July 16, 2011
Assume each word is generated independently Obviously this is not true… But it seems to work well in practice!
The probability of a string given a model
k
iik MqPMqqP
11 )|()|(
The probability of a sequence of words decomposes into a product of the probabilities of individual words
33
Physical Metaphor
July 16, 2011
Colored balls are randomly drawn from an urn (with replacement)
P ( )P ( ) =
= (4/9) (2/9) (4/9) (3/9)
wordsM
P ( ) P ( ) P ( )
34
An Example
July 16, 2011
the man likes the0.2 0.01 0.02 0.2 0.01
multiply
P(“the man likes the woman”|M)= P(the|M) P(man|M) P(likes|M) P(the|M) P(man|M)= 0.00000008
P(w) w
0.2 the
0.1 a
0.01 man
0.01 woman
0.03 said
0.02 likes
…
Model M
35
QUESTIONS?
July 16, 2011