1
Algorithms for Large Data Sets
Ziv Bar-YossefLecture 2
March 26, 2006
http://www.ee.technion.ac.il/courses/049011
2
Information Retrieval
3
I want information about Michael
Jordan, the machine learning expert
Information Retrieval Setting
queryUserDocument
Collection
“Information Need”
+”Michael Jordan” -basketball
1. Michael I. Jordan’s homepage2. NBA.com3. Michael Jordan on TV
Ranked list of retrieved documents
IR SystemIR System
documents
No. 1 is good, Rest are bad
feedback
Revised ranked list of retrieved documents
1. Michael I. Jordan’s homepage2. M.I. Jordan’s pubs3. Graphical Models
4
Information Retrieval vs.Data Retrieval Information Retrieval System: a system that allows a user
to retrieve documents that match her “information need” from a large corpus. Ex: Get documents about Michael Jordan, the machine learning
expert.
Data Retrieval System: a system that allows a user to retrieve all documents that match her query from a large corpus. Ex: SELECT doc
FROM corpus WHERE (doc.text CONTAINS “Michael Jordan”)
AND NOT (doc.text CONTAINS
“basketball”).
5
Information Retrieval vs. Data Retrieval
Information RetrievalData Retrieval
DataFree text, unstructuredDatabase tables, structured
QueriesKeywords,
Natural language
SQL,
Relational algebras
ResultsApproximate matchesExact matches
ResultsOrdered by relevanceUnordered
AccessibilityNon-expert humansKnowledgeable users or automatic processes
6
Information Retrieval Systems
IR System
queryprocessor
textprocessor
user query
ranked retrieved
docs
User
Corpus
rankingprocedure
system query
retrieved docs
index
indexertokenized
docs
postings
raw docs
7
Search EnginesSearch Engine
queryprocessor
textprocessor
user query
ranked retrieved
docs
User
Web
rankingprocedure
system query
retrieved docs
index
indexertokenized
docs
postings
crawlerglobal
analyzerrepository
8
Classical IR vs. Web IRClassical IRWeb IR
VolumeLargeHuge
Data qualityClean, no dupsNoisy, dups
Data change rateInfrequentIn flux
Data accessibilityAccessiblePartially accessible
Format diversityHomogeneousWidely diverse
DocumentsTextHypertext
# of matchesSmallLarge
IR techniquesContent-basedLink-based
9
Outline
Abstract formulation Models for relevance ranking Retrieval evaluation Query languages Text processing Indexing and searching
10
Abstract Formulation Ingredients:
D: document collection Q: query space f: D x Q R: relevance scoring function For every q in Q, f induces a ranking (partial order) q on D
Functions of an IR system: Preprocess D and create an index I Given q in Q, use I to produce a permutation on D
Goals: Accuracy: should be “close” to q Compactness: index should be compact Response time: answers should be given quickly
11
Document Representation
T = { t1,…, tk }: a “token space” (a.k.a. “feature space” or “term space”)Ex: all words in EnglishEx: phrases, URLs, …
A document: a real vector d in Rk
di: “weight” of token ti in d
Ex: di = normalized # of occurrences of ti in d
12
Classic IR (Relevance) Models
The Boolean model The Vector Space Model (VSM)
13
The Boolean Model
A document: a boolean vector d in {0,1}k
di = 1 iff ti belongs to d
A query: a boolean formula q over tokens q: {0,1}k {0,1} Ex: “Michael Jordan” AND (NOT basketball) Ex: +“Michael Jordan” –basketball
Relevance scoring function: f(d,q) = q(d)
14
The Boolean Model: Pros & Cons
Advantages:Simplicity for users
Disadvantages:Relevance scoring is too coarse
15
The Vector Space Model (VSM)
A document: a real vector d in Rk
di = weight of ti in d (usually TF-IDF score)
A query: a real vector q in Rk
qi = weight of ti in q
Relevance scoring function: f(d,q) = sim(d,q)
“similarity” between d and q
16
Popular Similarity Measures
L1 or L2 distance d,q are first normalized
to have unit norm
Cosine similarity
d
q
d –q
d
q
17
TF-IDF Score: Motivation
Motivating principle:A term ti is relevant to a document d if:
ti occurs many times in d relative to other terms that occur in d
ti occurs many times in d relative to its number of occurrences in other documents
Examples10 out of 100 terms in d are “java”10 out of 10,000 terms in d are “java”10 out of 100 terms in d are “the”
18
TF-IDF Score: Definition
n(d,ti) = # of occurrences of ti in d N = i n(d,ti) (# of tokens in d) Di = # of documents containing ti D = # of documents in the collection
TF(d,ti): “Term Frequency” Ex: TF(d,ti) = n(d,ti) / N Ex: TF(d,ti) = n(d,ti) / (maxj { n(d,tj) })
IDF(ti): “Inverse Document Frequency” Ex: IDF(ti) = log (D/Di)
TFIDF(d,ti) = TF(d,ti) x IDF(ti)
19
VSM: Pros & Cons
Advantages:Better granularity in relevance scoringGood performance in practiceEfficient implementations
Disadvantages:Assumes term independence
20
Retrieval Evaluation Notations:
D: document collection Dq: documents in D that are “relevant” to query q
Ex: f(d,q) is above some threshold
Lq: list of results on query qD
Lq DqRecall:
Precision:
21
Recall & Precision: Example
Recall(A) = 80% Precision(A) = 40%
1. d123
2. d84
3. d56
4. d6
5. d8
6. d9
7. d511
8. d129
9. d187
10.d25
List A Relevant docs:
d123, d56, d9, d25, d31. d81
2. d74
3. d56
4. d123
5. d511
6. d25
7. d9
8. d129
9. d3
10.d5
List B
Recall(B) = 100% Precision(B) = 50%
22
Precision@k and Recall@k
Notations:Dq: documents in D that are “relevant” to q
Lq,k: top k results on the list
Recall@k:
Precision@k:
23
Precision@k: Example
1. d123
2. d84
3. d56
4. d6
5. d8
6. d9
7. d511
8. d129
9. d187
10.d25
List A1. d81
2. d74
3. d56
4. d123
5. d511
6. d25
7. d9
8. d129
9. d3
10.d5
List B
0%10%20%30%40%50%60%70%80%90%
100%
1 2 3 4 5 6 7 8 9 1
k
pre
cisi
on
@k
List A List B
24
Recall@k: Example
0%10%20%30%40%50%60%70%80%90%
100%
1 2 3 4 5 6 7 8 9 1
k
recall@
k
List A List B
1. d123
2. d84
3. d56
4. d6
5. d8
6. d9
7. d511
8. d129
9. d187
10.d25
List A1. d81
2. d74
3. d56
4. d123
5. d511
6. d25
7. d9
8. d129
9. d3
10.d5
List B
25
“Interpolated” Precision
Notations:Dq: documents in D that are “relevant” to qr: a recall level (e.g., 20%)k(r): first k so that recall@k >= r
Interpolated precision@ recall level r =
max { precision@k : k >= k(r) }
26
Precision vs. Recall: Example
0%10%20%30%40%50%60%70%80%90%
100%
0 20 40 60 80 100
Recall
Inte
rpo
late
d P
reci
sio
n List AList B
1. d123
2. d84
3. d56
4. d6
5. d8
6. d9
7. d511
8. d129
9. d187
10.d25
List A1. d81
2. d74
3. d56
4. d123
5. d511
6. d25
7. d9
8. d129
9. d3
10.d5
List B
27
Query Languages: Keyword-Based Singe-word queries
Ex: Michael Jordan machine learning Context queries
Phrases. Ex: “Michael Jordan” “machine learning” Proximity. Ex: “Michael Jordan” at distance of at most
10 words from “machine learning” Boolean queries
Ex: +”Michael Jordan” –basketball Natural language queries
Ex: “Get me pages about Michael Jordan, the machine learning expert.”
28
Query Languages: Pattern Matching Prefixes
Ex: prefix:comput Suffixes
Ex: suffix:net Regular Expressions
Ex: [0-9]+th world-wide web conference
29
Text Processing
Lexical analysis & tokenization Split text into words, downcase letters, filter out
punctuation marks, digits, hyphens Stopword elimination
Better retrieval accuracy, more compact index Ex: “to be or not to be”
Stemming Ex: “computer”, “computing”, “computation” comput
Index term selection Keywords vs. full text
30
Inverted Index
Michael1 Jordan2, the3 author4 of5 “graphical6 models7”, is8 a9 professor10 at11 U.C.12 Berkeley13.
The1 famous2 NBA3 legend4 Michael5 Jordan6 liked7 to8 date9 models10.
d1
d2
author: (d1,4)
berkeley: (d1,13)
date: (d2,9)
famous: (d2, 2)
graphical: (d1,6)
jordan: (d1,2), (d2,6)
legend: (d2,4)
like: (d2,7)
michael: (d1,1), (d2,5)
model: (d1,7), (d2,10)
nba: (d2,3)
professor: (d1,10)
uc: (d1,12)
Vocabulary Postings
31
Inverted Index Structure
Vocabulary File
term1
term2
…
Postings File
postings list 1
postings list 2
…
• Usually, fits in main memory
• Stored on disk
32
Searching an Inverted Index
Given: t1, t2: query terms L1,L2: corresponding posting lists
Need to get ranked list of docs in intersection of L1,L2
Solution 1: If L1,L2 are comparable in size, “merge” L1 and L2 to find docs in their intersection, and then order them by rank. (running time: O(|L1| + |L2|))
Solution 2: If L1 is considerably shorter than L2, binary search each posting of L1 in L2 to find the intersection, and then order them by rank.(running time: O(|L1| x log(|L2|))
33
Search Optimization
Improvement:
Order docs in posting lists by static rank (e.g., PageRank).
Then, can output top matches, without scanning the whole lists.
34
Index Construction
Given a stream of documents, store (did,tid,pos) triplets in a file
Sort and group file by tid Extract posting lists
35
Index Maintenance
Naïve updates of inverted index can be very costly Require random access A single change may cause many insertions/deletions
Batch updates Two indices
Main index (created in batch, large, compressed) “Stop-press” index (incremental, small,
uncompressed)
36
Index Maintenance
If a page d is inserted/deleted, the “signed” postings (did,tid,pos,I/D) are added to the stop-press index.
Given a query term t, fetch its list Lt from main index, and two lists Lt,+ and Lt,- from stop-press index.
Result is:
When stop-press index grows too large, it is merged into the main index.
37
Index Compression
Delta compression
Saves a lot for popular terms Doesn’t save much for rare terms (but these don’t take
much space anyway)
michael: (1000007,5), (1000009,12), (1000013,77), (1000035,88),…
michael: (1000007,5), (2,12), (4,77), (22,88),…
38
Variable Length Encodings
How to encode gaps succinctly? Option 1: Fixed-length binary encoding.
Effective when all gap lengths are equally likely No savings over storing doc ids.
Option 2: Unary encoding. Gap x is encoded by x-1 1’s followed by a 0 Effective when large gaps are very rare (Pr(x) = 1/2x)
Option 3: Gamma encoding. Gap x is encoded by (x x), where x is the binary encoding
of x and x is the length of x, encoded in unary. Encoding length: about 2log(x).
39
End of Lecture 2