1 algorithms for large data sets ziv bar-yossef lecture 2 march 26, 2006

1

Algorithms for Large Data Sets

Ziv Bar-YossefLecture 2

March 26, 2006

http://www.ee.technion.ac.il/courses/049011

2

Information Retrieval

3

I want information about Michael

Jordan, the machine learning expert

Information Retrieval Setting

queryUserDocument

Collection

“Information Need”

+”Michael Jordan” -basketball

1. Michael I. Jordan’s homepage2. NBA.com3. Michael Jordan on TV

Ranked list of retrieved documents

IR SystemIR System

documents

No. 1 is good, Rest are bad

feedback

Revised ranked list of retrieved documents

1. Michael I. Jordan’s homepage2. M.I. Jordan’s pubs3. Graphical Models

4

Information Retrieval vs.Data Retrieval Information Retrieval System: a system that allows a user

to retrieve documents that match her “information need” from a large corpus. Ex: Get documents about Michael Jordan, the machine learning

expert.

Data Retrieval System: a system that allows a user to retrieve all documents that match her query from a large corpus. Ex: SELECT doc

FROM corpus WHERE (doc.text CONTAINS “Michael Jordan”)

AND NOT (doc.text CONTAINS

“basketball”).

5

Information Retrieval vs. Data Retrieval

Information RetrievalData Retrieval

DataFree text, unstructuredDatabase tables, structured

QueriesKeywords,

Natural language

SQL,

Relational algebras

ResultsApproximate matchesExact matches

ResultsOrdered by relevanceUnordered

AccessibilityNon-expert humansKnowledgeable users or automatic processes

6

Information Retrieval Systems

IR System

queryprocessor

textprocessor

user query

ranked retrieved

docs

User

Corpus

rankingprocedure

system query

retrieved docs

index

indexertokenized

docs

postings

raw docs

7

Search EnginesSearch Engine

queryprocessor

textprocessor

user query

ranked retrieved

docs

User

Web

rankingprocedure

system query

retrieved docs

index

indexertokenized

docs

postings

crawlerglobal

analyzerrepository

8

Classical IR vs. Web IRClassical IRWeb IR

VolumeLargeHuge

Data qualityClean, no dupsNoisy, dups

Data change rateInfrequentIn flux

Data accessibilityAccessiblePartially accessible

Format diversityHomogeneousWidely diverse

DocumentsTextHypertext

# of matchesSmallLarge

IR techniquesContent-basedLink-based

9

Outline

Abstract formulation Models for relevance ranking Retrieval evaluation Query languages Text processing Indexing and searching

10

Abstract Formulation Ingredients:

D: document collection Q: query space f: D x Q R: relevance scoring function For every q in Q, f induces a ranking (partial order) q on D

Functions of an IR system: Preprocess D and create an index I Given q in Q, use I to produce a permutation on D

Goals: Accuracy: should be “close” to q Compactness: index should be compact Response time: answers should be given quickly

11

Document Representation

T = { t1,…, tk }: a “token space” (a.k.a. “feature space” or “term space”)Ex: all words in EnglishEx: phrases, URLs, …

A document: a real vector d in Rk

di: “weight” of token ti in d

Ex: di = normalized # of occurrences of ti in d

12

Classic IR (Relevance) Models

The Boolean model The Vector Space Model (VSM)

13

The Boolean Model

A document: a boolean vector d in {0,1}k

di = 1 iff ti belongs to d

A query: a boolean formula q over tokens q: {0,1}k {0,1} Ex: “Michael Jordan” AND (NOT basketball) Ex: +“Michael Jordan” –basketball

Relevance scoring function: f(d,q) = q(d)

14

The Boolean Model: Pros & Cons

Advantages:Simplicity for users

Disadvantages:Relevance scoring is too coarse

15

The Vector Space Model (VSM)

A document: a real vector d in Rk

di = weight of ti in d (usually TF-IDF score)

A query: a real vector q in Rk

qi = weight of ti in q

Relevance scoring function: f(d,q) = sim(d,q)

“similarity” between d and q

16

Popular Similarity Measures

L1 or L2 distance d,q are first normalized

to have unit norm

Cosine similarity

d

q

d –q

d

q

17

TF-IDF Score: Motivation

Motivating principle:A term ti is relevant to a document d if:

ti occurs many times in d relative to other terms that occur in d

ti occurs many times in d relative to its number of occurrences in other documents

Examples10 out of 100 terms in d are “java”10 out of 10,000 terms in d are “java”10 out of 100 terms in d are “the”

18

TF-IDF Score: Definition

n(d,ti) = # of occurrences of ti in d N = i n(d,ti) (# of tokens in d) Di = # of documents containing ti D = # of documents in the collection

TF(d,ti): “Term Frequency” Ex: TF(d,ti) = n(d,ti) / N Ex: TF(d,ti) = n(d,ti) / (maxj { n(d,tj) })

IDF(ti): “Inverse Document Frequency” Ex: IDF(ti) = log (D/Di)

TFIDF(d,ti) = TF(d,ti) x IDF(ti)

19

VSM: Pros & Cons

Advantages:Better granularity in relevance scoringGood performance in practiceEfficient implementations

Disadvantages:Assumes term independence

20

Retrieval Evaluation Notations:

D: document collection Dq: documents in D that are “relevant” to query q

Ex: f(d,q) is above some threshold

Lq: list of results on query qD

Lq DqRecall:

Precision:

21

Recall & Precision: Example

Recall(A) = 80% Precision(A) = 40%

1. d123

2. d84

3. d56

4. d6

5. d8

6. d9

7. d511

8. d129

9. d187

10.d25

List A Relevant docs:

d123, d56, d9, d25, d31. d81

2. d74

3. d56

4. d123

5. d511

6. d25

7. d9

8. d129

9. d3

10.d5

List B

Recall(B) = 100% Precision(B) = 50%

22

Precision@k and Recall@k

Notations:Dq: documents in D that are “relevant” to q

Lq,k: top k results on the list

Recall@k:

Precision@k:

23

Precision@k: Example

1. d123

2. d84

3. d56

4. d6

5. d8

6. d9

7. d511

8. d129

9. d187

10.d25

List A1. d81

2. d74

3. d56

4. d123

5. d511

6. d25

7. d9

8. d129

9. d3

10.d5

List B

0%10%20%30%40%50%60%70%80%90%

100%

1 2 3 4 5 6 7 8 9 1

k

pre

cisi

on

@k

List A List B

24

Recall@k: Example

0%10%20%30%40%50%60%70%80%90%

100%

1 2 3 4 5 6 7 8 9 1

k

recall@

k

List A List B

1. d123

2. d84

3. d56

4. d6

5. d8

6. d9

7. d511

8. d129

9. d187

10.d25

List A1. d81

2. d74

3. d56

4. d123

5. d511

6. d25

7. d9

8. d129

9. d3

10.d5

List B

25

“Interpolated” Precision

Notations:Dq: documents in D that are “relevant” to qr: a recall level (e.g., 20%)k(r): first k so that recall@k >= r

Interpolated precision@ recall level r =

max { precision@k : k >= k(r) }

26

Precision vs. Recall: Example

0%10%20%30%40%50%60%70%80%90%

100%

0 20 40 60 80 100

Recall

Inte

rpo

late

d P

reci

sio

n List AList B

1. d123

2. d84

3. d56

4. d6

5. d8

6. d9

7. d511

8. d129

9. d187

10.d25

List A1. d81

2. d74

3. d56

4. d123

5. d511

6. d25

7. d9

8. d129

9. d3

10.d5

List B

27

Query Languages: Keyword-Based Singe-word queries

Ex: Michael Jordan machine learning Context queries

Phrases. Ex: “Michael Jordan” “machine learning” Proximity. Ex: “Michael Jordan” at distance of at most

10 words from “machine learning” Boolean queries

Ex: +”Michael Jordan” –basketball Natural language queries

Ex: “Get me pages about Michael Jordan, the machine learning expert.”

28

Query Languages: Pattern Matching Prefixes

Ex: prefix:comput Suffixes

Ex: suffix:net Regular Expressions

Ex: [0-9]+th world-wide web conference

29

Text Processing

Lexical analysis & tokenization Split text into words, downcase letters, filter out

punctuation marks, digits, hyphens Stopword elimination

Better retrieval accuracy, more compact index Ex: “to be or not to be”

Stemming Ex: “computer”, “computing”, “computation” comput

Index term selection Keywords vs. full text

30

Inverted Index

Michael1 Jordan2, the3 author4 of5 “graphical6 models7”, is8 a9 professor10 at11 U.C.12 Berkeley13.

The1 famous2 NBA3 legend4 Michael5 Jordan6 liked7 to8 date9 models10.

d1

d2

author: (d1,4)

berkeley: (d1,13)

date: (d2,9)

famous: (d2, 2)

graphical: (d1,6)

jordan: (d1,2), (d2,6)

legend: (d2,4)

like: (d2,7)

michael: (d1,1), (d2,5)

model: (d1,7), (d2,10)

nba: (d2,3)

professor: (d1,10)

uc: (d1,12)

Vocabulary Postings

31

Inverted Index Structure

Vocabulary File

term1

term2

…

Postings File

postings list 1

postings list 2

…

• Usually, fits in main memory

• Stored on disk

32

Searching an Inverted Index

Given: t1, t2: query terms L1,L2: corresponding posting lists

Need to get ranked list of docs in intersection of L1,L2

Solution 1: If L1,L2 are comparable in size, “merge” L1 and L2 to find docs in their intersection, and then order them by rank. (running time: O(|L1| + |L2|))

Solution 2: If L1 is considerably shorter than L2, binary search each posting of L1 in L2 to find the intersection, and then order them by rank.(running time: O(|L1| x log(|L2|))

33

Search Optimization

Improvement:

Order docs in posting lists by static rank (e.g., PageRank).

Then, can output top matches, without scanning the whole lists.

34

Index Construction

Given a stream of documents, store (did,tid,pos) triplets in a file

Sort and group file by tid Extract posting lists

35

Index Maintenance

Naïve updates of inverted index can be very costly Require random access A single change may cause many insertions/deletions

Batch updates Two indices

Main index (created in batch, large, compressed) “Stop-press” index (incremental, small,

uncompressed)

36

Index Maintenance

If a page d is inserted/deleted, the “signed” postings (did,tid,pos,I/D) are added to the stop-press index.

Given a query term t, fetch its list Lt from main index, and two lists Lt,+ and Lt,- from stop-press index.

Result is:

When stop-press index grows too large, it is merged into the main index.

37

Index Compression

Delta compression

Saves a lot for popular terms Doesn’t save much for rare terms (but these don’t take

much space anyway)

michael: (1000007,5), (1000009,12), (1000013,77), (1000035,88),…

michael: (1000007,5), (2,12), (4,77), (22,88),…

38

Variable Length Encodings

How to encode gaps succinctly? Option 1: Fixed-length binary encoding.

Effective when all gap lengths are equally likely No savings over storing doc ids.

Option 2: Unary encoding. Gap x is encoded by x-1 1’s followed by a 0 Effective when large gaps are very rare (Pr(x) = 1/2x)

Option 3: Gamma encoding. Gap x is encoded by (x x), where x is the binary encoding

of x and x is the length of x, encoded in unary. Encoding length: about 2log(x).

39

End of Lecture 2