1 algorithms for large data sets ziv bar-yossef lecture 2 march 26, 2006

39
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 http://www.ee.technion.ac.il/cours es/049011

Post on 19-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

1

Algorithms for Large Data Sets

Ziv Bar-YossefLecture 2

March 26, 2006

http://www.ee.technion.ac.il/courses/049011

Page 2: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

2

Information Retrieval

Page 3: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

3

I want information about Michael

Jordan, the machine learning expert

Information Retrieval Setting

queryUserDocument

Collection

“Information Need”

+”Michael Jordan” -basketball

1. Michael I. Jordan’s homepage2. NBA.com3. Michael Jordan on TV

Ranked list of retrieved documents

IR SystemIR System

documents

No. 1 is good, Rest are bad

feedback

Revised ranked list of retrieved documents

1. Michael I. Jordan’s homepage2. M.I. Jordan’s pubs3. Graphical Models

Page 4: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

4

Information Retrieval vs.Data Retrieval Information Retrieval System: a system that allows a user

to retrieve documents that match her “information need” from a large corpus. Ex: Get documents about Michael Jordan, the machine learning

expert.

Data Retrieval System: a system that allows a user to retrieve all documents that match her query from a large corpus. Ex: SELECT doc

FROM corpus WHERE (doc.text CONTAINS “Michael Jordan”)

AND NOT (doc.text CONTAINS

“basketball”).

Page 5: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

5

Information Retrieval vs. Data Retrieval

Information RetrievalData Retrieval

DataFree text, unstructuredDatabase tables, structured

QueriesKeywords,

Natural language

SQL,

Relational algebras

ResultsApproximate matchesExact matches

ResultsOrdered by relevanceUnordered

AccessibilityNon-expert humansKnowledgeable users or automatic processes

Page 6: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

6

Information Retrieval Systems

IR System

queryprocessor

textprocessor

user query

ranked retrieved

docs

User

Corpus

rankingprocedure

system query

retrieved docs

index

indexertokenized

docs

postings

raw docs

Page 7: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

7

Search EnginesSearch Engine

queryprocessor

textprocessor

user query

ranked retrieved

docs

User

Web

rankingprocedure

system query

retrieved docs

index

indexertokenized

docs

postings

crawlerglobal

analyzerrepository

Page 8: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

8

Classical IR vs. Web IRClassical IRWeb IR

VolumeLargeHuge

Data qualityClean, no dupsNoisy, dups

Data change rateInfrequentIn flux

Data accessibilityAccessiblePartially accessible

Format diversityHomogeneousWidely diverse

DocumentsTextHypertext

# of matchesSmallLarge

IR techniquesContent-basedLink-based

Page 9: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

9

Outline

Abstract formulation Models for relevance ranking Retrieval evaluation Query languages Text processing Indexing and searching

Page 10: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

10

Abstract Formulation Ingredients:

D: document collection Q: query space f: D x Q R: relevance scoring function For every q in Q, f induces a ranking (partial order) q on D

Functions of an IR system: Preprocess D and create an index I Given q in Q, use I to produce a permutation on D

Goals: Accuracy: should be “close” to q Compactness: index should be compact Response time: answers should be given quickly

Page 11: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

11

Document Representation

T = { t1,…, tk }: a “token space” (a.k.a. “feature space” or “term space”)Ex: all words in EnglishEx: phrases, URLs, …

A document: a real vector d in Rk

di: “weight” of token ti in d

Ex: di = normalized # of occurrences of ti in d

Page 12: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

12

Classic IR (Relevance) Models

The Boolean model The Vector Space Model (VSM)

Page 13: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

13

The Boolean Model

A document: a boolean vector d in {0,1}k

di = 1 iff ti belongs to d

A query: a boolean formula q over tokens q: {0,1}k {0,1} Ex: “Michael Jordan” AND (NOT basketball) Ex: +“Michael Jordan” –basketball

Relevance scoring function: f(d,q) = q(d)

Page 14: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

14

The Boolean Model: Pros & Cons

Advantages:Simplicity for users

Disadvantages:Relevance scoring is too coarse

Page 15: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

15

The Vector Space Model (VSM)

A document: a real vector d in Rk

di = weight of ti in d (usually TF-IDF score)

A query: a real vector q in Rk

qi = weight of ti in q

Relevance scoring function: f(d,q) = sim(d,q)

“similarity” between d and q

Page 16: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

16

Popular Similarity Measures

L1 or L2 distance d,q are first normalized

to have unit norm

Cosine similarity

d

q

d –q

d

q

Page 17: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

17

TF-IDF Score: Motivation

Motivating principle:A term ti is relevant to a document d if:

ti occurs many times in d relative to other terms that occur in d

ti occurs many times in d relative to its number of occurrences in other documents

Examples10 out of 100 terms in d are “java”10 out of 10,000 terms in d are “java”10 out of 100 terms in d are “the”

Page 18: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

18

TF-IDF Score: Definition

n(d,ti) = # of occurrences of ti in d N = i n(d,ti) (# of tokens in d) Di = # of documents containing ti D = # of documents in the collection

TF(d,ti): “Term Frequency” Ex: TF(d,ti) = n(d,ti) / N Ex: TF(d,ti) = n(d,ti) / (maxj { n(d,tj) })

IDF(ti): “Inverse Document Frequency” Ex: IDF(ti) = log (D/Di)

TFIDF(d,ti) = TF(d,ti) x IDF(ti)

Page 19: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

19

VSM: Pros & Cons

Advantages:Better granularity in relevance scoringGood performance in practiceEfficient implementations

Disadvantages:Assumes term independence

Page 20: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

20

Retrieval Evaluation Notations:

D: document collection Dq: documents in D that are “relevant” to query q

Ex: f(d,q) is above some threshold

Lq: list of results on query qD

Lq DqRecall:

Precision:

Page 21: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

21

Recall & Precision: Example

Recall(A) = 80% Precision(A) = 40%

1. d123

2. d84

3. d56

4. d6

5. d8

6. d9

7. d511

8. d129

9. d187

10.d25

List A Relevant docs:

d123, d56, d9, d25, d31. d81

2. d74

3. d56

4. d123

5. d511

6. d25

7. d9

8. d129

9. d3

10.d5

List B

Recall(B) = 100% Precision(B) = 50%

Page 22: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

22

Precision@k and Recall@k

Notations:Dq: documents in D that are “relevant” to q

Lq,k: top k results on the list

Recall@k:

Precision@k:

Page 23: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

23

Precision@k: Example

1. d123

2. d84

3. d56

4. d6

5. d8

6. d9

7. d511

8. d129

9. d187

10.d25

List A1. d81

2. d74

3. d56

4. d123

5. d511

6. d25

7. d9

8. d129

9. d3

10.d5

List B

0%10%20%30%40%50%60%70%80%90%

100%

1 2 3 4 5 6 7 8 9 1

k

pre

cisi

on

@k

List A List B

Page 24: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

24

Recall@k: Example

0%10%20%30%40%50%60%70%80%90%

100%

1 2 3 4 5 6 7 8 9 1

k

recall@

k

List A List B

1. d123

2. d84

3. d56

4. d6

5. d8

6. d9

7. d511

8. d129

9. d187

10.d25

List A1. d81

2. d74

3. d56

4. d123

5. d511

6. d25

7. d9

8. d129

9. d3

10.d5

List B

Page 25: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

25

“Interpolated” Precision

Notations:Dq: documents in D that are “relevant” to qr: a recall level (e.g., 20%)k(r): first k so that recall@k >= r

Interpolated precision@ recall level r =

max { precision@k : k >= k(r) }

Page 26: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

26

Precision vs. Recall: Example

0%10%20%30%40%50%60%70%80%90%

100%

0 20 40 60 80 100

Recall

Inte

rpo

late

d P

reci

sio

n List AList B

1. d123

2. d84

3. d56

4. d6

5. d8

6. d9

7. d511

8. d129

9. d187

10.d25

List A1. d81

2. d74

3. d56

4. d123

5. d511

6. d25

7. d9

8. d129

9. d3

10.d5

List B

Page 27: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

27

Query Languages: Keyword-Based Singe-word queries

Ex: Michael Jordan machine learning Context queries

Phrases. Ex: “Michael Jordan” “machine learning” Proximity. Ex: “Michael Jordan” at distance of at most

10 words from “machine learning” Boolean queries

Ex: +”Michael Jordan” –basketball Natural language queries

Ex: “Get me pages about Michael Jordan, the machine learning expert.”

Page 28: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

28

Query Languages: Pattern Matching Prefixes

Ex: prefix:comput Suffixes

Ex: suffix:net Regular Expressions

Ex: [0-9]+th world-wide web conference

Page 29: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

29

Text Processing

Lexical analysis & tokenization Split text into words, downcase letters, filter out

punctuation marks, digits, hyphens Stopword elimination

Better retrieval accuracy, more compact index Ex: “to be or not to be”

Stemming Ex: “computer”, “computing”, “computation” comput

Index term selection Keywords vs. full text

Page 30: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

30

Inverted Index

Michael1 Jordan2, the3 author4 of5 “graphical6 models7”, is8 a9 professor10 at11 U.C.12 Berkeley13.

The1 famous2 NBA3 legend4 Michael5 Jordan6 liked7 to8 date9 models10.

d1

d2

author: (d1,4)

berkeley: (d1,13)

date: (d2,9)

famous: (d2, 2)

graphical: (d1,6)

jordan: (d1,2), (d2,6)

legend: (d2,4)

like: (d2,7)

michael: (d1,1), (d2,5)

model: (d1,7), (d2,10)

nba: (d2,3)

professor: (d1,10)

uc: (d1,12)

Vocabulary Postings

Page 31: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

31

Inverted Index Structure

Vocabulary File

term1

term2

Postings File

postings list 1

postings list 2

• Usually, fits in main memory

• Stored on disk

Page 32: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

32

Searching an Inverted Index

Given: t1, t2: query terms L1,L2: corresponding posting lists

Need to get ranked list of docs in intersection of L1,L2

Solution 1: If L1,L2 are comparable in size, “merge” L1 and L2 to find docs in their intersection, and then order them by rank. (running time: O(|L1| + |L2|))

Solution 2: If L1 is considerably shorter than L2, binary search each posting of L1 in L2 to find the intersection, and then order them by rank.(running time: O(|L1| x log(|L2|))

Page 33: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

33

Search Optimization

Improvement:

Order docs in posting lists by static rank (e.g., PageRank).

Then, can output top matches, without scanning the whole lists.

Page 34: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

34

Index Construction

Given a stream of documents, store (did,tid,pos) triplets in a file

Sort and group file by tid Extract posting lists

Page 35: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

35

Index Maintenance

Naïve updates of inverted index can be very costly Require random access A single change may cause many insertions/deletions

Batch updates Two indices

Main index (created in batch, large, compressed) “Stop-press” index (incremental, small,

uncompressed)

Page 36: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

36

Index Maintenance

If a page d is inserted/deleted, the “signed” postings (did,tid,pos,I/D) are added to the stop-press index.

Given a query term t, fetch its list Lt from main index, and two lists Lt,+ and Lt,- from stop-press index.

Result is:

When stop-press index grows too large, it is merged into the main index.

Page 37: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

37

Index Compression

Delta compression

Saves a lot for popular terms Doesn’t save much for rare terms (but these don’t take

much space anyway)

michael: (1000007,5), (1000009,12), (1000013,77), (1000035,88),…

michael: (1000007,5), (2,12), (4,77), (22,88),…

Page 38: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

38

Variable Length Encodings

How to encode gaps succinctly? Option 1: Fixed-length binary encoding.

Effective when all gap lengths are equally likely No savings over storing doc ids.

Option 2: Unary encoding. Gap x is encoded by x-1 1’s followed by a 0 Effective when large gaps are very rare (Pr(x) = 1/2x)

Option 3: Gamma encoding. Gap x is encoded by (x x), where x is the binary encoding

of x and x is the length of x, encoded in unary. Encoding length: about 2log(x).

Page 39: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

39

End of Lecture 2