searching the web basic information retrieval. who i am associate professor at ucla computer...

38
Searching the Web Basic Information Retrieval

Upload: julian-sherman

Post on 18-Jan-2018

221 views

Category:

Documents


0 download

DESCRIPTION

Brief Overview of the Course  Basic principles and theories behind Web-search engines  Not much discussion on implementation or tools, but will be happy to discuss them if there are any questions  Topics  Basic IR models, data structures, and algorithms  Topic-based models  Latent Semantic index  Latent Dirichlet Analysis  Link-based ranking  Search-engine architecture  Issues of scale, Web crawling

TRANSCRIPT

Page 1: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Searching the Web

Basic Information Retrieval

Page 2: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Who I Am Associate Professor at UCLA Computer

Science Ph.D. from Stanford in Computer Science B.S. from SNU in Physics Got involved in early Web-search engine

projects Particularly in Web crawling part

Research on search engines and social Web

Page 3: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Brief Overview of the Course Basic principles and theories behind Web-

search engines Not much discussion on implementation or tools,

but will be happy to discuss them if there are any questions

Topics Basic IR models, data structures, and algorithms Topic-based models

Latent Semantic index Latent Dirichlet Analysis

Link-based ranking Search-engine architecture

Issues of scale, Web crawling

Page 4: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Who Are You? Background Expectation Career goal

Page 5: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Today’s Topic Basic Information Retrieval (IR)

Three approaches for computer-based information management

Bag of words assumption Boolean Model

String-matching algorithm Inverted index

Vector-space model Document-term matrix TF-IDF vector and cosine similarity

Phrase queries Spell correction

Page 6: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Computer-based Information Management Basic problem

How to use computers to help humans store, organize and retrieve information?

What approaches have been taken and what has been successful?

Page 7: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Three Major Approaches Database approach Expert-system approach Information-retrieval approach

Page 8: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Database Approach Information is stored in a highly-structured way

Data is stored in relational tables as tuples Simple data model and query language

Relational model and SQL query language Clear interpretation of data and query No ambition to be “intelligent” like humans

Mainly focus on highly efficient system “Performance, performance, performance”

It has been hugely successful All major businesses use a RDB system >$20B market

What are the pros and cons?

Page 9: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Expert-System Approach Information is stored as a set of logical

predicates Bird(x), Cat(x), Fly(x), …

Given a query, the system infers the answer through logical inference Bird(Ostrich) Fly(Ostrich)?

Popular approach in 80s, but has not been successful for general information retrieval

What are the pros and cons?

Page 10: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Information-Retrieval Approach Uses existing text documents as information

source No special structuring or database construction

required Text-based query language

Keyword-based query or natural-language query The system returns best-matching documents

given the query Had a limited appeal until the Web became

popular What are the pros and cons?

Page 11: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Main Challenge of IR Approach Relational Model

Interpretation of query and data is straightforward Student(name, birthdate, major, GPA) SELECT * FROM Student WHERE GPA > 3.0

Information Retrieval Both queries and data are “fuzzy”

Unstructured text and “natural language” query What documents are good matches for a query?

Computers do not “understand” the documents or the queries

Developing a computerizable “model” is essential to implement this approach

Page 12: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Bag of Words: Major Simplification Consider each document as a “bag of words”

“bag” vs “set” Ignore word ordering, but keep word count

Consider queries as bag of words as well Great oversimplification, but works adequately

in many cases “John loves only Jane” vs “Only John loves Jane” The limitation still shows up on current search

engines Still how do we match documents and

queries?

Page 13: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Boolean Model Return all documents that contain the words

in the query Simplest model for information retrieval

No notion of “ranking” A document is either a match or non-match

Q: How to find and return matching documents? Basic algorithm? Useful data structure?

Page 14: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

String-Matching Algorithm Given string “abcde”, find what documents

contain the string Q: Computational complexity of naïve

matching of string of length m over a document of length n? Q: Any efficient way

Page 15: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

String Matching Example (1) m 0123456789 D: ABCABABABC (doc) W: ABABC (word) i 01234

Page 16: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

m 0123456789 D: ABCABABABC (doc) W: ABABC (word) i 01234

Two cursors: m=2, i=1 m: beginning of matching part in D i: the location of matching char in W

String Matching Example (2)

Page 17: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

m 0123456789 D: ABCABABABC (doc) W: ABABC (word) i 01234

Mismatch at m=0,i=2 Q: What can we do? Start again at m=1,i=0?

String Matching Example (2)

Page 18: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

m 0123456789 D: ABCABABABC (doc) W: ABABC (word) i 01234

Mismatch at m=3,i=4 Q: What can we do? Start at m=7,i=0?

String Matching Example (3)

Page 19: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Algorithm KMP If no substring in W is self-repeated, we can

slide W “completely” for matched portion m <- m + i i <- 0

If the suffix of the matched part is equal to the prefix of W, we have to slide back a little bit m <- m + i – x // x is how much to slide back i <- x The exact value of x depends on the length of the

prefix matching the the suffix of the matched part T[0…m]: “slide-back” table recording x values

Page 20: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Algorithm KMPW: string to look forD: document T: “slide-back” table in case of mismatch

while (m + i) < |D| do: if W[i] = D[m + i], let i = i + 1 if i = |W|, return m otherwise, let m = m + i - T[i], if i > 0, let i = T[i]

return no-match

Page 21: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Algorithm KMP: T[i] TableW: ABCDABD (word)i 0123456

m <- m + i – T[i]

T[0]= -1, T[1]= 0

Q: What should be T[i] for i=2…6?

Page 22: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Data Structure for Quick Document Matching Boolean model

Find all documents that contain the keywords in Q. Q: What data structure will be useful to do it

fast?

Page 23: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Inverted Index Allows quick lookup of document ids with a

particular word

Q: How can we use this to answer “UCLA Physics”?

lexicon/dictionary DIC 3 8 10 13 16 20

Stanford

UCLA

MIT…

1 2 3 9 16 18

PL(Stanford)

PL(UCLA)

Postings list

4 5 8 10 13 19 20 22 PL(MIT)

Page 24: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Inverted Index Allows quick lookup of document ids with a

particular word

lexicon/dictionary DIC 3 8 10 13 16 20

Stanford

UCLA

MIT…

1 2 3 9 16 18

PL(Stanford)

PL(UCLA)

Postings list

4 5 8 10 13 19 20 22 PL(MIT)

Page 25: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Size of Inverted Index (1) 100M docs, 10KB/doc,

1000 unique words/doc, 10B/word, 4B/docid

Q: Document collection size?

Q: Inverted index size?

Heap’s Law: Vocabulary size = k nb with 30 < k < 100 and 0.4 < b < 1 k = 50 and b = 0.5 are good rule of thumb

Page 26: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Size of Inverted Index (2) Q: Between dictionary and postings lists,

which one is larger?

Q: Lengths of postings lists?

Zipf’s law: collection term frequency 1/frequency rank

Q: How do we construct an inverted index?

Page 27: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Inverted Index ConstructionC: set of all documents (corpus)DIC: dictionary of inverted indexPL(w): postings list of word w

1: For each document d C:2: Extract all words in content(d) into W3: For each w W:4: If w DIC, then add w to DIC5: Append id(d) to PL(w)

Q: What if the index is larger than main memory?

Page 28: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Inverted-Index Construction For large text corpus

Block-sorted based construction Partition and merge

Page 29: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Evaluation: Precision and Recall Q: Are all matching documents what users

want?

Basic idea: a model is good if it returns document if and only if it is “relevant”.

R: set of “relevant” documentD: set of documents returned by a model

||||Precision

DRD

||||Recall

RRD

Page 30: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Vector-Space Model Main problem of Boolean model

Too many matching documents when the corpus is large

Any way to “rank” documents? Matrix interpretation of Boolean model

Document – Term matrix Boolean 0 or 1 value for each entry

Basic idea Assign real-valued weight to the matrix entries

depending on the importance of the term “the” vs “UCLA”

Q: How should we assign the weights?

Page 31: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

TF-IDF Vector A term t is important for document d

If t appears many times in d or If t is a “rare” term

TF: term frequency # occurrence of t in d

IDF: inverse document frequency # documents containing t

TF-IDF weighting TF X Log(N/IDF)

Q: How to use it to compute query-document relevance?

Page 32: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Cosine Similarity Represent both query and document as a TF-

IDF vector Take the inner product of the two normalized

vectors to compute their similarity

Note: |Q| does not matter for document ranking. Division by |D| penalizes longer document.

DQDQ

Page 33: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Cosine Similarity: Example idf(UCLA)=10, idf(good)=0.1,

idf(university) = idf(car) = idf(racing) = 1

Q = (UCLA, university), D = (car, racing)

Q = (UCLA, university), D = (UCLA, good)

Q = (UCLA, university), D = (university, good)

Page 34: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Finding High Cosine-Similarity Documents Q: Under vector-space model, does

precision/recall make sense?

Q: How to find the documents with highest cosine similarity from corpus?

Q: Any way to avoid complete scan of corpus?

Page 35: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Inverted Index for TF-IDF Q · di = 0 if di has no query words Consider only the documents with query

words Inverted Index: Word Document

35

Word IDF

Stanford

UCLA

MIT…

1/3530

1/9860

1/937

docid TF

D1

D14

D376

2

308

(TF may be normalized by document size)

Postinglist

Lexicon

Page 36: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Phrase Queries “Havard University Boston” exactly as a

phrase Q: How can we support this query?

Two approaches Biword index Positional index

Q: Pros and cons of each approach?

Rule of thumb: x2 – x4 size increase for positional index compared to docid only

Page 37: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Spell correction Q: What the user may have truly intended for the

query “Britnie Spears”? How can we find the correct spelling?

Given a user-typed word w, find its correct spelling c. Probabilistic approach: Find c with the highest

probability P(c|w). Q: How to estimate it?

Bayes’ rule: P(c|w) = P(w|c)P(c)/P(w) Q: What are these probabilities and how can we

estimate them? Rule of thumb: 4/3 misspells are within edit

distance 1. 98% are within edit distance 2.

Page 38: Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S

Summary Boolean model Vector-space model

TF-IDF weight, cosine similarity String-matching algorithm

Algorithm KMP Inverted index

Boolean model TF-IDF model Phrase queries

Spell correction