google & document retrieval qing li school of computing and informatics arizona state university

Google & Document RetrievalGoogle & Document Retrieval

Qing Li

School of Computing and InformaticsArizona State University

22// 3131Arizona State UniversityArizona State University

OutlineOutline

Simple introduction of Google

Architecture of Web search engine

Key techniques of search engine• Indexing

• Matching & ranking

Open Sources for search engine


Google Search EngineGoogle Search Engine

“Google” • Number : 1 followed by 100 zeros• reflects the company's mission

• organize the immense amount of information available on the web.

Information Types• Text• Image• Video


Google ServiceGoogle Service


Google Web SearchingGoogle Web Searching


Life of a Google QueryLife of a Google Query


Web Search SystemWeb Search System

Data

IndexingIndex

Searching

SearchEngine User

Information

QueryCrawlingWeb

…

d1d3K2

d1d2K1


Conventional Overview of Text RetrievalConventional Overview of Text Retrieval

Text Processing User/System Interaction

Search Engine

Matching& rank

Text Analysis

Analysis ofInfo Needs

rawtext

InfoNeeds

Index Query

Knowledge Resources & Tools

Retrieval Result


Text Processing (1) - IndexingText Processing (1) - Indexing

A list of terms with relevant information• Frequency of terms

• Location of terms

• Etc.

Index terms: represent document content & separate documents• “economy” vs “computer” in a news article of Financial Times

To get Index• Extraction of index terms

• Computation of their weights


Text Processing (2) - ExtractionText Processing (2) - Extraction

Extraction of index terms• Word or phrase level

• Morphological Analysis (stemming in English)• “information”, “informed”, “informs”, “informative” inform

• Removal of common words from “Stop list”• “a”, “an”, “the”, “is”, “are”, “am”, …

• n-gram• “ 정보검색시스템” => “_ 정” , “ 정보” , “ 보검” , “ 검색” , … (bi-gra

m)

• Surprisingly effective in some languages


An ExampleAn Example

Identify all unique words in collection of 1,033 Abstracts in biomedicine

Delete 170 common function words includedIn stop list

Delete all terms with collection frequencyequal to 1 (terms occurring in one doc

with frequency 1)

Remove terminal “s” endings & combineIdentical word forms

Delete 30 very high-frequency terms occurring In over 24% of the documents

13,471 terms

13,301 terms left

7,236 terms left

6,056 terms left

6,026 terms left

Final indexing vocabulary


Text Processing (3) – Term WeightText Processing (3) – Term Weight

Calculation of term weights• Statistical weights using frequency information

• importance of a term in a document

• E.g. TF*IDF• TF: total frequency of a term in a document

• IDF: inverse document frequency

• DF: In how many documents the term appears?

• High TF , low DF good word to represent text

• High TF, High DF bad word


An ExampleAn Example

TF for “Arizona”• In Doc 1 is 1• In Doc 2 is 2

DF for “Arizona”• In this collection (Doc 1 & Doc 2)• Is 2 IDF = ½

TW = TF*IDF

Normalization of TF is critical to retrieval effectiveness

• prevent a bias towards longer documents• TF = 0.5 + 0.5*(TF / Max TF)

TW = TF * log2 (N / DF + 1)

Document 1

Document 2

Log 10-34 is -34


Text Processing (4) - Text Processing (4) - Storing indexing resultsStoring indexing results

For raw text to index

Arizona

University

：

：

：

…

1 1 2 2

Index Word Word Info.Document 1

Document 2

1 1 1 1


Text Processing (5) - Text Processing (5) - Storing indexing resultsStoring indexing results

Inverted File, …

search

Google

.

.

.

ASU

.

.

.

.

.

.

tiger

3

.

.

.

2

.

.

.

.

.

.

2

12345...

275276

.

.

.

.

10111012

12546...35....

14

Terms Pointers

Directory Posting file

Doc #1---------------

Doc #2---------------

Doc #5---------------

...

...

Query


Matching & RankingMatching & Ranking

Ranking• Retrieval Model

• Boolean (exact)

• Vector Space

• Probabilistic

• Inference Net

• Language Model …

Weighting Schemes• Index terms, query terms

• Parameters in formulas


Vector Space ModelVector Space Model

Treat document and query as a vector.

(DOC 1)... dog ........dog....

0 2Doc 1 = < 2>

(DOC 2)... cat ........ cat ......................dog..............dog....................

Doc 2 = < 2, 2>

0 2

2

dog

dog

cat



0 2

2

dog

cat

Query 1 : dog

Query 2 : cat, dog

Query 1 = <1>

Query 2 = <1,1>

COS (Q1,Doc)<COS(Q2,Doc)

If we use angles as a similarity measure, then Q2 is more similar to Doc than Q1

(DOC)... cat ........ cat ......................dog..............dog....................

Doc = < 2, 2>



Given

Dot product

Cosine Similiarity

n

iii yxyx

1

.

nxxxx ,...,, 21

nyyyy ,...,, 21

yx

yxyx

.

,cos



<DOC 1>... cat ........ dog ......................dog................mouse .....dog........mouse ........................

Q = < cat, mouse>

dog

mouse

cat

D1

Q

D1 = (1, 2, 3)Q = (1, 1,0)

Similarity = (1*1+2*1+3*0)/( length of line D1 + length of line Q)

Term weight is only decided by the term frequency

yx

yxyx

.

,cos


Matching & RankingMatching & Ranking

Techniques for efficiency• New storage structure esp. for new document types

• Use of accumulators for efficient generation of ranked output

• Compression/decompression of indexes

Technique for Web search engines• Use of hyperlinks

• PageRank : Inlinks & outlinks

• HITS : Authority vs hub pages

• In conjunction with Directory Services (e.g. Yahoo)

• ...


PageRankPageRank

Basic idea: more links to a page implies a better page

• But, all links are not created equal• Links from a more important page should

count more than links from a weaker page

Basic PageRank R(A) for page A:

• outDegree(x) = number of edges leaving page x= hyperlinks on page B

• Page x distributes its rank boost over all the pages it points to

x pointed to A

( )( )

( )

PR xPR A

outDegree x

PR(A) = PR(C)/1

PR(B) = PR(A) / 2

PR(C) = PR(A) / 2 + PR(B)/1


PageRankPageRank

PageRank definition is recursive• Rank of a page depends on and influences other pages

• Eventually, ranks will converge

To compute PageRank:• Choose arbitrary initial R_old and use it to compute R_new

• Repeat, setting R_old to R_new until R converges (the difference between old and new R is sufficiently small)

• Rank values typically converge in 50-100 iterations

• Rank orders converge even faster


Problems with Basic PageRank Problems with Basic PageRank

Web is not a strongly connected graph• Rank sink – single page (node) with no outward links

• Nodes not part of sink get rank of 0


Extended PageRank Extended PageRank

Remove all nodes without outlinks• No rank for these pages

Add decay factor, d

• n is the number of nodes/pages• d is a constant, typically between 0.8~ 0.9

• Represents fraction of a pages rank that is distributed among pages it links to, rest of rank is distributed among all pages

In random surfer model, decay factor corresponds to user getting bored (or unhappy) with links on a given page and jumping to any random page (not linked to)

x pointed to A

( )( ) (1 ) /

( )

PR xPR A d d n

outDegree x


ExampleExample

Set d=0.5 and Ignore n Small pages can be directly solved

PR(A) = 0.5 + 0.5 PR(C)PR(B) = 0.5 + 0.5 (PR(A) / 2)PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))

Get：PR(A) = 14/13 = 1.07692308PR(B) = 10/13 = 0.76923077PR(C) = 15/13 = 1.15384615

x pointed to A

( )( ) (1 ) /

( )

PR xPR A d d n

outDegree x


Example Example

PR(A) = 0.5 + 0.5 PR(C)PR(B) = 0.5 + 0.5 (PR(A) / 2)PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))

Set initial value of P(A), P(B), P(C) to 1.After first iteration, • PR(A) = 0.5 + 0.5 *1 = 1

PR(B) = 0.5 + 0.5 (1 / 2) = 0.75PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B)) =1.125

After second iteration• PR(A) = 0.5 + 0.5 * 1.125=1.0625

PR(B) = 0.5 + 0.5 (1 / 2)= 0.765625 PR(C) = 0.5 + 0.5 (1 / 2 +0.75) =1.1484375


ExampleExample

Large numbers, Iteration methodIteration PR(A) PR(B) PR(C)

0 1 1 1

1 1 0.75 1.125

2 1.0625 0.765625 1.1484375

3 1.07421875 0.76855469 1.15283203

4 1.07641602 0.76910400 1.15365601

5 1.07682800 0.76920700 1.15381050

6 1.07690525 0.76922631 1.15383947

7 1.07691973 0.76922993 1.15384490

8 1.07692245 0.76923061 1.15384592

9 1.07692296 0.76923074 1.15384611

10 1.07692305 0.76923076 1.15384615

11 1.07692307 0.76923077 1.15384615

12 1.07692308 0.76923077 1.15384615


Problems with PageRankProblems with PageRank

Show bias to new WebPages• Can be solved by a boost factor

No balance between relevancy and popularity• Very popular pages (such as search engines and web

portals) may be returned artificially high due to their popularity (even if not very related to the query)

Despite these problems, seems to work fairly well in practice


Open-Source Search Engine CodeOpen-Source Search Engine Code

Lucene Search Engine• http://lucene.apache.org/

SWISH• http://swish-e.org/

Glimpse• http://webglimpse.net/

and more


ReferenceReference

L.Pager & S. Brin,The PageRank citation ranking: Bringing order to the web , Stanford Digital Library Technique, Working paper 1999-0120, 1998.Steven Levy (2004). All Eyes on Google. Newsweek, April 12, 2004.

E. Brown, J. Callan, B. Croft (1994). “Fast Incremental Indexing for Full-Text Information Retrieval.” Proceedings of the 20th International Conference on Very Large Databases (VLDB).

Lawrence Page and Sergey Brin. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh International Web Conference (WWW 98), 1998.

google & document retrieval qing li school of computing and informatics arizona state university

Documents