google & document retrieval qing li school of computing and informatics arizona state university
TRANSCRIPT
Google & Document RetrievalGoogle & Document Retrieval
Qing Li
School of Computing and InformaticsArizona State University
22// 3131Arizona State UniversityArizona State University
OutlineOutline
Simple introduction of Google
Architecture of Web search engine
Key techniques of search engine• Indexing
• Matching & ranking
Open Sources for search engine
33// 3131Arizona State UniversityArizona State University
Google Search EngineGoogle Search Engine
“Google” • Number : 1 followed by 100 zeros• reflects the company's mission
• organize the immense amount of information available on the web.
Information Types• Text• Image• Video
44// 3131Arizona State UniversityArizona State University
Google ServiceGoogle Service
55// 3131Arizona State UniversityArizona State University
Google Web SearchingGoogle Web Searching
66// 3131Arizona State UniversityArizona State University
Life of a Google QueryLife of a Google Query
77// 3131Arizona State UniversityArizona State University
Web Search SystemWeb Search System
Data
IndexingIndex
Searching
SearchEngine User
Information
QueryCrawlingWeb
…
d1d3K2
d1d2K1
88// 3131Arizona State UniversityArizona State University
Conventional Overview of Text RetrievalConventional Overview of Text Retrieval
Text Processing User/System Interaction
Search Engine
Matching& rank
Text Analysis
Analysis ofInfo Needs
rawtext
InfoNeeds
Index Query
Knowledge Resources & Tools
Retrieval Result
99// 3131Arizona State UniversityArizona State University
Text Processing (1) - IndexingText Processing (1) - Indexing
A list of terms with relevant information• Frequency of terms
• Location of terms
• Etc.
Index terms: represent document content & separate documents• “economy” vs “computer” in a news article of Financial Times
To get Index• Extraction of index terms
• Computation of their weights
1010// 3131Arizona State UniversityArizona State University
Text Processing (2) - ExtractionText Processing (2) - Extraction
Extraction of index terms• Word or phrase level
• Morphological Analysis (stemming in English)• “information”, “informed”, “informs”, “informative” inform
• Removal of common words from “Stop list”• “a”, “an”, “the”, “is”, “are”, “am”, …
• n-gram• “ 정보검색시스템” => “_ 정” , “ 정보” , “ 보검” , “ 검색” , … (bi-gra
m)
• Surprisingly effective in some languages
1111// 3131Arizona State UniversityArizona State University
An ExampleAn Example
Identify all unique words in collection of 1,033 Abstracts in biomedicine
Delete 170 common function words includedIn stop list
Delete all terms with collection frequencyequal to 1 (terms occurring in one doc
with frequency 1)
Remove terminal “s” endings & combineIdentical word forms
Delete 30 very high-frequency terms occurring In over 24% of the documents
13,471 terms
13,301 terms left
7,236 terms left
6,056 terms left
6,026 terms left
Final indexing vocabulary
1212// 3131Arizona State UniversityArizona State University
Text Processing (3) – Term WeightText Processing (3) – Term Weight
Calculation of term weights• Statistical weights using frequency information
• importance of a term in a document
• E.g. TF*IDF• TF: total frequency of a term in a document
• IDF: inverse document frequency
• DF: In how many documents the term appears?
• High TF , low DF good word to represent text
• High TF, High DF bad word
1313// 3131Arizona State UniversityArizona State University
An ExampleAn Example
TF for “Arizona”• In Doc 1 is 1• In Doc 2 is 2
DF for “Arizona”• In this collection (Doc 1 & Doc 2)• Is 2 IDF = ½
TW = TF*IDF
Normalization of TF is critical to retrieval effectiveness
• prevent a bias towards longer documents• TF = 0.5 + 0.5*(TF / Max TF)
TW = TF * log2 (N / DF + 1)
Document 1
Document 2
Log 10-34 is -34
1414// 3131Arizona State UniversityArizona State University
Text Processing (4) - Text Processing (4) - Storing indexing resultsStoring indexing results
For raw text to index
Arizona
University
:
:
:
…
1 1 2 2
Index Word Word Info.Document 1
Document 2
1 1 1 1
1515// 3131Arizona State UniversityArizona State University
Text Processing (5) - Text Processing (5) - Storing indexing resultsStoring indexing results
Inverted File, …
search
.
.
.
ASU
.
.
.
.
.
.
tiger
3
.
.
.
2
.
.
.
.
.
.
2
12345...
275276
.
.
.
.
10111012
12546...35....
14
Terms Pointers
Directory Posting file
Doc #1---------------
Doc #2---------------
Doc #5---------------
...
...
Query
1616// 3131Arizona State UniversityArizona State University
Matching & RankingMatching & Ranking
Ranking• Retrieval Model
• Boolean (exact)
• Vector Space
• Probabilistic
• Inference Net
• Language Model …
Weighting Schemes• Index terms, query terms
• Parameters in formulas
1717// 3131Arizona State UniversityArizona State University
Vector Space ModelVector Space Model
Treat document and query as a vector.
(DOC 1)... dog ........dog....
0 2Doc 1 = < 2>
(DOC 2)... cat ........ cat ......................dog..............dog....................
Doc 2 = < 2, 2>
0 2
2
dog
dog
cat
1818// 3131Arizona State UniversityArizona State University
Vector Space ModelVector Space Model
0 2
2
dog
cat
Query 1 : dog
Query 2 : cat, dog
Query 1 = <1>
Query 2 = <1,1>
COS (Q1,Doc)<COS(Q2,Doc)
If we use angles as a similarity measure, then Q2 is more similar to Doc than Q1
(DOC)... cat ........ cat ......................dog..............dog....................
Doc = < 2, 2>
1919// 3131Arizona State UniversityArizona State University
Vector Space ModelVector Space Model
Given
Dot product
Cosine Similiarity
n
iii yxyx
1
.
nxxxx ,...,, 21
nyyyy ,...,, 21
yx
yxyx
.
,cos
2020// 3131Arizona State UniversityArizona State University
Vector Space ModelVector Space Model
<DOC 1>... cat ........ dog ......................dog................mouse .....dog........mouse ........................
Q = < cat, mouse>
dog
mouse
cat
D1
Q
D1 = (1, 2, 3)Q = (1, 1,0)
Similarity = (1*1+2*1+3*0)/( length of line D1 + length of line Q)
Term weight is only decided by the term frequency
yx
yxyx
.
,cos
2121// 3131Arizona State UniversityArizona State University
Matching & RankingMatching & Ranking
Techniques for efficiency• New storage structure esp. for new document types
• Use of accumulators for efficient generation of ranked output
• Compression/decompression of indexes
Technique for Web search engines• Use of hyperlinks
• PageRank : Inlinks & outlinks
• HITS : Authority vs hub pages
• In conjunction with Directory Services (e.g. Yahoo)
• ...
2222// 3131Arizona State UniversityArizona State University
PageRankPageRank
Basic idea: more links to a page implies a better page
• But, all links are not created equal• Links from a more important page should
count more than links from a weaker page
Basic PageRank R(A) for page A:
• outDegree(x) = number of edges leaving page x= hyperlinks on page B
• Page x distributes its rank boost over all the pages it points to
x pointed to A
( )( )
( )
PR xPR A
outDegree x
PR(A) = PR(C)/1
PR(B) = PR(A) / 2
PR(C) = PR(A) / 2 + PR(B)/1
2323// 3131Arizona State UniversityArizona State University
PageRankPageRank
PageRank definition is recursive• Rank of a page depends on and influences other pages
• Eventually, ranks will converge
To compute PageRank:• Choose arbitrary initial R_old and use it to compute R_new
• Repeat, setting R_old to R_new until R converges (the difference between old and new R is sufficiently small)
• Rank values typically converge in 50-100 iterations
• Rank orders converge even faster
2424// 3131Arizona State UniversityArizona State University
Problems with Basic PageRank Problems with Basic PageRank
Web is not a strongly connected graph• Rank sink – single page (node) with no outward links
• Nodes not part of sink get rank of 0
2525// 3131Arizona State UniversityArizona State University
Extended PageRank Extended PageRank
Remove all nodes without outlinks• No rank for these pages
Add decay factor, d
• n is the number of nodes/pages• d is a constant, typically between 0.8~ 0.9
• Represents fraction of a pages rank that is distributed among pages it links to, rest of rank is distributed among all pages
In random surfer model, decay factor corresponds to user getting bored (or unhappy) with links on a given page and jumping to any random page (not linked to)
x pointed to A
( )( ) (1 ) /
( )
PR xPR A d d n
outDegree x
2626// 3131Arizona State UniversityArizona State University
ExampleExample
Set d=0.5 and Ignore n Small pages can be directly solved
PR(A) = 0.5 + 0.5 PR(C)PR(B) = 0.5 + 0.5 (PR(A) / 2)PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))
Get:PR(A) = 14/13 = 1.07692308PR(B) = 10/13 = 0.76923077PR(C) = 15/13 = 1.15384615
x pointed to A
( )( ) (1 ) /
( )
PR xPR A d d n
outDegree x
2727// 3131Arizona State UniversityArizona State University
Example Example
PR(A) = 0.5 + 0.5 PR(C)PR(B) = 0.5 + 0.5 (PR(A) / 2)PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))
Set initial value of P(A), P(B), P(C) to 1.After first iteration, • PR(A) = 0.5 + 0.5 *1 = 1
PR(B) = 0.5 + 0.5 (1 / 2) = 0.75PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B)) =1.125
After second iteration• PR(A) = 0.5 + 0.5 * 1.125=1.0625
PR(B) = 0.5 + 0.5 (1 / 2)= 0.765625 PR(C) = 0.5 + 0.5 (1 / 2 +0.75) =1.1484375
2828// 3131Arizona State UniversityArizona State University
ExampleExample
Large numbers, Iteration methodIteration PR(A) PR(B) PR(C)
0 1 1 1
1 1 0.75 1.125
2 1.0625 0.765625 1.1484375
3 1.07421875 0.76855469 1.15283203
4 1.07641602 0.76910400 1.15365601
5 1.07682800 0.76920700 1.15381050
6 1.07690525 0.76922631 1.15383947
7 1.07691973 0.76922993 1.15384490
8 1.07692245 0.76923061 1.15384592
9 1.07692296 0.76923074 1.15384611
10 1.07692305 0.76923076 1.15384615
11 1.07692307 0.76923077 1.15384615
12 1.07692308 0.76923077 1.15384615
2929// 3131Arizona State UniversityArizona State University
Problems with PageRankProblems with PageRank
Show bias to new WebPages• Can be solved by a boost factor
No balance between relevancy and popularity• Very popular pages (such as search engines and web
portals) may be returned artificially high due to their popularity (even if not very related to the query)
Despite these problems, seems to work fairly well in practice
3030// 3131Arizona State UniversityArizona State University
Open-Source Search Engine CodeOpen-Source Search Engine Code
Lucene Search Engine• http://lucene.apache.org/
SWISH• http://swish-e.org/
Glimpse• http://webglimpse.net/
and more
3131// 3131Arizona State UniversityArizona State University
ReferenceReference
L.Pager & S. Brin,The PageRank citation ranking: Bringing order to the web , Stanford Digital Library Technique, Working paper 1999-0120, 1998.Steven Levy (2004). All Eyes on Google. Newsweek, April 12, 2004.
E. Brown, J. Callan, B. Croft (1994). “Fast Incremental Indexing for Full-Text Information Retrieval.” Proceedings of the 20th International Conference on Very Large Databases (VLDB).
Lawrence Page and Sergey Brin. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh International Web Conference (WWW 98), 1998.