Transcript
Page 1: Review of  "The anatomy of a large scale hyper textual web search engine"

REVIEW OF “The Anatomy of a Large-Scale Hyper textual Web Search Engine”

Sergey Brin and Lawrence Page started the design of ‘Google’ to make a search engine that can

crawl and index the web quickly and efficiently and to effectively deal with huge uncontrolled

hypertext collections. One of the main goals was to improve the quality and scalability of search.

Another goal was to setup a system that can support novel research activities on large-scale web

data and a reasonable number of people can actually use it for their academic research.

Google makes efficient use of storage space to store the index. This allows the quality of the

search to scale effectively to the size of the web as it grows. Its data structures are optimized for

fast and efficient access. To get high precision, Google uses the link structure of the Web to

calculate a quality ranking for each web page. This ranking is called PageRank. The probability

that the ‘random surfer’ visits a page is its PageRank. The ranking also involves damping factor,

which is the probability at each page the ‘random surfer’ will get bored and request another

random page. It allows for personalization and can make it nearly impossible to deliberately

mislead the system in order to get a higher ranking. The text of a link is associated with the page

that the link is on and also with the page the link points to. This idea of anchor text propagation

provides better quality search but the challenge was the efficient usage of it because of the heavy

data processing task. Along with page rank Google keeps a track of location information of all

hits, some visual presentation details and stores full raw HTML of pages in the repository.

Most of the Google’s architecture is implemented in C or C++ for efficiency and can run in

either Solaris or Linux. The data structures of Google include big files, document indexes,

lexicon, forward and reverse indexes and a huge repository. Google’s data structures are

optimized in terms of cost by the feature of avoiding disk seeks whenever possible. Google has a

fast distributed crawling system, where URL server and the crawlers are implemented in Python.

Each crawler maintains a DNS cache to reduce the no. of DNS lookups, uses asynchronous IO

and a no. of queues. The steps involved in indexing are parsing, indexing documents into barrels

using multiple indexers running in parallel and sorting. The Google’s ranking system is designed

so that no particular factor can have too much influence. The dot product of the vector of count-

weights with the vector of type-weights is used to compute an IR score for the document.

Finally, the IR score is combined with PageRank to give a final rank to the document. For multi

word search, Google has a complex algorithm. Google also considers feedback by trusted users

while updating the ranks of webpages.

Google can produce better results than the major commercial search engines for most searches.

Google has evolved to overcome a number of bottlenecks in CPU, memory access, memory

capacity, disk seeks, disk throughput, disk capacity, and network IO during various operations.

By the efficient crawling and indexing performed by Google, information can be kept up to date

and major changes can be tested relatively quickly. Google does not have optimizations such as

query caching, sub-indices on common terms. The inventors intended to speed up Google

considerably through distribution and hardware, software, and algorithmic improvements. They

wished to make Google as a high quality search tool for searchers and researchers all around the

world, sparking the next generation of search engine technology.

KOSURU SAI MALLESWAR; SC09B093; SEM-6.

Top Related