seminar formatkhjj

8/22/2019 Seminar Formatkhjj

1/24

REPORT TITLED

SEARCH ENGINES CONCEPT, TECHNOLOGY AND

CHALLENGES

Submitted

in partial fulfillment of

the requirements for the Degree of

Master of Computer Applications

(MCA)

By

Mrunalini S. Shinde

Roll No. 092011015

Under the guidance ofProf. L. C. Nene

Department Of Computer Technology

Veermata Jijabai Technological Institute

(Autonomous Institute, Affiliated To University of Mumbai)

Mumbai 400019

Year 2011-2012


2/24

v

VEERMATA JIJABAI TECHNOLOGICAL INSTITUTE

MATUNGA, MUMBAI 400019

CERTIFICATE

This is to certify that the seminar report titled

Search Engines: Concept, Technology and Challengeshas been completed successfullyBy

Miss. Mrunalini S. ShindeRoll No. 092011015

Class: MCA-VIin Academic year 2011-2012

Evaluator:

Date:


3/24

vi

Contents

Chapters Topics Page No.

1 Introduction

1.1 Search Engines

1.2 History Of Search Engines

1

1

2 Components Of Search Engine 3

3 How Search Engines Work?

3.1 Web Crawling

3.2 Indexing

3.3 Searching

3.4 Relevance Ranking

6

6

6

7

4 Types Of Search Engines

4.1 Crawler-Based Search Engines

4.2 Human-Powered Directories

4.3 Hybrid Search Engines

9

9

9

5 Search Engine Ranking Algorithms

5.1 TF-IDF Ranking Algorithm

5.2 PageRank Algorithm

10

11

6 Challenges To Search Engines 13

7 Search Engine Optimization (SEOs)

6.1 SEOs

6.2 Advantages Of SEO

14

14

8 Challenges To SEOs 15

9 Case Study: Google Search

9.1 Introduction

9.2 Architecture & Working

18

18

10 Conclusion 20

11 References 21


4/24

1. Introduction

1.1 Search Engines

Search Engine is a tool or a program designed to search for information on the WWW on the

basis of specified keywords and return a list of the documents where the keywords were

found.

Internet search engines are special sites on the Web that are designed to help people find

information stored on other sites. There are differences in the ways various search engines

work, but they all perform three basic tasks:

They search the Internet -- or select pieces of the Internet -- based on importantwords.

They keep an index of the words they find, and where they find them. They allow users to look for words or combinations of words found in that index.

1.2 History of Search Engines

Gopher

Gopher was developed in 1991 and was in use up to 1996. Its an internet server from which

hierarchically organized text files could be retrieved from all over the world.

Developed at the University of Minnesota, whose sports teams are called The Golden

Gophers.

HyperGopher could also display Gif and Jpeg graphic images.

Three important Gopher applications were Archie, Veronica and Jughead.

Archie was a tool for indexing FTP archives, allowing people to find specific files. It is

considered to be the first Internet search engine.


5/24

2

Veronica i.e. "Very Easy Rodent-Oriented Net-wide Index to Computer Archives is a search

engine system for the Gopher protocol, developed in 1992 by Steven Foster and Fred Barrie

at the University of Nevada, Reno.

Veronica is a constantly updated database of the names of almost every menu item on

thousands of Gopher servers. The Veronica database can be searched from most major

Gopher menus.

Jughead i.e. Jonzy's Universal Gopher Hierarchy Excavation And Display is a search

engine system for the Gopher protocol. Jughead was developed by Rhett Jones in 1993 at

the University of Utah. It is distinct from Veronica in that it searches a single server at a time.

However it lost importance with the introduction of the first graphical browser viz., Mosaic.

Wide Area Information Servers (W.A.I.S.)

W.A.I.S. coexisted with Gopher.

For Gopher, files had to be stored in a predetermined manner in databases.

The W.A.I.S. user had to connect to known databases in order to retrieve information or files.

It had the same fate as Gophers i.e. became superfluous with the introduction of browsers and

search engines.

Wandex

The first real search engine, in the form that we know search engines today, didnt come into

being until 1993. It was developed by Matthew Gray, and it was called Wandex. Wandex

indexed the files and allowed users to search for them. This technology was the first program

to crawl the Web, and later became the basis for all search crawlers.


6/24

3

2. Components of Search Engine

Fig: Components Of Search Engine

Search Form

Search Form can be considered as the user interface of search engine. It is a simple formwhere user enters his query usually in the form of keywords.


7/24

4

Query Parser

The Query Parser tokenizes the input and looks for operators and filters.

Index

Index is the file that is created by the web crawler and is used as a lookup by the Query

Engine.

Query Engine

The Query Engine finds the web pages that match the given criteria using index.

Relevance Ranker

The Relevance ranker is the search engine algorithm that ranks the results of search in order

of relevance.

Formatter

This deal with the way results are laid out and displayed to the user. It shows the results in

order of importance as decided by the relevance ranking.


8/24

5

3. How Search Engine Works?

Search engines use software robots to survey the Web and build their databases. Web

documents are retrieved and indexed. When you enter a query at a search engine website,your input is checked against the search engine's keyword indices. Search engines look

through their own databases of information in order to find what it is that you are looking for.

The best matches are then returned to you as hits.

Fig: Internal Working Of Search Engine


9/24

6

3.1 Web Crawling

Before a search engine can tell you where a file or document is, it must be found. To find

information on the hundreds of millions of Web pages that exist, a search engine employs

special software robots, called spiders, to build lists of the words found on Web sites.

When a spider is building its lists, the process is called Web crawling. In order to build and

maintain a useful list of words, a search engine's spiders have to look at a lot of pages.

3.2 Indexing

Indexing is extracting the content and storing it i.e. assigning the word to the page under

which it will be found later on when users are searching.

It uses similar techniques as handling actual queries such as following:

Stopword lists: What words do not contribute to the meaning

Examples: a, an, in, the, we, you, do, and etc.

Word stemming: Creating a canonical form

Example: words : word, swimming : swim etc

Thesaurus: Words with identical/similar meaning; synonyms.

Capitalization: Mostly ignored (content is important, not case)

Some search engines also index different file types.

Example: Google also indexes PDF files

3.3 Searching

There are two primary methods of text searching--keyword and concept.

KEYWORD SEARCHING:

This is the most common form of text search on the Web. Most search engines do their text

query and retrieval using keywords.

Unless the author of the Web document specifies the keywords for her document (this is

possible by using meta tags in HTML), it's up to the search engine to determine them.


10/24

7

Essentially, this means that search engines pull out and index words that are believed to be

significant.

Words that are mentioned towards the top of a document and words that are repeated several

times throughout the document are more likely to be deemed important.

CONCEPT BASED SEARCHING:

Unlike keyword search systems, concept-based search systems try to determine what you

mean, not just what you say. In the best circumstances, a concept-based search returns hits

on documents that are "about" the subject/theme you're exploring, even if the words in the

document don't precisely match the words you enter into the query.

This is also known as clustering -- which essentially means that words are examined in

relation to other words found nearby.

3.4 Relevance Ranking

Search for anything using your favorite crawler-based search engine. Nearly instantly, the

search engine will sort through the millions of pages it knows about and present you with

ones that match your topic. The matches will even be ranked, so that the most relevant ones

come first.

Of course, the search engines don't always get it right. Non-relevant pages make it through,

and sometimes it may take a little more digging to find what you are looking for. But, by and

large, search engines do an amazing job.

How do crawler-based search engines go about determining relevancy, when confronted

with hundreds of millions of web pages to sort through? They follow a set of rules, known as

an algorithm. Exactly how a particular search engine's algorithm works is a closely-kept trade

secret. However, all major search engines follow the general rules below.

Location, Location, Location...and Frequency

One of the main rules in a ranking algorithm involves the location and frequency of keywordson a web page. Call it the location/frequency method, for short.


11/24

8

Search engines will also check to see if the search keywords appear near the top of a web

page, such as in the headline or in the first few paragraphs of text. They assume that any page

relevant to the topic will mention those words right from the beginning.

Frequency is the other major factor in how search engines determine relevancy. A search

engine will analyze how often keywords appear in relation to other words in a web page.

Those with a higher frequency are often deemed more relevant than other web pages.

Off The Page Factors

Off the page factors are those that a webmasters cannot easily influence. Chief among these is

link analysis. By analyzing how pages link to each other, a search engine can both determine

what a page is about and whether that page is deemed to be "important" and thus deserving of

a ranking boost.

Another off the page factor is clickthrough measurement. In short, this means that a search

engine may watch what results someone selects for a particular search, and then eventually

drop high-ranking pages that aren't attracting clicks, while promoting lower-ranking pages

that do pull in visitors.

Number of other Web pages that link to the page in question. Google uses this to calculate

page rank.


12/24

9

4. Types of Search Engine

The term "search engine" is often used generically to describe both crawler-based search

engines and human-powered directories. These two types of search engines gather their

listings in radically different ways.

4.1 Crawler-Based Search Engines

Crawler-based search engines, such as Google, create their listings automatically. They

"crawl" or "spider" the web, then people search through what they have found. If you changeyour web pages, crawler-based search engines eventually find these changes, and that can

affect how you are listed. Page titles, body copy and other elements all play a role.

4.2 Human-Powered Directories

A human-powered directory, such as the Open Directory, depends on humans for its listings.

You submit a short description to the directory for your entire site, or editors write one for

sites they review. A search looks for matches only in the descriptions submitted.

Changing your web pages has no effect on your listing. Things that are useful for improving a

listing with a search engine have nothing to do with improving a listing in a directory. The

only exception is that a good site, with good content, might be more likely to get reviewed for

free than a poor site.

4.3 "Hybrid Search Engines" Or Mixed Results

In the web's early days, it used to be that a search engine either presented crawler-based

results or human-powered listings. Today, it is extremely common for both types of results to

be presented. Usually, a hybrid search engine will favor one type of listings over another.

However, it does also present crawler-based results (as provided by Inktomi), especially for

more obscure queries.


13/24

10

5. Search Engine Ranking Algorithms

After the database has been created and placed in the search engine computer's

memory, the device is finally ready to perform searches and deliver results. Only nowdoes another device come into play: the ranking algorithm. All search engines,

including directories, score the relevancy of web pages through these mathematical

machines. Their purpose is to deliver links to web pages most relevant to each search

phrase. Rightfully so, these automatic mechanisms are a source of great pride and

revenue for their inventors.

5.1 TF-IDF Ranking Algorithm

This algorithm calculates the page rank based on following two concepts as follows:

1. Term Frequency (TF) i.e. how frequently the term appears on the page.

2. Inverse Document Frequency (IDF) i.e. rare words are likely to be more important.

Consider for example the query Mahendra Singh Dhoni

A good answer contains all three words, and the more frequently the better; we call this Term

Frequency (TF).

Some query terms are more important-have better discriminating power than others. For e.g.

An answer containing only Dhoni is likely to be better than an answer containing only

Mahendra; we call this Inverse Document Frequency(IDF).

wij = tfij * log2 N/n ,

wij : weight of term Tj in document Di

tfij : frequency of term Tj in document Di

N: number of documents in collection

n : number of documents where term Tj occurs at least once


14/24

11

5.2 PageRank Algorithm

PageRank is a link analysis algorithm, named after Larry Page and used by Google Internet

search engine that assigns a numerical weight to each element of a hyperlinked set of

documents, such as the World Wide Web, with the purpose of measuring its relative

importance within the set.

Fig: Mathematical PageRanks for simple n/w expressed as percentages

We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d

is a damping factor which can be set between 0 and 1. We usually set d to 0.85.. Also C(A) is

defined as the number of links going out of page A. The PageRank of a page A is given as

follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Note that the PageRanks form a probability distribution over web pages, so the sum of all

web pages PageRanks will be one.

PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to

the principal eigenvector of the normalized link matrix of the web.


15/24

12

Simplified algorithm

Assume a small universe of four web pages: A, B, C and D. Links from a page to itself, or

multiple outbound links from one single page to another single page, are ignored. PageRank

is initialized to the same value for all pages. In the original form of PageRank, the sum of

PageRank over all pages was the total number of pages on the web at that time, so each page

in this example would have an initial PageRank of 1. However, later versions of PageRank,

and the remainder of this section, assume a probability distribution between 0 and 1. Hence

the initial value for each page is 0.25.

The PageRank transferred from a given page to the targets of its outbound links upon the next

iteration is divided equally among all outbound links.

If the only links in the system were from pages B, C, and D to A, each link would transfer

0.25 PageRank to A upon the next iteration, for a total of 0.75.

Suppose instead that page B had a link to pages C and A, while page D had links to all three

pages. Thus, upon the next iteration, page B would transfer half of its existing value, or

0.125, to page A and the other half, or 0.125, to page C. Since D had three outbound links, it

would transfer one third of its existing value, or approximately 0.083, to A.

In other words, the PageRank conferred by an outbound link is equal to the documents own

PageRank score divided by the number of outbound links L( ).

In the general case, the PageRank value for any page u can be expressed as:

,

i.e. the PageRank value for a page u is dependent on the PageRank values for each

page v contained in the set Bu (the set containing all pages linking to page u), divided by the

numberL(v) of links from page v.


16/24

13

6. Challenges to Search Engines

1. The amount of information on the web is growing rapidly, as well as the number ofnew users inexperienced in the art of web research.

2. Automated search engines that rely on keyword matching usually return too many lowquality matches.

3. To make matters worse, some advertisers attempt to gain peoples attention by takingmeasures meant to mislead automated search engines.

4. Web search engines try to avoid having duplicate and near-duplicate pages in theircollection, since such pages increase the time it takes to add useful content to the

collection and do not contribute new information to search results and thus annoy

users.

5. Uniformly sampling of web pages.6. Modelling the web graph and utilising its structure in search engine.


17/24

14

7. Search Engine Optimization

7.1 SEOs

From the above discussion it is clear that there are two things involved in getting a web page

into search engine results:

1. Getting into the search engine index.2. Getting the web page to the top of the final sorted results before display.

Accomplishing step 1 is relatively easy. You just need to let the search engine spider know

that the new web page exists. You can do this by pointing to the new page from an existing

web page that is already indexed. Some search engines also provide an option to suggest a

new URL for inclusion into their index.

Step 2 is the tough part. Most of the Search Engine Optimization tasks revolve around this.

Search engines spend a lot of time and effort on making their algorithms find the best way to

rank sites. According to Google, there are over 200 factors that determine the rank of a web

page in the results.

Thus, Search Engine Optimization is the process of trying to get your web page rank at top of

the search engine results for the keywords that are important to you.

7.2 Advantages of SEO

1) It Improves the Ranking of the Web-site and hence improves the Turn Over of the Firm or

Company.

2) SEO can increase the number of visitors who are actively searching for your website and

products.

3) SEO increases the brand visibility and hence increases the sales to a top level.

4) SEO Services are highly cost effective. Doesnt require much capital for SEO of a site.

5) SEO also increases Flexibility, visibility, targeted traffic, gives long term top positioning

and much more.


18/24

15

8. Challenges To SEOs

1. The challenge of keyword research

You want your website to be ranked well in all suitable search users perform. This needs

choosing and using the appropriate keywords in an appropriate manner. i.e. some amount of

keyword research.

Thinking from the users perspective and deciding the keywords for any website is indeed a

challenge.

The other issue here can be the incorrect or misleading use of same keywords by other sites

competing for relevance ranking in any scenario.

2. The challenge of optimizing your page

Many SEOs look at optimization as their major challenge. The SEOs are so focused on where

the keywords are on their page, that they are not really creating the level of "content quality"

or value that they might have if they were not so preoccupied with optimization! You've all

seen pages that are stacked with keywords, but how do these read quality wise to the average

human being? Let's say it's not a fraction of the potential it could be with just a little different

approach.

I've even seen people so focused on the optimization process that they forget to even include

any call to action. What good is even a number 1 ranking for the best possible phrase if when

people read it, they just leave again because of no call to action.

So what am I suggesting? The solution is simple. Write with your researched keyword phrase

in mind, but do not try to optimize and create new content at the exact same time.

Instead think about it as a two step process. So what if we take these as two separate steps.

Step 1 Create new content for your readers: Write some unique original content that you

feel you can be proud of. Something you know your visitors will find interesting and useful.


19/24

16

Don't even think about keyword density, or keyword prominence or keyword placement. Stay

focused on the message of your content so that you end up with a well crafted page that reads

well when you read it out loud and make sure it serves the needs of your readers. Make sure

there is a significant call to action. Okay, now you've created your content, you have done the

hard part.

Step 2: Now go back and do a simple re-write for those search engines. You'll be re-

writing with a finished article and just making mild changes to it for the purpose of

optimization.

Tip: If you've done good keyword research and are optimizing for the right phrases, you'll

find that with this 2 step approach, you'll be creating much better quality of articles for your

visitors and because you have done good research, the optimization is not that hard.

3. The challenge of knowing exactly what search engine spiders are doing

and how far they are crawling into your Web site.

How to know exactly what those search engine spiders are doing, how often they visit, which

pages they visit, how often they stay and a whole lot more.

There are some tools like Robot Manager Professional, that allows you track some really

fascinating information about search engine spiders.

4. Are you trying to get search engine spiders to come back and re-visit

your Web site more often?

Are you trying to use a Meta Tag to get search engine spiders to come back and revisit your

Web site every so often?

Save your time. Search engine robots actually run much better based on their own schedule

and just because you include a Meta Revisit Tag does not mean the robot is going pay attention to it.


20/24

17

5. Content Freshness:

There is one tip which is extremely effective in getting search engine robots to come visit you

much more often. We refer to as "content freshness" factors. It's not complex to understand.

What you want to do is start adding some pages on to your Web site on a regular consistent

basis. For example, if you were add one new article per month to your Web site that would be

good, but even better would be to add one new article to your Web site per week. The key is

consistency. Each time the robot returns and sees more new content it tends to adjust the

frequency of its schedule. There are other benefits as well but even if you were to only add

new content regularly to your site for the sake of your visitors, you may enjoy some other

benefits too.


21/24

18

9. Case Study: Google Search

9.1 Introduction:The Google search engine has two important features that help it produce high precision

results. First, it makes use of the link structure of the Web to calculate a quality ranking for

each web page. This ranking is called Page Rank.Second, Google utilizes link to improve

search results.

9.2 Architecture & Working

Fig: High Level Google Architecture


22/24

19

Working

In Google, the web crawling is done by several distributed crawlers. There is a URLserver

that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are thensent to the storeserver. The storeserver then compresses and stores the web pages into a

repository. Every web page has an associated ID number called a docID which is assigned

whenever a new URL is parsed out of a web page.

The indexing function is performed by the indexer and the sorter. The indexer performs a

number of functions. It reads the repository, uncompresses the documents, and parses them.

Each document is converted into a set of word occurrences called hits. The hits record the

word, position in document, an approximation of font size, and capitalization.

The indexer distributes these hits into a set of "barrels", creating a partially sorted forward

index. The indexer performs another important function. It parses out all the links in every

web page and stores important information about them in an anchors file. This file contains

enough information to determine where each link points from and to, and the text of the link.

The URLresolver reads the anchors file and converts relative URLs into absolute URLs and

in turn into docIDs. It puts the anchor text into the forward index, associated with the docID

that the anchor points to. It also generates a database of links which are pairs of docIDs. The

links database is used to compute PageRanks for all the documents.

The sorter takes the barrels, which are sorted by docID and resorts them by wordID to

generate the inverted index. This is done in place so that little temporary space is needed for

this operation. The sorter also produces a list of wordIDs and offsets into the inverted index.

A program called DumpLexicon takes this list together with the lexicon produced by the

indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web

server and uses the lexicon built by DumpLexicon together with the inverted index and the

PageRanks to answer queries.


23/24

20

10. Conclusions

Though there are many search engines available on the web, the searching methods and the

engines need to go a long way for efficient retrieval of information on relevant topics.

Indexing the entire web and building one huge integrated index will further deteriorate

retrieval effectiveness, since the web is growing at an exponential rate. Building indexes in an

hierarchical manner can be considered as an alternative.

The current generation of search tools and services have to significantly improve their

retrieval effectives. Otherwise, the web will continue to evolve towards an information

entertainment center for users with no specific search objectives.

Choosing the right search engine also is a challenge and several factors should be considered

while deciding it.

SEOs can do a great job to help increase search engines efficiency and improve performance

of your websites.


24/24

21

11. References

1. The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin andLawrence Page

2. www.perfect-optimization.com3. www.google.com4. http://computer.howstuffworks.com5. http://www.searchengineworkshops.com/articles/5-challenges.html6. Algorithmic Challenges In Web Search Engines by Monica R. Henzinger7. http://searchenginewatch.com/article/2065267/International-SEO-Challenges-and-Tips

seminar formatkhjj

Documents