seminar formatkhjj
TRANSCRIPT
-
8/22/2019 Seminar Formatkhjj
1/24
REPORT TITLED
SEARCH ENGINES CONCEPT, TECHNOLOGY AND
CHALLENGES
Submitted
in partial fulfillment of
the requirements for the Degree of
Master of Computer Applications
(MCA)
By
Mrunalini S. Shinde
Roll No. 092011015
Under the guidance ofProf. L. C. Nene
Department Of Computer Technology
Veermata Jijabai Technological Institute
(Autonomous Institute, Affiliated To University of Mumbai)
Mumbai 400019
Year 2011-2012
-
8/22/2019 Seminar Formatkhjj
2/24
v
VEERMATA JIJABAI TECHNOLOGICAL INSTITUTE
MATUNGA, MUMBAI 400019
CERTIFICATE
This is to certify that the seminar report titled
Search Engines: Concept, Technology and Challengeshas been completed successfullyBy
Miss. Mrunalini S. ShindeRoll No. 092011015
Class: MCA-VIin Academic year 2011-2012
Evaluator:
Date:
-
8/22/2019 Seminar Formatkhjj
3/24
vi
Contents
Chapters Topics Page No.
1 Introduction
1.1 Search Engines
1.2 History Of Search Engines
1
1
2 Components Of Search Engine 3
3 How Search Engines Work?
3.1 Web Crawling
3.2 Indexing
3.3 Searching
3.4 Relevance Ranking
6
6
6
7
4 Types Of Search Engines
4.1 Crawler-Based Search Engines
4.2 Human-Powered Directories
4.3 Hybrid Search Engines
9
9
9
5 Search Engine Ranking Algorithms
5.1 TF-IDF Ranking Algorithm
5.2 PageRank Algorithm
10
11
6 Challenges To Search Engines 13
7 Search Engine Optimization (SEOs)
6.1 SEOs
6.2 Advantages Of SEO
14
14
8 Challenges To SEOs 15
9 Case Study: Google Search
9.1 Introduction
9.2 Architecture & Working
18
18
10 Conclusion 20
11 References 21
-
8/22/2019 Seminar Formatkhjj
4/24
1. Introduction
1.1 Search Engines
Search Engine is a tool or a program designed to search for information on the WWW on the
basis of specified keywords and return a list of the documents where the keywords were
found.
Internet search engines are special sites on the Web that are designed to help people find
information stored on other sites. There are differences in the ways various search engines
work, but they all perform three basic tasks:
They search the Internet -- or select pieces of the Internet -- based on importantwords.
They keep an index of the words they find, and where they find them. They allow users to look for words or combinations of words found in that index.
1.2 History of Search Engines
Gopher
Gopher was developed in 1991 and was in use up to 1996. Its an internet server from which
hierarchically organized text files could be retrieved from all over the world.
Developed at the University of Minnesota, whose sports teams are called The Golden
Gophers.
HyperGopher could also display Gif and Jpeg graphic images.
Three important Gopher applications were Archie, Veronica and Jughead.
Archie was a tool for indexing FTP archives, allowing people to find specific files. It is
considered to be the first Internet search engine.
-
8/22/2019 Seminar Formatkhjj
5/24
2
Veronica i.e. "Very Easy Rodent-Oriented Net-wide Index to Computer Archives is a search
engine system for the Gopher protocol, developed in 1992 by Steven Foster and Fred Barrie
at the University of Nevada, Reno.
Veronica is a constantly updated database of the names of almost every menu item on
thousands of Gopher servers. The Veronica database can be searched from most major
Gopher menus.
Jughead i.e. Jonzy's Universal Gopher Hierarchy Excavation And Display is a search
engine system for the Gopher protocol. Jughead was developed by Rhett Jones in 1993 at
the University of Utah. It is distinct from Veronica in that it searches a single server at a time.
However it lost importance with the introduction of the first graphical browser viz., Mosaic.
Wide Area Information Servers (W.A.I.S.)
W.A.I.S. coexisted with Gopher.
For Gopher, files had to be stored in a predetermined manner in databases.
The W.A.I.S. user had to connect to known databases in order to retrieve information or files.
It had the same fate as Gophers i.e. became superfluous with the introduction of browsers and
search engines.
Wandex
The first real search engine, in the form that we know search engines today, didnt come into
being until 1993. It was developed by Matthew Gray, and it was called Wandex. Wandex
indexed the files and allowed users to search for them. This technology was the first program
to crawl the Web, and later became the basis for all search crawlers.
-
8/22/2019 Seminar Formatkhjj
6/24
3
2. Components of Search Engine
Fig: Components Of Search Engine
Search Form
Search Form can be considered as the user interface of search engine. It is a simple formwhere user enters his query usually in the form of keywords.
-
8/22/2019 Seminar Formatkhjj
7/24
4
Query Parser
The Query Parser tokenizes the input and looks for operators and filters.
Index
Index is the file that is created by the web crawler and is used as a lookup by the Query
Engine.
Query Engine
The Query Engine finds the web pages that match the given criteria using index.
Relevance Ranker
The Relevance ranker is the search engine algorithm that ranks the results of search in order
of relevance.
Formatter
This deal with the way results are laid out and displayed to the user. It shows the results in
order of importance as decided by the relevance ranking.
-
8/22/2019 Seminar Formatkhjj
8/24
5
3. How Search Engine Works?
Search engines use software robots to survey the Web and build their databases. Web
documents are retrieved and indexed. When you enter a query at a search engine website,your input is checked against the search engine's keyword indices. Search engines look
through their own databases of information in order to find what it is that you are looking for.
The best matches are then returned to you as hits.
Fig: Internal Working Of Search Engine
-
8/22/2019 Seminar Formatkhjj
9/24
6
3.1 Web Crawling
Before a search engine can tell you where a file or document is, it must be found. To find
information on the hundreds of millions of Web pages that exist, a search engine employs
special software robots, called spiders, to build lists of the words found on Web sites.
When a spider is building its lists, the process is called Web crawling. In order to build and
maintain a useful list of words, a search engine's spiders have to look at a lot of pages.
3.2 Indexing
Indexing is extracting the content and storing it i.e. assigning the word to the page under
which it will be found later on when users are searching.
It uses similar techniques as handling actual queries such as following:
Stopword lists: What words do not contribute to the meaning
Examples: a, an, in, the, we, you, do, and etc.
Word stemming: Creating a canonical form
Example: words : word, swimming : swim etc
Thesaurus: Words with identical/similar meaning; synonyms.
Capitalization: Mostly ignored (content is important, not case)
Some search engines also index different file types.
Example: Google also indexes PDF files
3.3 Searching
There are two primary methods of text searching--keyword and concept.
KEYWORD SEARCHING:
This is the most common form of text search on the Web. Most search engines do their text
query and retrieval using keywords.
Unless the author of the Web document specifies the keywords for her document (this is
possible by using meta tags in HTML), it's up to the search engine to determine them.
-
8/22/2019 Seminar Formatkhjj
10/24
7
Essentially, this means that search engines pull out and index words that are believed to be
significant.
Words that are mentioned towards the top of a document and words that are repeated several
times throughout the document are more likely to be deemed important.
CONCEPT BASED SEARCHING:
Unlike keyword search systems, concept-based search systems try to determine what you
mean, not just what you say. In the best circumstances, a concept-based search returns hits
on documents that are "about" the subject/theme you're exploring, even if the words in the
document don't precisely match the words you enter into the query.
This is also known as clustering -- which essentially means that words are examined in
relation to other words found nearby.
3.4 Relevance Ranking
Search for anything using your favorite crawler-based search engine. Nearly instantly, the
search engine will sort through the millions of pages it knows about and present you with
ones that match your topic. The matches will even be ranked, so that the most relevant ones
come first.
Of course, the search engines don't always get it right. Non-relevant pages make it through,
and sometimes it may take a little more digging to find what you are looking for. But, by and
large, search engines do an amazing job.
How do crawler-based search engines go about determining relevancy, when confronted
with hundreds of millions of web pages to sort through? They follow a set of rules, known as
an algorithm. Exactly how a particular search engine's algorithm works is a closely-kept trade
secret. However, all major search engines follow the general rules below.
Location, Location, Location...and Frequency
One of the main rules in a ranking algorithm involves the location and frequency of keywordson a web page. Call it the location/frequency method, for short.
-
8/22/2019 Seminar Formatkhjj
11/24
8
Search engines will also check to see if the search keywords appear near the top of a web
page, such as in the headline or in the first few paragraphs of text. They assume that any page
relevant to the topic will mention those words right from the beginning.
Frequency is the other major factor in how search engines determine relevancy. A search
engine will analyze how often keywords appear in relation to other words in a web page.
Those with a higher frequency are often deemed more relevant than other web pages.
Off The Page Factors
Off the page factors are those that a webmasters cannot easily influence. Chief among these is
link analysis. By analyzing how pages link to each other, a search engine can both determine
what a page is about and whether that page is deemed to be "important" and thus deserving of
a ranking boost.
Another off the page factor is clickthrough measurement. In short, this means that a search
engine may watch what results someone selects for a particular search, and then eventually
drop high-ranking pages that aren't attracting clicks, while promoting lower-ranking pages
that do pull in visitors.
Number of other Web pages that link to the page in question. Google uses this to calculate
page rank.
-
8/22/2019 Seminar Formatkhjj
12/24
9
4. Types of Search Engine
The term "search engine" is often used generically to describe both crawler-based search
engines and human-powered directories. These two types of search engines gather their
listings in radically different ways.
4.1 Crawler-Based Search Engines
Crawler-based search engines, such as Google, create their listings automatically. They
"crawl" or "spider" the web, then people search through what they have found. If you changeyour web pages, crawler-based search engines eventually find these changes, and that can
affect how you are listed. Page titles, body copy and other elements all play a role.
4.2 Human-Powered Directories
A human-powered directory, such as the Open Directory, depends on humans for its listings.
You submit a short description to the directory for your entire site, or editors write one for
sites they review. A search looks for matches only in the descriptions submitted.
Changing your web pages has no effect on your listing. Things that are useful for improving a
listing with a search engine have nothing to do with improving a listing in a directory. The
only exception is that a good site, with good content, might be more likely to get reviewed for
free than a poor site.
4.3 "Hybrid Search Engines" Or Mixed Results
In the web's early days, it used to be that a search engine either presented crawler-based
results or human-powered listings. Today, it is extremely common for both types of results to
be presented. Usually, a hybrid search engine will favor one type of listings over another.
However, it does also present crawler-based results (as provided by Inktomi), especially for
more obscure queries.
-
8/22/2019 Seminar Formatkhjj
13/24
10
5. Search Engine Ranking Algorithms
After the database has been created and placed in the search engine computer's
memory, the device is finally ready to perform searches and deliver results. Only nowdoes another device come into play: the ranking algorithm. All search engines,
including directories, score the relevancy of web pages through these mathematical
machines. Their purpose is to deliver links to web pages most relevant to each search
phrase. Rightfully so, these automatic mechanisms are a source of great pride and
revenue for their inventors.
5.1 TF-IDF Ranking Algorithm
This algorithm calculates the page rank based on following two concepts as follows:
1. Term Frequency (TF) i.e. how frequently the term appears on the page.
2. Inverse Document Frequency (IDF) i.e. rare words are likely to be more important.
Consider for example the query Mahendra Singh Dhoni
A good answer contains all three words, and the more frequently the better; we call this Term
Frequency (TF).
Some query terms are more important-have better discriminating power than others. For e.g.
An answer containing only Dhoni is likely to be better than an answer containing only
Mahendra; we call this Inverse Document Frequency(IDF).
wij = tfij * log2 N/n ,
wij : weight of term Tj in document Di
tfij : frequency of term Tj in document Di
N: number of documents in collection
n : number of documents where term Tj occurs at least once
-
8/22/2019 Seminar Formatkhjj
14/24
11
5.2 PageRank Algorithm
PageRank is a link analysis algorithm, named after Larry Page and used by Google Internet
search engine that assigns a numerical weight to each element of a hyperlinked set of
documents, such as the World Wide Web, with the purpose of measuring its relative
importance within the set.
Fig: Mathematical PageRanks for simple n/w expressed as percentages
We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d
is a damping factor which can be set between 0 and 1. We usually set d to 0.85.. Also C(A) is
defined as the number of links going out of page A. The PageRank of a page A is given as
follows:
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Note that the PageRanks form a probability distribution over web pages, so the sum of all
web pages PageRanks will be one.
PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to
the principal eigenvector of the normalized link matrix of the web.
-
8/22/2019 Seminar Formatkhjj
15/24
12
Simplified algorithm
Assume a small universe of four web pages: A, B, C and D. Links from a page to itself, or
multiple outbound links from one single page to another single page, are ignored. PageRank
is initialized to the same value for all pages. In the original form of PageRank, the sum of
PageRank over all pages was the total number of pages on the web at that time, so each page
in this example would have an initial PageRank of 1. However, later versions of PageRank,
and the remainder of this section, assume a probability distribution between 0 and 1. Hence
the initial value for each page is 0.25.
The PageRank transferred from a given page to the targets of its outbound links upon the next
iteration is divided equally among all outbound links.
If the only links in the system were from pages B, C, and D to A, each link would transfer
0.25 PageRank to A upon the next iteration, for a total of 0.75.
Suppose instead that page B had a link to pages C and A, while page D had links to all three
pages. Thus, upon the next iteration, page B would transfer half of its existing value, or
0.125, to page A and the other half, or 0.125, to page C. Since D had three outbound links, it
would transfer one third of its existing value, or approximately 0.083, to A.
In other words, the PageRank conferred by an outbound link is equal to the documents own
PageRank score divided by the number of outbound links L( ).
In the general case, the PageRank value for any page u can be expressed as:
,
i.e. the PageRank value for a page u is dependent on the PageRank values for each
page v contained in the set Bu (the set containing all pages linking to page u), divided by the
numberL(v) of links from page v.
-
8/22/2019 Seminar Formatkhjj
16/24
13
6. Challenges to Search Engines
1. The amount of information on the web is growing rapidly, as well as the number ofnew users inexperienced in the art of web research.
2. Automated search engines that rely on keyword matching usually return too many lowquality matches.
3. To make matters worse, some advertisers attempt to gain peoples attention by takingmeasures meant to mislead automated search engines.
4. Web search engines try to avoid having duplicate and near-duplicate pages in theircollection, since such pages increase the time it takes to add useful content to the
collection and do not contribute new information to search results and thus annoy
users.
5. Uniformly sampling of web pages.6. Modelling the web graph and utilising its structure in search engine.
-
8/22/2019 Seminar Formatkhjj
17/24
14
7. Search Engine Optimization
7.1 SEOs
From the above discussion it is clear that there are two things involved in getting a web page
into search engine results:
1. Getting into the search engine index.2. Getting the web page to the top of the final sorted results before display.
Accomplishing step 1 is relatively easy. You just need to let the search engine spider know
that the new web page exists. You can do this by pointing to the new page from an existing
web page that is already indexed. Some search engines also provide an option to suggest a
new URL for inclusion into their index.
Step 2 is the tough part. Most of the Search Engine Optimization tasks revolve around this.
Search engines spend a lot of time and effort on making their algorithms find the best way to
rank sites. According to Google, there are over 200 factors that determine the rank of a web
page in the results.
Thus, Search Engine Optimization is the process of trying to get your web page rank at top of
the search engine results for the keywords that are important to you.
7.2 Advantages of SEO
1) It Improves the Ranking of the Web-site and hence improves the Turn Over of the Firm or
Company.
2) SEO can increase the number of visitors who are actively searching for your website and
products.
3) SEO increases the brand visibility and hence increases the sales to a top level.
4) SEO Services are highly cost effective. Doesnt require much capital for SEO of a site.
5) SEO also increases Flexibility, visibility, targeted traffic, gives long term top positioning
and much more.
-
8/22/2019 Seminar Formatkhjj
18/24
15
8. Challenges To SEOs
1. The challenge of keyword research
You want your website to be ranked well in all suitable search users perform. This needs
choosing and using the appropriate keywords in an appropriate manner. i.e. some amount of
keyword research.
Thinking from the users perspective and deciding the keywords for any website is indeed a
challenge.
The other issue here can be the incorrect or misleading use of same keywords by other sites
competing for relevance ranking in any scenario.
2. The challenge of optimizing your page
Many SEOs look at optimization as their major challenge. The SEOs are so focused on where
the keywords are on their page, that they are not really creating the level of "content quality"
or value that they might have if they were not so preoccupied with optimization! You've all
seen pages that are stacked with keywords, but how do these read quality wise to the average
human being? Let's say it's not a fraction of the potential it could be with just a little different
approach.
I've even seen people so focused on the optimization process that they forget to even include
any call to action. What good is even a number 1 ranking for the best possible phrase if when
people read it, they just leave again because of no call to action.
So what am I suggesting? The solution is simple. Write with your researched keyword phrase
in mind, but do not try to optimize and create new content at the exact same time.
Instead think about it as a two step process. So what if we take these as two separate steps.
Step 1 Create new content for your readers: Write some unique original content that you
feel you can be proud of. Something you know your visitors will find interesting and useful.
-
8/22/2019 Seminar Formatkhjj
19/24
16
Don't even think about keyword density, or keyword prominence or keyword placement. Stay
focused on the message of your content so that you end up with a well crafted page that reads
well when you read it out loud and make sure it serves the needs of your readers. Make sure
there is a significant call to action. Okay, now you've created your content, you have done the
hard part.
Step 2: Now go back and do a simple re-write for those search engines. You'll be re-
writing with a finished article and just making mild changes to it for the purpose of
optimization.
Tip: If you've done good keyword research and are optimizing for the right phrases, you'll
find that with this 2 step approach, you'll be creating much better quality of articles for your
visitors and because you have done good research, the optimization is not that hard.
3. The challenge of knowing exactly what search engine spiders are doing
and how far they are crawling into your Web site.
How to know exactly what those search engine spiders are doing, how often they visit, which
pages they visit, how often they stay and a whole lot more.
There are some tools like Robot Manager Professional, that allows you track some really
fascinating information about search engine spiders.
4. Are you trying to get search engine spiders to come back and re-visit
your Web site more often?
Are you trying to use a Meta Tag to get search engine spiders to come back and revisit your
Web site every so often?
Save your time. Search engine robots actually run much better based on their own schedule
and just because you include a Meta Revisit Tag does not mean the robot is going pay attention to it.
-
8/22/2019 Seminar Formatkhjj
20/24
17
5. Content Freshness:
There is one tip which is extremely effective in getting search engine robots to come visit you
much more often. We refer to as "content freshness" factors. It's not complex to understand.
What you want to do is start adding some pages on to your Web site on a regular consistent
basis. For example, if you were add one new article per month to your Web site that would be
good, but even better would be to add one new article to your Web site per week. The key is
consistency. Each time the robot returns and sees more new content it tends to adjust the
frequency of its schedule. There are other benefits as well but even if you were to only add
new content regularly to your site for the sake of your visitors, you may enjoy some other
benefits too.
-
8/22/2019 Seminar Formatkhjj
21/24
18
9. Case Study: Google Search
9.1 Introduction:The Google search engine has two important features that help it produce high precision
results. First, it makes use of the link structure of the Web to calculate a quality ranking for
each web page. This ranking is called Page Rank.Second, Google utilizes link to improve
search results.
9.2 Architecture & Working
Fig: High Level Google Architecture
-
8/22/2019 Seminar Formatkhjj
22/24
19
Working
In Google, the web crawling is done by several distributed crawlers. There is a URLserver
that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are thensent to the storeserver. The storeserver then compresses and stores the web pages into a
repository. Every web page has an associated ID number called a docID which is assigned
whenever a new URL is parsed out of a web page.
The indexing function is performed by the indexer and the sorter. The indexer performs a
number of functions. It reads the repository, uncompresses the documents, and parses them.
Each document is converted into a set of word occurrences called hits. The hits record the
word, position in document, an approximation of font size, and capitalization.
The indexer distributes these hits into a set of "barrels", creating a partially sorted forward
index. The indexer performs another important function. It parses out all the links in every
web page and stores important information about them in an anchors file. This file contains
enough information to determine where each link points from and to, and the text of the link.
The URLresolver reads the anchors file and converts relative URLs into absolute URLs and
in turn into docIDs. It puts the anchor text into the forward index, associated with the docID
that the anchor points to. It also generates a database of links which are pairs of docIDs. The
links database is used to compute PageRanks for all the documents.
The sorter takes the barrels, which are sorted by docID and resorts them by wordID to
generate the inverted index. This is done in place so that little temporary space is needed for
this operation. The sorter also produces a list of wordIDs and offsets into the inverted index.
A program called DumpLexicon takes this list together with the lexicon produced by the
indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web
server and uses the lexicon built by DumpLexicon together with the inverted index and the
PageRanks to answer queries.
-
8/22/2019 Seminar Formatkhjj
23/24
20
10. Conclusions
Though there are many search engines available on the web, the searching methods and the
engines need to go a long way for efficient retrieval of information on relevant topics.
Indexing the entire web and building one huge integrated index will further deteriorate
retrieval effectiveness, since the web is growing at an exponential rate. Building indexes in an
hierarchical manner can be considered as an alternative.
The current generation of search tools and services have to significantly improve their
retrieval effectives. Otherwise, the web will continue to evolve towards an information
entertainment center for users with no specific search objectives.
Choosing the right search engine also is a challenge and several factors should be considered
while deciding it.
SEOs can do a great job to help increase search engines efficiency and improve performance
of your websites.
-
8/22/2019 Seminar Formatkhjj
24/24
21
11. References
1. The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin andLawrence Page
2. www.perfect-optimization.com3. www.google.com4. http://computer.howstuffworks.com5. http://www.searchengineworkshops.com/articles/5-challenges.html6. Algorithmic Challenges In Web Search Engines by Monica R. Henzinger7. http://searchenginewatch.com/article/2065267/International-SEO-Challenges-and-Tips