seminar formatkhjj

Upload: prasad-chavan

Post on 08-Aug-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/22/2019 Seminar Formatkhjj

    1/24

    REPORT TITLED

    SEARCH ENGINES CONCEPT, TECHNOLOGY AND

    CHALLENGES

    Submitted

    in partial fulfillment of

    the requirements for the Degree of

    Master of Computer Applications

    (MCA)

    By

    Mrunalini S. Shinde

    Roll No. 092011015

    Under the guidance ofProf. L. C. Nene

    Department Of Computer Technology

    Veermata Jijabai Technological Institute

    (Autonomous Institute, Affiliated To University of Mumbai)

    Mumbai 400019

    Year 2011-2012

  • 8/22/2019 Seminar Formatkhjj

    2/24

    v

    VEERMATA JIJABAI TECHNOLOGICAL INSTITUTE

    MATUNGA, MUMBAI 400019

    CERTIFICATE

    This is to certify that the seminar report titled

    Search Engines: Concept, Technology and Challengeshas been completed successfullyBy

    Miss. Mrunalini S. ShindeRoll No. 092011015

    Class: MCA-VIin Academic year 2011-2012

    Evaluator:

    Date:

  • 8/22/2019 Seminar Formatkhjj

    3/24

    vi

    Contents

    Chapters Topics Page No.

    1 Introduction

    1.1 Search Engines

    1.2 History Of Search Engines

    1

    1

    2 Components Of Search Engine 3

    3 How Search Engines Work?

    3.1 Web Crawling

    3.2 Indexing

    3.3 Searching

    3.4 Relevance Ranking

    6

    6

    6

    7

    4 Types Of Search Engines

    4.1 Crawler-Based Search Engines

    4.2 Human-Powered Directories

    4.3 Hybrid Search Engines

    9

    9

    9

    5 Search Engine Ranking Algorithms

    5.1 TF-IDF Ranking Algorithm

    5.2 PageRank Algorithm

    10

    11

    6 Challenges To Search Engines 13

    7 Search Engine Optimization (SEOs)

    6.1 SEOs

    6.2 Advantages Of SEO

    14

    14

    8 Challenges To SEOs 15

    9 Case Study: Google Search

    9.1 Introduction

    9.2 Architecture & Working

    18

    18

    10 Conclusion 20

    11 References 21

  • 8/22/2019 Seminar Formatkhjj

    4/24

    1. Introduction

    1.1 Search Engines

    Search Engine is a tool or a program designed to search for information on the WWW on the

    basis of specified keywords and return a list of the documents where the keywords were

    found.

    Internet search engines are special sites on the Web that are designed to help people find

    information stored on other sites. There are differences in the ways various search engines

    work, but they all perform three basic tasks:

    They search the Internet -- or select pieces of the Internet -- based on importantwords.

    They keep an index of the words they find, and where they find them. They allow users to look for words or combinations of words found in that index.

    1.2 History of Search Engines

    Gopher

    Gopher was developed in 1991 and was in use up to 1996. Its an internet server from which

    hierarchically organized text files could be retrieved from all over the world.

    Developed at the University of Minnesota, whose sports teams are called The Golden

    Gophers.

    HyperGopher could also display Gif and Jpeg graphic images.

    Three important Gopher applications were Archie, Veronica and Jughead.

    Archie was a tool for indexing FTP archives, allowing people to find specific files. It is

    considered to be the first Internet search engine.

  • 8/22/2019 Seminar Formatkhjj

    5/24

    2

    Veronica i.e. "Very Easy Rodent-Oriented Net-wide Index to Computer Archives is a search

    engine system for the Gopher protocol, developed in 1992 by Steven Foster and Fred Barrie

    at the University of Nevada, Reno.

    Veronica is a constantly updated database of the names of almost every menu item on

    thousands of Gopher servers. The Veronica database can be searched from most major

    Gopher menus.

    Jughead i.e. Jonzy's Universal Gopher Hierarchy Excavation And Display is a search

    engine system for the Gopher protocol. Jughead was developed by Rhett Jones in 1993 at

    the University of Utah. It is distinct from Veronica in that it searches a single server at a time.

    However it lost importance with the introduction of the first graphical browser viz., Mosaic.

    Wide Area Information Servers (W.A.I.S.)

    W.A.I.S. coexisted with Gopher.

    For Gopher, files had to be stored in a predetermined manner in databases.

    The W.A.I.S. user had to connect to known databases in order to retrieve information or files.

    It had the same fate as Gophers i.e. became superfluous with the introduction of browsers and

    search engines.

    Wandex

    The first real search engine, in the form that we know search engines today, didnt come into

    being until 1993. It was developed by Matthew Gray, and it was called Wandex. Wandex

    indexed the files and allowed users to search for them. This technology was the first program

    to crawl the Web, and later became the basis for all search crawlers.

  • 8/22/2019 Seminar Formatkhjj

    6/24

    3

    2. Components of Search Engine

    Fig: Components Of Search Engine

    Search Form

    Search Form can be considered as the user interface of search engine. It is a simple formwhere user enters his query usually in the form of keywords.

  • 8/22/2019 Seminar Formatkhjj

    7/24

    4

    Query Parser

    The Query Parser tokenizes the input and looks for operators and filters.

    Index

    Index is the file that is created by the web crawler and is used as a lookup by the Query

    Engine.

    Query Engine

    The Query Engine finds the web pages that match the given criteria using index.

    Relevance Ranker

    The Relevance ranker is the search engine algorithm that ranks the results of search in order

    of relevance.

    Formatter

    This deal with the way results are laid out and displayed to the user. It shows the results in

    order of importance as decided by the relevance ranking.

  • 8/22/2019 Seminar Formatkhjj

    8/24

    5

    3. How Search Engine Works?

    Search engines use software robots to survey the Web and build their databases. Web

    documents are retrieved and indexed. When you enter a query at a search engine website,your input is checked against the search engine's keyword indices. Search engines look

    through their own databases of information in order to find what it is that you are looking for.

    The best matches are then returned to you as hits.

    Fig: Internal Working Of Search Engine

  • 8/22/2019 Seminar Formatkhjj

    9/24

    6

    3.1 Web Crawling

    Before a search engine can tell you where a file or document is, it must be found. To find

    information on the hundreds of millions of Web pages that exist, a search engine employs

    special software robots, called spiders, to build lists of the words found on Web sites.

    When a spider is building its lists, the process is called Web crawling. In order to build and

    maintain a useful list of words, a search engine's spiders have to look at a lot of pages.

    3.2 Indexing

    Indexing is extracting the content and storing it i.e. assigning the word to the page under

    which it will be found later on when users are searching.

    It uses similar techniques as handling actual queries such as following:

    Stopword lists: What words do not contribute to the meaning

    Examples: a, an, in, the, we, you, do, and etc.

    Word stemming: Creating a canonical form

    Example: words : word, swimming : swim etc

    Thesaurus: Words with identical/similar meaning; synonyms.

    Capitalization: Mostly ignored (content is important, not case)

    Some search engines also index different file types.

    Example: Google also indexes PDF files

    3.3 Searching

    There are two primary methods of text searching--keyword and concept.

    KEYWORD SEARCHING:

    This is the most common form of text search on the Web. Most search engines do their text

    query and retrieval using keywords.

    Unless the author of the Web document specifies the keywords for her document (this is

    possible by using meta tags in HTML), it's up to the search engine to determine them.

  • 8/22/2019 Seminar Formatkhjj

    10/24

    7

    Essentially, this means that search engines pull out and index words that are believed to be

    significant.

    Words that are mentioned towards the top of a document and words that are repeated several

    times throughout the document are more likely to be deemed important.

    CONCEPT BASED SEARCHING:

    Unlike keyword search systems, concept-based search systems try to determine what you

    mean, not just what you say. In the best circumstances, a concept-based search returns hits

    on documents that are "about" the subject/theme you're exploring, even if the words in the

    document don't precisely match the words you enter into the query.

    This is also known as clustering -- which essentially means that words are examined in

    relation to other words found nearby.

    3.4 Relevance Ranking

    Search for anything using your favorite crawler-based search engine. Nearly instantly, the

    search engine will sort through the millions of pages it knows about and present you with

    ones that match your topic. The matches will even be ranked, so that the most relevant ones

    come first.

    Of course, the search engines don't always get it right. Non-relevant pages make it through,

    and sometimes it may take a little more digging to find what you are looking for. But, by and

    large, search engines do an amazing job.

    How do crawler-based search engines go about determining relevancy, when confronted

    with hundreds of millions of web pages to sort through? They follow a set of rules, known as

    an algorithm. Exactly how a particular search engine's algorithm works is a closely-kept trade

    secret. However, all major search engines follow the general rules below.

    Location, Location, Location...and Frequency

    One of the main rules in a ranking algorithm involves the location and frequency of keywordson a web page. Call it the location/frequency method, for short.

  • 8/22/2019 Seminar Formatkhjj

    11/24

    8

    Search engines will also check to see if the search keywords appear near the top of a web

    page, such as in the headline or in the first few paragraphs of text. They assume that any page

    relevant to the topic will mention those words right from the beginning.

    Frequency is the other major factor in how search engines determine relevancy. A search

    engine will analyze how often keywords appear in relation to other words in a web page.

    Those with a higher frequency are often deemed more relevant than other web pages.

    Off The Page Factors

    Off the page factors are those that a webmasters cannot easily influence. Chief among these is

    link analysis. By analyzing how pages link to each other, a search engine can both determine

    what a page is about and whether that page is deemed to be "important" and thus deserving of

    a ranking boost.

    Another off the page factor is clickthrough measurement. In short, this means that a search

    engine may watch what results someone selects for a particular search, and then eventually

    drop high-ranking pages that aren't attracting clicks, while promoting lower-ranking pages

    that do pull in visitors.

    Number of other Web pages that link to the page in question. Google uses this to calculate

    page rank.

  • 8/22/2019 Seminar Formatkhjj

    12/24

    9

    4. Types of Search Engine

    The term "search engine" is often used generically to describe both crawler-based search

    engines and human-powered directories. These two types of search engines gather their

    listings in radically different ways.

    4.1 Crawler-Based Search Engines

    Crawler-based search engines, such as Google, create their listings automatically. They

    "crawl" or "spider" the web, then people search through what they have found. If you changeyour web pages, crawler-based search engines eventually find these changes, and that can

    affect how you are listed. Page titles, body copy and other elements all play a role.

    4.2 Human-Powered Directories

    A human-powered directory, such as the Open Directory, depends on humans for its listings.

    You submit a short description to the directory for your entire site, or editors write one for

    sites they review. A search looks for matches only in the descriptions submitted.

    Changing your web pages has no effect on your listing. Things that are useful for improving a

    listing with a search engine have nothing to do with improving a listing in a directory. The

    only exception is that a good site, with good content, might be more likely to get reviewed for

    free than a poor site.

    4.3 "Hybrid Search Engines" Or Mixed Results

    In the web's early days, it used to be that a search engine either presented crawler-based

    results or human-powered listings. Today, it is extremely common for both types of results to

    be presented. Usually, a hybrid search engine will favor one type of listings over another.

    However, it does also present crawler-based results (as provided by Inktomi), especially for

    more obscure queries.

  • 8/22/2019 Seminar Formatkhjj

    13/24

    10

    5. Search Engine Ranking Algorithms

    After the database has been created and placed in the search engine computer's

    memory, the device is finally ready to perform searches and deliver results. Only nowdoes another device come into play: the ranking algorithm. All search engines,

    including directories, score the relevancy of web pages through these mathematical

    machines. Their purpose is to deliver links to web pages most relevant to each search

    phrase. Rightfully so, these automatic mechanisms are a source of great pride and

    revenue for their inventors.

    5.1 TF-IDF Ranking Algorithm

    This algorithm calculates the page rank based on following two concepts as follows:

    1. Term Frequency (TF) i.e. how frequently the term appears on the page.

    2. Inverse Document Frequency (IDF) i.e. rare words are likely to be more important.

    Consider for example the query Mahendra Singh Dhoni

    A good answer contains all three words, and the more frequently the better; we call this Term

    Frequency (TF).

    Some query terms are more important-have better discriminating power than others. For e.g.

    An answer containing only Dhoni is likely to be better than an answer containing only

    Mahendra; we call this Inverse Document Frequency(IDF).

    wij = tfij * log2 N/n ,

    wij : weight of term Tj in document Di

    tfij : frequency of term Tj in document Di

    N: number of documents in collection

    n : number of documents where term Tj occurs at least once

  • 8/22/2019 Seminar Formatkhjj

    14/24

    11

    5.2 PageRank Algorithm

    PageRank is a link analysis algorithm, named after Larry Page and used by Google Internet

    search engine that assigns a numerical weight to each element of a hyperlinked set of

    documents, such as the World Wide Web, with the purpose of measuring its relative

    importance within the set.

    Fig: Mathematical PageRanks for simple n/w expressed as percentages

    We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d

    is a damping factor which can be set between 0 and 1. We usually set d to 0.85.. Also C(A) is

    defined as the number of links going out of page A. The PageRank of a page A is given as

    follows:

    PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

    Note that the PageRanks form a probability distribution over web pages, so the sum of all

    web pages PageRanks will be one.

    PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to

    the principal eigenvector of the normalized link matrix of the web.

  • 8/22/2019 Seminar Formatkhjj

    15/24

    12

    Simplified algorithm

    Assume a small universe of four web pages: A, B, C and D. Links from a page to itself, or

    multiple outbound links from one single page to another single page, are ignored. PageRank

    is initialized to the same value for all pages. In the original form of PageRank, the sum of

    PageRank over all pages was the total number of pages on the web at that time, so each page

    in this example would have an initial PageRank of 1. However, later versions of PageRank,

    and the remainder of this section, assume a probability distribution between 0 and 1. Hence

    the initial value for each page is 0.25.

    The PageRank transferred from a given page to the targets of its outbound links upon the next

    iteration is divided equally among all outbound links.

    If the only links in the system were from pages B, C, and D to A, each link would transfer

    0.25 PageRank to A upon the next iteration, for a total of 0.75.

    Suppose instead that page B had a link to pages C and A, while page D had links to all three

    pages. Thus, upon the next iteration, page B would transfer half of its existing value, or

    0.125, to page A and the other half, or 0.125, to page C. Since D had three outbound links, it

    would transfer one third of its existing value, or approximately 0.083, to A.

    In other words, the PageRank conferred by an outbound link is equal to the documents own

    PageRank score divided by the number of outbound links L( ).

    In the general case, the PageRank value for any page u can be expressed as:

    ,

    i.e. the PageRank value for a page u is dependent on the PageRank values for each

    page v contained in the set Bu (the set containing all pages linking to page u), divided by the

    numberL(v) of links from page v.

  • 8/22/2019 Seminar Formatkhjj

    16/24

    13

    6. Challenges to Search Engines

    1. The amount of information on the web is growing rapidly, as well as the number ofnew users inexperienced in the art of web research.

    2. Automated search engines that rely on keyword matching usually return too many lowquality matches.

    3. To make matters worse, some advertisers attempt to gain peoples attention by takingmeasures meant to mislead automated search engines.

    4. Web search engines try to avoid having duplicate and near-duplicate pages in theircollection, since such pages increase the time it takes to add useful content to the

    collection and do not contribute new information to search results and thus annoy

    users.

    5. Uniformly sampling of web pages.6. Modelling the web graph and utilising its structure in search engine.

  • 8/22/2019 Seminar Formatkhjj

    17/24

    14

    7. Search Engine Optimization

    7.1 SEOs

    From the above discussion it is clear that there are two things involved in getting a web page

    into search engine results:

    1. Getting into the search engine index.2. Getting the web page to the top of the final sorted results before display.

    Accomplishing step 1 is relatively easy. You just need to let the search engine spider know

    that the new web page exists. You can do this by pointing to the new page from an existing

    web page that is already indexed. Some search engines also provide an option to suggest a

    new URL for inclusion into their index.

    Step 2 is the tough part. Most of the Search Engine Optimization tasks revolve around this.

    Search engines spend a lot of time and effort on making their algorithms find the best way to

    rank sites. According to Google, there are over 200 factors that determine the rank of a web

    page in the results.

    Thus, Search Engine Optimization is the process of trying to get your web page rank at top of

    the search engine results for the keywords that are important to you.

    7.2 Advantages of SEO

    1) It Improves the Ranking of the Web-site and hence improves the Turn Over of the Firm or

    Company.

    2) SEO can increase the number of visitors who are actively searching for your website and

    products.

    3) SEO increases the brand visibility and hence increases the sales to a top level.

    4) SEO Services are highly cost effective. Doesnt require much capital for SEO of a site.

    5) SEO also increases Flexibility, visibility, targeted traffic, gives long term top positioning

    and much more.

  • 8/22/2019 Seminar Formatkhjj

    18/24

    15

    8. Challenges To SEOs

    1. The challenge of keyword research

    You want your website to be ranked well in all suitable search users perform. This needs

    choosing and using the appropriate keywords in an appropriate manner. i.e. some amount of

    keyword research.

    Thinking from the users perspective and deciding the keywords for any website is indeed a

    challenge.

    The other issue here can be the incorrect or misleading use of same keywords by other sites

    competing for relevance ranking in any scenario.

    2. The challenge of optimizing your page

    Many SEOs look at optimization as their major challenge. The SEOs are so focused on where

    the keywords are on their page, that they are not really creating the level of "content quality"

    or value that they might have if they were not so preoccupied with optimization! You've all

    seen pages that are stacked with keywords, but how do these read quality wise to the average

    human being? Let's say it's not a fraction of the potential it could be with just a little different

    approach.

    I've even seen people so focused on the optimization process that they forget to even include

    any call to action. What good is even a number 1 ranking for the best possible phrase if when

    people read it, they just leave again because of no call to action.

    So what am I suggesting? The solution is simple. Write with your researched keyword phrase

    in mind, but do not try to optimize and create new content at the exact same time.

    Instead think about it as a two step process. So what if we take these as two separate steps.

    Step 1 Create new content for your readers: Write some unique original content that you

    feel you can be proud of. Something you know your visitors will find interesting and useful.

  • 8/22/2019 Seminar Formatkhjj

    19/24

    16

    Don't even think about keyword density, or keyword prominence or keyword placement. Stay

    focused on the message of your content so that you end up with a well crafted page that reads

    well when you read it out loud and make sure it serves the needs of your readers. Make sure

    there is a significant call to action. Okay, now you've created your content, you have done the

    hard part.

    Step 2: Now go back and do a simple re-write for those search engines. You'll be re-

    writing with a finished article and just making mild changes to it for the purpose of

    optimization.

    Tip: If you've done good keyword research and are optimizing for the right phrases, you'll

    find that with this 2 step approach, you'll be creating much better quality of articles for your

    visitors and because you have done good research, the optimization is not that hard.

    3. The challenge of knowing exactly what search engine spiders are doing

    and how far they are crawling into your Web site.

    How to know exactly what those search engine spiders are doing, how often they visit, which

    pages they visit, how often they stay and a whole lot more.

    There are some tools like Robot Manager Professional, that allows you track some really

    fascinating information about search engine spiders.

    4. Are you trying to get search engine spiders to come back and re-visit

    your Web site more often?

    Are you trying to use a Meta Tag to get search engine spiders to come back and revisit your

    Web site every so often?

    Save your time. Search engine robots actually run much better based on their own schedule

    and just because you include a Meta Revisit Tag does not mean the robot is going pay attention to it.

  • 8/22/2019 Seminar Formatkhjj

    20/24

    17

    5. Content Freshness:

    There is one tip which is extremely effective in getting search engine robots to come visit you

    much more often. We refer to as "content freshness" factors. It's not complex to understand.

    What you want to do is start adding some pages on to your Web site on a regular consistent

    basis. For example, if you were add one new article per month to your Web site that would be

    good, but even better would be to add one new article to your Web site per week. The key is

    consistency. Each time the robot returns and sees more new content it tends to adjust the

    frequency of its schedule. There are other benefits as well but even if you were to only add

    new content regularly to your site for the sake of your visitors, you may enjoy some other

    benefits too.

  • 8/22/2019 Seminar Formatkhjj

    21/24

    18

    9. Case Study: Google Search

    9.1 Introduction:The Google search engine has two important features that help it produce high precision

    results. First, it makes use of the link structure of the Web to calculate a quality ranking for

    each web page. This ranking is called Page Rank.Second, Google utilizes link to improve

    search results.

    9.2 Architecture & Working

    Fig: High Level Google Architecture

  • 8/22/2019 Seminar Formatkhjj

    22/24

    19

    Working

    In Google, the web crawling is done by several distributed crawlers. There is a URLserver

    that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are thensent to the storeserver. The storeserver then compresses and stores the web pages into a

    repository. Every web page has an associated ID number called a docID which is assigned

    whenever a new URL is parsed out of a web page.

    The indexing function is performed by the indexer and the sorter. The indexer performs a

    number of functions. It reads the repository, uncompresses the documents, and parses them.

    Each document is converted into a set of word occurrences called hits. The hits record the

    word, position in document, an approximation of font size, and capitalization.

    The indexer distributes these hits into a set of "barrels", creating a partially sorted forward

    index. The indexer performs another important function. It parses out all the links in every

    web page and stores important information about them in an anchors file. This file contains

    enough information to determine where each link points from and to, and the text of the link.

    The URLresolver reads the anchors file and converts relative URLs into absolute URLs and

    in turn into docIDs. It puts the anchor text into the forward index, associated with the docID

    that the anchor points to. It also generates a database of links which are pairs of docIDs. The

    links database is used to compute PageRanks for all the documents.

    The sorter takes the barrels, which are sorted by docID and resorts them by wordID to

    generate the inverted index. This is done in place so that little temporary space is needed for

    this operation. The sorter also produces a list of wordIDs and offsets into the inverted index.

    A program called DumpLexicon takes this list together with the lexicon produced by the

    indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web

    server and uses the lexicon built by DumpLexicon together with the inverted index and the

    PageRanks to answer queries.

  • 8/22/2019 Seminar Formatkhjj

    23/24

    20

    10. Conclusions

    Though there are many search engines available on the web, the searching methods and the

    engines need to go a long way for efficient retrieval of information on relevant topics.

    Indexing the entire web and building one huge integrated index will further deteriorate

    retrieval effectiveness, since the web is growing at an exponential rate. Building indexes in an

    hierarchical manner can be considered as an alternative.

    The current generation of search tools and services have to significantly improve their

    retrieval effectives. Otherwise, the web will continue to evolve towards an information

    entertainment center for users with no specific search objectives.

    Choosing the right search engine also is a challenge and several factors should be considered

    while deciding it.

    SEOs can do a great job to help increase search engines efficiency and improve performance

    of your websites.

  • 8/22/2019 Seminar Formatkhjj

    24/24

    21

    11. References

    1. The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin andLawrence Page

    2. www.perfect-optimization.com3. www.google.com4. http://computer.howstuffworks.com5. http://www.searchengineworkshops.com/articles/5-challenges.html6. Algorithmic Challenges In Web Search Engines by Monica R. Henzinger7. http://searchenginewatch.com/article/2065267/International-SEO-Challenges-and-Tips