web search engines ( mr.mirza )

SEARCH ENGINES

ABHAY KHANDALKARKISHOR MANE

SAIF ALI MIRZAESHA PALAV

DAKSHEH RATHOD

INTRODUCTION

What is a Search Engine?• Search engines are the key to find specific information on the vast expanse of the World Wide Web.

• Without search engines, it would be virtually impossible to locate anything on the Web without knowing a specific URL.

• Search engines use automated software (known as robots or spiders) that follow links on the websites, thus harvesting information as they go.

• Search engines are also known as answer machines. When a person performs an online search, the search engine scours its corpus of billions of documents and does two things:

1.) It returns only those results that are relevant or useful to the searcher's query;

2.) It ranks those results according to the popularity of the websites serving the information. It is both relevance and popularity that the process of SEO is meant to influence.

What is SEO?• Imagine for a minute that you are the

librarian

• People across the world depend upon you for the exact information they need

• For this we need a system to know what’s inside every book & how books relate to each other.

• System needs to take a lot of information & send out the best answers for the questions.

• Search engines like Google & Yahoo are the librarians of the Internet.

• Their systems collect information about every page on the web so they can help people find what exactly they are looking for.

• Every search engine has a secret recipe called Algorithm for turning all that information into useful search results.

• When pages have higher Rankings they have more people finding those higher pages.

• The key to higher Ranking is making sure your website has ingredients search engines need for their Algorithm. This is called as Search Engine Optimization.

Search Algorithm

1.Words matter

2.Titles matter

3.Links matter

4.Words in Links

5.Reputation

Search Engine Optimization

– The process of maximizing the number of visitors to a particular website by ensuring that the site appears high on the list of results returned by search engines.

Birth of Search Engines

• The concept of hypertext & memory extension came to life in July 1945 when Vannevar Bush’s “As We May Think” was published in The ‘Atlantic Monthly’.

• He urged scientists to work together to help build a body of knowledge for all mankind.

• He then proposed the idea of a virtually limitless, fast, reliable, extensible, associative memory storage & retrieval system. Vannevar Bush named this device a “MEMEX”.

• Ted Nelson created the Project Xanadu in 1960 & coined the term “Hypertext “ in 1963, much of the inspiration to create the WWW was drawn from Ted’s work hence he is rightly called as the father of ‘Search Engines’.

• ARPANET is the network which eventually led to the Internet.

• Packet switching was based on the concepts & designs by American scientists Leonard Kleinrock & Paul Baran of the Lincoln Laboratory.

• The ARPANET was an early packet switching network & the first network to implement the protocol suite ‘TCP/IP’.

• The ARPANET was operated by the military during the two decades of its existence until 1990.

EVOLUTION OF SEARCH ENGINES

• First Internet search engine created was Archie, in 1990 by Alan Emtage a student at McGill University in Montreal.

• Archie was a database of web filenames which it would match with the users query.

• Later Veronica was developed which served the same purpose as Archie but it worked on plain text files.

• Soon another user interface named Jughead appeared & both were used for sending files via Gopher which served an alternative to Archie by Mark McCahill at the university of Minnesota in 1991.

• Tim Berners-Lee designed & built the first web browser & editor called httpd.

• The first website built was http://info.cern.ch/ & was put online on August 6,1991.

• In 1994, Berners –Lee founded World Wide Web Consortium at Massachusetts Institute of Technology.

• IN 1993 Martijn Koster created Archie-Like Indexing of the Web, or ALIWEB.

• ALIWEB crawled meta information & allowed users to submit their pages which they wanted to index with their own page descriptions.

What is a Computer Bot?• Computer robots are simply programs that automate

repetitive tasks at speeds impossible for humans to reproduce.

• The term ‘Bot’ on the Internet is usually used to describe anything that interfaces with the user or that collects data.

• Search engines use “spiders” which search the web for information. They read the contents of pages for indexing & also record the links.

• Another eg. Is Chatterbots which attempts to act like a human & communicate with humans on said topic.

PRE MODERN SEARCH ENGINES

Primitive Web Search

By December of 1993, three full fledged bot fed search engines had surfaced on the web: JumpStation, the World Wide Web Worm, and the Repository-Based Software Engineering (RBSE) spider.

JumpStation & the WWW worm gathered info about the title & URL’s from webpages & retrieved these using a simple linear search.

The problem with JumpStation & the WWW worm is that they listed results in the order that they found them & provided no discimination.

The RBSE spider implemented a ranking system.

ALTA VISTA AltaVista provided

numerous search tips & advanced search features to the web.

They had nearly unlimited bandwidth, the first to allow natural language queries, advanced searching techniques & they also allowed users to add or delete their own URL within 24 hrs.

Due to poor mismanagement & fear of result manipulation AltaVista was largely driven into irrelevancy.

WEB CRAWLER• On April 20, 1994

Brian Pinkerton of the university of Washington released WebCrawler .

It was the first crawler which indexed entire pages.

In 1997 Excite bought out WebCrawler.

Lycos was the next major search development designed at Carnegie Mellon University around July of 1994 by Michale Mauldin.

Lycos provided ranked relevance retrieval, prefix matching & word proximity.

In November 1996, Lycos had indexed over 60 million documents-more than any other web search engine.

In October 1994, Lycos ranked first on Netscape’s list of search engines by finding the most hits on the word “surf”.

In April 1994, David Filo & Jerry Yang created the Yahoo! Directory as a collection of their favourite web pages.

As their no of links grew they had to reorganize & become a searchable directory.

On September 26, 2014 Yahoo! Announced they would close the Yahoo! Directory at the end of 2014.

LookSmart was founded in 1995.

LookSmart was too dependent on MSN & in 2003 Microsoft announced that they were discontinuing LookSmart that basically killed their business model.

The Inktomi Corporation came about on May20, 1996 with its search engine Hotbot.

They failed to develop a profitable business model & sold out to Yahoo! For approx $235 million in December 2003.

Ask.com(formerly Ask Jeeves):

In April 1997, Ask Jeeves was launched as a natural language search engine. It used human editors to match the search queries.

In 2001, Ask Jeeves bought Teoma to replace DirectHit search technology.

On March 21, 2005 Barry Diller’s IAC agreed to acquire Ask Jeeves for 1.85 billion dollars.

In 2006, Ask Jeeves was renamed to Ask.

MODERN SEARCH ENGINES

MICROSOFT

In 1998 MSN Search was launched, but Microsoft did not get serious about search until after Google proved the business model.

On September 11, 2006, Microsoft launched their in house search technology Live Searchproduct.

On June 1, 2009, Microsoft launched Bing, a new search service which changed the search landscape by placing inline search suggestions for related searches directly in the result set.

For eg. When one searches for credit cards they will suggest related phrases like “credit card types”, “apply for credit cards”,” advice on credit cards”, etc.

Yahoo!

Getting Into Search: Yahoo! was founded in 1994 by David Filo and Jerry Yang as a directory of websites.

Overture purchased AllTheWeb and AltaVista in 2003. Yahoo! purchased Inktomi in December, 2002, and then consumed Overture in July, 2003 & combined the technologies from the various search companies they bought to make a new search engine.

On March 20, 2005 Yahoo! purchased Flickr, a popular photo sharing site.

On December 9, 2005 Yahoo! Purchased Del.icio.us a social bookmarking site.

Yahoo! also made a strong push to promote Yahoo! Answers, a popular free community driven question answering service.

On July 29, 2009, Yahoo! decided to give up on search and signed a 10 yr deal to syndicate Bing ads and algorithmic results on their website.

GOOGLE In 1995, Larry Page met Sergey Brin at

Stanford.

By January of 1996, Larry and Sergey had begun collaboration on a search engine called BackRub, named for its unique ability to analyze the "back links" pointing to a given website

A year later their unique approach to link analysis was earning BackRub growing reputation.

BackRub ranked pages using citation notation. In the Page Rank algorithm links count as votes i.e. how many people link to you & how trustworthy those links are.

In 1998, Google was launched. Sergey tried to shop their PageRank technology, but nobody was interested in buying or licensing their search technology at that time

How Search Engine Works?

The Main Parts of a Search Engine

• Spider (or “web crawler”)

• Indexer

• Search software (an algorithm)

1. Web Crawling• A web crawler (also known as a web spider or web robot) is a program

or automated script which browses the World Wide Web in a methodical, automated manner. This process is called Web crawling or spidering.

• Many legitimate sites, in particular search engines, use spidering as a means of providing up-to-date data.

• Web crawlers are mainly used to create a copy of all the visited pages & are also used for automating maintenance tasks on a Web site, such as checking links or validating HTML code.

• For eg:Imagine the World Wide Web as a network of stops in a big city subway system. Each stop is a unique document (usually a web page, but sometimes a PDF, JPG, or other file). The search engines need a way to “crawl” the entire city and find all the stops along the way, so they use the best path available—links.

• Links allow the search engines' automated robots, called "crawlers" or "spiders," to reach the many billions of interconnected documents on the web.

• Once the engines find these pages, they decipher the code from them and store selected pieces in massive databases, to be recalled later when needed for a search query.

• The monstrous storage facilities hold thousands of machines processing large quantities of information very quickly.

• When a person performs a search at any of the major engines, they demand results instantaneously; even a one- or two-second delay can cause dissatisfaction, so the engines work hard to provide answers as fast as possible.

Indexing• Search engine indexing is the process of a search engine collecting, parses and

stores data for use by the search engine.

• The actual search engine index is the place where all the data the search engine has collected is stored. It is the search engine index that provides the results for search queries, and pages that are stored within the search engine index that appear on the search engine results page.

• Without a search engine index, the search engine would take considerable amounts of time and effort each time a search query was initiated, as the search engine would have to search not only every web page or piece of data that has to do with the particular keyword used in the search query, but every other piece of information it has access to, to ensure that it is not missing something that has something to do with the particular keyword.

• There are many different parts to a search engine index, such as design factors and data structures.

• The design factors of a search engine index design decide how the index actually works. The parts all combine to create the working of search engine index, and include:

• Index size-which pertains to the amount of computer space necessary to support the index.

• Storage techniques-which is the decision of the information should be stored .Larger files are compressed while smaller files are simply filtered.

• Fault tolerance-refers to the issue of how important it is for the search engine index to be reliable.

• Lookup speed-is exactly as it sounds, pertaining to how quickly a word can be found when the data is searched in the inverted index.

• Maintenance-is an important factor as well because the better maintained a search engine index, the better it works.

What is a Search Engine Algorithm?

• A search algorithm is defined as a math formula that takes a problem as input and returns a solution to the problem, usually after evaluating a number of possible solutions.

• In simple words, a search engine algorithm is a set of rules, or a unique formula, that the search engine uses to determine the significance or rankings of a web page, and each search engine has its own set of rules.

• Search algorithm sorts on the basis of many things like location of keyword, synonyms, adjacent words, etc

• But there are certain things that all search engine algorithms have in common.

• Relevancy

• Individual Factors

• Off-Page Factors

SAERCH ALGORITHM PRINCIPLES

Relevancy

• This is the First thing every search engine checks.• The algorithm will determine whether this web

page has any relevancy at all for the particular keyword.

• Location of keywords in that page is also important for the relevancy of that website.

• Web pages that have the keywords in the title, as well as within the headline or the first few lines of the text will rank better for that keyword than websites that do not have these features

Individual Factors • A second part of search engine algorithms are the individual factors that make that particular search engine different from every other search engine out there.

• Each search engine has unique algorithms, and the individual factors of these algorithms are why a search query turns up different results on Google than MSN or Yahoo!.

• One of the most common individual factors is the number of pages a search engine indexes.

• They may just have more pages indexed, or index them more frequently, but this can give different results for each search engine.

• Some search engines also penalize for spamming, while others do not.

Off-Page Factors • Another part of algorithms that is still individual

to each search engine are off-page factors. • Off-page factors are such things as click-through

measurement and linking. • The frequency of click-through rates and linking

can be an indicator of how relevant a web page is to actual users and visitors, and this can cause an algorithm to rank the web page higher.

• Off-page factors are harder for web masters to craft, but can have an enormous effect on page rank depending on the search engine algorithm.

Classified List of search engines

They are classified based on Content/Topic, Type of Information and Model. They are further sub categorized as:

–Content/topic:

General:• A general search engine operates using a search algorithm. Websites that are listed in the

search engine's directory are used to search for information based on various search qualities. The goal is that the user gets relevant results with useful pages.

• Examples: baidu.com, bing.com, duckduckdo.com, exalead.com, google.co.in, munax.com

Metasearch Engines:

• A metasearch engine (or aggregator) is a search tool that uses another search engine's data to produce their own results from the Internet. Metasearch engines take input from a user and simultaneously send out queries to third party search engines for results.

• Examples: Blingo, Yippy, DeeperWeb, Dogpile, Excite, HotBot, Info.com, Mamma, Metacrawler, Mobissimo, Otalo, and Skyscanner.

– Geographically Limited:

• Which search engine you use which has Geographical limited scope. Means that engine finds search result only related to that geographical area. In this we are not counting Google.co.in or many others like this. Here is a list of some of those Search Engines with their respective Geographical area.

• Examples: Accoona, Ansearch, Biglobe, Daum, Egerin, Goo, Leit.is, Maktoob, Miner.hu, Najdi.si, Naver, Onkosh, Rambler, Rediff, SAPO, Search.ch, Sesam, Seznam, Ziplocal, etc.

– Semantic:• Semantic search seeks to improve search accuracy by understanding searcher intent

and the contextual meaning of terms as they appear in the searchable dataspace, whether on the Web or within a closed system, to generate more relevant results. Semantic search systems consider various points including context of search, location, intent, variation of words, synonyms, generalized and specialized queries, concept matching and natural language queries to provide relevant search results. Major web search engines like Google and Bing incorporate some elements of semantic search.

• Examples: Sophia Search, Evi, Yummy, Swoogle.

– Business:

• Business search helps us to keep in touch with the global world. All the latest information regarding the dynamic market is available with just a click.

• Examples: Business.com, Getit Infoservices Private Limited, GenieKnows, GlobalSpec, Nexis(Lexis Nexis), Thomasnet, Justdial.

– Academic Materials Only:

• Examples: BASE, CiteULike, GoogleScholar, Library of congress, Shodan, Noodle Education, SkilledUp, Chegg.

– Enterprise:

• Examples: Funnelback, Jumper 2.0, Oracle Corporation, Q-Sensei, TeraText, SimilarWeb, Swifttype.

– Jobs:

• Examples: Adzuna, Bixee.com, CareerBuilder.com, Craigslist, Dice.com, Eluta.com, Hotjobs.com, JobSreet.com, Incruit, Indeed.com, Glassdooor.com, LinkUp.com, Monster.com, Naukri.com.

– Medical:

• Examples: Bing Health, Bioinformatic Harvester, CiteAb, EB-eye, Entrez, GenieKnows, GoPubMed, Healia, Healthline, Nextbio, Quertle, Searchmedica, WebMD.

– News:

• Examples: Bing News, Daylife, Google News, MagPortal, Newslookup, Nexis, Topix.net, Trapit, Yahoo! News.

● Type of Information

• 4.2.1 Blog:

• Examples: Amatomu, Bloglines, BlogScope, IceRocket, Munax, Regator, Technorati.– Multimedia:

• Examples: Bing Videos, blinkx, FindSounds, Google Videos, Munax’s Play Audio Video, Picsearch, Pixsta, Podscope, ScienceStage, SeeqPod, Songza, TinEye, TV Genius, Veveo.

– Source code:

• Examples: Google Code Search, Koders, Krugle.

– BitTorrent – Examples: BTDigg, Isohunt, Mininova, The Pirate Bay, TorrentSpy, Torrentz, Torrentus.

– Maps:

• Examples: Bing Maps, Geoportail, Google Maps, MapQuest, Nokia Maps, OpenStreetMap, Wikiloc, WikiMapia, Yahoo! Maps.

– Price:

• Examples:Bing Shopping, Google Shopping, Kelkoo, MySimon, PriceGrabber, PriceRunner, PriceSCAN, Pronto.com, Shopping.com, ShopWiki, Shopzilla, SwoopThat.com, TheFind.com.

Model Privacy search engines:

• Examples: DuckDuckGo, Ixquick.

– Open source search engines:• Examples: DataparkSearch, Gigablast, Grub, ht://Dig,

Isearch, Lemur Toolkit & Indri Search Engine, Lucene, Namazu, Nutch, Recoll, Sciencenet, Searchdaimaon, Seeks, etc.– Social search engines:

• Examples: ChaCha Search, Delver, Eurekster, Majhalo.com, Rollyo, Search Team, Sproose, Trexy

GOOGLE AS A SEARCH ENGINE

Winning the Search War• Later that year Andy Bechtolsheim gave them $100,000 seed funding and

Google received $25 million Sequoia Capital .

• In 1999 AOL selected Google as a search partner.

• In 2000 Google also launched their popular Google Toolbar.

• On May 1, 2002, AOL announced they would use Google to deliver their search related ads, which was a strong turning point in Google's battle against Overture.

• In 2003 Google also launched their AdSense program, which allowed them to expand their ad network by selling targeted ads on other websites.

Google Maps

Google News

Google Book Search

Google Scholar

Google Blog Search

Google Base

Google Video

VERTICAL GALORES

Google Universal Search

Email

Analytics

Radio ads

Office productivity software

Calendar

Checkout

Working of Google

• Crawling & Indexing:- Search starts with the web. It’s made up of over 60 trillion individual pages & it’s constantly growing. Google navigates the web by crawling that means it follows links from page to page. Pages are sorted by their content & other factors & it keeps a track of it in ‘THE INDEX’ (It’s over 100 million gigabytes)

• Algorithms:- Work looking for clues to better understand the user means. Based on the clues we pull relevant documents from ‘The Index’.The results are ranked according to freshness, site& page quality, safe search, user context, translation, Universal search. These results can take a variety of forms. (All this happens in 1/8th of a sec)

The Search Lab: The algorithms are constantly changing. These changes begin as ideas in the minds of the engineers. They take these ideas and run experiments, analyze the results, tweak them & run them again& again to get the following results:

•Knowledge Graph:-Provides results based on a database of real world people, places, things & connections between them.• Snippets:-Shows small previews of information, such as a page’s title&

short descriptive text for each search results.•News:-Includes results from online newspapers & blogs from around

the world.•Answers:-Displays immediate answers & information for things such as

the weather, sports, scores, quick facts,•Videos:-Shows video-based results with thumbnails so you can quickly

decide which video to watch.•Refinements:-Provides features like ‘Advanced Search’ related searches

& other search tools, all of which helps one find the respective search.•Voice search:-With the Google search app simply say what you want

and get answers spoken right back to you.•Mobile:-Include improvements designed specifically for mobile devices

such as tablets & smartphones.

Fighting Spam Google fights spam 24/7 to keep the results relevant.The majority of spam removal is automatic.Other questionable documents are examined by hand and incase any spam is detected manual action is taken.

Types of Spams:-• Unnatural links from a site: Google detected a pattern of unnatural, artificial,

deceptive or manipulative outbound links on this site.This may be the result of selling links that pass PageRank or participating in link schemes.

• Cloaking &/ Sneaky redirects:-Site appears to be cloaking(displaying different content to human users than is shown to search engines) to redirecting users to a different page than google.

• Hacked Site:-Some pages on this site may have been hacked by a third party to display spammy content or links.Websites owners should take immediate actions to clean their sites & fix any security vulnerabilities.

• Hidden text &/ or keyword stuffing:-Some of the pages may contain hidden texts &/ keyword stuffing.

And that’s how Google search engines works. Behind a simple page of results is a complex system, carefully crafted & tested, to support more than one hundred billion

searches each month.

• A web search engine is a software system that is designed to search for information on the World Wide Web.

• Working process of search engines which starts with Web crawling, Indexing and Searching which uses an algorithm to give relevant search results within fraction of seconds.

• History of Search engines from its inception i.e in 1945 when Vannevar Bush proposed the visionary idea of maintaining a record of all the knowledge available to mankind which led to an era of revolution for search engines.

• Various type of search engines like metasearch engine, business, educational, social, LookSmart,Lycos, Microsoft, Yahoo, human answers machines like Quora, etc.

• It saves time and gives us the precise & relevant information needed.