retrieving information on the web

Retrieving Information on the Web

Presented by

Md. Zaheed IftekharCourse : Information Retrieval (IFT6255)

Professor : Jian E. Nie DIRO, University of Montreal

April 9th, 2003

April 9, 2003Presented by: Md. Zaheed Iftekhar

2

Overview

• Web search: general description– Introduction of web, search engines

– Definitions

– Major search engines

– Current technologies

• The future – Where is the technology heading

– Proposal for further improvement

• Conclusion• References


3

History of the Web

• In 1990 the World Wide Web (WWW) was developed by Tim Berners-Lee at CERN to organize research documents available on the Internet.

• Combined idea of documents available by FTP with the idea of hypertext to link documents.

• Developed initial HTTP network protocol, URLs, HTML, and first “web server.”


4

• Ted Nelson developed idea of hypertext in 1965.

• Doug Engelbart invented the mouse and built the first implementation of hypertext in the late 1960’s at SRI.

• ARPANET was developed in the early 1970’s.

• The basic technology was in place in the 1970’s; but it took the PC revolution and widespread networking to inspire the web and make it practical.

World Wide Web


5

Web Browser

• Early browsers were developed in 1992 (Erwise, ViolaWWW).

• In 1993, Marc Andreessen and Eric Bina at UIUC NCSA developed the Mosaic.

• Andreessen joined with James Clark (Stanford Prof. and Silicon Graphics founder) to form Mosaic Communications Inc. in 1994 (which became Netscape to avoid conflict with UIUC).

• Microsoft licensed the original Mosaic from UIUC and used it to build Internet Explorer in 1995.


6

Web Search

• By late 1980’s many files were available by anonymous FTP.

• In 1990, Alan Emtage of McGill Univ. developed Archie (short for “archives”)

– Assembled lists of files available on many FTP servers.

– Allowed regex search of these file names.

• In 1993, Veronica and Jughead were developed to search names of text files available through Gopher servers.


7

Web Search

• In 1993, early web robots (spiders) were built to collect URL’s:

– Wanderer– ALIWEB (Archie-Like Index of the

WEB)– WWW Worm (indexed URL’s and titles

for regex search)• In 1994, Stanford grad students David Filo

and Jerry Yang started manually collecting popular web sites into a topical hierarchy called Yahoo.


8

Web Search

• In early 1994, Brian Pinkerton developed WebCrawler as a class project at U Wash. (became part of Excite and AOL).

• The same year, Fuzzy Maudlin, a grad student at CMU developed Lycos.

– First to use a standard IR system. – First to index a large set of pages.

• In late 1995, DEC developed Altavista. Supported boolean operators, phrases, and “reverse pointer” queries.

• In 1998, Larry Page and Sergey Brin, Ph.D. students at Stanford, started Google.


9

Spiders (Robots/Bots/Crawlers)

• Start with a comprehensive set of root URL’s from which to start the search.

• Follow all links on these pages recursively to find additional pages.

• Index all novel found pages in an inverted index as they are encountered.

• May allow users to directly submit pages to be indexed (and crawled from).


10

Breadth-first Search

Web search


11

Depth-first Search

Web search


12

Search Strategy Trade-Off’s

• Breadth-first explores uniformly outward from the root page but requires memory of all nodes on the previous level (exponential in depth). Standard spidering method.

• Depth-first requires memory of only depth times branching-factor (linear in depth) but gets “lost” pursuing a single thread.

• Both strategies implementable using a queue of links (URL’s).


13

Avoiding Page Duplication

• Must detect when revisiting a page that has already been spidered (web is a graph not a tree).

• Must efficiently index visited pages to allow rapid recognition test.– Tree indexing (e.g. trie)– Hashtable

• Index page using URL as a key.– Must canonicalize URL’s (e.g. delete ending “/”) – Not detect duplicated or mirrored pages.

• Index page using textual content as a key.– Requires first downloading page.


14

Spidering Algorithm

Initialize queue (Q) with initial set of known URL’s.Until Q empty or page or time limit exhausted: Pop URL, L, from front of Q. If L is not to an HTML page (.gif, .jpeg, .ps, .pdf, .ppt…) continue loop. If already visited L, continue loop. Download page, P, for L. If cannot download P (e.g. 404 error, robot excluded) continue loop. Index P (e.g. add to inverted index or store cached copy). Parse P to obtain list of new links N. Append N to the end of Q.


15

Queueing Strategy

• How new links added to the queue determines search strategy.

• FIFO (append to end of Q) gives breadth-first search.

• LIFO (add to front of Q) gives depth-first search.

• Heuristically ordering the Q gives a “focused crawler” that directs its search towards “interesting” pages.


16Source: http://www.bruceclay.com


17

Google

• Google is a search engine that maintains its own spider based index.

• Google also has a directory that is powered by the Open Directory;

• Google supports:– Boolean search– Phrase– Similarity– Proximity

Source: lookoff.com, http://www.bruceclay.com


18

Google

Strengths• The interface is tremendously simple, but the quality in

results is not significantly impeded • Accuracy for common topics

Weaknesses• Lack of power features • Coverage of the Internet is much less than some

competitors • No OR keyword support for boolean searches



19

Yahoo!

Strengths• Coverage of the Internet is excellent • Links are generally quite up to date and free of spam and poor

quality sites • Human maintainers ensure that sites are placed correctly within

the relevant topic • The search interface is very fast • Yahoo integrates with indexed searches after presenting Yahoo

topic areas • Accuracy for common topics

Weaknesses• The search interface is very effective for general searches but

could be better for powerful searches • Not all relevant sites are listed in Yahoo - they have to be

submitted and accepted.



20

Ask Jeeves

Strengths• A simple interface makes it very easy to form queries.

Excellent for new users and children. • If your query corresponds to a pre-packaged answer, you can

expect some surprisingly good results. Millions of bundled answers provide premium answers that are superior to standard index search.es

• The site is actively maintained. • An integrated metacrawler provides results for your search

from Goto, AltaVista, Mamma and 4Anything. • The search code is very fast. Weaknesses• The site supposedly takes pay for top spots, sometimes

placing dubious quality links at the top of results. • No advanced search. • Very little power in constructing your keywords • Little control over filtering results.


21

MSN

Strengths• Very active news portal with updated and well-presented

headlines. • Integrated single sign-on with hotmail, msn, etc. • Configurable interface lets you customize content, layout and

colors. • Very actively maintained. • Many interesting (although often commercially-oriented) services

tied into the MSN network. • Nationalized versions for quite a few countries providing a more

specific content and news feed. • Ability to save (i.e. tag) results to quickly filter search results into

a candidates list.

Weaknesses• Not a low-bandwidth interface. Slow modem users should beware. • Mediocre search interface • Less web coverage than most search engines


22

Program

Pages (#) Class FAQ FTP Index Meta Misc News Portal

Dejanews

300M msg Best N N N N Y Y N

Raging

250M Best N N Y N N N N

Yahoo

500T Best N N N N N N Y

AllTheWeb

300M Excellent N N Y N N N N

AltaVista

250M Excellent N N Y N N Y Y

FAQS

3300 FAQs Excellent Y N N N Y N N

FTPSearch

100M file Excellent N Y N N N N N

Search.com

N/A Excellent N N N Y N N N

About

? Good N N N N Y N Y

AskJeeves

8M Ques. Good Y N Y N N N Y

DirectHit

? Good N N N N N N Y

Excite

? Good N N Y N N Y Y

Go

50M? Good N N Y N N N Y

Google

100M? Good N N Y N N N N

HotBot


Lycos

250M? Good N Y Y N N N Y

MetaCrawler

N/A Good N N N Y N N N

MSN


NorthernLight

200M? Good N N Y N N Y N

OpenDirectory

1M? Good N N N N N N Y

WebCenter

500T? Good N N N N N N Y

DogPile

N/A Okay N Y N Y Y Y Y

GoTo

? Okay N N Y N N N Y

InfoSpace

very few Okay N N Y N Y N N

iWon

350M? Okay N N Y N Y N N

Snap

? Okay N N Y N N N Y

Mamma

n/a Weak N N N Y N N N


23


24


25


26


27

Conclusion

• Intelligent agent technology could be used to improve the searching method.

• Quantum searching method also could be explored.


28

Web search

Thank you all!

retrieving information on the web

Documents