retrieving information on the web

of 28 /28
Retrieving Information on the Web Presented by Md. Zaheed Iftekhar Course : Information Retrieval (IFT6255) Professor : Jian E. Nie DIRO, University of Montreal April 9 th , 2003

Author: makala

Post on 06-Jan-2016

23 views

Category:

Documents


0 download

Embed Size (px)

DESCRIPTION

Retrieving Information on the Web. Presented by Md. Zaheed Iftekhar Course : Information Retrieval (IFT6255) Professor : Jian E. Nie DIRO, University of Montreal April 9 th , 2003. Overview. Web search: general description Introduction of web, search engines Definitions - PowerPoint PPT Presentation

TRANSCRIPT

  • Retrieving Information on the WebPresented byMd. Zaheed IftekharCourse : Information Retrieval (IFT6255)Professor : Jian E. Nie DIRO, University of Montreal April 9th, 2003

    Presented by: Md. Zaheed Iftekhar

  • OverviewWeb search: general descriptionIntroduction of web, search enginesDefinitionsMajor search enginesCurrent technologiesThe future Where is the technology headingProposal for further improvementConclusionReferences

    Presented by: Md. Zaheed Iftekhar

  • History of the Web

    In 1990 the World Wide Web (WWW) was developed by Tim Berners-Lee at CERN to organize research documents available on the Internet.Combined idea of documents available by FTP with the idea of hypertext to link documents.Developed initial HTTP network protocol, URLs, HTML, and first web server.

    Presented by: Md. Zaheed Iftekhar

  • World Wide Web

    Ted Nelson developed idea of hypertext in 1965.Doug Engelbart invented the mouse and built the first implementation of hypertext in the late 1960s at SRI.ARPANET was developed in the early 1970s.The basic technology was in place in the 1970s; but it took the PC revolution and widespread networking to inspire the web and make it practical.

    Presented by: Md. Zaheed Iftekhar

  • Web Browser

    Early browsers were developed in 1992 (Erwise, ViolaWWW).In 1993, Marc Andreessen and Eric Bina at UIUC NCSA developed the Mosaic.Andreessen joined with James Clark (Stanford Prof. and Silicon Graphics founder) to form Mosaic Communications Inc. in 1994 (which became Netscape to avoid conflict with UIUC).Microsoft licensed the original Mosaic from UIUC and used it to build Internet Explorer in 1995.

    Presented by: Md. Zaheed Iftekhar

  • Web Search

    By late 1980s many files were available by anonymous FTP.In 1990, Alan Emtage of McGill Univ. developed Archie (short for archives) Assembled lists of files available on many FTP servers.Allowed regex search of these file names.In 1993, Veronica and Jughead were developed to search names of text files available through Gopher servers.

    Presented by: Md. Zaheed Iftekhar

  • Web Search

    In 1993, early web robots (spiders) were built to collect URLs:WandererALIWEB (Archie-Like Index of the WEB)WWW Worm (indexed URLs and titles for regex search)In 1994, Stanford grad students David Filo and Jerry Yang started manually collecting popular web sites into a topical hierarchy called Yahoo.

    Presented by: Md. Zaheed Iftekhar

  • Web Search

    In early 1994, Brian Pinkerton developed WebCrawler as a class project at U Wash. (became part of Excite and AOL). The same year, Fuzzy Maudlin, a grad student at CMU developed Lycos. First to use a standard IR system. First to index a large set of pages.In late 1995, DEC developed Altavista. Supported boolean operators, phrases, and reverse pointer queries.In 1998, Larry Page and Sergey Brin, Ph.D. students at Stanford, started Google.

    Presented by: Md. Zaheed Iftekhar

  • Spiders (Robots/Bots/Crawlers)Start with a comprehensive set of root URLs from which to start the search.Follow all links on these pages recursively to find additional pages.Index all novel found pages in an inverted index as they are encountered.May allow users to directly submit pages to be indexed (and crawled from).

    Presented by: Md. Zaheed Iftekhar

  • Web searchBreadth-first Search

    Presented by: Md. Zaheed Iftekhar

  • Web searchDepth-first Search

    Presented by: Md. Zaheed Iftekhar

  • Search Strategy Trade-OffsBreadth-first explores uniformly outward from the root page but requires memory of all nodes on the previous level (exponential in depth). Standard spidering method.Depth-first requires memory of only depth times branching-factor (linear in depth) but gets lost pursuing a single thread.Both strategies implementable using a queue of links (URLs).

    Presented by: Md. Zaheed Iftekhar

  • Avoiding Page DuplicationMust detect when revisiting a page that has already been spidered (web is a graph not a tree).Must efficiently index visited pages to allow rapid recognition test.Tree indexing (e.g. trie)HashtableIndex page using URL as a key.Must canonicalize URLs (e.g. delete ending /) Not detect duplicated or mirrored pages.Index page using textual content as a key.Requires first downloading page.

    Presented by: Md. Zaheed Iftekhar

  • Spidering AlgorithmInitialize queue (Q) with initial set of known URLs.Until Q empty or page or time limit exhausted: Pop URL, L, from front of Q. If L is not to an HTML page (.gif, .jpeg, .ps, .pdf, .ppt) continue loop. If already visited L, continue loop. Download page, P, for L. If cannot download P (e.g. 404 error, robot excluded) continue loop. Index P (e.g. add to inverted index or store cached copy). Parse P to obtain list of new links N. Append N to the end of Q.

    Presented by: Md. Zaheed Iftekhar

  • Queueing StrategyHow new links added to the queue determines search strategy.FIFO (append to end of Q) gives breadth-first search.LIFO (add to front of Q) gives depth-first search.Heuristically ordering the Q gives a focused crawler that directs its search towards interesting pages.

    Presented by: Md. Zaheed Iftekhar

  • Source: http://www.bruceclay.com

    Presented by: Md. Zaheed Iftekhar

  • GoogleGoogle is a search engine that maintains its own spider based index. Google also has a directory that is powered by the Open Directory; Google supports:Boolean searchPhraseSimilarityProximitySource: lookoff.com, http://www.bruceclay.com

    Presented by: Md. Zaheed Iftekhar

  • Google

    StrengthsThe interface is tremendously simple, but the quality in results is not significantly impeded Accuracy for common topics

    WeaknessesLack of power features Coverage of the Internet is much less than some competitors No OR keyword support for boolean searches

    Source: lookoff.com, http://www.bruceclay.com

    Presented by: Md. Zaheed Iftekhar

  • Yahoo!StrengthsCoverage of the Internet is excellent Links are generally quite up to date and free of spam and poor quality sites Human maintainers ensure that sites are placed correctly within the relevant topic The search interface is very fast Yahoo integrates with indexed searches after presenting Yahoo topic areas Accuracy for common topics WeaknessesThe search interface is very effective for general searches but could be better for powerful searches Not all relevant sites are listed in Yahoo - they have to be submitted and accepted.

    Source: lookoff.com, http://www.bruceclay.com

    Presented by: Md. Zaheed Iftekhar

  • Ask Jeeves StrengthsA simple interface makes it very easy to form queries. Excellent for new users and children. If your query corresponds to a pre-packaged answer, you can expect some surprisingly good results. Millions of bundled answers provide premium answers that are superior to standard index search.es The site is actively maintained. An integrated metacrawler provides results for your search from Goto, AltaVista, Mamma and 4Anything. The search code is very fast. WeaknessesThe site supposedly takes pay for top spots, sometimes placing dubious quality links at the top of results. No advanced search. Very little power in constructing your keywords Little control over filtering results.

    Presented by: Md. Zaheed Iftekhar

  • MSN StrengthsVery active news portal with updated and well-presented headlines. Integrated single sign-on with hotmail, msn, etc. Configurable interface lets you customize content, layout and colors. Very actively maintained. Many interesting (although often commercially-oriented) services tied into the MSN network. Nationalized versions for quite a few countries providing a more specific content and news feed. Ability to save (i.e. tag) results to quickly filter search results into a candidates list. WeaknessesNot a low-bandwidth interface. Slow modem users should beware. Mediocre search interface Less web coverage than most search engines

    Presented by: Md. Zaheed Iftekhar

  • Program Pages (#)ClassFAQFTPIndexMetaMiscNewsPortalDejanews 300M msgBestNNNNYYNRaging 250MBestNNYNNNNYahoo 500TBestNNNNNNYAllTheWeb 300MExcellentNNYNNNNAltaVista 250MExcellentNNYNNYYFAQS 3300 FAQsExcellentYNNNYNNFTPSearch 100M fileExcellentNYNNNNNSearch.com N/AExcellentNNNYNNNAbout ?GoodNNNNYNYAskJeeves 8M Ques.GoodYNYNNNYDirectHit ?GoodNNNNNNYExcite ?GoodNNYNNYYGo 50M?GoodNNYNNNYGoogle 100M?GoodNNYNNNNHotBot 150M?GoodNNYNNNYLycos 250M?GoodNYYNNNYMetaCrawler N/AGoodNNNYNNNMSN 120M?GoodNNYNNNYNorthernLight 200M?GoodNNYNNYNOpenDirectory 1M?GoodNNNNNNYWebCenter 500T?GoodNNNNNNYDogPile N/AOkayNYNYYYYGoTo ?OkayNNYNNNYInfoSpace very fewOkayNNYNYNNiWon 350M?OkayNNYNYNNSnap ?OkayNNYNNNYMamma n/aWeakNNNYNNN

    Presented by: Md. Zaheed Iftekhar

  • Presented by: Md. Zaheed Iftekhar

  • Presented by: Md. Zaheed Iftekhar

  • Presented by: Md. Zaheed Iftekhar

  • Presented by: Md. Zaheed Iftekhar

  • Conclusion

    Intelligent agent technology could be used to improve the searching method.

    Quantum searching method also could be explored.

    Presented by: Md. Zaheed Iftekhar

  • Web searchThank you all!

    Presented by: Md. Zaheed Iftekhar