s. lawrence and c.l. giles presented by robert cadwgan-evans, simon munday searching the world wide...

15
S. Lawrence and C.L. Giles Presented by Robert Cadwgan-Evans, Simon Munday Searching the World Wide Web

Upload: isabel-daniels

Post on 28-Mar-2015

223 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: S. Lawrence and C.L. Giles Presented by Robert Cadwgan-Evans, Simon Munday Searching the World Wide Web

S. Lawrence and C.L. Giles

Presented by

Robert Cadwgan-Evans, Simon Munday

Searching the World Wide Web

Page 2: S. Lawrence and C.L. Giles Presented by Robert Cadwgan-Evans, Simon Munday Searching the World Wide Web

Introduction• Analyse the paper

– Coverage of search engines– Size of the Indexable Web

• Consider search and Internet development from 1998-today

• The future of searching

Page 3: S. Lawrence and C.L. Giles Presented by Robert Cadwgan-Evans, Simon Munday Searching the World Wide Web

Paper Outline• Published April 1998, data collected in 1997

• Investigates the comparative coverage of the internet by major search engines of the time

• Attempts to put a figure on the size of the web

• Important as provide a way to measure the size of the web

Page 4: S. Lawrence and C.L. Giles Presented by Robert Cadwgan-Evans, Simon Munday Searching the World Wide Web

Search Engine Coverage: The Test

Coverage: Percentage of the unique list that an individual engine returns in its queries

HotBot Northern LightExcite Infoseek LycosAltaVista

ResultsResults Results ResultsResultsResults

575 Queries

List of unique resultsfrom all queries

Page 5: S. Lawrence and C.L. Giles Presented by Robert Cadwgan-Evans, Simon Munday Searching the World Wide Web

Search Engine Coverage: Results

Results of search engine coverage using this test:

Search Engine Coverage (%)

HotBot 57.5

AltaVista 46.5

Northern Light 32.9

Excite 23.1

Infoseek 16.5

Lycos 4.41

Even the most successful of the engines, HotBot, doesn’t manage to cover two thirds of the result set from all engines

Page 6: S. Lawrence and C.L. Giles Presented by Robert Cadwgan-Evans, Simon Munday Searching the World Wide Web

Size of the Indexable Web: Method

N

Na

N0

Nb Estimated on the analysis of the overlap between search engines

N Set of indexable web pages

Na Set of results returned by search engine A

Nb Set of results returned by search engine B

N0 Set of results returned by A and B, the overlap

An estimate of the fraction of the indexable web covered by an engine a can be calculated:

Pa = N0 / Nb

From this fraction an estimate for the overall size of the indexable web, N, can be calculated

N = Sa / Pa

Page 7: S. Lawrence and C.L. Giles Presented by Robert Cadwgan-Evans, Simon Munday Searching the World Wide Web

N

Na

N0

Nb

N

Na

N0Nb

Little overlap shows ignorance of search engines as lots of results are missing therefore not much of the web is covered

Size of the Indexable Web: ExamplesBig overlap shows the sets are almost complete therefore must contain most of the web

• Works on the assumption of randomness and independence

Page 8: S. Lawrence and C.L. Giles Presented by Robert Cadwgan-Evans, Simon Munday Searching the World Wide Web

Size of the Indexable Web: Results

Comparison between pairs of search engines

Search Engines Indexable Web (millions of pages)

Lycos and Infoseek 90

Infoseek and Excite 220

Excite and Northern Light 230

Northern Light and Altavista 230

Altavista and HotBot 320

Paper selects the largest of these, 320million pages, as an estimate for the size of the indexable web

Page 9: S. Lawrence and C.L. Giles Presented by Robert Cadwgan-Evans, Simon Munday Searching the World Wide Web

Paper Summary• Paper admits the size is an estimate, the

actual figure is probably larger

• Query terms based upon scientists searching habits, not general public

• This estimate suggests that previous estimates of as little as 75 million pages are incorrect

Page 10: S. Lawrence and C.L. Giles Presented by Robert Cadwgan-Evans, Simon Munday Searching the World Wide Web

Current Technology• Newcomers: Google, Yahoo, MSN and Ask Jevees• Size of the web has exploded in the last 5 years [1]

Page 11: S. Lawrence and C.L. Giles Presented by Robert Cadwgan-Evans, Simon Munday Searching the World Wide Web

Size of the Web Today• Up-to-date and accurate measurement is difficult.

But, current figures put the size of the web around 11.5billion pages [2]

• Currently indexed 9.4 billion pages [2]

• Google indexes 8 billion pages, but also takes searching further, indexing 880million images [3]

• Does a bigger index mean better quality results?

• Larger index could hamper performance [4]

Page 12: S. Lawrence and C.L. Giles Presented by Robert Cadwgan-Evans, Simon Munday Searching the World Wide Web

Specialized Search Engines• With such big search engines providing general results more

specialized search engines have resulted:

Page 13: S. Lawrence and C.L. Giles Presented by Robert Cadwgan-Evans, Simon Munday Searching the World Wide Web

The Future• The Deep Web – refers to databases from

which dynamic pages are created from

• Over 200,000 deep websites exist [5]

• Examples include eBay and Amazon

• Deep Web is 400 to 550 times larger than the “surface web” [5]

Page 14: S. Lawrence and C.L. Giles Presented by Robert Cadwgan-Evans, Simon Munday Searching the World Wide Web

Conclusion• Estimating the size of the web is difficult and

as of yet not possible

• Paper does a good job of showing previous estimates are far too low (even if it's own is low)

• The inclusion of deep web will only make the problem harder

Page 15: S. Lawrence and C.L. Giles Presented by Robert Cadwgan-Evans, Simon Munday Searching the World Wide Web

References• 1. Search Engine Sizes, D. Sullivan, January 2005, http:

//searchenginewatch.com/reports/article.php/2156481

• 2. The Indexable Web is More than 11.5 Billion Pages, A. Gulli and A. Sigorini, 2005, http://citeseer.ist.psu.edu/gulli05indexable.html

• 3. Google Product Descriptions, http://www.google.co.uk/press/descriptions.html

• 4. Accessibility of Information on the Web, S. Lawrence and C. Giles, Nature, 400:107--109, 1999

• 5. The Deep Web: Surfacing Hidden Value, Michael K. Bergman, 2001, http://beta.brightplanet.com/deepcontent/turtorials/DeepWeb/index.asp