1 internet search engines: fluctuations in document accessibility wouter mettrop cwi, amsterdam, the...

31
1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen, Belgium Hanneke Smulders Infomare Consultancy, The Netherlands http://www.cwi.nl/cwi/projects/IRT Presented at Internet Librarian International 2000 in London, England, March 2000

Upload: india-neblett

Post on 30-Mar-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

1Internet search engines:

Fluctuations in document accessibility

• Wouter Mettrop

CWI, Amsterdam, The Netherlands

• Paul Nieuwenhuysen

Vrije Universiteit Brussel, and Universitaire Instelling Antwerpen, Belgium

• Hanneke Smulders

Infomare Consultancy, The Netherlands

http://www.cwi.nl/cwi/projects/IRT

Presented at Internet Librarian International 2000

in London, England, March 2000

Page 2: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

3

WWW

WWW: growing number of WWW servers

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

1993 1994 1995 1996 1997 1998 1999 2000

Page 3: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

4

Internet based information sources: how many? how much?

In 2000:

• about 1 billion = 1000 million unique URLs in the total Internet

• about 10 terabyte (= 10 000 gigabyte) of text data

Page 4: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

5

Internet information retrieval systems in 2000

• Several types of systems exist to retrieve information:

»Directories of selected sources categorised by subject, made by humans, mainly for browsing.

»Search systems, based on databases with machine made indexes, for word-based searching!

»“Meta-search” or “multi-threaded” search systems.

• We have studied and compared several well-known international (and a few national) word-based Internet search engines.

Page 5: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

6

Internet information retrieval systems: evaluation criteria

• Many aspects/criteria can be considered in the evaluation of an Internet search engine, including

»coverage of documents present on WWW (studies exist)

»number of elements of a document, that are indexed to make them usable for retrieval

»fluctuations over time in the result sets offered by a search engine

• We started to study the depth of indexing and we were soon confronted with the fluctuations in the performance that do exist.

Page 6: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

7

Internet information retrieval systems: our research group

The following persons have been involved in the research: • Louise Beijer (Hogeschool van Amsterdam, The Netherlands)

• Hans de Bruin (Unilever Research Laboratorium, Vlaardingen, The Netherlands)

• Hans de Man (JdM Documentaire Informatie, Vlaardingen, The Netherlands)

• Rudy Dokter (PNO Consultants, Hengelo, The Netherlands)

• Marten Hofstede ( Rijksuniversiteit Leiden, The Netherlands)

• Wouter Mettrop (CWI, Amsterdam, The Netherlands)

• Paul Nieuwenhuysen (Vrije Universiteit Brussel, Belgium)

• Eric Sieverts (Hogeschool van Amsterdam, and RUU, The Netherlands)

• Hanneke Smulders (Infomare, Terneuzen, The Netherlands)

• Hans van der Laan (Consultant, Leiderdorp, The Netherlands)

• Ditmer Weertman (ADLIB, Utrecht, The Netherlands)

Page 7: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

8

Internet search engines: research on indexing functionality

• assessing the indexing functionality

»test document

»test method

• conclusions concerning indexing functionality

Page 8: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

9

0 8 16

Number of our test documents thatwere retrieved at least once during theinvestigation period

Number of our test documents that were retrieved

Page 9: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

10

Internet search engines: elements of test document studied

• title tag

• META-tags: keywords, description and author

• comment tag

• ALT tag

• text/URL of a link to a document

• H3 tag

• table header

• text of: an internal link, a reference anchor, a link to a sound file

• name of a sound file (au/wav/aiff/ra)

• text of a link to an image

• name of an image file (gif or jpg; inline or linked to)

• name of a Java applet (with or without extension class)

• terms after the first 100 lines in a document (200/…/700)

• the URL of a document

Page 10: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

11

Internet search engines: part of the test document source code

• <HTML> <HEAD>

• <TITLE>Test pagina</TITLE>

• <META NAME="keywords"

• CONTENT="een, twee, drie">

• <META NAME="description"

• CONTENT="This test page, containig a small part of the Secret Garden (by Frances Hodgson Burnett) is part of a larger site about the IRT project. vier, vijf, zes">

• <META NAME="Subject" CONTENT="zeven">

• <META NAME="Subject" CONTENT="acht">

• <META NAME="Subject" CONTENT="negen">

• <META NAME="Title” CONTENT="tien hoofdstukken uit The Secret Garden">

• <META NAME="Title:Subtitle" content="elf">

Page 11: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

12

0 5 10 15 20 25

Number of studieddocument elementsthat were indexedat least once duringthe observationperiod

Number of the studied document elements that were indexed

Page 12: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

13

Internet search engines : reachability

• 14 528 queries sent to 13 search engines

• 721 times unreachable

• The percentage of unreachability varies from nearly 0% to nearly 15%.

• The studied search engines were reachable for 95% of the queries.

Page 13: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

14

Search engine indexing functionality: conclusions

• Not “all of the web” is indexed.

»Not all of our test documents.

»Not all HTML elements of our test document.

• Some of the studied search engines showed changes in the indexing policy.

• No relation between the number of indexed test documents or HTML elements and the size of a search engine was found during our study.

Page 14: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

15

Internet search engines: fluctuations - definition

• A fluctuation appears when the result set of an observation

- i.e.

» one query or

» set of queries

misses documents with respect to a frame of reference

- i.e.

» other observations and

» knowledge about Web reality

Page 15: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

16

Internet search engines: detecting fluctuations

• Through time: comparing result sets of one observation, repeatedly performed

» Observation = one query or set of queries

» Frame of reference = other observations & web-knowledge

• One moment: consistency of result sets

» Observation = one query in set of queries

» Frame of reference = other observations

Page 16: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

17

Internet search engines: types of fluctuations

• Through time: comparing result sets of one observation repeatedly performed

» “Document fluctuations”

» “Indexing fluctuations”

• One moment: consistency of result sets

» “Element fluctuations”

Page 17: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

18

Page 18: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

19

Document fluctuations: example 1

TIME

Page 19: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

20

Document fluctuations: example 2

TIME

Page 20: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

21

0 10 20 30 40 50 60 70 80 90 100

AltaVistaEuroferret

Excite

HotBot

Ilse

Infoseek Lycos

MSNNorthernLight

Search.nl

Snap

VindexWebcrawler

Average percentage offorgotten documents perround

Percentage of roundswith one or moreforgotten documents

Document fluctuations: experimental results

Page 21: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

22

Page 22: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

23

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Average percentageof missed documentsper result set =Percentage of resultsets with missingdocuments

Indexing fluctuations:experimental results

Page 23: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

24

Page 24: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

25

Element fluctuations: example

0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Number of documents retrieved by HotBot in every query in round 23

Page 25: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

26

0 10 20 30 40 50 60 70 80 90 100

Average percentage ofmissed documents perresult-set

Percentage of result-setsthat were incomplete

Element fluctuations: experimental results

Page 26: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

27

0 10 20 30 40 50

Lost by elementfluctuations

Lost by documentfluctuations

Lost by indexingfluctuations

Percentage of documents missed due to fluctuations

Page 27: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

28

Internet search engines: fluctuations - quantitative conclusions

• Many element fluctuations many document and indexing fluctuations and many document elements indexed

• Many document fluctuations not always many element fluctuations

• Few document elements indexed few element fluctuations

Page 28: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

29

Fluctuations: remarks on “correctness”

• Fluctuations can be seen as “correct”, if they are reflections of alterations in:

»(web-) reality

— then document, indexing and element fluctuations are incorrect

»the indexed database of a search engine

— then only element fluctuations are incorrect

• Users do not care; they miss documents

Page 29: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

30

Fluctuations:remarks on “size”

• No relation document / element fluctuations < ===== > “size”

• Percentage missed documents determines (with other reducing effects, such as depth of indexing) the effective size of an engine

Page 30: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

31

Internet search engines: conclusions of our research

• Search engines differ in depth of indexing.

• Search engines show fluctuations in their result sets:

»They are subject to changes in indexing policy.(“indexing fluctuations”)

»They forget documents completely (“document fluctuations”)

»They miss documents in their result sets (“element fluctuations”).

Page 31: 1 Internet search engines: Fluctuations in document accessibility Wouter Mettrop CWI, Amsterdam, The Netherlands Paul Nieuwenhuysen Vrije Universiteit

32

Internet search engines: recommendations related to fluctuations

• Fluctuations are “normal”; do not be surprised; do not worry.

• Do not try to find a simple explanation to fully understand what happens.

• Known item searchers should repeat the search

»when using an engine with many element fluctuations; use other search terms;

»when using an engine with many document fluctuations: repeat later.

• Further research on effective size.