querylog-based assessment of retrievability bias in delpher

Post on 16-Jan-2017

1.171 Views

Category:

Science

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Querylog-based Assessment of Retrievability Bias in DelpherMyriam C. Traub, Thaer Samar, Jacco van Ossenbruggen, Jiyin He, Arjen de Vries, Lynda Hardman

1

Motivation

• Users want to be able

• to access all (relevant) documents in Delpher

• to get a fair overview of Delpher’s content

• However,

• data collections are implicitly biased,

• users are biased,

• and technology induces even more bias(es)

2

Motivation

• Users want to be able

• to access all (relevant) documents in Delpher

• to get a fair overview of Delpher’s content

• However,

• data collections are implicitly biased,

• users are biased,

• and technology induces even more bias(es)

… which I can deal with if the bias is

made explicit.#toolcrit

2

Motivation

• Users want to be able

• to access all (relevant) documents in Delpher

• to get a fair overview of Delpher’s content

• However,

• data collections are implicitly biased,

• users are biased,

• and technology induces even more bias(es)

… which I can deal with if the bias is

made explicit.#toolcrit

2

Note:

Bias is not necessarily a

bad thing!

Research Questions

RQ1: Is the access to the digitized newspaper collection in Delpher influenced by a retrievability bias?

RQ2: Can we correlate the features of a document (such as document length, time of publishing, type of document, etc.) with its retrievability scores?

RQ3: To what extent are retrievability experiments using simulated queries representative for the search behavior of real users?

3

Retrievability

• Introduced by Azzopardi et al. [1] in 2008 in a study based on born-digital documents and simulated queries

• Measures the accessibility of all documents in a collection for a given set of queries

• Retrievability score r(d) measures how often a document d is retrieved by a given set of queries

• Gini coefficient and Lorenz curves can visualize and quantify inequality in the distribution of r(d) scores

4

[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Lorenz curve

% of population

% o

f inc

ome

0, 0, 0, 0, 11, 1, 1, 1, 10, 0, 1, 1, 2

Lorenz Curve & Gini Coefficient

• Introduced by economists to visualize inequality in wealth distribution

• Ranges between 0 and 1:

• perfect tyranny (G=0.8)

• perfect communist (G=0)

• in-between (G=0.5)

• There is no good or bad G.

5

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Lorenz curve

% of population

% o

f inc

ome

0, 0, 0, 0, 11, 1, 1, 1, 10, 0, 1, 1, 2

Lorenz Curve & Gini Coefficient

• Introduced by economists to visualize inequality in wealth distribution

• Ranges between 0 and 1:

• perfect tyranny (G=0.8)

• perfect communist (G=0)

• in-between (G=0.5)

• There is no good or bad G.

5

% of documents

% o

f ac

cum

ulat

ed r(

d)

Document Collection:KB Newspaper Archive

June 1618 - December 1995

Total Size 102,718,528

Vocabulary Size 353,086,358

Articles 67% 69,237,655

Advertisements 29% 29,591,599

Notifications* 2% 1,918,375

Captions 2% 1,970,899

* Familiebericht 6

Simulated Queries

• Followed similar strategy as previous studies

• Top 2 million single terms from the preprocessed corpus + top 2 million bigram terms

• No filtering for OCR errors

7

Real Queries

• User logs collected between March and July 2015 on Delpher

• Extracted queries and view data related to newspaper archive

• Total of 957,239 unique queries

8

Experiment / Parameters

• Real queries, simulated queries

• Standard Information Retrieval models: TFIDF, LM1000, BM25

• Pre-processing: Stemming, stopword removal, operator removal

• Cutoff values: c=10, c=100, c=1000

9

Results of Quantifying

Retrievability Bias

10

Inequality

Real queries, c=1000 GBM25 = 0.76

Simulated queries, c=100 GBM25 = 0.52

11

Inequality c=10

Real queries GBM25 = 0.97

Simulated queries GBM25 = 0.85

12

Inequality c=10

Real queries GBM25 = 0.97

Simulated queries GBM25 = 0.85

12

A large fraction of documents

scores r(d)=0

• The Lorenz curves and Gini values

• are strongly influenced by 0 values,

• can indicate the degree of bias, but they tell us nothing about the type of bias.

13

Limitations

• The Lorenz curves and Gini values

• are strongly influenced by 0 values,

• can indicate the degree of bias, but they tell us nothing about the type of bias.

13

Limitations

Does it arise from the users’

interest / search behavior?

Or a technological bias towards a particular document

feature?

Frequencies of r(d) values

14

• Real queries (top):

• maximum r(d)=4319

• tend to retrieve a few documents more often

• Simulated queries (bottom):

• maximum r(d)=807

• tend to retrieve a larger number of documents

1510

50100

5001000

500010000

50000100000

5000001000000

50000001000000030000000

0 500 1000 1500 2000 2500 3000 3500 4000r(d)

counts

1510

50100

5001000

500010000

50000100000

5000001000000

500000010000000

0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800r(d)

counts

R(d) Values Meaningful?

• Created 4 subsets of documents according to their r(d) score and selected a set of target documents from each subset

• Generated queries from selected documents, tailored to retrieve these specific documents

• Performed the search tasks and measured ranks of the target documents

• Showed that documents with lower r(d) score are actually harder to find

15

Hardly retr. Few times retr. Often retr. Very often retr.

Document Features

16

●●

● ●

●●

●●

● ●●

●●

● ●

● ●

●●

● ●

● ●●

● ●

●●

● ● ● ● ●

● ●

0

1

2

3

1618

− 1

862

1862

− 1

891

1891

− 1

904

1904

− 1

913

1913

− 1

920

1920

− 1

926

1926

− 1

929

1929

− 1

932

1932

− 1

935

1935

− 1

939

1939

− 1

941

1941

− 1

948

1948

− 1

956

1956

− 1

963

1963

− 1

969

1969

− 1

974

1974

− 1

979

1979

− 1

984

1984

− 1

989

1989

− 2

011

Mea

n r(d

) per

bin

Time of Publishing

• Documents ordered by publishing date

• Split into 20 bins of equal size

• Mean r(d) per bin

17

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●

●●●●

●●●●●

●●●●

●●●●

●●●●●●●●

●●●●

●●

●●●●

●●

●●●

●●

●●●

●●●●●●

●●

●●●●●

●●

●●●

●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●●●

●●●●●

●●●

●●●

●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●

●●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●

●●●●

●●●

●●●●

●●●

●●●●

●●●

●●●●

●●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●

●●●

●●●●

●●●

●●

●●●●

●●●

●●●●

●●●

●●●

●●

●●●●

●●●

●●●

●●

●●●

●●

●●●●

●●●

●●●

●●●

●●●

●●

●●

●●●●

●●

●●●●

●●●

●●●

●●

●●

●●

●●●●

●●●

●●●

●●●●

●●●●

●●

●●●●

●●●

●●●●

●●●●

●●●

●●

●●●●

●●●●

●●

●●●●

●●●

●●●

●●●

●●

●●

●●●●

●●●●

●●●●

●●

●●

●●●●●

●●●●

●●●

●●●●

●●●

●●

●●●●

●●●

●●●●

●●●

●●●●

●●

●●●●

●●●

●●●●

●●●●●

●●●●

●●●

●●

●●●●●

●●

●●●●●

●●●●

●●●●

●●●●

●●●●●

●●●●

●●

●●●

●●●●

●●

●●●●●

●●

●●●●●

●●●●

●●●●●

●●●●●

●●

●●

●●

●●

●●

●●●●●

●●●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●●●

●●

●●

●●

●●

●●●●

●●

●●

●●●●

●●●●●

●●

●●

●●

●●●

●●●●●●

●●

●●

●●●●●

●●

●●●

●●●●

●●●

●●●●

●●●●●●

●●●

●●

●●●●

●●

●●●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●●●

●●

●●

●●

●●

●●●●

●●

●●●

●●●●●

●●●

●●

●●

●●●●●

●●●●●

●●

●●●

●●●●●

●●

●●

●●

●●

●●●

●●●●●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●●

●●●●●●

●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●

●●●

●●●●

●●

●●●●

●●●●

●●●●●●

●●●●●

●●

●●●

●●

●●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●

●●

●●

●●●●●

●●●●●

●●●

●●●●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●

●●●

●●

●●●●●

●●●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●●●●●●

●●

●●●

●●

●●●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●●●

●●●●●

●●●

●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●●

●●●●●●

●●

●●

●●

●●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●●●●

●●

●●

●●

●●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●●

●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●●●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●●

●●

●●●

●●●

●●●

●●●

●●

●●●

●●●

●●

●●

●●●●●●

●●

●●●

●●●

0.5

1.0

1.5

2.0

0 1000 2000 3000 4000 5000Bins based on page confidence (PC)

Mea

n r(d

) per

bin

OCR Confidence Scores

• Documents ordered by page confidence (PC)

• Split into bins according to PC value

• Mean r(d) per bin

18

Document Length

• Varies from 33 to 381,563 words (mean = 362)

• Documents ordered by length and split into bins of 20,000

• LM1000 (left): upward trend, longer documents more retrievable

• BM25 and TFIDF (right): seem to be better at retrieving documents of medium length

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●

●●●●

●●●●●●●●●●●●

●●

●●●●

●●●

●●●●

●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●●●●●

●●●

●●●●

●●●●

●●●

●●●

●●●

●●●

●●●●

●●●●●●●●●●

●●●

●●●●●●●

●●●

●●●

●●●●●●●●●●

●●●

●●●●●●●●●●

●●

●●●

●●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●●

●●●

●●●●●●

●●

●●●

●●●●●●●●●

●●●

●●●

●●●●

●●●●●

●●●

●●●

●●●

●●●

●●

●●●●

●●●

●●●

●●●

●●●

●●

●●●●

●●

●●●

●●

●●●

●●●●●

●●●

●●●

●●●

●●

●●●●

●●●

●●●●●

●●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●●●●●●●●●

●●●

●●●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●

●●●●

●●●

●●

●●

●●●

●●

●●

●●●

●●●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●●

●●●●●

●●●●●●

●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●

●●

●●●

●●

●●

●●

●●●

●●●

●●●●

●●●

●●

●●

●●●●●

●●

●●●●

●●●●●●

●●●●●●●●

●●●

●●

●●

●●

●●●●●●●●●

●●●●●●

●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●●●

●●●●●●●●●●●●●●

●●●●●

●●●

●●●●●●●●

●●●●●●●●

●●●

●●●●●●

●●

●●●●●●●●

●●

●●

●●

0

2

4

6

0 1000 2000 3000 4000 5000Bins based on document length

Mea

n r(d

) per

bin

●●●●●

●●●●●●●●●●●●

●●●●●●

●●●

●●

●●●●●●●●●●

●●●

●●

●●

●●●●●●

●●●●

●●●●

●●

●●●

●●

●●●●●●

●●●

●●

●●●●●

●●

●●●

●●●●●●●

●●●●●

●●

●●●

●●●●

●●

●●

●●●●

●●●●●●●●●●

●●●●

●●●

●●

●●

●●

●●●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●●●

●●●

●●

●●●●

●●

●●

●●●

●●

●●●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●●●

●●●

●●

●●●

●●

●●

●●●●

●●●

●●

●●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●●

●●

●●

●●●●

●●

●●

●●●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●●●●●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●●

●●●●●

●●

●●●

●●●●

●●●●●●

●●

●●●

●●●●●●●●●●●●

●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●

●●

●●

●●

●●●●●●●

●●●●

●●●

●●●●

●●●

●●

●●

●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●

●●

●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

0.0

0.5

1.0

1.5

0 1000 2000 3000 4000 5000Bins based on document length

Mea

n r(d

) per

bin

19

Newspaper Titles

• Number of articles range from one to 16,348,557 (mean 82,638, median 127)

• Subset of the 10 most prevalent newspaper titles

• Mean r(d)

• Top 3 titles are regional ones

20

Newspaper Title Mean r(d)Leeuwarder courant: hoofdblad van Friesland 0.15

Nieuwsblad van het Noorden 0.14

Limburgsch dagblad 0.12Het vrije volk: democratisch-socialistisch dagblad 0.10De Tijd: godsdienstig-staatkundig dagblad 0.08Het Vaderland: staat- en letterkundig nieuwsblad 0.07

Leeuwarder courant 0.07

Algemeen Handelsblad 0.06

De Telegraaf 0.06Rotterdamsch nieuwsblad 0.05

Document Types

• Hardly any differences for simulated queries

• In real queries, the official notifications stand out with a much higher score

Real Simulated

Article 0.90 3.89

Advertisement 0.51 3.32

Notification* 4.80 3.22

Caption 0.84 3.06

Mean r(d) for BM25, c=100

21* Familiebericht

Representativeness of our Study

22

Top retrieved article for real queries

23

Top retrieved article(s) for simulated queries

24

Differences between query sets

• Real queries:

• Mean length: 2.32 terms

• Unique terms: 253,637

• 56 references to persons or locations in top 100 terms

• Simulated queries:

• Mean length: 1.5 terms

• Unique terms: 2,028,617

• 5 references to persons or locations in top 100 terms

25

15

10

50100

5001000

500010000

50000100000

5000001000000

5 10 15 20 25 30 35 40 50 60 65 70 90 110 170 700Number of Views

Cou

nts

Document views

• 2.7M out of 102M documents were viewed by users

• Shape of the frequency distribution plot is very similar to the r(d) frequency plots

• Most documents only viewed once, very few are viewed more often26

Overlap with views

• How many documents were viewed by Delpher users, but not retrieved in our study?

• Many non-retrieved documents

• were found using facets, operators

• scored a rank just below the cutoff

• Use a smoother cost function based on the ranking

• Better representation of the real search engine, taking faceted search / operators into account

0

0.75

1.5

2.25

3

c=10 c=100 c=1000

RetrievedNon-Retrieved

27

Document Types - revisited

R(d) Real

R(d) Simulated Viewed

Article 0.90 3.89 2.61%

Advertisement 0.51 3.32 2.07%

Official Notification 4.80 3.22 40.10%

Caption 0.84 3.06 4.01%

28

Parameter Sets for Preprocessing

[1]  L. Azzopardi and V. Vinay. Retrievability: An evaluation measure for higher order information access tasks. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, pages 561–570, New York, NY, USA, 2008. ACM.

29

Parameters Stemming Stopwords Operators

PS1 (as used by[1]) yes removed removed

PS2 no kept removed

PS3 (only LM1000) yes removed kept

0

500000

1000000

1500000

2000000

BM25 TFIDF LM1000 BM25 TFIDF LM1000

PS1 PS2 PS3

c=10 c=100

Overlap Retrieved Documents and Viewed

30

Conclusions

• Real and simulated queries differ in regard to

• composition of query sets

• number of (unique) terms used

• use of named entities

• Apart from document length and page confidence, we did not find strong evidence for technical bias

• Using real queries is important for realistic results

• Simulation strategies for queries need to be improved

31

top related