resource selection in distributed information retrieval – an experimental study

31
Resource Selection in Distributed Information Retrieval – an Experimental Study Hans Friedrich Witschel (formerly) University of Leipzig (now) SAP Research CEC Karlsruhe

Upload: nuncio

Post on 21-Jan-2016

24 views

Category:

Documents


0 download

DESCRIPTION

Resource Selection in Distributed Information Retrieval – an Experimental Study. Hans Friedrich Witschel (formerly) University of Leipzig (now) SAP Research CEC Karlsruhe. Overview. Motivation Problem definition Solutions to be explored Experimental setup Results Conclusions. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Resource Selection in Distributed Information Retrieval – an Experimental Study

Resource Selection in Distributed Information Retrieval – an

Experimental StudyHans Friedrich Witschel

(formerly) University of Leipzig

(now) SAP Research CEC Karlsruhe

Page 2: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

1. Motivation

2. Problem definition

3. Solutions to be explored

4. Experimental setup

5. Results

6. Conclusions

Overview

Page 3: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Motivation

Page 4: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Resource selection

Motivation

treatmentsurgeryradiationoncologydiagnosticbone marrowurology

staticsbuildingprojectlandscapeanti-seismicdesigncubature

clientserverservent p2ptermsalgorithmranking

combineharvestercattlecropstractoragriculturalacres

Whom could I ask about „information retrieval“??

Page 5: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Resource selection

Reason for selecting only a subset of all available resources/peers: cost reduction

Distributed IR (DIR): time and load on databases Peer-to-peer IR (P2PIR): amount of messages, we will

concentrate on P2PIR here

Basic approach: treat peers/resources as giant documents, use existing (slightly modified) retrieval functions to rank them, visit top-ranked ones…

Motivation

Page 6: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Problem definition

Page 7: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Assumptions

Problem definition

Peers have profiles = lists of terms with weights (unigram language models)

Two options: Represent peers by what they have → extract terms from a

peer‘s shared documents Represent peers by queries for which they provide relevant

documents

Profiles have to be compact in order to reduce communication overhead absolute size of profiles dictated by available (network)

resources

Page 8: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Research questions

Problem definition

How much will profile pruning degrade the quality of resource selection? That is, how many terms can we prune from a profile and still have acceptable results?

What can be done to improve peer selection? Improve queries → Query Expansion? Improve profiles → Profile adaptation?

Page 9: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Solutions to be explored

Page 10: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Preliminaries

Profiles: use CORI for weighting terms t in the collection of peer p, rank

by P(t|p) Compression: apply simple thresholding Profile sizes: 10,20,40,80,160,320,640,unpruned

Global term weights (I component of CORI) Use external reference corpus for estimating idf values

Local retrieval function at each peer: BM25 Uses the same idf estimations as above

=> document scores comparable across all peers=> can concentrate on resource selection process, results not blurred by result merging effects

Solutions to be explored

Page 11: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Baselines

Random: Rank peers in random order

By-size: rank peers by the number of documents they hold, independent of offered content

Base CORI: rank peers by the sum of CORI weights of terms contained in both the query and the peer‘s profile

Solutions to be explored

Page 12: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Query expansion

All methods use Local Context analysis Input passages are taken from:

The web: top 10 results snippets returned by Yahoo! API for the query

Local documents: best 10 documents returned by highest-ranked peer (local pseudo feedback)

For comparison („upper QE baseline“): use global view on collection (global pseudo feedback)

Solutions to be explored

Page 13: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Profile adaptation

Idea: Boost weight of term t in peer p‘s profile if p has successfully

answered a query containing t Aim: profile allows the peer to answer popular queries for which

it has many relevant documents Can be done using a query log Extensions: collaborative tagging approach, allow user

interaction etc. (hard to evaluate)

Solutions to be explored

Page 14: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Profile adaptation

Update formula for term i in profile of peer p

Dp = documents returned by p

Do = documents returned by all peers contacted

AVGRP = average relative precision (RP) over all peers the query has reached

Update is only executed if ratio > 1, i.e. if p‘s results are „better“ than the average

For evaluation purposes: split a query log into query and test set, use training set for updating profiles

Solutions to be explored

Page 15: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Experimental Setup

Page 16: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Simplifying it…

Experimental Setup

Evaluate distributed IR only, instead of running full P2PIR simulation

Decouple query routing from other aspects (overlay topology etc.)

Considerably reduces number of free parameters Underlying assumption: a resource selection algorithm A that

works better than algorithm B for DIR, will also be better for P2PIR (i.e. when only a subset of all resources is visible)

– A DIR scenario corresponds to a fully connected P2P overlay (e.g. PlanetP)

Page 17: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Parameterising it…

DIR evaluation, but: use parameters typical of P2PIR settings: Pruned profiles >> 1000 Peers Peer collections: small and semantically (relatively)

homogeneous All this as opposed to DIR

Experimental Setup

Page 18: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Applying it…

Basic evaluation procedure: Obtain a ranking R of all peers w.r.t. query q Visit the top 100 peers in the order implied by R After visiting each peer: merge documents found so far into a

ranking S, judge quality of R by the quality of S using e.g. relevance judgments for documents

Experimental Setup

Page 19: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Test collections

a) Digital library scenario: peers = topics Ohsumed: medical abstracts, annotated with Medical Subject

Headings (MeSHs) GIRT: German sociology abstracts, annotated with terms from a

thesaurus For both collections, queries and relevance judgments are

available

b) Individuals sharing publications: Citeseer abstracts with peers = (co-)authors Query log available, but no relevance judgments

Experimental Setup

Page 20: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Evaluation measures

Missing relevance judgements: introduce new measure relative precision (RP)

Idea: compare a given ranking D with ranking C of a reference retrieval system (here: centralised system)

Probability of relevance of a document estimated as inverse rank in reference ranking

RP@k = average probability of relevance among first k documents of ranking D

Experimental Setup

C = [K,L,M,N,O,P]D = [L,M,O]

34.05

1

3

1

2

1

3

13@

RP

Page 21: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Results

Page 22: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Profile pruning, CiteSeer

Results

Page 23: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Profile pruning, GIRT

Results

Page 24: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Profile pruning, space savings

Results

Page 25: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Qualitative analysis

Results

Page 26: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Query expansion

Results

M=intervals where QE runs significantly better than baselineM‘=intervals where QE significantly worse

Page 27: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Profile adaptation

Results

Page 28: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Profile adaptation, delayed updates

Results

Page 29: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Conclusions

Page 30: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Profile pruning: Pruning profiles hurts performance less than expected Whether or not pruning to a predefined size hurts, does not

necessarily depend on the original profile size In the experiments, it was always safe to prune for (total) space

savings of 90%

„Advanced“ techniques: Query expansion: more often hurts than improves performance Profile adaptation:

Stable improvement of over 10% among the first 15 peers visited Especially high improvement for the highest ranked peer delayed updates do not hurt effectiveness (weak locality)

Conclusions

Page 31: Resource Selection in Distributed Information Retrieval – an Experimental Study

H.F. Witschel, Global and Local Resources for

P2PIR

Questions?