from federated to aggregated search

150
From federated to aggregated search Fernando Diaz, Mounia Lalmas and Milad Shokouhi [email protected] [email protected] [email protected]

Upload: mounia-lalmas

Post on 10-May-2015

6.949 views

Category:

Technology


6 download

DESCRIPTION

SIGIR 2010 Tutorial, with Fernando Diaz & Milad Shokouhi

TRANSCRIPT

Page 1: From federated to aggregated search

From federated to aggregated search

Fernando Diaz, Mounia Lalmas and Milad Shokouhi

[email protected]@acm.org

[email protected]

Page 2: From federated to aggregated search

OutlineIntroduction and TerminologyArchitectureResource RepresentationResource SelectionResult PresentationEvaluationOpen ProblemsBibliography

Page 3: From federated to aggregated search

OutlineIntroduction and TerminologyArchitectureResource RepresentationResource SelectionResult PresentationEvaluationOpen ProblemsBibliography

Page 4: From federated to aggregated search

Introduction

What is federated search?What is aggregated search?

MotivationsChallengesRelationships

Page 5: From federated to aggregated search

A classical example of federated search

www.theeuropeanlibrary.org

Collectionsto be searched

One query

Page 6: From federated to aggregated search

A classical example of federated search

www.theeuropeanlibrary.orgMerged listof results

Page 7: From federated to aggregated search

Motivation for federated search

Search a number of independent collections, with a focus on hidden web collectionsCollections not easily crawlable (and often

should not)

Access to up-to-date information and dataParallel search over several collectionsEffective tool for enterprise and digital

library environments

Page 8: From federated to aggregated search

Challenges for federated search

How to represent collections, so that to know what documents each contain?

How to select the collection(s) to be searched for relevant documents?

How to merge results retrieved from several collections, to return one list of results to the users?

Cooperative environmentUncooperative environment

Page 9: From federated to aggregated search

From federated search to aggregated search“Federated search on the web”

Peer-to-peer network connects distributed peers (usually for file sharing), where each peer can be both server and client

Metasearch engine combines the results of different search engines into a single result list

Vertical search – also known as aggregated search – add the top-ranked results from relevant verticals (e.g. images, videos, maps) to typical web search results

Page 10: From federated to aggregated search

A classical example of aggregated search

News

Homepage

Wikipedia

Real-time results

Video

Twitter

StructuredData

Page 11: From federated to aggregated search

Motivation for aggregated search

Increasingly different types of information being available, sough and relevante.g. news, image, wiki, video, audio, blog, map, tweet

Search engine allows accessing these through so-called verticals

Two “ways” to searchUsers can directly search the verticalsOr rely on so called aggregated search

Google universal search 2007: [ … ] search across all its content sources, compare and rank all the information in real time, and deliver a single, integrated set of search results [ … ] will incorporate information from a variety of previously separate sources – including videos, images, news, maps, books, and websites – into a single set of results. http://www.google.com/intl/en/press/pressrel/universalsearch_20070516.html

Page 12: From federated to aggregated search

Motivation for aggregated search

(Arguello et al, 09)

25K editorially classified queries

Page 13: From federated to aggregated search

Motivation for aggregated search

Page 14: From federated to aggregated search

Motivation for aggregated search

Page 15: From federated to aggregated search

Challenges in aggregated search

Extremely heterogeneous collectionsWhat is/are the vertical intent(s)? And

Handling ambiguous (query | vertical) intent Handling non-stationary intent (e.g. news, local)

How many results from each to return and where to position them in the result page? Slotting results Users looking at 1st result page

Page optimization and its evaluation

Page 16: From federated to aggregated search

Ambiguous non-stationary intent

Query - Travel - Molusk - Paul

Vertical - Wikipedia - News - Image

Page 17: From federated to aggregated search

Recap – Introduction

federated search

aggregated search

heterogeneity low high

scale (documents,

users)small large

user feedback little a lot

Page 18: From federated to aggregated search

Terminology

1. federated search, distributed information retrieval, data fusion, aggregated search, universal search, peer-to-peer network

2. resource, vertical, database, collection, source, server, domain, genre

3. merging, blending, fusion, aggregation, slotted, tiled

Page 19: From federated to aggregated search

Problem definition

Present the “querier” with a summary of search results from one or more resources.

Page 20: From federated to aggregated search

User

Search Interface/Portal/Broker

Source/Server/Vertical

Query Query Query Query Query

Source/Server/Vertical

Source/Server/Vertical

Source/Server/Vertical

Source/Server/Vertical

Raw Query

General architecture

Page 21: From federated to aggregated search

Peer-to-peer network

Peer Directory Server

Page 22: From federated to aggregated search

Peer to Peer (P2P) networksBroker-based

Single centralized broker with documents lists shared from peer (e.g. Napster, original version)

DecentralizedEach peer acts as both client and server (e.g.

Gnutella v0.4)

Structure-basedUse distributed hash tables (DHT) (e.g. Chord (Stocia

et al, 03) )

HierarchicalUse local directory services for routing and merging

(e.g. Swapper.NET)

Page 23: From federated to aggregated search

Query

Broker

Collection A

Query Query Query Query Query

Collection B

Collection C

Collection D

Collection E

SumA

SumB

SumC

SumD

SumE

Merged results

Federated search

Page 24: From federated to aggregated search

Federated search

Also known as distributed information retrieval (DIR) system

Provides one portal for searching information from multiple sourcescorporate intranets, fee-based databases,

library catalogues, internet resources, user-specific digital storage

Funnelback, Westlaw, FedStats, Cheshire, etc (see also http://federatedsearchblog.com/)

Page 25: From federated to aggregated search

http://funnelback.com/pdfs/brochures/enterprise.pdf

Page 26: From federated to aggregated search

User

Metasearch engine

Query Query Query Query

Raw Query

WWW

Metasearch

Page 27: From federated to aggregated search

Metasearch

Search engine querying several different search engines and combines results from them (blended), or displays results separately (non-blended)

Does not crawl the web but rely on data gathered by other search engines

Dogpile,Metacrawler, Search.com, etc (see http://www.cryer.co.uk/resources/searchengines/meta.htm)

Page 28: From federated to aggregated search

User

QueryQuery Query

Query

Angelina Jolie Results

WWWIndex (text)

Aggregatedsearch

Page 29: From federated to aggregated search

Aggregated search

Specific to a web search engine“Increasingly” more than one type of information

relevant to an information needmostly web page + image, map, blog, etc

These types of information are indexed and ranked using dedicated approaches (verticals)

Presenting the results from verticals in an aggregated way believed to be more useful

All major search engines are doing some levels of aggregated search

Page 30: From federated to aggregated search

Query

GOV2

BM25 KL Inquery Anchor only Title only

One document collection

Different document representations

Different retrieval models

Merging

One ranked list of result (merged)

Data fusion

(e.g. Voorhees etal, 95)

Page 31: From federated to aggregated search

Data fusionSearch one collectionDocument can be indexed in different ways

Title index, abstract index, etc (poly-representation)Weighting scheme

Different retrieval modelsRankings generated by different retrieval models

(or different document representations) merged to produce the final rank

Has often been shown to improve retrieval performance (TREC)

Page 32: From federated to aggregated search

Terminology - Resource

SourceServerDatabaseCollection (federated search)

ServerVertical (aggregated search)

DomainGenre

Page 33: From federated to aggregated search

Terminology - Aggregation

MergingBlendingFusion

SlottedTiled

Page 34: From federated to aggregated search

Aggregated search (tiled)

http://au.alpha.yahoo.com/

Page 35: From federated to aggregated search

Aggregated search (tiled)

Naver.com

Page 36: From federated to aggregated search

Aggregated search (slotted)

Page 37: From federated to aggregated search

Others

ClusteringFaceted searchMulti-document summarization

Document generationEntity search

(see special issue – in press – on “Current research in focused retrieval and result aggregation”, Journal of Information Retrieval (Trotman etal, 10))

Page 38: From federated to aggregated search

Yippy – Clustering search engine from Vivisimo

clusty.com

Page 39: From federated to aggregated search

Faceted search

Page 40: From federated to aggregated search

Multi-document summarization

http://newsblaster.cs.columbia.edu/

Page 41: From federated to aggregated search

“Fictitious” document generation

(Paris et al, 10)

Page 42: From federated to aggregated search

Entity search

http://sandbox.yahoo.com/Correlator

Page 43: From federated to aggregated search

Recap

Shown the relations between federated, aggregated search, and others

Exposed the various terminologies used

In the rest of the tutorial, we concentrate on federated search and aggregated search

Focus is on “effective search”

Page 44: From federated to aggregated search

OutlineIntroduction and TerminologyArchitectureResource RepresentationResource SelectionResult PresentationEvaluationOpen ProblemsBibliography

Page 45: From federated to aggregated search

Architecture: what are the general components of federated and aggregated search systems.

Page 46: From federated to aggregated search

Federated search architecture

Page 47: From federated to aggregated search

Aggregated search architecture

Pre-retrieval aggregation: decide verticals before seeing results

Post-retrieval aggregation: decide verticals after seeing results

Pre-web aggregation: decide verticals before seeing web results

Post-web aggregation: decide verticals after seeing web results

Page 48: From federated to aggregated search

Post-retrieval, pre-web

Page 49: From federated to aggregated search

Pre and post-retrieval, pre-web

Page 50: From federated to aggregated search

OutlineIntroduction and TerminologyArchitectureResource RepresentationResource SelectionResult PresentationEvaluationOpen ProblemsBibliography

Page 51: From federated to aggregated search

Resource representation: how to represent resources, so that we know what documents each contain.

Page 52: From federated to aggregated search

Resource representation in federated search(Also known as resource summary/description)

Page 53: From federated to aggregated search

Resource representation

Cooperative environmentsComprehensive term statisticsCollection size information

Uncooperative environmentsQuery-based samplingCollection size estimation

Page 54: From federated to aggregated search

Resource representation(cooperative environments)

STARTS Protocol (Gravano et al, 97)

Source metadata Rich query language

Page 55: From federated to aggregated search

Different types of term statistics (Callan et al, 95; Gravano et al, 94a,b,99; Meng et al, 01; Yuwono and Lee, 97; Xu and Callan, 98; Zobel, 97)

Anchor-textHARP (Hawking and Thomas, 05)

Resource representation(cooperative environments)

Page 56: From federated to aggregated search

Resource representation(uncooperative environments)

Query-based sampling (Callan and Connell, 01)

Select a query, probe collectionDownload the top n documentsSelect the next query, repeat

Query selector

Query

Sampled documents

Page 57: From federated to aggregated search

Query selector (Callan and Connell, 01)

Other resource description (ord)Learned resource description (lrd)

• Average tf, random, df, ctf

Query logs(Craswell, 00; Shokouhi et al, 07d)

Focused probing(Ipeirotis and Gravano, 02)

Resource representation(uncooperative environments)

Page 58: From federated to aggregated search

Adaptive sampling(Shokouhi et al, 06a)

Rate of visiting new vocabulary(Baillie et al, 06a)

Rate of sample quality improvement (reference query log)

(Caverlee et al, 06)Proportional document ratio (PD)Proportional vocabulary ratio (PV)Vocabulary growth (VG)

Resource representation(uncooperative environments)

Page 59: From federated to aggregated search

Improving incomplete samplesShrinkage (Ipeirotis, 04; Ipeirotis and Gravano, 04):

topically related collections should share similar terms

Q-pilot (Sugiura and Etzioni, 00):

sampled documents + backlinks + front page

Resource representation(uncooperative environments)

Page 60: From federated to aggregated search

Capture-recapture (Liu et al, 01)

Resource representation(Collection size estimation)

Sample A(Capture)

Sample B (recapture)

http://www.dorlingkindersley-uk.co.uk/static/cs/uk/11/clipart/nature/image_nature040.html

Page 61: From federated to aggregated search

Resource representation(Collection size estimation)

Page 62: From federated to aggregated search

Multiple queries sampler (Thomas and Hawking, 07)

Random-walk sampler, and pool-based sampler (Bar-Yossef and Gurevich, 06)

Collection overlap estimation (Shokouhi and Zobel, 07)

Resource representation(Collection size estimation)

Page 63: From federated to aggregated search

Resource representation(Updating summaries)

(Ipeirotis et al, 05) (Shokouhi et al, 07a)

Page 64: From federated to aggregated search

Resource representation in aggregated searchVertical content

samples or access to vertical APIrepresents content supply

Vertical query logssamples or access to historic vertical searchesrepresents content demand

Page 65: From federated to aggregated search

Vertical content includes text

NEWS

Page 66: From federated to aggregated search

Vertical content includes structure

SPORTS

Page 67: From federated to aggregated search

Vertical content includes images

IMAGES

Page 68: From federated to aggregated search

Issues with vertical content

Dynamicssome vertical becomes stale fast

Heterogeneous contentheterogeneous ranking algorithms

Non-free text APIsaffects query-based sampling

Page 69: From federated to aggregated search

Addressing content dynamics

sample most recently indexed documents(Diaz 09)

assumes users more likely to be interested in recent content

in practice, only need a fraction of the corpus to perform well

(Konig et al, 09)

Page 70: From federated to aggregated search

Addressing heterogeneous content

1. use text available with documents (e.g. captions)

2. manually map to surrogates (e.g. wikipedia pages)

(Arguello et al, 09)

performance of two different methods of dealing with heterogeneous content

Page 71: From federated to aggregated search

Vertical query logs

Queries issued directly to a vertical represent explicit vertical intent

Is similar to having a large body of labeled queries

Page 72: From federated to aggregated search

Issues with vertical query logs

Dynamicssome verticals require temporally-sensitive

samplingfor example, we do not want to sample news

query logs for a whole year

Non-free text APIsaffects query modeling

Page 73: From federated to aggregated search

Hybrid approaches

Should only sample documents likely to be useful for vertical selection/merginge.g. a document which is never requested is not

useful for representing a vertical

Suggests log-biased sampling (Shokouhi et al, 06; Arguello et al, 09)

Page 74: From federated to aggregated search

Recap – Resource representation

federated search

aggregated search

Representation completeness

low low-high

Representation generation

sampling/shared dictionaries

sampling, API

Freshness important critical

Page 75: From federated to aggregated search

OutlineIntroduction and TerminologyArchitectureResource RepresentationResource SelectionResult PresentationEvaluationOpen ProblemsBibliography

Page 76: From federated to aggregated search

Resource selection: how to select the resource(s) to be searched for relevant documents.

Page 77: From federated to aggregated search

Resource selection for federated search

Query

Broker

Collection A

Query Query Query

Collection B

Collection C

Collection D

Collection E

SumA

SumB

SumC

SumD

SumE

Page 78: From federated to aggregated search

“Big-document” bag of word summariesCORI (Callan et al, 95)

GlOSS (Gravano et al, 94b)

CVV (Yuwono and Lee, 97)

Resource selection(Lexicon-based methods)

Co

llect

ion

CC

olle

ctio

n A

Co

llect

ion

B

Sampling

Sampling

Sampling

Broker

Page 79: From federated to aggregated search

Resource selection(Lexicon-based methods)

CORI

GlOSS

Page 80: From federated to aggregated search

Sample documents with retained boundariesReDDE (Si and Callan, 03a)

CRCS (Shokouhi, 07a) SUSHI (Thomas and Shokouhi, 09)

Resource selection(Document-surrogate methods)

Co

llect

ion

CC

olle

ctio

n A

Co

llect

ion

B

Sampling

Sampling

Sampling

Broker

Page 81: From federated to aggregated search

Resource selection(Document-surrogate methods)

ReDDE ReDDE assumes that the top-ranked

sampled documents are relevant.

ReDDE estimates the size of collections by sample-resample

Assuming that all collections have the same size we have: yellow > blue > red

CRCS is inspired by ReDDE but assigns different probability of relevance based on document position: red > yellow, blue

Broker

Query

Ranking

Page 82: From federated to aggregated search

SUSHI

Resource selection(Document-surrogate methods)

http://www.monthly.se/nucleus/index.php?itemid=1464

Page 83: From federated to aggregated search

SUSHI

Resource selection(Document-surrogate methods)

http://www.monthly.se/nucleus/index.php?itemid=1464

Page 84: From federated to aggregated search

SUSHI

Resource selection(Document-surrogate methods)

Different regression functions for each collection and query

Scores are comparable (estimated over the same index)

http://www.monthly.se/nucleus/index.php?itemid=1464

Page 85: From federated to aggregated search

Utility maximization techniquesModel the search effectivenessDTF (Nottelmann and Fuhr, 03), UUM (Si and

Callan, 04a), RUM (Si and Callan, 05b)

Classification-based methodsClassify collections/queries for better selectionClassification-aware server selection (Ipeirotis

and Gravano, 08), classification-based resource selection (Arguello et al, 09a), learning from past queries (Cetintas et al, 09)

Resource selection(Supervised methods)

Page 86: From federated to aggregated search

Resource selection in aggregated SearchContent-based predictors

derived from (sampled) vertical content

Query string-based predictorsderived from query text, independent of any

resource associated with a vertical

Query log-based predictorsderived from previous requests issued by users

to the vertical portal

Page 87: From federated to aggregated search

Content-based predictors

Distributed information retrieval (DIR) predictors

Simple result set predictorsnumresults, score distributions, etc

(Diaz 09; Konig etal, 09)

Complex result set predictorsClarity (Cronen-Townsend et al, 02)

Autocorrelation (Diaz, 07)

Many, many more (Hauff, 10)

Page 88: From federated to aggregated search

Issues with content-based predictors

DIR (usually) assumes homogeneous content types

performance predictors (usually) assume text corpora

assumes ranking function consistencybetween verticalsbetween vertical selector machine and vertical ranker

machine

verticals have different dynamics (e.g. news vs. image)

Page 89: From federated to aggregated search

String-based predictors

Dictionary lookupsterms correlated with a vertical (e.g., movie

titles)

Regular expressionspatterns correlated with explicit vertical

requests (e.g., obama news)

Named entitiesautomatically-detected entity types (e.g.,

geographic entities)

Page 90: From federated to aggregated search

String-based predictors

Issuescurating lists and expressions (manual or

automatic)terms included in dictionary manually vetted for

relevancehigh precision/low recall

Page 91: From federated to aggregated search

Log-based predictors

Classification approaches(Beitzel etal 07; Li etal, 08)

Language model approaches(Arguello etal, 09)

Issuesverticals with structured queries (e.g. local)query logs with dynamics (e.g. news)

(Diaz, 09)

Page 92: From federated to aggregated search

Comparing predictor performance

(Arguello et al, 09)

Page 93: From federated to aggregated search

Predictor cost

Pre-retrieval predictorscomputed without sending the query to the verticalno network cost

Post-retrieval predictorscomputed on the results from the verticalrequires vertical support of web scale query trafficincurs network latencycan be mitigated with vertical content caches

Page 94: From federated to aggregated search

Combining predictors

Use predictors as features for a machine-learned model

Training data1. editorial data

2. behavioral data (e.g. clicks)

3. other vertical data

(Diaz, 09; Arguello etal, 09; Konig etal, 09)

Page 95: From federated to aggregated search

Editorial data

Data: <query,vertical,{+,-}>Features: predictors based on

f(query,vertical)Models:

log-linear (Arguello etal, 09)

boosted decision trees (Arguello etal, 10)

Page 96: From federated to aggregated search

Combining predictors

(Arguello etal, 09)

Page 97: From federated to aggregated search

Click data

Data: <query,vertical,{click,skip}>, <query,vertical,click through rate>

Features: predictors based on f(query,vertical)

Models:log-linear (Diaz, 09)

boosted decision trees (Konig etal, 09)

Page 98: From federated to aggregated search

Gathering click data

Exploration bucket: show suboptimal presentations in order to

gather positive (and negative) click/skip data

Cold start problem: without a basic model, the best exploration is

random

Random exploration results in poor user experience

Page 99: From federated to aggregated search

Gathering click data

Solutionsreduce impact to small fraction of traffic/userstrain a basic high-precision non-click model

(perhaps with editorial data)

Other issuesPresentation bias: different verticals have

different click-through rates a priori Position bias: different presentation positions

have different click-through rates a priori

Page 100: From federated to aggregated search

Click precision and recall

(Konig etal, 09)

ability to predict queries using thresholded click-through-rate to inferrelevance

Page 101: From federated to aggregated search

Non-target data

have training data no data

Page 102: From federated to aggregated search

Non-target data

Data: <query,source vertical,{+,-}>Features: predictors based on

f(query,target vertical)Models:

generic model+adaptation

(Arguello etal, 10)

Page 103: From federated to aggregated search

Non-target data

(Arguello etal, 10)

Page 104: From federated to aggregated search

Generic model

Objectivetrain a single model that performs well for all

source verticals

Assumptionif it performs well across all source verticals, it

will perform well on the target vertical

(Arguello etal, 10)

Page 105: From federated to aggregated search

Non-target data

(Arguello etal, 10)

adapted model

Page 106: From federated to aggregated search

Adapted model

Objectivelearn non-generic relationship between features

and the target vertical

Assumptioncan bootstrap from labels generated by the

generic model

(Arguello etal, 10)

Page 107: From federated to aggregated search

Non-target query classification

(Arguello etal, 10)

average precision on target query classification; red (blue) indicates statistically significant improvements (degradations) compared to the single predictor

Page 108: From federated to aggregated search

Training set characteristics

What is the cost of generating training datahow much money?how much time?how many negative impressions as a result of

exploration?

Are targets normalized?can we compare classifier output?

Page 109: From federated to aggregated search

Training set cost summary

Page 110: From federated to aggregated search

Online adaptation

Production vertical selection systems receive a variety of feedback signalsclicks, skipsreformulations

A machine-learned system can adjust predictions based on real time user feedbackvery important for dynamic verticals

(Diaz, 09; Diaz and Arguello, 09)

Page 111: From federated to aggregated search

Online adaptation

Passive feedback: adjust prediction/parameters in response to feedbackallows recovery from false positivesdifficult to recover from false negatives

Active feedback/explore-exploit: opportunistically present suboptimal verticals for feedbackallows recovery from both errorsincurs exploration cost

(Diaz, 09; Diaz and Arguello, 09)

Page 112: From federated to aggregated search

Online adaptation

Issuessetting learning rate for dynamic intent verticalsnormalizing feedback signal across verticalsresolving feedback and training signal

(click≠relevance)

(Diaz, 09; Diaz and Arguello, 09)

Page 113: From federated to aggregated search

Recap – Resource selection

Page 114: From federated to aggregated search

OutlineIntroduction and TerminologyArchitectureResource RepresentationResource SelectionResult PresentationEvaluationOpen ProblemsBibliography

Page 115: From federated to aggregated search

Resource presentation: how to return results retrieved from several resources to users.

Page 116: From federated to aggregated search

Same source (web) different overlapped indexesDocument scores may not be availableTitle, snippet, position and timestamps

D-WISE (Yuwono and Lee, 96)

Inquirus (Glover et al., 99)

SavvySearch (Dreilinger and Howe, 1997)

Result merging(Metasearch engines)

Page 117: From federated to aggregated search

Same corpusDifferent retrieval modelsDocument scores/positions available

Unsupervised techniquesCombSUM, CombMNZ (Fox and Shaw, 93, 94)

Borda fuse (Aslam and Montague, 01)

Supervised techniquesBayes-fuse, weighted Borda fuse (Aslam and Montague, 01)

Segment-based fusion (Lillis et al 06, 08; Shokouhi 07b)

Result merging(Data fusion)

Page 118: From federated to aggregated search

Result merging in federated searchUser

Broker

Collection A

Query Query

Collection B

Collection C

Collection D

Collection E

SumA

SumB

SumC

SumD

SumE

Merged results

Query

Page 119: From federated to aggregated search

CORI (Callan et al, 95)Normalized collection score + Normalized

document score.

Result merging

Page 120: From federated to aggregated search

Result merging

SSL (Si and Callan, 2003b)

Broker

AA

GG

BB

CC

DD

EE

FF

HH

Query

Ranking

Selected resources

LL

RR

DD

FF

QQ

Page 121: From federated to aggregated search

Result merging

http://upload.wikimedia.org/wikipedia/en/1/13/Linear_regression.png

Source-specific scoreB

roke

r sc

ore

Page 122: From federated to aggregated search

Multi-lingual result mergingSSL with logistic regression

(Si and Callan, 05a; Si et al, 08)

Personalized metasearch (Thomas, 08)

Merging overlapped collectionsCOSCO (Hernandez and

Kambhampati 05):

exact duplicatesGHV (Bernstein et al, 06;

Shokouhi et al, 07b):

exact/near duplicates

Result merging - Miscellaneous scenarios

Page 123: From federated to aggregated search

Images on top Images in the middle Images at the bottom

Images at top-right Images on the leftImages at the bottom-right

Slotted vs tiled result presentation

3 verticals3 positions3 degree of vertical intents (Sushmita et al, 10)

Page 124: From federated to aggregated search

Designers of aggregated search interfaces should account for the aggregation styles

for both, vertical intent key for deciding on position and type of “vertical” results

slotted accurate estimation of the best position of “vertical” result

tiled accurate selection of the type of “vertical” result

Slotted vs tiled

Page 125: From federated to aggregated search

Recap – Result presentation

federated search

aggregated search

Content typehomogenous

(text documents)heterogeneous

Document scoresdepends on environment

heterogeneous

Oracle centralized index none

Page 126: From federated to aggregated search

OutlineIntroduction and TerminologyArchitectureResource RepresentationResource SelectionResult PresentationEvaluationOpen ProblemsBibliography

Page 127: From federated to aggregated search

Evaluation

Evaluation: how to measure the effectiveness of federated and aggregated search systems.

Page 128: From federated to aggregated search

CTF ratio (Callan and Connell, 01)

Spearman rank correlation coefficient (SRCC), (Callan and Connell, 01)

Kullback-Leibler divergence (KL) (Baillie et al,06b;

Ipeirotis et al, 2005), topical KL (Baillie et al, 09)

Predictive likelihood (Baillie et al, 06a)

Resource representation (summaries) evaluation – Federated search

Page 129: From federated to aggregated search

Resource selection evaluation – Federated search

Page 130: From federated to aggregated search

Result merging evaluation – Federated searchOracle

Correct merging (centralized index ranking) (Hawking and Thistlewaite, 99)

Perfect merging (ordered by relevance labels) (Hawking and Thistlewaite, 99)

MetricsPrecisionCorrect matches (Chakravarthy and Haase, 95)

Page 131: From federated to aggregated search

Vertical Selection Evaluation – Aggregated search

Majority of publications focus on single vertical selectionvertical accuracy,

precision, recallEvaluation data

editorial databehavioral data

single vertical selection

Page 132: From federated to aggregated search

Editorial data

Guidelinesjudge relevance based on vertical results

(implicit judging of retrieval/content quality)judge relevance based on vertical description

(assumes idealized retrieval/content quality)

Evaluation metric derived from binary or graded relevance judgments

(Arguello etal, 09; Arguello et al, 10)

Page 133: From federated to aggregated search

Behavioral data

Inference relevance from behavioral data (e.g. click data)

Evaluation metricregression error on predicted CTRinfer binary or graded relevance

(Diaz, 09; Konig etal, 09)

Page 134: From federated to aggregated search

Test collections (a la TREC)

* There are on an average more than 100 events/shots contained in each video clip (document)

Statistics on Topics

number of topics 150

average rel docs per topic 110.3

average rel verticals per topic 1.75

ratio of “General Web” topics 29.3%

ratio of topics with two vertical intents

66.7%

ratio of topics with more than two vertical intents

4.0%

quantity/media text image video total

size (G) 2125 41.1 445.5 2611.6

number of documents 86,186,315 670,439 1,253* 86,858,007

(Zhou & Lalmas, 10)

Page 135: From federated to aggregated search

ImageCLEFphoto retrieval

track

……TREC web track

INEXad-hoc track

TRECblog track

topict1

docd1

d2

d3

…dn

judgmentRNR…R

……BlogVertical

Reference(Encyclopedia)

Vertical

ImageVertical

General WebVertical

ShoppingVertical

topict1

docd1

d2

…dV1

judgmentRN…R

verticalV1

V2 d1

d2

…dV2

NN…R

……

Vk d1

d2

…dVk

NN…N

t1

existing test collections

(simulated) verticals

Test collections (a la TREC)

Page 136: From federated to aggregated search

Recap – Evaluation

federated search

aggregated search

Editorial datadocument relevance judgments

query labels

Behavioral data none critical

Page 137: From federated to aggregated search

OutlineIntroduction and TerminologyArchitectureResource RepresentationResource SelectionResult PresentationEvaluationOpen ProblemsBibliography

Page 138: From federated to aggregated search

Open problems in federated searchBeyond big document

Classification-based server selection (Arguello et al, 09a) Topic modeling

Query expansion Previous techniques had little success (Ogilvie and Callan, 01;

Shokouhi et al, 09)

Evaluating federated search Confounding factors

Federated search in other context Blog Search (Elsas et al, 08; Seo and Croft, 08)

Effective merging Supervised techniques

Page 139: From federated to aggregated search

Open problems in aggregated search

Evaluation metricsslotted presentationtiled presentationmetrics based on behavioral signals

Models for multiple verticalsMinimizing the cost for new verticals,

markets

Page 140: From federated to aggregated search

OutlineIntroduction and TerminologyArchitectureResource RepresentationResource SelectionResult PresentationEvaluationOpen ProblemsBibliography

Page 141: From federated to aggregated search

Bibliography

J. Arguello, F. Diaz, J. Callan, and J.-F. Crespo, Sources of evidence for vertical selection. In SIGIR 2009 (2009).

J. Arguello, J. Callan, and F. Diaz. Classification-based resource selection. In Proceedings of the ACM CIKM, Pages 1277--1286, Hong Kong, China, 2009a.

J. Arguello, F. Diaz, J.-F. Paiement, Vertical Selection in the Presence of Unlabeled Verticals. In SIGIR 2010 (2010).

J. Aslam and Mark Montague. Models for metasearch, In Proceedings of ACM SIGIR, Pages, 276--284, New Orleans, LA, 2001.

M. Baillie, L. Azzopardi, and F. Crestani. Adaptive query-based sampling of distributed collections, In Proceedings of SPIRE, Pages 316--328, Glasgow, UK, 2006a.

M. Baillie, L. Azzopardi, and F. Crestani. Towards better measures: evaluation of estimated resource description quality for distributed IR. In X. Jia, editor, Proceedings of the First International Conference on Scalable Information systems, page 41, Hong Kong, 2006b.

M. Baillie, M. Carman, and F. Crestani. A topic-based measure of resource description quality for distributed information retrieval. In Proceedings of ECIR, pages 485--496, Toulouse, France, 2009.

Page 142: From federated to aggregated search

Bibliography

Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. Proceedings of WWW, pages 367--376, Edinburgh, UK, 2006.

S. M. Beitzel, E. C. Jensen, D. D. Lewis, A. Chowdhury, O. and Frieder, Automatic classification of web queries using very large unlabeled query logs. ACM Trans. Inf. Syst. 25, 2 (2007), 9.

Y. Bernstein, M. Shokouhi, and J. Zobel. Compact features for detection of near-duplicates in distributed retrieval. Proceedings of SPIRE, Pages 110--121, Glasgow, UK, 2006.

J. Callan and M. Connell. Query-based sampling of text databases. ACM Transactions on Information Systems, 19(2):97--130, 2001.

J. Callan, Z. Lu, and B. Croft. Searching distributed collections with inference networks. In Proceedings of ACM SIGIR, pages 21--28. Seattle, WA, 1995

J. Caverlee, L. Liu, and J. Bae. Distributed query sampling: a quality-conscious approach. In Proceedings of ACM SIGIR, pages 340--347. Seattle, WA, 2006.

S. Cetintas, L. Si, and H. Yuan, Learning from past queries for resource selection, In Proceedings of ACM CIKM, Pages1867--1870, Hong Kong, China.

Page 143: From federated to aggregated search

B.T. Bartell, G.W. Cottrell, and R.K. Belew. Automatic Combination of Multiple Ranked Retrieval Systems, ACM SIGIR, pp 173-181, 1994.

C. Baumgarten. A Probabilitstic Solution to the Selection and Fusion Problem in Distributed Information Retrieval, ACM SIGIR, pp 246-253, 1999.

N. Craswell. Methods for Distributed Information Retrieval. PhD thesis, Australian National University, 2000.

S. Cronen-Townsend, Y. Zhou, and W. B. Croft. Predicting query performance. ACM SIGIR, pp 299–306, 2002.

A. Chakravarthy and K. Haase. NetSerf: using semantic knowledge to find internet information archives, ACM SIGIR, pp 4-11, Seattle, WA, 1995.

F. Diaz. Performance prediction using spatial autocorrelation. ACM SIGIR, pp. 583–590, 2007.

F. Diaz. Integration of news content into web results. ACM International Conference on Web Search and Data Mining, 2009.

F. Diaz, J. and Arguello. Adaptation of offline vertical selection predictions in the presence of user feedback, ACM SIGIR, 2009.

D. Dreilinger and A. Howe. Experiences with selecting search engines using metasearch. ACM Transaction on Information Systems, 15(3):195-222, 1997.

J. Elsas, J. Arguello, J. Callan, and J. Carbonell. Retrieval and feedback models for blog feed search, ACM SIGIR, pp 347-354, Singapore, 2009.

Bibliography

Page 144: From federated to aggregated search

E. Glover, S. Lawrence, W. Birmingham, and C. Giles. Architecture of a metasearch engine that supports user information needs, ACM CIKM, pp 210—216,1999.

L. Gravano, H. García-Molina, and A. Tomasic. Precision and recall of GlOSS estimators for database discovery. Third International conference on Parallel and Distributed Information Systems, pp 103--106, Austin, TX, 1994a.

L. Gravano, H. García-Molina, and A. Tomasic. The effectiveness of GlOSS for the text database discovery problem. ACM SIGMOD, pp 126--137, Minneapolis, MN, 1994b.

L. Gravano, C. Chang, H. García-Molina, and A. Paepcke. STARTS:Stanford proposal for internet metasearching, ACM SIGMOD, pp 207--218, Tucson, AZ, 1997.

L. Gravano, H. García-Molina, and A. Tomasic. GlOSS: text-source discovery over the internet, ACM Transactions on Database Systems, 24(2):229--264, 1999.

E. Fox and J. Shaw. Combination of multiple searches. Second Text REtrieval Conference, pp 243-252, Gaithersburg, MD, 1993.

E. Fox and J. Shaw. Combination of multiple searches, Third Text REtrieval Conference, pp 105-108, Gaithersburg, MD, 1994.

J. French, and A. Powell. Metrics for evaluating database selection techniques, World Wide Web, 3(3):153--163, 2000.

C. Hauff. Predicting the Effectiveness of Queries and Retrieval Systems, PhD thesis, University of Twente, 2010.

Bibliography

Page 145: From federated to aggregated search

D. Hawking and P. Thomas. Server selection methods in hybrid portal search, ACM SIGIR, pp 75-82, Salvador, Brazil, 2005.

D. Hawking and P. Thistlewaite. Methods for information server selection, ACM Transactions on Information Systems, 17(1):40-76, 1999.

T. Hernandez and S. Kambhampati. Improving text collection selection with coverage and overlap statistics. WWW, pp 1128-1129, Chiba, Japan, 2005.

P. Ipeirotis and L. Gravano. When one sample is not enough: improving text database selection using shrinkage. ACM SIGMOD, pp 767-778, Paris, France, 2004.

P. Ipeirotis and L. Gravano. Distributed search over the hidden web: Hierarchical database sampling and selection. VLDB, pages 394-405, Hong Kong, China, 2002.

P. Ipeirotis and L. Gravano. Classification-aware hidden-web text database selection. ACM Transactions on Information Systems, 26(2):1-66, 2008.

P. Ipeirotis, A. Ntoulas, J. Cho, and L. Gravano. Modeling and managing content changes in text databases, 21st International Conference on Data Engineering, pp 606-617, Tokyo, Japan, 2005.

A. C. König, M. Gamon, and Q. Wu. Click-through prediction for news queries, ACM SIGIR, 2009.

Bibliography

Page 146: From federated to aggregated search

X. Li, Y.-Y. Wang, and A. Acero, Learning query intent from regularized click graphs, ACM SIGIR, pp. 339–346.

D. Lillis, F. Toolan, R. Collier, and J. Dunnion. ProbFuse: a probabilistic approach to data fusion, ACM SIGIR, pp 139-146, Seattle, WA, 2006.

K. Liu, C. Yu, and W. Meng. Discovering the representative of a search engine. ACM CIKM, pp 652-654, McLean, VA, 2002.

N. Liu, J. Yan, W. Fan, Q. Yang, and Z. Chen. Identifying Vertical Search Intention of Query through Social Tagging Propagation, WWW, Madrid, 2009.

W. Meng, Z. Wu, C. Yu, and Z. Li. A highly scalable and effective method for metasearch, ACM Transactions on Information Systems, 19(3):310-335, 2001.

W. Meng, C. Yu, and K. Liu. Building efficient and effective metasearch engines. ACM Computing Surveys, 34(1):48-89, 2002.

V. Murdock, and M. Lalmas. Workshop on aggregated search, SIGIR Forum 42(2): 80-83, 2008.

H. Nottelmann and N. Fuhr. Combining CORI and the decision-theoretic approach for advanced resource selection, ECIR, pp 138--153, Sunderland, UK, 2004.

P. Ogilvie and J. Callan. The effectiveness of query expansion for distributed information retrieval, ACM CIKM, pp 1830--190, Atlanta, GA, 2001.

C. Paris, S. Wan and P. Thomas. Focused and aggregated search: a perspective from natural language generation, Journal of Information Retrieval, Special Issue, 2010.

Bibliography

Page 147: From federated to aggregated search

S. Park. Analysis of characteristics and trends of Web queries submitted to NAVER, a major Korean search engine, Library & Information Science Research 31(2): 126-133, 2009.

F. Schumacher and R. Eschmeyer. The estimation of fish populations in lakes and ponds, Journal of the Tennessee Academy of Science, 18:228-249, 1943.

M. Shokouhi. Central-rank-based collection selection in uncooperative distributed information retrieval, ECIR, pp 160-172, Rome, Italy, 2007a.

J. Seo and B. Croft. Blog site search using resource selection, ACM CIKM, pp 1053-1062, Napa Valley, CA, 2008.

M. Shokouhi. Segmentation of search engine results for effective data-fusion, ECIR, pp 185-197, Rome, Italy, 2007b.

M. Shokouhi and J. Zobel. Robust result merging using sample-based score estimates, ACM Transactions on Information Systems, 27(3):1-29, 2009.

M. Shokouhi and J. Zobel. Federated text retrieval from uncooperative overlapped collections, ACM SIGIR, pp 495-502. Amsterdam, Netherlands, 2007.

M. Shokouhi, F. Scholer, and J. Zobel. Sample sizes for query probing in uncooperative distributed information retrieval, Eighth Asia Pacific Web Conference, pp 63--75, Harbin, China, 2006a.

Bibliography

Page 148: From federated to aggregated search

M. Shokouhi, J. Zobel, F. Scholer, and S. Tahaghoghi. Capturing collection size for distributed non-cooperative retrieval, ACM SIGIR, pp 316-323, Seattle, WA, 2006b.

M. Shokouhi, J. Zobel, S. Tahaghoghi, and F. Scholer. Using query logs to establish vocabularies in distributed information retrieval, Information Processing and Management, 43(1):169-180, 2007d.

M. Shokouhi, P. Thomas, and L. Azzopardi. Effective query expansion for federated search, ACM SIGIR, pp 427-434, Singapore, 2009.

L. Si and J. Callan. Unified utility maximization framework for resource selection, ACM CIKM, pages 32-41, Washington, DC, 2004a.

L. Si and J. Callan. CLEF2005: multilingual retrieval by combining multiple multilingual ranked lists. Sixth Workshop of the Cross-Language Evaluation Forum, Vienna, Austria, 2005a. http://www.cs.purdue.edu/homes/lsi/publications.htm

L. Si, J. Callan, S. Cetintas, and H. Yuan. An effective and efficient results merging strategy for multilingual information retrieval in federated search environments, Information Retrieval, 11(1):1--24, 2008.

L. Si and J. Callan. Relevant document distribution estimation method for resource selection, ACM SIGIR, pp 298-305, Toronto, Canada, 2003a.

L. Si and J. Callan. Modeling search engine effectiveness for federated search, ACM SIGIR, pp 83-90, Salvador, Brazil, 2005b.

L. Si and J. Callan. A semisupervised learning method to merge search engine results, ACM Transactions on Information Systems, 21(4):457-491, 2003b.

Bibliography

Page 149: From federated to aggregated search

A. Sugiura and O. Etzioni. Query routing for web search engines: architectures and experiments, WWW, Pages 417-429, Amsterdam, Netherlands, 2000.

S. Sushmita, H. Joho and M. Lalmas. A Task-Based Evaluation of an Aggregated Search Interface, SPIRE, Saariselkä, Finland, 2009.

S. Sushmita, H. Joho, M. Lalmas, and R. Villa. Factors Affecting Click-Through Behavior in Aggregated Search Interfaces, ACM CIKM, Toronto, Canada, 2010.

S. Sushmita, B. Piwowarski, and M. Lalmas. Dynamics of Genre and Domain Intents, Technical Report, University of Glasgow 2010.

S. Sushmita, H. Joho, M. Lalmas and J.M. Jose. Understanding domain "relevance" in web search, WWW 2009 Workshop on Web Search Result Summarization and Presentation, Madrid, Spain, 2009.

P. Thomas and D. Hawking. Evaluating sampling methods for uncooperative collections, ACM SIGIR, pp 503-510, Amsterdam, Netherlands, 2007.

P. Thomas. Server characterisation and selection for personal metasearch, PhD thesis, Australian National University, 2008.

P. Thomas and M. Shokouhi. SUSHI: scoring scaled samples for server selection, ACM SIGIR, pp 419-426, Singapore, Singapore, 2009.

A. Trotman, S. Geva, J. Kamps, M. Lalmas and V. Murdock (eds). Current research in focused retrieval and result aggregation, Special Issue in the Journal of Information Retrieval, Springer, 2010.

Bibliography

Page 150: From federated to aggregated search

T. Tsikrika and M. Lalmas. Merging Techniques for Performing Data Fusion on the Web, ACM CIKM, pp 181-189, Atlanta, Georgia, 2001.

Ellen M. Voorhees, Narendra Kumar Gupta, Ben Johnson-Laird. Learning Collection Fusion Strategies, ACM SIGIR, pp 172-179, 1995.

B. Yuwono and D. Lee. WISE: A world wide web resource database system. IEEE Transactions on Knowledge and Data Engineering, 8(4):548--554, 1996.

B. Yuwono and D. Lee. Server ranking for distributed text retrieval systems on the internet. Fifth International Conference on Database Systems for Advanced Applications, 6, pp 41-50, Melbourne, Australia, 1997.

J. Xu and J. Callan. Effective retrieval with distributed collections, ACM SIGIR, pp 112-120, Melbourne, Australia, 1998.

A. Zhou and M. Lalmas. Building a Test Collection for Aggregated Search, Technical Report, University of Glasgow 2010.

J. Zobel. Collection selection via lexicon inspection, Australian Document Computing Symposium, pp 74--80, Melbourne, Australia, 1997.

Bibliography