federated search of text search engines in uncooperative environments

Federated Search of Text Search Engines in Uncooperative Environments

Luo SiLanguage Technology Institute

School of Computer ScienceCarnegie Mellon University

AdvisorJamie Callan (Carnegie Mellon University)

2 © Luo Si July,2004

Outline

Outline:

Introduction: Introduction to federated search

Research Problems: the state-of-the-art and our contribution

Demo: Demo of a prototype system for real world application!


Outline

Outline:

Introduction: Introduction to federated search


Demo: Demo of a prototype system for real world application!


Introduction

Visible Web vs. Hidden Web

• Visible Web: Information can be copied (crawled) and accessed by

conventional search engines like Google or AltaVista

- No arbitrary crawl of the data (e.g., ACM library)

- Updated too frequently to be crawled (e.g.,

buy.com)

Can NOT Index (promptly)

• Hidden Web: Information hidden from conventional engines.

- Larger than Visible Web (2-50 times)

Valuable Searched by

Federated Search- Created by professionals

- Web: Uncooperative information sources

Federated Search is a feature used to beat Google by search engines like www.find.com


Introduction

Components of Federated Search System

. . . . . .

(1)ResourceRepresentation

. . . .Engine 1 Engine 2 Engine 3 Engine 4 Engine N

(2) Resource Selection

…………

(3) Results Merging


Introduction

Modeling Federated Search

• Application in real world

- But, not enough relevance judgments, not enough control…

Require Thorough Simulation

• TREC Testbeds with about 100 information sources

- Normal or moderately skewed size testbeds: Trec123 or Trec4_Kmeans

- Skewed: Representative (large source with the same relevant doc density), Relevant (large source with higher relevant doc density),

Nonrelevant (large source with lower relevant doc density)

• Multiple type of search engines to reflect uncooperative environment

Modeling Federated Search in Research Environments


Outline

Outline:

Introduction


Demo

- Resource Representation

- Resource Selection

- Results Merging

- A Unified Framework


Research Problems(Resource Representation)

Previous Research on Resource Representation

• Resource descriptions of words and the occurrences

- Query-Based Sampling (Callan, 1999): send query and get sampled doc

• Information source size estimation- Capture-Recapture Model (Liu and Yu, 1999): But require large number of interactions with information sources

• Centralized sample database: Collect docs from Query-Based Sampling (QBS)

- For query-expansion (Ogilvie & Callan, 2001), not very successful

- Successful utilization for other problems, throughout our new research


Research Problems(Resource Representation)

Estimate df of a term in sampled docs, Get total df from the source by resample query , Scale the number of sampled docs to estimate source size

• Sample-Resample Model (Si and Callan, 2003)

New Information Source Size Estimation Algorithm

*

*

N-NAER=

NAbsolute error

ratio

Estimated Size

Actual Size

Trec123 Trec123-10Col

Cap-Recapture 0.729 0.943

Sample-Resample 0.232 0.299

ExperimentsMeasure:


Outline

Outline:

Introduction


Demo



- Results Merging



Research Problems(Resource Selection)

Previous Research on Resource Selection

Goal of Resource Selection of Information Source Recommendation

High-Recall: Select the (few) information sources that have the most relevant documents

• “Big document” resource selection approach: Treat information sources as big documents, rank them by similarity of user query

- Examples: CVV, CORI and KL-divergence

They lose doc boundaries and do not optimize the goal of High-Recall

Estimate the percentage of relevant docs among sources and rank sources

New RElevant Doc Distribution Estimation (ReDDE) resource selection

“Relevant Document Distribution Estimation Method for Resource Selection”

(Luo Si & Jamie Callan, SIGIR ’03)



Relevant Doc Distribution Estimation (ReDDE) Algorithm

ii

dbd db _samp

P(rel|d) SF

i

i

i dbd db

Rel_Q(i) = P(rel|d) P(d|db ) N

Estimated Source Size

Number of Sampled Docs

P(rel|d)

“Everything at the top is (equally)

relevant”

i

i

i

^

dbdb

db _samp

NSF =

N

Source Scale Factor

Rank on Centralized Complete DB

Simple Rank on Centralized Complete DB with ranking on Centralized Complete DB

otherwise0

Nratiod)(Q,RankifCi

dbCCDBQ i

Number of Relevant Docs



Experimentsk

ii=1k k

ii=1

ER =

B

Evaluated Ranking

Desired Ranking

Measure:


Outline

Outline:

Introduction


Future Research



- Results Merging



Research Problems(Results Merging)

Goal of Results Merging

Make different result lists comparable and merge them into a single list

Difficulties: Information sources may use different retrieval algorithms

Information sources have different corpus statistics

Previous Research on Results Merging

• Some methods download all docs and calculate comparable scores large communication and computation costs

• Some methods use heuristic combination: CORI method

Semi-Supervised Learning (SSL) Merging (Si & Callan, 2002, 2003)

Basic idea is to approximate centralized doc score by linear regressionEstimate linear models from overlap documents in both centralized sampled DB and individual ranked lists



In resource representation:

• Build representations by QBS, collapse sampled docs into centralized sample DB

In resource selection:

• Rank sources, calculate centralized scores for docs in centralized sample DB

In results merging:

• Find overlap docs, build linear models, estimate centralized scores for all docs

• SSL Results Merging (cont)

En

gine 2

……

. . . .

……

En

gine 1E

ngine N

Resource

Representation

Centralized Sample DB

Resource

Selection. .

Overlap Docs

. . . Final Resul

ts

CSDB Rankin

g



10 Sources Selected

Experiments

Trec123 Trec4-kmeans

“Using Sampled Data and Regression to Merger Search Engine Results ”

(Luo Si & Jamie Callan, SIGIR ’02)

“A Semi-Supervised Learning Method to Merge Search Engine Results ”

(Luo Si & Jamie Callan, TOIS ’03)


Outline

Outline:

Introduction

Research Problems: the state-of-the-art and preliminary research

Demo



- Results Merging



Research Problems(Unified Utility Framework)

Goal of the Unified Utility Maximization Framework

Integrate and adjust individual components of federated search to get global desired results for different applications

High-Recall vs. High-Precision

Simply combine individual effective components together

High-Recall: Select sources that contain as many relevant docs as possible for information source recommendation

High-Precision: Select sources that return many relevant docs at top part of final ranked list for federated document retrieval

They are correlated but NOT identical, previous research does NOT distinguish them



• Formalize federated search as mathematic optimization problem with respect to different goals of different applications

Unified Utility Maximization Framework (UUM)

Example: for document retrieval with High-Precision Goal:

Number of rel docs in top part of

rank listNumber of

sources to select

id ^*

i ijd i j=1

i sdbi

i rdoc i

Subject to

d = argmax I(d ) R(d )

: I(d ) = N

d = N , if I(d ) 0

Retrieve fixed

number of docs



• Resource selection for federated document retrieval

Unified Utility Maximization Framework (UUM)

Solution: No simple solution, by dynamic

programming

A variant to select variable number of docs from selected sourcesi ^d*

i ijj=1d i

i sdbi

i Total_rdoci

i

Subject to

d = argmax I(d ) R(d )

: I(d ) = N

d = N

d = 10 k, k [0, 1, 2, .., 10]

Total number of documents to

select

Retrieve variable

number of docs

“Unified Utility Maximization Framework for Resource Selection ”

(Luo Si & Jamie Callan, CIKM ’04)



Experiments Resource selection for federated document retrievalTrec123 Representati

ve

3 Sources Selected

10 Sources Selected

SSL Merge


Outline

Demo

FedStats Project:

Cooperative work with Jamie Callan, Thi Nhu Truong and Lawrence Yau


Outline

Demo

0.4

0.6

0.8

1

0 20 40 60

Rank

Pre

cis

ion

SSL

CORI

Results Merging Experiments of FedStats for CORI and SSL


Future Research (Conclude)

Conclude

• Federated search has been hot research in last decade

- Most of previous research is tied with “Big document” Approach

- More theoretically solid foundation

- More empirically effective

- Better model real world applications

The new research advances the state-of-the-art

Bridge from

Cool Research to

Practical Tool

federated search of text search engines in uncooperative environments

Documents