federated search of text search engines in uncooperative environments

25
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University Advisor Jamie Callan (Carnegie Mellon University)

Upload: jed

Post on 01-Feb-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Federated Search of Text Search Engines in Uncooperative Environments. Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University Advisor Jamie Callan (Carnegie Mellon University). Outline. Outline: Introduction: Introduction to federated search - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Federated Search of Text Search Engines in Uncooperative Environments

Federated Search of Text Search Engines in Uncooperative Environments

Luo SiLanguage Technology Institute

School of Computer ScienceCarnegie Mellon University

AdvisorJamie Callan (Carnegie Mellon University)

Page 2: Federated Search of Text Search Engines in Uncooperative Environments

2 © Luo Si July,2004

Outline

Outline:

Introduction: Introduction to federated search

Research Problems: the state-of-the-art and our contribution

Demo: Demo of a prototype system for real world application!

Page 3: Federated Search of Text Search Engines in Uncooperative Environments

3 © Luo Si July,2004

Outline

Outline:

Introduction: Introduction to federated search

Research Problems: the state-of-the-art and our contribution

Demo: Demo of a prototype system for real world application!

Page 4: Federated Search of Text Search Engines in Uncooperative Environments

4 © Luo Si July,2004

Introduction

Visible Web vs. Hidden Web

• Visible Web: Information can be copied (crawled) and accessed by

conventional search engines like Google or AltaVista

- No arbitrary crawl of the data (e.g., ACM library)

- Updated too frequently to be crawled (e.g.,

buy.com)

Can NOT Index (promptly)

• Hidden Web: Information hidden from conventional engines.

- Larger than Visible Web (2-50 times)

Valuable Searched by

Federated Search- Created by professionals

- Web: Uncooperative information sources

Federated Search is a feature used to beat Google by search engines like www.find.com

Page 5: Federated Search of Text Search Engines in Uncooperative Environments

5 © Luo Si July,2004

Introduction

Components of Federated Search System

. . . . . .

(1)ResourceRepresentation

. . . .Engine 1 Engine 2 Engine 3 Engine 4 Engine N

(2) Resource Selection

…………

(3) Results Merging

Page 6: Federated Search of Text Search Engines in Uncooperative Environments

6 © Luo Si July,2004

Introduction

Modeling Federated Search

• Application in real world

- But, not enough relevance judgments, not enough control…

Require Thorough Simulation

• TREC Testbeds with about 100 information sources

- Normal or moderately skewed size testbeds: Trec123 or Trec4_Kmeans

- Skewed: Representative (large source with the same relevant doc density), Relevant (large source with higher relevant doc density),

Nonrelevant (large source with lower relevant doc density)

• Multiple type of search engines to reflect uncooperative environment

Modeling Federated Search in Research Environments

Page 7: Federated Search of Text Search Engines in Uncooperative Environments

7 © Luo Si July,2004

Outline

Outline:

Introduction

Research Problems: the state-of-the-art and our contribution

Demo

- Resource Representation

- Resource Selection

- Results Merging

- A Unified Framework

Page 8: Federated Search of Text Search Engines in Uncooperative Environments

8 © Luo Si July,2004

Research Problems(Resource Representation)

Previous Research on Resource Representation

• Resource descriptions of words and the occurrences

- Query-Based Sampling (Callan, 1999): send query and get sampled doc

• Information source size estimation- Capture-Recapture Model (Liu and Yu, 1999): But require large number of interactions with information sources

• Centralized sample database: Collect docs from Query-Based Sampling (QBS)

- For query-expansion (Ogilvie & Callan, 2001), not very successful

- Successful utilization for other problems, throughout our new research

Page 9: Federated Search of Text Search Engines in Uncooperative Environments

9 © Luo Si July,2004

Research Problems(Resource Representation)

Estimate df of a term in sampled docs, Get total df from the source by resample query , Scale the number of sampled docs to estimate source size

• Sample-Resample Model (Si and Callan, 2003)

New Information Source Size Estimation Algorithm

*

*

N-NAER=

NAbsolute error

ratio

Estimated Size

Actual Size

Trec123 Trec123-10Col

Cap-Recapture 0.729 0.943

Sample-Resample 0.232 0.299

ExperimentsMeasure:

Page 10: Federated Search of Text Search Engines in Uncooperative Environments

10 © Luo Si July,2004

Outline

Outline:

Introduction

Research Problems: the state-of-the-art and our contribution

Demo

- Resource Representation

- Resource Selection

- Results Merging

- A Unified Framework

Page 11: Federated Search of Text Search Engines in Uncooperative Environments

11 © Luo Si July,2004

Research Problems(Resource Selection)

Previous Research on Resource Selection

Goal of Resource Selection of Information Source Recommendation

High-Recall: Select the (few) information sources that have the most relevant documents

• “Big document” resource selection approach: Treat information sources as big documents, rank them by similarity of user query

- Examples: CVV, CORI and KL-divergence

They lose doc boundaries and do not optimize the goal of High-Recall

Estimate the percentage of relevant docs among sources and rank sources

New RElevant Doc Distribution Estimation (ReDDE) resource selection

“Relevant Document Distribution Estimation Method for Resource Selection”

(Luo Si & Jamie Callan, SIGIR ’03)

Page 12: Federated Search of Text Search Engines in Uncooperative Environments

12 © Luo Si July,2004

Research Problems(Resource Selection)

Relevant Doc Distribution Estimation (ReDDE) Algorithm

ii

dbd db _samp

P(rel|d) SF

i

i

i dbd db

Rel_Q(i) = P(rel|d) P(d|db ) N

Estimated Source Size

Number of Sampled Docs

P(rel|d)

“Everything at the top is (equally)

relevant”

i

i

i

^

dbdb

db _samp

NSF =

N

Source Scale Factor

Rank on Centralized Complete DB

Simple Rank on Centralized Complete DB with ranking on Centralized Complete DB

otherwise0

Nratiod)(Q,RankifCi

dbCCDBQ i

Number of Relevant Docs

Page 13: Federated Search of Text Search Engines in Uncooperative Environments

13 © Luo Si July,2004

Research Problems(Resource Selection)

Experimentsk

ii=1k k

ii=1

ER =

B

Evaluated Ranking

Desired Ranking

Measure:

Page 14: Federated Search of Text Search Engines in Uncooperative Environments

14 © Luo Si July,2004

Outline

Outline:

Introduction

Research Problems: the state-of-the-art and our contribution

Future Research

- Resource Representation

- Resource Selection

- Results Merging

- A Unified Framework

Page 15: Federated Search of Text Search Engines in Uncooperative Environments

15 © Luo Si July,2004

Research Problems(Results Merging)

Goal of Results Merging

Make different result lists comparable and merge them into a single list

Difficulties: Information sources may use different retrieval algorithms

Information sources have different corpus statistics

Previous Research on Results Merging

• Some methods download all docs and calculate comparable scores large communication and computation costs

• Some methods use heuristic combination: CORI method

Semi-Supervised Learning (SSL) Merging (Si & Callan, 2002, 2003)

Basic idea is to approximate centralized doc score by linear regressionEstimate linear models from overlap documents in both centralized sampled DB and individual ranked lists

Page 16: Federated Search of Text Search Engines in Uncooperative Environments

16 © Luo Si July,2004

Research Problems(Results Merging)

In resource representation:

• Build representations by QBS, collapse sampled docs into centralized sample DB

In resource selection:

• Rank sources, calculate centralized scores for docs in centralized sample DB

In results merging:

• Find overlap docs, build linear models, estimate centralized scores for all docs

• SSL Results Merging (cont)

En

gine 2

……

. . . .

……

En

gine 1E

ngine N

Resource

Representation

Centralized Sample DB

Resource

Selection. .

Overlap Docs

. . . Final Resul

ts

CSDB Rankin

g

Page 17: Federated Search of Text Search Engines in Uncooperative Environments

17 © Luo Si July,2004

Research Problems(Results Merging)

10 Sources Selected

Experiments

Trec123 Trec4-kmeans

“Using Sampled Data and Regression to Merger Search Engine Results ”

(Luo Si & Jamie Callan, SIGIR ’02)

“A Semi-Supervised Learning Method to Merge Search Engine Results ”

(Luo Si & Jamie Callan, TOIS ’03)

Page 18: Federated Search of Text Search Engines in Uncooperative Environments

18 © Luo Si July,2004

Outline

Outline:

Introduction

Research Problems: the state-of-the-art and preliminary research

Demo

- Resource Representation

- Resource Selection

- Results Merging

- A Unified Framework

Page 19: Federated Search of Text Search Engines in Uncooperative Environments

19 © Luo Si July,2004

Research Problems(Unified Utility Framework)

Goal of the Unified Utility Maximization Framework

Integrate and adjust individual components of federated search to get global desired results for different applications

High-Recall vs. High-Precision

Simply combine individual effective components together

High-Recall: Select sources that contain as many relevant docs as possible for information source recommendation

High-Precision: Select sources that return many relevant docs at top part of final ranked list for federated document retrieval

They are correlated but NOT identical, previous research does NOT distinguish them

Page 20: Federated Search of Text Search Engines in Uncooperative Environments

20 © Luo Si July,2004

Research Problems(Unified Utility Framework)

• Formalize federated search as mathematic optimization problem with respect to different goals of different applications

Unified Utility Maximization Framework (UUM)

Example: for document retrieval with High-Precision Goal:

Number of rel docs in top part of

rank listNumber of

sources to select

id ^*

i ijd i j=1

i sdbi

i rdoc i

Subject to

d = argmax I(d ) R(d )

: I(d ) = N

d = N , if I(d ) 0

Retrieve fixed

number of docs

Page 21: Federated Search of Text Search Engines in Uncooperative Environments

21 © Luo Si July,2004

Research Problems(Unified Utility Framework)

• Resource selection for federated document retrieval

Unified Utility Maximization Framework (UUM)

Solution: No simple solution, by dynamic

programming

A variant to select variable number of docs from selected sourcesi ^d*

i ijj=1d i

i sdbi

i Total_rdoci

i

Subject to

d = argmax I(d ) R(d )

: I(d ) = N

d = N

d = 10 k, k [0, 1, 2, .., 10]

Total number of documents to

select

Retrieve variable

number of docs

“Unified Utility Maximization Framework for Resource Selection ”

(Luo Si & Jamie Callan, CIKM ’04)

Page 22: Federated Search of Text Search Engines in Uncooperative Environments

22 © Luo Si July,2004

Research Problems(Unified Utility Framework)

Experiments Resource selection for federated document retrievalTrec123 Representati

ve

3 Sources Selected

10 Sources Selected

SSL Merge

Page 23: Federated Search of Text Search Engines in Uncooperative Environments

23 © Luo Si July,2004

Outline

Demo

FedStats Project:

Cooperative work with Jamie Callan, Thi Nhu Truong and Lawrence Yau

Page 24: Federated Search of Text Search Engines in Uncooperative Environments

24 © Luo Si July,2004

Outline

Demo

0.4

0.6

0.8

1

0 20 40 60

Rank

Pre

cis

ion

SSL

CORI

Results Merging Experiments of FedStats for CORI and SSL

Page 25: Federated Search of Text Search Engines in Uncooperative Environments

25 © Luo Si July,2004

Future Research (Conclude)

Conclude

• Federated search has been hot research in last decade

- Most of previous research is tied with “Big document” Approach

- More theoretically solid foundation

- More empirically effective

- Better model real world applications

The new research advances the state-of-the-art

Bridge from

Cool Research to

Practical Tool