federated search of text search engines in uncooperative environments
DESCRIPTION
Federated Search of Text Search Engines in Uncooperative Environments. Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University Advisor Jamie Callan (Carnegie Mellon University). Outline. Outline: Introduction: Introduction to federated search - PowerPoint PPT PresentationTRANSCRIPT
Federated Search of Text Search Engines in Uncooperative Environments
Luo SiLanguage Technology Institute
School of Computer ScienceCarnegie Mellon University
AdvisorJamie Callan (Carnegie Mellon University)
2 © Luo Si July,2004
Outline
Outline:
Introduction: Introduction to federated search
Research Problems: the state-of-the-art and our contribution
Demo: Demo of a prototype system for real world application!
3 © Luo Si July,2004
Outline
Outline:
Introduction: Introduction to federated search
Research Problems: the state-of-the-art and our contribution
Demo: Demo of a prototype system for real world application!
4 © Luo Si July,2004
Introduction
Visible Web vs. Hidden Web
• Visible Web: Information can be copied (crawled) and accessed by
conventional search engines like Google or AltaVista
- No arbitrary crawl of the data (e.g., ACM library)
- Updated too frequently to be crawled (e.g.,
buy.com)
Can NOT Index (promptly)
• Hidden Web: Information hidden from conventional engines.
- Larger than Visible Web (2-50 times)
Valuable Searched by
Federated Search- Created by professionals
- Web: Uncooperative information sources
Federated Search is a feature used to beat Google by search engines like www.find.com
5 © Luo Si July,2004
Introduction
Components of Federated Search System
. . . . . .
(1)ResourceRepresentation
. . . .Engine 1 Engine 2 Engine 3 Engine 4 Engine N
(2) Resource Selection
…………
(3) Results Merging
6 © Luo Si July,2004
Introduction
Modeling Federated Search
• Application in real world
- But, not enough relevance judgments, not enough control…
Require Thorough Simulation
• TREC Testbeds with about 100 information sources
- Normal or moderately skewed size testbeds: Trec123 or Trec4_Kmeans
- Skewed: Representative (large source with the same relevant doc density), Relevant (large source with higher relevant doc density),
Nonrelevant (large source with lower relevant doc density)
• Multiple type of search engines to reflect uncooperative environment
Modeling Federated Search in Research Environments
7 © Luo Si July,2004
Outline
Outline:
Introduction
Research Problems: the state-of-the-art and our contribution
Demo
- Resource Representation
- Resource Selection
- Results Merging
- A Unified Framework
8 © Luo Si July,2004
Research Problems(Resource Representation)
Previous Research on Resource Representation
• Resource descriptions of words and the occurrences
- Query-Based Sampling (Callan, 1999): send query and get sampled doc
• Information source size estimation- Capture-Recapture Model (Liu and Yu, 1999): But require large number of interactions with information sources
• Centralized sample database: Collect docs from Query-Based Sampling (QBS)
- For query-expansion (Ogilvie & Callan, 2001), not very successful
- Successful utilization for other problems, throughout our new research
9 © Luo Si July,2004
Research Problems(Resource Representation)
Estimate df of a term in sampled docs, Get total df from the source by resample query , Scale the number of sampled docs to estimate source size
• Sample-Resample Model (Si and Callan, 2003)
New Information Source Size Estimation Algorithm
*
*
N-NAER=
NAbsolute error
ratio
Estimated Size
Actual Size
Trec123 Trec123-10Col
Cap-Recapture 0.729 0.943
Sample-Resample 0.232 0.299
ExperimentsMeasure:
10 © Luo Si July,2004
Outline
Outline:
Introduction
Research Problems: the state-of-the-art and our contribution
Demo
- Resource Representation
- Resource Selection
- Results Merging
- A Unified Framework
11 © Luo Si July,2004
Research Problems(Resource Selection)
Previous Research on Resource Selection
Goal of Resource Selection of Information Source Recommendation
High-Recall: Select the (few) information sources that have the most relevant documents
• “Big document” resource selection approach: Treat information sources as big documents, rank them by similarity of user query
- Examples: CVV, CORI and KL-divergence
They lose doc boundaries and do not optimize the goal of High-Recall
Estimate the percentage of relevant docs among sources and rank sources
New RElevant Doc Distribution Estimation (ReDDE) resource selection
“Relevant Document Distribution Estimation Method for Resource Selection”
(Luo Si & Jamie Callan, SIGIR ’03)
12 © Luo Si July,2004
Research Problems(Resource Selection)
Relevant Doc Distribution Estimation (ReDDE) Algorithm
ii
dbd db _samp
P(rel|d) SF
i
i
i dbd db
Rel_Q(i) = P(rel|d) P(d|db ) N
Estimated Source Size
Number of Sampled Docs
P(rel|d)
“Everything at the top is (equally)
relevant”
i
i
i
^
dbdb
db _samp
NSF =
N
Source Scale Factor
Rank on Centralized Complete DB
Simple Rank on Centralized Complete DB with ranking on Centralized Complete DB
otherwise0
Nratiod)(Q,RankifCi
dbCCDBQ i
Number of Relevant Docs
13 © Luo Si July,2004
Research Problems(Resource Selection)
Experimentsk
ii=1k k
ii=1
ER =
B
Evaluated Ranking
Desired Ranking
Measure:
14 © Luo Si July,2004
Outline
Outline:
Introduction
Research Problems: the state-of-the-art and our contribution
Future Research
- Resource Representation
- Resource Selection
- Results Merging
- A Unified Framework
15 © Luo Si July,2004
Research Problems(Results Merging)
Goal of Results Merging
Make different result lists comparable and merge them into a single list
Difficulties: Information sources may use different retrieval algorithms
Information sources have different corpus statistics
Previous Research on Results Merging
• Some methods download all docs and calculate comparable scores large communication and computation costs
• Some methods use heuristic combination: CORI method
Semi-Supervised Learning (SSL) Merging (Si & Callan, 2002, 2003)
Basic idea is to approximate centralized doc score by linear regressionEstimate linear models from overlap documents in both centralized sampled DB and individual ranked lists
16 © Luo Si July,2004
Research Problems(Results Merging)
In resource representation:
• Build representations by QBS, collapse sampled docs into centralized sample DB
In resource selection:
• Rank sources, calculate centralized scores for docs in centralized sample DB
In results merging:
• Find overlap docs, build linear models, estimate centralized scores for all docs
• SSL Results Merging (cont)
En
gine 2
……
. . . .
……
En
gine 1E
ngine N
Resource
Representation
Centralized Sample DB
Resource
Selection. .
Overlap Docs
. . . Final Resul
ts
CSDB Rankin
g
17 © Luo Si July,2004
Research Problems(Results Merging)
10 Sources Selected
Experiments
Trec123 Trec4-kmeans
“Using Sampled Data and Regression to Merger Search Engine Results ”
(Luo Si & Jamie Callan, SIGIR ’02)
“A Semi-Supervised Learning Method to Merge Search Engine Results ”
(Luo Si & Jamie Callan, TOIS ’03)
18 © Luo Si July,2004
Outline
Outline:
Introduction
Research Problems: the state-of-the-art and preliminary research
Demo
- Resource Representation
- Resource Selection
- Results Merging
- A Unified Framework
19 © Luo Si July,2004
Research Problems(Unified Utility Framework)
Goal of the Unified Utility Maximization Framework
Integrate and adjust individual components of federated search to get global desired results for different applications
High-Recall vs. High-Precision
Simply combine individual effective components together
High-Recall: Select sources that contain as many relevant docs as possible for information source recommendation
High-Precision: Select sources that return many relevant docs at top part of final ranked list for federated document retrieval
They are correlated but NOT identical, previous research does NOT distinguish them
20 © Luo Si July,2004
Research Problems(Unified Utility Framework)
• Formalize federated search as mathematic optimization problem with respect to different goals of different applications
Unified Utility Maximization Framework (UUM)
Example: for document retrieval with High-Precision Goal:
Number of rel docs in top part of
rank listNumber of
sources to select
id ^*
i ijd i j=1
i sdbi
i rdoc i
Subject to
d = argmax I(d ) R(d )
: I(d ) = N
d = N , if I(d ) 0
Retrieve fixed
number of docs
21 © Luo Si July,2004
Research Problems(Unified Utility Framework)
• Resource selection for federated document retrieval
Unified Utility Maximization Framework (UUM)
Solution: No simple solution, by dynamic
programming
A variant to select variable number of docs from selected sourcesi ^d*
i ijj=1d i
i sdbi
i Total_rdoci
i
Subject to
d = argmax I(d ) R(d )
: I(d ) = N
d = N
d = 10 k, k [0, 1, 2, .., 10]
Total number of documents to
select
Retrieve variable
number of docs
“Unified Utility Maximization Framework for Resource Selection ”
(Luo Si & Jamie Callan, CIKM ’04)
22 © Luo Si July,2004
Research Problems(Unified Utility Framework)
Experiments Resource selection for federated document retrievalTrec123 Representati
ve
3 Sources Selected
10 Sources Selected
SSL Merge
23 © Luo Si July,2004
Outline
Demo
FedStats Project:
Cooperative work with Jamie Callan, Thi Nhu Truong and Lawrence Yau
24 © Luo Si July,2004
Outline
Demo
0.4
0.6
0.8
1
0 20 40 60
Rank
Pre
cis
ion
SSL
CORI
Results Merging Experiments of FedStats for CORI and SSL
25 © Luo Si July,2004
Future Research (Conclude)
Conclude
• Federated search has been hot research in last decade
- Most of previous research is tied with “Big document” Approach
- More theoretically solid foundation
- More empirically effective
- Better model real world applications
The new research advances the state-of-the-art
Bridge from
Cool Research to
Practical Tool