vldb'99 tutorial metasearch engines: solutions and challenges
DESCRIPTION
VLDB'99 TUTORIAL Metasearch Engines: Solutions and Challenges. Clement Yu Weiyi Meng Dept. of EECS Dept. of Computer Science U. of Illinois at Chicago SUNY at Binghamton Chicago, IL 60607 Binghamton, NY 13902 - PowerPoint PPT PresentationTRANSCRIPT
VLDB'99 TUTORIAL
Metasearch Engines: Solutions and Challenges
Clement Yu Weiyi Meng Dept. of EECS Dept. of Computer
ScienceU. of Illinois at Chicago SUNY at Binghamton Chicago, IL 60607 Binghamton, NY 13902 [email protected]
The Problem
search search search engine 1 engine 2 engine n . . . . . . text text text source 1 source 2 source n
How am I going to find the 5 best pages on “Internet Security”?
Metasearch Engine Solution
user
user interface query dispatcher result merger
search search search engine 1 engine 2 engine n . . . . . .
text text text source 1 source 2 source n
query result
Some Observations
most sources are not useful for a given query
sending a query to a useless source would incur unnecessary network traffic waste local resources for evaluating the
query increase the cost of merging the results
retrieving too many documents from a source is inefficient
A More Efficient Metasearch Engine
user
user interface database selector document selector query dispatcher result merger
search search search engine 1 engine 2 engine n
. . . . . .
text text text source 1 source 2 source n
query result
Tutorial Outline
1. Introduction to Text Retrieval consider only Vector Space Model
2. Search Engines on the Web 3. Introduction to Metasearch Engine 4. Database Selection 5. Document Selection 6. Result Merging 7. New Challenges
Introduction to Text Retrieval (1)
Document representation remove stopwords: of, the, ... stemming: stemming stem d = (d1 , ..., di , ..., dn)
di : weight of ith term in d
tf *idf formula for computing di
Example: consider term t of document d in a database of N documents.
tf weight of t in d if tf > 0: 0.5 + 0.5*tf/max_tf idf weight of t: log(N/df) weight of t in d: (0.5 + 0.5*tf/max_tf)*log(N/df)
Introduction to Text Retrieval (2)
Query representation q = (q1 , ..., qi , ..., qn)
qi : weight of ith term in q
compute qi : tf weight only
alternative: use idf weight for query terms not document terms query expansion (e.g., add related terms)
Introduction to Text Retrieval (3)
Similarity Functions
simple dot product:
favor long documents
Cosine function:
other similarity functions exist normalized similarities: [0, 1.0]
n
iii dqdqsim
1
),(
22
1cos),(ii
n
i ii
dq
dqdqsim
q
d
Introduction to Text Retrieval (4)
Retrieval Effectiveness relevant documents: documents useful to the
user of query recall: percentage of relevant documents
retrieved precision: percentage of retrieved documents
that are relevant
recall
pre
cisi
on
Search Engines on the Web (1)
Search engine as a document retrieval system no control on web pages that can be searched web pages have rich structures and semantics web pages are extensively linked additional information for each page (time last
modified, organization publishing it, etc.) databases are dynamic and can be very large few general-purpose search engines and
numerous special-purpose search engines
Search Engines on the Web (2)
New indexing techniques partial-text indexing to improve scalability ignore and/or discount spamming terms use anchor terms to index linked pages e.g.: WWWW [McBr94], Google [BrPa98], Webor [CSM97]
. . . . . .airplane ticket and hotel . . . . . .
Page 1 Page 2: http://travelocity.com/
Search Engines on the Web (3)
New term weighting schemes higher weights to terms enclosed by special tags
title (SIBRIS [WaWJ89], Altavista, HotBot, Yahoo) special fonts (Google [BrPa98]) special fonts & tags (LASER [BoFJ96])
Webor [CSM97] approach partition tags into disjoint classes (title, header,
strong, anchor, list, plain text) assign different importance factors to terms in
different classes determine optimal importance factors
Search Engines on the Web (4)
New document ranking methods Vector Spreading Activation [YuLe96]
add a fraction of parents' similaritiesExample: Suppose for query q sim(q, d1) = 0.4 sim(q, d2) = 0.2 sim(q, d3)
= 0.2
final score of d3 = 0.2 + 0.1*0.4 + 0.1*0.2 = 0.26
d1
d2d3
Search Engines on the Web (5)
New document ranking methods combine similarity with rank
PageRank [PaBr98]: an important page is linked to by many pages and/or by important pages
combine similarity with authority score authority [Klei98]: an important content
page is highly linked to among initially retrieved pages and their neighbors
Introduction to Metasearch Engine (1)
An Example Query: Internet Security
Databases: NYT ... WP ... DB ... DB ...
Retrieved results : t1, t2, ... p1, p2, …
Merged results : p1, t1, ...
Introduction to Metasearch Engine (2)
Database Selection Problem Select potentially useful databases for a given
query essential if the number of local databases is
largereduce network trafficavoid wasting local resources
query
Introduction to Metasearch Engine (3)
Potentially useful database: contain potentially useful documents Potentially useful documents:
global similarity above a thresholdglobal similarity among m highest
Need some knowledge about each database in advance in order to perform database selection Database Representative
Introduction to Metasearch Engine (4)
Document Selection Problem Select potentially useful documents from each
selected local database efficientlyStep 1: Retrieve all potentially useful documents
while minimizing the retrieval of useless documents from global similarity threshold to tightest local
similarity threshold
want all d: Gsim(q, d) > GT
retrieve d from DBk : Lsim(q, d) > LTk
LTk is largest : Gsim(q, d) > GT Lsim(q, d) >
LTk
Introduction to Metasearch Engine (5)
Efficient Document SelectionStep 2: Transmit all potentially useful
documents to result merger while minimizing the transmission of useless documents
further filtering to reduce transmission cost and merge cost
Example:
local
DBk
retrieved1 , …, ds
filterd2, d7, d10
transmit
Introduction to Metasearch Engine (6)
Result Merging ProblemObjective: Merge returned documents from
multiple sources into a single ranked list.
Difficulty: Local document similarities may be incomparable or not available.
Solutions: Generate "global similarities” for ranking.
DB1
DBN
. . . . . .d11, d12, ...
dN1, dN2, ...
Merger d12, d54, ...
Introduction to Metasearch Engine (7)
An Ideal Metasearch Engine: Retrieval effectiveness: same as that as if all
documents were in the same collection. Efficiency: optimize the retrieval process
Implications: should aimed at: selecting only useful search engines retrieving and transmitting only useful
documents ranking documents according to their degrees
of relevance
Introduction to Metasearch Engine (8)
Main Sources of Difficulties: [MYL99] autonomy of local search engines
design autonomy maintenance autonomy
heterogeneities among local search engines indexing method document/query term weighting schemes similarity/ranking function document database document version result presentation
Introduction to Metasearch Engine (9)
Impact of Autonomy and Heterogeneities [MLY99]
unwilling to provide database representatives or provide different types of representatives
difficult to find potentially useful documents difficult to merge documents from multiple
sources
Database Selection: Basic Idea
Goal: Identify potentially useful databases for each user query.
General approach: use representative to indicate approximately
the content of each database use these representatives to select databases
for each queryDiversity of solutions different types of representatives different algorithms using the representatives
Solution Classification
Naive Approach: select all databases (e.g. MetaCrawler, NCSTRL) Qualitative Approaches: estimate the quality of
each local database based on rough representatives based on detailed representatives
Quantitative Approaches: estimate quantities that measure the quality of each local database more directly and explicitly
Learning-based Approaches: database representatives are obtained through training or learning
Qualitative Approaches Using Rough Representatives
typical representative: a few words or a few paragraphs in certain
format manual construction often needed
can work well for special-purpose local search engines
very scalable storage requirement selection can be inaccurate as the description
is too rough
Qualitative Approaches Using Rough Representatives
Example 1: ALIWEB [Kost94] Representative has a fixed format: site containing
files for the Perl Language Template-Type: DOCUMENTTitle: PerlDescription: Information on the Perl
Programming Language. Includes a local
Hypertext Perl Manual, and the latest FAQ in Hypertext.Keywords: perl, perl-faq, language user query can match against one or more fields
Qualitative Approaches Using Rough Representatives
Example 2: NetSerf [ChHa95] Representative has a WordNet based structure:
site for world facts listed by countrytopic: country synset: [nation, nationality, land, country,
a_people] synset: [state, nation, country, land, commonwealth, res_publica, body_politic] synset: [country, state, land, nation]info-type: facts user query is transformed to similar structure
before match
Qualitative Approaches Using Detailed Representatives
Use detailed statistical information for each term
employ special measures to estimate the usefulness/quality of each search engine for each query
the measures reflect the usefulness in a less direct/explicit way compared to those used in quantitative approaches.
scalability starts to become an issue
Qualitative Approaches Using Detailed Representatives
Example 1: gGlOSS [GrGa95] representative: for term ti
-- document frequency of ti
-- the sum of weights of ti in all documents
database usefulness: sum of high similarities
usefulness(q, D, T) =
),( ii Wdf
idf
iW
TdqsimDd
dqsim),(
),(
gGlOSS (continued)
Suppose for query q , we have
D1 d11: 0.6, d12: 0.5
D2 d21: 0.3, d22: 0.3, d23: 0.2
D3 d31: 0.7, d32: 0.1, d33: 0.1
usefulness(q, D1, 0.3) = 1.1
usefulness(q, D2, 0.3) = 0.6
usefulness(q, D3, 0.3) = 0.7
gGlOSS (continued)
gGlOSS: usefulness is estimated for two cases high-correlation case: if dfi dfj , then every
document having ti also has tj .
Example: Consider q = (1, 1, 1) with df1 = 2, df2 = 3,
df3 = 4, W1 = 0.6, W2 = 0.6 and W3 = 1.2.
t1 t2 t3 t1 t2 t3 d1 0.2 0.1 0.3 0.3 0.2 0.3 d2 0.4 0.3 0.2 0.3 0.2 0.3 d3 0 0.2 0.4 0 0.2 0.3 d4 0 0 0.3 0 0 0.3 usefulness(q, D, 0.5) = W1 + W2 + df2*W3/df3 = 2.1
gGlOSS (continued)
disjoint case: for any two query terms ti and tj , no
document contains both ti and tj .
Example: Consider q = (1, 1, 1) with df1 = 2, df2 = 1,
df3 = 1, W1 = 0.5, W2 = 0.2 and W3 = 0.4 .
t1 t2 t3 t1 t2 t3 d1 0.2 0 0 0.25 0 0 d2 0 0.2 0 0 0.2 0 d3 0.3 0 0 0.25 0 0 d4 0 0 0.4 0 0 0.4
usefulness(q, D, 0.3) = = W3 = 0.4
TdfWq
ii
iii
Wq/
gGlOSS (continued)
Some observations usefulness dependent on threshold representative has two quantities per term strong assumptions are used high-correlation tends to overestimate disjoint tends to underestimate the two estimates tend to form bounds to
the sum of the similarities T
Qualitative Approaches Using Detailed Representatives
Example 2: CORI Net [CaLC95]
representative: (dfi , cfi ) for term ti
dfi -- document frequency of ti
cfi -- collection frequency of ti
cfi can be shared by all databases
database usefulness usefulness(q, D) = sim(q, representative of D) usefulness similarity
dfi tfi
cfi dfi
CORI Net (continued)
Some observations estimates independent of threshold representative has less than two quantities
per term similarity is computed based on inference
network same method for ranking documents and
ranking databases
Qualitative Approaches Using Detailed Representatives
Example 3: D-WISE [YuLe97]
representative: dfi,j for term tj in database Di
database usefulness: a measure of query term concentration in different databases
usefulness(q, Di) =
k : number of query terms CVVj : cue validity variance of term tj across all
databases; larger CVVj tj is more
useful in distinguishing different databases
k
jjij dfCVV
1,
D-WISE (continued)
ACVj : average cue validity of tj over all
databases
Observations: estimates independent of threshold representative has one quantity per term measure is difficult to understand
NACVCVCVVN
ijjij
1
2, )(
N
ik k
N
ik jk
i
ji
ijiji
n
df
n
df
ndfCV
,,
,,
N : number of databases
ni : number of documents
in database Di
Quantitative Approaches
Two types of quantities may be estimated wrt query q: the number of documents in a database D with
similarities higher than a threshold T:
NoDoc(q, D, T) = |{ d : d D and sim(q, d) > T }|
the global similarity of the most similar document in
D:
msim(q, D) = max { sim(q, d) } dD
can be used to rank databases in descending order of similarity (or any desirability measure)
Estimating NoDoc(q, D, T)
Basic Approach [MLYW98]
representative: (pi , wi ) for term ti
pi : probability that ti appears in a document
wi : average weight of ti among documents
having ti
Example: normalized weights of ti in 10
documents are (0, 0, 0, 0, 0.2, 0.2, 0.4, 0.4, 0.6, 0.6).
pi = 0.6, wi = 0.4
Estimating NoDoc(q, D, T)
Basic Approach (continued)Example: Consider query q = (1, 1).
Suppose p1 = 0.2, w1 = 2, p2 = 0.4, w2 = 1.
A generating function:
(0.2 X 2 + 0.8) (0.4 X + 0.6)
= 0.08 X 3 + 0.12 X 2 + 0.32 X + 0.48
a X b : a is the probability that a document in D has similarity b with q
NoDoc(q, D, 1) = 10*(0.08 + 0.12) = 2
Estimating NoDoc(q, D, T)
Basic Approach (continued) Consider query q = (q1, ..., qr).
Proposition. If the terms are independent and the weight of term ti whenever present in a
document is wi (the average weight), 1 i
r, then the coefficient of X s in the following generating function is the probability that a document in D has similarity s with q.
r
ii
qwi pXp ii
1
)]1([
Estimating NoDoc(q, D, T)
Subrange-based Approach [MLYW99] overcome the uniform term weight
assumption additional information for term ti :
i : standard deviation of weights of ti in all
documents
mnwi : maximum normalized weight of ti
Estimating NoDoc(q, D, T)
Example: weights of term ti : 4, 4, 1, 1, 1, 1, 0, 0, 0, 0
generating function (factor) using average weight
0.6*X 2 + 0.4 a more accurate function using subranges of weights
0.2*X 4 + 0.4*X + 0.4 In general, weights are partitioned to k subranges:
pi1*X mi1 + ... + pik*X mik + (1 - pi)
Probability pij and median mij can be estimated
using di and the average of weights of ti .
A special implementation: Use the maximum normalized weight as the first subrange by itself.
Estimating NoDoc(q, D, T)
Combined-term Approach [LYMW99] relieve the term independence assumption Example: Consider query : Chinese medicine . Suppose generating function for:
Chinese: 0.1X3 + 0.3X + 0.6
medicine: 0.2X2 + 0.4 X + 0.4
Chinese medicine: 0.02 X5 + 0.04 X4 + 0.1X3 + …
“Chinese medicine”: 0.05 Xw + ...
Estimating NoDoc(q, D, T)
Criteria for combining “Chinese” and “medicine”: The maximum normalized weight of the combined
term is higher than the maximum normalized weight of each of the two individual terms (w > 3);
The sum of estimated probabilities of terms with exponents w under the term independence assumption is very different from 1/N, N is the number of documents in database;
They are adjacent terms in previous queries.
Database Selection Using msim(q,D)
Optimal Ranking of Databases [YLWM99b]
User: for query q, find the m most similar documents or with the m largest degrees of relevance
Definition: Databases [D1, D2, …, Dp] are optimally
ranked with respect to q if there exists a k such
that each of the databases D1, …, Dk contains
one of the m most similar documents, and all of these m documents are contained in these k databases.
Database Selection Using msim(q,D)
Optimal Ranking of DatabasesExample: For a given query q: D1 d1: 0.8, d2: 0.5, d3: 0.2, ... D2 d9: 0.7, d2: 0.6, d10: 0.4, ...
D3 d8: 0.9, d12: 0.3, … other databases have documents with
small similarities When m = 5: pick D1, D2, D3
Database Selection Using msim(q,D)
Proposition: Databases [D1, D2, …, Dp] are
optimally ranked with respect to a query q if
and only if msim(q, Di) msim(q, Dj), i < j
Example: D1 d1: 0.8, … D2 d9: 0.7, … D3 d8: 0.9, … Optimal rank: [D3, D1, D2, …]
Estimating msim(q, D)
Use subrange-based or combined-term method. Example: Suppose 100 documents in a database. For query q, the generating function is:
0.002 X4 + 0.009 X3 + …
Since 100*(0.002 + 0.009) 1, the global similarity of the most similar document is estimated to be 3.
Weakness of this approach: require large storage for database
representative exponential computation complexity
Estimating msim(q, D)
A more efficient method
global database representative: global dfi of term ti
local database representative:
anwi : average normalized weight of ti
mnwi : maximum normalized weight of ti
Example: term ti : d1 0.3, d2 0.4, d3 0, d4 0.74
anwi = (0.3 + 0.4 + 0 + 0.7)/4 = 0.35
mnwi = 0.74
Estimating msim(q, D)
A more efficient method (continued) term weighting scheme query term: tf*gidf document term: tf query q = (q1, q2)
msim(q, D) = max { q1*gidf1*mnw1 + q2*gidf2*anw2 ,
q2*gidf2*mnw2 + q1*gidf1*anw1 }
linear computation complexity
Estimating msim(q, D)
Combine terms to improve estimation accuracy Restrictions for combining terms ti and tj into tij :
ti and tj are adjacent query terms
mnwij > max { mnwi + anwj , mnwj + anwi }
Given a query having ti , tj and tk in this order, decide
which terms to combine if they should combine. Combine ti and tj if
mnwij > max { mnwi + anwj , mnwj + anwi }
and mnwij - max { mnwi + anwj , mnwj + anwi }
> mnwkj - max { mnwk+ anwj , mnwj + anwk }
Learning-based Approaches
Use past retrieval experiences to determine usefulness
Assume no or little global database or local database statistics
Static learning : learning based on static training queries
Dynamic learning : learning based on evaluated user queries
Combined learning: learned knowledge based on training queries will be adjusted based on user queries
Static Learning
Example: MRDD (Modeling Relevant Document Distribution) [VoGJ95]
record the result of each training query for each local database:
<r1, ..., rs>: ri indicates the minimum number of
top-ranked documents to retrieve in order to obtain i relevant documents <2, 5, … >: need to retrieve 2 documents in order to obtain 1 relevant document
MRDD (continued)
For a new query: identify the k most similar training queries obtain the average distribution vector from the k
training queries for each database use these vectors to determine databases to search
and documents to retrieve to maximize precisionExample: Suppose for query q, three average
distribution are obtained: D1: <1, 4, 6, 7, 10, 12, 17> D2: <1, 5, 7, 9, 15, 20> D3: <2, 3, 6, 9, 11, 16>To retrieve two relevant documents: select D1 and D2.
Dynamic LearningExample : SavvySearch [DrHo97] database representative: weight wi and cfi for term ti and
two penalty values ph and pr for each D.
wi : indicate how well D responds to query term ti
cfi : number of databases containing ti
ph : penalty if the average number of hits returned
for most recent five queries < Th
ph = (Th - h) 2 / Th 2
pr : penalty if the average response time for most
recent five queries > Tr
pr = (r - Tr ) 2 / (45 - Tr )
2
SavvySearch (continued)
Update of wi
initially zero reduce by 1/k if no document is retrieved for a k-
term query containing ti
increase by 1/k if some returned document is read
Compute the ranking score of database D for query q = (t1, ..., tk)
r = )(
||
)log(
1
1rhn
i i
k
i ii ppw
cfNw
Combined Learning
Example: ProFusion [FaGa99]Phase 1: Static Learning 13 categories/concepts are utilized training queries in each category are selected relevance assessment for each query is used
to compute the average score of each local database with respect to each category
category D1 D2 . . . Dn C1 0.3 0.1 . . . 0.2 . . . . . . . . . C13 0 0.4 . . . 0.1
ProFusion (continued)
Phase 2: Database Selection and Dynamic Learning Each user query is mapped to one or more
categories Databases are selected based on accumulated
scores over involved categories Example: Suppose query q is mapped to C1, C4, C5 category D1 D2 D3 D4 C1 0.2 0 0.1 0.3 C4 0.1 0.2 0 0 C5 0 0.4 0.3 0.2 total score 0.3 0.6 0.4 0.5
ProFusion (continued)
Each retrieved document from all selected databases is re-ranked based on the product of local similarity of the document and the score of the database.
if the first clicked document by the user is not the top ranked increase the score of the database that
produced the document in related categories
decrease the score of other searched databases in related categories
Other Database Selection Techniques
incorporating ranks [YMLW99a] query expansion [XuCa98] use of lightweight queries [HaTh99]
shorter not evaluated like regular queries
use of representative hierarchies [YMLW99b]
Document Selection
Goal: Select all globally most similar documents from a selected local search engine while minimizing the retrieval of useless documents.
General approaches determine the number k of documents to
retrieve from a local search engine and then retrieve the k documents with the largest local similarities from the search engine
determine a local threshold for the local database and retrieve documents whose local similarities exceed the threshold
* The two approaches are equivalent.
Solution Classification
Local Determination all locally retrieved documents will be returned
Examples: NCSTRL, Search Broker [MaBi97] User Determination
global user determines how many documents should be retrieved from each local database
neither effective nor practical when the number of databases is large.
Examples: MetaCrawler [SeEt97] SavvySearch [DrHo97]
Solution Classification (continued)
Weighted Allocation retrieve proportionally more documents
from local databases that are ranked higher
Learning-based Approaches use past retrieval experience for
selection Guaranteed Retrieval
aimed at guaranteeing the retrieval of globally most similar documents
Weighted Allocation
Suppose m documents are to be retrieved from N local databases.
Example 1: CORI net [CaLC95] Retrieve m* 2(1+ N - i) / N ( N+1) documents from the ith ranked local database.Example 2: D-WISE [YuLe97]
Let ri be the ranking score of local database Di .
Retrieve m * ri / documents from Di .
When retrieving k documents from local database D ,
the k documents with largest local similarities are
retrieved from Di .
N
k kr1
Learning-based Approaches
determine the number of documents to retrieve from a local database based on past retrieval experiences with the local database.
Example: MRDD [VoGJ95] For query q, three average distribution are
obtained: D1: <1, 4, 6, 7, 10, 12, 17> D2: <1, 5, 7, 9, 15, 20> D3: <2, 3, 6, 9, 11, 16> To retrieve four relevant documents: retrieve 1
document from D1, 1 from D2 and 3 from D3.
Guaranteed Retrieval
Aim at guaranteeing that all potentially useful
documents with respect to a query be retrieved minimizing the retrieval of useless documentsTwo cases case 1: a global similarity threshold is known
case 2: the number of globally desired documents is known
The two cases are mutually translatable.
Case 1: Global Similarity Threshold GT Is Known
find all documents whose global similarities GTTechnique 1: Query modification [MLYW98] Modify q to q' such that Gsim(q, d) = Lsim(q', d) find all documents whose local similarities with q’ GT Example: q = (q1, q2); d = (d1, d2);
Gsim(q, d) = gidf1*q1*d1 + gidf2*q2*d2,
Lsim(q, d) = lidf1*q1*d1 + lidf2*q2*d2,
q' = (gidf1/lidf1 * q1, gidf2/lidf2 * q2)
Lsim(q', d) = lidf1*(gidf1/lidf1)*q1*d1 +
lidf2*(gidf2/lidf2)*q2*d2 = Gsim(q, d)
Case 1: Global Similarity Threshold GT Is Known
Technique 2: find largest local threshold LT such that Gsim(q, d) GT Lsim(q, d) LT [MLYW98] retrieve d such that Lsim(q, d) LT to form set S transmit d from S if Gsim(q, d) GT Example: Gsim(q, d) Lsim(q, d) d1 0.8 0.7 d2 0.75 0.35 ....… d3 0.4 0.6 If d2 is desired, then LT can be no higher than 0.35. If GT = 0.6, d3 will not be transmitted. Transmit m documents from each local database.
Case 1: Global Similarity Threshold GT Is Known
Define tightest local threshold: LT = min { Lsim(q, d) | Gsim(q, d) GT } d
Determining LT
if both Gsim and Lsim are linear functions, apply linear programming;
otherwise, try Lagrange Multiplier.
Case 1: Global Similarity Threshold GT Is Known
Example: Gsim(q, d) = Cosine(qG , d)
Lsim(q, d) = Cosine(qL , d)
LT = min { Cosine(qL , d) | Cosine(qG, d) GT } d = Cosine( + 1) when qG, qL, d in the same plane
= GT * Cosine 1 - sin * sin 1
qL
qGd
1
Case 2: Number of Globally Desired Documents Is Known
Solution: rank databases optimally for a given query q retrieve documents from databases in the
optimal order
Case 2: Number of Globally Desired Documents Is Known
Algorithm OptDocRetrv [YLWM99]while less than m documents have been obtained do 1. select the next database in the order 2. compute actual similarity of most similar document 3. find the minimum min_sim of the actual similarities of most similar documents of selected databases 4. select documents from each selected database whose actual global similarities min_simend loopSort the documents in descending similarities and
present the top m to the user.
Case 2: Number of Globally Desired Documents Is Known
Example: Number of documents desired = 4. Databases are ranked in the order D1, D2, D3, D4 … D1 d1: 0.53, d2: 0.48, d3: 0.39, … D2 d10: 0.47, d21: 0.43, d52: 0.42, … D3 d23: 0.54, d42: 0.49, ... D4 d33: 0.40, … select D1, min_sim = 0.53: result = { d1 } select D2, min_sim = 0.47, result = { d1, d2, d10 } select D3, min_sim = 0.47, result = { d1, d2, d10, d23, d42 } result to user = { d1, d2, d23, d42 }
Case 2: Number of Globally Desired Documents Is Known
Proposition: If databases are optimally ranked, then all the m globally most similar documents will be retrieved by algorithm OptDocRetrv.
Proposition: For any single-term query, all the
m globally most similar documents will be
retrieved by algorithm OptDocRetrv.
Result Merging
Goal: Merge returned documents from multiple sources into a single ranked list.
Difficulties local similarities are usually not comparable due to
different similarity functions different term weighting schemes different statistical values e.g., global idf vs. local idf
local similarities may be unavailable to metasearch engine (only ranks are provided)
Ideal rank: in non-increasing order of global similarities
Solution Classification
similarity normalization normalize all local similarities into a common
fixed range to improve comparability similarity adjustment adjust local similarities/ranks based on the
quality of local databases global similarity computation aim at obtaining the actual global similaritiesMerge based on normalized/adjusted/computed
similarities.
Similarity Normalization
Example 1: MetaCrawler [SeEt97] map all local similarities into [0, 1000] map largest local similarity from each source to 1000 map other local similarities proportionally add normalized local similarities for documents
retrieved from multiple sources D1 D2 d1 d2 d3 d1 d4 d5local similarity: 100 200 400 0.3 0.2 0.5 normalized: 250 500 1000 600 400 1000 final similarity: 850 500 1000 400 1000
Similarity Normalization
Example 2: SavvySearch [DrHo97] same as MetaCrawler except using range [0, 1] documents with no local similarities are assigned 0.5
Retrieval based on Multiple Evidence normalized similarity between 0 and 1 can be
considered as a confidence that a document is useful let si be the confidence of source i that document d
is useful to query q estimate overall confidence that d is useful: S(d, q) = 1 - (1 - si)*...*(1- sk)
Example: s1 = 0.7, s2 = 0.8 S(d, q) = 0.94
Similarity Adjustment
Use local similarity of d and the ranking score of its database to estimate the global similarity of d. database ranking score: the higher the better
Example: CORI net [CaLC95] assign the following weight to database D
w(D) = 1 + N * (r - r') / r'
r : rank score of D wrt q r’ : avg of scores of searched databases N : number of local databases searched adjust local similarity s of document d in D to s*w(D)
Similar approach employed in ProFusion [GaWG96].
Similarity Adjustment
Use local rank of d and the ranking score of its database to estimate the global similarity of d.
Example: D-WISE [YuLe97]
Gsim(q, d) = 1 - (r - 1) * Rmin / (m * Ri)
Ri : ranking score of database Di
Rmin : lowest database ranking score
r : local rank of document d from Di
m : total number of documents desiredObservation: top ranked document from any
database has the same global similarity
D-WISE (continued)
Example: R1 = 0.3, R2 = 0.7, Rmin = 0.2, m = 4
Gsim(q, d) = 1 - (r - 1) * 0.2 / (4 Ri)
D1 D2
r Gsim r Gsim
d1 1 1.0 d1' 1 1.0
d2 2 0.83 d2' 2 0.93
d3 3 0.67 d3' 3 0.86
more documents from databases with higher ranking scores have higher global similarities
Global Similarity Computation
Technique 1: Document Fetching (e.g.: E2RD2, ParaCrawler)
fetch documents to the metasearch engine collect desired statistics (tf, idf, ...) compute global similarities Problem: may not scale well.
Global Similarity Computation
Technique 2: Knowledge Discovery discover similarity functions and term
weighting schemes used in different search engines
use the discovered knowledge to determine what local similarities are reasonably
comparable? how to adjust local similarities to make them
more comparable? how to compute/estimate global similarities?
Knowledge Discovery (continued)
Example: All local search engines selected for a query
employ same methods for indexing local documents and computing local similarities
do not use idf information local similarities comparable idf information is used and q has a single term t Lsim(q, d) = [tft(q) * lidft *
tft(d)]/(|q|*|d|) = [lidft * tft(d)]/|d|
Gsim(q, d) = (gidft * tft(d)) /|d|
Gsim(q, d) = Lsim(q, d) * gidft / lidft
Knowledge Discovery (continued)
Example (continued)
idf information is used and q has terms t1, ..., tk
Gsim(q, d) =
=
can be determined using ti as a single-term
query.
||||
)()(1
dq
dtfgidfqtfk
i tititi
titi
k
i
ti gidfd
dtf
q
qtf
||
)(
||
)(
1
||
)(
d
dtfti
Knowledge Discovery (continued)
submit ti as a single-term query and let
si = Lsim(d, q(ti)) = |||)(|
)())((
dtiq
dtflidftiqtf tititi
titi
iti
lidftiqtf
tiqs
d
dtf
))((
|)(|
||
)(
New Challenges
Incorporate new search techniques into metasearch. Document ranks in Google Kleinberg's hub and authority scores Tag information in HTML documents Implicit user feedback on previous retrieval Pseudo relevance feedback on previous retrieval Use of user profiles
Integrate local systems supporting different query types fewer researches on boolean queries, proximity
queries and hierarchical queries
New Challenges (continued)
Develop techniques to discover knowledge (representatives, ranking algorithms) about local search engines more accurately and more efficiently. some search engines may be unwilling to
provided desired representatives or may provide inaccurate representatives
indexing techniques, term weighting schemes and similarity functions are typically proprietary.
Develop standard guideline on what information each search engine should provide to metasearch engine (some efforts: STARTS, Dublin Core).
New Challenges (continued)
Distributed implementation of metasearch engine alternative ways to store local database
representatives? how to perform database selection and
document selection at multiple sites in parallel?
Scale to a million databases storage of database representatives fast algorithms for database selection,
document selection and result merging efficient network utilization
New Challenges (continued)
Standard testbed for evaluation need a large number of local databases documents should have links for computing
ranks, hub and authority scores a large number of typical Internet queries relevance assessment of documents to each
query Go beyond text databases
how to extend to databases containing text, images, video, audio, structured data?