vldb'99 tutorial metasearch engines: solutions and challenges

VLDB'99 TUTORIAL

Metasearch Engines: Solutions and Challenges

Clement Yu Weiyi Meng Dept. of EECS Dept. of Computer

ScienceU. of Illinois at Chicago SUNY at Binghamton Chicago, IL 60607 Binghamton, NY 13902 [email protected]

[email protected]

The Problem

search search search engine 1 engine 2 engine n . . . . . . text text text source 1 source 2 source n

How am I going to find the 5 best pages on “Internet Security”?

Metasearch Engine Solution

user

user interface query dispatcher result merger

search search search engine 1 engine 2 engine n . . . . . .

text text text source 1 source 2 source n

query result

Some Observations

most sources are not useful for a given query

sending a query to a useless source would incur unnecessary network traffic waste local resources for evaluating the

query increase the cost of merging the results

retrieving too many documents from a source is inefficient

A More Efficient Metasearch Engine

user

user interface database selector document selector query dispatcher result merger

search search search engine 1 engine 2 engine n

. . . . . .

text text text source 1 source 2 source n

query result

Tutorial Outline

1. Introduction to Text Retrieval consider only Vector Space Model

2. Search Engines on the Web 3. Introduction to Metasearch Engine 4. Database Selection 5. Document Selection 6. Result Merging 7. New Challenges

Introduction to Text Retrieval (1)

Document representation remove stopwords: of, the, ... stemming: stemming stem d = (d1 , ..., di , ..., dn)

di : weight of ith term in d

tf *idf formula for computing di

Example: consider term t of document d in a database of N documents.

tf weight of t in d if tf > 0: 0.5 + 0.5*tf/max_tf idf weight of t: log(N/df) weight of t in d: (0.5 + 0.5*tf/max_tf)*log(N/df)


Query representation q = (q1 , ..., qi , ..., qn)

qi : weight of ith term in q

compute qi : tf weight only

alternative: use idf weight for query terms not document terms query expansion (e.g., add related terms)


Similarity Functions

simple dot product:

favor long documents

Cosine function:

other similarity functions exist normalized similarities: [0, 1.0]

n

iii dqdqsim

1

),(

22

1cos),(ii

n

i ii

dq

dqdqsim

q

d


Retrieval Effectiveness relevant documents: documents useful to the

user of query recall: percentage of relevant documents

retrieved precision: percentage of retrieved documents

that are relevant

recall

pre

cisi

on

Search Engines on the Web (1)

Search engine as a document retrieval system no control on web pages that can be searched web pages have rich structures and semantics web pages are extensively linked additional information for each page (time last

modified, organization publishing it, etc.) databases are dynamic and can be very large few general-purpose search engines and

numerous special-purpose search engines


New indexing techniques partial-text indexing to improve scalability ignore and/or discount spamming terms use anchor terms to index linked pages e.g.: WWWW [McBr94], Google [BrPa98], Webor [CSM97]

. . . . . .airplane ticket and hotel . . . . . .

Page 1 Page 2: http://travelocity.com/


New term weighting schemes higher weights to terms enclosed by special tags

title (SIBRIS [WaWJ89], Altavista, HotBot, Yahoo) special fonts (Google [BrPa98]) special fonts & tags (LASER [BoFJ96])

Webor [CSM97] approach partition tags into disjoint classes (title, header,

strong, anchor, list, plain text) assign different importance factors to terms in

different classes determine optimal importance factors


New document ranking methods Vector Spreading Activation [YuLe96]

add a fraction of parents' similaritiesExample: Suppose for query q sim(q, d1) = 0.4 sim(q, d2) = 0.2 sim(q, d3)

= 0.2

final score of d3 = 0.2 + 0.1*0.4 + 0.1*0.2 = 0.26

d1

d2d3


New document ranking methods combine similarity with rank

PageRank [PaBr98]: an important page is linked to by many pages and/or by important pages

combine similarity with authority score authority [Klei98]: an important content

page is highly linked to among initially retrieved pages and their neighbors

Introduction to Metasearch Engine (1)

An Example Query: Internet Security

Databases: NYT ... WP ... DB ... DB ...

Retrieved results : t1, t2, ... p1, p2, …

Merged results : p1, t1, ...


Database Selection Problem Select potentially useful databases for a given

query essential if the number of local databases is

largereduce network trafficavoid wasting local resources

query


Potentially useful database: contain potentially useful documents Potentially useful documents:

global similarity above a thresholdglobal similarity among m highest

Need some knowledge about each database in advance in order to perform database selection Database Representative


Document Selection Problem Select potentially useful documents from each

selected local database efficientlyStep 1: Retrieve all potentially useful documents

while minimizing the retrieval of useless documents from global similarity threshold to tightest local

similarity threshold

want all d: Gsim(q, d) > GT

retrieve d from DBk : Lsim(q, d) > LTk

LTk is largest : Gsim(q, d) > GT Lsim(q, d) >

LTk


Efficient Document SelectionStep 2: Transmit all potentially useful

documents to result merger while minimizing the transmission of useless documents

further filtering to reduce transmission cost and merge cost

Example:

local

DBk

retrieved1 , …, ds

filterd2, d7, d10

transmit


Result Merging ProblemObjective: Merge returned documents from

multiple sources into a single ranked list.

Difficulty: Local document similarities may be incomparable or not available.

Solutions: Generate "global similarities” for ranking.

DB1

DBN

. . . . . .d11, d12, ...

dN1, dN2, ...

Merger d12, d54, ...


An Ideal Metasearch Engine: Retrieval effectiveness: same as that as if all

documents were in the same collection. Efficiency: optimize the retrieval process

Implications: should aimed at: selecting only useful search engines retrieving and transmitting only useful

documents ranking documents according to their degrees

of relevance


Main Sources of Difficulties: [MYL99] autonomy of local search engines

design autonomy maintenance autonomy

heterogeneities among local search engines indexing method document/query term weighting schemes similarity/ranking function document database document version result presentation


Impact of Autonomy and Heterogeneities [MLY99]

unwilling to provide database representatives or provide different types of representatives

difficult to find potentially useful documents difficult to merge documents from multiple

sources

Database Selection: Basic Idea

Goal: Identify potentially useful databases for each user query.

General approach: use representative to indicate approximately

the content of each database use these representatives to select databases

for each queryDiversity of solutions different types of representatives different algorithms using the representatives

Solution Classification

Naive Approach: select all databases (e.g. MetaCrawler, NCSTRL) Qualitative Approaches: estimate the quality of

each local database based on rough representatives based on detailed representatives

Quantitative Approaches: estimate quantities that measure the quality of each local database more directly and explicitly

Learning-based Approaches: database representatives are obtained through training or learning

Qualitative Approaches Using Rough Representatives

typical representative: a few words or a few paragraphs in certain

format manual construction often needed

can work well for special-purpose local search engines

very scalable storage requirement selection can be inaccurate as the description

is too rough


Example 1: ALIWEB [Kost94] Representative has a fixed format: site containing

files for the Perl Language Template-Type: DOCUMENTTitle: PerlDescription: Information on the Perl

Programming Language. Includes a local

Hypertext Perl Manual, and the latest FAQ in Hypertext.Keywords: perl, perl-faq, language user query can match against one or more fields


Example 2: NetSerf [ChHa95] Representative has a WordNet based structure:

site for world facts listed by countrytopic: country synset: [nation, nationality, land, country,

a_people] synset: [state, nation, country, land, commonwealth, res_publica, body_politic] synset: [country, state, land, nation]info-type: facts user query is transformed to similar structure

before match

Qualitative Approaches Using Detailed Representatives

Use detailed statistical information for each term

employ special measures to estimate the usefulness/quality of each search engine for each query

the measures reflect the usefulness in a less direct/explicit way compared to those used in quantitative approaches.

scalability starts to become an issue


Example 1: gGlOSS [GrGa95] representative: for term ti

-- document frequency of ti

-- the sum of weights of ti in all documents

database usefulness: sum of high similarities

usefulness(q, D, T) =

),( ii Wdf

idf

iW

TdqsimDd

dqsim),(

),(

gGlOSS (continued)

Suppose for query q , we have

D1 d11: 0.6, d12: 0.5

D2 d21: 0.3, d22: 0.3, d23: 0.2

D3 d31: 0.7, d32: 0.1, d33: 0.1

usefulness(q, D1, 0.3) = 1.1



gGlOSS (continued)

gGlOSS: usefulness is estimated for two cases high-correlation case: if dfi dfj , then every

document having ti also has tj .

Example: Consider q = (1, 1, 1) with df1 = 2, df2 = 3,

df3 = 4, W1 = 0.6, W2 = 0.6 and W3 = 1.2.

t1 t2 t3 t1 t2 t3 d1 0.2 0.1 0.3 0.3 0.2 0.3 d2 0.4 0.3 0.2 0.3 0.2 0.3 d3 0 0.2 0.4 0 0.2 0.3 d4 0 0 0.3 0 0 0.3 usefulness(q, D, 0.5) = W1 + W2 + df2*W3/df3 = 2.1

gGlOSS (continued)

disjoint case: for any two query terms ti and tj , no

document contains both ti and tj .

Example: Consider q = (1, 1, 1) with df1 = 2, df2 = 1,

df3 = 1, W1 = 0.5, W2 = 0.2 and W3 = 0.4 .

t1 t2 t3 t1 t2 t3 d1 0.2 0 0 0.25 0 0 d2 0 0.2 0 0 0.2 0 d3 0.3 0 0 0.25 0 0 d4 0 0 0.4 0 0 0.4

usefulness(q, D, 0.3) = = W3 = 0.4

TdfWq

ii

iii

Wq/

gGlOSS (continued)

Some observations usefulness dependent on threshold representative has two quantities per term strong assumptions are used high-correlation tends to overestimate disjoint tends to underestimate the two estimates tend to form bounds to

the sum of the similarities T


Example 2: CORI Net [CaLC95]

representative: (dfi , cfi ) for term ti

dfi -- document frequency of ti

cfi -- collection frequency of ti

cfi can be shared by all databases

database usefulness usefulness(q, D) = sim(q, representative of D) usefulness similarity

dfi tfi

cfi dfi

CORI Net (continued)

Some observations estimates independent of threshold representative has less than two quantities

per term similarity is computed based on inference

network same method for ranking documents and

ranking databases


Example 3: D-WISE [YuLe97]

representative: dfi,j for term tj in database Di

database usefulness: a measure of query term concentration in different databases

usefulness(q, Di) =

k : number of query terms CVVj : cue validity variance of term tj across all

databases; larger CVVj tj is more

useful in distinguishing different databases

k

jjij dfCVV

1,

D-WISE (continued)

ACVj : average cue validity of tj over all

databases

Observations: estimates independent of threshold representative has one quantity per term measure is difficult to understand

NACVCVCVVN

ijjij

1

2, )(

N

ik k

N

ik jk

i

ji

ijiji

n

df

n

df

ndfCV

,,

,,

N : number of databases

ni : number of documents

in database Di

Quantitative Approaches

Two types of quantities may be estimated wrt query q: the number of documents in a database D with

similarities higher than a threshold T:

NoDoc(q, D, T) = |{ d : d D and sim(q, d) > T }|

the global similarity of the most similar document in

D:

msim(q, D) = max { sim(q, d) } dD

can be used to rank databases in descending order of similarity (or any desirability measure)

Estimating NoDoc(q, D, T)

Basic Approach [MLYW98]

representative: (pi , wi ) for term ti

pi : probability that ti appears in a document

wi : average weight of ti among documents

having ti

Example: normalized weights of ti in 10

documents are (0, 0, 0, 0, 0.2, 0.2, 0.4, 0.4, 0.6, 0.6).

pi = 0.6, wi = 0.4


Basic Approach (continued)Example: Consider query q = (1, 1).

Suppose p1 = 0.2, w1 = 2, p2 = 0.4, w2 = 1.

A generating function:

(0.2 X 2 + 0.8) (0.4 X + 0.6)

= 0.08 X 3 + 0.12 X 2 + 0.32 X + 0.48

a X b : a is the probability that a document in D has similarity b with q

NoDoc(q, D, 1) = 10*(0.08 + 0.12) = 2


Basic Approach (continued) Consider query q = (q1, ..., qr).

Proposition. If the terms are independent and the weight of term ti whenever present in a

document is wi (the average weight), 1 i

r, then the coefficient of X s in the following generating function is the probability that a document in D has similarity s with q.

r

ii

qwi pXp ii

1

)]1([


Subrange-based Approach [MLYW99] overcome the uniform term weight

assumption additional information for term ti :

i : standard deviation of weights of ti in all

documents

mnwi : maximum normalized weight of ti


Example: weights of term ti : 4, 4, 1, 1, 1, 1, 0, 0, 0, 0

generating function (factor) using average weight

0.6*X 2 + 0.4 a more accurate function using subranges of weights

0.2*X 4 + 0.4*X + 0.4 In general, weights are partitioned to k subranges:

pi1*X mi1 + ... + pik*X mik + (1 - pi)

Probability pij and median mij can be estimated

using di and the average of weights of ti .

A special implementation: Use the maximum normalized weight as the first subrange by itself.


Combined-term Approach [LYMW99] relieve the term independence assumption Example: Consider query : Chinese medicine . Suppose generating function for:

Chinese: 0.1X3 + 0.3X + 0.6

medicine: 0.2X2 + 0.4 X + 0.4

Chinese medicine: 0.02 X5 + 0.04 X4 + 0.1X3 + …

“Chinese medicine”: 0.05 Xw + ...


Criteria for combining “Chinese” and “medicine”: The maximum normalized weight of the combined

term is higher than the maximum normalized weight of each of the two individual terms (w > 3);

The sum of estimated probabilities of terms with exponents w under the term independence assumption is very different from 1/N, N is the number of documents in database;

They are adjacent terms in previous queries.

Database Selection Using msim(q,D)

Optimal Ranking of Databases [YLWM99b]

User: for query q, find the m most similar documents or with the m largest degrees of relevance

Definition: Databases [D1, D2, …, Dp] are optimally

ranked with respect to q if there exists a k such

that each of the databases D1, …, Dk contains

one of the m most similar documents, and all of these m documents are contained in these k databases.


Optimal Ranking of DatabasesExample: For a given query q: D1 d1: 0.8, d2: 0.5, d3: 0.2, ... D2 d9: 0.7, d2: 0.6, d10: 0.4, ...

D3 d8: 0.9, d12: 0.3, … other databases have documents with

small similarities When m = 5: pick D1, D2, D3


Proposition: Databases [D1, D2, …, Dp] are

optimally ranked with respect to a query q if

and only if msim(q, Di) msim(q, Dj), i < j

Example: D1 d1: 0.8, … D2 d9: 0.7, … D3 d8: 0.9, … Optimal rank: [D3, D1, D2, …]

Estimating msim(q, D)

Use subrange-based or combined-term method. Example: Suppose 100 documents in a database. For query q, the generating function is:

0.002 X4 + 0.009 X3 + …

Since 100*(0.002 + 0.009) 1, the global similarity of the most similar document is estimated to be 3.

Weakness of this approach: require large storage for database

representative exponential computation complexity


A more efficient method

global database representative: global dfi of term ti

local database representative:

anwi : average normalized weight of ti

mnwi : maximum normalized weight of ti

Example: term ti : d1 0.3, d2 0.4, d3 0, d4 0.74

anwi = (0.3 + 0.4 + 0 + 0.7)/4 = 0.35

mnwi = 0.74


A more efficient method (continued) term weighting scheme query term: tf*gidf document term: tf query q = (q1, q2)

msim(q, D) = max { q1*gidf1*mnw1 + q2*gidf2*anw2 ,

q2*gidf2*mnw2 + q1*gidf1*anw1 }

linear computation complexity


Combine terms to improve estimation accuracy Restrictions for combining terms ti and tj into tij :

ti and tj are adjacent query terms

mnwij > max { mnwi + anwj , mnwj + anwi }

Given a query having ti , tj and tk in this order, decide

which terms to combine if they should combine. Combine ti and tj if

mnwij > max { mnwi + anwj , mnwj + anwi }

and mnwij - max { mnwi + anwj , mnwj + anwi }

> mnwkj - max { mnwk+ anwj , mnwj + anwk }

Learning-based Approaches

Use past retrieval experiences to determine usefulness

Assume no or little global database or local database statistics

Static learning : learning based on static training queries

Dynamic learning : learning based on evaluated user queries

Combined learning: learned knowledge based on training queries will be adjusted based on user queries

Static Learning

Example: MRDD (Modeling Relevant Document Distribution) [VoGJ95]

record the result of each training query for each local database:

<r1, ..., rs>: ri indicates the minimum number of

top-ranked documents to retrieve in order to obtain i relevant documents <2, 5, … >: need to retrieve 2 documents in order to obtain 1 relevant document

MRDD (continued)

For a new query: identify the k most similar training queries obtain the average distribution vector from the k

training queries for each database use these vectors to determine databases to search

and documents to retrieve to maximize precisionExample: Suppose for query q, three average

distribution are obtained: D1: <1, 4, 6, 7, 10, 12, 17> D2: <1, 5, 7, 9, 15, 20> D3: <2, 3, 6, 9, 11, 16>To retrieve two relevant documents: select D1 and D2.

Dynamic LearningExample : SavvySearch [DrHo97] database representative: weight wi and cfi for term ti and

two penalty values ph and pr for each D.

wi : indicate how well D responds to query term ti

cfi : number of databases containing ti

ph : penalty if the average number of hits returned

for most recent five queries < Th

ph = (Th - h) 2 / Th 2

pr : penalty if the average response time for most

recent five queries > Tr

pr = (r - Tr ) 2 / (45 - Tr )

2

SavvySearch (continued)

Update of wi

initially zero reduce by 1/k if no document is retrieved for a k-

term query containing ti

increase by 1/k if some returned document is read

Compute the ranking score of database D for query q = (t1, ..., tk)

r = )(

||

)log(

1

1rhn

i i

k

i ii ppw

cfNw

Combined Learning

Example: ProFusion [FaGa99]Phase 1: Static Learning 13 categories/concepts are utilized training queries in each category are selected relevance assessment for each query is used

to compute the average score of each local database with respect to each category

category D1 D2 . . . Dn C1 0.3 0.1 . . . 0.2 . . . . . . . . . C13 0 0.4 . . . 0.1

ProFusion (continued)

Phase 2: Database Selection and Dynamic Learning Each user query is mapped to one or more

categories Databases are selected based on accumulated

scores over involved categories Example: Suppose query q is mapped to C1, C4, C5 category D1 D2 D3 D4 C1 0.2 0 0.1 0.3 C4 0.1 0.2 0 0 C5 0 0.4 0.3 0.2 total score 0.3 0.6 0.4 0.5

ProFusion (continued)

Each retrieved document from all selected databases is re-ranked based on the product of local similarity of the document and the score of the database.

if the first clicked document by the user is not the top ranked increase the score of the database that

produced the document in related categories

decrease the score of other searched databases in related categories

Other Database Selection Techniques

incorporating ranks [YMLW99a] query expansion [XuCa98] use of lightweight queries [HaTh99]

shorter not evaluated like regular queries

use of representative hierarchies [YMLW99b]

Document Selection

Goal: Select all globally most similar documents from a selected local search engine while minimizing the retrieval of useless documents.

General approaches determine the number k of documents to

retrieve from a local search engine and then retrieve the k documents with the largest local similarities from the search engine

determine a local threshold for the local database and retrieve documents whose local similarities exceed the threshold

* The two approaches are equivalent.


Local Determination all locally retrieved documents will be returned

Examples: NCSTRL, Search Broker [MaBi97] User Determination

global user determines how many documents should be retrieved from each local database

neither effective nor practical when the number of databases is large.

Examples: MetaCrawler [SeEt97] SavvySearch [DrHo97]

Solution Classification (continued)

Weighted Allocation retrieve proportionally more documents

from local databases that are ranked higher

Learning-based Approaches use past retrieval experience for

selection Guaranteed Retrieval

aimed at guaranteeing the retrieval of globally most similar documents

Weighted Allocation

Suppose m documents are to be retrieved from N local databases.

Example 1: CORI net [CaLC95] Retrieve m* 2(1+ N - i) / N ( N+1) documents from the ith ranked local database.Example 2: D-WISE [YuLe97]

Let ri be the ranking score of local database Di .

Retrieve m * ri / documents from Di .

When retrieving k documents from local database D ,

the k documents with largest local similarities are

retrieved from Di .

N

k kr1

Learning-based Approaches

determine the number of documents to retrieve from a local database based on past retrieval experiences with the local database.

Example: MRDD [VoGJ95] For query q, three average distribution are

obtained: D1: <1, 4, 6, 7, 10, 12, 17> D2: <1, 5, 7, 9, 15, 20> D3: <2, 3, 6, 9, 11, 16> To retrieve four relevant documents: retrieve 1

document from D1, 1 from D2 and 3 from D3.

Guaranteed Retrieval

Aim at guaranteeing that all potentially useful

documents with respect to a query be retrieved minimizing the retrieval of useless documentsTwo cases case 1: a global similarity threshold is known

case 2: the number of globally desired documents is known

The two cases are mutually translatable.

Case 1: Global Similarity Threshold GT Is Known

find all documents whose global similarities GTTechnique 1: Query modification [MLYW98] Modify q to q' such that Gsim(q, d) = Lsim(q', d) find all documents whose local similarities with q’ GT Example: q = (q1, q2); d = (d1, d2);

Gsim(q, d) = gidf1*q1*d1 + gidf2*q2*d2,

Lsim(q, d) = lidf1*q1*d1 + lidf2*q2*d2,

q' = (gidf1/lidf1 * q1, gidf2/lidf2 * q2)

Lsim(q', d) = lidf1*(gidf1/lidf1)*q1*d1 +

lidf2*(gidf2/lidf2)*q2*d2 = Gsim(q, d)


Technique 2: find largest local threshold LT such that Gsim(q, d) GT Lsim(q, d) LT [MLYW98] retrieve d such that Lsim(q, d) LT to form set S transmit d from S if Gsim(q, d) GT Example: Gsim(q, d) Lsim(q, d) d1 0.8 0.7 d2 0.75 0.35 ....… d3 0.4 0.6 If d2 is desired, then LT can be no higher than 0.35. If GT = 0.6, d3 will not be transmitted. Transmit m documents from each local database.


Define tightest local threshold: LT = min { Lsim(q, d) | Gsim(q, d) GT } d

Determining LT

if both Gsim and Lsim are linear functions, apply linear programming;

otherwise, try Lagrange Multiplier.


Example: Gsim(q, d) = Cosine(qG , d)

Lsim(q, d) = Cosine(qL , d)

LT = min { Cosine(qL , d) | Cosine(qG, d) GT } d = Cosine( + 1) when qG, qL, d in the same plane

= GT * Cosine 1 - sin * sin 1

qL

qGd

1

Case 2: Number of Globally Desired Documents Is Known

Solution: rank databases optimally for a given query q retrieve documents from databases in the

optimal order


Algorithm OptDocRetrv [YLWM99]while less than m documents have been obtained do 1. select the next database in the order 2. compute actual similarity of most similar document 3. find the minimum min_sim of the actual similarities of most similar documents of selected databases 4. select documents from each selected database whose actual global similarities min_simend loopSort the documents in descending similarities and

present the top m to the user.


Example: Number of documents desired = 4. Databases are ranked in the order D1, D2, D3, D4 … D1 d1: 0.53, d2: 0.48, d3: 0.39, … D2 d10: 0.47, d21: 0.43, d52: 0.42, … D3 d23: 0.54, d42: 0.49, ... D4 d33: 0.40, … select D1, min_sim = 0.53: result = { d1 } select D2, min_sim = 0.47, result = { d1, d2, d10 } select D3, min_sim = 0.47, result = { d1, d2, d10, d23, d42 } result to user = { d1, d2, d23, d42 }


Proposition: If databases are optimally ranked, then all the m globally most similar documents will be retrieved by algorithm OptDocRetrv.

Proposition: For any single-term query, all the

m globally most similar documents will be

retrieved by algorithm OptDocRetrv.

Result Merging

Goal: Merge returned documents from multiple sources into a single ranked list.

Difficulties local similarities are usually not comparable due to

different similarity functions different term weighting schemes different statistical values e.g., global idf vs. local idf

local similarities may be unavailable to metasearch engine (only ranks are provided)

Ideal rank: in non-increasing order of global similarities


similarity normalization normalize all local similarities into a common

fixed range to improve comparability similarity adjustment adjust local similarities/ranks based on the

quality of local databases global similarity computation aim at obtaining the actual global similaritiesMerge based on normalized/adjusted/computed

similarities.

Similarity Normalization

Example 1: MetaCrawler [SeEt97] map all local similarities into [0, 1000] map largest local similarity from each source to 1000 map other local similarities proportionally add normalized local similarities for documents

retrieved from multiple sources D1 D2 d1 d2 d3 d1 d4 d5local similarity: 100 200 400 0.3 0.2 0.5 normalized: 250 500 1000 600 400 1000 final similarity: 850 500 1000 400 1000

Similarity Normalization

Example 2: SavvySearch [DrHo97] same as MetaCrawler except using range [0, 1] documents with no local similarities are assigned 0.5

Retrieval based on Multiple Evidence normalized similarity between 0 and 1 can be

considered as a confidence that a document is useful let si be the confidence of source i that document d

is useful to query q estimate overall confidence that d is useful: S(d, q) = 1 - (1 - si)*...*(1- sk)

Example: s1 = 0.7, s2 = 0.8 S(d, q) = 0.94

Similarity Adjustment

Use local similarity of d and the ranking score of its database to estimate the global similarity of d. database ranking score: the higher the better

Example: CORI net [CaLC95] assign the following weight to database D

w(D) = 1 + N * (r - r') / r'

r : rank score of D wrt q r’ : avg of scores of searched databases N : number of local databases searched adjust local similarity s of document d in D to s*w(D)

Similar approach employed in ProFusion [GaWG96].

Similarity Adjustment

Use local rank of d and the ranking score of its database to estimate the global similarity of d.

Example: D-WISE [YuLe97]

Gsim(q, d) = 1 - (r - 1) * Rmin / (m * Ri)

Ri : ranking score of database Di

Rmin : lowest database ranking score

r : local rank of document d from Di

m : total number of documents desiredObservation: top ranked document from any

database has the same global similarity

D-WISE (continued)

Example: R1 = 0.3, R2 = 0.7, Rmin = 0.2, m = 4

Gsim(q, d) = 1 - (r - 1) * 0.2 / (4 Ri)

D1 D2

r Gsim r Gsim

d1 1 1.0 d1' 1 1.0

d2 2 0.83 d2' 2 0.93

d3 3 0.67 d3' 3 0.86

more documents from databases with higher ranking scores have higher global similarities

Global Similarity Computation

Technique 1: Document Fetching (e.g.: E2RD2, ParaCrawler)

fetch documents to the metasearch engine collect desired statistics (tf, idf, ...) compute global similarities Problem: may not scale well.

Global Similarity Computation

Technique 2: Knowledge Discovery discover similarity functions and term

weighting schemes used in different search engines

use the discovered knowledge to determine what local similarities are reasonably

comparable? how to adjust local similarities to make them

more comparable? how to compute/estimate global similarities?

Knowledge Discovery (continued)

Example: All local search engines selected for a query

employ same methods for indexing local documents and computing local similarities

do not use idf information local similarities comparable idf information is used and q has a single term t Lsim(q, d) = [tft(q) * lidft *

tft(d)]/(|q|*|d|) = [lidft * tft(d)]/|d|

Gsim(q, d) = (gidft * tft(d)) /|d|

Gsim(q, d) = Lsim(q, d) * gidft / lidft


Example (continued)

idf information is used and q has terms t1, ..., tk

Gsim(q, d) =

=

can be determined using ti as a single-term

query.

||||

)()(1

dq

dtfgidfqtfk

i tititi

titi

k

i

ti gidfd

dtf

q

qtf

||

)(

||

)(

1

||

)(

d

dtfti


submit ti as a single-term query and let

si = Lsim(d, q(ti)) = |||)(|

)())((

dtiq

dtflidftiqtf tititi

titi

iti

lidftiqtf

tiqs

d

dtf

))((

|)(|

||

)(

New Challenges

Incorporate new search techniques into metasearch. Document ranks in Google Kleinberg's hub and authority scores Tag information in HTML documents Implicit user feedback on previous retrieval Pseudo relevance feedback on previous retrieval Use of user profiles

Integrate local systems supporting different query types fewer researches on boolean queries, proximity

queries and hierarchical queries

New Challenges (continued)

Develop techniques to discover knowledge (representatives, ranking algorithms) about local search engines more accurately and more efficiently. some search engines may be unwilling to

provided desired representatives or may provide inaccurate representatives

indexing techniques, term weighting schemes and similarity functions are typically proprietary.

Develop standard guideline on what information each search engine should provide to metasearch engine (some efforts: STARTS, Dublin Core).


Distributed implementation of metasearch engine alternative ways to store local database

representatives? how to perform database selection and

document selection at multiple sites in parallel?

Scale to a million databases storage of database representatives fast algorithms for database selection,

document selection and result merging efficient network utilization


Standard testbed for evaluation need a large number of local databases documents should have links for computing

ranks, hub and authority scores a large number of typical Internet queries relevance assessment of documents to each

query Go beyond text databases

how to extend to databases containing text, images, video, audio, structured data?

vldb'99 tutorial metasearch engines: solutions and challenges

Documents

engine ntext text text

text retrieval

source nhow

source nqueryresulttutorial

tf weight of t

tf idf weight of t

weight of ith term

tf weight onlyalternative