vldb'99 tutorial metasearch engines: solutions and challenges

93
VLDB'99 TUTORIAL Metasearch Engines: Solutions and Challenges Clement Yu Weiyi Meng Dept. of EECS Dept. of Computer Science U. of Illinois at Chicago SUNY at Binghamton Chicago, IL 60607 Binghamton, NY 13902 [email protected] [email protected]

Upload: berny

Post on 04-Feb-2016

25 views

Category:

Documents


0 download

DESCRIPTION

VLDB'99 TUTORIAL Metasearch Engines: Solutions and Challenges. Clement Yu Weiyi Meng Dept. of EECS Dept. of Computer Science U. of Illinois at Chicago SUNY at Binghamton Chicago, IL 60607 Binghamton, NY 13902 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

VLDB'99 TUTORIAL

Metasearch Engines: Solutions and Challenges

Clement Yu Weiyi Meng Dept. of EECS Dept. of Computer

ScienceU. of Illinois at Chicago SUNY at Binghamton Chicago, IL 60607 Binghamton, NY 13902 [email protected]

[email protected]

Page 2: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

The Problem

search search search engine 1 engine 2 engine n . . . . . . text text text source 1 source 2 source n

How am I going to find the 5 best pages on “Internet Security”?

Page 3: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Metasearch Engine Solution

user

user interface query dispatcher result merger

search search search engine 1 engine 2 engine n . . . . . .

text text text source 1 source 2 source n

query result

Page 4: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Some Observations

most sources are not useful for a given query

sending a query to a useless source would incur unnecessary network traffic waste local resources for evaluating the

query increase the cost of merging the results

retrieving too many documents from a source is inefficient

Page 5: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

A More Efficient Metasearch Engine

user

user interface database selector document selector query dispatcher result merger

search search search engine 1 engine 2 engine n

. . . . . .

text text text source 1 source 2 source n

query result

Page 6: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Tutorial Outline

1. Introduction to Text Retrieval consider only Vector Space Model

2. Search Engines on the Web 3. Introduction to Metasearch Engine 4. Database Selection 5. Document Selection 6. Result Merging 7. New Challenges

Page 7: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Introduction to Text Retrieval (1)

Document representation remove stopwords: of, the, ... stemming: stemming stem d = (d1 , ..., di , ..., dn)

di : weight of ith term in d

tf *idf formula for computing di

Example: consider term t of document d in a database of N documents.

tf weight of t in d if tf > 0: 0.5 + 0.5*tf/max_tf idf weight of t: log(N/df) weight of t in d: (0.5 + 0.5*tf/max_tf)*log(N/df)

Page 8: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Introduction to Text Retrieval (2)

Query representation q = (q1 , ..., qi , ..., qn)

qi : weight of ith term in q

compute qi : tf weight only

alternative: use idf weight for query terms not document terms query expansion (e.g., add related terms)

Page 9: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Introduction to Text Retrieval (3)

Similarity Functions

simple dot product:

favor long documents

Cosine function:

other similarity functions exist normalized similarities: [0, 1.0]

n

iii dqdqsim

1

),(

22

1cos),(ii

n

i ii

dq

dqdqsim

q

d

Page 10: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Introduction to Text Retrieval (4)

Retrieval Effectiveness relevant documents: documents useful to the

user of query recall: percentage of relevant documents

retrieved precision: percentage of retrieved documents

that are relevant

recall

pre

cisi

on

Page 11: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Search Engines on the Web (1)

Search engine as a document retrieval system no control on web pages that can be searched web pages have rich structures and semantics web pages are extensively linked additional information for each page (time last

modified, organization publishing it, etc.) databases are dynamic and can be very large few general-purpose search engines and

numerous special-purpose search engines

Page 12: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Search Engines on the Web (2)

New indexing techniques partial-text indexing to improve scalability ignore and/or discount spamming terms use anchor terms to index linked pages e.g.: WWWW [McBr94], Google [BrPa98], Webor [CSM97]

. . . . . .airplane ticket and hotel . . . . . .

Page 1 Page 2: http://travelocity.com/

Page 13: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Search Engines on the Web (3)

New term weighting schemes higher weights to terms enclosed by special tags

title (SIBRIS [WaWJ89], Altavista, HotBot, Yahoo) special fonts (Google [BrPa98]) special fonts & tags (LASER [BoFJ96])

Webor [CSM97] approach partition tags into disjoint classes (title, header,

strong, anchor, list, plain text) assign different importance factors to terms in

different classes determine optimal importance factors

Page 14: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Search Engines on the Web (4)

New document ranking methods Vector Spreading Activation [YuLe96]

add a fraction of parents' similaritiesExample: Suppose for query q sim(q, d1) = 0.4 sim(q, d2) = 0.2 sim(q, d3)

= 0.2

final score of d3 = 0.2 + 0.1*0.4 + 0.1*0.2 = 0.26

d1

d2d3

Page 15: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Search Engines on the Web (5)

New document ranking methods combine similarity with rank

PageRank [PaBr98]: an important page is linked to by many pages and/or by important pages

combine similarity with authority score authority [Klei98]: an important content

page is highly linked to among initially retrieved pages and their neighbors

Page 16: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Introduction to Metasearch Engine (1)

An Example Query: Internet Security

Databases: NYT ... WP ... DB ... DB ...

Retrieved results : t1, t2, ... p1, p2, …

Merged results : p1, t1, ...

Page 17: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Introduction to Metasearch Engine (2)

Database Selection Problem Select potentially useful databases for a given

query essential if the number of local databases is

largereduce network trafficavoid wasting local resources

query

Page 18: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Introduction to Metasearch Engine (3)

Potentially useful database: contain potentially useful documents Potentially useful documents:

global similarity above a thresholdglobal similarity among m highest

Need some knowledge about each database in advance in order to perform database selection Database Representative

Page 19: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Introduction to Metasearch Engine (4)

Document Selection Problem Select potentially useful documents from each

selected local database efficientlyStep 1: Retrieve all potentially useful documents

while minimizing the retrieval of useless documents from global similarity threshold to tightest local

similarity threshold

want all d: Gsim(q, d) > GT

retrieve d from DBk : Lsim(q, d) > LTk

LTk is largest : Gsim(q, d) > GT Lsim(q, d) >

LTk

Page 20: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Introduction to Metasearch Engine (5)

Efficient Document SelectionStep 2: Transmit all potentially useful

documents to result merger while minimizing the transmission of useless documents

further filtering to reduce transmission cost and merge cost

Example:

local

DBk

retrieved1 , …, ds

filterd2, d7, d10

transmit

Page 21: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Introduction to Metasearch Engine (6)

Result Merging ProblemObjective: Merge returned documents from

multiple sources into a single ranked list.

Difficulty: Local document similarities may be incomparable or not available.

Solutions: Generate "global similarities” for ranking.

DB1

DBN

. . . . . .d11, d12, ...

dN1, dN2, ...

Merger d12, d54, ...

Page 22: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Introduction to Metasearch Engine (7)

An Ideal Metasearch Engine: Retrieval effectiveness: same as that as if all

documents were in the same collection. Efficiency: optimize the retrieval process

Implications: should aimed at: selecting only useful search engines retrieving and transmitting only useful

documents ranking documents according to their degrees

of relevance

Page 23: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Introduction to Metasearch Engine (8)

Main Sources of Difficulties: [MYL99] autonomy of local search engines

design autonomy maintenance autonomy

heterogeneities among local search engines indexing method document/query term weighting schemes similarity/ranking function document database document version result presentation

Page 24: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Introduction to Metasearch Engine (9)

Impact of Autonomy and Heterogeneities [MLY99]

unwilling to provide database representatives or provide different types of representatives

difficult to find potentially useful documents difficult to merge documents from multiple

sources

Page 25: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Database Selection: Basic Idea

Goal: Identify potentially useful databases for each user query.

General approach: use representative to indicate approximately

the content of each database use these representatives to select databases

for each queryDiversity of solutions different types of representatives different algorithms using the representatives

Page 26: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Solution Classification

Naive Approach: select all databases (e.g. MetaCrawler, NCSTRL) Qualitative Approaches: estimate the quality of

each local database based on rough representatives based on detailed representatives

Quantitative Approaches: estimate quantities that measure the quality of each local database more directly and explicitly

Learning-based Approaches: database representatives are obtained through training or learning

Page 27: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Qualitative Approaches Using Rough Representatives

typical representative: a few words or a few paragraphs in certain

format manual construction often needed

can work well for special-purpose local search engines

very scalable storage requirement selection can be inaccurate as the description

is too rough

Page 28: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Qualitative Approaches Using Rough Representatives

Example 1: ALIWEB [Kost94] Representative has a fixed format: site containing

files for the Perl Language Template-Type: DOCUMENTTitle: PerlDescription: Information on the Perl

Programming Language. Includes a local

Hypertext Perl Manual, and the latest FAQ in Hypertext.Keywords: perl, perl-faq, language user query can match against one or more fields

Page 29: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Qualitative Approaches Using Rough Representatives

Example 2: NetSerf [ChHa95] Representative has a WordNet based structure:

site for world facts listed by countrytopic: country synset: [nation, nationality, land, country,

a_people] synset: [state, nation, country, land, commonwealth, res_publica, body_politic] synset: [country, state, land, nation]info-type: facts user query is transformed to similar structure

before match

Page 30: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Qualitative Approaches Using Detailed Representatives

Use detailed statistical information for each term

employ special measures to estimate the usefulness/quality of each search engine for each query

the measures reflect the usefulness in a less direct/explicit way compared to those used in quantitative approaches.

scalability starts to become an issue

Page 31: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Qualitative Approaches Using Detailed Representatives

Example 1: gGlOSS [GrGa95] representative: for term ti

-- document frequency of ti

-- the sum of weights of ti in all documents

database usefulness: sum of high similarities

usefulness(q, D, T) =

),( ii Wdf

idf

iW

TdqsimDd

dqsim),(

),(

Page 32: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

gGlOSS (continued)

Suppose for query q , we have

D1 d11: 0.6, d12: 0.5

D2 d21: 0.3, d22: 0.3, d23: 0.2

D3 d31: 0.7, d32: 0.1, d33: 0.1

usefulness(q, D1, 0.3) = 1.1

usefulness(q, D2, 0.3) = 0.6

usefulness(q, D3, 0.3) = 0.7

Page 33: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

gGlOSS (continued)

gGlOSS: usefulness is estimated for two cases high-correlation case: if dfi dfj , then every

document having ti also has tj .

Example: Consider q = (1, 1, 1) with df1 = 2, df2 = 3,

df3 = 4, W1 = 0.6, W2 = 0.6 and W3 = 1.2.

t1 t2 t3 t1 t2 t3 d1 0.2 0.1 0.3 0.3 0.2 0.3 d2 0.4 0.3 0.2 0.3 0.2 0.3 d3 0 0.2 0.4 0 0.2 0.3 d4 0 0 0.3 0 0 0.3 usefulness(q, D, 0.5) = W1 + W2 + df2*W3/df3 = 2.1

Page 34: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

gGlOSS (continued)

disjoint case: for any two query terms ti and tj , no

document contains both ti and tj .

Example: Consider q = (1, 1, 1) with df1 = 2, df2 = 1,

df3 = 1, W1 = 0.5, W2 = 0.2 and W3 = 0.4 .

t1 t2 t3 t1 t2 t3 d1 0.2 0 0 0.25 0 0 d2 0 0.2 0 0 0.2 0 d3 0.3 0 0 0.25 0 0 d4 0 0 0.4 0 0 0.4

usefulness(q, D, 0.3) = = W3 = 0.4

TdfWq

ii

iii

Wq/

Page 35: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

gGlOSS (continued)

Some observations usefulness dependent on threshold representative has two quantities per term strong assumptions are used high-correlation tends to overestimate disjoint tends to underestimate the two estimates tend to form bounds to

the sum of the similarities T

Page 36: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Qualitative Approaches Using Detailed Representatives

Example 2: CORI Net [CaLC95]

representative: (dfi , cfi ) for term ti

dfi -- document frequency of ti

cfi -- collection frequency of ti

cfi can be shared by all databases

database usefulness usefulness(q, D) = sim(q, representative of D) usefulness similarity

dfi tfi

cfi dfi

Page 37: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

CORI Net (continued)

Some observations estimates independent of threshold representative has less than two quantities

per term similarity is computed based on inference

network same method for ranking documents and

ranking databases

Page 38: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Qualitative Approaches Using Detailed Representatives

Example 3: D-WISE [YuLe97]

representative: dfi,j for term tj in database Di

database usefulness: a measure of query term concentration in different databases

usefulness(q, Di) =

k : number of query terms CVVj : cue validity variance of term tj across all

databases; larger CVVj tj is more

useful in distinguishing different databases

k

jjij dfCVV

1,

Page 39: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

D-WISE (continued)

ACVj : average cue validity of tj over all

databases

Observations: estimates independent of threshold representative has one quantity per term measure is difficult to understand

NACVCVCVVN

ijjij

1

2, )(

N

ik k

N

ik jk

i

ji

ijiji

n

df

n

df

ndfCV

,,

,,

N : number of databases

ni : number of documents

in database Di

Page 40: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Quantitative Approaches

Two types of quantities may be estimated wrt query q: the number of documents in a database D with

similarities higher than a threshold T:

NoDoc(q, D, T) = |{ d : d D and sim(q, d) > T }|

the global similarity of the most similar document in

D:

msim(q, D) = max { sim(q, d) } dD

can be used to rank databases in descending order of similarity (or any desirability measure)

Page 41: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Estimating NoDoc(q, D, T)

Basic Approach [MLYW98]

representative: (pi , wi ) for term ti

pi : probability that ti appears in a document

wi : average weight of ti among documents

having ti

Example: normalized weights of ti in 10

documents are (0, 0, 0, 0, 0.2, 0.2, 0.4, 0.4, 0.6, 0.6).

pi = 0.6, wi = 0.4

Page 42: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Estimating NoDoc(q, D, T)

Basic Approach (continued)Example: Consider query q = (1, 1).

Suppose p1 = 0.2, w1 = 2, p2 = 0.4, w2 = 1.

A generating function:

(0.2 X 2 + 0.8) (0.4 X + 0.6)

= 0.08 X 3 + 0.12 X 2 + 0.32 X + 0.48

a X b : a is the probability that a document in D has similarity b with q

NoDoc(q, D, 1) = 10*(0.08 + 0.12) = 2

Page 43: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Estimating NoDoc(q, D, T)

Basic Approach (continued) Consider query q = (q1, ..., qr).

Proposition. If the terms are independent and the weight of term ti whenever present in a

document is wi (the average weight), 1 i

r, then the coefficient of X s in the following generating function is the probability that a document in D has similarity s with q.

r

ii

qwi pXp ii

1

)]1([

Page 44: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Estimating NoDoc(q, D, T)

Subrange-based Approach [MLYW99] overcome the uniform term weight

assumption additional information for term ti :

i : standard deviation of weights of ti in all

documents

mnwi : maximum normalized weight of ti

Page 45: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Estimating NoDoc(q, D, T)

Example: weights of term ti : 4, 4, 1, 1, 1, 1, 0, 0, 0, 0

generating function (factor) using average weight

0.6*X 2 + 0.4 a more accurate function using subranges of weights

0.2*X 4 + 0.4*X + 0.4 In general, weights are partitioned to k subranges:

pi1*X mi1 + ... + pik*X mik + (1 - pi)

Probability pij and median mij can be estimated

using di and the average of weights of ti .

A special implementation: Use the maximum normalized weight as the first subrange by itself.

Page 46: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Estimating NoDoc(q, D, T)

Combined-term Approach [LYMW99] relieve the term independence assumption Example: Consider query : Chinese medicine . Suppose generating function for:

Chinese: 0.1X3 + 0.3X + 0.6

medicine: 0.2X2 + 0.4 X + 0.4

Chinese medicine: 0.02 X5 + 0.04 X4 + 0.1X3 + …

“Chinese medicine”: 0.05 Xw + ...

Page 47: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Estimating NoDoc(q, D, T)

Criteria for combining “Chinese” and “medicine”: The maximum normalized weight of the combined

term is higher than the maximum normalized weight of each of the two individual terms (w > 3);

The sum of estimated probabilities of terms with exponents w under the term independence assumption is very different from 1/N, N is the number of documents in database;

They are adjacent terms in previous queries.

Page 48: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Database Selection Using msim(q,D)

Optimal Ranking of Databases [YLWM99b]

User: for query q, find the m most similar documents or with the m largest degrees of relevance

Definition: Databases [D1, D2, …, Dp] are optimally

ranked with respect to q if there exists a k such

that each of the databases D1, …, Dk contains

one of the m most similar documents, and all of these m documents are contained in these k databases.

Page 49: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Database Selection Using msim(q,D)

Optimal Ranking of DatabasesExample: For a given query q: D1 d1: 0.8, d2: 0.5, d3: 0.2, ... D2 d9: 0.7, d2: 0.6, d10: 0.4, ...

D3 d8: 0.9, d12: 0.3, … other databases have documents with

small similarities When m = 5: pick D1, D2, D3

Page 50: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Database Selection Using msim(q,D)

Proposition: Databases [D1, D2, …, Dp] are

optimally ranked with respect to a query q if

and only if msim(q, Di) msim(q, Dj), i < j

Example: D1 d1: 0.8, … D2 d9: 0.7, … D3 d8: 0.9, … Optimal rank: [D3, D1, D2, …]

Page 51: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Estimating msim(q, D)

Use subrange-based or combined-term method. Example: Suppose 100 documents in a database. For query q, the generating function is:

0.002 X4 + 0.009 X3 + …

Since 100*(0.002 + 0.009) 1, the global similarity of the most similar document is estimated to be 3.

Weakness of this approach: require large storage for database

representative exponential computation complexity

Page 52: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Estimating msim(q, D)

A more efficient method

global database representative: global dfi of term ti

local database representative:

anwi : average normalized weight of ti

mnwi : maximum normalized weight of ti

Example: term ti : d1 0.3, d2 0.4, d3 0, d4 0.74

anwi = (0.3 + 0.4 + 0 + 0.7)/4 = 0.35

mnwi = 0.74

Page 53: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Estimating msim(q, D)

A more efficient method (continued) term weighting scheme query term: tf*gidf document term: tf query q = (q1, q2)

msim(q, D) = max { q1*gidf1*mnw1 + q2*gidf2*anw2 ,

q2*gidf2*mnw2 + q1*gidf1*anw1 }

linear computation complexity

Page 54: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Estimating msim(q, D)

Combine terms to improve estimation accuracy Restrictions for combining terms ti and tj into tij :

ti and tj are adjacent query terms

mnwij > max { mnwi + anwj , mnwj + anwi }

Given a query having ti , tj and tk in this order, decide

which terms to combine if they should combine. Combine ti and tj if

mnwij > max { mnwi + anwj , mnwj + anwi }

and mnwij - max { mnwi + anwj , mnwj + anwi }

> mnwkj - max { mnwk+ anwj , mnwj + anwk }

Page 55: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Learning-based Approaches

Use past retrieval experiences to determine usefulness

Assume no or little global database or local database statistics

Static learning : learning based on static training queries

Dynamic learning : learning based on evaluated user queries

Combined learning: learned knowledge based on training queries will be adjusted based on user queries

Page 56: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Static Learning

Example: MRDD (Modeling Relevant Document Distribution) [VoGJ95]

record the result of each training query for each local database:

<r1, ..., rs>: ri indicates the minimum number of

top-ranked documents to retrieve in order to obtain i relevant documents <2, 5, … >: need to retrieve 2 documents in order to obtain 1 relevant document

Page 57: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

MRDD (continued)

For a new query: identify the k most similar training queries obtain the average distribution vector from the k

training queries for each database use these vectors to determine databases to search

and documents to retrieve to maximize precisionExample: Suppose for query q, three average

distribution are obtained: D1: <1, 4, 6, 7, 10, 12, 17> D2: <1, 5, 7, 9, 15, 20> D3: <2, 3, 6, 9, 11, 16>To retrieve two relevant documents: select D1 and D2.

Page 58: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Dynamic LearningExample : SavvySearch [DrHo97] database representative: weight wi and cfi for term ti and

two penalty values ph and pr for each D.

wi : indicate how well D responds to query term ti

cfi : number of databases containing ti

ph : penalty if the average number of hits returned

for most recent five queries < Th

ph = (Th - h) 2 / Th 2

pr : penalty if the average response time for most

recent five queries > Tr

pr = (r - Tr ) 2 / (45 - Tr )

2

Page 59: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

SavvySearch (continued)

Update of wi

initially zero reduce by 1/k if no document is retrieved for a k-

term query containing ti

increase by 1/k if some returned document is read

Compute the ranking score of database D for query q = (t1, ..., tk)

r = )(

||

)log(

1

1rhn

i i

k

i ii ppw

cfNw

Page 60: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Combined Learning

Example: ProFusion [FaGa99]Phase 1: Static Learning 13 categories/concepts are utilized training queries in each category are selected relevance assessment for each query is used

to compute the average score of each local database with respect to each category

category D1 D2 . . . Dn C1 0.3 0.1 . . . 0.2 . . . . . . . . . C13 0 0.4 . . . 0.1

Page 61: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

ProFusion (continued)

Phase 2: Database Selection and Dynamic Learning Each user query is mapped to one or more

categories Databases are selected based on accumulated

scores over involved categories Example: Suppose query q is mapped to C1, C4, C5 category D1 D2 D3 D4 C1 0.2 0 0.1 0.3 C4 0.1 0.2 0 0 C5 0 0.4 0.3 0.2 total score 0.3 0.6 0.4 0.5

Page 62: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

ProFusion (continued)

Each retrieved document from all selected databases is re-ranked based on the product of local similarity of the document and the score of the database.

if the first clicked document by the user is not the top ranked increase the score of the database that

produced the document in related categories

decrease the score of other searched databases in related categories

Page 63: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Other Database Selection Techniques

incorporating ranks [YMLW99a] query expansion [XuCa98] use of lightweight queries [HaTh99]

shorter not evaluated like regular queries

use of representative hierarchies [YMLW99b]

Page 64: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Document Selection

Goal: Select all globally most similar documents from a selected local search engine while minimizing the retrieval of useless documents.

General approaches determine the number k of documents to

retrieve from a local search engine and then retrieve the k documents with the largest local similarities from the search engine

determine a local threshold for the local database and retrieve documents whose local similarities exceed the threshold

* The two approaches are equivalent.

Page 65: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Solution Classification

Local Determination all locally retrieved documents will be returned

Examples: NCSTRL, Search Broker [MaBi97] User Determination

global user determines how many documents should be retrieved from each local database

neither effective nor practical when the number of databases is large.

Examples: MetaCrawler [SeEt97] SavvySearch [DrHo97]

Page 66: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Solution Classification (continued)

Weighted Allocation retrieve proportionally more documents

from local databases that are ranked higher

Learning-based Approaches use past retrieval experience for

selection Guaranteed Retrieval

aimed at guaranteeing the retrieval of globally most similar documents

Page 67: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Weighted Allocation

Suppose m documents are to be retrieved from N local databases.

Example 1: CORI net [CaLC95] Retrieve m* 2(1+ N - i) / N ( N+1) documents from the ith ranked local database.Example 2: D-WISE [YuLe97]

Let ri be the ranking score of local database Di .

Retrieve m * ri / documents from Di .

When retrieving k documents from local database D ,

the k documents with largest local similarities are

retrieved from Di .

N

k kr1

Page 68: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Learning-based Approaches

determine the number of documents to retrieve from a local database based on past retrieval experiences with the local database.

Example: MRDD [VoGJ95] For query q, three average distribution are

obtained: D1: <1, 4, 6, 7, 10, 12, 17> D2: <1, 5, 7, 9, 15, 20> D3: <2, 3, 6, 9, 11, 16> To retrieve four relevant documents: retrieve 1

document from D1, 1 from D2 and 3 from D3.

Page 69: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Guaranteed Retrieval

Aim at guaranteeing that all potentially useful

documents with respect to a query be retrieved minimizing the retrieval of useless documentsTwo cases case 1: a global similarity threshold is known

case 2: the number of globally desired documents is known

The two cases are mutually translatable.

Page 70: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Case 1: Global Similarity Threshold GT Is Known

find all documents whose global similarities GTTechnique 1: Query modification [MLYW98] Modify q to q' such that Gsim(q, d) = Lsim(q', d) find all documents whose local similarities with q’ GT Example: q = (q1, q2); d = (d1, d2);

Gsim(q, d) = gidf1*q1*d1 + gidf2*q2*d2,

Lsim(q, d) = lidf1*q1*d1 + lidf2*q2*d2,

q' = (gidf1/lidf1 * q1, gidf2/lidf2 * q2)

Lsim(q', d) = lidf1*(gidf1/lidf1)*q1*d1 +

lidf2*(gidf2/lidf2)*q2*d2 = Gsim(q, d)

Page 71: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Case 1: Global Similarity Threshold GT Is Known

Technique 2: find largest local threshold LT such that Gsim(q, d) GT Lsim(q, d) LT [MLYW98] retrieve d such that Lsim(q, d) LT to form set S transmit d from S if Gsim(q, d) GT Example: Gsim(q, d) Lsim(q, d) d1 0.8 0.7 d2 0.75 0.35 ....… d3 0.4 0.6 If d2 is desired, then LT can be no higher than 0.35. If GT = 0.6, d3 will not be transmitted. Transmit m documents from each local database.

Page 72: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Case 1: Global Similarity Threshold GT Is Known

Define tightest local threshold: LT = min { Lsim(q, d) | Gsim(q, d) GT } d

Determining LT

if both Gsim and Lsim are linear functions, apply linear programming;

otherwise, try Lagrange Multiplier.

Page 73: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Case 1: Global Similarity Threshold GT Is Known

Example: Gsim(q, d) = Cosine(qG , d)

Lsim(q, d) = Cosine(qL , d)

LT = min { Cosine(qL , d) | Cosine(qG, d) GT } d = Cosine( + 1) when qG, qL, d in the same plane

= GT * Cosine 1 - sin * sin 1

qL

qGd

1

Page 74: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Case 2: Number of Globally Desired Documents Is Known

Solution: rank databases optimally for a given query q retrieve documents from databases in the

optimal order

Page 75: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Case 2: Number of Globally Desired Documents Is Known

Algorithm OptDocRetrv [YLWM99]while less than m documents have been obtained do 1. select the next database in the order 2. compute actual similarity of most similar document 3. find the minimum min_sim of the actual similarities of most similar documents of selected databases 4. select documents from each selected database whose actual global similarities min_simend loopSort the documents in descending similarities and

present the top m to the user.

Page 76: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Case 2: Number of Globally Desired Documents Is Known

Example: Number of documents desired = 4. Databases are ranked in the order D1, D2, D3, D4 … D1 d1: 0.53, d2: 0.48, d3: 0.39, … D2 d10: 0.47, d21: 0.43, d52: 0.42, … D3 d23: 0.54, d42: 0.49, ... D4 d33: 0.40, … select D1, min_sim = 0.53: result = { d1 } select D2, min_sim = 0.47, result = { d1, d2, d10 } select D3, min_sim = 0.47, result = { d1, d2, d10, d23, d42 } result to user = { d1, d2, d23, d42 }

Page 77: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Case 2: Number of Globally Desired Documents Is Known

Proposition: If databases are optimally ranked, then all the m globally most similar documents will be retrieved by algorithm OptDocRetrv.

Proposition: For any single-term query, all the

m globally most similar documents will be

retrieved by algorithm OptDocRetrv.

Page 78: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Result Merging

Goal: Merge returned documents from multiple sources into a single ranked list.

Difficulties local similarities are usually not comparable due to

different similarity functions different term weighting schemes different statistical values e.g., global idf vs. local idf

local similarities may be unavailable to metasearch engine (only ranks are provided)

Ideal rank: in non-increasing order of global similarities

Page 79: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Solution Classification

similarity normalization normalize all local similarities into a common

fixed range to improve comparability similarity adjustment adjust local similarities/ranks based on the

quality of local databases global similarity computation aim at obtaining the actual global similaritiesMerge based on normalized/adjusted/computed

similarities.

Page 80: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Similarity Normalization

Example 1: MetaCrawler [SeEt97] map all local similarities into [0, 1000] map largest local similarity from each source to 1000 map other local similarities proportionally add normalized local similarities for documents

retrieved from multiple sources D1 D2 d1 d2 d3 d1 d4 d5local similarity: 100 200 400 0.3 0.2 0.5 normalized: 250 500 1000 600 400 1000 final similarity: 850 500 1000 400 1000

Page 81: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Similarity Normalization

Example 2: SavvySearch [DrHo97] same as MetaCrawler except using range [0, 1] documents with no local similarities are assigned 0.5

Retrieval based on Multiple Evidence normalized similarity between 0 and 1 can be

considered as a confidence that a document is useful let si be the confidence of source i that document d

is useful to query q estimate overall confidence that d is useful: S(d, q) = 1 - (1 - si)*...*(1- sk)

Example: s1 = 0.7, s2 = 0.8 S(d, q) = 0.94

Page 82: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Similarity Adjustment

Use local similarity of d and the ranking score of its database to estimate the global similarity of d. database ranking score: the higher the better

Example: CORI net [CaLC95] assign the following weight to database D

w(D) = 1 + N * (r - r') / r'

r : rank score of D wrt q r’ : avg of scores of searched databases N : number of local databases searched adjust local similarity s of document d in D to s*w(D)

Similar approach employed in ProFusion [GaWG96].

Page 83: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Similarity Adjustment

Use local rank of d and the ranking score of its database to estimate the global similarity of d.

Example: D-WISE [YuLe97]

Gsim(q, d) = 1 - (r - 1) * Rmin / (m * Ri)

Ri : ranking score of database Di

Rmin : lowest database ranking score

r : local rank of document d from Di

m : total number of documents desiredObservation: top ranked document from any

database has the same global similarity

Page 84: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

D-WISE (continued)

Example: R1 = 0.3, R2 = 0.7, Rmin = 0.2, m = 4

Gsim(q, d) = 1 - (r - 1) * 0.2 / (4 Ri)

D1 D2

r Gsim r Gsim

d1 1 1.0 d1' 1 1.0

d2 2 0.83 d2' 2 0.93

d3 3 0.67 d3' 3 0.86

more documents from databases with higher ranking scores have higher global similarities

Page 85: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Global Similarity Computation

Technique 1: Document Fetching (e.g.: E2RD2, ParaCrawler)

fetch documents to the metasearch engine collect desired statistics (tf, idf, ...) compute global similarities Problem: may not scale well.

Page 86: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Global Similarity Computation

Technique 2: Knowledge Discovery discover similarity functions and term

weighting schemes used in different search engines

use the discovered knowledge to determine what local similarities are reasonably

comparable? how to adjust local similarities to make them

more comparable? how to compute/estimate global similarities?

Page 87: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Knowledge Discovery (continued)

Example: All local search engines selected for a query

employ same methods for indexing local documents and computing local similarities

do not use idf information local similarities comparable idf information is used and q has a single term t Lsim(q, d) = [tft(q) * lidft *

tft(d)]/(|q|*|d|) = [lidft * tft(d)]/|d|

Gsim(q, d) = (gidft * tft(d)) /|d|

Gsim(q, d) = Lsim(q, d) * gidft / lidft

Page 88: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Knowledge Discovery (continued)

Example (continued)

idf information is used and q has terms t1, ..., tk

Gsim(q, d) =

=

can be determined using ti as a single-term

query.

||||

)()(1

dq

dtfgidfqtfk

i tititi

titi

k

i

ti gidfd

dtf

q

qtf

||

)(

||

)(

1

||

)(

d

dtfti

Page 89: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

Knowledge Discovery (continued)

submit ti as a single-term query and let

si = Lsim(d, q(ti)) = |||)(|

)())((

dtiq

dtflidftiqtf tititi

titi

iti

lidftiqtf

tiqs

d

dtf

))((

|)(|

||

)(

Page 90: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

New Challenges

Incorporate new search techniques into metasearch. Document ranks in Google Kleinberg's hub and authority scores Tag information in HTML documents Implicit user feedback on previous retrieval Pseudo relevance feedback on previous retrieval Use of user profiles

Integrate local systems supporting different query types fewer researches on boolean queries, proximity

queries and hierarchical queries

Page 91: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

New Challenges (continued)

Develop techniques to discover knowledge (representatives, ranking algorithms) about local search engines more accurately and more efficiently. some search engines may be unwilling to

provided desired representatives or may provide inaccurate representatives

indexing techniques, term weighting schemes and similarity functions are typically proprietary.

Develop standard guideline on what information each search engine should provide to metasearch engine (some efforts: STARTS, Dublin Core).

Page 92: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

New Challenges (continued)

Distributed implementation of metasearch engine alternative ways to store local database

representatives? how to perform database selection and

document selection at multiple sites in parallel?

Scale to a million databases storage of database representatives fast algorithms for database selection,

document selection and result merging efficient network utilization

Page 93: VLDB'99 TUTORIAL Metasearch Engines:  Solutions and Challenges

New Challenges (continued)

Standard testbed for evaluation need a large number of local databases documents should have links for computing

ranks, hub and authority scores a large number of typical Internet queries relevance assessment of documents to each

query Go beyond text databases

how to extend to databases containing text, images, video, audio, structured data?