mindreader: querying databases through multiple examples

MindReader:Querying databases through

multiple examples

Yoshiharu Ishikawa(Nara Institute of Science and Technology, Japan)

Ravishankar Subramanya(Pittsburgh Supercomputing Center)

Christos Faloutsos(Carnegie Mellon University)

Outline

Background & Introduction Query by Example Our Approach Relevance Feedback What’s New in MindReader?

Proposed Method Problem Formulation Theorems

Experimental Results Discussion & Conclusion

Query-by-Example: an example

Searching “mildly overweighted” patients

• The doctor selects examples by browsing patient database



: very good

: good




Height

Weig

ht

: very good

: good




Height

Weig

ht

: very good

: good


• The examples have “oblique” correlation



Height

Weig

ht

: very good

: good



• We can “guess” the implied query



Height

Weig

ht

: very good

: good



• We can “guess” the implied query

q

Query-by-Example: the question

Assume that user gives multiple examples user optionally assigns scores to the

examples samples have spatial correlation

Query-by-Example: the question

Assume that user gives multiple examples user optionally assigns scores to the

examples samples have spatial correlation

How can we “guess” the implied query?

Outline




Our Approach

Automatically derive distance measure from the given examples

Two important notions:1. diagonal query: isosurfaces of queries have

ellipsoid shapes

2. multiple-level scores: user can specify “goodness scores” on samples

Isosurfaces of Distance Functions

Euclidean weightedEuclidean

generalizedellipsoid distance

q qq

Distance Function Formulas

Euclidean

D(x, q) = (x – q)2

Weighted Euclidean

D(x, q) = i mi(xi – qi)2

Generalized ellipsoid distanceD(x, q) = (x – q)T M (x – q)

Outline




Relevance Feedback

Popular method in IRQuery is modified based on relevance

judgment from the userTwo major approaches

1. query-point movement

2. re-weighting

Relevance Feedback— Query-point Movement —

Query point is moved towards “good” examples — Rocchio’s formula in IR

Q0

Q0: query point



Q0: query point

: retrieved data

Q0



Q0: query point

: retrieved data

: relevance judgments

Q0



Q1

Q0: query point

: retrieved data

: relevance judgments

Q1: new query pointQ0

Relevance Feedback— Re-weighting —

Standard Deviation Method in MARS (UIUC) image retrieval system



Assumption: the deviation is high the feature is not important




f2

f1




f1

f2




“bad” feature

“good”feature

f1

f2




For each feature, weight wi = 1/i

is assigned “bad” feature

“good”feature

f1

f2




For each feature, weight wi = 1/i

is assigned “bad” feature

“good”feature

f1

f2ImpliedQuery




For each feature, weight wi = 1/ ｊ

is assigned MARS didn’t provide any justification for this formula

“bad” feature

“good”feature

f1

f2ImpliedQuery

Outline




What’s New in MindReader?

MindReader does not use ad-hoc heuristics

cf. Rocchio’s expression, re-weighting in MARS

can handle multiple levels of scores can derive generalized ellipsoid distance

What’s New in MindReader?

MindReader can derive generalized ellipsoid distances

q


Euclidean weightedEuclidean


q qq


Euclidean

Rocchio

weightedEuclidean


q qq


Euclidean

Rocchio

weightedEuclidean MARS


q qq


Euclidean

Rocchio

weightedEuclidean MARS

generalizedellipsoid distance MindReader

q qq

Outline




Method: distance function

Generalized ellipsoid distance functionD(x, q) = (x – q)T M (x – q), or

D(x, q) = j k mjk (xj – qj) (xk – qk) q: query point vector

x: data point vector

M = [mjk]: symmetric distance matrix

Method: definitions

N: no. of samplesn: no. of dimensions (features)xi: n-d sample data vectors

xi = [xi1, …, xin]T

X: N×n sample data matrix X = [x1, …, xN]T

v: N-d score vector v = [v1, …, vN]

Method: problem formulation

Problem formulation

Given N sample n-d vectors multiple-level scores (optional)

Estimate optimal distance matrix M optimal new query point q

Method: optimality

How do we measure “optimality”?minimization of “penalty”

What is the “penalty”?weighted sum of distances between query

point and sample vectors

Therefore, minimize i (xi – q)T M (xi – q) under the constraint det(M) = 1

Outline




Theorems: theorem 1

Solved with Lagrange multipliersTheorem 1: optimal query point

q = x = [x1, …, xn]T= XT v / vi

optimal query point is the weighted average of sample data vectors

Theorems: theorem 2 & 3

Theorem 2: optimal distance matrix M = (det(C))1/n C–1

C = [cjk] is the weighted covariance matrix

cjk = vi (xik - xk) (xij - xj)

Theorem 3 If we restrict M to diagonal matrix, our

method is equal to standard deviation method MindReader includes MARS!

Outline




Experiments

1. Estimation of optimal distance function Can MindReader estimate target distance

matrix Mhidden appropriately?

Based on synthetic data Comparison with standard deviation method

2. Query-point movement

3. Application to real data sets GIS data

Experiment 1: target data

Two-dimensional normal distribution

Experiment 1: idea

Assume that the user has “hidden” distance Mhidden in his mind

Simulate iterative query refinement

Q: How fast can we discover “hidden” distance?

Query point is fixed to (0, 0)

Experiment 1: iteration steps

1. Make initial samples: compute k-NNs with Euclidean distance

2. For each object x, calculate its score that reflects the hidden distance Mhidden

3. MindReader estimates the matrix M 4. Retrieve k-NNs with the derived matrix M5. If the result is improved, go to step 2

Experiment 1: scores

Calculation of scores in terms of “hidden” distance function 1. Calculate distance from the query point q

based on hidden distance matrix Mhidden

d = D(x, q) (0 v

2. Translate distance value d to score (-v s = exp(-d2/2) v = log s / (1 - s)

Experiment 1: evaluation measures

Used to check whether the query result is improved or not

CD-k measure CD stands for “cumulative distance” for k-NNs retrieved by matrix M, compute

actual distance by matrix Mhidden then take summation

Experiment 1: final k-NNs

Ellipse: isosurface for Mhidden

Red points: final k-NNs obtained by standard deviation method

Green points: final k-NNs obtained by MindReader

Experiment 1: speed of convergence

x-axis: no. of iterations

y-axis: CD-k measure value

Red ： standard deviation method

Green: MindReader

Blue: best CD-k value possible for the data set

Experiment 1: changes of isosurfaces

After 0th and 2nd iterations

Experiment 1: changes of isosurfaces

After 4th and 8th iterations

Experiment 2: query-point movement

Starts from query point (0.5, 0.5)

MindReader converges to Mhidden with five iterations

Experiment 3: real data set

End-points of road segments from the Montgomery County, MD

Data is normalized to [-1, 1] [-1, 1]

The query specifies five points along route I-270

Can we estimate good distance function?

Experiment 3: isosurfaces

After 0th and 2nd iterations: fast convergence!

Discussion: efficiency

Don’t worry about speed! ellipsoid query support using spatial access

methods: Seidl & Kriegel [VLDB97] Ankerst, Branmüller, Kriegel, Seidl [VLDB98]

for the derived distance, we can efficiently use spatial index

Conclusion

MindReader automatically guess diagonal queries from the given examples multiple levels of scores includes “Rocchio” and “MARS” (standard

deviation method) problem formulation & solution evaluation based on the experiments

mindreader: querying databases through multiple examples

Documents

good examples rocchios

movement query point

diagonal query

patient databasequery

multiplelevel scores

multiple examplesuser

goodness scores

oblique correlationquery