mindreader: querying databases through multiple examples

61
MindReader: Querying databases through multiple examples Yoshiharu Ishikawa (Nara Institute of Science and Technology, Japan) Ravishankar Subramanya (Pittsburgh Supercomputing Center) Christos Faloutsos (Carnegie Mellon University)

Upload: magee-berger

Post on 03-Jan-2016

50 views

Category:

Documents


0 download

DESCRIPTION

MindReader: Querying databases through multiple examples. Yoshiharu Ishikawa (Nara Institute of Science and Technology, Japan) Ravishankar Subramanya (Pittsburgh Supercomputing Center) Christos Faloutsos (Carnegie Mellon University). Outline. Background & Introduction Query by Example - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: MindReader: Querying databases through multiple examples

MindReader:Querying databases through

multiple examples

Yoshiharu Ishikawa(Nara Institute of Science and Technology, Japan)

Ravishankar Subramanya(Pittsburgh Supercomputing Center)

Christos Faloutsos(Carnegie Mellon University)

Page 2: MindReader: Querying databases through multiple examples

Outline

Background & Introduction Query by Example Our Approach Relevance Feedback What’s New in MindReader?

Proposed Method Problem Formulation Theorems

Experimental Results Discussion & Conclusion

Page 3: MindReader: Querying databases through multiple examples

Query-by-Example: an example

Searching “mildly overweighted” patients

• The doctor selects examples by browsing patient database

Page 4: MindReader: Querying databases through multiple examples

Query-by-Example: an example

Searching “mildly overweighted” patients

: very good

: good

• The doctor selects examples by browsing patient database

Page 5: MindReader: Querying databases through multiple examples

Query-by-Example: an example

Searching “mildly overweighted” patients

Height

Weig

ht

: very good

: good

• The doctor selects examples by browsing patient database

Page 6: MindReader: Querying databases through multiple examples

Query-by-Example: an example

Searching “mildly overweighted” patients

Height

Weig

ht

: very good

: good

• The doctor selects examples by browsing patient database

Page 7: MindReader: Querying databases through multiple examples

Query-by-Example: an example

Searching “mildly overweighted” patients

Height

Weig

ht

: very good

: good

• The doctor selects examples by browsing patient database

• The examples have “oblique” correlation

Page 8: MindReader: Querying databases through multiple examples

Query-by-Example: an example

Searching “mildly overweighted” patients

Height

Weig

ht

: very good

: good

• The doctor selects examples by browsing patient database

• The examples have “oblique” correlation

• We can “guess” the implied query

Page 9: MindReader: Querying databases through multiple examples

Query-by-Example: an example

Searching “mildly overweighted” patients

Height

Weig

ht

: very good

: good

• The doctor selects examples by browsing patient database

• The examples have “oblique” correlation

• We can “guess” the implied query

q

Page 10: MindReader: Querying databases through multiple examples

Query-by-Example: the question

Assume that user gives multiple examples user optionally assigns scores to the

examples samples have spatial correlation

Page 11: MindReader: Querying databases through multiple examples

Query-by-Example: the question

Assume that user gives multiple examples user optionally assigns scores to the

examples samples have spatial correlation

How can we “guess” the implied query?

Page 12: MindReader: Querying databases through multiple examples

Outline

Background & Introduction Query by Example Our Approach Relevance Feedback What’s New in MindReader?

Proposed Method Problem Formulation Theorems

Experimental Results Discussion & Conclusion

Page 13: MindReader: Querying databases through multiple examples

Our Approach

Automatically derive distance measure from the given examples

Two important notions:1. diagonal query: isosurfaces of queries have

ellipsoid shapes

2. multiple-level scores: user can specify “goodness scores” on samples

Page 14: MindReader: Querying databases through multiple examples

Isosurfaces of Distance Functions

Euclidean weightedEuclidean

generalizedellipsoid distance

q qq

Page 15: MindReader: Querying databases through multiple examples

Distance Function Formulas

Euclidean

D(x, q) = (x – q)2

Weighted Euclidean

D(x, q) = i mi(xi – qi)2

Generalized ellipsoid distanceD(x, q) = (x – q)T M (x – q)

Page 16: MindReader: Querying databases through multiple examples

Outline

Background & Introduction Query by Example Our Approach Relevance Feedback What’s New in MindReader?

Proposed Method Problem Formulation Theorems

Experimental Results Discussion & Conclusion

Page 17: MindReader: Querying databases through multiple examples

Relevance Feedback

Popular method in IRQuery is modified based on relevance

judgment from the userTwo major approaches

1. query-point movement

2. re-weighting

Page 18: MindReader: Querying databases through multiple examples

Relevance Feedback— Query-point Movement —

Query point is moved towards “good” examples — Rocchio’s formula in IR

Q0

Q0: query point

Page 19: MindReader: Querying databases through multiple examples

Relevance Feedback— Query-point Movement —

Query point is moved towards “good” examples — Rocchio’s formula in IR

Q0: query point

: retrieved data

Q0

Page 20: MindReader: Querying databases through multiple examples

Relevance Feedback— Query-point Movement —

Query point is moved towards “good” examples — Rocchio’s formula in IR

Q0: query point

: retrieved data

: relevance judgments

Q0

Page 21: MindReader: Querying databases through multiple examples

Relevance Feedback— Query-point Movement —

Query point is moved towards “good” examples — Rocchio’s formula in IR

Q1

Q0: query point

: retrieved data

: relevance judgments

Q1: new query pointQ0

Page 22: MindReader: Querying databases through multiple examples

Relevance Feedback— Re-weighting —

Standard Deviation Method in MARS (UIUC) image retrieval system

Page 23: MindReader: Querying databases through multiple examples

Relevance Feedback— Re-weighting —

Standard Deviation Method in MARS (UIUC) image retrieval system

Assumption: the deviation is high the feature is not important

Page 24: MindReader: Querying databases through multiple examples

Relevance Feedback— Re-weighting —

Standard Deviation Method in MARS (UIUC) image retrieval system

Assumption: the deviation is high the feature is not important

f2

f1

Page 25: MindReader: Querying databases through multiple examples

Relevance Feedback— Re-weighting —

Standard Deviation Method in MARS (UIUC) image retrieval system

Assumption: the deviation is high the feature is not important

f1

f2

Page 26: MindReader: Querying databases through multiple examples

Relevance Feedback— Re-weighting —

Standard Deviation Method in MARS (UIUC) image retrieval system

Assumption: the deviation is high the feature is not important

f1

f2

Page 27: MindReader: Querying databases through multiple examples

Relevance Feedback— Re-weighting —

Standard Deviation Method in MARS (UIUC) image retrieval system

Assumption: the deviation is high the feature is not important

“bad” feature

“good”feature

f1

f2

Page 28: MindReader: Querying databases through multiple examples

Relevance Feedback— Re-weighting —

Standard Deviation Method in MARS (UIUC) image retrieval system

Assumption: the deviation is high the feature is not important

For each feature, weight wi = 1/i

is assigned “bad” feature

“good”feature

f1

f2

Page 29: MindReader: Querying databases through multiple examples

Relevance Feedback— Re-weighting —

Standard Deviation Method in MARS (UIUC) image retrieval system

Assumption: the deviation is high the feature is not important

For each feature, weight wi = 1/i

is assigned “bad” feature

“good”feature

f1

f2ImpliedQuery

Page 30: MindReader: Querying databases through multiple examples

Relevance Feedback— Re-weighting —

Standard Deviation Method in MARS (UIUC) image retrieval system

Assumption: the deviation is high the feature is not important

For each feature, weight wi = 1/ j

is assigned MARS didn’t provide any justification for this formula

“bad” feature

“good”feature

f1

f2ImpliedQuery

Page 31: MindReader: Querying databases through multiple examples

Outline

Background & Introduction Query by Example Our Approach Relevance Feedback What’s New in MindReader?

Proposed Method Problem Formulation Theorems

Experimental Results Discussion & Conclusion

Page 32: MindReader: Querying databases through multiple examples

What’s New in MindReader?

MindReader does not use ad-hoc heuristics

cf. Rocchio’s expression, re-weighting in MARS

can handle multiple levels of scores can derive generalized ellipsoid distance

Page 33: MindReader: Querying databases through multiple examples

What’s New in MindReader?

MindReader can derive generalized ellipsoid distances

q

Page 34: MindReader: Querying databases through multiple examples

Isosurfaces of Distance Functions

Euclidean weightedEuclidean

generalizedellipsoid distance

q qq

Page 35: MindReader: Querying databases through multiple examples

Isosurfaces of Distance Functions

Euclidean

Rocchio

weightedEuclidean

generalizedellipsoid distance

q qq

Page 36: MindReader: Querying databases through multiple examples

Isosurfaces of Distance Functions

Euclidean

Rocchio

weightedEuclidean MARS

generalizedellipsoid distance

q qq

Page 37: MindReader: Querying databases through multiple examples

Isosurfaces of Distance Functions

Euclidean

Rocchio

weightedEuclidean MARS

generalizedellipsoid distance MindReader

q qq

Page 38: MindReader: Querying databases through multiple examples

Outline

Background & Introduction Query by Example Our Approach Relevance Feedback What’s New in MindReader?

Proposed Method Problem Formulation Theorems

Experimental Results Discussion & Conclusion

Page 39: MindReader: Querying databases through multiple examples

Method: distance function

Generalized ellipsoid distance functionD(x, q) = (x – q)T M (x – q), or

D(x, q) = j k mjk (xj – qj) (xk – qk) q: query point vector

x: data point vector

M = [mjk]: symmetric distance matrix

Page 40: MindReader: Querying databases through multiple examples

Method: definitions

N: no. of samplesn: no. of dimensions (features)xi: n-d sample data vectors

xi = [xi1, …, xin]T

X: N×n sample data matrix X = [x1, …, xN]T

v: N-d score vector v = [v1, …, vN]

Page 41: MindReader: Querying databases through multiple examples

Method: problem formulation

Problem formulation

Given N sample n-d vectors multiple-level scores (optional)

Estimate optimal distance matrix M optimal new query point q

Page 42: MindReader: Querying databases through multiple examples

Method: optimality

How do we measure “optimality”?minimization of “penalty”

What is the “penalty”?weighted sum of distances between query

point and sample vectors

Therefore, minimize i (xi – q)T M (xi – q) under the constraint det(M) = 1

Page 43: MindReader: Querying databases through multiple examples

Outline

Background & Introduction Query by Example Our Approach Relevance Feedback What’s New in MindReader?

Proposed Method Problem Formulation Theorems

Experimental Results Discussion & Conclusion

Page 44: MindReader: Querying databases through multiple examples

Theorems: theorem 1

Solved with Lagrange multipliersTheorem 1: optimal query point

q = x = [x1, …, xn]T= XT v / vi

optimal query point is the weighted average of sample data vectors

Page 45: MindReader: Querying databases through multiple examples

Theorems: theorem 2 & 3

Theorem 2: optimal distance matrix M = (det(C))1/n C–1

C = [cjk] is the weighted covariance matrix

cjk = vi (xik - xk) (xij - xj)

Theorem 3 If we restrict M to diagonal matrix, our

method is equal to standard deviation method MindReader includes MARS!

Page 46: MindReader: Querying databases through multiple examples

Outline

Background & Introduction Query by Example Our Approach Relevance Feedback What’s New in MindReader?

Proposed Method Problem Formulation Theorems

Experimental Results Discussion & Conclusion

Page 47: MindReader: Querying databases through multiple examples

Experiments

1. Estimation of optimal distance function Can MindReader estimate target distance

matrix Mhidden appropriately?

Based on synthetic data Comparison with standard deviation method

2. Query-point movement

3. Application to real data sets GIS data

Page 48: MindReader: Querying databases through multiple examples

Experiment 1: target data

Two-dimensional normal distribution

Page 49: MindReader: Querying databases through multiple examples

Experiment 1: idea

Assume that the user has “hidden” distance Mhidden in his mind

Simulate iterative query refinement

Q: How fast can we discover “hidden” distance?

Query point is fixed to (0, 0)

Page 50: MindReader: Querying databases through multiple examples

Experiment 1: iteration steps

1. Make initial samples: compute k-NNs with Euclidean distance

2. For each object x, calculate its score that reflects the hidden distance Mhidden

3. MindReader estimates the matrix M 4. Retrieve k-NNs with the derived matrix M5. If the result is improved, go to step 2

Page 51: MindReader: Querying databases through multiple examples

Experiment 1: scores

Calculation of scores in terms of “hidden” distance function 1. Calculate distance from the query point q

based on hidden distance matrix Mhidden

d = D(x, q) (0 v

2. Translate distance value d to score (-v s = exp(-d2/2) v = log s / (1 - s)

Page 52: MindReader: Querying databases through multiple examples

Experiment 1: evaluation measures

Used to check whether the query result is improved or not

CD-k measure CD stands for “cumulative distance” for k-NNs retrieved by matrix M, compute

actual distance by matrix Mhidden then take summation

Page 53: MindReader: Querying databases through multiple examples

Experiment 1: final k-NNs

Ellipse: isosurface for Mhidden

Red points: final k-NNs obtained by standard deviation method

Green points: final k-NNs obtained by MindReader

Page 54: MindReader: Querying databases through multiple examples

Experiment 1: speed of convergence

x-axis: no. of iterations

y-axis: CD-k measure value

Red : standard deviation method

Green: MindReader

Blue: best CD-k value possible for the data set

Page 55: MindReader: Querying databases through multiple examples

Experiment 1: changes of isosurfaces

After 0th and 2nd iterations

Page 56: MindReader: Querying databases through multiple examples

Experiment 1: changes of isosurfaces

After 4th and 8th iterations

Page 57: MindReader: Querying databases through multiple examples

Experiment 2: query-point movement

Starts from query point (0.5, 0.5)

MindReader converges to Mhidden with five iterations

Page 58: MindReader: Querying databases through multiple examples

Experiment 3: real data set

End-points of road segments from the Montgomery County, MD

Data is normalized to [-1, 1] [-1, 1]

The query specifies five points along route I-270

Can we estimate good distance function?

Page 59: MindReader: Querying databases through multiple examples

Experiment 3: isosurfaces

After 0th and 2nd iterations: fast convergence!

Page 60: MindReader: Querying databases through multiple examples

Discussion: efficiency

Don’t worry about speed! ellipsoid query support using spatial access

methods: Seidl & Kriegel [VLDB97] Ankerst, Branmüller, Kriegel, Seidl [VLDB98]

for the derived distance, we can efficiently use spatial index

Page 61: MindReader: Querying databases through multiple examples

Conclusion

MindReader automatically guess diagonal queries from the given examples multiple levels of scores includes “Rocchio” and “MARS” (standard

deviation method) problem formulation & solution evaluation based on the experiments