mindreader: querying databases through multiple examples
DESCRIPTION
MindReader: Querying databases through multiple examples. Yoshiharu Ishikawa (Nara Institute of Science and Technology, Japan) Ravishankar Subramanya (Pittsburgh Supercomputing Center) Christos Faloutsos (Carnegie Mellon University). Outline. Background & Introduction Query by Example - PowerPoint PPT PresentationTRANSCRIPT
MindReader:Querying databases through
multiple examples
Yoshiharu Ishikawa(Nara Institute of Science and Technology, Japan)
Ravishankar Subramanya(Pittsburgh Supercomputing Center)
Christos Faloutsos(Carnegie Mellon University)
Outline
Background & Introduction Query by Example Our Approach Relevance Feedback What’s New in MindReader?
Proposed Method Problem Formulation Theorems
Experimental Results Discussion & Conclusion
Query-by-Example: an example
Searching “mildly overweighted” patients
• The doctor selects examples by browsing patient database
Query-by-Example: an example
Searching “mildly overweighted” patients
: very good
: good
• The doctor selects examples by browsing patient database
Query-by-Example: an example
Searching “mildly overweighted” patients
Height
Weig
ht
: very good
: good
• The doctor selects examples by browsing patient database
Query-by-Example: an example
Searching “mildly overweighted” patients
Height
Weig
ht
: very good
: good
• The doctor selects examples by browsing patient database
Query-by-Example: an example
Searching “mildly overweighted” patients
Height
Weig
ht
: very good
: good
• The doctor selects examples by browsing patient database
• The examples have “oblique” correlation
Query-by-Example: an example
Searching “mildly overweighted” patients
Height
Weig
ht
: very good
: good
• The doctor selects examples by browsing patient database
• The examples have “oblique” correlation
• We can “guess” the implied query
Query-by-Example: an example
Searching “mildly overweighted” patients
Height
Weig
ht
: very good
: good
• The doctor selects examples by browsing patient database
• The examples have “oblique” correlation
• We can “guess” the implied query
q
Query-by-Example: the question
Assume that user gives multiple examples user optionally assigns scores to the
examples samples have spatial correlation
Query-by-Example: the question
Assume that user gives multiple examples user optionally assigns scores to the
examples samples have spatial correlation
How can we “guess” the implied query?
Outline
Background & Introduction Query by Example Our Approach Relevance Feedback What’s New in MindReader?
Proposed Method Problem Formulation Theorems
Experimental Results Discussion & Conclusion
Our Approach
Automatically derive distance measure from the given examples
Two important notions:1. diagonal query: isosurfaces of queries have
ellipsoid shapes
2. multiple-level scores: user can specify “goodness scores” on samples
Isosurfaces of Distance Functions
Euclidean weightedEuclidean
generalizedellipsoid distance
q qq
Distance Function Formulas
Euclidean
D(x, q) = (x – q)2
Weighted Euclidean
D(x, q) = i mi(xi – qi)2
Generalized ellipsoid distanceD(x, q) = (x – q)T M (x – q)
Outline
Background & Introduction Query by Example Our Approach Relevance Feedback What’s New in MindReader?
Proposed Method Problem Formulation Theorems
Experimental Results Discussion & Conclusion
Relevance Feedback
Popular method in IRQuery is modified based on relevance
judgment from the userTwo major approaches
1. query-point movement
2. re-weighting
Relevance Feedback— Query-point Movement —
Query point is moved towards “good” examples — Rocchio’s formula in IR
Q0
Q0: query point
Relevance Feedback— Query-point Movement —
Query point is moved towards “good” examples — Rocchio’s formula in IR
Q0: query point
: retrieved data
Q0
Relevance Feedback— Query-point Movement —
Query point is moved towards “good” examples — Rocchio’s formula in IR
Q0: query point
: retrieved data
: relevance judgments
Q0
Relevance Feedback— Query-point Movement —
Query point is moved towards “good” examples — Rocchio’s formula in IR
Q1
Q0: query point
: retrieved data
: relevance judgments
Q1: new query pointQ0
Relevance Feedback— Re-weighting —
Standard Deviation Method in MARS (UIUC) image retrieval system
Relevance Feedback— Re-weighting —
Standard Deviation Method in MARS (UIUC) image retrieval system
Assumption: the deviation is high the feature is not important
Relevance Feedback— Re-weighting —
Standard Deviation Method in MARS (UIUC) image retrieval system
Assumption: the deviation is high the feature is not important
f2
f1
Relevance Feedback— Re-weighting —
Standard Deviation Method in MARS (UIUC) image retrieval system
Assumption: the deviation is high the feature is not important
f1
f2
Relevance Feedback— Re-weighting —
Standard Deviation Method in MARS (UIUC) image retrieval system
Assumption: the deviation is high the feature is not important
f1
f2
Relevance Feedback— Re-weighting —
Standard Deviation Method in MARS (UIUC) image retrieval system
Assumption: the deviation is high the feature is not important
“bad” feature
“good”feature
f1
f2
Relevance Feedback— Re-weighting —
Standard Deviation Method in MARS (UIUC) image retrieval system
Assumption: the deviation is high the feature is not important
For each feature, weight wi = 1/i
is assigned “bad” feature
“good”feature
f1
f2
Relevance Feedback— Re-weighting —
Standard Deviation Method in MARS (UIUC) image retrieval system
Assumption: the deviation is high the feature is not important
For each feature, weight wi = 1/i
is assigned “bad” feature
“good”feature
f1
f2ImpliedQuery
Relevance Feedback— Re-weighting —
Standard Deviation Method in MARS (UIUC) image retrieval system
Assumption: the deviation is high the feature is not important
For each feature, weight wi = 1/ j
is assigned MARS didn’t provide any justification for this formula
“bad” feature
“good”feature
f1
f2ImpliedQuery
Outline
Background & Introduction Query by Example Our Approach Relevance Feedback What’s New in MindReader?
Proposed Method Problem Formulation Theorems
Experimental Results Discussion & Conclusion
What’s New in MindReader?
MindReader does not use ad-hoc heuristics
cf. Rocchio’s expression, re-weighting in MARS
can handle multiple levels of scores can derive generalized ellipsoid distance
What’s New in MindReader?
MindReader can derive generalized ellipsoid distances
q
Isosurfaces of Distance Functions
Euclidean weightedEuclidean
generalizedellipsoid distance
q qq
Isosurfaces of Distance Functions
Euclidean
Rocchio
weightedEuclidean
generalizedellipsoid distance
q qq
Isosurfaces of Distance Functions
Euclidean
Rocchio
weightedEuclidean MARS
generalizedellipsoid distance
q qq
Isosurfaces of Distance Functions
Euclidean
Rocchio
weightedEuclidean MARS
generalizedellipsoid distance MindReader
q qq
Outline
Background & Introduction Query by Example Our Approach Relevance Feedback What’s New in MindReader?
Proposed Method Problem Formulation Theorems
Experimental Results Discussion & Conclusion
Method: distance function
Generalized ellipsoid distance functionD(x, q) = (x – q)T M (x – q), or
D(x, q) = j k mjk (xj – qj) (xk – qk) q: query point vector
x: data point vector
M = [mjk]: symmetric distance matrix
Method: definitions
N: no. of samplesn: no. of dimensions (features)xi: n-d sample data vectors
xi = [xi1, …, xin]T
X: N×n sample data matrix X = [x1, …, xN]T
v: N-d score vector v = [v1, …, vN]
Method: problem formulation
Problem formulation
Given N sample n-d vectors multiple-level scores (optional)
Estimate optimal distance matrix M optimal new query point q
Method: optimality
How do we measure “optimality”?minimization of “penalty”
What is the “penalty”?weighted sum of distances between query
point and sample vectors
Therefore, minimize i (xi – q)T M (xi – q) under the constraint det(M) = 1
Outline
Background & Introduction Query by Example Our Approach Relevance Feedback What’s New in MindReader?
Proposed Method Problem Formulation Theorems
Experimental Results Discussion & Conclusion
Theorems: theorem 1
Solved with Lagrange multipliersTheorem 1: optimal query point
q = x = [x1, …, xn]T= XT v / vi
optimal query point is the weighted average of sample data vectors
Theorems: theorem 2 & 3
Theorem 2: optimal distance matrix M = (det(C))1/n C–1
C = [cjk] is the weighted covariance matrix
cjk = vi (xik - xk) (xij - xj)
Theorem 3 If we restrict M to diagonal matrix, our
method is equal to standard deviation method MindReader includes MARS!
Outline
Background & Introduction Query by Example Our Approach Relevance Feedback What’s New in MindReader?
Proposed Method Problem Formulation Theorems
Experimental Results Discussion & Conclusion
Experiments
1. Estimation of optimal distance function Can MindReader estimate target distance
matrix Mhidden appropriately?
Based on synthetic data Comparison with standard deviation method
2. Query-point movement
3. Application to real data sets GIS data
Experiment 1: target data
Two-dimensional normal distribution
Experiment 1: idea
Assume that the user has “hidden” distance Mhidden in his mind
Simulate iterative query refinement
Q: How fast can we discover “hidden” distance?
Query point is fixed to (0, 0)
Experiment 1: iteration steps
1. Make initial samples: compute k-NNs with Euclidean distance
2. For each object x, calculate its score that reflects the hidden distance Mhidden
3. MindReader estimates the matrix M 4. Retrieve k-NNs with the derived matrix M5. If the result is improved, go to step 2
Experiment 1: scores
Calculation of scores in terms of “hidden” distance function 1. Calculate distance from the query point q
based on hidden distance matrix Mhidden
d = D(x, q) (0 v
2. Translate distance value d to score (-v s = exp(-d2/2) v = log s / (1 - s)
Experiment 1: evaluation measures
Used to check whether the query result is improved or not
CD-k measure CD stands for “cumulative distance” for k-NNs retrieved by matrix M, compute
actual distance by matrix Mhidden then take summation
Experiment 1: final k-NNs
Ellipse: isosurface for Mhidden
Red points: final k-NNs obtained by standard deviation method
Green points: final k-NNs obtained by MindReader
Experiment 1: speed of convergence
x-axis: no. of iterations
y-axis: CD-k measure value
Red : standard deviation method
Green: MindReader
Blue: best CD-k value possible for the data set
Experiment 1: changes of isosurfaces
After 0th and 2nd iterations
Experiment 1: changes of isosurfaces
After 4th and 8th iterations
Experiment 2: query-point movement
Starts from query point (0.5, 0.5)
MindReader converges to Mhidden with five iterations
Experiment 3: real data set
End-points of road segments from the Montgomery County, MD
Data is normalized to [-1, 1] [-1, 1]
The query specifies five points along route I-270
Can we estimate good distance function?
Experiment 3: isosurfaces
After 0th and 2nd iterations: fast convergence!
Discussion: efficiency
Don’t worry about speed! ellipsoid query support using spatial access
methods: Seidl & Kriegel [VLDB97] Ankerst, Branmüller, Kriegel, Seidl [VLDB98]
for the derived distance, we can efficiently use spatial index
Conclusion
MindReader automatically guess diagonal queries from the given examples multiple levels of scores includes “Rocchio” and “MARS” (standard
deviation method) problem formulation & solution evaluation based on the experiments