tomáš skopal 1, benjamin bustos 2 1 charles university in prague, czech republic 2 university of...

On Index-free Similarity Search in Metric Spaces

Tom Skopal1, Benjamin Bustos2

1Charles University in Prague, Czech Republic 2University of Chile, Santiago, ChileOn Index-free Similarity Search in Metric SpacesDEXA 2009, Linz, AustriaOutlineMetric approach to similarity searchMotivation for index-free similarity searchD-file (+ D-cache)ExperimentsConclusion

DEXA 2009, Linz, AustriaSimilarity searchMultimedia databases, time series, bioinformatics, ...Content-based similarity search (query by example)DEXA 2009, Linz, Austria

0.1

0.15

0.3

0.6

0.8k nearest neighbors query (give me the 3 most similar)range query (give me the very similar ones over 80%)Metric approach to similarity searchthe similarity d (actually distance) is computationally expensive often O(m2), sometimes even O(2m) w.r.t. the size (m) of a compared objectquerying by a sequential scan over the database of n objects is thus expensivethe goal: minimizing the number of distance computations d(*,*) for a querythe way: using metric distances (metric postulates)allows to partition the data spacethe search is then performed just in several partitions efficient searchDEXA 2009, Linz, Austriaa cheap determination of tight lower-bound distance of d(*,*) provides a mechanism how to quickly filter irrelevant objects from search

this filtering is used in various forms by metric access methods, where X stands for a database object and P for a pivot objectUsing lower-bound distances for filtering database objectsDEXA 2009, Linz, Austriaqueryball QPXrThe task: check if X is inside query ballwe know d(Q,P)we know d(P,X)we do not know d(Q,X)we do not have to compute d(Q,X), because its lower bound d(Q,P)-d (X,P) is larger than r, so X surely cannot be in the query ball, so X is ignoredIndex-based metric access methods All metric access methods (MAM) are index-based, i.e. preprocessing of a database is always needed.Index construction usually takes between O(kn) to O(n2).DEXA 2009, Linz, Austria

M-treePM-treeGNATMotivation for index-free searchindexing is not desirable (or even possible) ifwe have a highly changeable databasemore inserts/deletes/updates than searches, i.e., streaming databases, archives, logs, sensory databases, etc.

we perform isolated searchesa database is created for a few queries and then discarded, i.e., in data mining tasks

we switch between distances (changing similarity)the distance function is tuned at query time, e.g., weighing of object features is applied dynamically

DEXA 2009, Linz, AustriaD-filejust the original database using sequential scan, BUTit uses D-cachea memory-resident structure that maintains the distances computed during previous queriesprovides lower-bounds of requested distances that can be used to filter some of the database objects when queryingO(1) complexity for a lower bound retrievalno preprocessing (indexing) of database

DEXA 2009, Linz, AustriaD-file range queryDEXA 2009, Linz, Austria

simple sequential searchsequential search enhanced by D-cache filteringQOi???D-cacheevery time a D-file computes a distance d(*,*), it is stored into D-cachethe D-cache could be viewed as a sparse matrix, where queries denote rows, database object denote columns, and a cell contains a value of d(Q,O)DEXA 2009, Linz, Austria

D-cacheDEXA 2009, Linz, AustriaD-cache has two functionalitiesit allows to retrieve the exact distance d(Q,O), if it is therethe main functionality: it provides tight lower bound to d(Q,O)How to obtain a lower bound?prior to a new query Q, determine some old queries DPiQ (acting as dynamic pivots) and compute the distances d(Q, DPiQ)when a lower bound to d(Q,O) is required, search for available distances d(Q, DPiQ) in the D-cache and obtain the max(d(DPiQ, O) d(Q, DPiQ)); that is our tight lower bound distance

D-cacheDEXA 2009, Linz, Austriahow to choose the old queries (dynamic pivots)?Recent policysimple we just choose k previous queriesmotivation: the recently added distances are likely to still sit in the D-cacheInternal policyadvanced we select k of the previous queries which are probably closewe avoid computation of any distance between new and old queries, we just estimate the distance using distances from D-cachemotivation: a close query (pivot) produces tighter lower boundsD-cache implementationDEXA 2009, Linz, AustriaCell cachea simple hash tableused to determine individual cell values, based on id1, id2used for Recent pivot selectionRow cachein inverted list (list of objects belonging to old queries)used to determine the mediators when using Internal pivot selectionReplacement policiesbecause the size of D-cache is limited, both Cell cache and Row cache apply the LRU distance replacement policyExperimentsdatasetsA subset of Corel features 65,615 32-dimensional vectors of color moments, and the d = L1 distanceA synthetic Polygons set; 500,000 randomly generated 2D polygons varying in the number of vertices from 10 to 15, and the d = Hausdorff distance (maximum distance of a point set to the nearest point in the other set).A subset of GenBank file rel147, namely 50,000 protein sequences of lengths from 50 to 100, and the d = edit distanceDEXA 2009, Linz, AustriaExperimentsDEXA 2009, Linz, AustriaD-file was compared with 3 metric access methods and the trivial sequential scanM-treePM-treeGNATseq. scanwe have observed the number of distance computations spent onindexingquerying

ExperimentsDEXA 2009, Linz, Austriaconstruction costs

ExperimentsDEXA 2009, Linz, Austriaunknown queries (query objects outside the dataset)

ExperimentsDEXA 2009, Linz, Austriaunknown queries (query objects outside the dataset)

ExperimentsDEXA 2009, Linz, Austriadatabase queries (used when browsing, etc.)

ExperimentsDEXA 2009, Linz, Austriadatabase queries (used when browsing, etc.)

ConclusionD-file an index-free metric access methodrequires no indexingsuitable for highly changeable databases, isolated searches or when changing the similarityD-cachea structure used by D-file to cheaply determine lower-bound distancesuses distances computed and cached during previous queries processingDEXA 2009, Linz, AustriaFuture workwe plan to include the D-cache also into index-based metric access methods to improve the efficiency of index constructionsimple queries (range and kNN)advanced operations (similarity joins)and so on...

Thank you for your attention!DEXA 2009, Linz, Austria

tomáš skopal 1, benjamin bustos 2 1 charles university in prague, czech republic 2 university of...

Documents

distance d

metric distances metric

dcachethe dcache

dcacheevery time

dcache filteringqoi

original database

lower bound dq

metric spacesdexa