Permutation Search Methods are Efficient,Yet Faster Search is Possible
Bilegsaikhan Naidan, Leonid Boytsov, and Eric Nyberg
ProblemNearest neighbor (NN) search is afundamental operation in computer vision,machine learning, and natural language processing;
The problem is hard especially when the searchspace is non-metric.
Sample ApplicationsQuery by image content
Classification
Entity detection
Spell-checking
State of the ArtVP-trees
kNN-graphs
Multiprobe LSH (MPLSH)
Research Subject (Contestants)Permutation search methods:
Neighborhood APProximation Index (NAPP)
Brute-force filtering of permutations
Basics of Permutation MethodsFilter-and-refine using pivot-based projectionto the Euclidean space (L2):
Select pivots randomly;
Rank pivots by their distances to data points;
Filter by comparing pivot rankings;
Refine by comparing remaining points to thequery.
Research QuestionsHow well do permutation methods fare against
state of the art?
How good are permutation-based projections?
Selected Distance FunctionsEuclidean (L2)
Cosine similarity
Jensen-Shannon divergence (JS-div.)
Normalized Levenshtein
Signature-Quadratic Form Distance (SQFD)
Selected Data Sets
Name Distance Number Brute-force Dimens.function of points (sec.)
SIFT L2 5 · 106 0.3 128ImageNet SQFD 1 · 106 4.1 N/AWiki-sparse cosine sim. 4 · 106 1.9 105
Wiki-128 JS-Div. 2 · 106 4 128DNA norm. Leven. 1 · 106 3.5 N/A
Efficiency EvaluationImprovement in efficiency over brute-force search vs. accuracy. Higher and tothe right is better:
SIFT (L2) ImageNet (SQFD)
0.6 0.7 0.8 0.9 1
101
102
Recall
Impro
v.
ineffi
cien
cy(l
og.
scal
e)
VP-tree
MPLSH
kNN-graph (SW)
NAPP
0.75 0.8 0.85 0.9 0.95 1
101
102
Recall
Impro
v.
ineffi
cien
cy(l
og.
scal
e)
VP-tree
brute-force filt.
kNN-graph (SW)
NAPP
Wiki-sparse (cosine simil.) Normalized Levenshtein
0.7 0.8 0.9100
101
102
Recall
Impro
v.
ineffi
cien
cy(l
og.
scal
e) kNN-graph (SW)
NAPP
0.6 0.7 0.8 0.9 1
101
102
Recall
Impro
v.
ineffi
cien
cy(l
og.
scal
e) VP-tree
kNN-graph (NN-desc)
brute-force filt. bin.
NAPP
Indexing Time (min)
VP-tree NAPP MPLSH Brute-force filt. kNN graph
SIFT 0.4 5 18.4 52.2ImageNet 4.4 33 32.3 127.6Wiki-sparse 7.9 231.2Wiki-128 1.2 36.6 36.1DNA 0.9 15.9 15.6 88
Projection Quality EvaluationDistance in the original space vs. distance in the projected space. The closerto a monotonic mapping, the better:
0
100
200
300
0 200 400 600
0
50
100
150
200
250
0.0 0.2 0.4 0.6
Good projection (original distance: L2) So-so projection (original distance: JS-div.)
Results and ConclusionsPermutation methods beat state-of-the-art methods for some data sets:
NAPP beats MPLSH & VP-tree for SIFT, as well as VP-tree for Wiki-128;Brute force filtering beats all methods including kNN graph for DNA;Yet, kNN graph is the best for SIFT, Wiki-128, and Wiki-sparse.
The quality of permutation-based projection varies;
Permutations method are most useful when the distance function isexpensive (e.g., SQFD, Levenshtein, or data is on disk) and/or indexing costof kNN graphs is unacceptable.
http://www.idi.ntnu.no, http://scs.cmu.edu/ [email protected],{srchvrs,ehn}@cs.cmu.edu