permutation search methods are efficient, 9ptyet faster...

1
Permutation Search Methods are Efficient, Yet Faster Search is Possible Bilegsaikhan Naidan, Leonid Boytsov, and Eric Nyberg Problem Nearest neighbor (NN) search is a fundamental operation in computer vision, machine learning, and natural language processing; The problem is hard especially when the search space is non-metric. Sample Applications Query by image content Classification Entity detection Spell-checking State of the Art VP-trees kNN-graphs Multiprobe LSH (MPLSH) Research Subject (Contestants) Permutation search methods: Neighborhood APProximation Index (NAPP) Brute-force filtering of permutations Basics of Permutation Methods Filter-and-refine using pivot-based projection to the Euclidean space (L 2 ): Select pivots randomly; Rank pivots by their distances to data points; Filter by comparing pivot rankings; Refine by comparing remaining points to the query. Research Questions How well do permutation methods fare against state of the art? How good are permutation-based projections? Selected Distance Functions Euclidean (L 2 ) Cosine similarity Jensen-Shannon divergence (JS-div.) Normalized Levenshtein Signature-Quadratic Form Distance (SQFD) Selected Data Sets Name Distance Number Brute-force Dimens. function of points (sec.) SIFT L 2 5 · 10 6 0.3 128 ImageNet SQFD 1 · 10 6 4.1 N/A Wiki-sparse cosine sim. 4 · 10 6 1.9 10 5 Wiki-128 JS-Div. 2 · 10 6 4 128 DNA norm. Leven. 1 · 10 6 3.5 N/A Efficiency Evaluation Improvement in efficiency over brute-force search vs. accuracy. Higher and to the right is better : SIFT (L 2 ) ImageNet (SQFD) 0.6 0.7 0.8 0.9 1 10 1 10 2 Recall Improv. in efficiency (log. scale) VP-tree MPLSH kNN-graph (SW) NAPP 0.75 0.8 0.85 0.9 0.95 1 10 1 10 2 Recall Improv. in efficiency (log. scale) VP-tree brute-force filt. kNN-graph (SW) NAPP Wiki-sparse (cosine simil.) Normalized Levenshtein 0.7 0.8 0.9 10 0 10 1 10 2 Recall Improv. in efficiency (log. scale) kNN-graph (SW) NAPP 0.6 0.7 0.8 0.9 1 10 1 10 2 Recall Improv. in efficiency (log. scale) VP-tree kNN-graph (NN-desc) brute-force filt. bin. NAPP Indexing Time (min) VP-tree NAPP MPLSH Brute-force filt. kNN graph SIFT 0.4 5 18.4 52.2 ImageNet 4.4 33 32.3 127.6 Wiki-sparse 7.9 231.2 Wiki-128 1.2 36.6 36.1 DNA 0.9 15.9 15.6 88 Projection Quality Evaluation Distance in the original space vs. distance in the projected space. The closer to a monotonic mapping, the better : 0 100 200 300 0 200 400 600 0 50 100 150 200 250 0.0 0.2 0.4 0.6 Good projection (original distance: L 2 ) So-so projection (original distance: JS-div.) Results and Conclusions Permutation methods beat state-of-the-art methods for some data sets: NAPP beats MPLSH & VP-tree for SIFT, as well as VP-tree for Wiki-128; Brute force filtering beats all methods including kNN graph for DNA; Yet, kNN graph is the best for SIFT, Wiki-128, and Wiki-sparse. The quality of permutation-based projection varies; Permutations method are most useful when the distance function is expensive (e.g., SQFD, Levenshtein, or data is on disk) and/or indexing cost of kNN graphs is unacceptable. http://www.idi.ntnu.no, http://scs.cmu.edu/ [email protected],{srchvrs,ehn}@cs.cmu.edu

Upload: vandang

Post on 05-Jun-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Permutation Search Methods are Efficient, 9ptYet Faster ...acmsocc.github.io/2015/posters/socc15posters-final6.pdf · Permutation Search Methods are E cient, ... The problem is hard

Permutation Search Methods are Efficient,Yet Faster Search is Possible

Bilegsaikhan Naidan, Leonid Boytsov, and Eric Nyberg

ProblemNearest neighbor (NN) search is afundamental operation in computer vision,machine learning, and natural language processing;

The problem is hard especially when the searchspace is non-metric.

Sample ApplicationsQuery by image content

Classification

Entity detection

Spell-checking

State of the ArtVP-trees

kNN-graphs

Multiprobe LSH (MPLSH)

Research Subject (Contestants)Permutation search methods:

Neighborhood APProximation Index (NAPP)

Brute-force filtering of permutations

Basics of Permutation MethodsFilter-and-refine using pivot-based projectionto the Euclidean space (L2):

Select pivots randomly;

Rank pivots by their distances to data points;

Filter by comparing pivot rankings;

Refine by comparing remaining points to thequery.

Research QuestionsHow well do permutation methods fare against

state of the art?

How good are permutation-based projections?

Selected Distance FunctionsEuclidean (L2)

Cosine similarity

Jensen-Shannon divergence (JS-div.)

Normalized Levenshtein

Signature-Quadratic Form Distance (SQFD)

Selected Data Sets

Name Distance Number Brute-force Dimens.function of points (sec.)

SIFT L2 5 · 106 0.3 128ImageNet SQFD 1 · 106 4.1 N/AWiki-sparse cosine sim. 4 · 106 1.9 105

Wiki-128 JS-Div. 2 · 106 4 128DNA norm. Leven. 1 · 106 3.5 N/A

Efficiency EvaluationImprovement in efficiency over brute-force search vs. accuracy. Higher and tothe right is better:

SIFT (L2) ImageNet (SQFD)

0.6 0.7 0.8 0.9 1

101

102

Recall

Impro

v.

ineffi

cien

cy(l

og.

scal

e)

VP-tree

MPLSH

kNN-graph (SW)

NAPP

0.75 0.8 0.85 0.9 0.95 1

101

102

Recall

Impro

v.

ineffi

cien

cy(l

og.

scal

e)

VP-tree

brute-force filt.

kNN-graph (SW)

NAPP

Wiki-sparse (cosine simil.) Normalized Levenshtein

0.7 0.8 0.9100

101

102

Recall

Impro

v.

ineffi

cien

cy(l

og.

scal

e) kNN-graph (SW)

NAPP

0.6 0.7 0.8 0.9 1

101

102

Recall

Impro

v.

ineffi

cien

cy(l

og.

scal

e) VP-tree

kNN-graph (NN-desc)

brute-force filt. bin.

NAPP

Indexing Time (min)

VP-tree NAPP MPLSH Brute-force filt. kNN graph

SIFT 0.4 5 18.4 52.2ImageNet 4.4 33 32.3 127.6Wiki-sparse 7.9 231.2Wiki-128 1.2 36.6 36.1DNA 0.9 15.9 15.6 88

Projection Quality EvaluationDistance in the original space vs. distance in the projected space. The closerto a monotonic mapping, the better:

0

100

200

300

0 200 400 600

0

50

100

150

200

250

0.0 0.2 0.4 0.6

Good projection (original distance: L2) So-so projection (original distance: JS-div.)

Results and ConclusionsPermutation methods beat state-of-the-art methods for some data sets:

NAPP beats MPLSH & VP-tree for SIFT, as well as VP-tree for Wiki-128;Brute force filtering beats all methods including kNN graph for DNA;Yet, kNN graph is the best for SIFT, Wiki-128, and Wiki-sparse.

The quality of permutation-based projection varies;

Permutations method are most useful when the distance function isexpensive (e.g., SQFD, Levenshtein, or data is on disk) and/or indexing costof kNN graphs is unacceptable.

http://www.idi.ntnu.no, http://scs.cmu.edu/ [email protected],{srchvrs,ehn}@cs.cmu.edu