k-nearest neighbors search in high dimensions

8/9/2019 k-Nearest Neighbors Search in High Dimensions

http://slidepdf.com/reader/full/k-nearest-neighbors-search-in-high-dimensions 1/111

k-Nearest Neighbors Search

in High Dimensions

Tomer Peled

Dan Kushnir

Tell me who your neighbors are, and I'll know who you are



Outline

•Problem definition and flavorsProblem definition and flavors

•Algorithms overview - low dimensions

•Curse of dimensionality (d>10..20(

•Enchanting the curseLocality Sensitive Hashing (high dimension approximate solutions(

•l2 extension

•Applications (Dan(



• Given: a set P of n points in Rd

Over some metric

• find the nearest neighbor p of q in P

Nearest Neighbor Search

Problem definition

Distance metric

QQ??



Applications

•Classification

•Clustering

•Segmentation

q?

•Indexing

•Dimension reduction

(e.g. lle(

color

Weight



Naïve solution

• No preprocess

•Given a query point q

– Go over all n points

– Do comparison in Rd

•query time = O(nd(

Keep in mind



Common solution

•Use a data structure for acceleration

•Scale-ability with n & with d is important



When to use nearest neighbor

High level algorithms

Assuming no prior knowledge about the underlying probability structure

complex models Sparse data High dimensions

Parametric Non-parametric

Density

estimationProbability

distribution estimation

Nearest

neighbors



Nearest Neighbor

min pi

∈P

dist(q,pi

)

C l o s e s t

qq??



r, ε - Nearest Neighbor

r

)1+ε ) r

dist(q,p1) ≤ r

dist(q,p2) ≥ (1 + ε ) r r2=(1 + ε ) r1

qq??



Outline

•Problem definition and flavors

•Algorithms overview - low dimensionsAlgorithms overview - low dimensions


•Enchanting the curseLocality Sensitive Hashing(high dimension approximate solutions(

•l2 extension




The simplest solution

•Lion in the desert



Quadtree

Split the first dimension into 2

Repeat iteratively

Stop when each cell

has no more than 1 data point



Quadtree - structure

X

Y

X1,Y1 P≥X1

P≥Y1P<X1

P<Y1

P≥X1

P<Y1

P<X1

P≥Y1

X1,Y1



Quadtree - Query

X

Y

In many cases works

X1,Y1P<X1

P<Y1P<X1

P≥Y1

X1,Y1

P≥X1

P≥Y1

P≥X1

P<Y1



Quadtree – Pitfall1

X

Y

In some cases doesn’t

X1,Y1P≥X1

P≥Y1

P<X1

P<X1

P<Y1P≥X1

P<Y1

P<X1

P≥Y1

X1,Y1



Quadtree – Pitfall1

X

Y

In some cases nothing works



Quadtree – pitfall 2X

Y

O(2d)

Could result in Query time Exponential in #dimensions



Space partition based algorithms

Multidimensional access methods / Volker Gaede, O. Gunther

Could be improved



Outline



•Curse of dimensionality (d>10..20Curse of dimensionality (d>10..20((

•Enchanting the curseLocality Sensitive Hashing(high dimension approximate solutions(

•l2 extension




Curse of dimensionality

•Query time or spaceO(nd(

•D>10..20 worst than sequential scan

– For most geometric distributions

•Techniques specific to high dimensions are needed

• Prooved in theory and in practice by Barkol & Rabani 2000 & Beame-Vee 2002

O( min(nd, nd) ) Naive



Curse of dimensionality

Some intuition

2

22

23

2d



Outline




•Enchanting the curseEnchanting the curse

Locality Sensitive HashingLocality Sensitive Hashing(high dimension approximate solutions(

•l2 extension




Preview

•General Solution –

Locality sensitive hashing

•Implementation for Hamming space

•Generalization to l1 & l2



Hash function



Hash function

Hash function

Data_Item

Key

Bin/Bucket



Hash function

X modulo 3

X=Number

in the range 0..n

0..2

Storage Address

Data structure

0

Usually we would like related Data-items to be stored at the same bin



Recall r, ε - Nearest Neighbor

r

)1+ε ) r

dist(q,p1) ≤ r

dist(q,p2) ≥ (1 + ε ) r r2=(1 + ε ) r1

qq??




r)1+) r

)r, ε ,p1,p2) Sensitive≡ Pr[I(p)=I(q)] is “high” if p is “close” to q

≡ Pr[I(p)=I(q)] is “low” if p is”far” from q

r2=(1 + ε ) r1

qq??

P1P2



Preview




•Generalization to l1 & l2



Hamming Space

•Hamming space = 2 N binary strings

•Hamming distance = #changed digits

a.k.a Signal distanceRichard Hamming



Hamming Space N

010100001111

010100001111

010010000011Distance = 4

•Hamming space

•Hamming distance

SUM(X1 XOR X2)



L1 to Hamming Space Embedding

p

8

C=11

1111111100011000000000

2

1111111100011000000000

d’=C*d



Hash function

Lj Hash function

p Hd’∈

Gj(p)=p|Ij

j=1..L, k=3 digits

Bits sampling from p

Store p into bucket p|Ij 2k buckets101

11000000000 111111110000 111000000000 111111110001



Construction

1 2 L

p



Query

1 2 L

q



Alternative intuition random projections

p

8

C=11

1111111100011000000000

2

1111111100011000000000

d’=C*d




8

C=11

1111111100011000000000

2

1111111100011000000000

p




101

11000000000 111111110000 111000000000 111111110001

000

100

110

001

101

111

2233BucketsBuckets

p



k samplings



Repeating



Repeating L times



Secondary hashing

Support volume tuning

dataset-size vs. storage volume

2k buckets

011

Size=B

M Buckets

Simple Hashing

M*B=α*n α=2

Skip



The above hashing

is locality-sensitive• Probability (p,q in same bucket ( =

k=1 k=2

Distance (q,pi) Distance (q,pi)

P r o b

a b

i l i

t y Pr

Adopted from Piotr Indyk’s slides

k q p

−

dimensions#

),(Distance1



Preview




•Generalization to l2



Direct L2 solution

• New hashing function

•Still based on sampling

•Using mathematical trick

•P-stable distribution for Lp distance

•Gaussian distribution for L2 distance



Central limit theorem

v1* +v2* …+vn* =+…

Σ (Weighted Gaussians) = Weighted Gaussian




v1

..vn

= Real Numbers

X1:Xn = Independent Identically Distributed

(i.i.d)

+v2* X2 …+vn* Xn =+…v1* X1




X v X v

i

i

i

ii ⋅

=⋅ ∑∑

2/1

2||

Dot Product Norm



Norm Distance

X vu X v X ui

ii

i

ii

i

ii ⋅

−=⋅−⋅ ∑∑∑

2/1

2||

Featuresvector 1

Featuresvector 2 Distance



Norm Distance

X vu X v X ui

ii

i

ii

i

ii ⋅

−=⋅−⋅ ∑∑∑

2/1

2||

DotProduct

DotProduct Distance



The full Hashing

+⋅=

w

bvavh ba )(,

]34 82 21[1

22

77

42

d

d random*numbers

+b

phaseRandom[0,w]

wDiscretization

step

Featuresvector



The full Hashing

+⋅=

w

bvavh ba )(,

+34

100

7944

7900 8000 8100 82007800



The full Hashing

+⋅=

w

bvavh ba )(,

+34

phaseRandom[0,w]

100Discretization

step

7944



The full Hashing

+⋅=

w

bvavh ba )(,

a1 v d

i.i.d from p-stabledistribution

+b

phaseRandom[0,w]

wDiscretization

step

Featuresvector



Generalization: P-Stable distribution

•L p p=eps..2

•Generalized

Central Limit Theorem•P-stable distribution

Cauchy for L2

•L2

•Central Limit Theorem

•Gaussian (normal)

distribution



P-Stable summary

•Works for

•Generalizes to 0<p<=2

•Improves query time

Query time = O (dn1/(1+ ε )log(n) ) O (dn1/(1+ ε )̂ 2 log(n) )

r, ε - Nearest Neighbor

Latest results

Reported in Email by

Alexander Andoni



Parameters selection

•90%Probability Best quarry time performance

For Euclidean Space

L



Parameters selection…

For Euclidean Space

•Single projection hit an ε - Nearest

Neighbor with Pr=p1

•k projections hits an ε - Nearest Neighbor with Pr=p1k

•L hashings fail to collide with Pr=(1-p1k )L

•To ensure Collision (e.g. 1-δ≥90%(

•1- )1- p1k )L≥ 1-δ)1log(

)log(

1

k p

L−

≥δ

L

K



…Parameters selection

K

k

time Candidates verification Candidates extraction



Better Query Time than Spatial Data Structures

Scales well to higher dimensions and larger data size( Sub-linear dependence )

Predictable running time

Extra storage over-head

Inefficient for data with distances concentrated aroundaverage

works best for Hamming distance (although can begeneralized to Euclidean space)

In secondary storage, linear scan is pretty much all wecan do (for high dim)

requires radius r to be fixed in advance

Pros. & Cons.

From Pioter Indyk slides



Conclusion

•.. but at the end

everything depends on your data set

•Try it at home – Visit:

http://web.mit.edu/andoni/www/LSH/index.html

– Email Alex [email protected] – Test over your own data

(C code under Red Hat Linux(

LSH A li i


mailto:[email protected]





LSH - Applications• Searching video clips in databases .("Hierarchical, Non-Uniform Locality Sensitive

Hashing and Its Application to Video Identification“, Yang, Ooi, Sun).

• Searching image databases (see the following).

• Image segmentation (see the following).

• Image classification (“Discriminant adaptive Nearest Neighbor Classification”, T. Hastie, R Tibshirani).

• Texture classification (see the following).

• Clustering (see the following).

• Embedding and manifold learning (LLE, and many others)

• Compression – vector quantization.

• Search engines (“LSH Forest: SelfTuning Indexes for Similarity Search”, M. Bawa, T. Condie, P. Ganesan”).

• Genomics (“Efficient Large-Scale Sequence Comparison by Locality-Sensitive Hashing”, J. Buhler).

• In short: whenever K-Nearest Neighbors (KNN) areneeded.

M i i



Motivation

• A variety of procedures in learningrequire KNN computation.

• KNN search is a computational

bottleneck.• LSH provides a fast approximate solution

to the problem.

• LSH requires hash function construction

and parameter tunning.

Outline



Outline

Fast Pose Estimation with Parameter Sensitive

Hashing G. Shakhnarovich, P. Viola, and T. Darrell.

• Finding sensitive hash functions.

Mean Shift Based Clustering in High

Dimensions: A Texture Classification ExampleB. Georgescu, I. Shimshoni, and P. Meer

• Tuning LSH parameters.

• LSH data structure is used for algorithmspeedups.

Fast Pose Estimation with Parameter Sensitive



Given an image x, what are the

parameters θ, in this image?

i.e. angles of joints, orientation of the body,

etc.

The Problem:

HashingG. Shakhnarovich, P. Viola, and T. Darrell

iθ



Ingredients

• Input query image with unknown angles

(parameters).

• Database of human poses with known angles.• Image feature extractor – edge detector.

• Distance metric in feature space d x.

• Distance metric in angles space:

∑=

−−=m

i

iid 1

2121 )cos(1),( θ θ θ θ θ

Example based learning



Example based learning

•Construct a database of example images with their knownangles.

• Given a query image, run your favorite feature extractor.

• Compute KNN from database.

• Use these KNNs to compute the average angles of the

query.

Input: queryFind KNN in

database of

examples

Output: Average

angles of KNN

The algorithm flow



Input Query

Features extraction

Processed query

P S H

( L S H )Database of examples

The algorithm flow

LW R ( R e g r e ssion )

Output Match

Feature Extraction PSH LWR



The image features

BA

∑= A x x 4/107 )(

,4

3 ,

2 ,

4 ,0

π φ

π π π

Image features are multi-

scale edge histograms:





PSH: The basic assumption

There are two metric spaces here: feature space ( )

and parameter space ( ).

We want similarity to be measured in the angles

space, whereas LSH works on the feature space.

• Assumption: The feature space is closely

related to the parameter space.

xd

θ d





Insight: Manifolds

• Manifold is a space in which

every point has a neighborhood

resembling a Euclid space.

• But global structure may be

complicated: curved.

• For example: lines are 1D

manifolds, planes are 2Dmanifolds, etc.



Parameters Space

(angles)

Feature Space

q

Is this Magic?




Parameter Sensitive Hashing (PSH)

The trick:

Estimate performance of different hash functionson examples, and select those sensitive to :

The hash functions are applied in feature space

but the KNN are valid in angle space.

θ d




Label pairs of examples

with similar angles

Define hash functions h on feature space

Predict labeling of similar\

non-similar examples by using h

Comparelabeling

If labeling by h is good

accept h, else change h

PSH as a classification problem




+1 +1 -1 -1

(r=0.25)

Labels:

+>−

<+=

)1(),(if 1

),(if 1y

:labeledis

),x(),,(xexamplesof pair A

ij

ji

ε θ θ

θ θ

θ θ

θ

θ

r d

r d

ji

ji

ji




>+=

otherwise 1-

T(x)if 1)(,

φ φ xh T

A binary hash function:features

−

=+

= otherwise 1

if 1

ˆ

labelsePredict th

,, )(xh )(xh

) ,x(x y

jT iT

jih

φ φ

Feature




s.constrainties probabilitwith thelabeling

truethe predictsthat*T besttheFind

:themseparateor bin

samein theexamples both placewill,T hφ

θ

)( xφ T




Local Weighted Regression (LWR)

• Given a query image, PSH returnsKNNs.

• LWR uses the KNN to compute a

weighted average of the estimatedangles of the query:

β * = arg min b d q xi Î N ( x0 )å (g( x i,b),q i)K (d X ( x i, x0))

dist . weight

1 244 34 4



Results

Synthetic data were generated:

• 13 angles: 1 for rotation of the torso, 12 for

joints.• 150,000 images.

• Nuisance parameters added: clothing,

illumination, face expression.

• 1 775 000 example pairs



1,775,000 example pairs.

• Selected 137 out of 5,123 meaningful features

(how??):18 bit hash functions (k ), 150 hash tables (l ).

• Test on 1000 synthetic examples:• PSH searched only 3.4% of the data per query.

• Without selection needed 40 bits and

1000 hash tables.

Recall:P1 is prob of positive

hash.

P2 is prob of bad hash.

B is the max number of

pts in a bucket.



Results – real data

• 800 images.

• Processed by a segmentation algorithm.

• 1.3% of the data were searched.

R l l d



Results – real data

Interesting mismatches



Interesting mismatches



Fast pose estimation - summary

• Fast way to compute the angles of human

body figure.

• Moving from one representation space toanother.

• Training a sensitive hash function.

• KNN smart averaging.

Food for Thought



Food for Thought

• The basic assumption may be problematic(distance metric, representations).

• The training set should be dense.

• Texture and clutter.• General: some features are more important

than others and should be weighted.



Food for Thought: Point Location in

Different Spheres (PLDS(

• Given: n spheres in Rd , centered at P={p1 ,…,p

n }

with radii {r 1 ,…,r

n } .

• Goal: given a query q, preprocess the points in P

to find point pithat its sphere ‘cover’ the query q.

qpi

r i

Courtesy of Mohamad Hegaze

Mean-Shift Based Clustering in High Dimensions: A



Motivation:

• Clustering high dimensional data by using local

density measurements (e.g. feature space).• Statistical curse of dimensionality:

sparseness of the data.• Computational curse of dimensionality:

expensive range queries.• LSH parameters should be adjusted for optimal

performance.

Mean Shift Based Clustering in High Dimensions: A

Texture Classification ExampleB. Georgescu, I. Shimshoni, and P. Meer

Outline



Outline

• Mean-shift in a nutshell + examples.

Our scope:

• Mean-shift in high dimensions – using LSH.

• Speedups:

1. Finding optimal LSH parameters.

2. Data-driven partitions into buckets.

3. Additional speedup by using LSH data structure.

Mean-shift LSH: optimal k,l LSH: data partition

LSH LSH: data struct



Mean-Shift in a Nutshell bandwidth

point





KNN in mean-shift

Bandwidth should be inversely proportional to the

density in the region:

high density - small bandwidth

low density - large bandwidth

Based on k th nearest neighbor of the point

The bandwidth is

Adaptive mean-shift vs. non-adaptive.





Image segmentation algorithm

1. Input : Data in 5D (3 color + 2 x,y) or 3D (1 gray +2 x,y)2. Resolution controlled by the bandwidth: hs (spatial), hr (color)

3. Apply filtering

3D:

Mean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 0’







original segmentefiltered

Filtering : pixel value of the nearest mode

Mean-shift

trajectories

Filtering examples



original squirrel filtere

d

original baboon filtere

dMean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 0’

Segmentation examples



Mean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 0’





Mean-shift in high dimensions

Computational curse of dimensionality:

Statistical curse of dimensionality:

Expensive range queries implemented with LSH

Sparseness of the data variable bandwidth





LSH-based data structure

• Choose L random partitions:

Each partition includes K pairs

(dk ,vk )

• For each point we check:

k d i v x K ≤,

It Partitions the data into cells:





Choosing the optimal K and L

• For a query q compute

smallest number of distances

to points in its buckets.





pointsextraincludemight bigtooisLif but

missed, bemight pointssmalltooisLIf

cell.ain pointsof number smaller k Large

∪⇒⇒⇒

C

l

l

C C

d

C

LN N

d K n N

≈

+≈

∪

−)1/(

∪

∩

C

C

structure.datatheof resolutionthedetermines

decreases. butincreasesincreasesLAs

∩

∩∪

C

C C

Choosing optimal K and L





Choosing optimal K and LDetermine accurately the KNN for m randomly-selected

data points.

distance (bandwidth)

Choose error threshold ε

The optimal K and L should satisfy

the approximate distance







• For each K estimate the error for • In one run for all L’s:

find the minimal L satisfying the constraint L(K)• Minimize time t(K,L(K)):

minimum

Approximationerror for K,L

LK( for ε =0.05 Running timet[K,LK(]

d i i i





Data driven partitions

• In original LSH, cut values are random in the range of the

data.• Suggestion: Randomly select a point from the data and

use one of its coordinates as the cut value.

uniform data driven points/bucket

distribution

Additi l d





Additional speedup

aggregate)anof typealikeis(Cmode.same thetoconvergewillCin pointsallthatAssume

∩

∩

∪

∩

C

C



Speedup results

65536 points, 1638 points sampled , k=100

Food for thought



Low dimension High dimension

A thought for food…



g

• Choose K, L by sample learning, or take the traditional.

• Can one estimate K, L without sampling?

• A thought for food: does it help to know the data

dimensionality or the data manifold?

• Intuitively: dimensionality implies the number of hash

functions needed.

• The catch: efficient dimensionality learning requires KNN.

15:30cookies…..

Summary



Summary

• LSH suggests a compromise on accuracy for thegain of complexity.

• Applications that involve massive data in high

dimension require the LSH fast performance.• Extension of the LSH to different spaces (PSH).

• Learning the LSH parameters and hash

functions for different applications.



Conclusion

• ..but at the end

everything depends on your data set

• Try it at home – Visit:


– Email Alex Andoni [email protected] – Test over your own data

(C code under Red Hat Linux )







Thanks

• Ilan Shimshoni (Haifa).

• Mohamad Hegaze (Weizmann).

• Alex Andoni (MIT).• Mica and Denis.

k-nearest neighbors search in high dimensions

Documents