dsh data sensitive hashing for high dimensional k-nn search

DSH: Data Sensitive Hashing for High-Dimensional k-NN Search

Choi1, Myung2, Lim1, Song2

1DataKnow. Lab. 2D&C Lab.Korea Univ.

Jinyang Gao, H. V. Jagadish, Wei Lu, Beng Chin OoiSIGMOD `14

Motivation

3/ 12

App: Large Scale Image Search in Database• Find similar images in a large database (e.g. google image search)

Kristen Grauman et al

slide: Yunchao Gong UNC Chapel Hill [email protected]

4/ 12

Feature Vector? High Dimension?• Feature Vector: Example

• Nudity detection Alg. Based on Neural Network by Choi• Image File (png) -> 8 x 8 vector (0, 0, 0, …, 0.3241, 0.00441, …)

• 현업에서는 더 많은 dimension 의 feature vector 를 사용

5/ 12

Image Search, 그리고 kNN• 이미지를 나타내는 d- 차원의 feature vector 집합

• 에 대해 • 가 작으면 가 서로 유사한 이미지라고 하자 .• 가 크다면 가 서로 상이한 이미지라고 하자 .

• 질의 이미지 Q 를 공간 상의 한 점 으로 표현해보자

• Q 와 유사한 이미지를 k 개 만큼 찾는 문제는 k-NN 문제로 변환 가능• Return

R-Tree 기반 kNN Search로 문제 해결 가능 ?

불가능 : Curse of dimensionality

6/ 12

Reality Check• Curse of dimensionality

• [Qin lv et al, Image Similarity Search with Compact Data Structures @CIKM`04]

•

• poor performance when the number of dimensions is high

Roger Weber et al, A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces @ VLDB`98

7/ 12

Data Sensitive Hashing• a Solution to the Approximate k-NN Problem in High-Dimensional

Space• K-NN Problem• Recall:

Curse of Dimensionality Recall 데이터 분포 영향 기반 기술

Scan X ( 없음 ) 1 X N/A

RTree-based Solution O ( 강함 ) 1 △ index: Tree

Locality Sensitive Hashing △ ( 덜함 ) OHashing

+ Mathematics

Data Sensitive Hashing △ ( 덜함 ) △ Hashing+ Machine Learning

KNN ob-jects

QueryResult Set

Related Work: LSH

Randomly

extract functions

h11 , h12 ,…,h1𝑚→g1

h 𝑙1 ,h 𝑙2 ,…,h𝑙𝑚→g 𝑙…

Generating func-tions

9/ 12

Locality Sensitive Hashing• 100 차원의 실수 공간 () 에서 KNN 문제를 풀어야 한다 .

• What if!?• 유사한 점은 서로 Collision 이 일어나고 ,• 상이한 점은 Collision 이 일어나지 않는• 이 있다면 어떨까 ?

Query Point

그러나 이러한 이상적인 함수는 존재하지 않음

10/ 12

Formally,

• informally,a. ( 두 점이 유사하다면 ) 두 점의 hash 함수 값이 같을 확률이 높 ( 아야한 ) 다 .b. ( 두 점이 유사하지 않다면 ) 두 점의 hash 함수 값이 같을 확률이 낮 ( 아야한 ) 다 .

• Intuitively,• 인 Hash 함수를 만들 수 있다면 ? • 그러나 이러한 이상적인 함수는 존재하지 않음

• Challenging• 를 도출하는 것 자체가 수학적으로 어려움 !• 도출했다 하더라도 대체로 는 낮으며 , 높음

문제점 1: 도출은 가능하나 ,

가 너무 높다 ( 낮아야 하는데 !)

11/ 12

Random projection (backup slide 참조 )• Formally




해결책 1:함수를 여러 개로 ( 개 )

묶어서 사용해보자 !

0

1

12/ 12

m-concatination• let • 거리가 먼 두 점 q, p 에 대해

0

1

0

10

1Fergus et al





묶어서 사용해보자 !

효과 : false positive 감소유사하지 않은 두 점에 대해101

001100 111

13/ 12

Random projection• 이 아주 높은 0.8 이라고 하더라도• =5 이라면 ,

• 유사한 두 점에 대해 • 즉 , 만약 한 개의 로 Hash table 구성 시 ,

• 질의 지점 q 와 아주 유사한 점의 수가 100개라면

• 그 중 33 개 이상 찾는 것을 보장해주겠다는 뜻• 낮은 Recall 을 갖게 됨 !

• =5 라면 , 이므로 ,• 평균적으로 86 개 이상 찾을 수 있다는 뜻





묶어서 사용해봤다 !

효과 : false positive 감소유사하지 않은 두 점에 대해

역효과 : false negative 도 증가

유사한 두 점에 대해

문제점 2: 가 낮아지는 바람에 ,

도 낮아졌다 ( 높아야 하는데 !)

해결책 2:g 를 여러 개 ( 개 ) 사용한

후그 중에서 k-NN 을 찾자 !

효과 : High Recall즉 , 라는

recall 을 달성할 수 있음

14/ 12

Structure• LSH

• a Set of Hash tables • Hash Function

• for example,

Key Bucket

000000

000001

...

111111

H 1

Key Bucket

000000

000001

...

111111

H 2

Key Bucket

000000

000001

...

111111

H 26

... 𝑔1 𝑔2 𝑔26

15/ 12

Processing: 도식화

• Query Pont q = • Processing

• Step 1. Candidate Set • Step 2. return k_Nearest_Neighbors(q) in

• linear search

Key Bucket

000000

000001

...

111111

H 1

Key Bucket

000000

000001

...

111111

H 2

Key Bucket

000000

000001

...

111111

H 26

... 𝑔1𝑔2

𝑔26

16/ 12

Formally,

Randomly

extract functions

h11 , h12 ,…,h1𝑚→g1

h 𝑙1 ,h 𝑙2 ,…,h𝑙𝑚→g 𝑙

…

Generating func-tions

Traditional LSH Technique: ① Derive mathematically ② prove that a. and b. holds for an arbi-

trary w.r.t. parameter , ③ Randomly extract functions and

build Hash Table.

In DSH(Data Sensitive Hashing): ① learn by using adaptive boosting and ② If is not sufficient to guarantee that

a. and b. holds w.r.t go to ① ③ Randomly extract functions and build

Hash Table.

17/ 12

LSH VS DSH

Traditional LSH Technique: ① Derive mathematically ② prove that a. and b. holds for an arbi-

trary w.r.t. parameter , ③ Randomly extract functions and

build Hash Table.

In DSH(Data Sensitive Hashing): ① learn by using adaptive boosting and ② If is not sufficient to guarantee that

a. and b. holds w.r.t go to ① ③ Randomly extract functions and build

Hash Table.

데이터 분포 고려 기반 기술

Locality Sensitive HashingX

( 애당초 Uniform Distribution 을 가정했기 때문에 (for ))②Hashing

+ Mathematics

Data Sensitive HashingO

( 대상 데이터 분포를 기준으로 강제로 h 를 뽑아 내기 때문에 )

Hashing+ Machine Learning

18/ 12

LSH VS DSH 2Sensitive 기반 기술

Locality Sensitive Hashing , 에 따라 Sensitive 한 Hashing Hashing + Mathematics

Data Sensitive Hashing Data (k-NN 과 non-ck-NN) 에 Sensitive 한 Hashing Hashing+ Machine Learning

DSH: demonstration

20/ 12

Example: Data Set

• 100-dimensional data set

• 10 clusters

0 100 200 300 400 500 600 700 800 900 10000

100

200

300

400

500

600

700

800

900

1000

21/ 12

Build DSH for D• DSH dsh = new DSH(10, 1.1, 0.7, 0.6, 0.4, querySet, dataSet);

Parameter Value

k (k-NN) 10

( 학습률 ) 1.1

(lower bound of recall) 70%

0.6

0.4

Query Set D

Data Set D

22/ 12

Structure• DSH

• a Set of Hash tables • Hash Function

• for example,

Key Bucket

000000

000001

...

111111

H 1

Key Bucket

000000

000001

...

111111

H 2

Key Bucket

000000

000001

...

111111

H 26

... 𝑔1 𝑔2 𝑔26

23/ 12

Query Example• res = dsh.k_Nearest_Neighbor(q=new Point(984.29, 946.23, ...,

848.21)));• return 10-aNN objs from the given point q

• DSH’s Property:• Result set must include at least 70% of the exact 10-NN objs

• Result: Query Point p: (984.29, 946.23, ..., 848.21)

10-aNN of P (recall: 100%)

0 100 200 300 400 500 600 700 800 900 10000

100

200

300

400

500

600

700

800

900

1000

24/ 12

Processing: dsh.k_Nearest_Neighbor(q)• Query Pont q = • Processing

• Step 1. Candidate Set • Step 2. return k_Nearest_Neighbors(q) in

• linear search

Key Bucket

000000

000001

...

111111

H 1

Key Bucket

000000

000001

...

111111

H 2

Key Bucket

000000

000001

...

111111

H 26

... 𝑔1𝑔2

𝑔26

25/ 12

H i .𝑔𝑒𝑡 (𝑞)• Query Pont q = • =

<>

26/ 12

Processing: dsh.k_Nearest_Neighbor(q)• for each

• ={Data(id=93~98, 100~102)}• = ...• ...• = ...

• Candidate Set • dsh.k_Nearest_Neighbor(q)

• = k_Nearest_Neighbors(q) in

27/ 12

Processing: dsh.k_Nearest_Neighbor(q)• for each

• ={Data(id=93~98, 100~102)}• = ...• ...• = ...

• Candidate Set

28/ 12

Processing: dsh.k_Nearest_Neighbor(q)• Candidate Set

• result <- Find k-NN(q) in • dsh.k_Nearest_Neighbor(q)

• return result Query Pont q =

How to build DSH for D?

30/ 12

Build DSH for D• Step 1. Generate , Data Sensitive Hashing Family (Chapter 3-4)• Step 2. Generate Hash Function by Randomly extracting hash functions

Generating Hashing Family

Randomly extract functions

h11 , h12 ,…,h1𝑚→g1

h 𝑙1 ,h 𝑙2 ,…,h𝑙𝑚→g 𝑙…

Generating functions

•Step 3. for each ,• Initialize Hash Table (<key, value> = <Integer array, Data>)• for each .put(

Chapter 3

Chapter 4

Adaptive Boosting: Principle

34/ 12

a Weak Classifier• a Weak Classifier is a function

• Input: <query , data >pair• Desired output:

a Weak Classifier

kNN Pair

0 (correct)

a Weak Classifier

non-ckNN Pair

0 (incorrect)

a Weak Classifier

kNN Pair

1 (incorrect)

a Weak Classifier

non-ckNN Pair

1 (correct)

note:a Weak Classifier may produce

a lot of incorrect result

35/ 12

Weak Classifier 3

Weak Classifier 2

Adaptive Boosting• Build Strong Classifier by combining several weak classifiers

Weak Classifier 1

1st : Query-Data Pair Set

weak classifier trainer

test

Well Classified

Pair

Badly Classified

Pair

Feed back

2nd : Query-Data Pair Set

Well Classi-fied Pair

Badly Clas-sified Pair

Feed back

3rd : Query-Data Pair Set

Well Classified Pair

36/ 12

Weak Classifier 3

Weak Classifier 2

a Strong Classifier• Build Strong Classifier by combining several weak classifiers

Weak Classifier 1

Query-Data Pair Set

a Strong Classifier



37/ 12

Adaptive Boosting• Build Strong Classifier by combining several weak classifiers

Weak Classifier 3

Weak Classifier 2

Weak Classifier 1

1st : Query-Data Pair Set

weak classifier trainer

test

Well Classified

Pair

Badly Classified

Pair

Feed back

2nd : Query-Data Pair Set

Well Classi-fied Pair


3rd : Query-Data Pair Set


Single Hash Function Optimization

39/ 12

Notation• Query Set • Data Set • Weight Matrix W

1

2

1 2 3 4

¿ ¿1 1 0 -1

-1 0 1 1

1 42 3

1 2

1 41 2 23

k=2 , c=32 sampling rate=1

40/ 12

Objective

• =

dsh data sensitive hashing for high dimensional k-nn search

Engineering