efficient processing of k nearest neighbor joins using mapreduce

Efficient Processing of Efficient Processing of k Nearest Neighbor k Nearest Neighbor

Joins usingJoins usingMapReduceMapReduce

INTRODUCTIONINTRODUCTION

• k nearest neighbor join (kNN join) is a special type of join that combines each object in a dataset R with the k objects in another dataset S that are closest to it.

• As a combination of the k nearest neighbor (kNN) query and the join operation, kNN join is an expensive operation.

• Most of the existing work rely on some centralized indexing structure such as the B+-tree and the R-tree, which cannot be accommodated in such a distributed and parallel environment directly.

AN OVERVIEW OF KNN JOIN AN OVERVIEW OF KNN JOIN USING MAPREDUCEUSING MAPREDUCE• basic strategy:R=U1≤i≤N Ri, where Ri∩Rj =

∅, i ≠ j; each subset Ri is distributed to a reducer. S has to be sent to each reducer to be joined with Ri; finally R∝S = U1≤i≤N

Ri∝ S. |R|+N·|S|.• H-BRJ: splits both R and S into √n R=U1≤i≤

√n Ri S=U1≤i≤ √nSi. • Better strategy: Ri∝S=Ri∝Si and

R∝S=U1≤i≤NRi∝Si. |R|+α·|S|

• In summary, for the purpose of minimizing the join cost, we need to

1. find a good partitioning of R; 2. find the minimal set of Si for each

Ri ∈ R, given a partitioning of R.※ The minimum set of Si is Si =U1≤j≤|Ri|

KNN(ri, S). However,it is impossible to find out the k nearest neighbors for all ri apriori.

AN OVERVIEW OF KNN JOIN AN OVERVIEW OF KNN JOIN USING MAPREDUCEUSING MAPREDUCE

HANDLING KNN JOIN USING HANDLING KNN JOIN USING MAPREDUCEMAPREDUCE

DATA PREPROCESSINGDATA PREPROCESSING

• A good partitioning of R for optimizing kNN join should cluster objects based on their proximity.

• Random Selection• Farthest Selection• k-means Selection※ It is not easy to find pivots.

First MapReduce JobFirst MapReduce Job• perform data partitioning and collect

some statistics for each partition.

Second MapReduce JobSecond MapReduce Job• Distance Bound of kNN

ub(s,PiR) = U(Pi

R) + |pi,pj| + |pj,s|

θi= max∀s∈KNN(PiR,S)|ub(s, Pi

R )| ①

Second MapReduce JobSecond MapReduce Job• Finding Si for Ri

lb(s, PiR ) = max{0, |pi, pj| − U(Pi

R ) − |s, pj |} ②

if (lb(s, PiR )>θi) ③

then s KNN(PiR,S)

LB(PjS,Pi

R) = |pi, pj|- U(Pi

R ) -θi

if (|s,pj| ≥LB(PjS,Pi

R))

then s KNN(PiR,S)

s ∈ [LB(PjS,Pi

R),U(PjS)]

Second MapReduce JobSecond MapReduce Job• In this way, objects in each partition of R

and their potential k nearest neighbors will be sent to the same reducer. By parsing the key value pair (k2, v2), the reducer can derive the partition Pi

R and subset Si that consists of Pj1

S , . . . ,PjMS

• ∀r ∈ PiR , in order to reduce the number of

distance computations, we first sort the partitions from Si by the distances from their pivots to pivot pi in the ascending order.

※ compute θi ← max∀s∈KNN(PRi,S)|ub(s,PRi )|

※ Refine θi but I think it is useless.

Second MapReduce JobSecond MapReduce Job• define d(o,HP(pi, pj)) =

. | pj pi,| 2

|pj o,||,| 2

2pio

if d(o,HP(pi, pj)) > θthen ∀ q∈Pi

R |o,q|> θ

if max{L(PiS), |pi, q| −

θ} ≤ |pi,o| ≤ min{U(Pi

O ), |pi, q|+ θ}then |q, o| ≤ θ

MINIMIZING REPLICATION OF MINIMIZING REPLICATION OF SS• |s, pj| ≥ LB(Pj

S, PiR ) => large LB(Pj

S, PiR) keep

small |s, pj|

=>split the dataset into finer granularity and the bound of the kNN distances for all objects in each partition of R will become tighter.

• R =U1≤i≤N Gi, Gi ∩ Gj = ∅, i = j.

s is assigned to Si only if |s, pj| ≥ LB(PjS, Gi ).

where LB(PjS, Gi ) = min ∀Pi

R G∈ i LB(PjS, Pi

R )

RP(S) =∑∀Gi∑∀PjS|{s|s ∈ Pj

S∧ |s, pj| ≥ LB(PjS ,Gi)}|

MINIMIZING REPLICATION OF MINIMIZING REPLICATION OF SS• Geometric Grouping

• Greedy Groupingminimize the size of RP(S,Gi ∪ {Pj

R}) − RP(S,Gi)

but it is rather cost, so ∃s ∈ PSl , |s, pj| ≤ LB(Pj

S ,Gi)

RP(S,Gi) ≈∀PjS⊂S{Pj

S |LB(PjS ,Gi) ≤ U(Pj

S )}

EXPERIMENTAL EXPERIMENTAL EVALUATIONEVALUATION

The End!The End!ThanksThanks

efficient processing of k nearest neighbor joins using mapreduce

Documents