efficient processing of k nearest neighbor joins using mapreduce

19
Efficient Efficient Processing of k Processing of k Nearest Neighbor Nearest Neighbor Joins using Joins using MapReduce MapReduce

Upload: gertrude-foster

Post on 22-Dec-2015

226 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Efficient Processing of k Nearest Neighbor Joins using MapReduce

Efficient Processing of Efficient Processing of k Nearest Neighbor k Nearest Neighbor

Joins usingJoins usingMapReduceMapReduce

Page 2: Efficient Processing of k Nearest Neighbor Joins using MapReduce

INTRODUCTIONINTRODUCTION

• k nearest neighbor join (kNN join) is a special type of join that combines each object in a dataset R with the k objects in another dataset S that are closest to it.

• As a combination of the k nearest neighbor (kNN) query and the join operation, kNN join is an expensive operation.

• Most of the existing work rely on some centralized indexing structure such as the B+-tree and the R-tree, which cannot be accommodated in such a distributed and parallel environment directly.

Page 3: Efficient Processing of k Nearest Neighbor Joins using MapReduce

AN OVERVIEW OF KNN JOIN AN OVERVIEW OF KNN JOIN USING MAPREDUCEUSING MAPREDUCE• basic strategy:R=U1≤i≤N Ri, where Ri∩Rj =

∅, i ≠ j; each subset Ri is distributed to a reducer. S has to be sent to each reducer to be joined with Ri; finally R∝S = U1≤i≤N

Ri∝ S. |R|+N·|S|.• H-BRJ: splits both R and S into √n R=U1≤i≤

√n Ri S=U1≤i≤ √nSi. • Better strategy: Ri∝S=Ri∝Si and

R∝S=U1≤i≤NRi∝Si. |R|+α·|S|

Page 4: Efficient Processing of k Nearest Neighbor Joins using MapReduce

• In summary, for the purpose of minimizing the join cost, we need to

1. find a good partitioning of R; 2. find the minimal set of Si for each

Ri ∈ R, given a partitioning of R.※ The minimum set of Si is Si =U1≤j≤|Ri|

KNN(ri, S). However,it is impossible to find out the k nearest neighbors for all ri apriori.

AN OVERVIEW OF KNN JOIN AN OVERVIEW OF KNN JOIN USING MAPREDUCEUSING MAPREDUCE

Page 5: Efficient Processing of k Nearest Neighbor Joins using MapReduce

HANDLING KNN JOIN USING HANDLING KNN JOIN USING MAPREDUCEMAPREDUCE

Page 6: Efficient Processing of k Nearest Neighbor Joins using MapReduce

DATA PREPROCESSINGDATA PREPROCESSING

• A good partitioning of R for optimizing kNN join should cluster objects based on their proximity.

• Random Selection• Farthest Selection• k-means Selection※ It is not easy to find pivots.

Page 7: Efficient Processing of k Nearest Neighbor Joins using MapReduce

First MapReduce JobFirst MapReduce Job• perform data partitioning and collect

some statistics for each partition.

Page 8: Efficient Processing of k Nearest Neighbor Joins using MapReduce

Second MapReduce JobSecond MapReduce Job• Distance Bound of kNN

ub(s,PiR) = U(Pi

R) + |pi,pj| + |pj,s|

θi= max∀s∈KNN(PiR,S)|ub(s, Pi

R )| ①

Page 9: Efficient Processing of k Nearest Neighbor Joins using MapReduce

Second MapReduce JobSecond MapReduce Job• Finding Si for Ri

lb(s, PiR ) = max{0, |pi, pj| − U(Pi

R ) − |s, pj |} ②

if (lb(s, PiR )>θi) ③

then s KNN(PiR,S)

LB(PjS,Pi

R) = |pi, pj|- U(Pi

R ) -θi

if (|s,pj| ≥LB(PjS,Pi

R))

then s KNN(PiR,S)

s ∈ [LB(PjS,Pi

R),U(PjS)]

Page 10: Efficient Processing of k Nearest Neighbor Joins using MapReduce

Second MapReduce JobSecond MapReduce Job• In this way, objects in each partition of R

and their potential k nearest neighbors will be sent to the same reducer. By parsing the key value pair (k2, v2), the reducer can derive the partition Pi

R and subset Si that consists of Pj1

S , . . . ,PjMS

• ∀r ∈ PiR , in order to reduce the number of

distance computations, we first sort the partitions from Si by the distances from their pivots to pivot pi in the ascending order.

※ compute θi ← max∀s∈KNN(PRi,S)|ub(s,PRi )|

※ Refine θi but I think it is useless.

Page 11: Efficient Processing of k Nearest Neighbor Joins using MapReduce

Second MapReduce JobSecond MapReduce Job• define d(o,HP(pi, pj)) =

. | pj pi,| 2

|pj o,||,| 2

2pio

if d(o,HP(pi, pj)) > θthen ∀ q∈Pi

R |o,q|> θ

if max{L(PiS), |pi, q| −

θ} ≤ |pi,o| ≤ min{U(Pi

O ), |pi, q|+ θ}then |q, o| ≤ θ

Page 12: Efficient Processing of k Nearest Neighbor Joins using MapReduce

MINIMIZING REPLICATION OF MINIMIZING REPLICATION OF SS• |s, pj| ≥ LB(Pj

S, PiR ) => large LB(Pj

S, PiR) keep

small |s, pj|

=>split the dataset into finer granularity and the bound of the kNN distances for all objects in each partition of R will become tighter.

• R =U1≤i≤N Gi, Gi ∩ Gj = ∅, i = j.

s is assigned to Si only if |s, pj| ≥ LB(PjS, Gi ).

where LB(PjS, Gi ) = min ∀Pi

R G∈ i LB(PjS, Pi

R )

RP(S) =∑∀Gi∑∀PjS|{s|s ∈ Pj

S∧ |s, pj| ≥ LB(PjS ,Gi)}|

Page 13: Efficient Processing of k Nearest Neighbor Joins using MapReduce

MINIMIZING REPLICATION OF MINIMIZING REPLICATION OF SS• Geometric Grouping

• Greedy Groupingminimize the size of RP(S,Gi ∪ {Pj

R}) − RP(S,Gi)

but it is rather cost, so ∃s ∈ PSl , |s, pj| ≤ LB(Pj

S ,Gi)

RP(S,Gi) ≈∀PjS⊂S{Pj

S |LB(PjS ,Gi) ≤ U(Pj

S )}

Page 14: Efficient Processing of k Nearest Neighbor Joins using MapReduce

EXPERIMENTAL EXPERIMENTAL EVALUATIONEVALUATION

Page 15: Efficient Processing of k Nearest Neighbor Joins using MapReduce

EXPERIMENTAL EXPERIMENTAL EVALUATIONEVALUATION

Page 16: Efficient Processing of k Nearest Neighbor Joins using MapReduce

EXPERIMENTAL EXPERIMENTAL EVALUATIONEVALUATION

Page 17: Efficient Processing of k Nearest Neighbor Joins using MapReduce

EXPERIMENTAL EXPERIMENTAL EVALUATIONEVALUATION

Page 18: Efficient Processing of k Nearest Neighbor Joins using MapReduce

EXPERIMENTAL EXPERIMENTAL EVALUATIONEVALUATION

Page 19: Efficient Processing of k Nearest Neighbor Joins using MapReduce

The End!The End!ThanksThanks