[ieee 2013 10th web information system and application conference (wisa) - yangzhou, china...

6
Intelligent Similary Joins for Big Data Integration Mian Wang, Tiezheng Nie, Derong Shen, Yue Kou , Ge Yu College of Information Science and Engineering Northeastern University Shenyang, China [email protected], {shenderong, nietiezheng ,kouyue, yuge}@ise.neu.edu.cn Abstract With the increasing amount of data, the record linkage has become a challenge for big data integration. Similarity join is an efficient approach to address the record linkage, but it is hardly achieved by the single node environment. In this paper, we propose a framework based on MapReduce for set similarity join. The techniques of framework improve the efficiency from two aspects: reducing candidate pairs and load balance. In reducing candidate pairs, we propose algorithms that combines multiple filtering principles to reduce the amount of candidate pairs. It includes length filter, prefix filter and position filter. The techniques for load balance are used to address the skew data and decrease the replication transfer volume. Experimental results on real dataset show that our approaches can achieve the speed-up over previous algorithms on big data Keywordssimilarity join; MapReduce; prefix filter; load balance I. INTRODUCTION Similarity joins is one of effective approaches for linking different records expressing the same entity from the same or different data sources. It can be applied to data integration and many other applications, such as plagiarism check, duplicated webpage, document clustering, and social network user identification. However, with the increasing scale of data in applications, traditional centralized approaches have the limitations of scalability on the memory and capability. In similarity joins, the similarities between records are determined by similarity functions. According to the data features, different applications require different similarity functions, and the filter algorithms should be designed based on similarity algorithms. This paper mainly focuses on set similarity join problems for the textual similarity with token- based algorithms. In existing studies, the process of similarity join mainly contains two parts of works: generate candidate pairs and verify candidate pairs. In candidate generation phase, the goal is to filter record pairs that can't be matched, which can minimize the amount of candidate record pairs in calculating the similarities. Therefore, current works rely on effectively filtering the candidate pairs[2,3,4,5] to improve the performance of similarity join. Then, the verification phase is to calculate the similarity of candidate pairs and compare it with the given threshold. However, to address the problems from large-scale data, it needs to execute similarity joins in parallel. Recently, several techniques have been proposed for the similarity join based on the MapReduce framework that provides the ability of processing massive data in distribution. Existing approaches for similarity join focu on the algorithms of filtering record pairs and separating them into different blocks and distributing them into multiple nodes to process. But the performance is always surfer from the data skew which decreases the benefit of distributed processing. In this paper, we propose a framework based on MapReduce to process set similarity joins efficiently. The contributions of this paper are as follows: We extend the prefix filter and propose a new prefix- based algorithm that can efficiently filter candidate pairs and be applied in the MapReduce framework. To address the data skew, we propose a task estimation-based load-balancing model, which solves the high differences of a single task execution time due to imbalance task distribution in candidate validation phase, and reduces the transmission of replications. Extensive experimental results on real datasets demonstrate the efficiency of our proposed approaches . The rest of the paper is organized as follows: Section 2 introduces the related works. Section 3 presents the preliminaries. Section 4 and 5 give our proposed algorithms that integrated improved prefix filter with other filter principles and algorithms for load balance. The implementation of framework based on MapReduce is presented in Section 6. We present our experimental results in Section 7. Section 8 concludes the paper. II. RELATED WORKS Set Similarity Join (SSJoin) is defined as follows: Given two datasets R and S, any r from R and any s from S, if the overlap of R.r and S.s is higher than the given threshold, then the tuple composed with r and s is returned as the results[3,4]. For textual data, token-based algorithms are widely used for calculating the similarity of records since a string can be mapped into a set of tokens by using Q-Gram or token method. Thus existing works on the similarity join focus on algorithms based on the overlap of two token sets[4], such as Jaccard algorithm. There are two popular methods in solving the Set similarity problems: Local Sensitive Hash and Prefix Filtering. LSH is appropriate for the joins with lower threshold [5,6], while the prefix filtering is superior in a higher threshold for the exact case. Considering that seldom pairs satisfy the higher threshold in all candidate pairs, recent research works [3,4,7,8,9] focus on decreasing the amount of the candidate pairs to reduce the overall computational complexity. 2013 10th Web Information System and Application Conference 978-0-7695-5134-0/13 $26.00 © 2013 IEEE DOI 10.1109/WISA.2013.79 383 2013 10th Web Information System and Application Conference 978-0-7695-5134-0/13 $26.00 © 2013 IEEE DOI 10.1109/WISA.2013.79 383 2013 10th Web Information System and Application Conference 978-0-7695-5134-0/13 $26.00 © 2013 IEEE DOI 10.1109/WISA.2013.79 383 2013 10th Web Information System and Application Conference 978-0-7695-5134-0/13 $26.00 © 2013 IEEE DOI 10.1109/WISA.2013.79 383 2013 10th Web Information System and Application Conference 978-1-4799-3219-1/13 $31.00 © 2013 IEEE DOI 10.1109/WISA.2013.79 383

Upload: ge

Post on 16-Feb-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Intelligent Similary Joins for Big Data IntegrationMian Wang, Tiezheng Nie, Derong Shen, Yue Kou , Ge Yu

College of Information Science and Engineering Northeastern University

Shenyang, China [email protected], {shenderong, nietiezheng ,kouyue, yuge}@ise.neu.edu.cn

Abstract—With the increasing amount of data, the record linkage has become a challenge for big data integration. Similarity join is an efficient approach to address the record linkage, but it is hardly achieved by the single node environment. In this paper, we propose a framework based on MapReduce for set similarity join. The techniques of framework improve the efficiency from two aspects: reducing candidate pairs and load balance. In reducing candidate pairs, we propose algorithms that combines multiple filtering principles to reduce the amount of candidate pairs. It includes length filter, prefix filter and position filter. The techniques for load balance are used to address the skew data and decrease the replication transfer volume. Experimental results on real dataset show that our approaches can achieve the speed-upover previous algorithms on big data

Keywords—similarity join; MapReduce; prefix filter; load balance

I. INTRODUCTION

Similarity joins is one of effective approaches for linking different records expressing the same entity from the same or different data sources. It can be applied to data integration and many other applications, such as plagiarism check, duplicated webpage, document clustering, and social network user identification. However, with the increasing scale of data in applications, traditional centralized approaches have the limitations of scalability on the memory and capability.

In similarity joins, the similarities between records are determined by similarity functions. According to the data features, different applications require different similarity functions, and the filter algorithms should be designed based on similarity algorithms. This paper mainly focuses on set similarity join problems for the textual similarity with token-based algorithms. In existing studies, the process of similarity join mainly contains two parts of works: generate candidate pairs and verify candidate pairs. In candidate generation phase, the goal is to filter record pairs that can't be matched, which can minimize the amount of candidate record pairs in calculating the similarities. Therefore, current works rely on effectively filtering the candidate pairs[2,3,4,5] to improve the performance of similarity join. Then, the verification phase is to calculate the similarity of candidate pairs and compare it with the given threshold. However, to address the problems from large-scale data, it needs to execute similarity joins in parallel. Recently, several techniques have been proposed for the similarity join based on the MapReduce framework that provides the ability of processing massive data in distribution. Existing approaches

for similarity join focu on the algorithms of filtering record pairs and separating them into different blocks and distributing them into multiple nodes to process. But the performance is always surfer from the data skew which decreases the benefit of distributed processing.

In this paper, we propose a framework based on MapReduce to process set similarity joins efficiently. The contributions of this paper are as follows:

� We extend the prefix filter and propose a new prefix-based algorithm that can efficiently filter candidate pairs and be applied in the MapReduce framework.

� To address the data skew, we propose a task estimation-based load-balancing model, which solves the high differences of a single task execution time due to imbalance task distribution in candidate validation phase, and reduces the transmission of replications.

� Extensive experimental results on real datasets demonstrate the efficiency of our proposed approaches .

The rest of the paper is organized as follows: Section 2 introduces the related works. Section 3 presents the preliminaries. Section 4 and 5 give our proposed algorithms that integrated improved prefix filter with other filter principles and algorithms for load balance. The implementation of framework based on MapReduce is presented in Section 6. We present our experimental results in Section 7. Section 8 concludes the paper.

II. RELATED WORKS

Set Similarity Join (SSJoin) is defined as follows: Given two datasets R and S, any r from R and any s from S, if the overlap of R.r and S.s is higher than the given threshold, then the tuple composed with r and s is returned as the results[3,4].

For textual data, token-based algorithms are widely used for calculating the similarity of records since a string can be mapped into a set of tokens by using Q-Gram or token method. Thus existing works on the similarity join focus on algorithms based on the overlap of two token sets[4], such as Jaccard algorithm.

There are two popular methods in solving the Set similarity problems: Local Sensitive Hash and Prefix Filtering. LSH is appropriate for the joins with lower threshold [5,6], while the prefix filtering is superior in a higher threshold for the exact case. Considering that seldom pairs satisfy the higher threshold in all candidate pairs, recent research works [3,4,7,8,9] focus on decreasing the amount of the candidate pairs to reduce the overall computational complexity.

2013 10th Web Information System and Application Conference

978-0-7695-5134-0/13 $26.00 © 2013 IEEE

DOI 10.1109/WISA.2013.79

383

2013 10th Web Information System and Application Conference

978-0-7695-5134-0/13 $26.00 © 2013 IEEE

DOI 10.1109/WISA.2013.79

383

2013 10th Web Information System and Application Conference

978-0-7695-5134-0/13 $26.00 © 2013 IEEE

DOI 10.1109/WISA.2013.79

383

2013 10th Web Information System and Application Conference

978-0-7695-5134-0/13 $26.00 © 2013 IEEE

DOI 10.1109/WISA.2013.79

383

2013 10th Web Information System and Application Conference

978-1-4799-3219-1/13 $31.00 © 2013 IEEE

DOI 10.1109/WISA.2013.79

383

The length filter and prefix hash algorithm proposed in [4] are effective to solve the problem of large amount of candidate pairs. In addition, the position information of prefix introduced in [8] filters the false candidate pairs that have passed the length-filter. [9] proposed an optimization to further reduce the number of candidate pairs to be verified utilizing the feature of min-prefix information. In [10], different types of strings have different characters based on the adaptive selection of prefix length, which is highly universe but took much time in selecting the length of prefix. Besides, to reduce network traffic, [3] mapped the records into a vector utilizing hash signature. This method avoids statistical analysis of information and decreases the number of candidate pairs. However, it doesn’t take advantage of the position information.

MapReduce framework has been applied to improve the runtime efficiency of the record linkage for the big data. A MapReduce implementation of set similarity proposed in [11] includes three stages, computing data statistics, outputting RID pairs of similar records and generating actual pairs of joined records. Dealing with large documents, [12] investigated the virtual node method. Two load balance methods have been put forward for entity resolution, of which block split is simpler than pair range, but both considering Cartesian product[14]. The paper [13] presented a balanced cost model, and grouped data in accordance with the cost model.

III. PRELIMINARIES

A. The MapReduce Model

Split 3

Split 0

Split 1

Split 2

Split 4

map1

map2

map3

JobTracker

datanode1

datanode2

datanode3

reduce1

reduce2

Output file 0

Output file 1

Namenode

3

datannoden 1

datannoden 2

datannoden 3

3

Input -> datanodes

Map phase -> tasktracker

Reduce phase -> tasktracker

Output -> datanodes

Intermediate files(on local

disks)

Figure 1. The data flow of MapReduce Model

MapReduce[1,2] is a framework using for the parallel computing for large-scale data. It provides users two intuitive functions called map and reduce to handle the complicated computation. Users specify a Map function that processes a key-value pair to generate a set of intermediate key-value pairs, and the Reduce function merges all intermediate values shared with the same intermediate key.

Figure 1 shows the data flow of MapReduce Model. Firstly, the input splits are assigned to the map tasks from the DFS. The Map function takes an input pair and produces a set of intermediate key-value pair. Then the MapReduce library groups all intermediate values associated with the same key I and passes them to the Reduce Function. The

Reduce function accepts an intermediate key and a set of values for that key. It merges these values into a smaller set. And the Reduce function iterates the set and produces a key-value pair as the output. Finally, the MapReduce rewrites thegenerated key-value pairs to the file system.

B. POSITIONAL FILTER Let S and R be token sets from two strings s and r, the

Jaccard similarity of s and r is computed by following formula:

(1)

The positional filters are proposed based on the Jaccard similarity. Suppose Ls, Lr are the size of record s and rrespectively, � is the threshold of the similarity between two strings, prefix(s) denotes the prefix of s. Note that Ls<Lr and the prefix(s) is tokens from 1 to �(1-�)Ls�+1 in S. According to prefix filter principle, given an arbitrary pair (s, r) , if its similarity has SimJ(s, r)��, the prefixes of s and r must have at least one common token. Map set S into different buckets based on the symbol token x in position i of the prefix as (x, i, length). The same process is done with R. This ensures that pairs with common tokens are mapped into at least one bucket.

i

j

s

r

p

q

Figure 2. The principle of PPjoin and PPjoin+ filter

The prefix position filtering [8] is proposed for more efficient similarity joins which is named as PPjoin. As shown in Figure 2, according to Formula 1, the size of two sets union is at least Ls+j-1, which means there are at least j-1 symbols in r that are not in s. The size of the intersection is at most Ls-i+1, since it is the number of symbols of that could possibly be in r. For the similarity must be at least θ,thus it must satisfy (Ls-i+1)/(Ls+j-1)��.

Suppose r has suffix length p and s has suffix length q.When p�q, the maximum size of the intersection is Ls-i+1-(p-q). Since Ls=i+p, we can regard the above expression as q+1. The minimum size of the union is Ls+j-1, so it needs to satisfy formula 2. When p<q, in the same way, it needs satisfy formula 3.

q+1Ls + j�1

�� � (2)�

-i+11

sLi j q

��� � � (3)�

Let us consider the above facts that transforming the inequality SimJ(s, r)�� is effective to prune the number of

384384384384384

the candidate pairs. Therefore we can propose a proper way to find an estimated expression that is more approximate to the real value, and get a better filter performance.

IV. IMPROVED SIMILARITY JOIN ALGORITHM BASED ONPREFIX FILTERING

A. The All Prefix Filter (appjoin)Considering the number of the common prefix of any

two strings that may have more than one common prefix, an efficient way is that just computes them in one bucket based on one common symbol and filter other candidate pairs. As shown in Figure 3, Prefix(s) and Prefix(r) have two common symbols A and B, the positions of B is ahead of the positions of A in s and r. Therefore the computation for the prefix Bmust have been dealt in the B bucket, so the computation in the A bucket is redundant.

s

r

prefix(s)

i2j1 j2i1

prefix(r)

Figure 3. Multiple common symbols in prefixes

To filter the candidate pairs, existing works have proposed the position filter principle that using the positions information being indexed can prune the candidate pairs to be verified. However, it doesn’t know whether it is the first common prefix or the estimated size of the intersections (s, r)that is close to the real length, so it brings larger number of candidate pairs to be verified. In this paper, we use the symbols indexed information to filter candidate pairs.

i js

r

prefix(s) p

q<--rc-->

<-----sc---->

prefix(r)

Figure 4. The appjoin filter princple

Figure 4 shows that s and r has the common symbol A in prefixes with position i in s and position j in r. If the positions (i, j) are not the first common key in prefix, we will filter it out. It’s obvious that the more accurate we estimate the ratio of the intersections and the union of the two stings,the better the filter’s performance is. So we propose the Estimated Similarity denoted as SimE(s,r), and it has the property SimE(s,r)�SimJ(s,r). Therefore, if SimE(s,r)<��,SimJ must be lower than � and the pair (s,r) is filtered out. Different from the position filter, we can make a full scan for the prefix in the candidate pairs to calculate the number c of the same key in the two prefixes, prefix(s) and prefix(r).Assume p is the suffix of r and q is the suffix of s. The maximum number of intersections is c+min(p,q) and the

minimum number of the unions is Ls+Lr-min(p,q)-c. If p<q,the filtering condition is (c+p)/(Lr+prefix(s)-c)��, and If p�q,the filtering condition is (c+p)/(Ls+prefix(r)-c)��. So, we define the SimE(s,r) as Formula 4:

� ( , )( , ) ( ( ), ( ))s r

c pSimE s rMax L L Min prefix s prifix r c

� ������ �

It’s obvious that SimE(s,r) is not smaller than SimJ(s,r).We define c=sc�sr as the common symbols of the two prefixes. So if the candidate pair is above the threshold, c+pis not bigger than sc+p and c+q is not bigger than rc+p. And in the same way Ls+prefix(r)-c is not smaller than Ls+j-1 for prefix(r)-c�j-1 and Lr+prefix(s)-c�i+Lr. That means SimP(s,r)�SimE(s,r)�SimJ(s,r), in which SimP(s,r) is the estimate similarity of position filter. Our SimE(s,r) is more close to the real similarity value SimJ(s,r), so the appjoin filter principle is more efficient than ppjoin and ppjoin+ filter.

B. N+Prefix Filter

< ----N---- >

s

prefix(s)

r

p

qprefix(r)

Figure 5. N+ prefix filter

Given the threshold �, we have t=Ls-�Ls*��+1 as the prefix of set s from the prefix filtering principle. In Figure 5, suppose there is only one common symbol A in prefixes, it means that all the suffix p in s and q in r must be all the same if the pair’s similarity is larger than �. If we regard additional n symbols as the extension of the prefix, the number of common prefix must be larger than n+1 based on the pigeonhole principle. But it would lead to more replication of records. Then we propose the N+ prefix filter (Nppjoin) in which we still use the original length as prefix but taking the information about another n keys in a bitmap. Like the appjoin filter, it scans the number of the common prefix of the candidate pairs (denoted as c). Here we will make an intersection between two bitmaps and get the number of common ones in the n keys(denoted as t). And then comparing c+t with n+1, we can make a better filtering if the c+t is smaller than n+1.

For the threshold θ and the similarity SimJ, if SimJ is smaller than θ, we have that n*SimJ/��n. Based on the expression, we learn that the bigger n are selected, the approximate is lower to select more than n+1 common symbols in the candidate pair whose similarity is below �. So a proper n is required to balance the cost of computation and effects of filtering.

C. The Pipeline Filter Framework As we known the probability of arbitrary element in a set

existing in another set is equal to the similarity of the two

385385385385385

sets. Given the threshold θ, if SimJ is smaller than θ, we have that n*SimJ/��n. It is suggestive to select a proper Nbalancing the cost of computation and effects of filtering.

Our pipeline filter framework mainly focuses on the computational complexity. It integrates the length filter,ppjoin filter, ppjoin+ filter, appjoin and nppjoin filters in ascending order of computational complexity. As long as the candidate pair dissatisfies one of the filters, the pair is filtered out.

Here we hash the symbols of the prefix and additional Nsymbols into one bitmap whose size is more than the 5 to 10 times of the prefix length to avoid hash collision.

V. ALGORITHM FOR LOAD BALANCE

The performance of MapReduce jobs need to consider two factors: the amount of messages transmission and the data skew. With our filters, the transmission of replications can be reduced. This section presents the technique of load balance to partition the skew data based on the cost estimation of task. The cost model of Load balance is given as follows:

The cost of a given task in generating candidate pairs is defined asT, where k is the number of keys and v is the number of the values shared with the same key. Suppose the number of reducers is m, the number of keys received by each reducer is k1 to km and the number of values corresponding to each key is xki, we can get the computation

cost in each reducer: � 2

1

i

i

k

mi kT x� , thus � 1

m

miT T � .

Based on the above expressions, we have the difference of computation cost between two reducers:

� � 2 2

1 1

ki ki

i j ki kT T x x � � (5)

Figure 6. Example of the task schedul balance

It’s obvious that the cost difference will be significantly large if all the high frequent keys are hashed to the same reducer. Figure 6 displays an example in which the tasks in each reducer are skew. So we give a load balance technique using knapsack problem to estimate the potential cost.

It is hard to guarantee that tasks are uniformly distributed to reducers. Since the cost depends on the largest task, this problem is similar to the task-scheduling problem that aims to minimize the time cost of all the tasks finished in parallel.

If the number of keys received by each reducer is almost the same, then the reducers are balanced. It firstly sorts the task list in a descending order of cost and then initiates the amount of task slots. Iterating the sorted task list, we always get the unassigned task with bigger amount and put it into the task slot with less task amount. In general, all the prefix keys are grouped to m reducers having almost the same task amount.

VI. IMPLEMENTATION OF FRAMEWORK

In this section, we present the implementation of our framework based on MapReduce. Our framework includes four stages: information statistic regroups the hash keys,generate the verified pairs and link records.

A. Information statistics In information collection, it is responsible for the token

frequency statistics. It has two MapReduce phases for this process. The first phase computes the frequency of each token and the second phase sorts the tokens with their frequencies. The first map uses the Split function to divide each record into tokens and produce pairs. The reducer shuffles the records and counts the number of each token. And at the next phase of MapReduce, the map function swaps key and value and the reduce function shuffles the keys in increasing number and swap key and value again.

B. Regroup the hash keys Here we use one MapReduce phase to regroup the hash

keys for balancing the skew data. In the setup function, we load the token frequencies table into a sorted set.

Map: input all the items, and transform them into an ordered set, which is reserved as the input of next stage that generate verified pairs. Iterate the prefix and produce a tuple (token,1) as the output. And there we will also produce (bitmap record set) as the signature of the prefix of the record for the appjoin to further reduce the computation cost.

Reduce: count the number of records that contain the prefix k. The task amount is k2. There are two methods that we can generate the regroup number of each key. If the prefixes of all the records are fitted in the memory, we use only one reducer. We record the key and task amount in the hash table. Another method is using c reducers. If we ensure that the tasks in m buckets of each reducer are balanced, merging all reducers’ m buckets into the shared m buckets, the total amount of task in each bucket has less differences.

And at the close function, we sort the keys based on the task amount in descending order. Based on the previous task assignment algorithm, we output (key, H(key)) for each key in the hash table, in which H() is the hash function. In the next step of similarity joins, we load the mapping relation of partition function to conduct the map partition.

The above works can guarantee the validity of data packet. Meanwhile, it regroups the keys and balances the reduce tasks.

C. Generate the verified pairs As shown in Figure 7, the token frequency table used in

the Map stage could transform the original records into an

386386386386386

orderly set. We can partition the same prefix records into one Reduce task by extracting the prefix of the ordered set and hashing the records by prefix. Arbitrary two elements of the hash result on each Reduce task may be an eligible candidatepair. Reduce task generates candidate pairs by invoking the pipeline filtering framework, and outputs verified candidate pairs and similarities in the form of (r, s, SimJ(r,s)).

Take into account of the load balance, the map’s partition is not based on the key but the key’s group. And the total number of the group is the assigned reducer number.

map reduce

g p g

output

Figure 7. Canditate and verified the pairs

D. Link records We get the merged results (r, s, SimJ(r,s)) with the

threshold � and original records as the input of this part. It outputs all pairs of records whose similarity are larger than �.This work adopts two methods proposed in [4].

VII. EXPERIMENTS

A. Experimental Setup The experimental evaluation is running on a variable

Hadoop cluster which contains 20 slaver nodes and a master node. Each node is configured with 8GB memory and 3.1GHz four Core i3-2100 Processor with the operating system of Red Hat Enterprise Linux 6.1(Santiago) supporting Gigabit Ethernet.

We use DBLP data and citeseerX data as the experimental datasets. The dataset consists of almost 1.2million records in DBLP and 1.5 million records from citeseerX. And we also create datasets with benchmark methods to enlarge the scale of datasets from the original dataset of citeseerX.

B. Experimental results 1) Varying the data size:

In this section, we use various sizes of experimental data to compare the performances of our methods with ppjoinFirst, we compare the time cost of ppjoin without load balance and ppjoin with our load balance algorithm undervariable-size of datasets. We set the threshold of similarity as 0.8. Figure 8 shows the results of time cost on different size of data sets. It shows that our load balance method can achieve average 20% speed-up for similarity joins. With the incensement of the data size, the load balance algorithm performs well.

0200400600800

100012001400

5 7 9 15 20

time

cost

(sec

ond)

data size(million)

imbalance balance

Figure 8. Comparing time costs of ppjoins with load balance and without load balance

0 200 400 600 800 1000

5

7

9

15

20

cost time (second)

data

size

(mill

ion)

balance stage Generate Pairs stage

Figure 9. Time of balance stage and generate verified pairs stage

0

500

1000

1500

2000

2500

0 5 10 15 20 25 30

Ela

psed

tim

e(se

cond

)

data size (million)

imbalance balance impro

Figure 10. The cost time comparision among different algrithoms

Figure 11. Size of fileted paris among different algrithoms

We also compare the time cost on different stages in similarity joins, and we separate the progress into load balance stage and generate verified pairs stage. Figure 9 shows the experimental results, and we find that with the data size increasing, the ratio of the balance is less than the total cost time. The increment of cost time in load balance stage is far less than that in generate pairs stage.

Then, we evaluate the performance of our appjoin filter. Figure 10 shows the results comparing three filter methods, ppjoin without load balance (imbalance), ppjoin with load balance (balance) and the appjoin (impro). The elapse time

387387387387387

is used to evaluate the performance of these methods working on the same data set. And we specified the threshold is 0.8. From the results, we find that our appjoin method has better performance on a large data set. We illustrate the phenomenon that the large data set generates much more pairs and our filter method is more efficient to generate less candidate pairs in Figure 11. So our method can save more time in large data sets.

2) Varying threshold We also evaluate the performances of different methods

under different similarity thresholds. We varied the threshold from 0.65 to 0.95 and gives a certain data size of 7 million. the experiment includes the basic ppjoin (imbalance),balanced ppjoin (balance) and our appjoin method(impro) on the same data set. Figure 12 shows the cost time of the three methods in different thresholds. These experimental results prove that our appjoin method performs better than both the basic ppjoin method and the ppjoin with load balance, especially when the threshold has a lower value. The differences between the three methods are signature with the decrease of the threshold. Our appjoin method performs well in low threshold, because our appjoin method can filter more candidate pairs than ppjoin.

0200400600800

100012001400

0.6 0.7 0.8 0.9 1

elas

pe ti

me

threshold

imbalance balance impro

Figure 12. Cost time comparision varied in different thresholds

859095

100

0 2 4

cost

itm

e

N

threshold=0.8

400450500550

0 2 4

cost

time

N

threshold=0.65

Figure 13. Cost of N prefix selected algrithom

3) N+ prefix In this experience, we evaluate the performance of our

nppjoin method and find a better N to promote the performance of the nppjoin. We set the threshold as 0.65 and 0.8, and the data set contains 5 million records. we varied N from 0 to 3 to test the elapse time of the nppjoin. N =0 means the basic appjoin. Figure 13 shows the experimental results, in which there is an inflexion point when N =2. The experimental results display that setting the value of N � 2can achieve the lowest cost time, and it is also depends on the value of threshold. It proves that our nppjoin method has scale ability and good performance than other methods.

VIII. CONCLUSION

In this paper, the similarity join techniques for big data integration are proposed. Based on these techniques, we construct a filter framework that combines multiple filter algorithms and uses MapReduce to process similarity join tasks in parallel. Moreover, an estimate-based algorithm for load balance is proposed for the problem of data skew, and is also integrated into our framework . The experimental results show that our methods can enhance the efficiency of the candidate filter, and reduce the task network traffic and the task cost time.

In the future, we will make a further research on other similarity functions and the condition that suits to the lower threshold and large documents in MapReduce framework.

ACKNOWLEDGMENT

This research was supported by the National Basic Research 973 Program of China under Grant No. 2012CB316201, the National Natural Science Foundation of China under Grant Nos. 61033007, 61003060, the National Research Foundation for the Doctoral Program of Higher Education of China Grant No. 20120042110028, the MOE-Intel Special Fund of Information Technology under Grant No. MOE-INTEL-2012-06, the Fundamental Research Funds for the Central Universities of China under Grant No. N100704001.

REFERENCES

[1] Apache Hadoop. http://hadoop.apache.org. [2] J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing

on Large Clusters", ;in Proc. OSDI, 2004, pp.137-150. [3] A. Arasu, V. Ganti, and R. Kaushik, "Efficient Exact Set-Similarity

Joins", ;in Proc. VLDB, 2006, pp.918-929. [4] R.J. Bayardo, Y. Ma, and R. Srikant, "Scaling up all pairs similarity

search", ;in Proc. WWW, 2007, pp.131-140. [5] A. Andoni and P. Indyk, "Near-Optimal Hashing Algorithms for

Approximate Nearest Neighbor in High Dimensions", ;in Proc. FOCS, 2006, pp.459-468.

[6] H. Lee, R.T. Ng, and K. Shim, "Similarity Join Size Estimation using Locality Sensitive Hashing", ;presented at CoRR, 2011.

[7] S. Chaudhuri, V. Ganti, and R. Kaushik, "A Primitive Operator for Similarity Joins in Data Cleaning", ;in Proc. ICDE, 2006.

[8] C. Xiao, W. Wang, X. Lin, and J.X. Yu, "Efficient similarity joins for near duplicate detection", ;in Proc. WWW, 2008, pp.131-140.

[9] L. Ribeiro and T. Härder, "Efficient Set Similarity Joins Using Min-prefixes", ;in Proc. ADBIS, 2009, pp.88-102.

[10] J. Wang, G. Li, and J. Feng, "Can we beat the prefix filtering?: an adaptive framework for similarity join and search", ;in Proc. SIGMOD Conference, 2012, pp.85-96.

[11] R. Vernica, M.J. Carey, and C. Li, "Efficient parallel set-similarity joins using MapReduce", ;in Proc. SIGMOD Conference, 2010, pp.495-506.

[12] C. Wang, J. Wang, X. Lin, W. Wang, H. Wang, H. Li, W. Tian, J. Xu, and R. Li, "MapDupReducer: detecting near duplicates over massive datasets", ;in Proc. SIGMOD Conference, 2010, pp.1119-1122.

[13] B. Gufler, N. Augsten, A. Reiser, and A. Kemper, "Handling Data Skew in MapReduce", ;in Proc. CLOSER, 2011, pp.574-583.

[14] L. Kolb, A. Thor, and E. Rahm, "Load Balancing for MapReduce-based Entity Resolution", ;in Proc. ICDE, 2012, pp.618-629.

388388388388388