when hashes met wedges: a distributed algorithm for finding high similarity …papers.… ·...

10
When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity Vectors Aneesh Sharma Twitter, Inc. [email protected] C. Seshadhri University of California Santa Cruz, CA [email protected] Ashish Goel * Stanford University [email protected] ABSTRACT Finding similar user pairs is a fundamental task in social networks, with numerous applications in ranking and per- sonalization tasks such as link prediction and tie strength detection. A common manifestation of user similarity is based upon network structure: each user is represented by a vector that represents the user’s network connections, where pairwise cosine similarity among these vectors defines user similarity. The predominant task for user similarity appli- cations is to discover all similar pairs that have a pairwise cosine similarity value larger than a given threshold τ . In contrast to previous work where τ is assumed to be quite close to 1, we focus on recommendation applications where τ is small, but still meaningful. The all pairs cosine similar- ity problem is computationally challenging on networks with billions of edges, and especially so for settings with small τ . To the best of our knowledge, there is no practical solution for computing all user pairs with, say τ =0.2 on large social networks, even using the power of distributed algorithms. Our work directly addresses this challenge by introduc- ing a new algorithm — WHIMP — that solves this prob- lem efficiently in the MapReduce model. The key insight in WHIMP is to combine the “wedge-sampling” approach of Cohen-Lewis for approximate matrix multiplication with the SimHash random projection techniques of Charikar. We provide a theoretical analysis of WHIMP, proving that it has near optimal communication costs while maintaining com- putation cost comparable with the state of the art. We also empirically demonstrate WHIMP’s scalability by computing all highly similar pairs on four massive data sets, and show that it accurately finds high similarity pairs. In particu- lar, we note that WHIMP successfully processes the entire Twitter network, which has tens of billions of edges. * Research supported in part by NSF Award 1447697. c 2017 International World Wide Web Conference Committee (IW3C2), published under Creative Commons CC BY 4.0 License. WWW 2017, April 3–7, 2017, Perth, Australia. ACM 978-1-4503-4913-0/17/04. http://dx.doi.org/10.1145/3038912.3052633 . Keywords Similarity search, nearest neighbor search, matrix multipli- cation, wedge sampling 1. INTRODUCTION Similarity search among a collection of objects is one of the oldest and most fundamental operations in social net- works, web mining, data analysis and machine learning. It is hard to overstate the importance of this problem: it is a basic building block of personalization and recommendation systems [12, 21], link prediction [1, 30], and is found to be immensely useful in many personalization and mining tasks on social networks and databases [41, 33, 42, 35]. Indeed, the list of applications is so broad that we do not attempt to survey them here and instead refer to recommender sys- tems and data mining textbooks that cover applications in diverse areas such as collaborative filtering [35, 29]. Given the vast amount of literature on similarity search, many forms of the problem have been studied in various applications. In this work we focus on the social and infor- mation networks setting where we can define pairwise sim- ilarity among users on the network based on having com- mon connections. This definition of similarity is particu- larly relevant in the context of information networks where users generate and consume content (Twitter, blogging net- works, web networks, etc.). In particular, the directionality of these information networks provides a natural measure that is sometimes called “production” similarity: two users are defined to be similar to each other if they are followed by a common set of users. Thus, “closeness” is based on common followers, indicating that users who consume con- tent from one of these users may be interested in the other “producer” as well. 1 The most common measure of closeness or similarity here is cosine similarity. This notion of cosine similarity is widely used for applications [1, 26, 5, 30] and is in particular a fundamental component of the Who To Follow recommendation system at Twitter [20, 21]. Our focus in this work is on the computational aspect of this widely important and well studied problem. In par- ticular, despite the large amount of attention given to the problem, there remain significant scalability challenges with computing all-pairs similarity on massive size information networks. A unique aspect of this problem on these large networks is that cosine similarity values that are tradition- ally considered “small” can be quite meaningful for social 1 One can also define “consumption” similarity, where users are similar if they follow the same set of users. 431

Upload: others

Post on 18-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity …papers.… · 2017-04-03 · The practical challenge: This leads us to main impetus behind our work

When Hashes Met Wedges: A Distributed Algorithm forFinding High Similarity Vectors

Aneesh SharmaTwitter, Inc.

[email protected]

C. SeshadhriUniversity of California

Santa Cruz, [email protected]

Ashish Goel ∗Stanford University

[email protected]

ABSTRACTFinding similar user pairs is a fundamental task in socialnetworks, with numerous applications in ranking and per-sonalization tasks such as link prediction and tie strengthdetection. A common manifestation of user similarity isbased upon network structure: each user is represented by avector that represents the user’s network connections, wherepairwise cosine similarity among these vectors defines usersimilarity. The predominant task for user similarity appli-cations is to discover all similar pairs that have a pairwisecosine similarity value larger than a given threshold τ . Incontrast to previous work where τ is assumed to be quiteclose to 1, we focus on recommendation applications whereτ is small, but still meaningful. The all pairs cosine similar-ity problem is computationally challenging on networks withbillions of edges, and especially so for settings with small τ .To the best of our knowledge, there is no practical solutionfor computing all user pairs with, say τ = 0.2 on large socialnetworks, even using the power of distributed algorithms.

Our work directly addresses this challenge by introduc-ing a new algorithm — WHIMP — that solves this prob-lem efficiently in the MapReduce model. The key insightin WHIMP is to combine the “wedge-sampling” approachof Cohen-Lewis for approximate matrix multiplication withthe SimHash random projection techniques of Charikar. Weprovide a theoretical analysis of WHIMP, proving that it hasnear optimal communication costs while maintaining com-putation cost comparable with the state of the art. We alsoempirically demonstrate WHIMP’s scalability by computingall highly similar pairs on four massive data sets, and showthat it accurately finds high similarity pairs. In particu-lar, we note that WHIMP successfully processes the entireTwitter network, which has tens of billions of edges.

∗Research supported in part by NSF Award 1447697.

c©2017 International World Wide Web Conference Committee(IW3C2), published under Creative Commons CC BY 4.0 License.WWW 2017, April 3–7, 2017, Perth, Australia.ACM 978-1-4503-4913-0/17/04.http://dx.doi.org/10.1145/3038912.3052633

.

KeywordsSimilarity search, nearest neighbor search, matrix multipli-cation, wedge sampling

1. INTRODUCTIONSimilarity search among a collection of objects is one of

the oldest and most fundamental operations in social net-works, web mining, data analysis and machine learning. Itis hard to overstate the importance of this problem: it is abasic building block of personalization and recommendationsystems [12, 21], link prediction [1, 30], and is found to beimmensely useful in many personalization and mining taskson social networks and databases [41, 33, 42, 35]. Indeed,the list of applications is so broad that we do not attemptto survey them here and instead refer to recommender sys-tems and data mining textbooks that cover applications indiverse areas such as collaborative filtering [35, 29].

Given the vast amount of literature on similarity search,many forms of the problem have been studied in variousapplications. In this work we focus on the social and infor-mation networks setting where we can define pairwise sim-ilarity among users on the network based on having com-mon connections. This definition of similarity is particu-larly relevant in the context of information networks whereusers generate and consume content (Twitter, blogging net-works, web networks, etc.). In particular, the directionalityof these information networks provides a natural measurethat is sometimes called “production” similarity: two usersare defined to be similar to each other if they are followedby a common set of users. Thus, “closeness” is based oncommon followers, indicating that users who consume con-tent from one of these users may be interested in the other“producer” as well.1 The most common measure of closenessor similarity here is cosine similarity. This notion of cosinesimilarity is widely used for applications [1, 26, 5, 30] andis in particular a fundamental component of the Who ToFollow recommendation system at Twitter [20, 21].

Our focus in this work is on the computational aspect ofthis widely important and well studied problem. In par-ticular, despite the large amount of attention given to theproblem, there remain significant scalability challenges withcomputing all-pairs similarity on massive size informationnetworks. A unique aspect of this problem on these largenetworks is that cosine similarity values that are tradition-ally considered “small” can be quite meaningful for social

1One can also define “consumption” similarity, where usersare similar if they follow the same set of users.

431

Page 2: When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity …papers.… · 2017-04-03 · The practical challenge: This leads us to main impetus behind our work

and information network applications — it may be quiteuseful and indicative to find users sharing a cosine similar-ity value of 0.2, as we will illustrate in our experimentalresults. With this particular note in mind, we move on todescribing our problem formally and discuss the challengesinvolved in solving it at scale.

1.1 Problem StatementAs mentioned earlier, the similarity search problem is rel-

evant to a wide variety of areas, and hence there are severallanguages for describing similarity: on sets, on a graph, andalso on matrix columns. We’ll attempt to provide the differ-ent views where possible, but we largely stick to the matrixnotation in our work. Given two sets S and T , their cosinesimilarity is |S ∩ T |/

√|S| · |T |, which is a normalized inter-

section size. It is instructive to define this geometrically, byrepresenting a set as an incidence vector. Given two (typi-cally non-negative) vectors ~v1 and ~v2, the cosine similarity is~v1 ·~v2/(‖~v1‖2‖~v2‖2). This is the cosine of the angle betweenthe vectors; hence the name.

In our context, the corresponding ~v for some user is theincidence vector of followers. In other words, the ith coor-dinate of ~v is 1 if user i follows the user, and 0 otherwise.Abusing notation, let us denote the users by their corre-sponding vectors, and we use the terms “user” and “vector”interchangably. Thus, we can define our problem as follows.

Problem 1.1. Given a sets of vectors ~v1, ~v2, . . . , ~vm in(R+)d, and threshold τ > 0: determine all pairs (i, j) suchthat ~vi · ~vj ≥ τ .

Equivalently, call ~vj τ -similar to ~vi if ~vi · ~vj ≥ τ . Forevery vector ~vi, find all vectors τ -similar to ~vj.

In terms of (approximate) information retrieval, the latterformulation represents a more stringent criterion. Instead ofgood accuracy in find similar pairs overall, we demand highaccuracy for most (if not all) users. This is crucial for anyrecommendation system, since we need good results for mostusers. More generally, we want good results at all “scales”,meaning accurate results for users with small followings aswell as big followings. Observe that the sparsity of ~v isinversely related to the indegree (following size) of the user,and represents their popularity. Recommendation needs tobe of high quality both for newer users (high sparsity ~v) andcelebrities (low sparsity ~v).

We can mathematically express Problem 1.1 in matrixterms as follows. Let A be the d × n matrix where theith column is ~vi/‖~vi‖2. We wish to find all large entriesin the Gramian matrix ATA (the matrix of all similarities).It is convenient to think of the input as A. Note that thenon-zeros of A correspond exactly to the underlying socialnetwork edges.

1.2 ChallengesScale: The most obvious challenge for practical applica-

tions is the sheer size of the matrix A. For example, theTwitter recommendation systems deal with a matrix withhundreds of millions of dimensions, and the number of non-zeros is in many tens of billions. Partitioning techniquesbecome extremely challenging for these sizes and clearly weneed distributed algorithms for Problem 1.1.

The similarity value τ : An equally important (but lessdiscussed) problem is the relevant setting of threshold τ inProblem 1.1. In large similarity search applications, a cosine

similarity (between users) of, say, 0.2 is highly significant.Roughly speaking, if user u is 0.2-similar to v, then 20% ofu’s followers also follow v. For recommendation, this is animmensely strong signal. But for many similarity techniquesbased on hashing/projection, this is too small [24, 3, 37, 39,38, 4]. Techniques based on LSH and projection usually de-tect similarities above 0.8 or higher. Mathematically, thesemethods have storage complexities that scale as 1/τ2, andare simply infeasible when τ is (say) 0.2.

We stress that this point does not receive much attention.But in our view, it is the primary bottleneck behind the lackof methods to solve Problem 1.1 for many real applications.

The practical challenge: This leads us to main impetusbehind our work.

For the matrix A corresponding to the Twitter networkwith O(100B) edges, find (as many as possible) entries inATA above 0.2. For a majority of users, reliably find many0.2-similar users.

1.3 Why previous approaches failThe challenge described above exemplifies where big data

forces an algorithmic rethink. Matrix multiplication andvariants thereof have been well-studied in the literature, butno solution works for such a large matrix. If a matrix Ahas a 100 billion non-zeroes, it takes upwards of 1TB just tostore the entries. This is more than an order of magnitudeof the storage of a commodity machine in a cluster. Anyapproach of partitioning A into submatrices cannot scale.

There are highly tuned libraries like Intel MKL’s BLAS [25]and CSparse [13]. But any sparse matrix multiplicationroutine [22, 2] will generate all triples (i, i′, j) such thatAi,jAi′,j 6= 0. In our example, this turns out to be morethan 100 trillion triples. This is infeasible even for a largeindustrial-strength cluster.

Starting from the work of Drineas, Kannan, and Mahoney,there is rich line of results on approximate matrix multiplica-tion by subsampling rows of the matrix [15, 17, 16, 36, 7, 32,34, 23]. These methods generate approximate products ac-cording to Frobenius norm using outer products of columns.This would result in dense matrices, which is clearly infea-sible at our scale. In any case, the large entries (of interest)contribute to a small part of the output.

Why communication matters: There are upper boundson the total communication even in industrial-strength Hadoopclusters, and in this work we consider our upper bound tobe about 100TB2. A promising approach for Problem 1.1 isthe wedge sampling method of Cohen-Lewis [11], which wasfurther developed in the diamond sampling work of Ballardet al [6]. The idea is to set up a linear-sized data structurethat can sample indices of entries proportional to value (orvalues squared in [6]). One then generates many samples,and picks the index pairs that occur most frequently. Thesesamples can be generated in a distributed manner, as shownby Zadeh and Goel [43].

The problem is in the final communication. The samplingcalculations show that about 10τ−1∑

i,j ~ai · ~aj samples arerequired to get all entries above τ with high probability.These samples must be collected/shuffled to actually findthe large entries. In our setting, this is upwards of 1000TBof communication.

2Note that if a reducer were to process 5GB of data each,processing 100TB would require 20,000 reducers.

432

Page 3: When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity …papers.… · 2017-04-03 · The practical challenge: This leads us to main impetus behind our work

Figure 1: Result on WHIMP for τ = 0.2: the left plot the precision-recall curves for finding all entries in ATA above 0.2(with respect to a sampled evaluation set). The other plots give the cumulative distribution, over all sampled users, of theminimum of precision and recall. We observe that for an overwhelming majority of users, WHIMP reliably finds more than70% of 0.2-similar users.

Locality Sensitive Hashing: In the normalized set-ting, maximizing dot product is equivalent to minimizingdistance. Thus, Problem 1.1 can be cast in terms of findingall pairs of points within some distance threshold. A power-ful technique for this problem is Locality Sensitive Hashing(LSH) [24, 18, 3]. Recent results by Shrivastava and Li useLSH ideas for the MIPS problem [37, 39, 38]. This essen-tially involves carefully chosen low dimensional projectionswith a reverse index for fast lookup. It is well-known thatLSH requires building hashes that are a few orders of mag-nitude more than the data size. Furthermore, in our setting,we need to make hundreds of millions of queries, which in-volve constructing all the hashes, and shuffling them to findthe near neighbors. Again, this hits the communication bot-tleneck.

1.4 ResultsWe design WHIMP, a distributed algorithm to solve Prob-

lem 1.1. We specifically describe and implement WHIMP inthe MapReduce model, since it is the most appropriate forour applications.

Theoretical analysis: WHIMP is a novel combinationof wedge sampling ideas from Cohen-Lewis [11] with randomprojection-based hashes first described by Charikar [10]. Wegive a detailed theoretical analysis of WHIMP and provethat it has near optimal communication/shuffle cost, with acomputation cost comparable to the state-of-the-art. To thebest of our knowledge, it is the first algorithm to have suchstrong guarentees on the communication cost. WHIMP hasa provable precision and recall guarantee, in that it outputsall large entries, and does not output small entries.

Empirical demonstration: We implement WHIMP onHadoop and test it on a collection of large networks. Ourlargest network is flock, the Twitter network with tens ofbillions of non-zeroes. We present results in Fig. 1. Forevaluation, we compute ground truth for a stratified sampleof users (details in §7). All empirical results are with respectto this evaluation. Observe the high quality of precision andrecall for τ = 0.2. For all instances other than flock (allhave non-zeros between 1B to 100B), the accuracy is nearperfect. For flock, WHIMP dominates a precision-recallover (0.7, 0.7), a significant advance for Problem 1.1 at thisscale.

Even more impressive are the distribution of precision-recall values. For each user in the evaluation sample (and

for a specific setting of parameters in WHIMP), we com-pute the precision and recall 0.2-similar vectors. We plotthe cumulative histogram of the minimum of the precisionand recall (a lower bound on any F-score) for two of thelargest datasets, eu (a web network) and flock. For morethan 75% of the users, we get a precision and recall of morethan 0.7 (for eu, the results are even better). Thus, we areable to meet our challenge of getting accurate results on anoverwhelming majority of users. (We note that in recentadvances in using hashing techniques [37, 39, 38], precision-recall curves rarely dominate the point (0.4, 0.4).)

2. PROBLEM FORMULATIONRecall that the problem of finding similar users is a special

case of Problem 1.1. Since our results extend to the moregeneral setting, in our presentation we focus on the ATBformulation for given matrices A and B. The set of columnsof A is the index set [m], denoted by CA. Similarly, the setof columns of B, indexed by [n], is denoted by CB . Thedimensions of the underlying space are indexed by D = [d].We use a1, . . . to denote columns of A, b1, . . . for columnsin B, and r1, r2, . . . for dimensions. For convenience, weassume wlog that n ≥ m.

We denote rows and columns of A by Ad,∗ and A∗,a re-spectively. And similar notation is used for B. We also usennz(·) to denote the number of non-zeroes in a matrix. Forany matrix M and σ ∈ R, the thresholded matrix [M ]≥σkeeps all values in M that are at least σ. In other words,([M ]≥σ)i,j = Mi,j if Mi,j ≥ σ and zero otherwise. We use‖M‖1 to be the entrywise 1-norm. We will assume that‖ATB‖1 ≥ 1. This is a minor technical assumption, andone that always holds in matrix products of interest.

We can naturally represent A as a (weighted) bipartitegraph GA = (CA, D,EA), where an edge (a, d) is present iffAd,a 6= 0. Analogously, we can define the bipartite graphGB . Their union GA ∪ GB is a tripartite graph denotedby GA,B . For any vertex v in GA,B , we use N(v) for theneighborhood of v.

Finally, we will assume the existence of a Gaussian ran-dom number generator g. Given a binary string x as input,g(x) ∼ N (0, 1). We assume that all values of g are indepen-dent.

The computational model: While our implementation(and focus) is on MapReduce, it is convenient to think of

433

Page 4: When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity …papers.… · 2017-04-03 · The practical challenge: This leads us to main impetus behind our work

an abstract distributed computational model that is also aclose proxy for MapReduce in our setting [19]. This allowsfor a transparent explanation of the computation and com-munication cost.

Let each vertex in GA,B be associated with a differentprocessor. Communication only occurs along edges of GA,B ,and occurs synchronously. Each round of communicationinvolves a single communication over all edges of GA,B .

3. HIGH LEVEL DESCRIPTIONThe starting point of our WHIMP algorithm (Wedges and

Hashes in Matrix Product) is the wedge sampling methodof Cohen-Lewis. A distributed MapReduce implementationof wedge sampling (for the special case of A = B) was givenby Zadeh-Goel [43]. In effect, the main distributed step inwedge sampling is the following. For each dimension r ∈ [d](independently and in parallel), we construct two distribu-tions on the index sets of vectors in A and B. We thenchoose a set of independent samples for each of these distri-butions, to get pairs (a, b), where a indexes a vector in A,and b indexes a vector in B. These are the “candidates” forhigh similarity. If enough candidates are generated, we areguaranteed that the candidates that occur with high enoughfrequency are exactly the large entries of ATB.

The primary bottleneck with this approach is that the vastmajority of pairs generated occur infrequently, but dominatethe total shuffle cost. In particular, most non-zero entriesin ATB are very small, but in total, these entries accountfor most of ‖ATB‖1. Thus, these low similarity value pairsdominate the output of wedge sampling.

Our main idea is to construct an efficient, local “approxi-mate oracle” for deciding if A∗,a ·B∗,b ≥ τ . This is achievedby adapting the well-known SimHash projection scheme ofCharikar [10]. For every vector ~v in our input, we constructa compact logarithmic sized hash h(~v). By the propertiesof SimHash, it is (approximately) possible to determine if~u ·~v ≥ τ only given the hashes h(~u) and h(~v). These hashescan be constructed by random projections using near-linearcommunication. Now, each machine that processes dimen-sion r (of the wedge sampling algorithm) collects every hashh(A∗,a) for each a such that Ar,a 6= 0 (similarly for B). Thisadds an extra near-linear communication step, but all thesehashes can now be stored locally in the machine comput-ing wedge samples for dimension r. This machines runs thesame wedge sampling procedure as before, but now when itgenerates a candidate (a, b), it first checks if A∗,a ·B∗,b ≥ τusing the SimHash oracle. And this pair is emitted iff thiscondition passes. Thus, the communication of this step isjust the desired output, since very few low similarity pairsare emitted. The total CPU/computation cost remains thesame as the Cohen-Lewis algorithm.

4. THE SIGNIFICANCE OF THE MAIN THE-OREM

Before describing the actual algorithm, we state the maintheorem and briefly describe its significance.

Theorem 4.1. Given input matrices A,B and thresholdτ , denote the set of index pairs output by WHIMP algo-rithm by S. Then, fixing parameters ` = dcτ−2 logne, s =(c(logn)/τ), and σ = τ/2 for a sufficiently large constantc, the WHIMP algorithm has the following properties withprobability at least 1− 1/n2:

• [Recall:] If (ATB)a,b ≥ τ , (a, b) is output.• [Precision:] If (a, b) is output, (ATB)a,b ≥ τ/4.• The total computation cost is O(τ−1‖ATB‖1 logn +τ−2

(nnz(A) +nnz(B)) logn).• The total communication cost is O((τ−1 logn)‖[ATB]≥τ/4‖1

+nnz(A) + nnz(B) + τ−2(m+ n) logn).

As labeled above, the first two items above are recall andprecision. The first term in the total computation cost isexactly that of vanilla wedge sampling, τ−1‖ATB‖1 logn,while the second is an extra near-linear term. The totalcommunication of wedge sampling is also τ−1‖ATB‖1 logn.Note that WHIMP has a communication ofτ−1‖[ATB]≥τ/4‖1 logn. Since all entries in ATB are at most

1, ‖[ATB]≥τ/4‖1 ≤ nnz([ATB]≥τ/4). Thus, the communi-

cation of WHIMP is at most (τ−1 logn)nnz([ATB]≥τ/4 plus

an additional linear term. The former is (up to the τ−1 lognterm) simply the size of the output, and must be paid byany algorithm that outputs all entries above τ/4. Finally,we emphasize that the constant of 4 is merely a matter ofconvenience, and can be replaced with any constant (1 + δ).

In summary, Theorem 4.1 asserts that WHIMP has (bar-ring additional near-linear terms) the same computation costas wedge sampling, with nearly optimal communication cost.

5. THE WHIMP ALGORITHMThe WHIMP algorithm goes through three rounds of com-

munication, each of which are described in detail in Figure 2.The output of WHIMP is a list of triples ((a, b), esta,b),where esta,b is an estimate for (ATB)a,b. Abusing nota-tion, we say a pair (a, b) is output, if it is part of some triplethat is output.

In each round, we have a step “Gather”. The last roundhas an output operation. These are the communication op-eration. All other steps are compute operations that arelocal to the processor involved.

Lemma 5.1. With probability at least 1 − 1/n6 over therandomness of WHIMP, for all pairs (a, b), |esta,b − A∗,a ·B∗,b| ≤ τ/4.

Proof. First fix a pair (a, b). We have esta,b = ‖A∗,a‖2‖B∗,b‖2cos(π∆/`), where ∆ is the Hamming distance between haand hb. Note that ha[i] = sgn(

∑r∈[d] g(〈r, i〉)Ar,a). Let ~v

be the d-dimension unit vector with rth entry proportionalto g(〈r, i〉). Thus, the rth component is a random (scaled)Gaussian, and ~v is a uniform (Gaussian) random vector inthe unit sphere. We can write ha[i] = sgn(~v · A∗,a) andhb[i] = sgn(~v · B∗,b). The probability that ha[i] 6= hb[i] isexactly the probability that the vectors A∗,a and B∗,b are ondifferent sides of a randomly chosen hyperplane. By a stan-dard geometric argument [10], if θa,b is the angle betweenthe vectors A∗,a and B∗,b, then this probability is θa,b/π.

Define Xi to be the indicator random variable for ha[i] 6=hb[i]. Note that the Hamming distance ∆ =

∑i≤`Xi and

E[∆] = `θa,b/π. Applying Hoeffding’s inequality,

Pr[|∆−E[∆]| ≥ `τ/(4π‖A∗,a‖2‖B∗,b‖2)]

< exp[−(`2τ2/16π2‖A∗,a‖22‖B∗,b‖22)/2`]

= exp(−(c/τ2)(logn)τ2/(32π2‖A∗,a‖22‖B∗,b‖22)) < n−8

Thus, with probability > 1− n−8,|π∆/` − θa,b| ≤ τ/(4‖A∗,a‖2‖B∗,b‖2). By the Mean ValueTheorem, | cos(π∆/`) − cos(θa,b)| ≤ τ/(4‖A∗,a‖2‖B∗,b‖2).

434

Page 5: When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity …papers.… · 2017-04-03 · The practical challenge: This leads us to main impetus behind our work

WHIMP Round 1 (Hash Computation):

1. For each a ∈ CA:(a) Gather column A∗,a.(b) Compute ‖A∗,a‖2.(c) Compute bit array ha of length ` as follows:

ha[i] = sgn(∑

r∈[d] g(〈r, i〉)Ar,a)

.

2. Perform all the above operations for all b ∈ CB .

WHIMP Round 2 (Weight Computation):

1. For all r ∈ [d]:(a) Gather rows Ar,∗ and Br,∗.(b) Compute ‖Ar,∗‖1 and construct a data struc-

ture that samples a ∈ CA proportional toAr,a/‖Ar,∗‖1. Call this distribution Ar.

(c) Similarly compute ‖Br,∗‖1 and sampling datastructure for Br.

WHIMP Round 3 (Candidate Generation):

1. For all r ∈ [d]:(a) Gather: For all a, b ∈ N(r),

ha, hb, ‖A∗,a‖2, ‖B∗,b‖2.(b) Repeat s‖Ar,∗‖1‖Br,∗‖1/ (s set to c(logn)/τ)

times:i. Generate a ∼ Ar.ii. Generate b ∼ Br.iii. Denote the Hamming distance between

bit arrays ha and hb by ∆.iv. Compute esta,b =‖A∗,a‖2‖B∗,b‖2 cos(π∆/`).

v. If est ≥ σ, emit ((a, b), esta,b).

Figure 2: The WHIMP (Wedges And Hashes In MatrixProduct) algorithm

Multiplying by ‖A∗,a‖2‖B∗,b‖2, we get |esta,b−A∗,a ·B∗,b| ≤τ/4. We take a union bound over all Θ(mn) pairs (a, b) tocomplete the proof.

We denote a pair (a, b) as generated if it is generated inSteps 1(b)i and 1(b)ii during some iteration. Note that sucha pair is actually output iff esta,b is sufficiently large.

Lemma 5.2. With probability at least 1 − 1/n3 over therandomness of WHIMP, the following hold. The total num-ber of triples output is O((τ−1 logn) max(‖[ATB ]≥τ/4‖1, 1)).Furthermore, if A∗,a ·B∗,b ≥ τ , (a, b) is output.

Proof. Let Xa,b,r,i be the indicator random variable for(a, b) being output in the i iteration for dimension r. Thetotal number of times that (a, b) is output is exactly Xa,b =∑r,iXa,b,r,i. By the definition of the distributions Ar and

Br, E[Xa,b,r,i] =Ar,a

‖Ar,∗‖1· Br,b

‖Br,∗‖1. Denote c(logn)‖Ar,∗‖1‖Br,∗‖1/τ

by kr, the number of samples at dimension r. By linearityof expectation,

E[Xa,b] =∑r≤d

∑i≤kr

Ar,aBr,b‖Ar,∗‖1‖Br,∗‖1

=∑r≤d

c(logn)‖Ar,∗‖1‖Br,∗‖1τ

· Ar,aBr,b‖Ar,∗‖1‖Br,∗‖1

= cτ−1 logn∑r≤d

Ar,aBr,b = cA∗,a ·B∗,bτ−1 logn

Note that the random choices in creating the hashes is inde-pendent of those generating the candidates. By Lemma 5.1,with probability > 1 − n−6, the following event (call it E)holds: ∀(a, b), |esta,b−A∗,a ·B∗,b| ≤ τ/4. Conditioned on E ,if A∗,a · B∗,b < τ/4, then esta,b < τ/2 and (a, b) is not out-put. Let S = {(a, b)|A∗,a · B∗,b ≥ τ/4}. Let the number oftriples output be Y . Conditioned in E , Y ≥

∑(a,b)∈S [Xa,b].

Denote the latter random variable as Z. By linearity ofexpectation and independence of Xa,b from E ,

EE [Z] =∑

(a,b)∈S

EE [Xa,b]

= cτ−1 logn∑

(a,b)∈S

A∗,a ·B∗,b

= cτ−1 logn‖[ATB]≥τ/4‖1Furthermore, Z is the sum of Bernoulli random variables.Thus, we can apply a standard upper-Chernoff bound tothe sum above, and deduce that

PrE

[Z ≥ 4cτ−1 lognmax(‖[ATB]≥τ/4‖1, 1)]

≤ exp(−4cτ−1 logn) ≤ n−10

Thus, conditioned on E , the probability that Y is greaterthan 4cτ−1 lognmax(‖[ATB]≥τ/4‖1, 1) is at most n−10. Since

Pr[E ] ≤ n−6, with probability at least 1 − n−5, the num-ber of triples output is O((τ−1 logn) max(‖[ATB ]≥τ/4‖1, 1)).This proves the first part.

For the second part now. Fix a pair (a, b) such thatA∗,a · B∗,b ≥ τ . We have E[Xa,b] ≥ c logn. By a standardlower tail Chernoff bound, Pr[Xa,b ≥ (c/2) logn] ≤ n−10.Thus, (a, b) is guaranteed to be generated. If event E hap-pens, then esta,b ≥ 3τ/4. By a union bound over the com-plement events, with probability at least 1− n−5, (a, b) willbe generated and output. We complete the proof by takinga union bound over all mn pairs (a, b).

The first two statements of Theorem 4.1 hold by Lemma 5.1and Lemma 5.2, and the remaining two statements follow bya straightforward calculation. Hence we skip the remainderof the proof.

6. IMPLEMENTING WHIMPWe implement and deploy WHIMP in Hadoop [40], which

is an open source implementation of MapReduce [14]. Ourexperiments were run on Twitter’s production Hadoop clus-ter, aspects of which have been described before in [31, 27,20]. In this section, we discuss our WHIMP parameterchoices and some engineering details. As explained earlier,all our experiments have A = B.

It is helpful to discuss the quality measures. Suppose wewish to find all entries above some threshold τ > 0. Typicalchoices are in the range [0.1, 0.5] (cosine values are rarelyhigher in our applications). The support of [ATA]≥τ is de-noted by Hτ , and this is the set of pairs that we wish tofind. Let the output of WHIMP be S. The natural aim isto maximize both precision and recall.• Precision: the fraction of output that is “correct”, |Hτ ∩

S|/|S|.• Recall: the fraction of Hτ that is output, |Hτ ∩S|/|Hτ |.There are three parameter choices in WHIMP, as described

in Theorem 4.1. We show practical settings for these param-eters.

435

Page 6: When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity …papers.… · 2017-04-03 · The practical challenge: This leads us to main impetus behind our work

Figure 3: Precision-recall curves

`, the sketch length: This appears in Step 1c of Round1. Larger ` implies better accuracy for the SimHash sketch,and thereby leads to higher precision and recall. On theother hand, the communication in Round 3 requires emittingall sketches, and thus, it is linear in `.

A rough rule of thumb is as follows: we wish to distin-guish A∗,a · A∗,b = 0 from A∗,a · A∗,b > τ . (Of course,we wish for more, but this argument suffices to give rea-sonable values for `.) Consider a single bit of SimHash.In the former case, Pr[h(A∗,a) = h(A∗,b) = 1/2, whilein the latter case Pr[h(A∗,a) = h(A∗,b)] = 1 − θa,b/π =

cos−1(A∗,a · A∗,b)/π ≥ 1 − cos−1(τ)/π. It is convenient toexpress the latter as Pr[h(A∗,a) = h(A∗,b)] ≥ 1/2 + δ, whereδ = 1/2− cos−1(τ)/π.

Standard binomial tail bounds tells us that 1/δ2 indepen-dent SimHash bits are necessary to distinguish the two cases.For convergence, at least one order of magnitude more sam-ples are required, so ` should be around 10/δ2. Plugging insome values, for τ = 0.1, δ = 0.03, and ` should be 11, 000.For τ = 0.2, we get ` to be 2, 400. In general, the size of `is around 1 kilobyte.

s, the oversampling factor: This parameter appearsin Step 1b of Round 3, and determines the number of wedgesamples generated. The easiest way to think about s is interms of vanilla wedge sampling. Going through the cal-culations, the total number of wedge samples (over the en-tire procedure) is exactly s

∑r ‖Ar,∗‖1‖Ar,∗‖1 = s‖ATA‖1.

Fix a pair (a, b) ∈ Hτ , with dot product exactly τ . Theprobability that a single wedge sample produces (a, b) isA∗,a · A∗,b/‖ATA‖1 = τ/‖ATA‖1. Thus, WHIMP gener-ates this pair (expected) τ/‖ATA‖1× s‖ATA‖1 = τs times.

The more samples we choose, the higher likelihood of find-ing a pair (a, b) ∈ Sτ . On the other hand, observe that pairsin Hτ are generated τs times, and increasing s increases thecommunication in Round 3. Thus, we require s to be atleast 1/τ , and our rule of thumb is 10/τ to get convergence.

σ, the filtering value: This is used in the final opera-tion, Step 1(b)v, and decides which pairs are actually out-put. The effect of σ is coupled with the accuracy of theSimHash sketch. If the SimHash estimate is perfect, thenσ should just be τ . In practice, we modify σ to accountfor SimHash error. Higher σ imposes a stricter filer andimproves precision at the cost of recall. And the oppositehappens for lower σ. In most runs, we simply set σ = τ . Wevary σ to generate precision-recall curves.

7. EXPERIMENTAL SETUPAs mentioned earlier, we run all experiments on Twit-

ter’s Hadoop cluster. All the code for this work was writ-ten in Scalding, which is Twitter’s Scala API to Cascading,an open-source framework for building dataflows that canbe executed on Hadoop. These are all mature productionsystems, aspects of which have been discussed in detail else-where [31, 27, 20].

Datasets: We choose four large datasets. Two of them,clueweb and eu are webgraphs. The dataset friendster isa social network, and is available from the Stanford LargeNetwork Dataset Collection [28]. The two webgraphs wereobtained from the LAW graph repository [8, 9]. Apart fromthese public datasets, we also report results on our propri-etary dataset, flock, which is the Twitter follow graph.

We interpret the graph as vectors in the following way.For each vertex, we take the incidence vector of the in-neighborhood. Thus, two vertices are similar if they are fol-lowed by a similar set of other vertices. This is an extremelyimportant signal for Twitter’s recommendation system [21],our main motivating problem. For consistency, we apply thesame viewpoint to all the datasets.

We apply a standard cleaning procedure (for similarity)and remove high out-degrees. In other words, if some vertexv has more than 10K followers (outdegree> 10K), we removeall these edges. (We do not remove the vertex, but ratheronly its out-edges.) Intuitively, the fact that two vertices arefollowed by v is not a useful signal for similarity. In flock

and friendster, such vertices are typically spammers andshould be ignored. For webgraphs, a page linking to morethan 10K other pages is probably not useful for similaritymeasurement.

Dataset Dimensions n = d Size (nnz) |ATA|1friendster 65M 1.6B 7.2E9clueweb 978M 42B 6.8E10eu 1.1B 84B 1.9E11flock - O(100B) 5.1E12

Table 1: Details on Datasets

We give the size of the datasets in Tab. 1. (This is aftercleaning, which removes at most 5% of the edges. Exactsizes for flock cannot be revealed but we do report aggre-gate results where possible.) Since the underlying matrixA is square, n = d. All instances have at least a billionnon-zeros. To give a sense of scale, the raw storage of 40B

436

Page 7: When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity …papers.… · 2017-04-03 · The practical challenge: This leads us to main impetus behind our work

Figure 4: Per-user precision-recall histograms for τ = 0.4

non-zeros (as a list of edges/pairs, each of which is two longs)is roughly half a terrabyte. This is beyond the memory ofmost commodity machines or nodes in a small cluster, un-derscoring the challenge in designing distributed algorithms.

Parameters: We set the parameters of WHIMP as fol-lows. Our focus is typically on τ > 0.1, though we shallpresent results for varying τ ∈ [0.1, 0.5]. The sketch length `is 8192 (1KB sketch size); the oversampling factor s is 150;σ is just τ . For getting precision-recall curves, we vary σ, asdiscussed in §6.

Evaluation: ComputingATA exactly is infeasible at thesesizes. A natural evaluation would be pick a random sampleof vertices and determine all similar vertices for each vertexin the sample. (In terms of matrices, this involves samplingcolumns of A to get a thinner matrix B, and then computingATB explicitly). Then, we look at the output of WHIMPand measure the number of similar pairs (among this sam-ple) it found. An issue with pure uniform sampling is thatmost vertices tend to be low degree (the columns have highsparsity). In recommendation applications, we care for ac-curate behavior at all scales.

We perform a stratified sampling of columns to gener-ate ground truth. For integer i, we create a bucket withall vertices whose indegree (vector sparsity) is in the range[10i, 10i+1). We then uniformly sample 1000 vertices fromeach bucket to get a stratified sample of vertices/columns.All evaluation is performed with respect to the exact resultsfor this stratified sample.

8. EXPERIMENTAL RESULTSPrecision-recall curves: We use threshold τ of 0.2, 0.4,

0.6. We compute precision-recall curves for WHIMP on allthe datasets, and present the results in Fig. 3. Observe thehigh quality results on clueweb, eu, and friendster: forτ ≥ 0.4, the results are near perfect. The worst behavioris that of flock, which still dominates a precision and re-call of 0.7 in all cases. Thus, WHIMP is near perfect whennnz(A) has substantially fewer than 100B entries (as our the-ory predicts). The extreme size of flock probably requireseven larger parameter settings to get near perfect results.

Per-vertex results: In recommendation applications,global precision/recall is less relevant that per-user results.Can we find similar neighbors for most users, or alternately,for how many users can we provide accurate results? Thisis more stringent quality metric than just the number ofentries in [ATA]≥τ obtained.

Figure 5: Split-up of shuffle over various rounds for WHIMP

Dataset WHIMP (TB) DISCO est. (TB) ‖ATA‖1friendster 4.9 26.2 7.2e+09clueweb 90.1 247.4 6.8e+10eu 225.0 691.2 1.9e+11flock 287.0 18553.7 5.1e+12

Table 2: Total communication/shuffle cost of WHIMP

In the following experiment, we simply set the filteringvalue σ to be τ . We vary τ in 0.2, 0.4, etc. For eachdataset and each vertex in the evaluation sample, (gener-ation described in §7) we compute the precision and recallfor WHIMP just for the similar vertices of the sample vertex.We just focus on the minimum of the precision and recall(this is a lower bound on any Fβ score, and is a conservativemeasure). The cumulative (over the sample) histogram ofthe minimum of the precision and recall is plotted in Fig. 4.

Just for clarity, we give an equivalent description in termsof matrices. We compute the (minimum of) precision and re-call of entries above τ in a specific (sampled) column of ATA.We plot the cumulative histogram over sampled columns.

For space reasons, we only show the results for τ = 0.4and ignore the smallest dataset, friendster. The resultsfor clueweb and eu are incredibly accurate: for more than90% of the sample, both precision and recall are above 0.8,regardless of τ . The results for flock are extremely good,but not nearly as accurate. WHIMP gets a precision andrecall above 0.7 for at least 75% of the sample. We stressthe low values of cosine similarities here: a similarity of0.2 is well-below the values studied in recent LSH-based re-sults [37, 39, 38]. It is well-known that low similarity valuesare harder to detect, yet WHIMP gets accurate results foran overwhelming majority of the vertices/users.

437

Page 8: When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity …papers.… · 2017-04-03 · The practical challenge: This leads us to main impetus behind our work

Shuffle cost of WHIMP: The main impetus behindWHIMP was to get an algorithm with low shuffle cost. Rounds1 and 2 only shuffle the input data (and a small factor overit), and do not pose a bottleneck. Round 3 has two majorshuffling steps.• Shuffling the sketches: In Step 1a, the sketches are com-

municated. The total cost is the sum of sizes of all sketches,which is `nnz(A).• Shuffling the candidates that are output: In Step 1(b)v,

the candidates large entries are output. There is an impor-tant point here that is irrelevant in the theoretical descrip-tion. We perform a deduplication step to output entriesonly once. This requires a shuffle step after which the finaloutput is generated.We split communication into three parts: the sketch shuffle,the candidate shuffle, and the final (deduped) output. Thetotal of all these is presented in Tab. 2. (We stress that thisis not shuffled together.) The split-up between the variousparts is show in in Fig. 5. Observe that the sketch and can-didate shuffle are roughly equal. For friendster and flock,the (deduped) output is itself more than 10% of the totalshuffle. This (weakly) justifies the optimality Theorem 4.1in these cases, since the total communication is at most anorder of magnitude more than the desired output. For theother cases, the output is between 3-5% of the total shuffle.

Comparisons with existing art: No other algorithmworks at this scale, and we were not able to deploy anythingelse for such large datasets. Nonetheless, given the param-eters of the datasets, we can mathematically argue againstother approaches.• Wedge sampling of Cohen-Lewis [11], DISCO [43]: Dis-

tributed version of wedge sampling were given by Zadeh andGoel in their DISCO algorithm [43]. But it cannot scaleto these sizes. DISCO is equivalent to using Round 2 ofWHIMPto set up weights, and then running Round 3 with-out any filtering step (Step 1(b)v). Then, we would lookfor all pairs (a, b) that have been emitted sufficiently manytimes, and make those the final output. In this case, CPUand shuffle costs are basically identical, since any candidategenerated is emitted.Consider (a, b) such that A∗,a ·A∗,b = τ . By the wedge sam-pling calculations, s‖ATA‖1 wedge samples would generate(a, b) an expected sτ times. We would need to ensure thatthis is concentrated well, since we finally output pairs gener-ated often enough. In our experience, setting s = 50/τ is thebare minimum to get precision/recall more than 0.8. Notethat WHIMP only needs to generate such a wedge sampleonce, since Step 1(b)v is then guaranteed to output it (as-suming SimHash is accurate). But vanilla wedge samplingmust generate (a, b) with a frequency close to its expecta-tion. Thus, WHIMP can set s closer to (say) 10/τ , but thisis not enough for the convergence of wedge sampling.But all the wedges have to be shuffled, and this leads to10‖ATA‖1/τ wedges being shuffled. Each wedge is two longs(using standard representations), and that gives a ballparkestimate of 160‖ATA‖1/τ . We definitely care about τ = 0.2,and WHIMP generates results for this setting (Fig. 3). Wecompute this value for the various datasets in Tab. 2, andpresent it as the estimated shuffle cost for DISCO.Observe that it is significantly large than the total shufflecost of WHIMP, which is actually split roughly equally intotwo parts (Fig. 5). The wedge shuffles discussed above aremost naturally done in a single round. To shuffle more than

Table 3: Top similar results for a few Twitter accounts,generated from WHIMP on flock.

Users similar to @www2016caRank Twitter @handle Score

1 @WSDMSocial 0.2682 @WWWfirenze 0.2133 @SIGIR2016 0.1904 @ecir2016 0.1755 @WSDM2015 0.155

Users similar to @duncanjwattsRank Twitter @handle Score

1 @ladamic 0.2872 @davidlazer 0.2863 @barabasi 0.2844 @jure 0.2185 @net science 0.200

Users similar to @POTUSRank Twitter @handle Score

1 @FLOTUS 0.3872 @HillaryClinton 0.3683 @billclinton 0.3084 @BernieSanders 0.2805 @WhiteHouse 0.267

200TB would require a more complex algorithm that splitsthe wedge samples into various rounds. For eu and flock,the numbers are more than 1000TB, and completely beyondthe possibility of engineering. We note that friendster canprobably be handled by the DISCO algorithm.• Locality-Sensitive Hashing [24, 37]: LSH is an impor-

tant method for nearest neighbor search. Unfortunately, itdoes not perform well when similarities are low but still sig-nificant (say τ = 0.2). Furthermore, it is well-known torequire large memory overhead. The basic idea is to hashevery vector into a “bucket” using, say, a small (like 8-bit)SimHash sketch. The similarity is explicitly computed on allpairs of vectors in a bucket, i.e. those with the same sketchvalue. This process is repeated with a large number of hashfunctions to ensure that most similar pairs are found.Using SimHash, the mathematics says roughly the follow-ing. (We refer the reader to important LSH papers for moredetails [24, 18, 3, 37].) Let the probability of two similarvectors (with cosine similarity above 0.2) having the sameSimHash value be denoted P1. Let the corresponding prob-ability for two vectors with similarity zero by P2. By theSimHash calculations of §6, P1 = 1/2−cos−1(0.2)/π ≈ 0.56,while P2 = 0.5. This difference measures the “gap” ob-tained by the SimHash function. The LSH formula basicallytells us that the total storage of all the hashes is (at least)

n1+(logP1)/(logP2) bytes. This comes out to be n1.83. As-suming that n is around 1 billion, the total storage is 26KTB. This is astronomically large, and even reducing this bya factor of hundred is insufficient for feasibility.

Case Study: In addition to the demonstration of thealgorithm’s performance in terms of raw precision and re-call, we also showcase some examples to illustrate the prac-tical effectiveness of the approach. Some of these resultsare presented in Table 3. First, note that the cosine scorevalues that generate the results are around 0.2, which pro-vides justification for our focus on generating results withthese values. Furthermore, note that even at these values,the results are quite interpretable and clearly find similarusers: for the @www2016ca account, it finds accounts forother related social networks and data mining conferences.

438

Page 9: When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity …papers.… · 2017-04-03 · The practical challenge: This leads us to main impetus behind our work

For @duncanjwatts, who is a network science researcher, thealgorithm finds other network science researchers. And fi-nally, an example of a very popular user is @POTUS, forwhom the algorithm finds clearly very related accounts.

9. REFERENCES[1] L. Adamic and E. Adar. Friends and neighbors on the

web. Social Networks, 25(3):211–230, 2003.

[2] R. R. Amossen and R. Pagh. Faster join-projects andsparse matrix multiplications. In ICDT ’09: Proc. 12thIntl. Conf. on Database Theory, pages 121–126, 2009.

[3] A. Andoni and P. Indyk. Near-optimal hashingalgorithms for approximate nearest neighbor in highdimensions. Comm. of the ACM, 1:117–122, 2008.

[4] A. Andoni, P. Indyk, T. Laarhoven, I. P. Razenshteyn,and L. Schmidt. Practical and optimal LSH forangular distance. In Advances in Neural InformationProcessing Systems 28: Annual Conference on NeuralInformation Processing Systems 2015, December 7-12,2015, Montreal, Quebec, Canada, pages 1225–1233,2015.

[5] F. Angiulli and C. Pizzuti. An approximate algorithmfor top-k closest pairs join query in large highdimensional data. Data & Knowledge Engineering,53(3):263–281, June 2005.

[6] G. Ballard, T. G. Kolda, A. Pinar, and C. Seshadhri.Diamond sampling for approximate maximumall-pairs dot-product (MAD) search. In InternationalConference on Data Mining, pages 11–20, 2015.

[7] M.-A. Belabbas and P. Wolfe. On sparserepresentations of linear operators and theapproximation of matrix products. In CISS 2008:42nd Annual Conf. on Information Sciences andSystems, pages 258–263, Mar. 2008.

[8] P. Boldi, M. Rosa, M. Santini, and S. Vigna. Layeredlabel propagation: A multiresolution coordinate-freeordering for compressing social networks. InS. Srinivasan, K. Ramamritham, A. Kumar, M. P.Ravindra, E. Bertino, and R. Kumar, editors,Conference on World Wide Web (WWW), pages587–596. ACM Press, 2011.

[9] P. Boldi and S. Vigna. The WebGraph framework I:Compression techniques. In Conference on WorldWide Web, pages 595–601, Manhattan, USA, 2004.ACM Press.

[10] M. Charikar. Similarity estimation techniques fromrounding algorithms. In Symposium on Theory ofComputing, pages 380–388, 2002.

[11] E. Cohen and D. D. Lewis. Approximating matrixmultiplication for pattern recognition tasks. J.Algorithms, 30(2):211–252, 1999.

[12] A. S. Das, M. Datar, A. Garg, and S. Rajaram.Google news personalization: scalable onlinecollaborative filtering. In Proceedings of World WideWeb, pages 271–280, 2007.

[13] T. Davis. Direct Methods for Sparse Linear Systems.SIAM, 2006.

[14] J. Dean and S. Ghemawat. Mapreduce: simplifieddata processing on large clusters. Communications ofthe ACM, 51(1):107–113, 2008.

[15] P. Drineas and R. Kannan. Fast monte-carloalgorithms for approximate matrix multiplication. In

FoCS’01: Proc. 42nd IEEE Symposium onFoundations of Computer Science, pages 452–459,Oct. 2001.

[16] P. Drineas, R. Kannan, and M. W. Mahoney. FastMonte Carlo algorithms for matrices I: Approximatingmatrix multiplication. SIAM Journal on Computing,36(1):132–157, Jan. 2006.

[17] P. Drineas and M. W. Mahoney. On the Nystrommethod for approximating a gram matrix for improvedkernel-based learning. J. Mach. Learn. Res.,6:2153–2175, Dec. 2005.

[18] A. Gionis, P. Indyk, and R. Motwani. Similaritysearch in high dimensions via hashing. In Proceedingsof VLDB, pages 518–529, 1999.

[19] A. Goel and K. Munagala. Complexity measures formap-reduce, and comparison to parallel computing.arXiv preprint arXiv:1211.6526, 2012.

[20] A. Goel, A. Sharma, D. Wang, and Z. Yin.Discovering similar users on twitter. In 11th Workshopon Mining and Learning with Graphs, 2013.

[21] P. Gupta, A. Goel, J. J. Lin, A. Sharma, D. Wang,and R. Zadeh. WTF: the who to follow service attwitter. In Conference on World Wide Web, pages505–514, 2013.

[22] F. G. Gustavson. Two fast algorithms for sparsematrices: Multiplication and permuted transposition.ACM Transactions on Mathematical Software,4(3):250–269, Sept. 1978.

[23] J. T. Holodnak and I. C. F. Ipsen. Randomizedapproximation of the gram matrix: Exact computationand probabilistic bounds. SIAM Journal on MatrixAnalysis and Applications, 36(1):110–137, 2015.

[24] P. Indyk and R. Motwani. Approximate nearestneighbors: Towards removing the curse ofdimensionality. In Proceedings of STOC, pages604–613, 1998.

[25] Intel. Math kernel library reference manual, 2014.Version 11.2.

[26] D. V. Kalashnikov, S. Mehrotra, and Z. Chen.Exploiting relationships for domain-independent datacleaning. In SDM’05: Proc. 2005 SIAM Intl. Conf. onData Mining, pages 262–273, Apr. 2005.

[27] G. Lee, J. Lin, C. Liu, A. Lorek, and D. Ryaboy. Theunified logging infrastructure for data analytics attwitter. Proceedings of the VLDB Endowment,5(12):1771–1780, 2012.

[28] J. Leskovec and A. Krevl. SNAP Datasets: Stanfordlarge network dataset collection.http://snap.stanford.edu/data, June 2014.

[29] J. Leskovec, A. Rajaraman, and J. D. Ullman. Miningof massive datasets. Cambridge University Press,2014.

[30] D. Liben-Nowell and J. Kleinberg. The link predictionproblem for social networks. Journal of the AmericanSociety for Information Science and Technology,58(7):1019–1031, 2007.

[31] J. Lin and D. Ryaboy. Scaling big data mininginfrastructure: the twitter experience. ACM SIGKDDExplorations Newsletter, 14(2):6–19, 2013.

[32] A. Magen and A. Zouzias. Low rank matrix-valuedChernoff bounds and approximate matrix

439

Page 10: When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity …papers.… · 2017-04-03 · The practical challenge: This leads us to main impetus behind our work

multiplication. In Proc. Symposium on DiscreteAlgorithms (SODA), pages 1422–1436, 2011.

[33] M. McPherson, L. Smith-Lovin, and J. M. Cook.Birds of a feather: Homophily in social networks.Annual review of sociology, pages 415–444, 2001.

[34] R. Pagh. Compressed matrix multiplication. ACMTransactions on Computation Theory (TOCT),5(3):1–17, Aug. 2013.

[35] F. Ricci, L. Rokach, and B. Shapira. Introduction torecommender systems handbook. Springer, 2011.

[36] T. Sarlos. Improved approximation algorithms forlarge matrices via random projections. In Proceedingsof Foundations of Computer Science, pages 143–152,Oct. 2006.

[37] A. Shrivastava and P. Li. Asymmetric LSH (ALSH)for sublinear time maximum inner product search(MIPS). In NIPS 2014: Advances in NeuralInformation Processing Systems 27, pages 2321–2329,2014.

[38] A. Shrivastava and P. Li. Asymmetric minwisehashing for indexing binary inner products and setcontainment. In Conference on World Wide Web(WWW), pages 981–991, 2015.

[39] A. Shrivastava and P. Li. Improved asymmetriclocality sensitive hashing (ALSH) for maximum innerproduct search (MIPS). In Conference on Uncertaintyin Artificial Intelligence (UAI), pages 812–821, 2015.

[40] K. Shvachko, H. Kuang, S. Radia, and R. Chansler.The hadoop distributed file system. In 2010 IEEE26th symposium on mass storage systems andtechnologies (MSST), pages 1–10. IEEE, 2010.

[41] R. Xiang, J. Neville, and M. Rogati. Modelingrelationship strength in online social networks. InProceedings of the 19th international conference onWorld wide web, pages 981–990. ACM, 2010.

[42] X. Yan, P. S. Yu, and J. Han. Substructure similaritysearch in graph databases. In Proceedings of the 2005ACM SIGMOD international conference onManagement of data, pages 766–777. ACM, 2005.

[43] R. B. Zadeh and A. Goel. Dimension independentsimilarity computation. Journal of Machine LearningResearch, 14(1):1605–1626, 2013.

440