b490 mining the big data - indiana university bloomington

78
1-1 B490 Mining the Big Data Qin Zhang §1 Finding Similar Items

Upload: others

Post on 16-Oct-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: B490 Mining the Big Data - Indiana University Bloomington

1-1

B490 Mining the Big Data

Qin Zhang

§1 Finding Similar Items

Page 2: B490 Mining the Big Data - Indiana University Bloomington

2-1

Motivations

Finding similar documents/webpages/images

(Approximate) mirror sites.Application: Don’t want to show both when Google.

Page 3: B490 Mining the Big Data - Indiana University Bloomington

2-2

Motivations

Finding similar documents/webpages/images

(Approximate) mirror sites.Application: Don’t want to show both when Google.

Plagiarism, large quotationsApplication: I am sure you know

Page 4: B490 Mining the Big Data - Indiana University Bloomington

2-3

Motivations

Finding similar documents/webpages/images

(Approximate) mirror sites.Application: Don’t want to show both when Google.

Plagiarism, large quotationsApplication: I am sure you know

Similar topic/articles from various placesApplication: Cluster articles by “same story”.

Page 5: B490 Mining the Big Data - Indiana University Bloomington

2-4

Motivations

Finding similar documents/webpages/images

(Approximate) mirror sites.Application: Don’t want to show both when Google.

Plagiarism, large quotationsApplication: I am sure you know

Similar topic/articles from various placesApplication: Cluster articles by “same story”.

Google image

Page 6: B490 Mining the Big Data - Indiana University Bloomington

2-5

Motivations

Finding similar documents/webpages/images

(Approximate) mirror sites.Application: Don’t want to show both when Google.

Plagiarism, large quotationsApplication: I am sure you know

Similar topic/articles from various placesApplication: Cluster articles by “same story”.

Google image

Social network analysis

Page 7: B490 Mining the Big Data - Indiana University Bloomington

2-6

Motivations

Finding similar documents/webpages/images

(Approximate) mirror sites.Application: Don’t want to show both when Google.

Plagiarism, large quotationsApplication: I am sure you know

Similar topic/articles from various placesApplication: Cluster articles by “same story”.

Google image

Social network analysis

Finding NetFlix users with similar tastes in movies,e.g., for personalized recomendation systems.

Page 8: B490 Mining the Big Data - Indiana University Bloomington

3-1

How can we compare similarity?

Convert the data (homework, webpages,images) into an object in an abstractspace that we know how to measuredistance, and how to do it efficiently.

Page 9: B490 Mining the Big Data - Indiana University Bloomington

3-2

How can we compare similarity?

Convert the data (homework, webpages,images) into an object in an abstractspace that we know how to measuredistance, and how to do it efficiently.

What abstract space?

Page 10: B490 Mining the Big Data - Indiana University Bloomington

3-3

How can we compare similarity?

Convert the data (homework, webpages,images) into an object in an abstractspace that we know how to measuredistance, and how to do it efficiently.

What abstract space?

For example, the Euclidean space over Rd

Page 11: B490 Mining the Big Data - Indiana University Bloomington

3-4

How can we compare similarity?

Convert the data (homework, webpages,images) into an object in an abstractspace that we know how to measuredistance, and how to do it efficiently.

What abstract space?

For example, the Euclidean space over Rd

L1 space, L∞ space, . . .

Page 12: B490 Mining the Big Data - Indiana University Bloomington

4-1

Three essential techniques

• k-gram: convert documents to sets.

Page 13: B490 Mining the Big Data - Indiana University Bloomington

4-2

Three essential techniques

• k-gram: convert documents to sets.

• Min-hashing: convert large sets to short signatures,while preserving similarity

Page 14: B490 Mining the Big Data - Indiana University Bloomington

4-3

Three essential techniques

• k-gram: convert documents to sets.

• Min-hashing: convert large sets to short signatures,while preserving similarity

• Locality-sensitive hashing (LSH) map the sets tosome space so that similar sets will be close indistance.

Page 15: B490 Mining the Big Data - Indiana University Bloomington

4-4

Three essential techniques

• k-gram: convert documents to sets.

• Min-hashing: convert large sets to short signatures,while preserving similarity

• Locality-sensitive hashing (LSH) map the sets tosome space so that similar sets will be close indistance.

k-gramMin-hashing

LSHDocs

A bag ofstrings oflength kfrom the doc

A “signature”representingthe sets “points”

in somespace

The usual way of finding similar items

Page 16: B490 Mining the Big Data - Indiana University Bloomington

5-1

Jacccard Distance

and Min-hashing

Page 17: B490 Mining the Big Data - Indiana University Bloomington

6-1

Convert documents to sets

k-gram: a sequence of k characters that appears in the doc

E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca}

Page 18: B490 Mining the Big Data - Indiana University Bloomington

6-2

Convert documents to sets

k-gram: a sequence of k characters that appears in the doc

E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca}

Question: What is a good k?

Page 19: B490 Mining the Big Data - Indiana University Bloomington

6-3

Convert documents to sets

k-gram: a sequence of k characters that appears in the doc

E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca}

Question: What is a good k?

Question: Shall we count replicas?

Page 20: B490 Mining the Big Data - Indiana University Bloomington

6-4

Convert documents to sets

k-gram: a sequence of k characters that appears in the doc

E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca}

Question: What is a good k?

Question: Shall we count replicas?

Question: How to deal with white space, and stop words(e.g., a you, for, the, to, and, that)?

Page 21: B490 Mining the Big Data - Indiana University Bloomington

6-5

Convert documents to sets

k-gram: a sequence of k characters that appears in the doc

E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca}

Question: What is a good k?

Question: Shall we count replicas?

Question: How to deal with white space, and stop words(e.g., a you, for, the, to, and, that)?

Question: Capitaization? Punctuation?

Page 22: B490 Mining the Big Data - Indiana University Bloomington

6-6

Convert documents to sets

k-gram: a sequence of k characters that appears in the doc

E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca}

Question: What is a good k?

Question: Shall we count replicas?

Question: How to deal with white space, and stop words(e.g., a you, for, the, to, and, that)?

Question: Capitaization? Punctuation?

Question: Characters vs. Words?

Page 23: B490 Mining the Big Data - Indiana University Bloomington

6-7

Convert documents to sets

k-gram: a sequence of k characters that appears in the doc

E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca}

Question: What is a good k?

Question: Shall we count replicas?

Question: How to deal with white space, and stop words(e.g., a you, for, the, to, and, that)?

Question: Capitaization? Punctuation?

Question: Characters vs. Words?

Question: How about different languages?

Page 24: B490 Mining the Big Data - Indiana University Bloomington

7-1

Jaccard similarity

Intersection size: 3Union size: 8Jaccard similarity: 3/8

Formally, the Jaccard similarity of two sets C1,C2 is

JS(C1,C2) =|C1 ∩ C2||C1 ∪ C2|

.

It can be thought as a distance function, if we definedJS(C1,C2) = JS(C1,C2). Actually, any similaritymeasure can be converted to a distance function

Page 25: B490 Mining the Big Data - Indiana University Bloomington

8-1

Jaccard similarity with clusters

Consider two sets A = {0, 1, 2, 5, 6}, B = {0, 2, 3, 5, 7, 9}.What is the Jaccard similartiy of A and B?

Page 26: B490 Mining the Big Data - Indiana University Bloomington

8-2

Jaccard similarity with clusters

Consider two sets A = {0, 1, 2, 5, 6}, B = {0, 2, 3, 5, 7, 9}.What is the Jaccard similartiy of A and B?

With clusters: We may have some items whichbasically represent the same thing. We group objectsto clusters. E.g.,

C1 = {0, 1, 2},C2 = {3, 4},C3 = {5, 6},C4 = {7, 8, 9}For instance: C1 may represent action movies, C2 may representcomedies, C3 may represent documentaries, C4 may representhorror movies.

Now we can represent Aclu = {C1,C3}, Bclu = {C1,C2,C3,C4}

JSclu(A,B) = JS(Aclu,Bclu) =|{C1,C2}|

|{C1,C2,C3,C4}|= 0.5

Page 27: B490 Mining the Big Data - Indiana University Bloomington

9-1

Min-hashing

Given setsS1 = {1, 2, 5}, S2 = {3}, S3 = {2, 3, 4, 6}, S4 = {1, 4, 6}We can represent them as a matrix

Item S1 S2 S3 S4

1 1 0 0 12 1 0 1 03 0 1 1 04 0 0 1 15 1 0 0 06 0 0 1 1

Page 28: B490 Mining the Big Data - Indiana University Bloomington

9-2

Min-hashing

Given setsS1 = {1, 2, 5}, S2 = {3}, S3 = {2, 3, 4, 6}, S4 = {1, 4, 6}We can represent them as a matrix

Item S1 S2 S3 S4

1 1 0 0 12 1 0 1 03 0 1 1 04 0 0 1 15 1 0 0 06 0 0 1 1

Item S1 S2 S3 S4

2 1 0 1 05 1 0 0 06 0 0 1 11 1 0 0 14 0 0 1 13 0 1 1 0

Step 1Randomlypermute

Page 29: B490 Mining the Big Data - Indiana University Bloomington

9-3

Min-hashing

Given setsS1 = {1, 2, 5}, S2 = {3}, S3 = {2, 3, 4, 6}, S4 = {1, 4, 6}We can represent them as a matrix

Item S1 S2 S3 S4

1 1 0 0 12 1 0 1 03 0 1 1 04 0 0 1 15 1 0 0 06 0 0 1 1

Item S1 S2 S3 S4

2 1 0 1 05 1 0 0 06 0 0 1 11 1 0 0 14 0 0 1 13 0 1 1 0

Step 1Randomlypermute

Step 2: Record the first 1 in each column:m(S1) = 2,m(S2) = 3,m(S3) = 2,m(S4) = 6

Page 30: B490 Mining the Big Data - Indiana University Bloomington

9-4

Min-hashing

Given setsS1 = {1, 2, 5}, S2 = {3}, S3 = {2, 3, 4, 6}, S4 = {1, 4, 6}We can represent them as a matrix

Item S1 S2 S3 S4

1 1 0 0 12 1 0 1 03 0 1 1 04 0 0 1 15 1 0 0 06 0 0 1 1

Item S1 S2 S3 S4

2 1 0 1 05 1 0 0 06 0 0 1 11 1 0 0 14 0 0 1 13 0 1 1 0

Step 1Randomlypermute

Step 2: Record the first 1 in each column:m(S1) = 2,m(S2) = 3,m(S3) = 2,m(S4) = 6

Step 3: Estimate the Jaccard similarity JS(Si , Sj) as

X =

{1 if m(Si ) = m(Sj),

0 otherwise .

Page 31: B490 Mining the Big Data - Indiana University Bloomington

10-1

Min-hashing (cont.)

Lemma:

Pr[m(Si ) = m(Sj)] = E[X ] = JS(Si ,Sj)

(proof on the board)

Page 32: B490 Mining the Big Data - Indiana University Bloomington

10-2

Min-hashing (cont.)

Lemma:

Pr[m(Si ) = m(Sj)] = E[X ] = JS(Si ,Sj)

(proof on the board)

To approximate Jaccard similarity within ε error withprobability at least 1− δ, repeat k = 1/ε2 · log(1/δ)times and then take the average.

(Chernoff bound on board)

Page 33: B490 Mining the Big Data - Indiana University Bloomington

11-1

Min-hashing (cont.)

How to implement Min-hashing efficiently?

Make one pass over the data.

Maintain k random hash functions {h1, h2, . . . , hk} sothat hi : [N]→ [N] at random (more next slide).

Maintain k counters {c1, . . . , ck} with ci (1 ≤ i ≤ k)being initialized to ∞.

Algorithm Min-hashingFor each i ∈ S do

for j = 1 to k doIf (hj(i) < cj) then

cj = hj(i)

Output mj(S) = cj for each j ∈ [k]

Page 34: B490 Mining the Big Data - Indiana University Bloomington

11-2

Min-hashing (cont.)

How to implement Min-hashing efficiently?

Make one pass over the data.

Maintain k random hash functions {h1, h2, . . . , hk} sothat hi : [N]→ [N] at random (more next slide).

Maintain k counters {c1, . . . , ck} with ci (1 ≤ i ≤ k)being initialized to ∞.

Algorithm Min-hashingFor each i ∈ S do

for j = 1 to k doIf (hj(i) < cj) then

cj = hj(i)

Output mj(S) = cj for each j ∈ [k]

JSk(Si ,Si ′) = 1k

∑kj=1 1(mj(Si ) = mj(Si ′))

Page 35: B490 Mining the Big Data - Indiana University Bloomington

11-3

Min-hashing (cont.)

How to implement Min-hashing efficiently?

Make one pass over the data.

Maintain k random hash functions {h1, h2, . . . , hk} sothat hi : [N]→ [N] at random (more next slide).

Maintain k counters {c1, . . . , ck} with ci (1 ≤ i ≤ k)being initialized to ∞.

Algorithm Min-hashingFor each i ∈ S do

for j = 1 to k doIf (hj(i) < cj) then

cj = hj(i)

Output mj(S) = cj for each j ∈ [k]

JSk(Si ,Si ′) = 1k

∑kj=1 1(mj(Si ) = mj(Si ′))

Collisions?

Page 36: B490 Mining the Big Data - Indiana University Bloomington

11-4

Min-hashing (cont.)

How to implement Min-hashing efficiently?

Make one pass over the data.

Maintain k random hash functions {h1, h2, . . . , hk} sothat hi : [N]→ [N] at random (more next slide).

Maintain k counters {c1, . . . , ck} with ci (1 ≤ i ≤ k)being initialized to ∞.

Algorithm Min-hashingFor each i ∈ S do

for j = 1 to k doIf (hj(i) < cj) then

cj = hj(i)

Output mj(S) = cj for each j ∈ [k]

JSk(Si ,Si ′) = 1k

∑kj=1 1(mj(Si ) = mj(Si ′))

Collisions?

3

Use Birthday Paradox

Page 37: B490 Mining the Big Data - Indiana University Bloomington

12-1

Random hashing functions

A random hash func. h ∈ H is the one that maps from aset A to a set B, so that conditioned on the random choiceh ∈ H,

1. The location h(x) ∈ B for ∀x ∈ A is equally likely.

2. (Stronger) For any two x ′ 6= x ∈ A, h(x) and h(x ′) areindependent over the random choice of h ∈ H.

Page 38: B490 Mining the Big Data - Indiana University Bloomington

12-2

Random hashing functions

A random hash func. h ∈ H is the one that maps from aset A to a set B, so that conditioned on the random choiceh ∈ H,

1. The location h(x) ∈ B for ∀x ∈ A is equally likely.

2. (Stronger) For any two x ′ 6= x ∈ A, h(x) and h(x ′) areindependent over the random choice of h ∈ H.

Even stronger: two → any disjoint subset.Good but hard to achieve in theory

Page 39: B490 Mining the Big Data - Indiana University Bloomington

12-3

Random hashing functions

A random hash func. h ∈ H is the one that maps from aset A to a set B, so that conditioned on the random choiceh ∈ H,

1. The location h(x) ∈ B for ∀x ∈ A is equally likely.

2. (Stronger) For any two x ′ 6= x ∈ A, h(x) and h(x ′) areindependent over the random choice of h ∈ H.

Even stronger: two → any disjoint subset.Good but hard to achieve in theory

Some practical hash function h :∑k → [m].

• Modular Hashing: h(x) = x mod m

• Multiplicative Hashing: ha(x) = bm · frac(x · a)c• Using Primes Let p ∈ [m, 2m] be a prime. Let a, b ∈R [p].

h(x) = ((ax + b) mod p) mod m

• Tabulation Hashing: h(x) = H1[x1]⊕ . . .⊕ Hk [xk ], where⊕ is an bitwise exclusive or, and Hi :

∑→ [m] is a random

mapping table (stored in a fast cache)

Page 40: B490 Mining the Big Data - Indiana University Bloomington

13-1

Finding similar items

Given a set of n = 1, 000, 000 items, we ask two questions:

Which items are similar?

We don’t want to check all n2 pairs.

Page 41: B490 Mining the Big Data - Indiana University Bloomington

13-2

Finding similar items

Given a set of n = 1, 000, 000 items, we ask two questions:

Which items are similar?

We don’t want to check all n2 pairs.

Given a query item, which others are similar to that item?

We don’t want to check all n items.

Page 42: B490 Mining the Big Data - Indiana University Bloomington

13-3

Finding similar items

Given a set of n = 1, 000, 000 items, we ask two questions:

Which items are similar?

We don’t want to check all n2 pairs.

Given a query item, which others are similar to that item?

We don’t want to check all n items.

What if the set is n points in the plane R2?

Page 43: B490 Mining the Big Data - Indiana University Bloomington

13-4

Finding similar items

Given a set of n = 1, 000, 000 items, we ask two questions:

Which items are similar?

We don’t want to check all n2 pairs.

Given a query item, which others are similar to that item?

We don’t want to check all n items.

What if the set is n points in the plane R2?

Choice 1: Hierarchical indexing trees(e.g., range tree, kd-tree, B-tree).

Problem: do not scale to high dimensions.

Page 44: B490 Mining the Big Data - Indiana University Bloomington

13-5

Finding similar items

Given a set of n = 1, 000, 000 items, we ask two questions:

Which items are similar?

We don’t want to check all n2 pairs.

Given a query item, which others are similar to that item?

We don’t want to check all n items.

What if the set is n points in the plane R2?

Choice 1: Hierarchical indexing trees(e.g., range tree, kd-tree, B-tree).

Problem: do not scale to high dimensions.

Choice 2: Lay down a (random) Grid. Similar to theLocality Sensitive Hashing that we will explore

Page 45: B490 Mining the Big Data - Indiana University Bloomington

14-1

Locality Sensitive Hashing

Page 46: B490 Mining the Big Data - Indiana University Bloomington

15-1

Locality Sensitive Hashing (LSH)

Definition h is (`, u, p`, pu)-sensitive with a distancefunction d if

• Pr[h(a) = h(b)] > p` if d(a, b) < `.

• Pr[h(a) = h(b)] < pu if d(a, b) > u.

For this definition to make sense, we need pu < p` for u > `.

Page 47: B490 Mining the Big Data - Indiana University Bloomington

15-2

Locality Sensitive Hashing (LSH)

Definition h is (`, u, p`, pu)-sensitive with a distancefunction d if

• Pr[h(a) = h(b)] > p` if d(a, b) < `.

• Pr[h(a) = h(b)] < pu if d(a, b) > u.

For this definition to make sense, we need pu < p` for u > `.

Idealy, want p` − pu to be large even when u − ` is small.Then we can repeat and further amplify this effect (nextslide)

Page 48: B490 Mining the Big Data - Indiana University Bloomington

15-3

Locality Sensitive Hashing (LSH)

Definition h is (`, u, p`, pu)-sensitive with a distancefunction d if

• Pr[h(a) = h(b)] > p` if d(a, b) < `.

• Pr[h(a) = h(b)] < pu if d(a, b) > u.

For this definition to make sense, we need pu < p` for u > `.

Idealy, want p` − pu to be large even when u − ` is small.Then we can repeat and further amplify this effect (nextslide)

Define d(a, b) = 1− JS(a, b). The family of Min-hashingfunctions is (d1, d2, 1− d1, 1− d2)-sensitive for any0 ≤ d1 < d2 ≤ 1.

Page 49: B490 Mining the Big Data - Indiana University Bloomington

15-4

Locality Sensitive Hashing (LSH)

Definition h is (`, u, p`, pu)-sensitive with a distancefunction d if

• Pr[h(a) = h(b)] > p` if d(a, b) < `.

• Pr[h(a) = h(b)] < pu if d(a, b) > u.

For this definition to make sense, we need pu < p` for u > `.

Idealy, want p` − pu to be large even when u − ` is small.Then we can repeat and further amplify this effect (nextslide)

Define d(a, b) = 1− JS(a, b). The family of Min-hashingfunctions is (d1, d2, 1− d1, 1− d2)-sensitive for any0 ≤ d1 < d2 ≤ 1.

Not good enough: d2 − d1 = (1− d1)− (1− d2).Can we improve it?

Page 50: B490 Mining the Big Data - Indiana University Bloomington

16-1

Partition into bands

Docs D1 D2 D3 D4

h1 1 2 4 0h2 2 0 1 3h3 5 3 3 0h4 4 0 1 1. . .ht 1 2 3 0

⇒ b bandsr = t/browsper band

Candidates pairs are those that hash to the same bucket forat least 1 band.

Tune b, r to catch most similar pairs, but few nonsimilar pairs.

Page 51: B490 Mining the Big Data - Indiana University Bloomington

17-1

A bite of theory for LSH (using min-hashing)

Docs D1 D2 D3 D4

h1 1 2 4 0h2 2 0 1 3h3 5 3 3 0h4 4 0 1 1. . .ht 1 2 3 0

⇒ b bandsr = t/browsper band

s = JS(D1,D2) is the probability that D1 and D2 have a hash collision.

s r = probability all hashes collide in 1 band

(1− s r ) = probability not all collide in 1 band

(1− s r )b = probability that in no bands do all hashes collide

f (s) = 1− (1− s r )b = probability all hashes collide in at least 1 band

Page 52: B490 Mining the Big Data - Indiana University Bloomington

18-1

E.g., if we chooset = 15, r = 3 andb = 5, thenf (0.1) = 0.005,f (0.2) = 0.04,f (0.3) = 0.13,f (0.4) = 0.28,f (0.5) = 0.48,f (0.6) = 0.70,f (0.7) = 0.88,f (0.8) = 0.97,f (0.9) = 0.998

s = JS(D1,D2) is the probability that D1 and D2 have a hash collision.

s r = probability all hashes collide in 1 band

(1− s r ) = probability not all collide in 1 band

(1− s r )b = probability that in no bands do all hashes collide

f (s) = 1− (1− s r )b = probability all hashes collide in at least 1 band

A bite of theory for LSH (cont.)

Page 53: B490 Mining the Big Data - Indiana University Bloomington

19-1

A bite of theory for LSH (cont.)

Choice of r and b. Usually there is a budget of t hashfunction one is willing to use. Given (a budget of) t hashfunctions, how does one divvy them up among r and b?

Page 54: B490 Mining the Big Data - Indiana University Bloomington

19-2

A bite of theory for LSH (cont.)

Choice of r and b. Usually there is a budget of t hashfunction one is willing to use. Given (a budget of) t hashfunctions, how does one divvy them up among r and b?

The threshold τ where f has the steepest slope is aboutτ ≈ (1/b)1/r .So given a similarity s that we want to use as a cut-off, we cansolve for r = t/b in s = (1/b)1/r to yield r ≈ logs(1/t).

If there is no budget on r and b, as they increase the S curvegets sharper.

Page 55: B490 Mining the Big Data - Indiana University Bloomington

20-1

LSH using Euclidean distance

Besides the min-hashing “distance function”, let’s usethe Euclidean distance function for LSH.

dE [u, v ] = ‖u − v‖2 =√∑k

i=1(vi − ui )2

Page 56: B490 Mining the Big Data - Indiana University Bloomington

20-2

LSH using Euclidean distance

Besides the min-hashing “distance function”, let’s usethe Euclidean distance function for LSH.

dE [u, v ] = ‖u − v‖2 =√∑k

i=1(vi − ui )2

Hashing function h

1. First take a random unit vector u ∈ Rd . A unit vector usatisfies that ‖u‖2 = 1, that is dE (u, 0) = 1. We will see howto generate a random unit vector later.

2. Project a, b ∈ P ⊂ Rk onto u:

au = 〈a, u〉 =∑k

i=1 ai · ui

This is contractive: |au − bu| ≤ ‖a− b‖2.

3. Create bins of size γ on u (that is, R1), (better with arandom shift z ∈ [0, γ)). h(a) = index of the bin a falls into.

Page 57: B490 Mining the Big Data - Indiana University Bloomington

21-1

Property of LSH using Euclidean distance

Hashing function h

1. First take a random unit vector u ∈ Rd . A unit vector usatisfies that ‖u‖2 = 1, that is dE (u, 0) = 1. We will see howto generate a random unit vector later.

2. Project a, b ∈ P ⊂ Rk onto u:

au = 〈a, u〉 =∑k

i=1 ai · ui

This is contractive: |au − bu| ≤ ‖a− b‖2.

3. Create bins of size γ on u (that is, R1). h(a) = index of thebin a falls into.

h is (γ/2, 2γ, 1/2, 1/3)-sensitive.

• If ‖a− b‖2 < γ/2, then Pr[h(a) = h(b)] > 1/2.• If ‖a− b‖2 > 2γ, then Pr[h(a) = h(b)] < 1/3.

Page 58: B490 Mining the Big Data - Indiana University Bloomington

22-1

Entropy LSH

Problem with the basic LSH: Need many hash functionsh1, h2, . . . , ht .

• Consume a lot of space.• Hard to implement in the distributed computation

model.

Page 59: B490 Mining the Big Data - Indiana University Bloomington

22-2

Entropy LSH

Problem with the basic LSH: Need many hash functionsh1, h2, . . . , ht .

• Consume a lot of space.• Hard to implement in the distributed computation

model.

Entrop hashing: solves the first issue.

– When query h(q), in addition query several “offsets”q + δi (1 ≤ i ≤ L), chosen randomly from the surface of aball B(q, r) for some carefully chosen radius r .

– Intuition: the data points close to the query point q arehighly likely to hash either to the same value as h(q) or toa value very close to that.

Need much fewer hash functions.

Page 60: B490 Mining the Big Data - Indiana University Bloomington

23-1

Distances

Page 61: B490 Mining the Big Data - Indiana University Bloomington

24-1

Distance functions

(M1) (non-negativity) d(a, b) ≥ 0

(M2) (identity) d(a, b) = 0 if and only if a = b

(M3) (symmetry) d(a, b) = d(b, a)

(M4) (triangle inequality) d(a, b) ≤ d(a, c) + d(c , b)

A distance d : X × X → R+ is a metric if

Page 62: B490 Mining the Big Data - Indiana University Bloomington

24-2

Distance functions

(M1) (non-negativity) d(a, b) ≥ 0

(M2) (identity) d(a, b) = 0 if and only if a = b

(M3) (symmetry) d(a, b) = d(b, a)

(M4) (triangle inequality) d(a, b) ≤ d(a, c) + d(c , b)

A distance d : X × X → R+ is a metric if

A distance that satisfies (M1), (M3) and (M4) is called apseudometric.

Page 63: B490 Mining the Big Data - Indiana University Bloomington

24-3

Distance functions

(M1) (non-negativity) d(a, b) ≥ 0

(M2) (identity) d(a, b) = 0 if and only if a = b

(M3) (symmetry) d(a, b) = d(b, a)

(M4) (triangle inequality) d(a, b) ≤ d(a, c) + d(c , b)

A distance d : X × X → R+ is a metric if

A distance that satisfies (M1), (M3) and (M4) is called apseudometric.

A distance that satisfies (M1), (M2) and (M4) is called aquasiometric.

Page 64: B490 Mining the Big Data - Indiana University Bloomington

25-1

Distance functions (cont.)

Lp distances on two vectors a = (a1, . . . , ad), b = (b1, . . . , bd).

dp(a, b) = ‖a− b‖p =(∑d

i=1(|ai − bi |)p)1/p

Page 65: B490 Mining the Big Data - Indiana University Bloomington

25-2

Distance functions (cont.)

Lp distances on two vectors a = (a1, . . . , ad), b = (b1, . . . , bd).

dp(a, b) = ‖a− b‖p =(∑d

i=1(|ai − bi |)p)1/p

L2: Euclidean distance

L0: d0(a, b) = d −∑d

i=1 1(ai , bi ).If ai , bi ∈ {0, 1}, then this is calledHamming distance.

L1: also known as Manhattan distance.d1(a, b) =

∑di=1(ai , bi ).

L∞: d∞(a, b) = maxdi=1 |ai − bi |

Page 66: B490 Mining the Big Data - Indiana University Bloomington

25-3

Distance functions (cont.)

Lp distances on two vectors a = (a1, . . . , ad), b = (b1, . . . , bd).

dp(a, b) = ‖a− b‖p =(∑d

i=1(|ai − bi |)p)1/p

L2: Euclidean distance

L0: d0(a, b) = d −∑d

i=1 1(ai , bi ).If ai , bi ∈ {0, 1}, then this is calledHamming distance.

L1: also known as Manhattan distance.d1(a, b) =

∑di=1(ai , bi ).

L∞: d∞(a, b) = maxdi=1 |ai − bi |

Page 67: B490 Mining the Big Data - Indiana University Bloomington

26-1

Jaccard distance:

dJ(a, b) = 1− JS(a, b) = 1− |A ∩ B||A ∪ B|

.

Distance functions (cont.)

Proof (on board)

Page 68: B490 Mining the Big Data - Indiana University Bloomington

27-1

Distance functions (cont.)

Cosine distance:

dcos(a, b) = 1− 〈a, b〉‖a‖2 ‖b‖2

= 1−∑d

i=1 aibi‖a‖2 ‖b‖2

Page 69: B490 Mining the Big Data - Indiana University Bloomington

27-2

Distance functions (cont.)

Cosine distance:

dcos(a, b) = 1− 〈a, b〉‖a‖2 ‖b‖2

= 1−∑d

i=1 aibi‖a‖2 ‖b‖2

A psuedometric.

Page 70: B490 Mining the Big Data - Indiana University Bloomington

27-3

Distance functions (cont.)

Cosine distance:

dcos(a, b) = 1− 〈a, b〉‖a‖2 ‖b‖2

= 1−∑d

i=1 aibi‖a‖2 ‖b‖2

We can also develop an LSH function h for dcos(·, ·), asfollows. Choose a random vector v ∈ Rd . Then let

hv (a) =

{+1 if 〈v , a〉 > 0−1 otherwise

A psuedometric.

Page 71: B490 Mining the Big Data - Indiana University Bloomington

27-4

Distance functions (cont.)

Cosine distance:

dcos(a, b) = 1− 〈a, b〉‖a‖2 ‖b‖2

= 1−∑d

i=1 aibi‖a‖2 ‖b‖2

We can also develop an LSH function h for dcos(·, ·), asfollows. Choose a random vector v ∈ Rd . Then let

hv (a) =

{+1 if 〈v , a〉 > 0−1 otherwise

It is (γ, φ, (π − γ)/π, φ/π)-sensitive for any γ < φ ∈ [0, π]

A psuedometric.

Page 72: B490 Mining the Big Data - Indiana University Bloomington

28-1

Distance functions (cont.)

Edit distance. Consider two strings a, b ∈∑∗, and

ded(a, b) = #operations to make a = b,

where an operation could be inserting a letter or deleting aletter or substituting a letter

App: Google’s auto-correct.

It is expensive to compute

Page 73: B490 Mining the Big Data - Indiana University Bloomington

29-1

Graph distance. Given a graph G = (V ,E ), whereV = {v1, . . . , vn} and E = {e1, . . . , em}. The graph distancebetween dG between any pair vi , vj ∈ V is defined as theshortest path between vi , vj .

Distance functions (cont.)

Page 74: B490 Mining the Big Data - Indiana University Bloomington

30-1

Distance functions (cont.)

Kullback-Liebler (KL) Divergence. Given two discretedistributions P = {p1, . . . , pd} and Q = {q1, . . . , qd}.

dKL(P,Q) =d∑

i=1

pi ln(pi/qi ).

Page 75: B490 Mining the Big Data - Indiana University Bloomington

30-2

Distance functions (cont.)

Kullback-Liebler (KL) Divergence. Given two discretedistributions P = {p1, . . . , pd} and Q = {q1, . . . , qd}.

dKL(P,Q) =d∑

i=1

pi ln(pi/qi ).

A distance that is NOT a metric.

Page 76: B490 Mining the Big Data - Indiana University Bloomington

30-3

Distance functions (cont.)

Kullback-Liebler (KL) Divergence. Given two discretedistributions P = {p1, . . . , pd} and Q = {q1, . . . , qd}.

dKL(P,Q) =d∑

i=1

pi ln(pi/qi ).

A distance that is NOT a metric.

Can be written as H(P,Q)− H(P)

Page 77: B490 Mining the Big Data - Indiana University Bloomington

31-1

Distance functions (cont.)

Given two multisets A,B in the grid [∆]2 with|A| = |B| = N, the Earth-Mover Distance (EMD) isdefined as the minimum cost of a perfect matchingbetween points in A and B, that is,

EMD(A,B) = minπ:A→B

∑a∈A ‖a− π(a)‖1 .

Good for comparing images

Page 78: B490 Mining the Big Data - Indiana University Bloomington

32-1

Thank you!

Some slides are based on the MMDS bookhttp://infolab.stanford.edu/~ullman/mmds.html

and Jeff Phillips lecture noteshttp://www.cs.utah.edu/~jeffp/teaching/cs5955.html