set similarity - itu.dksetting of this talk • we are given a collection of sets that we are...

32
Rasmus Pagh IT University of Copenhagen Google Research BARC WADS, Edmonton, August 5, 2019 S CALABLE S IMILARITY S EARCH Set Similarity – a Survey 4 Set of Q&A

Upload: others

Post on 11-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

Rasmus Pagh IT University of Copenhagen

Google ResearchBARC

WADS, Edmonton, August 5, 2019

06/02/2017, 08.30

Page 1 of 1file:///Users/pagh/Downloads/potrace-1.13.mac-x86_64/barc.svg

SCALABLESIMILARITYSEARCH

Set Similarity – a Survey

!4

Set of Q&A

Page 2: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

Before we start…

Let’s consider three internet technologies launched around 20 years ago

!5

Page 3: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

Recommendations

!6

Page 4: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

Advanced search

!7

Page 5: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

Wildcard operator

Edm?nt?n map

!8

Page 6: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

Before we start…

What happened to wildcard search and to boolean expressions?

!9

Page 7: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

Set of shopping carts with Canadian Train Ride

Set of shopping carts with Trans-American Train Ride

(4,o)

(1,E) (2,d) (3,m) (5,n) (6,t) (8,n)

(7,o)

Web pages containing “ballroom”

Web pages containing

“dance”Web pages containing

“salsa”

It’s all about sets

!10

Page 8: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

Setting of this talk• We are given a collection of sets � that we

are allowed to preprocess.

• Seek answer to queries such as:

- Given � what is the size of � ? � ?

- Given a set � , is there an � such that � ? � ?

- Given a set � and an integer � , is there an � such that� ?

S1, …, Sn ⊆ U

i, j Si ∪ Sj Si ∩ Sj

Q i Q ⊆ Si Q ⊇ Si

Q t i|Q ∩ Si | ≥ t

!11

Similarity computation

Similarity search

Page 9: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

!14

Similarity computation

Similarity search

Good news 3 4Bad news 1 2

Page 10: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

Bad news• Query: Given � what is the size of � ?

• [Pǎtraşcu ’10], [Kopelowitz et al. ’14]:

- Assume we can preprocess sets � , each of size � , in time � such that it is possible to determine if � in time � .

- Then integer 3SUM can be solved in time � .

i, j Si ∩ Sj

S1, …, Sn ⊆ [n]n O(n 1.99)

Si ∩ Sj = ∅ O(n 0.49)O(n 1.991)

Suggests polylog ! query time not possible without

essentially precomputing all answers(n)

!15

Similarity computation

Page 11: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

Bad news• Given a set � , is there an � such that � ?

• [Williams ’04], [Alman & Williams ’15]:- Assume we can preprocess sets � in

time poly� such that it is possible to determine if � , in time � .

- Then � such that k-SAT witn � variables can be solved in time � .

Q i Q ⊆ Si

S1, …, Sn ⊆ [n 0.01](n)

∃i : Q ⊆ Si O(n 0.99)∃c < 2 n

cn

Under strong exponential time hypothesis, this is not possible!

!16

Similarity search

Page 12: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

The good news…• We can now explain why nearly no progress on basic set

processing problems has been made since the 1970s.

• More constructively, it justifies looking at � -approximate versions of these problems:

- Given � what is the approximate size of � and � , up to a multiplicative error � ?

- Given a set � and an integer � , is there an � such that� or is � for all � ?

c

i, j Si ∪ SjSi ∩ Sj c > 1

Q t i|Q ∩ Si | ≥ t |Q ∩ Si | ≤ t/c i

!18

Page 13: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

!19

Similarity computation

Similarity search

Good news 4Bad news 1 2

3

Page 14: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

Similarity estimation attempt 2:

Coordinated sampling• Sample � where � independently with

probability � , and let � .

• Observe that � , and by Chernoff bounds � with probability � .

• Can estimate � if sampling rate � .

• Time to compute estimate is � .

U′� ⊆ U x ∈ U′�α S′�i = Si ∩ U′�

μ = E[ |S′ �i ∩ S′�j | ] = α |Si ∩ Sj ||S′�i ∩ S′�j | ≈ μ 1 − e− Ω(μ)

|Si ∩ Sj | ≈ |S′ �i ∩ S′�j | /αα ≫ 1/ |Si ∩ Sj |

|S′�i | + |S′ �j | ≈ α( |Si | + |Sj | )

[Brewer et al. ’72]

!20

Need to store set !U′�

Sample size is variable

Page 15: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

A mystery of alpine flowers

!21

Bulletin de la Société Vaudoise des Sciences Naturelles

Vol. XXXVn. N" 140. 1901

DISTRIBUTION DE LA FLORE ALPINE

DANS LE

Bassin des Dranses et dans quelques régions voisines

PAR LE

Dr Paul JACCARD, professeur.

I

Dans un précédent mémoire1, la comparaison de la florealpine des trois régions : Trient, Bagnes, Wildhorn, m'a¬menait à conclure que la richesse en espèces et surtout laproportion des espèces spéciales à chacune des régionscomparées est sensiblement proportionnée à la variété deleurs conditions biologiques.Jusqu'à quel point cette conclusion est-elle générale?

C'est ce que je me propose d'établir dans le présent mé¬moire en m'occupant tout d'abord d'une exception appa¬rente à la conclusion que je viens de rappeler.

Il s'agit du Grand Saint-Bernard ct du val d'Entremont.

1 Ce travail est la suite d'un mémoire publié dans le Bulletin de la Soc. vau¬doise de l'année dernière, vol. XXXVI, et intitulé : Contribution au problèmede l'immigration de la flore alpine. Il reproduit en les développant les deuxnotes parues dans les Archives des Se. phys. et nat. de Genève, t. X, octobreiqoo : L'immigration post-glaciaire et la distribution actuelle de la flore al.pine dans quelques régions des Alpes, et dans les Comptes rendus du Congrèsinternational de botanique de Paris, 1900, p. 3i-38, Méthode de déterminationde la distribution de la flore alpine.

XXXVII IÖ

1901-1996:41 citations (Google scholar)

1997-2019:~2800 citations (Google scholar)

Page 16: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

Min-wise hashing (aka. minhash)

• Pick random hash function � and define:�

• �

• Repeat � times to get sample of size � . Advantages:

- Coordinated samples without storing a set � .- Storage requirement is fixed.

h : U → [n 10]

minhashh(Si) = arg minx∈ Si

h(x)

Pr[minhashh(Si) = minhashh(Sj)] ≈ |Si ∩ Sj | / |Si ∪ Sj |

s sU′�

[Broder ’97]

!22

Page 17: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

DISTRIBUTION DE LA FLORE ALPINE

DANS LE

Bassin des Dranses et dans quelques régions voisines

PAR LE

Dr Paul JACCARD, professeur.

Minhash estimation

• Pick random hash functions � , � .

• Create sketch vectors � , where � .

• Estimator: �

• �

ht : U → [n 10] t = 1,…, s

v(Si) v(Si)t = minhashht(Si)

X = 1s ∑

t1v(Si)t= v(Sj)t

E[X] ≈|Si ∩ Sj ||Si ∪ Sj |

= J(S1, S2)

[Broder ’97]

!23

!Si

!Sj

!v(Si)

!v(Sj)

�=�? �=�?�…

Var[X] = J(1 − J)s

Page 18: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

1-bit minhash• Idea: Compress the vector � to � .

• Use hash functions � , and define:�

• Estimator for Jaccard sim.: � .

v(Si) ∈ Us v′�(Si) ∈ {0,1}s

gt : U → {0,1}

v′�(Si)t = gt(v(Si)t)

X′� = 2s (∑

t1v′�(Si)t= v′�(Sj)t) − 1

!24

[Li and König ’09]

Var[X′�] = (1 + J)(1 − J)s

Factor ! larger than minhash

(1 + J)/J

Page 19: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

Optimality of 1-bit minhash

• [P.-Stöckel-Woodruff ’14]: The variance of any estimator for Jaccard similarity based on � -bit summaries must be � for � .

• What happens when � is close to zero or one?

- Not much seems to be known about � .

- Experiments in [Li and König ’09] suggest that using � -bit minwise hashing is better for � .

s Ωε(1/s) J ∈ (ε,1 − ε)

J

J ≈ 0b

J ≈ 0

!25

Page 20: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

!26

[Christiani ’18]

Lower variance for low similarities

0.2 0.4 0.6 0.8 1.0|Si∩Sj|/w

0.10.20.30.40.5

Hamming distance / s

CP hash1-bit minwise1-bit CP

1-bit minwise

• Choose � , where � indep.

• Parameter � is chosen s.t. � .

• Define � .

It ⊆ UPr[k ∈ It] = p

pPr[S∩ It = ∅] = 1

2

v′�′�′ �(S)t = 1S∩It≠∅

“CP hash”

Assume for simplicity that all sets have size !w

• Variance improves by factor almost 2 for small � .J

Page 21: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

Lower variance for high similarities

!27

[Mitzenmacher, P., Pham ’14]

• Start with minhash � .

• 1-bit minhash: �

• Alternative binarization, “odd sketch”:

Use hash function � , define � .

• Can estimate � from � , error proportional to � .

v(Si) ∈ Uαs

v′�(Si)t = gt(v(Si)t)

g : U → {1,…, s}v′�′�(S)t = ∑

j1g(v(S)j)= t mod 2

|Si △ Sj | = |Si\Sj | + |Sj\Si |v′�′�(Si) ⊕ v′�′ �(Sj) 1 − J(S1, S2)

g g g(x) g(x’)

Page 22: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

!28

Similarity estimation

Similarity search

Good news 3Bad news 1 2

4

Page 23: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

Minhash for searching• Fix reals � .

• Query: Given � , find � such that � , assuming that for all � we have � .

• Data structure: Choose � such that � . For each set �store � in a hash table, with pointer to � .

• Query: Look up � in hash table, inspect linked set(s).• Analysis:

- Expected number of matching sets, � .

- Success probability � ; repeat until success.

1 > j1 > j2 > 0Q ⊆ U i J(Q, Si) ≥ j1

i ≠j J(Q, Sj) ≤ j2s js

2 ≈ 1/n Siv(Si) Si

v(Q)

E [∑i

1v(Q)= v(Si)] ≤ 2

js1 ≈ n − log( j1)/log( j2)

!29

[Indyk & Motwani ’98]

Page 24: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

Is min-hash search optimal?

!30

Can we hope to beat � ?

• [Christiani-P. ’17], [Ahle ’19]: Improvement of the exponent is possible!

• [Chen-Williams ’19], [Stausholm-P.-Thorup ’19]: Assuming the Strong Exponential Time Hypothesis, time� requires that � .

O (n log( j1)/log( j2))

n 1− Ω(1) log( j1)/log( j2) < 1 − Ω(1)

Assume for simplicity that all sets have size !w

Page 25: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

ChosenPath algorithm

• Choose � , where � .

• Create recursive data structures for

sets � for �

until recursion depth � .

• Queries: For each � , recurse in subtree � (if it exists), perform exhaustive search at leaves.

I ⊆ U Pr[k ∈ I] = 1 + j12j1w

Xk = {Si | k ∈ Si} k ∈ I

⌈log(n)/log ( 1 + j22j2 )⌉

k ∈ Q Xk

31

[Christiani-P. ’17]

X = {Si | i = 1,…, n}

Page 26: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

ChosenPath analysis

32

• Suppose � . Then the set of “good” recursive calls � has expected size at least 1.

• In branching process terminology: expected number of offspring is at least 1 at each level of the recursion.

• Theory of branching processes [Agresti ’74] implies success probability � at level � .

• Repeat � times for constant success probability.

|Si ∩ Q | / |Si ∪ Q | ≥ j1k ∈ I ∩ Q ∩ Si

1/(λ + 1) λλ

x x'ySi SjQ

Page 27: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

• Combines ChosenPath with an idea of “supermajorities” inspired by angular LSH to get improved results for asymmetric sets, � .|Q | ≠|Si |

!33

Partial match

• Special case is “partial match” queries, � .|Q | = j1 |Si |

Page 28: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

!34

Supermajorities for partial matchConsider

case where minhash leads to

search time ! .n

Page 29: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

!35

Beyond set similarityIn many research communities: Hashing = mapping to ! .{0,1}s

Page 30: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

Some open problems

1. Is there a single sketch that is simultaneously space/variance optimal for low and high Jaccard similarity?

2.Known ! -bit sketches and estimators for Jaccard similarity are symmetric. Can asymmetry improve precision?

3.How many bits are needed to estimate Jaccard similarity up to factor ! when ! ?

s

1 + ε J → 0

!36

Similarity estimation

bit.ly/2T3laP0

Page 31: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

More open problems

4.We wish to choose ! from an explicit family of functions such

that ! .Is there an explicit such family of size ! ?

5.Similarity search in Euclidean/Hamming space can be made faster using data dependent LSH. What kind of speedup can be achieved for set similarity (maybe via embedding)?

6. Is the performance of Ahle’s supermajorities algorithm the best possible for LSH-based partial match?

hPr[minhashh(Si) = minhashh(Si)] = (1 ± ε)

|Si ∩ Sj |

|Si ∪ Sj |

O(poly(1/ε) log |U | )

!37

Similarity search

bit.ly/2T3laP0

Page 32: Set Similarity - itu.dkSetting of this talk • We are given a collection of sets that we are allowed to preprocess. • Seek answer to queries such as:-Given what is the size of ?

!38

That’s all Folks!not

Timothy Chan, Saladi Rahul and Jie Xue. Range closest-pair search in higher dimensions

Boris Aronov, Omrit Filtser, Michael Horton, Matthew Katz and Khadijeh Sheikhan. Efficient Nearest-Neighbor Query and Clustering of Planar Curves

Timothy M. Chan, Yakov Nekrich and Michiel Smid. Orthogonal Range Reporting and Rectangle Stabbing for Fat Rectangles

Matteo Ceccarello, Anne Driemel and Francesco Silvestri. FRESH: Fréchet Similarity with Hashing