improving minhashing: de bruijn sequences and primitive roots for counting trailing zeroes mark...

Improving minhashing:Improving minhashing:De Bruijn sequences and De Bruijn sequences and

primitive roots for primitive roots for counting trailing zeroescounting trailing zeroes

Mark ManasseFrank McSherry

Kunal Talwar

Microsoft Research

Why things you didn’t think you cared about are actually practical means of bit-twiddling

MinhashingMinhashing

This all comes out of a perennial quest to make minhashing faster

Minhashing is a technique for sampling an element from a stream which is◦Uniformly random (every element equally

likely)◦Consistent (similar stream ↔ similar sample)

P(S(A) = S(B))=

BA

BA

Locality sensitive hashingLocality sensitive hashing

View documents, photos, music, etc. as a set of features

View feature set as high-dimensional vectorFind closely-matching vectors

◦Most of the time◦Proportionally close

In L2, this leads us to cosine similarity (Indyk, Motwani)◦A hash function whose bits match proprtionally to

the cosine of the angle between two vectors◦Allows off-line computation of hash, and faster

comparison of hashes

Working in LWorking in L11::Jaccard similarityJaccard similarityGiven two sets, define

sim(A,B) to equal the cardinality of the intersection of the sets divided by the cardinality of the union

Proven useful, when applied to the set of phrases in a web page, when testing for near duplicates

A BA∩B

Basic idea, and old speedupBasic idea, and old speedup

Pick ~100 such samples, by picking ~100 random one-to-one functions mapping to a well-ordered range

For each function, select the pre-image of the smallest image under that function

Naively, takes 100 such evaluations per input item (one for each function)

Improve by factor of almost 8, by ◦ Choosing a 64 bit function◦ Lead 8 bits of 8 images generated by carving 64 into 8◦ Compute more bits, but only when needed

Why this approximates JaccardWhy this approximates Jaccard

Assume a uniformly chosen random function, mapping injectively to infinite binary sequences.

Order sequences lexicographicallyGiven sets A and B, and a random function f,

argmin(f”(AB)) is certain to be an element of A or B, and is in the intersection with exactly the Jaccard coefficient.

Sampling requires uniformity and consistency◦Uniformity so that probability mass is spread◦Consistency small perturbations don’t matter

New idea to speed up and reduce New idea to speed up and reduce collisionscollisions

Carve 64 into expected 32, by dividing at 1’s into 10…0

32, because expect half of bits to be 1Better yet, the number of maximal length

sequences is bounded by 2, independent of length of input

But, how do we efficiently divide at 1’s?

Dividing at 1’sDividing at 1’s

Could go bit at a time, shifting and testingLots of missed branch predictions, lots of

tests

Could look at low-order byte; if zero, shift by 8, if not do table look-up

Almost certainly good enough, in practice

But we can do mathematically better….

Dividing at 1’s smoothlyDividing at 1’s smoothly

Reduce to a simpler problem: taking the logarithm base 2 of a power of 2

Given x in twos-complement, x&-x is the smallest power of 2 in binary expansion of x◦ Works because -x = ~x + 1

~x is 01..1 below smallest power of 2 in x◦ x & (x-1) removes the least power of 2 (not useful here)◦ x ^ -x is all ones above the least power of 2◦ x ^ (x-1) is all ones at and below the least power of 2

So all 64-bit numbers can be reduced to only 65 possibilities, depending on least power of 2

How to we figure out which one?How to we figure out which one?

Naïve: binary search in sorted tableBetter: perfect hashingUsing x&-x, all 65 possible values are

powers of 2, or 0◦2 is a primitive root of unity modulo 67 (kudos

to Peter Montgomery for noting this)◦So, the powers of 2 generate the multiplicative

group (1, 2, …, 66) modulo 67◦That is, first 66 powers of 2 are distinct mod 67

So, take (x&-x) % 67, and look in a table

Perfect, maybe. But optimal?Perfect, maybe. But optimal?

Leiserson, Prokop, and Randall noticed that De Bruijn sequences are even better

Like Gray codes, only folded◦ De Bruijn sequences are vaguely like Gray codes

Hamiltonian circuit of hypercube is Gray code Hamiltonian circuit of De Bruijn graph is ….

De Bruijn sequences allow candy necklaces where any sequence of k candies occurs at exactly one starting point in clockwise order

Always exist (even generalized to higher dimension, but we don’t need that)

(00011101)* is such a sequence for 3-bit binary

More de BruijnMore de Bruijn

Any rotation of De Bruijn is De BruijnReversal of De Bruijn is De BruijnCanonicalize sequences by rotating to leastThree canonical sequences for binary sequences

of length 6; one is own reversal (6 is even)Starting with 6 zeroes, the first five bits needed

in rotation are zero, so shift is good enoughJust look at high-order 6 bits after multiplying by

constant DB=0x0218a392cd3d5dbfULDoesn’t handle 0, just powers of 2

Few branch missesFew branch misses

#define DB 0x0218a392cd3d5dbfUL#define NRES 100unsigned short dblookup[64]; // initialized to dblookup[(DB << i) >> 58] = iunsigned result[NRES + 64]; // answers plus spill-over spaceunsigned n=0, rb=0, elog=0; // quantity produced, remaining bits of

randomness, left-over zeroesunsigned long long cur;while (n < NRES) {

cur = newrandom(key);elog += rb;rb = 64;while (cur != 0) {

unsigned short log = dblookup[((cur & (1+~cur)) * DB) >> 58];cur >>= log + 1;rb -= log + 1;result[n++] = log + elog;elog= 0;

}}

Selecting randomly and Selecting randomly and repeatably from weighted repeatably from weighted

distributionsdistributionsMark Manasse

Frank McSherryKunal Talwar

Microsoft Research

Jaccard, extended to multisetsJaccard, extended to multisets

Jaccard, as defined, doesn’t work when the number of copies of an element is a distinguishing factor

If we generalize a little, we get sum of lesser number of occurrences divided by sum of greater

◦ Still works even for non-integral counts

◦ Allows for weighting of elements by importance

Same as before, for integer counts, if we replace items with iteminstance◦ <cat, cat, dog>

{cat1, cat2, dog1}Sample is (item,

instance), not just item

Sampling, instead of pairwise Sampling, instead of pairwise computation, for setscomputation, for sets

To allow for faster computation, we estimate similarity by sampling

Pick some number of samples, where for any sets A and B, each sample agrees with probability equal to sim(A,B)

Count the average number of matching samples

To get a good sample, pick a random one-to-one mapping of set elements to a well-ordering, and pick preimage of the (unique!) smallest.

Multiset Jaccard, one Multiset Jaccard, one implementationimplementation

Given a good way to approximate Jaccard for sets, we can convert a multiset (but not a distribution) into a set by replacing 100 occurrences of “cat” by “cat1”, “cat2”, …, “cat100”.

Requires (if streaming) remembering how many prior occurrences of elements have been seen.

Multiset reduction considered Multiset reduction considered inefficientinefficient

Previous technique is linear in input size, if input is {cat, cat, cat, dog}

Exponential in input size if input is (cat, 100)

Probability to the rescue!◦If our random map is to a real number between

0 and 1, we don’t need to generate 100 random values to find the smallest

◦CDF(X > x) = 1-x◦CDF(min_k > x) = (1-x)k

Not so fast!Not so fast!

That’s the probability, but not one that lets us pick samples to test for agreement

Not good enough to pick an element, have to pick an instance of the element◦(cat, 100) and (cat, 200) are only .5 similar

Has to be repeatable◦If cat7 is chosen from (cat, 100), mustn’t choose

cat73 from (cat, 200) (but cat104 would be OK)

Properties for repeatable Properties for repeatable samplingsampling

A sampling process must pick an element of the input◦For discrete things, an integer occurrence at

most equal to the number of occurrences◦For non-negative real valued things, a real

value at most equal to the inputMust pick uniformly by weightMust pick same sample from any subset

containing the sample

UniformityUniformity

To be uniform in a column, we have to pick a number smaller than a given number

Variant of reservoir sampling suffices◦ Given n, pick a random number

below n uniformly◦ Given that number, pick a random

number below that, and repeat◦ To make repeatable & expected con-

stant time, break into powers of 2◦ Given n, round up to next higher

power of 2, repeat downward process until below half

1.0

2.0

4.0

8.0

0.5

S(8,1) = 7.632…

S(8,2) = 4.918…

S(4,1) = 3.054…

S(1,1) = 0.783…

n = 3.724…

For this choice of n, S(4,1) is the downward selection; for slightly smaller n, S(1,1) would be selected

Scaling upScaling up

Same process, but we have to first round up, by finding smallest chosen number above n

First check the power of 2 range containing n

Next level up contains something if first selected number < 2k+1 is > 2k

If this happens, take smallest number in range

Otherwise, repeat at next up power of 2

1.0

2.0

4.0

8.0

0.5

S(8,1) = 7.632…

S(8,2) = 4.918…

S(4,1) = 3.054…

S(1,1) = 0.783…

n = 3.724…

For this choice of n, S(8,2) is the upward selection

Picking a columnPicking a column

Given a scaled-up column to n, need to construct the right distribution for second smallest number (assuming the n’th is the smallest)

In the discrete case (if we consider only integers as valid choices) IDF (2nd smallest > x) proportional to (1-x)n-1×x, so CDF = (n+1)xn-nxn+1= xn(1-n(x-1))

In the continuous case (which we can get by scaling the discrete case), CDF = xn(1-nlnx)

Pick a random luckiness factor for a column, p, solve for x in CDF = p by iteration

Pick column with smallest x value

Reducing randomness and Reducing randomness and improving numerical accuracyimproving numerical accuracy

We can just use one bit to decide if a power of 2 range has any selected values

So use a single random value to decide which of 64 powers of 2 are useful◦Either by computing 64 samples in parallel or◦Computing 64 intervals at once

Use logarithms of CDF rather than CDF to keep things reasonable; look at 1-x instead of x to keep log away from 0

Partially evaluate convergence to save time, and compare preimages to CDF when possible

improving minhashing: de bruijn sequences and primitive roots for counting trailing zeroes mark...

Documents

twiddling slide

smallest power

hash function

binary expansion of

random function f

function lead

minhashing faster minhashing

order sequences