latent semantic indexing (mapping onto a smaller space of latent concepts)

Latent Semantic Indexing(mapping onto a smaller space of latent concepts)

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Reading 18

Speeding up cosine computation

What if we could take our vectors and “pack” them into fewer dimensions (say 50,000100) while preserving distances? Now, O(nm) Then, O(km+kn) where k << n,m

Two methods: “Latent semantic indexing” Random projection

A sketch

LSI is data-dependent Create a k-dim subspace by eliminating

redundant axes Pull together “related” axes – hopefully

car and automobile

Random projection is data-independent Choose a k-dim subspace that guarantees

good stretching properties with high probability between pair of points.

What about polysemy ?

Notions from linear algebra

Matrix A, vector v Matrix transpose (At) Matrix product Rank Eigenvalues and eigenvector v: Av = v

Overview of LSI

Pre-process docs using a technique from linear algebra called Singular Value Decomposition

Create a new (smaller) vector space

Queries handled (faster) in this new space

Singular-Value Decomposition

Recall m n matrix of terms docs, A. A has rank r m,n

Define term-term correlation matrix T=AAt

T is a square, symmetric m m matrix Let P be m r matrix of eigenvectors of T

Define doc-doc correlation matrix D=AtA D is a square, symmetric n n matrix Let R be n r matrix of eigenvectors of D

A ’s decomposition

Do exist matrices P (for T, m r) and R (for D, n r) formed by orthonormal columns (unit dot-product)

It turns out that A = PRt

Where is a diagonal matrix with the eigenvalues of T=AAt in decreasing order.

=

A P Rt

mn mr rr rn

For some k << r, zero out all but the k biggest eigenvalues in [choice of k is crucial]

Denote by k this new version of , having rank k

Typically k is about 100, while r (A ’s rank) is > 10,000

=

P k Rt

Dimensionality reduction

Ak

document

useless due to 0-col/0-row of k

m x r r x n

r

kk

k

00

0

A m x k k x n

Guarantee

Ak is a pretty good approximation to A: Relative distances are (approximately) preserved

Of all m n matrices of rank k, Ak is the best

approximation to A wrt the following measures:

minB, rank(B)=k ||A-B||2 = ||A-Ak||2 = k

minB, rank(B)=k ||A-B||F2 = ||A-Ak||F

2 =

k2k+2

2r2

Frobenius norm ||A||F2 =

22r

2

Reduction

Xk = k Rt is the doc-matrix k x n, hence reduced to k dim

Take the doc-correlation matrix: It is D=At

A =(P Rt)t (P Rt) = (Rt)t (Rt) Approx with k, thus get At

A Xkt Xk (both are n x n matr.)

We use Xk to define how to project A and Q: Xk = k Rt , substitute Rt = Pt A, so get Pk

t A . In fact, k Pt = Pk

t which is a k x m matrix

This means that to reduce a doc/query vector is enough to multiply it by Pk

t

Cost of sim(q,d), for all d, is O(kn+km) instead of O(mn)

R,P are formed by

orthonormal eigenvectorsof the matrices D,T

Which are the concepts ?

c-th concept = c-th row of Pkt (which is k x m)

Denote it by Pkt [c], whose size is m = #terms

Pkt [c][i] = strength of association between c-th

concept and i-th term

Projected document: d’j = Pkt dj

d’j[c] = strenght of concept c in dj

Projected query: q’ = Pkt q

q’ [c] = strenght of concept c in q

Random Projections


Università di Pisa

Slides only !

An interesting math result

f() is called JL-embeddingSetting v=0 we also get a bound on f(u)’s stretching!!!

Lemma (Johnson-Linderstrauss, ‘82)

Let P be a set of n distinct points in m-dimensions.

Given > 0, there exists a function f : P IRk such that for every pair of points u,v in P it holds:

(1 - ) ||u - v||2 ≤ ||f(u) – f(v)||2 ≤ (1 + ) ||u-v||2

Where k = O(-2 log m)

What about the cosine-distance ?

f(u)’s, f(v)’s stretching

substituting formula above

How to compute a JL-embedding?

E[ri,j] = 0

Var[ri,j] = 1

If we set R = ri,j to be a random mx k matrix, where the components are independent random variables with one of the following distributions

Finally...

Random projections hide large constantsk (1/)2 * log m, so it may be large…

it is simple and fast to compute

LSI is intuitive and may scale to any koptimal under various metrics

but costly to compute

Document duplication(exact or approximate)


Università di Pisa

Slides only!

Duplicate documents

The web is full of duplicated content Few exact duplicate detection Many cases of near duplicates

E.g., Last modified date the only difference between two copies of a page

Sec. 19.6

Natural Approaches

Fingerprinting: only works for exact matches, slow Checksum – no worst-case collision probability guarantees MD5 – cryptographically-secure string hashes

Edit-distance metric for approximate string-matching expensive – even for one pair of strings impossible – for 1032 web documents

Random Sampling sample substrings (phrases, sentences, etc) hope: similar documents similar samples But – even samples of same document will differ

Obvious techniques Checksum – no worst-case collision probability

guarantees MD5 – cryptographically-secure string hashes

relatively slow

Karp-Rabin’s Scheme Rolling hash: split doc in many pieces Algebraic technique – arithmetic on primes Efficient and other nice properties…

Exact-Duplicate Detection

Near-Duplicate Detection

Problem Given a large collection of documents Identify the near-duplicate documents

Web search engines Proliferation of near-duplicate documents

Legitimate – mirrors, local copies, updates, … Malicious – spam, spider-traps, dynamic URLs, … Mistaken – spider errors

30% of web-pages are near-duplicates [1997]

Desiderata

Storage: only small sketches of each document.

Computation: the fastest possible

Stream Processing: once sketch computed, source is

unavailable

Error Guarantees problem scale small biases have large impact need formal guarantees – heuristics will not do

Basic Idea [Broder 1997]

Shingling dissect document into q-grams (shingles) represent documents by shingle-sets reduce problem to set intersection

[ Jaccard ] They are near-duplicates if large shingle-sets

intersect enough

Similarity of Documents

DocBSB

SADocA

• Jaccard measure – similarity of SA, SB

• Claim: A & B are near-duplicates if sim(SA,SB) is high

BA

BABA SS

SS )S,sim(S

Basic Idea [Broder 1997]

Shingling dissect document into q-grams (shingles) represent documents by shingle-sets reduce problem to set intersection

[ Jaccard ] They are near-duplicates if large shingle-sets

intersect enough

We need to cope with “Set Intersection” fingerprints of shingles (for space/time efficiency) min-hash to estimate intersections sizes (for further

time and space efficiency)

Multiset ofFingerprints

Doc shinglingMultiset ofShingles

fingerprint

Documents Sets of 64-bit fingerprints

Fingerprints:

• Use Karp-Rabin fingerprints over q-gram shingles (of 8q bits)• Fingerprint space [0, …, U-1]• In practice, use 64-bit fingerprints, i.e., U=264

• Prob[collision] ≈ (8q)/264 << 1

This reduces space for storing the multi-setsAnd the time to intersect them, but...

Speeding-up: Sketch of a document

Intersecting shingle-sets is too costly

Create a “sketch vector” (of size ~200) for each document, for its shingle-set

Documents that share ≥ t (say 80%) corresponding vector elements are near duplicates

Sec. 19.6

Sketching by Min-Hashing

Consider SA, SB P

Pick a random permutation π of P (such as ax+b

mod |P|)

Define = π -1( min{π(SA)} ) , = π -

1( min{π(SB)} ) minimal element under permutation π

Lemma: BA

BA

SS

SS β]P[α

Strengthening it…

Similarity sketch sk(A) = k minimal elements under π(SA)

K is fixed or is a fixed ratio of SA,SB ? We might also take K permutations and the min of each

Similarity Sketches sk(A): Succinct representation of fingerprint sets SA

Allows efficient estimation of sim(SA,SB)

Basic idea is to use min-hash of fingerprints

Note: we can reduce the variance by using a larger k

Computing Sketch[i] for Doc1

Document 1

264

264

264

264

Start with 64-bit f(shingles)

Permute on the number linewith i

Pick the min value

Sec. 19.6

Test if Doc1.Sketch[i] = Doc2.Sketch[i]

Document 1 Document 2

264

264

264

264

264

264

264

264

Are these equal?

Test for 200 random permutations: , ,… 200

A B

Sec. 19.6

However…

Document 1 Document 2

264

264

264

264

264

264

264

264

A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the intersection)

Claim: This happens with probability Size_of_intersection / Size_of_union

BA

Sec. 19.6

Sum up…

Brute-force: compare sk(A) vs. sk(B) for all the pairs of documents A and B.

Locality sensitive hashing (LSH) Compute sk(A) for each document A Use LSH of all sketches, briefly:

Take h elements of sk(A) as ID (may induce false positives)

Create t IDs (to reduce the false negatives)

If one ID matches with another one (wrt same h-selection),

then the corresponding docs are probably near-duplicates;

hence compare.

Search Engines

“Semantic” searches ?

“Diego Maradona won against Mexico”

Dictionary of termsagainstDiegoMaradonaMexicowon

2.25.19.11.00.1

Term Vector

Similarity(v,w) ≈ cos()

t1

v

w

t3

t2

Vector Space model

Classical approach

Mainly term-based:

polysemy and synonymy issues

38

A new approach: Massive graphs of entities and relations

May 2012

the paparazzi photographed the star

the astronomer photographed the star

A typical issue: polysemy

He is using Microsoft’s browser

He is a fan of Internet Explorer

Another issue: synonymy

http://tagme.di.unipi.ithttp://tagme.di.unipi.it

• Korean won• Win-loss record• Only won• ...

“Diego Maradona won against Mexico”

•Diego A. Maradona •Diego Maradona jr.•Maradona Stadium•Maradona Film•…

•Mexico nation•Mexico state•Mexico football team•Mexico baseball team•…

No Annotation

PARSINGPARSING PRUNING2 simple features

DISAMBIGUATIONby a voting scheme

TAGME

ρscoreρscore

obama asks iran for RQ-170 sentinel drone back

us president issues Ahmadinejad ultimatum

Barack Obama

Iran Lockheed Martin RQ-170 Sentinel

President of the United States

Mahmoud Ahmadinejad

Ultimatum

Why is it more powerful ?

44

Text as a sub-graph of topics

Mahmoud Ahmadinejad

Ultimatum

RQ-170 drone


Barack Obama

Iran

Text as a sub-graph of topics

Mahmoud Ahmadinejad

UltimatumRQ-170 drone

Any relatedness measure over a graph, e.g. [Milne & Witten, 2008]


Barack Obama

Iran

Graph analysis allows to find similarities between texts and entities even if they do not match syntactically (so at concept level)

Search Results Clustering

Jaguar Cars Panthera Onca Mac OS X Atari Jaguar Jacksonville Jags Fender Jaguar …

TOPICS

Paper atACM WSDM 2012

Paper atECIR 2012

Paper atIEEE Software 2012

Pls design your killer

app...

http://acube.di.unipi.it/ta

gmeReleasing open-source…