latent semantic indexing (mapping onto a smaller space of latent concepts)
DESCRIPTION
Latent Semantic Indexing (mapping onto a smaller space of latent concepts). Paolo Ferragina Dipartimento di Informatica Università di Pisa. Reading 18. Speeding up cosine computation. - PowerPoint PPT PresentationTRANSCRIPT
Latent Semantic Indexing(mapping onto a smaller space of latent concepts)
Paolo FerraginaDipartimento di Informatica
Università di Pisa
Reading 18
Speeding up cosine computation
What if we could take our vectors and “pack” them into fewer dimensions (say 50,000100) while preserving distances? Now, O(nm) Then, O(km+kn) where k << n,m
Two methods: “Latent semantic indexing” Random projection
A sketch
LSI is data-dependent Create a k-dim subspace by eliminating
redundant axes Pull together “related” axes – hopefully
car and automobile
Random projection is data-independent Choose a k-dim subspace that guarantees
good stretching properties with high probability between pair of points.
What about polysemy ?
Notions from linear algebra
Matrix A, vector v Matrix transpose (At) Matrix product Rank Eigenvalues and eigenvector v: Av = v
Overview of LSI
Pre-process docs using a technique from linear algebra called Singular Value Decomposition
Create a new (smaller) vector space
Queries handled (faster) in this new space
Singular-Value Decomposition
Recall m n matrix of terms docs, A. A has rank r m,n
Define term-term correlation matrix T=AAt
T is a square, symmetric m m matrix Let P be m r matrix of eigenvectors of T
Define doc-doc correlation matrix D=AtA D is a square, symmetric n n matrix Let R be n r matrix of eigenvectors of D
A ’s decomposition
Do exist matrices P (for T, m r) and R (for D, n r) formed by orthonormal columns (unit dot-product)
It turns out that A = PRt
Where is a diagonal matrix with the eigenvalues of T=AAt in decreasing order.
=
A P Rt
mn mr rr rn
For some k << r, zero out all but the k biggest eigenvalues in [choice of k is crucial]
Denote by k this new version of , having rank k
Typically k is about 100, while r (A ’s rank) is > 10,000
=
P k Rt
Dimensionality reduction
Ak
document
useless due to 0-col/0-row of k
m x r r x n
r
kk
k
00
0
A m x k k x n
Guarantee
Ak is a pretty good approximation to A: Relative distances are (approximately) preserved
Of all m n matrices of rank k, Ak is the best
approximation to A wrt the following measures:
minB, rank(B)=k ||A-B||2 = ||A-Ak||2 = k
minB, rank(B)=k ||A-B||F2 = ||A-Ak||F
2 =
k2k+2
2r2
Frobenius norm ||A||F2 =
22r
2
Reduction
Xk = k Rt is the doc-matrix k x n, hence reduced to k dim
Take the doc-correlation matrix: It is D=At
A =(P Rt)t (P Rt) = (Rt)t (Rt) Approx with k, thus get At
A Xkt Xk (both are n x n matr.)
We use Xk to define how to project A and Q: Xk = k Rt , substitute Rt = Pt A, so get Pk
t A . In fact, k Pt = Pk
t which is a k x m matrix
This means that to reduce a doc/query vector is enough to multiply it by Pk
t
Cost of sim(q,d), for all d, is O(kn+km) instead of O(mn)
R,P are formed by
orthonormal eigenvectorsof the matrices D,T
Which are the concepts ?
c-th concept = c-th row of Pkt (which is k x m)
Denote it by Pkt [c], whose size is m = #terms
Pkt [c][i] = strength of association between c-th
concept and i-th term
Projected document: d’j = Pkt dj
d’j[c] = strenght of concept c in dj
Projected query: q’ = Pkt q
q’ [c] = strenght of concept c in q
Random Projections
Paolo FerraginaDipartimento di Informatica
Università di Pisa
Slides only !
An interesting math result
f() is called JL-embeddingSetting v=0 we also get a bound on f(u)’s stretching!!!
Lemma (Johnson-Linderstrauss, ‘82)
Let P be a set of n distinct points in m-dimensions.
Given > 0, there exists a function f : P IRk such that for every pair of points u,v in P it holds:
(1 - ) ||u - v||2 ≤ ||f(u) – f(v)||2 ≤ (1 + ) ||u-v||2
Where k = O(-2 log m)
What about the cosine-distance ?
f(u)’s, f(v)’s stretching
substituting formula above
How to compute a JL-embedding?
E[ri,j] = 0
Var[ri,j] = 1
If we set R = ri,j to be a random mx k matrix, where the components are independent random variables with one of the following distributions
Finally...
Random projections hide large constantsk (1/)2 * log m, so it may be large…
it is simple and fast to compute
LSI is intuitive and may scale to any koptimal under various metrics
but costly to compute
Document duplication(exact or approximate)
Paolo FerraginaDipartimento di Informatica
Università di Pisa
Slides only!
Duplicate documents
The web is full of duplicated content Few exact duplicate detection Many cases of near duplicates
E.g., Last modified date the only difference between two copies of a page
Sec. 19.6
Natural Approaches
Fingerprinting: only works for exact matches, slow Checksum – no worst-case collision probability guarantees MD5 – cryptographically-secure string hashes
Edit-distance metric for approximate string-matching expensive – even for one pair of strings impossible – for 1032 web documents
Random Sampling sample substrings (phrases, sentences, etc) hope: similar documents similar samples But – even samples of same document will differ
Obvious techniques Checksum – no worst-case collision probability
guarantees MD5 – cryptographically-secure string hashes
relatively slow
Karp-Rabin’s Scheme Rolling hash: split doc in many pieces Algebraic technique – arithmetic on primes Efficient and other nice properties…
Exact-Duplicate Detection
Near-Duplicate Detection
Problem Given a large collection of documents Identify the near-duplicate documents
Web search engines Proliferation of near-duplicate documents
Legitimate – mirrors, local copies, updates, … Malicious – spam, spider-traps, dynamic URLs, … Mistaken – spider errors
30% of web-pages are near-duplicates [1997]
Desiderata
Storage: only small sketches of each document.
Computation: the fastest possible
Stream Processing: once sketch computed, source is
unavailable
Error Guarantees problem scale small biases have large impact need formal guarantees – heuristics will not do
Basic Idea [Broder 1997]
Shingling dissect document into q-grams (shingles) represent documents by shingle-sets reduce problem to set intersection
[ Jaccard ] They are near-duplicates if large shingle-sets
intersect enough
Similarity of Documents
DocBSB
SADocA
• Jaccard measure – similarity of SA, SB
• Claim: A & B are near-duplicates if sim(SA,SB) is high
BA
BABA SS
SS )S,sim(S
Basic Idea [Broder 1997]
Shingling dissect document into q-grams (shingles) represent documents by shingle-sets reduce problem to set intersection
[ Jaccard ] They are near-duplicates if large shingle-sets
intersect enough
We need to cope with “Set Intersection” fingerprints of shingles (for space/time efficiency) min-hash to estimate intersections sizes (for further
time and space efficiency)
Multiset ofFingerprints
Doc shinglingMultiset ofShingles
fingerprint
Documents Sets of 64-bit fingerprints
Fingerprints:
• Use Karp-Rabin fingerprints over q-gram shingles (of 8q bits)• Fingerprint space [0, …, U-1]• In practice, use 64-bit fingerprints, i.e., U=264
• Prob[collision] ≈ (8q)/264 << 1
This reduces space for storing the multi-setsAnd the time to intersect them, but...
Speeding-up: Sketch of a document
Intersecting shingle-sets is too costly
Create a “sketch vector” (of size ~200) for each document, for its shingle-set
Documents that share ≥ t (say 80%) corresponding vector elements are near duplicates
Sec. 19.6
Sketching by Min-Hashing
Consider SA, SB P
Pick a random permutation π of P (such as ax+b
mod |P|)
Define = π -1( min{π(SA)} ) , = π -
1( min{π(SB)} ) minimal element under permutation π
Lemma: BA
BA
SS
SS β]P[α
Strengthening it…
Similarity sketch sk(A) = k minimal elements under π(SA)
K is fixed or is a fixed ratio of SA,SB ? We might also take K permutations and the min of each
Similarity Sketches sk(A): Succinct representation of fingerprint sets SA
Allows efficient estimation of sim(SA,SB)
Basic idea is to use min-hash of fingerprints
Note: we can reduce the variance by using a larger k
Computing Sketch[i] for Doc1
Document 1
264
264
264
264
Start with 64-bit f(shingles)
Permute on the number linewith i
Pick the min value
Sec. 19.6
Test if Doc1.Sketch[i] = Doc2.Sketch[i]
Document 1 Document 2
264
264
264
264
264
264
264
264
Are these equal?
Test for 200 random permutations: , ,… 200
A B
Sec. 19.6
However…
Document 1 Document 2
264
264
264
264
264
264
264
264
A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the intersection)
Claim: This happens with probability Size_of_intersection / Size_of_union
BA
Sec. 19.6
Sum up…
Brute-force: compare sk(A) vs. sk(B) for all the pairs of documents A and B.
Locality sensitive hashing (LSH) Compute sk(A) for each document A Use LSH of all sketches, briefly:
Take h elements of sk(A) as ID (may induce false positives)
Create t IDs (to reduce the false negatives)
If one ID matches with another one (wrt same h-selection),
then the corresponding docs are probably near-duplicates;
hence compare.
Search Engines
“Semantic” searches ?
“Diego Maradona won against Mexico”
Dictionary of termsagainstDiegoMaradonaMexicowon
2.25.19.11.00.1
Term Vector
Similarity(v,w) ≈ cos()
t1
v
w
t3
t2
Vector Space model
Classical approach
Mainly term-based:
polysemy and synonymy issues
38
A new approach: Massive graphs of entities and relations
May 2012
the paparazzi photographed the star
the astronomer photographed the star
A typical issue: polysemy
He is using Microsoft’s browser
He is a fan of Internet Explorer
Another issue: synonymy
http://tagme.di.unipi.ithttp://tagme.di.unipi.it
• Korean won• Win-loss record• Only won• ...
“Diego Maradona won against Mexico”
•Diego A. Maradona •Diego Maradona jr.•Maradona Stadium•Maradona Film•…
•Mexico nation•Mexico state•Mexico football team•Mexico baseball team•…
No Annotation
PARSINGPARSING PRUNING2 simple features
DISAMBIGUATIONby a voting scheme
TAGME
ρscoreρscore
obama asks iran for RQ-170 sentinel drone back
us president issues Ahmadinejad ultimatum
Barack Obama
Iran Lockheed Martin RQ-170 Sentinel
President of the United States
Mahmoud Ahmadinejad
Ultimatum
Why is it more powerful ?
44
Text as a sub-graph of topics
Mahmoud Ahmadinejad
Ultimatum
RQ-170 drone
President of the United States
Barack Obama
Iran
Text as a sub-graph of topics
Mahmoud Ahmadinejad
UltimatumRQ-170 drone
Any relatedness measure over a graph, e.g. [Milne & Witten, 2008]
President of the United States
Barack Obama
Iran
Graph analysis allows to find similarities between texts and entities even if they do not match syntactically (so at concept level)
Search Results Clustering
Jaguar Cars Panthera Onca Mac OS X Atari Jaguar Jacksonville Jags Fender Jaguar …
TOPICS
Paper atACM WSDM 2012
Paper atECIR 2012
Paper atIEEE Software 2012
Pls design your killer
app...
http://acube.di.unipi.it/ta
gmeReleasing open-source…