textual spatial cosine similarity giancarlo crocetti pace university seidenberg school of csis

12
Textual Spatial Cosine Similarity Giancarlo Crocetti Pace University Seidenberg School of CSIS

Upload: elvin-james

Post on 04-Jan-2016

221 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Textual Spatial Cosine Similarity Giancarlo Crocetti Pace University Seidenberg School of CSIS

Textual Spatial Cosine SimilarityGiancarlo CrocettiPace University Seidenberg School of CSIS

Page 2: Textual Spatial Cosine Similarity Giancarlo Crocetti Pace University Seidenberg School of CSIS

Introductin• Similarity is a quantifiable measure of how similar

two objects are

• We have many document similarity measures today

• Cosine Similarity is widely used and is considered a standard in search engines.

• Cosine Similarity has a serious drawback: does not consider word placement

Page 3: Textual Spatial Cosine Similarity Giancarlo Crocetti Pace University Seidenberg School of CSIS

An Example• Compare “John loves Mary” with “Mary loves

John”

simcosine(“John loves Mary”, ”Mary loves John”) = 1.0

• Definitely similar, but they are not the same

• Methods based on NLP exists, but computationally intensive

I will introduce a Textual Space Similaritythat provides Semantic-Quality results

without the overhead of semantic approaches.

Page 4: Textual Spatial Cosine Similarity Giancarlo Crocetti Pace University Seidenberg School of CSIS

Textual Space Similarity• Different from other positional-based methods.

Given two documents i and j, let’s define the Spatial Difference for the term t the quantity:

• Each term can be 0 when in the same position

• 1 when the first instance of a term is at the beginning of one document.

• Otherwise 0<<1

Page 5: Textual Spatial Cosine Similarity Giancarlo Crocetti Pace University Seidenberg School of CSIS

Textual Space Similarity (continued)Finally, we define the Textual Space Similarity of two documents di and dj the quantity:

With l the number of matching terms in the two documents.

• Numerator is the summation of quantities [0,1] appearing no more than l times, therefore TSS [0,1]

• In order for TSS to have the same direction of other document similarities:

[0,1]

𝑇 𝑆𝑆 (𝑑𝑖 ,𝑑 𝑗 )=1−∑

𝑡

𝑠𝑑𝑖 , 𝑗(𝑡 )

Page 6: Textual Spatial Cosine Similarity Giancarlo Crocetti Pace University Seidenberg School of CSIS

Back to the Example

TSS(“John loves Mary”, “Mary loves John”) =

This result is quite different from the cosine similarity of 1.0

=

Page 7: Textual Spatial Cosine Similarity Giancarlo Crocetti Pace University Seidenberg School of CSIS

Textual Spatial Cosine Similarity

In order to both spatial and term features in documents we combine the cosine similarity with TSS deriving the Textual Spatial Cosine Similarity defined as:

• a=1: = sim

• a=0: = TSS

• In general a=0.5 works well in most cases

= * sim+(1- )*TSS

Page 8: Textual Spatial Cosine Similarity Giancarlo Crocetti Pace University Seidenberg School of CSIS

TSCS and Corpus Size

• Similarity is a value [0,1] and it is not clear what is the threshold to use to assert two documents are “similar”

• Cosine similarity varies with changes in corpus size

• We ran an experiment to see how the similarity of two seeded document varies with changes in corpus size (a=0.5)

Size of Corpus

Similarity of Set #1

Similarity of Set #2

4 0.89 0.52 5 0.89 0.5310 0.90 0.5415 0.90 0.5420 0.91 0.5630 0.92 0.5740 0.92 0.59

Size of Corpus

Similarity of Set #1

Similarity of Set #2

4 0.85 0.48 5 0.85 0.5010 0.87 0.5115 0.86 0.5220 0.89 0.4430 0.90 0.5740 0.91 0.60

Similarity variations with different corpus sizes using TSCS Similarity variations with different corpus sizes using Cosine

Page 9: Textual Spatial Cosine Similarity Giancarlo Crocetti Pace University Seidenberg School of CSIS

TSCS and Paraphrasing• The dataset consisted of 734 English pairs drawn from

publicly available datasets:– Microsoft Research Paraphrase Corpus

– Microsoft Research Video Description Corpus

– WMT2008 development dataset

• We analyzed the TSCS performance in detecting paraphrases, by using different values of alpha

Page 10: Textual Spatial Cosine Similarity Giancarlo Crocetti Pace University Seidenberg School of CSIS

TSCS and Paraphrasing (continued)• Number of correct detection maximized with a=0

• TSCS recognized a total of 649 paraphrases

• TSCS (in its degenerate case of a=0) achieved an accuracy of 649/734 = 0.8842

• TSCSa=0 = TSS can be adopted in the detection of paraphrasing

Page 11: Textual Spatial Cosine Similarity Giancarlo Crocetti Pace University Seidenberg School of CSIS

Conclusions• Textual Space Cosine Similarity (TSCS) adds a spatial

dimension without computational intensive, semantic approaches

• TSCS is minimally sensitive to changes in the corpus size

• In its degenerative case can be used as a model for paraphrasing detection with accuracy levels close to 90%.

• TSCS can be used by search engines in:– Detection of plagiarism

– Content recommendation

– Content Discovery

Page 12: Textual Spatial Cosine Similarity Giancarlo Crocetti Pace University Seidenberg School of CSIS

Thank You

Giancarlo Crocetti – [email protected]