sc spectra: a new soft cardinality approximation for text comparison

9
SC Spectra: A Linear-Time Soft Cardinality Approximation for Text Comparison Sergio Jimenez Universidad Nacional de Colombia Alexander Gelbukh CIC-IPN Doctoral Consortium

Upload: sergio-jimenez

Post on 31-Jul-2015

66 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: SC spectra: A new soft cardinality approximation for text comparison

SC Spectra: A Linear-Time Soft Cardinality Approximation for Text

Comparison

Sergio JimenezUniversidad Nacional de Colombia

Alexander GelbukhCIC-IPN

DoctoralConsortium

Page 2: SC spectra: A new soft cardinality approximation for text comparison

Sets ?model better? human similarity judgements than vectors.

• Yes… Tvesky: Similarity statements, images, sounds, symbols, etc.

• Useful at task where …. (e.g. Entity Resolution, Name Matching, … )

SC Spectra: A Linear-Time Soft Cardinality Approximation for Text Comparison

Page 3: SC spectra: A new soft cardinality approximation for text comparison

Set Similarity Measures

Cardinality-based

||||

BABA

|)||(|5.0||

BABA

||||

||

BA

BA

ppp BABA

/1)|||(|5.0||

Jaccard Dice Cosine Overlap

|)||,min(|||BA

BA

Generalized Mean

0,,||||||

||

ABBABA

BATversky’s Ratio Model

p= 1 Dicep 0 Cosinep ∞ Overlap

α=β =1 Jaccardα=β =0.5 Dice

commonalitiesreferent

AB

You only need to count the number of elements in a set!

Page 4: SC spectra: A new soft cardinality approximation for text comparison

But cardinality-based measures are crisp ….

,B,A

BA21

|)||(|5.0||

BABA

… because classic set cardinality doesn’t take into account similarities between elements.

Dice coeficient

Page 5: SC spectra: A new soft cardinality approximation for text comparison

So, we proposed soft cardinality …… a cardinality function that counts elements in a

softened way, denoted

, ,

, ,For a set },,,{ 21 naaaA the soft cardinality of A is

n

iiaA

1

'

A

B

9.2' A

2.1' B

'*

Page 6: SC spectra: A new soft cardinality approximation for text comparison

Soft Cardinality for Name Matching

SergioGonzalo Jiménez

Cergio G. Jimenes

Sergio

Gonzalo

Jiménez

CergioG.

Jimenes

Sergio

Gonzalo

Jiménez

CergioG.

Jimenes

|'| BA'|| A

'|| B

A

B

Page 7: SC spectra: A new soft cardinality approximation for text comparison

But …

the soft cardinality definition requires the computation of the cardinality of the union of n sets, which requires 2n-1 terms and n-ary intersections.

e.g.

n

iiaA

1

'

a1 a2

a1 321 aaa

321313221321 aaaaaaaaaaaa 1 2 3 4 5 6 7

3-ary intersectionbinary intersections

Page 8: SC spectra: A new soft cardinality approximation for text comparison

So, we proposed an approximation …

… using only:1. Pair-wise intersections (binary similarity function)2. n2 terms for a set of n elements

n

in

jji aasim

A1

1

'

),(

1),( jiji aasimaa

But (again) O(n2) is fine for short text (i.e. name matching) but still impractical for lorger texts.

Page 9: SC spectra: A new soft cardinality approximation for text comparison

Approximating Soft Cardinality using character q-grams (n-grams)

An example using 2-grams: }"","{" GonzalezGonzaloA

""1 Gonzaloa

""2 Gonzaleza

},,,,,,,{]2[1 oloalzanzonGoGa

},,,,,,,,{]2[2 zezlealzanzonGoGa

},,,,,,,,,,{]2[2

]2[1

]2[ zezleoloalzanzonGoGaaA

◄G Go on nz za al lo o► le ez z►0.100

0.105

0.110

0.115

0.120

0.125

0.130

2grams in both 2grams only in a1 2grams only in a2