sc spectra: a new soft cardinality approximation for text comparison
TRANSCRIPT
SC Spectra: A Linear-Time Soft Cardinality Approximation for Text
Comparison
Sergio JimenezUniversidad Nacional de Colombia
Alexander GelbukhCIC-IPN
DoctoralConsortium
Sets ?model better? human similarity judgements than vectors.
• Yes… Tvesky: Similarity statements, images, sounds, symbols, etc.
• Useful at task where …. (e.g. Entity Resolution, Name Matching, … )
SC Spectra: A Linear-Time Soft Cardinality Approximation for Text Comparison
Set Similarity Measures
Cardinality-based
||||
BABA
|)||(|5.0||
BABA
||||
||
BA
BA
ppp BABA
/1)|||(|5.0||
Jaccard Dice Cosine Overlap
|)||,min(|||BA
BA
Generalized Mean
0,,||||||
||
ABBABA
BATversky’s Ratio Model
p= 1 Dicep 0 Cosinep ∞ Overlap
α=β =1 Jaccardα=β =0.5 Dice
commonalitiesreferent
AB
You only need to count the number of elements in a set!
But cardinality-based measures are crisp ….
,B,A
BA21
|)||(|5.0||
BABA
… because classic set cardinality doesn’t take into account similarities between elements.
Dice coeficient
So, we proposed soft cardinality …… a cardinality function that counts elements in a
softened way, denoted
, ,
, ,For a set },,,{ 21 naaaA the soft cardinality of A is
n
iiaA
1
'
A
B
9.2' A
2.1' B
'*
Soft Cardinality for Name Matching
SergioGonzalo Jiménez
Cergio G. Jimenes
Sergio
Gonzalo
Jiménez
CergioG.
Jimenes
Sergio
Gonzalo
Jiménez
CergioG.
Jimenes
|'| BA'|| A
'|| B
A
B
But …
the soft cardinality definition requires the computation of the cardinality of the union of n sets, which requires 2n-1 terms and n-ary intersections.
e.g.
n
iiaA
1
'
a1 a2
a1 321 aaa
321313221321 aaaaaaaaaaaa 1 2 3 4 5 6 7
3-ary intersectionbinary intersections
So, we proposed an approximation …
… using only:1. Pair-wise intersections (binary similarity function)2. n2 terms for a set of n elements
n
in
jji aasim
A1
1
'
),(
1),( jiji aasimaa
But (again) O(n2) is fine for short text (i.e. name matching) but still impractical for lorger texts.
Approximating Soft Cardinality using character q-grams (n-grams)
An example using 2-grams: }"","{" GonzalezGonzaloA
""1 Gonzaloa
""2 Gonzaleza
},,,,,,,{]2[1 oloalzanzonGoGa
},,,,,,,,{]2[2 zezlealzanzonGoGa
},,,,,,,,,,{]2[2
]2[1
]2[ zezleoloalzanzonGoGaaA
◄G Go on nz za al lo o► le ez z►0.100
0.105
0.110
0.115
0.120
0.125
0.130
2grams in both 2grams only in a1 2grams only in a2