softcardinality: learning to identify directional cross-lingual entailment from cardinalities and...

13
SOFTCARDINALITY: Learning to Identify Directional Cross-Lingual Entailment from Cardinalities and SMT Sergio Jimenez and Claudia Becerra (a participating system in the Cross-lingual Textual Entailment, CLTE, TASK-8) Alexander Gelbukh Instituto Politécnico Nacional,

Upload: sergio-jimenez

Post on 31-Jul-2015

55 views

Category:

Technology


0 download

TRANSCRIPT

SOFTCARDINALITY: Learning to Identify Directional Cross-Lingual

Entailment from Cardinalities and SMT

Sergio Jimenez and Claudia Becerra

(a participating system in the Cross-lingual Textual Entailment, CLTE, TASK-8)

Alexander Gelbukh

Instituto PolitécnicoNacional, Mexico

Soft Cardinality

A=, ,

B= , ,

|A|=3

|B|=3

Classical(integer)

Soft(real)

|A|’2.9

|B|’1.3

Cardinality: number of different elements in a collection, i.e. set definition.

C= ,= |C|=1 |C|’=1.0

Soft Cardinality

|𝐴|′=∑𝑖=1

|𝐴|

𝑤𝑖 (∑𝑗=1

|𝐴|

𝑠𝑖𝑚(𝑎𝑖❑ ,𝑎 𝑗

❑)𝑝)− 1

inter-elementssimilarity

elementsweights

“softness”control

When

word-to-wordsimilarity

idf termweighting

Word-to-word similarity functions

• Character q-grams𝑠𝑖𝑚 (𝑡𝑖 ,𝑡 𝑗 )=

|𝑡 𝑖∩𝑡 𝑗|−𝑏𝑖𝑎𝑠∝max (|𝑡 𝑖|,|𝑡 𝑗|)+(1−∝ ) min(|𝑡𝑖|,|𝑡 𝑗|)

• Edit-distance𝑠𝑖𝑚 (𝑡𝑖 ,𝑡 𝑗 )=1−

𝐸𝑑𝑖𝑡𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 (𝑡 𝑖 ,𝑡 𝑗 )max [𝑙𝑒𝑛 (𝑡 𝑖 ) , 𝑙𝑒𝑛(𝑡 𝑗)]

• Jaro-Winkler𝑠𝑖𝑚 (𝑡𝑖 ,𝑡 𝑗 )=

13 ( 𝑐𝑙𝑒𝑛(𝑡𝑖)

+𝑐

𝑙𝑒𝑛(𝑡 𝑗)+𝑐−𝑚𝑐 )

c is the number of characters in common with in a sliding window of size

m is the number of order mismatches between the common characters

Features for Text Pairs T1 , T2

T1

T2

Language-pair Model

T1(EN)

T1(EN)

T2(FR)

T2(FR)

SMT

SMT

T1t

(FR)

T1t

(FR)

T2t

(EN)

T2t

(EN)

translate

Text

Pre

-pro

cess

ing

• Tokenizing• Stemming• Stop-words removal• idf term weighting

Feat

ure

Extr

actio

n

4-w

ay c

lass

ifica

tion

mod

el

Goldstandard

SVM

Submitted Systems

• RUN1: 4 language-pair models (es-en, fr-en, it-en, de-en) each one trained with 1,000 text pairs. SVM using C=1.0• RUN2: same as RUN1 but optimizing

C for max. accuracy.

Official Results

Circular Pivoting Translations

T1(EN)

T1(EN)

T2(FR)

T2(FR)

SMT

SMT

T1t

(FR)

T1t

(FR)

T2t

(EN)

T2t

(EN)

SMT

SMT

T1tt

(EN)

T1tt

(EN)

T2tt

(FR)

T2tt

(FR)

SMT

SMT

T1ttt

(FR)

T1ttt

(FR)

T2ttt

(EN)

T2ttt

(EN)

Original feature set: 2 comparable text pairs x 14 features= 28 features

Extended feature set: 2+2+4 comparable text pairs x 14 features= 112 features

Original feature set: 2+2 comparable text pairs x 14 features= 56 features

Single Multilingual Modelen de

en es

iten

fren

1,000 feature vectors

1,000 feature vectors

1,000 feature vectors

1,000 feature vectors

4,000 features vector

training data set

4-w

ay c

lass

ifica

tion

mod

el

Single Multilingual Model Results

4.6% better than best official

5.3% better than best official 1.3%

better than best official

4.4% below best

official

6.0% better than best official

baseline

Conclusions

• Soft Cardinality + SMT + SVM seems to be a good combination for CLTE.

• A single multilingual model produced improved results than language-pair models.

• Additional circular pivoting translations produced slightly improved but consistent improvements.

• Character q-grams seems to be better than Edit-distance and Jaro-Winkler.

Soft Cardinality at *SEM and SemEval

• STS-2012, official 3th out of 89 systems• STS-2013-CORE task, 18th out of 90 systems

(4th un-official)• STS-2013-TYPED task, top-system UNITOR team• CLTE-2012, 3rd out of 29 systems (1st un-official)• CLTE-2013, among the 2-top systems• SRA-2013, among the 2-top systems

, , 1.3’