an integrated platform for high-accuracy word alignment

25
An integrated platform for high-accuracy word alignment Dan Tufis, Alexandru Ceausu, Radu Ion, Dan Stefanescu RACAI – Research Institute for Artificial Intelligence, Bucharest

Upload: patience-bernard

Post on 31-Dec-2015

39 views

Category:

Documents


3 download

DESCRIPTION

An integrated platform for high-accuracy word alignment. Dan Tufis, Alexandru Ceausu, Radu Ion, Dan Stefanescu RACAI – Research Institute for Artificial Intelligence, Bucharest. COWAL. The main task of COWAL is to combine the output of two or more comparable word-aligners - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: An integrated platform for high-accuracy word alignment

An integrated platform for high-accuracy word alignment

Dan Tufis, Alexandru Ceausu, Radu Ion, Dan Stefanescu

RACAI – Research Institute for Artificial Intelligence, Bucharest

Page 2: An integrated platform for high-accuracy word alignment

Arona, 26.09.2005 Exploiting parallel corpora in up to 20 languages 2

COWAL The main task of COWAL is to

combine the output of two or more comparable word-aligners

In order to achieve this task, COWAL is also an integrated platform with modules for: tokenization, POS-tagging, lemmatization, collocation detection, dependency annotation, chunking and word sense disambiguation.

Page 3: An integrated platform for high-accuracy word alignment

Arona, 26.09.2005 Exploiting parallel corpora in up to 20 languages 3

Word alignment algorithms (YAWA)

YAWA starts with all plausible links (those with ll-score higher than 11)

Then, using a competitive linking strategy, retains the links that maximizes sentence translation equivalence score, and minimizing the number of crossing links

In this way, it generates only 1-1 alignments. N-M alignments are possible only with chunking and/or dependency linking available.

Page 4: An integrated platform for high-accuracy word alignment

Arona, 26.09.2005 Exploiting parallel corpora in up to 20 languages 4

Word alignment algorithms (MEBA)

MEBA iterates several times over each pair of aligned sentences, at each iteration adding only the highest score links.

The links already established in previous iterations give support or create restrictions for the links to be added in a subsequent iteration.

MEBA uses different weights and different significance thresholds on each feature and iteration step.

Page 5: An integrated platform for high-accuracy word alignment

Arona, 26.09.2005 Exploiting parallel corpora in up to 20 languages 5

Features characterizing a link A link <Token1 Token2> is characterized

by a set of features, the values of which are real numbers in the [0,1] interval.

context independent features – CIF, they refer only to the tokens of the current link

context dependent features – CDF, they refer to the properties of the current link with respect to the rest of links in a bi-text

Page 6: An integrated platform for high-accuracy word alignment

Arona, 26.09.2005 Exploiting parallel corpora in up to 20 languages 6

Context independent features

Translation equivalents (lemma and/or wordform )

Translation equivalents entropy (lemma) Part-of-Speech affinity Cognates

Page 7: An integrated platform for high-accuracy word alignment

Arona, 26.09.2005 Exploiting parallel corpora in up to 20 languages 7

Translation equivalents (TE) YAWA, TREQ-AL use competitive linking based on ll-

scores, plus the Ro-En aligned wordnets MEBA uses GIZA++ generated candidates filtered

with a log-likelihood threshold (11). The TE candidates search space is limited by

lemmatization and POS meta-classes (e.g. meta-class 1 includes only N, V, Aj and Adv; meta-class 8 includes only proper names)

For a pair of languages translation equivalents are computed in both directions. The value of the TE feature of a candidate link <TOKEN1 TOKEN2> is 1/2 (PTR(TOKEN1, TOKEN2) + PTR(TOKEN2, TOKEN1).

Page 8: An integrated platform for high-accuracy word alignment

Arona, 26.09.2005 Exploiting parallel corpora in up to 20 languages 8

Entropy Score (ES) The entropy of a word's translation equivalents

distribution proved to be an important hint on identifying highly reliable links (anchoring links)

Skewed distributions favored against uniform ones

For a link <A B>, the link feature value is 0.5(ES(A)+ES(B))

N

TRWpTRWpN

iii

WES log

),(log*),(11)(

Page 9: An integrated platform for high-accuracy word alignment

Arona, 26.09.2005 Exploiting parallel corpora in up to 20 languages 9

Part-of-speech affinity (PA) An important clue in word alignment is the fact

that the translated words tend to keep their part-of-speech and when they have different POSes, this is not arbitrary.

Tried to use GIZA++ (replacing tokens with their respective POSes) but there was too much noise!

The information was computed based on a gold standard (the revised NAACL2003), in both directions (source-target and target-source).

For a link <A,B> PA=0.5*(P(cat(A)|cat(B))+P(cat(B)|cat(A))

Page 10: An integrated platform for high-accuracy word alignment

Arona, 26.09.2005 Exploiting parallel corpora in up to 20 languages 10

Cognates (COG) The cognates feature assigns a string similarity (using

Levenstein distance) to the tokens of a candidate link We estimated the probability of a pair of

orthographically similar words, appearing in aligned sentences, to be cognates, with different string similarity thresholds. For the threshold 0.6 we didn’t find any exception. Therefore, the value of this feature is either 1 (if the similarity score is above the threshold or 0 otherwise).

Before computing the string similarity score, the words are normalized (duplicate letters are removed, diacritics are removed, some suffixes are discarded).

Page 11: An integrated platform for high-accuracy word alignment

Arona, 26.09.2005 Exploiting parallel corpora in up to 20 languages 11

Context dependent features

Locality Links crossed Relative position/Distortion Collocation/Fertility Coherence

Page 12: An integrated platform for high-accuracy word alignment

Arona, 26.09.2005 Exploiting parallel corpora in up to 20 languages 12

Collocation Bi-gram lists (only content words) were built from

each monolingual part of the training corpus, using the log-likelihood score (threshold of 10) and minimal occurrence frequency (3) for candidates filtering. Collocation probabilities are estimated for each surviving bi-gram.

If neither token of a candidate link has a relevant collocation score with the tokens in its neighborhood, the link value of this feature is 0. Otherwise the value is the maximum of the collocation probabilities of the link’s tokens. Competing links (starting or finishing in the same token) are licensed only and only if at least one of them have a non-null collocation score

Page 13: An integrated platform for high-accuracy word alignment

Arona, 26.09.2005 Exploiting parallel corpora in up to 20 languages 13

Distorsion/Relative position Each token in both sides of a bi-text is

characterized by a position index, computed as the ratio between the relative position in the sentence and the length of the sentence. The absolute value of the difference between tokens’ position indexes, gives the link’s “obliqueness”

The distorsion feature of a link is its obliqueness D(link)=OBL(SWi, TWj)

)()(),(

TSji Sentlength

j

Sentlength

iTWSWOBL

Page 14: An integrated platform for high-accuracy word alignment

Arona, 26.09.2005 Exploiting parallel corpora in up to 20 languages 14

Localization This feature is relevant with or without chunking or

dependency parsing modules. It accounts for the degree of the cohesion of links.

With the chunking module is available, and the chunks are aligned via the linking of their respective heads, the links starting in one chunk should finish in the aligned chunk.

When chunking information is not available, the link localization is judged against a window, the span of which is dependent on the aligned sentences length.

Maximum localization (1) is the one with all the tokens in the source window are linked to all tokens in the target window

Page 15: An integrated platform for high-accuracy word alignment

Arona, 26.09.2005 Exploiting parallel corpora in up to 20 languages 15

Crossed links The crossed links feature computes (for a

window size depending on the categories of the candidates and the sentences lengths) the links that were crossed.

The normalization factor (maximum number of crossable links) is empirically set, based on categories of the link’s tokens

links) crossed of No.),,mber(MAX(max_nu

links crossed of No.1

S TCCreCrossFeatu

Page 16: An integrated platform for high-accuracy word alignment

Arona, 26.09.2005 Exploiting parallel corpora in up to 20 languages 16

EVALUATION:Official rankingU.RACAI.Combined

L.ISI.Run5.vocab.grow

Page 17: An integrated platform for high-accuracy word alignment

Arona, 26.09.2005 Exploiting parallel corpora in up to 20 languages 17

Page 18: An integrated platform for high-accuracy word alignment

Arona, 26.09.2005 Exploiting parallel corpora in up to 20 languages 18

Page 19: An integrated platform for high-accuracy word alignment

Arona, 26.09.2005 Exploiting parallel corpora in up to 20 languages 19

Page 20: An integrated platform for high-accuracy word alignment

Arona, 26.09.2005 Exploiting parallel corpora in up to 20 languages 20

Word alignment combiners

The COWAL(ACL2005) combiner is fine-tuned for the concerned language pair (rule-based)

The SMV filter is a language independent combiner (trainable on positive and negative examples)

Trade-off between human introspection and performance

Page 21: An integrated platform for high-accuracy word alignment

Arona, 26.09.2005 Exploiting parallel corpora in up to 20 languages 21

SVM filter Combining word alignments requires

the ability to distinguish among correct links and incorrect links of the two ore more merged alignments. SVM technology is specifically adequate for this task:

The SVM combiner is a classifier trained on both positive and negative examples.

Page 22: An integrated platform for high-accuracy word alignment

Arona, 26.09.2005 Exploiting parallel corpora in up to 20 languages 22

SVM filter evaluation

MEBA COWALMEBA filtered

YAWA & MEBA filtered

Precision 0.9122 0.8795 0.9315 0.8830

Recall 0.6976 0.7775 0.6712 0.7713

F-measure 0.7924 0.8254 0.7802 0.8234

SVM filtering results.The SVM model was trained on NAACL 2003 gold standard.

Page 23: An integrated platform for high-accuracy word alignment

Arona, 26.09.2005 Exploiting parallel corpora in up to 20 languages 23

Romanian Acquis The available Romanian documents were

downloaded from CCVISTA (over 12000 Microsoft word documents)

We kept only 11228 files (some of them were different versions of the same document)

The remaining documents were converted into the same XML format of the ACQUIS corpus

From the 11228 Romanian files only 6256 are available for English in the JRC distribution

Page 24: An integrated platform for high-accuracy word alignment

Arona, 26.09.2005 Exploiting parallel corpora in up to 20 languages 24

Romanian Acquis

•Tokenization

•Sentence splitting

•POS-tagging

•Lemmatization

•Chunking

•Sentence aligning

Page 25: An integrated platform for high-accuracy word alignment

Arona, 26.09.2005 Exploiting parallel corpora in up to 20 languages 25