dictionary alignment by rewrite-based entry translation

23
Dictionary Alignment by Rewrite-based Entry Translation Alberto Sim˜ oes 1 Xavier G´ omez Guinovart 2 1 Centro de Estudos Human´ ısticos, Universidade do Minho Campus de Gualtar, Braga, Portugal [email protected] 2 Galician Language Technologies and Applications (TALG Group) Universidade de Vigo, Galiza, Spain [email protected] SLATE 2013 Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation

Upload: alberto-simoes

Post on 08-May-2015

230 views

Category:

Technology


2 download

DESCRIPTION

Presentation at SLATE 2013 - http://slate-conf/

TRANSCRIPT

Page 1: Dictionary Alignment by Rewrite-based Entry Translation

Dictionary Alignmentby Rewrite-based Entry Translation

Alberto Simoes1 Xavier Gomez Guinovart2

1Centro de Estudos Humanısticos, Universidade do MinhoCampus de Gualtar, Braga, Portugal

[email protected]

2Galician Language Technologies and Applications (TALG Group)Universidade de Vigo, Galiza, Spain

[email protected]

SLATE 2013

Alberto Simoes, Xavier Gomez Guinovart Dictionary Alignment by Rewrite-based Entry Translation

Page 2: Dictionary Alignment by Rewrite-based Entry Translation

Motivation

We have a running project, Dicionario-Aberto, that allows theuser to consult a Portuguese dictionary;

Dicionario-Aberto is also available in TEI and DB formats;

Within GALNET project, a Galician Synonyms Dictionary wasconverted from a WYSIWYG format to a rich TEI format;

Would it be possible to integrate the GSD into DA?

Alberto Simoes, Xavier Gomez Guinovart Dictionary Alignment by Rewrite-based Entry Translation

Page 3: Dictionary Alignment by Rewrite-based Entry Translation

Problem

Dicionario-Aberto has more than a hundred thousand entries!

Galician Synonyms Dictionary is not that big, and has somedozens of thousand entries.

Problem: how to align entries from both dictionaries?

The two languages are very close;That help with concepts alignment!

There are too many different words;

There is a reasonable set of false friend words;

There isn’t a a free and big enough translation dictionary.

Alberto Simoes, Xavier Gomez Guinovart Dictionary Alignment by Rewrite-based Entry Translation

Page 4: Dictionary Alignment by Rewrite-based Entry Translation

Problem

Dicionario-Aberto has more than a hundred thousand entries!

Galician Synonyms Dictionary is not that big, and has somedozens of thousand entries.

Problem: how to align entries from both dictionaries?

The two languages are very close;That help with concepts alignment!

There are too many different words;

There is a reasonable set of false friend words;

There isn’t a a free and big enough translation dictionary.

Alberto Simoes, Xavier Gomez Guinovart Dictionary Alignment by Rewrite-based Entry Translation

Page 5: Dictionary Alignment by Rewrite-based Entry Translation

Inspiration (part 1)

In the first year, “s” will be used instead of the soft “c.” Sertainly,sivil servants will resieve this news with joy. Also, the hard “c” willbe replaced with “k”. Not only will this klear up konfusion, buttypewriters kan have one less letter.

There will be growing publik emthusiasm in the sekond year, whenthe troublesome “ph” will be replaced by “f”. This will make wordslike “fotograf” 20 persent shorter.

In the third year, publik akseptanse of the new spelling kan beexpekted to reach the stage where more komplikated changes arepossible. Governments will enkorage the removal of double letters,which have always ben a deterent to akurate speling. Also, al wilagre that the horible mes of silent “e”s in the languag is disgrasful,and they would go.

Alberto Simoes, Xavier Gomez Guinovart Dictionary Alignment by Rewrite-based Entry Translation

Page 6: Dictionary Alignment by Rewrite-based Entry Translation

Inspiration (part 2)

By the fourth year, peopl wil be reseptiv to steps such as replasing“th” by “z” and “w” by “v”.

During ze fifz year, ze unesesary “o” kan be dropd from vordskontaining “ou”, and similar changes vud of kors be aplid to ozerkombinations of leters.

After zis fifz yer, ve vil hav a reli sensibl riten styl. Zer vil be nomor trubls or difikultis and evrivun vil find it ezi tu understand echozer. Ze drem vil finali kum tru!!

Alberto Simoes, Xavier Gomez Guinovart Dictionary Alignment by Rewrite-based Entry Translation

Page 7: Dictionary Alignment by Rewrite-based Entry Translation

Approach

Define a translation function based on a set or sequence of texttransformations (mainly substitutions) that convert (translate)Portuguese words into Galician words.

The translation function is defined as

T (Lgl ,wpt) = wgl

Lgl is the target Galician lexicon, obtained from the wordspresent in the Galician Synonyms Dictionary;

wpt is the Portuguese word being translated;

wgl is the Galician translation.

Alberto Simoes, Xavier Gomez Guinovart Dictionary Alignment by Rewrite-based Entry Translation

Page 8: Dictionary Alignment by Rewrite-based Entry Translation

Approach

Define a translation function based on a set or sequence of texttransformations (mainly substitutions) that convert (translate)Portuguese words into Galician words.

The translation function is defined as

T (Lgl ,wpt) = wgl

Lgl is the target Galician lexicon, obtained from the wordspresent in the Galician Synonyms Dictionary;

wpt is the Portuguese word being translated;

wgl is the Galician translation.

Alberto Simoes, Xavier Gomez Guinovart Dictionary Alignment by Rewrite-based Entry Translation

Page 9: Dictionary Alignment by Rewrite-based Entry Translation

Translation Function

Substitutions can be simple, as:

ss > s — passo > paso

j > x — sujeito > suxeito, injectar > inxectar

z ([eieıeı]) > c — bronze > bronce

Substitutions can over-generate:

-c~ao > -cion,-zon —adivinhacao > adivinacion, coracao > corazon

-velmente > belmente,-blemente —previsivelmente > previsibelmente, previsiblemente

rv > rv,rb —preservacao > preservacion, estorvar > estorbar

Alberto Simoes, Xavier Gomez Guinovart Dictionary Alignment by Rewrite-based Entry Translation

Page 10: Dictionary Alignment by Rewrite-based Entry Translation

Translation Function

Substitutions can be simple, as:

ss > s — passo > paso

j > x — sujeito > suxeito, injectar > inxectar

z ([eieıeı]) > c — bronze > bronce

Substitutions can over-generate:

-c~ao > -cion,-zon —adivinhacao > adivinacion, coracao > corazon

-velmente > belmente,-blemente —previsivelmente > previsibelmente, previsiblemente

rv > rv,rb —preservacao > preservacion, estorvar > estorbar

Alberto Simoes, Xavier Gomez Guinovart Dictionary Alignment by Rewrite-based Entry Translation

Page 11: Dictionary Alignment by Rewrite-based Entry Translation

Translation Function

A word without substitutions can be a valid translation;

Substitutions can be inter-dependent;(for example, -c~ao > cion should be applied before c > z)

Substitutions are applied from more generic to more specific;(unless there is interdependence)

Substitutions can generate more than one possibletranslations;

Before returning, the first word in the possible translationsthat exists in the target lexicon is returned.

Alberto Simoes, Xavier Gomez Guinovart Dictionary Alignment by Rewrite-based Entry Translation

Page 12: Dictionary Alignment by Rewrite-based Entry Translation

Translation Function

Id. SubstitutionID —A ss > s

B j > x

C -c~ao > -cion,-zon

D c > z

E nh > ~n

F -dizer > -dicir

G z ([eieıeı]) > c

H lh > ll

I vr > br

J -agem > -axe

K g ([eieıeı]) > x

L -avel > -abel,-able

M -ıvel > -ıbel,-ible

N -velmente > belmente,-blemente

O -eio > -eo

P -ancia > -ancia

Q -encia > -encia

R -aria > -erıa,-arıa

S -ario > -ario

T -ori[oa] > -ori[oa]

Id. SubstitutionU -s~ao > -sion,-son

V -r~ao > -ron,-ran

W -m~ao > -mon,-man

X -i~ao > ion,-ian

Y -ıcio > -icio

Z -oide > -oide

AA -ıdio > -idio

AB -anico > -anico

AC -edia > -edia

AD -cimento > -cemento

AE -m > -n

AF -crever > -cribir

AG -u > -u,-o

AH -var > -bar

AI im- > im-,inm-

AJ qua- > cua-,ca-

AK qua > cua

AL -x~ao > -xon,-xion

AM rv > rv,rb

AN -iver > -ivir

Alberto Simoes, Xavier Gomez Guinovart Dictionary Alignment by Rewrite-based Entry Translation

Page 13: Dictionary Alignment by Rewrite-based Entry Translation

Evaluation 1

Given a small (about 9K pairs) hand-cured translation dictionary. . .Compute Type I/II Hypothesis:

T (Lgl ,wpt) = wgl Correct Incorrect

wgl is a Galician word TP FPwgl is not a Galician word TN FN

TP True Positives – Correct Translation

FP False Positives – Wrong Translation, but obtained Word ispresent in Galician Lexicon;

TN True Negative – Correct translation, but translation not inGalician Lexicon (always 0).

FN False Negative – Wrong Translation, and obtained Word isnot in Galician Lexicon;

Alberto Simoes, Xavier Gomez Guinovart Dictionary Alignment by Rewrite-based Entry Translation

Page 14: Dictionary Alignment by Rewrite-based Entry Translation

Evaluation 1 — Measures

accuracy =TP + TN

TP + TN + FP + FN(1)

precision =TP

TP + FP(2)

recall =TP

TP + FN(3)

F1 = 2× precision× recall

precision + recall(4)

Alberto Simoes, Xavier Gomez Guinovart Dictionary Alignment by Rewrite-based Entry Translation

Page 15: Dictionary Alignment by Rewrite-based Entry Translation

Evaluation 1 — Results

Id. Precision Recall F1 Accuracy Correct ∆

ID 0.9954 0.5859 0.7376 0.5843 5390 5390A 0.9952 0.6038 0.7516 0.6020 5553 163B 0.9951 0.6158 0.7608 0.6139 5663 110C 0.9952 0.6567 0.7912 0.6546 6038 375D 0.9951 0.6687 0.7999 0.6665 6148 110E 0.9952 0.6782 0.8066 0.6760 6235 87F 0.9952 0.6786 0.8070 0.6764 6239 4G 0.9953 0.6838 0.8107 0.6816 6287 48H 0.9953 0.6927 0.8169 0.6905 6369 82I 0.9953 0.6934 0.8174 0.6911 6375 6J 0.9953 0.6964 0.8195 0.6942 6403 28K 0.9955 0.7210 0.8363 0.7187 6629 226L 0.9955 0.7256 0.8394 0.7232 6671 42

M 0.9955 0.7284 0.8413 0.7260 6697 26N 0.9957 0.7482 0.8544 0.7458 6879 182

Alberto Simoes, Xavier Gomez Guinovart Dictionary Alignment by Rewrite-based Entry Translation

Page 16: Dictionary Alignment by Rewrite-based Entry Translation

Evaluation 1 — Results

Id. Precision Recall F1 Accuracy Correct ∆

O 0.9957 0.7496 0.8553 0.7472 6892 13P 0.9957 0.7515 0.8565 0.7490 6909 17Q 0.9957 0.7588 0.8612 0.7563 6976 67R 0.9957 0.7602 0.8621 0.7577 6989 13S 0.9958 0.7680 0.8672 0.7655 7061 72T 0.9958 0.7703 0.8686 0.7678 7082 21U 0.9958 0.7772 0.8731 0.7747 7146 64V 0.9958 0.7780 0.8735 0.7755 7153 7

W 0.9958 0.7783 0.8737 0.7758 7156 3X 0.9958 0.7796 0.8746 0.7771 7168 12Y 0.9958 0.7806 0.8752 0.7781 7177 9Z 0.9958 0.7807 0.8753 0.7782 7178 1

AA 0.9958 0.7813 0.8756 0.7787 7183 5

Alberto Simoes, Xavier Gomez Guinovart Dictionary Alignment by Rewrite-based Entry Translation

Page 17: Dictionary Alignment by Rewrite-based Entry Translation

Evaluation 1 — Results

Id. Precision Recall F1 Accuracy Correct ∆

AB 0.9958 0.7818 0.8759 0.7793 7188 5AC 0.9958 0.7822 0.8762 0.7797 7192 4AD 0.9959 0.7836 0.8770 0.7810 7204 12AE 0.9959 0.7855 0.8783 0.7830 7222 18AF 0.9959 0.7863 0.8787 0.7837 7229 7AG 0.9957 0.7876 0.8795 0.7849 7240 11AH 0.9957 0.7882 0.8799 0.7856 7246 6AI 0.9958 0.7903 0.8812 0.7876 7265 19AJ 0.9956 0.7928 0.8827 0.7900 7287 22AK 0.9956 0.7940 0.8834 0.7912 7298 11AL 0.9956 0.7947 0.8839 0.7920 7305 7

AM 0.9956 0.7951 0.8842 0.7924 7309 4AN 0.9956 0.7955 0.8844 0.7927 7312 3

Alberto Simoes, Xavier Gomez Guinovart Dictionary Alignment by Rewrite-based Entry Translation

Page 18: Dictionary Alignment by Rewrite-based Entry Translation

Evaluation 2

Triangulating a bigger dictionary for evaluation purposes:

PT–SP (12 340) ◦ SP–GL (7 581) ⇒ PT–GL (5 045 pairs)

from Apertium translation software

PT–SP ◦ SP–EN (24 912) ◦ EN–GL (17 626) ⇒ PT–GL (6 644)

PT–SP and EN–GL from Apertium, En–GL from CLUVI

PT–EN (14 600) ◦ EN–GL ⇒ PT–GL (8 589 pairs)

PT–EN from a merchandising app, EN–GL from CLUVI

Adding dictionaries together resulted in a 14 492 pairs.

Alberto Simoes, Xavier Gomez Guinovart Dictionary Alignment by Rewrite-based Entry Translation

Page 19: Dictionary Alignment by Rewrite-based Entry Translation

Evaluation 2 – Results

Id. Precision Recall F1 Accuracy Correct ∆

ID 0.9668 0.5022 0.6611 0.4937 7155 7155A 0.9664 0.5176 0.6741 0.5084 7368 213B 0.9663 0.5275 0.6824 0.5179 7506 138C 0.9668 0.5646 0.7129 0.5538 8026 520D 0.9661 0.5746 0.7206 0.5633 8163 137E 0.9658 0.5831 0.7272 0.5713 8279 116...

......

......

......

AH 0.9660 0.6819 0.7994 0.6659 9650 7AI 0.9661 0.6841 0.8010 0.6681 9682 32AJ 0.9660 0.6863 0.8025 0.6701 9711 29AK 0.9660 0.6873 0.8032 0.6711 9726 15AL 0.9661 0.6881 0.8037 0.6718 9736 10

AM 0.9660 0.6884 0.8039 0.6721 9740 4AN 0.9660 0.6887 0.8041 0.6724 9744 4

Alberto Simoes, Xavier Gomez Guinovart Dictionary Alignment by Rewrite-based Entry Translation

Page 20: Dictionary Alignment by Rewrite-based Entry Translation

Dictionary Alignment — Results

Portuguese Words Galician WordsSubstitution Count Percentage Count Percentage

ID 12711 15.3502% 12711 33.7475%A 13082 15.7982% 13065 34.6874%B 13447 16.2390% 13421 35.6326%C 14348 17.3270% 14321 38.0220%D 14764 17.8294% 14728 39.1026%E 15174 18.3245% 15138 40.1912%...

......

......

AI 17712 21.3895% 17627 46.7994%AJ 17740 21.4233% 17648 46.8552%AK 17765 21.4535% 17673 46.9215%AL 17784 21.4764% 17693 46.9746%AM 17813 21.5115% 17718 47.0410%AN 17817 21.5163% 17722 47.0516%DIC 20084 24.2540% 19989 53.0705%

Alberto Simoes, Xavier Gomez Guinovart Dictionary Alignment by Rewrite-based Entry Translation

Page 21: Dictionary Alignment by Rewrite-based Entry Translation

Final Remarks

An approach to translate Portuguese words in a dictionaryinto Galician words using a set of string substitutions;

Approach is unable to translate all words;

Reasonable amount of words in Dicionario-Aberto havepre-1930 orthography, that wasn’t dealt with;

We deliberately ignored a relevant problem: false friends.

two words that share a subset of the meanings. For instance,talho (PT) and tallo (GL) share the majority of their senses,but there are some of them that are specific to Portuguese(for example, the place where meat is sold);

two words that have complete different meanings. An examplewould be the word presunto (written in the same way in thetwo languages) that means ham in Portuguese (a noun), butmeans alleged in Galician (an adjective);

Alberto Simoes, Xavier Gomez Guinovart Dictionary Alignment by Rewrite-based Entry Translation

Page 22: Dictionary Alignment by Rewrite-based Entry Translation

Final Remarks

An approach to translate Portuguese words in a dictionaryinto Galician words using a set of string substitutions;

Approach is unable to translate all words;

Reasonable amount of words in Dicionario-Aberto havepre-1930 orthography, that wasn’t dealt with;

We deliberately ignored a relevant problem: false friends.

two words that share a subset of the meanings. For instance,talho (PT) and tallo (GL) share the majority of their senses,but there are some of them that are specific to Portuguese(for example, the place where meat is sold);

two words that have complete different meanings. An examplewould be the word presunto (written in the same way in thetwo languages) that means ham in Portuguese (a noun), butmeans alleged in Galician (an adjective);

Alberto Simoes, Xavier Gomez Guinovart Dictionary Alignment by Rewrite-based Entry Translation

Page 23: Dictionary Alignment by Rewrite-based Entry Translation

Dictionary Alignmentby Rewrite-based Entry Translation

Alberto Simoes1 Xavier Gomez Guinovart2

1Centro de Estudos Humanısticos, Universidade do MinhoCampus de Gualtar, Braga, Portugal

[email protected]

2Galician Language Technologies and Applications (TALG Group)Universidade de Vigo, Galiza, Spain

[email protected]

SLATE 2013

Alberto Simoes, Xavier Gomez Guinovart Dictionary Alignment by Rewrite-based Entry Translation