traductor ingl e9edimburghl

22
Alejandro Curado Martín Garay University of Extremadura, Spain Variation-influenced quality for MT: General vs. specialised corpora

Upload: alex-curtis

Post on 29-Jul-2015

160 views

Category:

Education


4 download

TRANSCRIPT

Page 1: Traductor ingl e9edimburghl

Alejandro CuradoMartín Garay

University of Extremadura, Spain

Variation-influenced quality for MT: General vs. specialised corpora

Page 2: Traductor ingl e9edimburghl

Theoretical background / method: Context-based MT

Variation-influenced quality for MT: General vs. specialised corpora

>Data retrieved from massive corpus: A lot of data to compare n-grams (the more context the better correspondences)

>No need to use a parallel corpus (e.g., SMT = aligned / parallel corpora translation)

>Optimal scores in BLEU (Bilingual Evaluation Under Study) scale for MT--Carbonell, 2006 = “Meaningful Machines” = 0.66 “blind test” out of 0.7 (human) with 53 GB of target text

>General translation (vs. Specialised translation??)

Page 3: Traductor ingl e9edimburghl

Variation-influenced quality for MT: General vs. specialised corpora

Page 4: Traductor ingl e9edimburghl

Resources:

English Dictionary table with 200,000 entries (single and compound words / idioms).

Spanish dictionary table with more than 5,000,000 entries

Large general numerical corpus that may reach up to 100 GB (end of July 2010): Indexed by text, sentence, word

Variation-influenced quality for MT: General vs. specialised corpora

Page 5: Traductor ingl e9edimburghl

Improve / increase resources:

Dictionary:

Web pages have been developed to:

1. Add all those word units missing (with equivalents)2. Increase word meanings if not in the dictionaries (wordreference)

Variation-influenced quality for MT: General vs. specialised corpora

Page 6: Traductor ingl e9edimburghl

Improve / increase resources:

The large corpus.

Indexing :

1. Books on the web.2. Wikipedia.3. Sketch Engine-retrieved texts (seed keywords).

Variation-influenced quality for MT: General vs. specialised corpora

Page 7: Traductor ingl e9edimburghl

Types of Spanish corpora used for the translation tests:

The large corpus (late May 2010)Nearly 73 million words, 11,490 texts, 3,900,000 sentences (+ 1256 news texts indexed in June = +4 mill. words)

Experiment corpus (1) with apartment / housing ads (March 2010)

70 texts, 5,455 sentences, 87,353 words

Experiment corpus (2) with international news (June 2010)

286 texts, 2,791 sentences, 125,936 words

Variation-influenced quality for MT: General vs. specialised corpora

Page 8: Traductor ingl e9edimburghl

Translation procedure.

1st step. Inserting the sentence or text.

The nice big house is located near the

sea.The nice big house is located near the

sea.

Variation-influenced quality for MT: General vs. specialised corpora

Page 9: Traductor ingl e9edimburghl

2nd step. Dividing the text into phrases / sentences.

The segmentation is carried out by using the following punctuation symbols:

In our case:

The nice big house is located near the sea.

. ; : ¿? ¡!

. ; : ¿? ¡!

Variation-influenced quality for MT: General vs. specialised corpora

Page 10: Traductor ingl e9edimburghl

3rd step. Obtaining the numbers that correspond to those words / word units in the English dictionary.

The nice big house is located near the sea

44634 30497 6962 22817 3456 27139 30255 44634 39064

Variation-influenced quality for MT: General vs. specialised corpora

Page 11: Traductor ingl e9edimburghl

4th step. We remove the function / nexus words (those words that repeat the most statistically in that language) from the sentence and we store them on a separate table.

nice big house located sea.

Final phrase nice big house located sea.

The is thenear

Variation-influenced quality for MT: General vs. specialised corpora

Page 12: Traductor ingl e9edimburghl

5th step. The remaining words (content words) are sent to the dictionary to retrieve the different translation equivalents they may have.

1: Restriction in the tests to only two equivalents in Spanish

Variation-influenced quality for MT: General vs. specialised corpora

Page 13: Traductor ingl e9edimburghl

Nice big house located sea.

6th step. Each ngram is divided into subn-grams (different combinations of the correspondences) which are then sent to the corpus.

bonito gran casa situado

1º 1043795 284672 839170 1098037

bonito gran casa situada

2º 1043795 284672 839170 1098063

……………………………………….

bonita gran casa situada

nº 1043794 284672 839170 1098063

Variation-influenced quality for MT: General vs. specialised corpora

Page 14: Traductor ingl e9edimburghl

7th step. A score is given to each result obtained; thus, each subn-gram will receive a final score and an arrangement according to the score.

- Parameters that decide the score given to each subn-gram:

Number of needed words found in the sentence. Distance found beween the words. Number of needed words found together inside the sentence.

SCORE

Subngrama 1 . bonito gran casa situado 2.5

Subngrama 2 . bonito gran casa situada 3.1 Subngrama n . bonita gran casa situada 7

Variation-influenced quality for MT: General vs. specialised corpora

Page 15: Traductor ingl e9edimburghl

8th Step. Scoring the n-gram in relation to the best scores obtained by the subn-grams.

Nice big house located

SCORE Subngrama 1 . bonito gran casa situado 2.5

Subngrama 2 . bonito gran casa situada 3.1 Subngrama n . bonita gran casa situada 7

SCORE

30

big house located sea. 50

SCORE

Variation-influenced quality for MT: General vs. specialised corpora

Page 16: Traductor ingl e9edimburghl

9th step. Combining the n-grams integrated in the sentence / text.

Nice big house located sea.

Parameters for the combination / overlapping

• Scoring the texts that repeat for the n-grams• Scoring the sentences that repeat for the n-grams

Variation-influenced quality for MT: General vs. specialised corpora

Page 17: Traductor ingl e9edimburghl

10th step. We add the function words previously removed from the sentence. We search for these words in the best subn-grams used.

nice big house located sea.The is near the

Variation-influenced quality for MT: General vs. specialised corpora

Page 18: Traductor ingl e9edimburghl

11th step. Obtaining the translated sentence

The nice big house is located near the sea.

La gran y bonita casa está situada cerca del mar

Variation-influenced quality for MT: General vs. specialised corpora

Page 19: Traductor ingl e9edimburghl

the nice big house is located near the sea.

La gran y bonita casa está situada cerca del mar.

the white house has an old gate that is broken and ugly.

La casa blanca tiene una vieja verja averiada y fea.

Time used by the system: 0.98 seconds / 1.2 seconds

Variation-influenced quality for MT: General vs. specialised corpora

In the housing ads (first specialised corpus):

Page 20: Traductor ingl e9edimburghl

the nice big house is located near the sea.

La casa grande está cerca del mar.

the white house has an old gate that is broken and ugly.

La casa blanca tiene una valla vieja que se rompe y es fea

Time used by the system: 3 minutes and 33 seconds / 3 minutes and 39 seconds

Variation-influenced quality for MT: General vs. specialised corpora

In the large corpus (end of May):

Page 21: Traductor ingl e9edimburghl

The director checked the mail and said he had no new mail

Nuevos directores comprobaban correo y la dijo no hay correo

The salesperson decided to stop doing business with

them

El vendedor decidió parar a hacer negocios con ellos

Time used by the system: 2 minutes and 12 seconds / 1 minute and 6 seconds

Variation-influenced quality for MT: General vs. specialised corpora

Other problems in the large corpus (June: + news):

Page 22: Traductor ingl e9edimburghl

Some linguistic / technical conclusions:

Variation-influenced quality for MT: General vs. specialised corpora

>Data retrieved from massive corpus:Important to obtain more common phrases / familiar expressions / overlapping connectors

>Data retrieved from the specialised corpus:Important for fixed phrases / collocations in the field / genre – BUT may need more linguistic information for connections

< Problems: Verb agreement in indirect clauses? / Fewer probabilities for open content combinations (e.g., new + mail)

<EVER important need to improve dictionary entries for general corpus<Scores according to context: Texts repeat more in specialised translation (problem for large corpus—e.g., nuevos directores)