traductor ingl e9edimburghl

Post on 29-Jul-2015

160 Views

Category:

Education

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Alejandro CuradoMartín Garay

University of Extremadura, Spain

Variation-influenced quality for MT: General vs. specialised corpora

Theoretical background / method: Context-based MT

Variation-influenced quality for MT: General vs. specialised corpora

>Data retrieved from massive corpus: A lot of data to compare n-grams (the more context the better correspondences)

>No need to use a parallel corpus (e.g., SMT = aligned / parallel corpora translation)

>Optimal scores in BLEU (Bilingual Evaluation Under Study) scale for MT--Carbonell, 2006 = “Meaningful Machines” = 0.66 “blind test” out of 0.7 (human) with 53 GB of target text

>General translation (vs. Specialised translation??)

Variation-influenced quality for MT: General vs. specialised corpora

Resources:

English Dictionary table with 200,000 entries (single and compound words / idioms).

Spanish dictionary table with more than 5,000,000 entries

Large general numerical corpus that may reach up to 100 GB (end of July 2010): Indexed by text, sentence, word

Variation-influenced quality for MT: General vs. specialised corpora

Improve / increase resources:

Dictionary:

Web pages have been developed to:

1. Add all those word units missing (with equivalents)2. Increase word meanings if not in the dictionaries (wordreference)

Variation-influenced quality for MT: General vs. specialised corpora

Improve / increase resources:

The large corpus.

Indexing :

1. Books on the web.2. Wikipedia.3. Sketch Engine-retrieved texts (seed keywords).

Variation-influenced quality for MT: General vs. specialised corpora

Types of Spanish corpora used for the translation tests:

The large corpus (late May 2010)Nearly 73 million words, 11,490 texts, 3,900,000 sentences (+ 1256 news texts indexed in June = +4 mill. words)

Experiment corpus (1) with apartment / housing ads (March 2010)

70 texts, 5,455 sentences, 87,353 words

Experiment corpus (2) with international news (June 2010)

286 texts, 2,791 sentences, 125,936 words

Variation-influenced quality for MT: General vs. specialised corpora

Translation procedure.

1st step. Inserting the sentence or text.

The nice big house is located near the

sea.The nice big house is located near the

sea.

Variation-influenced quality for MT: General vs. specialised corpora

2nd step. Dividing the text into phrases / sentences.

The segmentation is carried out by using the following punctuation symbols:

In our case:

The nice big house is located near the sea.

. ; : ¿? ¡!

. ; : ¿? ¡!

Variation-influenced quality for MT: General vs. specialised corpora

3rd step. Obtaining the numbers that correspond to those words / word units in the English dictionary.

The nice big house is located near the sea

44634 30497 6962 22817 3456 27139 30255 44634 39064

Variation-influenced quality for MT: General vs. specialised corpora

4th step. We remove the function / nexus words (those words that repeat the most statistically in that language) from the sentence and we store them on a separate table.

nice big house located sea.

Final phrase nice big house located sea.

The is thenear

Variation-influenced quality for MT: General vs. specialised corpora

5th step. The remaining words (content words) are sent to the dictionary to retrieve the different translation equivalents they may have.

1: Restriction in the tests to only two equivalents in Spanish

Variation-influenced quality for MT: General vs. specialised corpora

Nice big house located sea.

6th step. Each ngram is divided into subn-grams (different combinations of the correspondences) which are then sent to the corpus.

bonito gran casa situado

1º 1043795 284672 839170 1098037

bonito gran casa situada

2º 1043795 284672 839170 1098063

……………………………………….

bonita gran casa situada

nº 1043794 284672 839170 1098063

Variation-influenced quality for MT: General vs. specialised corpora

7th step. A score is given to each result obtained; thus, each subn-gram will receive a final score and an arrangement according to the score.

- Parameters that decide the score given to each subn-gram:

Number of needed words found in the sentence. Distance found beween the words. Number of needed words found together inside the sentence.

SCORE

Subngrama 1 . bonito gran casa situado 2.5

Subngrama 2 . bonito gran casa situada 3.1 Subngrama n . bonita gran casa situada 7

Variation-influenced quality for MT: General vs. specialised corpora

8th Step. Scoring the n-gram in relation to the best scores obtained by the subn-grams.

Nice big house located

SCORE Subngrama 1 . bonito gran casa situado 2.5

Subngrama 2 . bonito gran casa situada 3.1 Subngrama n . bonita gran casa situada 7

SCORE

30

big house located sea. 50

SCORE

Variation-influenced quality for MT: General vs. specialised corpora

9th step. Combining the n-grams integrated in the sentence / text.

Nice big house located sea.

Parameters for the combination / overlapping

• Scoring the texts that repeat for the n-grams• Scoring the sentences that repeat for the n-grams

Variation-influenced quality for MT: General vs. specialised corpora

10th step. We add the function words previously removed from the sentence. We search for these words in the best subn-grams used.

nice big house located sea.The is near the

Variation-influenced quality for MT: General vs. specialised corpora

11th step. Obtaining the translated sentence

The nice big house is located near the sea.

La gran y bonita casa está situada cerca del mar

Variation-influenced quality for MT: General vs. specialised corpora

the nice big house is located near the sea.

La gran y bonita casa está situada cerca del mar.

the white house has an old gate that is broken and ugly.

La casa blanca tiene una vieja verja averiada y fea.

Time used by the system: 0.98 seconds / 1.2 seconds

Variation-influenced quality for MT: General vs. specialised corpora

In the housing ads (first specialised corpus):

the nice big house is located near the sea.

La casa grande está cerca del mar.

the white house has an old gate that is broken and ugly.

La casa blanca tiene una valla vieja que se rompe y es fea

Time used by the system: 3 minutes and 33 seconds / 3 minutes and 39 seconds

Variation-influenced quality for MT: General vs. specialised corpora

In the large corpus (end of May):

The director checked the mail and said he had no new mail

Nuevos directores comprobaban correo y la dijo no hay correo

The salesperson decided to stop doing business with

them

El vendedor decidió parar a hacer negocios con ellos

Time used by the system: 2 minutes and 12 seconds / 1 minute and 6 seconds

Variation-influenced quality for MT: General vs. specialised corpora

Other problems in the large corpus (June: + news):

Some linguistic / technical conclusions:

Variation-influenced quality for MT: General vs. specialised corpora

>Data retrieved from massive corpus:Important to obtain more common phrases / familiar expressions / overlapping connectors

>Data retrieved from the specialised corpus:Important for fixed phrases / collocations in the field / genre – BUT may need more linguistic information for connections

< Problems: Verb agreement in indirect clauses? / Fewer probabilities for open content combinations (e.g., new + mail)

<EVER important need to improve dictionary entries for general corpus<Scores according to context: Texts repeat more in specialised translation (problem for large corpus—e.g., nuevos directores)

top related