traductor ingl e9edimburghl
Post on 29-Jul-2015
160 Views
Preview:
TRANSCRIPT
Alejandro CuradoMartín Garay
University of Extremadura, Spain
Variation-influenced quality for MT: General vs. specialised corpora
Theoretical background / method: Context-based MT
Variation-influenced quality for MT: General vs. specialised corpora
>Data retrieved from massive corpus: A lot of data to compare n-grams (the more context the better correspondences)
>No need to use a parallel corpus (e.g., SMT = aligned / parallel corpora translation)
>Optimal scores in BLEU (Bilingual Evaluation Under Study) scale for MT--Carbonell, 2006 = “Meaningful Machines” = 0.66 “blind test” out of 0.7 (human) with 53 GB of target text
>General translation (vs. Specialised translation??)
Variation-influenced quality for MT: General vs. specialised corpora
Resources:
English Dictionary table with 200,000 entries (single and compound words / idioms).
Spanish dictionary table with more than 5,000,000 entries
Large general numerical corpus that may reach up to 100 GB (end of July 2010): Indexed by text, sentence, word
Variation-influenced quality for MT: General vs. specialised corpora
Improve / increase resources:
Dictionary:
Web pages have been developed to:
1. Add all those word units missing (with equivalents)2. Increase word meanings if not in the dictionaries (wordreference)
Variation-influenced quality for MT: General vs. specialised corpora
Improve / increase resources:
The large corpus.
Indexing :
1. Books on the web.2. Wikipedia.3. Sketch Engine-retrieved texts (seed keywords).
Variation-influenced quality for MT: General vs. specialised corpora
Types of Spanish corpora used for the translation tests:
The large corpus (late May 2010)Nearly 73 million words, 11,490 texts, 3,900,000 sentences (+ 1256 news texts indexed in June = +4 mill. words)
Experiment corpus (1) with apartment / housing ads (March 2010)
70 texts, 5,455 sentences, 87,353 words
Experiment corpus (2) with international news (June 2010)
286 texts, 2,791 sentences, 125,936 words
Variation-influenced quality for MT: General vs. specialised corpora
Translation procedure.
1st step. Inserting the sentence or text.
The nice big house is located near the
sea.The nice big house is located near the
sea.
Variation-influenced quality for MT: General vs. specialised corpora
2nd step. Dividing the text into phrases / sentences.
The segmentation is carried out by using the following punctuation symbols:
In our case:
The nice big house is located near the sea.
. ; : ¿? ¡!
. ; : ¿? ¡!
Variation-influenced quality for MT: General vs. specialised corpora
3rd step. Obtaining the numbers that correspond to those words / word units in the English dictionary.
The nice big house is located near the sea
44634 30497 6962 22817 3456 27139 30255 44634 39064
Variation-influenced quality for MT: General vs. specialised corpora
4th step. We remove the function / nexus words (those words that repeat the most statistically in that language) from the sentence and we store them on a separate table.
nice big house located sea.
Final phrase nice big house located sea.
The is thenear
Variation-influenced quality for MT: General vs. specialised corpora
5th step. The remaining words (content words) are sent to the dictionary to retrieve the different translation equivalents they may have.
1: Restriction in the tests to only two equivalents in Spanish
Variation-influenced quality for MT: General vs. specialised corpora
Nice big house located sea.
6th step. Each ngram is divided into subn-grams (different combinations of the correspondences) which are then sent to the corpus.
bonito gran casa situado
1º 1043795 284672 839170 1098037
bonito gran casa situada
2º 1043795 284672 839170 1098063
……………………………………….
bonita gran casa situada
nº 1043794 284672 839170 1098063
Variation-influenced quality for MT: General vs. specialised corpora
7th step. A score is given to each result obtained; thus, each subn-gram will receive a final score and an arrangement according to the score.
- Parameters that decide the score given to each subn-gram:
Number of needed words found in the sentence. Distance found beween the words. Number of needed words found together inside the sentence.
SCORE
Subngrama 1 . bonito gran casa situado 2.5
Subngrama 2 . bonito gran casa situada 3.1 Subngrama n . bonita gran casa situada 7
Variation-influenced quality for MT: General vs. specialised corpora
8th Step. Scoring the n-gram in relation to the best scores obtained by the subn-grams.
Nice big house located
SCORE Subngrama 1 . bonito gran casa situado 2.5
Subngrama 2 . bonito gran casa situada 3.1 Subngrama n . bonita gran casa situada 7
SCORE
30
big house located sea. 50
SCORE
Variation-influenced quality for MT: General vs. specialised corpora
9th step. Combining the n-grams integrated in the sentence / text.
Nice big house located sea.
Parameters for the combination / overlapping
• Scoring the texts that repeat for the n-grams• Scoring the sentences that repeat for the n-grams
Variation-influenced quality for MT: General vs. specialised corpora
10th step. We add the function words previously removed from the sentence. We search for these words in the best subn-grams used.
nice big house located sea.The is near the
Variation-influenced quality for MT: General vs. specialised corpora
11th step. Obtaining the translated sentence
The nice big house is located near the sea.
La gran y bonita casa está situada cerca del mar
Variation-influenced quality for MT: General vs. specialised corpora
the nice big house is located near the sea.
La gran y bonita casa está situada cerca del mar.
the white house has an old gate that is broken and ugly.
La casa blanca tiene una vieja verja averiada y fea.
Time used by the system: 0.98 seconds / 1.2 seconds
Variation-influenced quality for MT: General vs. specialised corpora
In the housing ads (first specialised corpus):
the nice big house is located near the sea.
La casa grande está cerca del mar.
the white house has an old gate that is broken and ugly.
La casa blanca tiene una valla vieja que se rompe y es fea
Time used by the system: 3 minutes and 33 seconds / 3 minutes and 39 seconds
Variation-influenced quality for MT: General vs. specialised corpora
In the large corpus (end of May):
The director checked the mail and said he had no new mail
Nuevos directores comprobaban correo y la dijo no hay correo
The salesperson decided to stop doing business with
them
El vendedor decidió parar a hacer negocios con ellos
Time used by the system: 2 minutes and 12 seconds / 1 minute and 6 seconds
Variation-influenced quality for MT: General vs. specialised corpora
Other problems in the large corpus (June: + news):
Some linguistic / technical conclusions:
Variation-influenced quality for MT: General vs. specialised corpora
>Data retrieved from massive corpus:Important to obtain more common phrases / familiar expressions / overlapping connectors
>Data retrieved from the specialised corpus:Important for fixed phrases / collocations in the field / genre – BUT may need more linguistic information for connections
< Problems: Verb agreement in indirect clauses? / Fewer probabilities for open content combinations (e.g., new + mail)
<EVER important need to improve dictionary entries for general corpus<Scores according to context: Texts repeat more in specialised translation (problem for large corpus—e.g., nuevos directores)
top related