computing science, university of aberdeen1 cs4025: machine translation l background, how languages...
Post on 24-Dec-2015
215 Views
Preview:
TRANSCRIPT
Computing Science, University of Aberdeen 1
CS4025: Machine Translation
Background, how languages differ MT Techniques Controlled languages
For more info: J&M, chap 21 in 1st ed, 25 in 2nd .Also extra notes.
Computing Science, University of Aberdeen 2
Machine Translation
Automatically translate texts between languages (eg, English to Japanese)» Or assist human translators?
One of the oldest dreams of NLP, AI, and CS (first system in 1954).
Computing Science, University of Aberdeen 3
Varieties of Machine Translation
Translating from a source language to a target language.
(FA)MT – (full automatic) Machine Translation HAMT – Human Aided MT (aid before or after) MAHT – Machine Aided Human Translation
Computing Science, University of Aberdeen 4
Brief History of MT
Serious but naïve work in the 1950’s 1966 ALPAC report (speed, cost, accuracy)
terminated most research funding “Underground” MT systems developed into
products (e.g. SYSTRAN) in the 1970’s More MT products emerged in the 1980’s and
1990’s, though still relatively simple MT now in everyday widespread use (e.g. for
web pages), in spite of its problems
Computing Science, University of Aberdeen 5
Translation is Hard: Language differences
Lexical Meanings assigned to a word
» to know a person» to know a fact
Boundaries on a scale» friend vs acquaintance
Preferences» sibling vs brother vs elder brother
Gaps» Japanese has no word for privacy
Computing Science, University of Aberdeen 6
Overlaps between word senses (Eng/Fr)
Computing Science, University of Aberdeen 7
Syntactic differences
Morphology vs word-order» English: John saw Jane» Russian: John[+subject] saw Jane[+object]
Which word orders» English: a cheap car» French: a car cheap
Argument order (e.g. VSO/SVO/SOV languages)» English: John likes apples» Spanish: apples gustar John
Computing Science, University of Aberdeen 8
Pragmatic differences
Zero pronouns» Bake [] for 20 minutes
Extra distinctions» Relative-status markers in Japanese
Cultural knowledge» mu -> curtains of her bed, not just curtains
Computing Science, University of Aberdeen 9
Translating from Japanese to English…
dai yu zi zai chuang shang gan nian bao chai you ting jian chuang wai zhu shao xiang ye zhe shang, yu sheng xi li, qing han tou mu, bu jue you di xia lei lai.
Dai-yu alone on bed top think-of-with-gratitude Bao-chai again listen to window outside bamboo tip plantain leaf of on-top rain sound sigh drop clear cold penetrate curtain not feeling again fall down tears come
As she lay there alone, Dai-yu’s thoughts turned to Bao-chai… Then she listened to the insistent rustle of the rain on the bamboos and plantains outside her window. The coldness penetrated the curtains of her bed. Almost without noticing it she had begun to cry.
Computing Science, University of Aberdeen 10
Perfect Translation needs World Knowledge
Example: Translating “it” into a language which associates grammatical gender with nouns requires identifying the antecedent:» A hollow cylinder … rests on a surface … and an
object is suspended so that it …
English German Gender Pronoun
Surface Flaeche Feminine sie
Cylinder Zylinder Masculine er
Object Objekt Neuter es
Computing Science, University of Aberdeen 11
Approaches to MT
Computing Science, University of Aberdeen 12
Direct Translation
No intermediate representation. Possibly morphological analysis and simple reordering principles
Input: [Japanese text] After word-by-word translation
» I give PAST pen on desk John to After word-order, det rewrite rules
» I give PAST the pen on the desk to John After morphology
» I gave the pen on the desk to John
Computing Science, University of Aberdeen 13
Completely tied to a language pair» Complete new system for each pair
Problems dealing with ambiguity:Example (Russian-English)» My trebuem mira» We require world (direct translation)» We want peace (correct
translation) Don’t need complex NLP
» used in cheap translators Useful as a “default translation” if more
complex techniques fail
Direct Translation - Issues
Computing Science, University of Aberdeen 14
Structural Transfer
Three steps» parse input text (reusable)» rewrite parse tree into parse tree of new language
(specific to language pair)– English NP -> Det Adj N becomes– French NP -> Det N Adj
» generate output text (reusable) More in next lecture
Computing Science, University of Aberdeen 15
Structural Transfer - Issues
Most popular approach (?)» Used in Systran (Altavista translator)
n*(n-1) transfer components needed for translation between n languages
Good for syntax, less good for words, pragmatics» supplement with other techniques, such as statistical
translation of individual words?
Computing Science, University of Aberdeen 16
Interlingua Approach
Two steps» full analysis of input text, into a meaning
(interlingua)– eg, know into KnowFact or KnowPerson
» full generation of output text, from meaning Can’t be done except in a small domain Preserving ambiguity
» if target language uses same word for KnowFact or KnowPerson, no need to disambiguate know
Computing Science, University of Aberdeen 17
Interlingua Approach - Issues
Interlingua must contain all aspects of meaning needed for all the languages (e.g. gender for Spanish cats)
Interlingua must reflect all the different views on how the world is made up (e.g. Japanese “yasai” refers mostly to vegetables, but also mint but not carrots)
For this to work, the domain must be restricted and the languages similar
Translation between n languages only needs n analysis components and n generation components
Computing Science, University of Aberdeen 18
Statistical Approach
Noisy channel model for speech rec: look for Sentence that maximises P(Sig|Sent)*P(Sent)
MT: look for translation Sent that maximises P(Input|Sent)*P(Sent)» faithfulness*fluency??» P(Sent) - estimated using bigrams/trigrams» P(Input|Sent) - estimated by analysing a corpus of
human-translated texts– eg, how often is know translated as savoir (know fact)
and how often as connaitre (know person)– Also model reordering, insertions, deletions
Computing Science, University of Aberdeen 19
Statistical Approach - Issues
P(Input|Sent)» Very hard to model situations where
translation reorders material, even if this has a simple syntactic description
» How “faithful” is a proposed output sentence to the original input text?
» Less clear what this means once we go beyond translating individual words
» Combine with direct techniques?
Computing Science, University of Aberdeen 20
Translating 100 sentences is trivial, the problems are all in the scaling-up.» Good dictionaries are key.
Three uses» Fully automatic rough translation
– like Altavista/Systran Babelfish
» Draft translations which a human post-edits (humans can postedit quickly as long as less than 20% of words need to be changed)
» Tools for translators (MAHT)
MT Performance
Computing Science, University of Aberdeen 21
Another approach to HAMT:Controlled Languages
A controlled (simplified, basic) English is a subset of full English.» Limited vocabulary: repair but not fix» Limited syntax: I ate but not I have eaten
Mainly used for technical documents Originally intended to make manuals easier for
non-native speakers MT works much better if input is Controlled
English
Computing Science, University of Aberdeen 22
(Emerging) standard for commercial aerospace industry.
Designed by academic linguists as well as practitioners (technical authors).
AECMA Simplified English
Computing Science, University of Aberdeen 23
AECMA: vocabulary
Fixed vocabulary (2000 words?) with additions limited to specific areas (eg, company names).
Goal is “each word means only one thing”, and “each concept is expressed by only one word”. No ambiguity, no synonyms.
Computing Science, University of Aberdeen 24
Above: only use to indicate physical position» Legal: The wing is above the wheel» Illegal: The engine temperature is above normal» Legal: The engine temperature is more than normal
Test: use as noun only» Legal: the system test» Illegal: Test the circuit.» Legal: Do a test on the circuit.
Example words
Computing Science, University of Aberdeen 25
AECMA: Syntax
Rule: Forbid “unusual” English syntax Ex: only simple past, present, future tenses
» Illegal: Any other information is to be ignored» Legal: Ignore any other information
Ex: No gerunds» Illegal: Changing the light is dangerous.» Legal: It is dangerous to change the light.
Computing Science, University of Aberdeen 26
Only two noun-noun modifiers» Illegal: The aircraft door attachment bolt» Legal: The attachment bolt of the aircraft door
Verbs and det. must be included» Illegal: Rotary switch to INPUT» Legal: Set the rotary switch to INPUT
AECMA: Syntax Examples (2)
Computing Science, University of Aberdeen 27
AECMA: Stylistic Rules
Sentences should be 20 words or less Paragraphs should be 6 sentences or less. Start warnings with a command
» Illegal: The oil used in the engine contains toxic additives which may be absorbed through the skin.
» Legal: Do not get the oil on your skin. It is poisonous.
Computing Science, University of Aberdeen 28
Controlled-Language MT
Much easier» No problems disambiguating words» Hard syntax is forbidden» May also prohibit/restrict pronouns
Authors must write in CE» CE conformance checkers
Lot of commercial interest
top related