by dr. gurpreet s. josan punjabi university, patiala

77
Machine Translation Breaking the Communication Barrier By Dr. Gurpreet S. Josan Punjabi University, Patiala

Upload: beverly-morrison

Post on 22-Dec-2015

249 views

Category:

Documents


5 download

TRANSCRIPT

  • Slide 1
  • By Dr. Gurpreet S. Josan Punjabi University, Patiala
  • Slide 2
  • Communication is the activity of conveying meaningful information. Communication requires a sender, a message, and an intended recipient The communication process is complete once the receiver has understood the sender. Machine Translation-Breaking the language Barrier2
  • Slide 3
  • Nonverbal communication- includes gesture, body language or posture; facial expression Visual communication- includes signs, typography, drawing, colours etc. Oral communication- spoken verbal communication Written communication- includes alphabets symbols, grammar etc. Machine Translation-Breaking the language Barrier3
  • Slide 4
  • 4
  • Slide 5
  • 5 Language is a barrier for information dissemination. All the major source of information/ discoveries are in English. We are unable to reach the masses in rural area who do not know English. Youre a scientist who has just clicked a revolutionary new idea. How do you find out if a scientist anywhere in world has already filed a patent on a similar idea in their native language?
  • Slide 6
  • Machine Translation-Breaking the language Barrier 6 A Translator Manual Too Slow Limited Available Costly Accurate Machine Fast Economical Not accurate
  • Slide 7
  • kJfmmfj mmmvvv nnnffn333 Uj iheale eleee mnster vensi credur Baboi oi cestnitze Coovoel2^ ekk; ldsllk lkdf vnnjfj? Fgmflmllk mlfm kfre xnnn!
  • Slide 8
  • Computers Lack Knowledge! Computers see text in English the same you have seen the previous text! People have no trouble understanding language Common sense knowledge Reasoning capacity Experience Computers have No common sense knowledge No reasoning capacity
  • Slide 9
  • Machine Translation-Breaking the language Barrier9 Which ones are going to be difficult for computers to deal with? Grammar or Lexicon? Grammar (Rules for putting words together into sentences) How many rules are there? 100, 1000, 10000, more Do we have all the rules written down somewhere? Lexicon (Dictionary) How many words do we need to know? 1000, 10000, 100000
  • Slide 10
  • the dog ate my homework - Who did what to whom? Identify the part of speech (POS) Dog = noun ; ate = verb ; homework = noun English POS tagging: 95% Try to tag this text manually I can, can the can. 2. Identify collocations mother in law, hot dog
  • Slide 11
  • Seemingly similar sentences may differ radically in meaning: The CEO was fired up about his new role. The CEO was fired from his new role. Seemingly different sentences can have the same meaning: IBMs PC division was acquired by Lenovo. Lenovo bought the PC division of IBM. Machine Translation-Breaking the language Barrier11
  • Slide 12
  • Machine Translation-Breaking the language Barrier12 Ambiguity Structural ambiguity I saw the man with the telescope Word level ambiguity
  • Slide 13
  • Various Meaning of word in Punjabi Machine Translation-Breaking the language Barrier13
  • Slide 14
  • If more than one ambiguous word is present in a sentence, the number of potential interpretations of the sentence explodes: the number of interpretations is the product of all possible meanings of the words. Consider the sentence and assume that only {va } and {pa } are ambiguous in this sentence, and that they both have 4 senses. This brings the number of possible interpretations to 16. Machine Translation-Breaking the language Barrier14
  • Slide 15
  • Imagine what happens if there are more senses to be taken into account or if the sentence gets longer. Machine Translation-Breaking the language Barrier15 (Sukhbir) (has) (twine) (thigh) (crease) (FootPath) (sultriness) (destroy) (door-leaf) (silk)
  • Slide 16
  • Anaphora Resolution: The dog ate the bird and it died. Gender Conversion Idioms & Phrases
  • Slide 17
  • Named Entity Recognition . Dr. Plant Singh vs Dr. Buta Singh Foreign words vs Spelling Variation , , etc. Machine Translation-Breaking the language Barrier17
  • Slide 18
  • Rhyming Reduplication - , - Other issues In Indian Languages-no fixed font For example the word can be written in following manners: Machine Translation-Breaking the language Barrier18 + + + = + + = + + = + + + + + + + + + + + + + + + + + +
  • Slide 19
  • Interlingua Semantic Structure Semantic Structure Syntactic Structure Syntactic Structure Word Structure Word Structure Source Text Target Text Semantic Composition Semantic Decomposition Semantic Analysis Semantic Generation Syntactic Analysis Syntactic Generation Morphological Analysis Morphological Generation Semantic Transfer Syntactic Transfer Direct Machine Translation-Breaking the language Barrier19
  • Slide 20
  • Pros Fast Simple Inexpensive Robust No translation rules hidden in lexicon Cons Unreliable Not powerful Rule proliferation Requires too much context Major restructuring after lexical substitution Machine Translation-Breaking the language Barrier20
  • Slide 21
  • Pros Dont need to find language-neutral representation Relatively fast Cons Large no. of transfer rules: Difficult to extend Proliferation of language-specific rules in lexicon and syntax Machine Translation-Breaking the language Barrier21
  • Slide 22
  • Pros Portable Lexical rules and structural transformations stated more simply on normalized representation Explanatory Adequacy Cons Difficult to deal with terms. Deciding what should be added is difficult. What will be the universal knowledge format? How do we encode? Must decompose and reassemble concepts Machine Translation-Breaking the language Barrier22
  • Slide 23
  • Corpus-based approaches Statistics-based Machine Translation (SMT): Every target language string, , is a possible translation of . Every string is given a number called probability. We select the string which has maximum probability. = argmax [Pr() Pr( | )] Where is a source language and is a target language These are known as the language modeling problem, the translation modeling problem, and the search problem. Machine Translation-Breaking the language Barrier23
  • Slide 24
  • Corpus-based approaches Example Based Machine Translation Translation by Analogy. System is given a set of sentences in the source language and their corresponding translations in the target language System uses those examples to translate other, similar source-language sentences into the target language. Hybrid methods Combination of Rule Based and Statistical Methods Machine Translation-Breaking the language Barrier24
  • Slide 25
  • Punjabi to Hindi Machine translation system is a direct translation system based on various lexical resources and rule-base. The system is modular with clear separation of data from process. The central idea is to select words from source language and do the minimal analysis required like extracting the root word, lexical category and contextual information i.e. tokens at left and right side of the current token. Machine Translation-Breaking the language Barrier25
  • Slide 26
  • Word sense disambiguation module is called for ambiguous words. Equivalents of source token in target language are found out from the lexicon and are replaced to get target language. The rules are applied to the output for making it appropriate for the target language. Machine Translation-Breaking the language Barrier26
  • Slide 27
  • System Architecture Normalized Source Text Tokenization Named Entity Recognition Repetitive Construct Handling Lexicon Look up Ambiguity Resolution Transliteration Hit? Ambiguous? Post Editing Target Text No Yes Root word & Inflectional Form DB Bigram & Trigram DB Ambiguous Word DB Append in Output and retrieve next token If token present Yes No Pre Processing Translation Engine Post Processing
  • Slide 28
  • For a given language pair and text type what kind of system is required is largely an empirical and a practical question. General requirements on MT systems such as modularity, separation of data from processes, reusability of resources and modules, robustness, corpus-based derivation of data and so on, do not, provide conclusive arguments for either one of the models. The available resources are one of the key factors for deciding the approach. Machine Translation-Breaking the language Barrier28
  • Slide 29
  • In general, if the two languages are structurally similar, in particular as regards lexical correspondences, morphology and word order, the case for abstract syntactic analysis seems less convincing. Keeping in view, the similarity in Punjabi and Hindi language pair, a simpler, direct model is our obvious choice for Punjabi to Hindi machine translation system. Machine Translation-Breaking the language Barrier29
  • Slide 30
  • The lexicon contains information about the primary component of languages, i.e. words. Most NLP applications use dictionaries. For example, morphological analyzers use a lexicon containing morphemes, and tagging systems use probability data, and parsers use lexical/semantic information or co-occurrence information, and MT systems use Translation Memory and a transfer dictionary. Machine Translation-Breaking the language Barrier30
  • Slide 31
  • Machine Translation-Breaking the language Barrier31 The bilingual dictionary prepared by LTRC dept of IIIT Hyderabad in ISCII format containing about 22000 entries. Adopted and extended for our system. Converted in to Unicode format. The entries are extended to about 33000 covering almost all the root words of Punjabi language. Root Table { Field name: PW Field Type: Text Field name: gnp Field Type: Text Field name: cat Field Type: Text Field name: HW Field Type: Text } Root Table pwgnpcatHw msnm nsadj fsnf nsadj fnf mnm fnfAmb
  • Slide 32
  • Inflectional Form Table pwroothw Table of all the inflected forms of Punjabi root words. Contains all the inflectional forms of Punjabi root words and along with their roots. The corresponding Hindi words are entered manually. It comprise of about 65,000 entries. Inflectional Form Table { Field name: PW Field Type: Text Field name: ROOT Field Type: Text Field name: HW Field Type: Text } Where ROOT is one of the entry from Root table. Machine Translation-Breaking the language Barrier32
  • Slide 33
  • For all the ambiguous words in root table as well as inflectional form table, the entry for target word contains a symbol amb. It triggers the disambiguation process for the given word. A table of ambiguous words is prepared for this purpose that contains most frequent meaning followed by all other possible meanings of a given word. The Lexicon Ambigous word table { Field name: PW Field Type: Text Field name: s1 Field Type: Text Field name: s2 Field Type: Text } Ambigous words Pws1s2 / / // / Machine Translation-Breaking the language Barrier33
  • Slide 34
  • To help the disambiguation module, bigram and trigram tables are created. They contains the context of ambiguous words along with their meaning in that context and frequency obtained from a corpus of about 30 lakh words. Bigram Table { Field name: PREV1 Field Type: Text Field name: PW Field Type: Text Field name: HW Field Type: Text Field name: COUNT Field Type: Number } Trigram Table { Field name: PREV2 Field Type: Text Field name: PREV1 Field Type: Text Field name: PW Field Type: Text Field name: HW Field Type: Text Field name: COUNT Field Type: Number } Bigram prev1pwhwcount , 15 , 18 , 25 , 22 , 7 Trigram prev2prev1pwhwCount , , 2 ,, 3 ,, 2 ,, 19 ,, 10 ,, 54
  • Slide 35
  • The lexicon also contains a rule-base. It contains all the rules to handle different grammatical dissimilarities between two languages at post processing. Replacement orgtxtreptxt - Replacement Table { Field name: orgtext Field Type: Text Field name: reptxt Field Type: Text } Machine Translation-Breaking the language Barrier35
  • Slide 36
  • The text should be in a normalized way i.e. there should be only one way to represent a syllable. Having several identical pieces of text represented by differing underlying byte sequences makes analysis of the text much more difficult. For example, under the AnmolLipi font, the Latin character A would appear as . Conversely, under the DrChatrikWeb font it would appear as . This cause a problem while scanning a text. Machine Translation-Breaking the language Barrier36
  • Slide 37
  • So the source text is normalized by converting it into Unicode format. It gives us three fold advantages; First it will reduce the text scanning complexity. Secondly it also helps in internationalizing the system. Thirdly it eases the transliteration task. Machine Translation-Breaking the language Barrier37
  • Slide 38
  • Spelling normalization There may be the chances that the word is present in database but with different spellings like {prkhia} [examination] In database only one may appear and other may not. The purpose of spelling normalization is to find the missing variant. Soundex technique is used for the spelling normalization. Machine Translation-Breaking the language Barrier38
  • Slide 39
  • In this technique, a unique number is assigned to each character of alphabet. Similar sounding letters get same number. Then codes for each string are generated. All the strings with same code are the spelling variations of a same one string. Machine Translation-Breaking the language Barrier39
  • Slide 40
  • c1 c14 c30 C32 c2 c15 c26 C38 c11 c16 c31 kb c4 c17 c32 ,, U c6 c18 c33 ,, S c5 c19 c34 L c8 c20 c35 ,, O c9 c21 c36 K c8 c22 c37 L c1 c23 c38 C26 c3 c24 c39 , , HALANTNo Code c7 c25 c40 c41 c26 c10 H c27 c13 c12 c28 c14 c13 c29 c19 Machine Translation-Breaking the language Barrier40
  • Slide 41
  • With this table the code for came out to be c31c37sc13sc4 enabling the system to detect the variant present in database. For example, if the database contains {prkhia} as Punjabi word then the code c31c37sc13sc4 is stored against it. If a user enters as input, which is not present in database, its code will be generated on the fly by the system and checked in the database. If code appears in the database the corresponding Punjabi word is selected as spelling variant. Machine Translation-Breaking the language Barrier41
  • Slide 42
  • In order to achieve this, we make use of the information contained in the context similar to what humans do. A standalone word sense disambiguation module that is capable for performing its work without any help from outside. To start with, all we have is a raw corpus of Punjabi text. So the statistical approach is the obvious choice for us. Machine Translation-Breaking the language Barrier42
  • Slide 43
  • We use the words surrounding the ambiguous word to build a statistical language model. This model is then used to determine the meaning of examples of that particular ambiguous word in new contexts. The basic idea of statistical methodologies is that, given a sentence with ambiguous words, it is possible to determine the most likely sense for each word. One of such statistical model is n gram model. Machine Translation-Breaking the language Barrier43
  • Slide 44
  • An n-gram is simply a sequence of successive n words along with their count i.e. number of occurrences in training data. An n-gram of size 2 is a bigram; size 3 is a trigram; and size 4 or more is simply called an n-gram or (n 1)-order Markov model. N-grams are used as probability estimators which estimates likeliness of a word(s) to follow a certain point in a document. What is the optimum value of n? Machine Translation-Breaking the language Barrier44
  • Slide 45
  • Consider predicting the word " " from the three sentences: (1) . (2) , , . (3) In (1), the prediction can be done with a bigram (2-gram) language model (n=2), but (2) requires n=4 and (3) require n > 9. Machine Translation-Breaking the language Barrier45
  • Slide 46
  • Number of words to be considered at n positions is important Factors of concern are Larger the value of n, higher is the probability of getting correct word sense i.e. for the general domain; more training data will always improve the result. But on the other hand most of the higher order n grams do not occur in training data. This is the problem of sparseness of data. Machine Translation-Breaking the language Barrier46
  • Slide 47
  • As training data size increase, the size of model also increase which can lead to models that are too large for practical use. The total number of potential n grams scales exponentially with n. A large n require huge amount of memory space and time. Does the model get much better if we use a longer word history for modeling an n-gram? Do we have enough data to estimate the probabilities for the longer history? Machine Translation-Breaking the language Barrier47
  • Slide 48
  • An experiment for optimum value of n for Punjabi language is performed. Different n gram models were generated where n ranges from 1 to 6 This was observed that as the value of n increases, its ability to disambiguate a word decreases. This is due to sparseness of data. Machine Translation-Breaking the language Barrier48
  • Slide 49
  • Another interesting point observed is that instead of making and using a higher order n gram models, we can improve the efficiency of the system tremendously by utilizing lower order models jointly. We can use tri-gram model in the first place to disambiguate a word. If it fails to disambiguate then we move to lower order model i.e. bi-gram model for WSD. If it also fails, we can use the unigram model. With this technique we get only 7.96% of incorrectly disambiguated words This approach is adopted for the word sense disambiguation module. Machine Translation-Breaking the language Barrier49
  • Slide 50
  • Three models viz. Unigram, Bigram and Trigram of the ambiguous words to tap the words in context of any ambiguous word are created from a corpus of about 3 million words generated by including different types of articles like essays, stories, editorials, News, novels, office letters, court orders etc. In order to reduce the size of n-grams, we retain only those context which leads to less frequent meaning of ambiguous words. Machine Translation-Breaking the language Barrier50
  • Slide 51
  • The idea is to check the contextual information for the least frequent meaning. If it fails to disambiguate then we use the most frequent meaning by default. For example {d} in Punjabi can be used as post position as well as verb, but its usage as verb is very less frequent. So we place all those bigrams and trigrams in database that leads to the disambiguation of less frequent meaning i.e. {d} as verb. Machine Translation-Breaking the language Barrier51
  • Slide 52
  • Machine Translation-Breaking the language Barrier52 It contains all those entries for which has less frequent meaning. All such meanings are entered manually in Trigram and Bigram Model. If the word cannot be disambiguated by bigram and trigram then most frequent meaning is selected by default. There are the chances when the previous words in the context leads to one sense but next words are producing the other sense. For such cases again the sense with higher probability is selected. bigram prev1pwhwcount , 4 , 2 , 12 , 2 , 2
  • Slide 53
  • Transliteration is a solution of OUT-OF- VOCABULAY words. Transliteration is a process wherein an input string in some alphabet is converted to a string in another alphabet, usually based on the phonetics of the original word. If the target language contains all the phonemes used in the source language, the transliteration is straightforward e.g. the Hindi transliteration of Punjabi word (Room) is which is essentially pronounced in the same way. Machine Translation-Breaking the language Barrier53
  • Slide 54
  • For missing sounds or extra sounds in the target language are generally mapped to the most phonetically similar letter e.g. in Hindi we have alphabets which have double sound associated with them like which is a combination of sound of and . In Punjabi, generally the letter is used to denote such sounds e.g. in (Alphabet) which is transliterated to . A single foreign word can have many different transliterations. E.g. (Mehfooz) can be transliterated as , , , etc. Machine Translation-Breaking the language Barrier54
  • Slide 55
  • Direct Mapping Rule Based Soundex Based Machine Translation-Breaking the language Barrier55
  • Slide 56
  • GurmukhiDevanagariGurmukhiDevanagari ,, ,, ,, ,, ,, ,, ,, ,, Vowel Mapping Punjabi contains 10 vowel symbols and nine dependent vowel sounds. Hindi has the one to one representations of all Punjabi vowel symbols and sounds. Machine Translation-Breaking the language Barrier56
  • Slide 57
  • Consonant Mapping Consonant mapping is shown in Table Below Gurmu khi Devana gari Gurmu khi Devana gari Gurmu khi Devana gari Gurmu khi Devana gari Gurmu khi Devana gari - - - - No letter in Punjabi is present for Hindi letters . This means these letters can never be mapped in letter to letter based approach. Similar is the case for some double sound producing letters like .
  • Slide 58
  • Sub Joins Mapping There are three sub joins (PAIREEN) in Gurmukhi, Haahaa, Vaavaa, Raaraa shown in table below PAIRE EN PunjabiHindi English Lips Sequentiall y Self-respect In Punjabi They are represented by the virama (or halant) character before the consonant. Similar viram character is also present in Hindi which indicate that the inherent vowel is omitted (or 'killed'). PAIREEN Haahaa and Vaavaa are replaced with full consonants while their previous consonant is shown in half form, whereas PAIREEN Raaraa takes the position below the previous consonant in Hindi similar to as in Punjabi. Machine Translation-Breaking the language Barrier58
  • Slide 59
  • Other Symbols Adhak ( ) is used to duplicate the sound of a consonant in Punjabi. No such character is present in Hindi. Sound duplication is represented by half form of consonants in Hindi. Punctuation marks and digits are same in both scripts. A special character called as visarga ( ) is present in Hindi but not in Punjabi. So it will never be mapped in letter to letter based scheme. Beside this Gurmukhi has two separate nasal characters Bindi ( ) and Tippi ( ). Hindi has also two nasal characters i.e. Bindi ( ) and chander bindu ( ). Both nasal characters of Punjabi are mapped to this single nasal character in Hindi.
  • Slide 60
  • Letter to letter mapping produce quite good results. But we can improve the results by making them more nearer to target language in term of spellings and choice of alphabets by using some set of rules. Rule Based Approach Alphabets whose mapping is not available in Hindi : and are two such alphabets. They are replaced by their most phonetically equivalent characters i.e. and . A character adhak is present in Punjabi and used to show the stress on the next character. No alphabet in Hindi is present to represent this character. The purpose is solved by placing a half character before the stressed character. e.g. is transliterated as . There is an exception for this rule. If the next character of adhak is then instead of placing half character, a half is placed. e.g. tranliterated to . Also if next character is then half character is replaced by half . e.g. as in which is transliterated to . Similarly if next character is then half character is replaced by half . e.g. as is transliterated as . Machine Translation-Breaking the language Barrier60
  • Slide 61
  • Rule for tippi if followed by or then is replaced by half as in and if followed by then is replaced by half as in . Rule for character when followed or preceded by is translitrated differently. In case, it is followed by then is ommited from the transliterated text e.g. is transliterated to . if it is preceded by then is mapped to in the transliterated text. e.g. is transliterated to . Miscellaneous Rules If this combination of letters appears at the last position in a word then instead of mapping to this letter mapped to . e.g. is transliterated to . if at last position then it is replaced by e.g. in if at last position or 2nd last position then it is replaced by e.g. in . Machine Translation-Breaking the language Barrier61
  • Slide 62
  • The Soundex concept is extended for searching the correct spelling variant of a given transliteration. The transliterations are produced by the methods discussed earlier. we have developed a unigram table from a corpus of about 10 million words The codes are generated for each of the word in unigram For the comparison, letters in word are converted into phonetic code Machine Translation-Breaking the language Barrier62
  • Slide 63
  • Then this code is looked for in unigram table. The entry with maximum frequency is selected as the correct variant of given input For example consider the word { arpha } (Draft) written in Punjabi. The strings { arpha } is produced by the baseline module. The code 2541483623 is generated for this string. This code is looked for in unigram database. This database contains two entries against this code i.e. and with frequency 12 and 8. The string with higher frequency is selected as correct output for this input. Machine Translation-Breaking the language Barrier63
  • Slide 64
  • Machine Translation-Breaking the language Barrier64 frweIvr = Direct Mapping = Rule Base Enhancement = Soundex Based Enhancement =
  • Slide 65
  • Rules are applied to cover the minor grammatical differences between languages. The general structure of the rules is context dependent replacement. The corresponding phrase or word in a given context replaces one phrase or a word. Ordering of rules does not matter. Machine Translation-Breaking the language Barrier65
  • Slide 66
  • Input = Output = Machine Translation-Breaking the language Barrier66
  • Slide 67
  • Use Default Meaning Put Hindi Word into Output Look into Trigram Look into Bigram FoundAmb Transliterate YES NO YES NO Sentence Complete Check vicinity for direct transliteration Need Translit. Look into Root Data Base Found Look into Inflectional DB NO YES Select Next Word NOYES Stop Working of Target Word Generator Selected word in red
  • Slide 68
  • Use Default Meaning Put Hindi Word into Output Look into Trigram Look into Bigram FoundAmb Transliterate YES NO YES NO Sentence Complete Check vicinity for direct transliteration Need Translit. Look into Root Data Base Found Look into Inflectional DB NO YES Select Next Word NOYES Stop Working of Target Word Generator Selected word in red
  • Slide 69
  • Use Default Meaning Put Hindi Word into Output Look into Trigram Look into Bigram FoundAmb Transliterate YES NO YES NO Sentence Complete Check vicinity for direct transliteration Need Translit. Look into Root Data Base Found Look into Inflectional DB NO YES Select Next Word NOYES Stop Working of Target Word Generator Selected word in red Three trigram are possible as shown below Only 2 nd & 3 rd trigram can resolve ambiguity. The meaning with higher count is selected. In this case both trigram produce same result.
  • Slide 70
  • Use Default Meaning Put Hindi Word into Output Look into Trigram Look into Bigram FoundAmb Transliterate YES NO YES NO Sentence Complete Check vicinity for direct transliteration Need Translit. Look into Root Data Base Found Look into Inflectional DB NO YES Select Next Word NOYES Stop Working of Target Word Generator Selected word in red
  • Slide 71
  • Machine Translation-Breaking the language Barrier71
  • Slide 72
  • Machine Translation-Breaking the language Barrier72
  • Slide 73
  • Machine Translation-Breaking the language Barrier73
  • Slide 74
  • Machine Translation-Breaking the language Barrier74
  • Slide 75
  • The ultimate goal of MT fraternity is to develop MT systems and integrate them in order to break the language Barrier. Any person can have required information in his own native language and can share his/her knowledge across the world without learning the other languages. Machine Translation-Breaking the language Barrier75
  • Slide 76
  • Slide 77
  • THANKS !! Machine Translation-Breaking the language Barrier77