real-time directtranslation system for sinhala and tamil languages
TRANSCRIPT
ITRU Symposium Presentation 2014
Real-time DirectTranslation System for Sinhala and Tamil Languages
Authors: Rajpirathap S, Sheeyam S, Umasuthan K, Amalraj ChelvarajahAITM15 / FedCSIS15
Machine TranslationAutomatic translation from one language to another using computing devices and algorithms.
Machine Translation ApproachesTransfer based approachInterlingua approachDirect approachExample based approach Statistical based approach
Why Statistical Approach?Proved to be more efficientShorter development timeLots of standard algorithms existSupportive tools availableEffective for large text translationFew linguistic assumptions
GoalTo develop a Real-time Machine Translation System which enables effective communication between Sinhala & Tamil people solving the language barriers in Sri Lanka.
Problem & SolutionProblemPeople face language barriers while communicating in native languages & unavailability of translation systems especially for Tamil and SinhalaUnavailability of a real time communication system that does translation automatically for the language pairs considered especially in informal domainSolutionBuild our own Instant Communication system which enables effective communication between Sinhala & Tamil people solving the language barriers in Sri Lanka.
ObjectiveDevelop a Bi-directional translation system for Sinhala & Tamil languages which can be used for communication purposes
ScopeTranslate Sinhala text to Tamil text and vice versaTranslation output is based on the type of the language corpora we use to implement the systemAccessible for the public
Others Work
Sinhala Word Net ProjectThe BEES projectExample Based Machine Translation for English-Sinhala Translation systemTransFire IPhone applicationUCSC projects on Statistical Machine TranslationTranslators supported by ICTA
Comparison of existing Machine Translation Approaches
Word Net ProjectExample Based MTTrans Fire ApplicationUCSC SMT ResearchOur SMT ResearchSinhala to Tamil Translation
Tamil to Sinhala Translation
Chat Feature
TransliterationHandle Large text translations
ProjectsFeatures
Concept of SMTSinhala Tamil translateA sentence t1A sentence t2A sentence t3A sentence t4A sentence s p1p2p3pn
Assumption
Concept of SMTSelect the Tamil Sentence that has maximum probabilityIf ( p3 > p1,p2 pn ) then the sentence t3 is a translation of sentence sThe notation is : -
Concept of SMT
Using BAYEs theoremAs s is fixed in a language, p(s) can be removed
Concept of SMT
Language ModelTranslation Model
Components of a SMT systemParallel Corpus (Data Preparation)Language ModelTranslation ModelDecoder
Data PreparationWe used over 6000 phrases from each language which is totally more than 12000 sentences and more than 120000 words to train the system
Data PreparationDiscussions of various ministry affairsDiscussions on Road development affairsDiscussions on financial developments and issuesDiscussions on general public issuesAdministrative data1500+ parallel text of informal language
Data Preparation
Data SetsTraining SetTuning SetTesting SetSinTamSinTamSinTamWords99k78k3425307831103204Phrases58875887200200200200
Language ModelStandard n-gram language modelProbability value is set to every sentence.Conditional distribution to identify the i'th word in a sequence, given the identities of all previous words.
Consider a sentence s as :- s = { w1,w2 wn }
N-gram approximation
SmoothingOnly the word sequences in the corpus are assigned non-zero probabilities and all unseen word sequences are assigned zero.
Allocate some probability mass to unseen word sequence by decrementing the actual probabilities of seen word sequence.
Smoothing AlgorithmsAdd SmoothingWritten Bell DiscountingNatural Discounting Neys absolute discountingKneser - Ney DiscountingGood Turing Discounting
Translation ModelUseful in checking whether a target language sentence is a proper translation of a source language sentence or notP ( S | T )Probability of source sentence (s) given target sentence (t)
Translation ModelsIBM Model 1IBM Model 2IBM Model 3IBM Model 4IBM Model 5
Translation Modeling ProcessCalculate lexical Translation ProbabilitiesGenerate phrase extraction fileScoring extracted phrasesLexical WeightingWord penaltyPhrase penaltyBuild Re-Ordering Model
Word AlignmentSinhala : Mama Paasalata Yanawa
Tamil : Naan Paadasaalaikku Pokiren
Word alignments: 1 1 , 2 2 , 3 3
Word Alignment AlgorithmsUnionIntersectionGrowGrow-Diagonal
DecoderEfficient Searching Given language and translation models, searching for the most satisfying source sentence for a given target sentence
DecoderBeam SearchMinimum Bayes Risk decodingLattice MBRConsensus decoding
Our SMT Project OutlineClient 1(Sinhala)Client 2(Tamil)Translation model in both waysSinhala TextTamil TextTraining CorpusTrainingCorrected Tamil outputSinhala TextTamil TextCorrected Sinhala output
DesignArchitecture DesignImplementation DesignEvaluation Design
Architecture Design (SMT)Data preparationLanguage Modeling Translation ModelingMERT TuningDecodingEvaluation (BLEU & NIST)
UsersinputApplication InterfaceFormatted inputoutputCorrections
Implementation Design (SMT)Client Application 1History FileTraining CorpusSinhala SentencesTamil SentencesSMT SystemLMTMDecoderTunerEvaluator
Client Application 2History File
Implementation (SMT)
We have developed a Bi-directional Translation system which does translations for Sinhala and Tamil.Developed a chat Application which supports translations of Sinhala and Tamil.Technologies like Java, C++ & Perl are usedLanguage Modeling and Translation Modeling Algorithms are integrated.Decoder is developed using decoding algorithms integrated.Parallel Corpora for Chat domain is created (2000 parallel lines)Parallel corpora of Parliament order papers are used to model LM and TM.
Our contributions as developers
Trainer application Interface implementationChat application implementationParameter optimizations (n-gram , Discounting , word alignment , lexical Re - ordering ) Creation of corpus Decoder implementationTokenizer improvements and implementationTransliteration feature implementation
Evaluation Design/Strategy
Language Model Evaluation StrategyOrder Adjustment (2,3,4)Smoothing/DiscountingAdd SmoothNeys AbsoluteKneser NeyNatural DiscountingWritten Bell Interpolation/not
Translation Model Evaluation StrategyResulted Language Models
Word AlignmentIntersectGrow-DiagonalGrow- Diag-FinalUnion
Re-OrderingMSD-BidirectionalMSDMonotonocity-bidirectionalMonotonocity
Selected Combinations of LM and TMs
Decoder Evaluation StrategyDecoding AlgorithmBeam SearchMinimum Bayes RiskLattice MBRConsensus Limit on distortion (-1,0,6,10,20)BEST System Configurations
EvaluationEvaluation to find Optimal System ParametersUser EvaluationManual EvaluationCorpus Evaluation
Evaluation to find Optimal System Parameters for the translation system
N-gram order (3 parameters)Smoothing (6 techniques) Word Alignment & Re-ordering(16 Combinations) Decoding(4 Algorithms) Language Modeling(18 experiments and select the best 2)Translation Modeling(32 experiments and select the best 2)Decoding(8 experiments and select the best 1)~60 experiments per system) * 2=120 Experiments
Optimal System Evaluation ScoresSystemsSinhala - TamilTamil - SinhalaBLEU0.59570.6693NIST4.41824.8563
Manual Evaluation (SMT)SystemsSinhala - TamilTamil - SinhalaSource Words812Translated Words610Missed Words22%7583.33
Evaluated the systems with 15 translations each
User Evaluation (SMT)Questionnaires were distributed to evaluate the final systemDemo Videos were prepared to use for the evaluation.IT and Engineering students were our participantsUser feedbacks were considered and corrective actions were takenAchieved an Overall rating of 3.8 out of 5
Corpus Evaluation (SMT)Evaluated the systems with different numbers of training phrasesSin - TamTam - SinNumber of phrasesBLEU SCORE5100.1369820.23568911000.2698450.29653416500.2647820.35648421250.3865320.49897530600.4023560.54326540020.4532560.57632550200.4923560.59235656970.5235420.62356466970.5493030.642535
MERT TuningMERT - Minimum Error Rate Training
Possesses the capability of adjusting many parameters
Attempts solving the loose relation to the final translation quality of unseen text in maximum likely hood method
Tuned System EvaluationSystemsSinhala - TamilTamil - SinhalaBLEU0.59570.6693NIST4.41824.8563
Final ConclusionsWord Alignment [ BEST - Grow-Diag , Worst Union ]Re-Ordering [ BEST- MSD ]Decoding Algorithm [ Beam Search, Lattice MBR (TAM to SIN) and Consensus (TAM to SIN) ]
Final outputs
Sinhala to Tamil
Final outputs
Tamil to Sinhala
Improvements than past/existing researchesImproved BLEU and NIST scoresAuto training of systemHighly accurate translationsNormal usage conversations trained to support chat domainImproved performance and translation timesReferred updated tools and techniquesSystem is supported for large data setsImproved data preparation techniques
Future WorkCloud hostingDevelop an API for developers to use this system as a serviceEnable public contribution for data preparationImprove translation qualityUpdate new techniques/algorithmsSpecialize in wider domainsUsage of newly available evaluation metrics
Demonstration
Translation Enabled Chat Application
Thank you