current status of machine translation research in vietnambao/talks/machinetranslationinvn.pdf ·...
TRANSCRIPT
Current Status of Machine Translation
Research in Vietnam
Towards Asian wide multi language machine translation project
2
Content
Status of machine translation research
Dictionaries and corpora
Activities in organization and experts Others
3
Overview: Main machine translation groups
Previously: rule-based approach to English-Vietnamese MT system. The system is completed but still not published.Currently: focus on statistical MT, and improve the rule-based MT system using statistical techniques.
G4. JAIST(Mr. LE Anh Cuong)
Since 1989 with various trails. Statistical approach to Vietnamese-English translation (since 2002) and phrase-based approach to English-Vietnamese translation and phrase extraction from Penn Treebank (since 2003)
G3. HCM Univ. of Technology, VNUHCM(Prof. PHAN Thi Tuoi)
Transfer based MT using BTL (Bitext Transfer Learning) for English-Vietnamese MT system. Experience in doing dictionary, bilingual corpus.
G2. Univ. of Natural Sciences, VNUHCM(Dr. DINH Dien)
Rule-based approach to English-Vietnamese MT systems. These are the only MT commercial systems in Vietnam (EVTRAN3.0, VETRAN3.0)
G1. National Center for Technology Progress(Dr. LE Khanh Hung)
ExperienceGroup
4
G1 (Nacentech): About the group
People: 12 members, leader: Dr. LE Khanh Hung2 Ph.D. candidates on NLP3 masters Other 6 engineers and B.A.
Institution: National Center for Technology Progress, MOST, C6 Thanh Xuan Bac, Hanoi, Email: [email protected]
5
G1 (Nacentech): Approach
Morphological Morphological AnalysisAnalysis
Phrase Phrase AnalysisAnalysis
Semantic Semantic LinkingLinking
Phrase Phrase SynthesisSynthesis
Morphological Morphological SynthesisSynthesis
Dependency Dependency TreesTrees
TranslationTranslation
Lexical Lexical LatticeLattice
Dependency Dependency TreesTrees
Source TextSource Text
Lexical Lexical RulesRules
Grammar Grammar RulesRules
Semantic Semantic LatticeLattice
Lexical Lexical LatticeLattice
Target Language
6
G1 (Nacentech): Current status
MT research group was established in 1989 ,1990 starting with an English to Vietnamese MT system
Transfer TechnologyDictionary with 12,000 entries, 500 grammar rules
1997: EVTRAN 1.02,000 grammar rules, 60,000 entries
1999: EVTRAN 2.03,000 grammar rules, 250,000 entriesCommercial software in VietnamListed in Compendium of Translation Software (EAMT)
2005: EVTRAN 3.0Automatic source language identification10,000 grammar rules, 530,000 entries
7
Phrase-sensitive grammar and dependency trees
Γ = (Σ, ℵ, S, E, ℘) where S = {S1, … Sn} – set of start symbols, E –Semantic Lattices defined in (Σ ∪ ℵ)*; Element of E – List of Phrases; ℘ – Rule Set defined in (E×E)
8
Austro-Asiatic Germanic
Mon-Khmer West Germanic
Viet-Muong English
Vietnamese English
Muong
Japanese
Interlinguas
Japanese
G1 (Nacentech): Interlingual MT (plan)
9
G2 (UNS-VNUHCM): About the group
PeopleDr. Dinh Dien, PhD in CS and in Linguistics (Leader)Dr. Ho Bao Quoc, Prof. Dong Thi Bich Thuy5 MS studentd in CS + 1 Ph.D. student
InstitutionDepartment of Information Technology, Univ. of Natural Sciences, Vietnam National Univ. in HCMC227 Nguyen Van Cu, HCMCEmail: [email protected]
10
G2 (UNS-VNUHCM): Approach
BTL model (Bitext Transfer Learning) for English-Vietnamese MT: from annotated-EVC (bitext),
to automatically extract “transfer rules” (lexical and structure) by learning algorithm (fTBL)
then apply those rules to tag the target language (Vietnamese sentence).
11
Group 2: BTL-based MTEVT
English
Morphology
Linguistic Annotating Annotated
parallel corpus
Transformation Rules
Grammar
Semantics
Transfer
VNese
Post-editingWord Align
KFTBL
UnannotatedParallel corpus
Generation
Baseline Tagging
Baseline Tagging
12
Group 2: Current status
The group has developed two machine translation systems: EVT 1.0 and VCLEVT 2.0EVT 1.0
A rule-based MT systemEvaluated by PC World Vietnam Magazine in 1998: 65% for simple sentences; 50% for normal sentences; and 35% for complex sentences.
VCLEVT 2.0Using BTL modelLearning automatically on bilingual corpusGaining better translation quality on informatic documents
13
G3 (HCMUT): About the group
PeopleProf. PHAN Thi Tuoi (leader), Prof. CAO Hoang Tru5 Ph.D. candidates and master students
InstitutionHoChiMinh City University of Technology, VNUHCM268 Ly Thuong Kiet Street, District 10, HoChiMinh City, VietnamEmail: [email protected]
14
G3 (HCMUT): Research activities
Syntax-based English-Vietnamese translation for simple sentences (1989)
Vietnamese word segmentation using corpus and statistical models (2002)
Vietnames POS tagging by context and style
Text alignment and statistical models (2004)
Statistical model for Vietnamese-English MT (since 2002)
English-Vietnamese MT based on lexicon and phrase extraction from treebank (since 2003)
15
G4 (JAIST) About the group
People: Prof. Ho Tu Bao, Le Anh Cuong (leader), Nguyen Phuong Thai, Dr. Nguyen Le Minh, Phan Xuan Hieu, Nguyen Van Vinh
InstitutionJAIST (Japan Advanced Institute of Science and Technology)1-1 Asahidai, Nomi, Ishikawa 923-1292 JapanEmail: [email protected], [email protected]
16
G4 (JAIST): Status
History: 1999-2003Developed an English-Vietnamese MT system at an Information company in Vietnam. The system based on the transfer approach.
2004-present Research on modern technologies in MT: Example-Based, SMT, Phrase-Based SMT
– Le Anh Cuong: word sense disambiguation– Nguyen Phuong Thai: syntactic parsing– Nguyen Le Minh: example-based approach– Phan Xuan Hieu: par-of-speech tagging, chunking– Nguyen Van Vinh: Dictionary
17
G4 (JAIST): Current Translation System
Format Processing Tokenizer Morphological Analyzer
Named EntityRecognizer
Parser
Word SenseDisambiguationTransfer and SynthesisFormat Processing
Grammar Rules
CommonDictionary
POS Tagger
User Dictionary
Domain Dictionary
Dictionaries
18
G4 (JAIST): MT improvement direction
Develop a new MT system which combines advantages of rule based, example based, and statistical machine translationApply advances of English processing to improve current MT systemBuild powerful and intuitive tools which support users modifying and editing dictionary
19
Content
Status of machine translation research
Dictionaries and corpora
Activities in organization and experts Others
20
Overview
Dictionaries and corpora have been developed by each group by their need and ability. Dictionary
E-V dictionaries are well done, V-E dictionaries are in debateModel for Japanese EDR-based dictionaryNo J-V, V-J dictionaries on computer
CorporaSome work in the pastNew plan for corpora
21
Japanese EDR-based Dictionary (JAIST)
Model for such a dictionary (in NLP project 2001-2003)Can benefit from Japanese EDR
English word dictionaryConcept dictionary with concept primary illustration and concept explication in VietnameseEnglish co-occurrence dictionaryEDR Corpus (English Corpus)
Components to be newly doneVietnamese word dictionaryEnglish co-occurrence dictionaryBilingual dictionary English-Vietnamese, Vietnamese-EnglishEDR Corpus (Vietnamese Corpus)
22
Word Morp POS GRM SEM English Freq Field
máy tính C Ns cnt ART computer 2.221 cpt
hiển thị C Vt Vcom display 1.956 cpt
đường W Ns cnt LIN line 2.087
đường W Nm uncnt CHM sugar 1.987
A Vietnamese MRD (UNS-VNUHCM)
23
Dictionary for machine translation(Example from JAIST group)<word> take
<grammar>verb_i<semantic-category> [none] </semantic-category><translation-default> có hiệu lực </translaition-default><translation>
cắn câu<constraint> subj: {fish} </constraint>
</translation><translation-pattern>
$VP[inf]:=take off for $Obj<translation-default> "vội vàng đến“ $Obj </translation-default><translation>
"cất cánh đi“ $Obj<constraint>
subj: {plane, aeroplane, airplane, aircraft}</constraint>
<translation></translation-pattern>
</grammar>. . .
<word>
Entries: 95,000 words, 15,000 phrases, 18,000 translation patterns (lexical rules)
24
JAIST’s group MT system on the Web
25
Some corpora
Monolingual corpora: VLC (Vietnam Lexicography Centre), UNS-VNUHCM, etc. for Vietnamese
Bilingual corpora: The EVC corpus (UNS-VNUHCM) consists of 400,000 pairs of E-V sentences (approx. 5,500,000 words) in the fields of Science and Technology (Computer, Electronics,..). This EVC has been being partially annotated with morphology (word boundary, lemmatize), POS and Sense tags semi-automatically.
26
VĂN BẢN THÔ(Draw Text)
TÁCH TỪ(Word segmentation)
GÁN NHÃN TỪ LOẠI
(POS tagging)
GÁN NHÃN CÚ PHÁP
(Chunking & Parsing)
KHO NGỮ LIỆUTIẾNG VIỆT(Treebank)
Vietnamese Corpus ToolsVietnamese
Corpus Tools
KIỂM TRA CHÍNH TẢ(Spelling)
XÁC ĐỊNH CÂU(Sentence determination)
TỪ ĐIỂNTừ điển đơn ngữ
(Monolingual dictionary)Từ điển đa ngữ
(Bilingual dictionary)
VLC: Development of supporting tools
27
VLC: Capacity and realization
Available tools- Word segmentation, POS tagging, Deep Parsing in TAG
formalism, Syllable list and morpho-syntactic lexicon, Editor for segmentation and tagging revision
- Some utilities for corpus explorationOngoing work
- Improvement of available tools- Improvement of the tagset for POS tagging- Building syntactic lexicon based on morpho-syntactic
lexicon.- Collection of a balanced corpus following the above criteria.
Human resources- About 10 persons working on the project.
28
Content
Status of machine translation research
Dictionary and corpora
Activities in organization and experts Others
29
VLSP national project 2006-2010
National project with participation of more ten research groups (all active groups on VLSP)Leaders: Prof. Ho Tu Bao (JAIST & IOIT) and Assoc. Prof. Luong Chi Mai (IOIT, VAST)Objectives:
1. Build and develop several typical products for VLSP for public end-users.
2. Build and develop indispensable resources and tools for the VLSP development
30
Content: Basic research (1/3)
Basic research on methods for processing Vietnamese language and speech.Applied research to adapt methods and technologies for processing other languages or advanced techniques to Vietnamese language and speech.
Computation methods for VLSP
Typical products for the end-users
Resources and tools for
VLSP
31
Content: Products for end-user (2/3)
P1: VnVoice system for VN synthesis
P2: Embedded speech synthesis and recognition system
P3: Large lexicon based speech recognizer
P4: Domain-specific English-Vietnamese translation system
P5: IREST system for information retrieval, extraction, summarization, and translation
P6: Vietnamese spelling checker
Computation methods for VLSP
Typical products for the end-users
Resources and tools for
VLSP
32
Content: Resources and tools (3/3)
P7: Basic resources for speechCorpus for speech synthesis and
recognitionP8: Three basic resources for language
P81: Vietnamese MRD P82: Annotated corpora (mono, multi)P83. Entities (KB)
(Rules of VN grammar)P9: Five basic tools for language
P91: Spelling checkerP92: Vietnamese word segmentationP93: Vietnamese POS taggerP94: Vietnamese chunkerP95: Vietnamese syntax analyzer
Computation methods for VLSP
Typical products for the end-users
Resources and tools for
VLSP
33
Recent events
VLSP workshop, 29 March 2005, HanoiVLSP workshop, 21 May 2005, HanoiVLSP workshop, July 2005VLSP meeting, 21-25 Nov. 2005, JAIST
34
Content
Status of machine translation research
Dictionary and corpora
Activities in organization and experts Others
35
Current and future demands
Current need for development: tourist, economy, communication, etc.Demand increases both on human translation and automatically translation, especially the translation on the Internet.Lack of translation experts, especially in the foreign languages other than English, such as important languages for Vietnam such as Japanese, Chinese.Demand of translation in future will be increased because of the increase of the world integration.