current status of machine translation research in vietnambao/talks/machinetranslationinvn.pdf ·...

Current Status of Machine Translation

Research in Vietnam

Towards Asian wide multi language machine translation project

2

Content

Status of machine translation research

Dictionaries and corpora

Activities in organization and experts Others

3

Overview: Main machine translation groups

Previously: rule-based approach to English-Vietnamese MT system. The system is completed but still not published.Currently: focus on statistical MT, and improve the rule-based MT system using statistical techniques.

G4. JAIST(Mr. LE Anh Cuong)

Since 1989 with various trails. Statistical approach to Vietnamese-English translation (since 2002) and phrase-based approach to English-Vietnamese translation and phrase extraction from Penn Treebank (since 2003)

G3. HCM Univ. of Technology, VNUHCM(Prof. PHAN Thi Tuoi)

Transfer based MT using BTL (Bitext Transfer Learning) for English-Vietnamese MT system. Experience in doing dictionary, bilingual corpus.

G2. Univ. of Natural Sciences, VNUHCM(Dr. DINH Dien)

Rule-based approach to English-Vietnamese MT systems. These are the only MT commercial systems in Vietnam (EVTRAN3.0, VETRAN3.0)

G1. National Center for Technology Progress(Dr. LE Khanh Hung)

ExperienceGroup

4

G1 (Nacentech): About the group

People: 12 members, leader: Dr. LE Khanh Hung2 Ph.D. candidates on NLP3 masters Other 6 engineers and B.A.

Institution: National Center for Technology Progress, MOST, C6 Thanh Xuan Bac, Hanoi, Email: [email protected]

5

G1 (Nacentech): Approach

Morphological Morphological AnalysisAnalysis

Phrase Phrase AnalysisAnalysis

Semantic Semantic LinkingLinking

Phrase Phrase SynthesisSynthesis

Morphological Morphological SynthesisSynthesis

Dependency Dependency TreesTrees

TranslationTranslation

Lexical Lexical LatticeLattice

Dependency Dependency TreesTrees

Source TextSource Text

Lexical Lexical RulesRules

Grammar Grammar RulesRules

Semantic Semantic LatticeLattice

Lexical Lexical LatticeLattice

Target Language

6

G1 (Nacentech): Current status

MT research group was established in 1989 ,1990 starting with an English to Vietnamese MT system

Transfer TechnologyDictionary with 12,000 entries, 500 grammar rules

1997: EVTRAN 1.02,000 grammar rules, 60,000 entries

1999: EVTRAN 2.03,000 grammar rules, 250,000 entriesCommercial software in VietnamListed in Compendium of Translation Software (EAMT)

2005: EVTRAN 3.0Automatic source language identification10,000 grammar rules, 530,000 entries

7

Phrase-sensitive grammar and dependency trees

Γ = (Σ, ℵ, S, E, ℘) where S = {S1, … Sn} – set of start symbols, E –Semantic Lattices defined in (Σ ∪ ℵ)*; Element of E – List of Phrases; ℘ – Rule Set defined in (E×E)

8

Austro-Asiatic Germanic

Mon-Khmer West Germanic

Viet-Muong English

Vietnamese English

Muong

Japanese

Interlinguas

Japanese

G1 (Nacentech): Interlingual MT (plan)

9

G2 (UNS-VNUHCM): About the group

PeopleDr. Dinh Dien, PhD in CS and in Linguistics (Leader)Dr. Ho Bao Quoc, Prof. Dong Thi Bich Thuy5 MS studentd in CS + 1 Ph.D. student

InstitutionDepartment of Information Technology, Univ. of Natural Sciences, Vietnam National Univ. in HCMC227 Nguyen Van Cu, HCMCEmail: [email protected]

10

G2 (UNS-VNUHCM): Approach

BTL model (Bitext Transfer Learning) for English-Vietnamese MT: from annotated-EVC (bitext),

to automatically extract “transfer rules” (lexical and structure) by learning algorithm (fTBL)

then apply those rules to tag the target language (Vietnamese sentence).

11

Group 2: BTL-based MTEVT

English

Morphology

Linguistic Annotating Annotated

parallel corpus

Transformation Rules

Grammar

Semantics

Transfer

VNese

Post-editingWord Align

KFTBL

UnannotatedParallel corpus

Generation

Baseline Tagging

Baseline Tagging

12

Group 2: Current status

The group has developed two machine translation systems: EVT 1.0 and VCLEVT 2.0EVT 1.0

A rule-based MT systemEvaluated by PC World Vietnam Magazine in 1998: 65% for simple sentences; 50% for normal sentences; and 35% for complex sentences.

VCLEVT 2.0Using BTL modelLearning automatically on bilingual corpusGaining better translation quality on informatic documents

13

G3 (HCMUT): About the group

PeopleProf. PHAN Thi Tuoi (leader), Prof. CAO Hoang Tru5 Ph.D. candidates and master students

InstitutionHoChiMinh City University of Technology, VNUHCM268 Ly Thuong Kiet Street, District 10, HoChiMinh City, VietnamEmail: [email protected]

14

G3 (HCMUT): Research activities

Syntax-based English-Vietnamese translation for simple sentences (1989)

Vietnamese word segmentation using corpus and statistical models (2002)

Vietnames POS tagging by context and style

Text alignment and statistical models (2004)

Statistical model for Vietnamese-English MT (since 2002)

English-Vietnamese MT based on lexicon and phrase extraction from treebank (since 2003)

15

G4 (JAIST) About the group

People: Prof. Ho Tu Bao, Le Anh Cuong (leader), Nguyen Phuong Thai, Dr. Nguyen Le Minh, Phan Xuan Hieu, Nguyen Van Vinh

InstitutionJAIST (Japan Advanced Institute of Science and Technology)1-1 Asahidai, Nomi, Ishikawa 923-1292 JapanEmail: [email protected], [email protected]

16

G4 (JAIST): Status

History: 1999-2003Developed an English-Vietnamese MT system at an Information company in Vietnam. The system based on the transfer approach.

2004-present Research on modern technologies in MT: Example-Based, SMT, Phrase-Based SMT

– Le Anh Cuong: word sense disambiguation– Nguyen Phuong Thai: syntactic parsing– Nguyen Le Minh: example-based approach– Phan Xuan Hieu: par-of-speech tagging, chunking– Nguyen Van Vinh: Dictionary

17

G4 (JAIST): Current Translation System

Format Processing Tokenizer Morphological Analyzer

Named EntityRecognizer

Parser

Word SenseDisambiguationTransfer and SynthesisFormat Processing

Grammar Rules

CommonDictionary

POS Tagger

User Dictionary

Domain Dictionary

Dictionaries

18

G4 (JAIST): MT improvement direction

Develop a new MT system which combines advantages of rule based, example based, and statistical machine translationApply advances of English processing to improve current MT systemBuild powerful and intuitive tools which support users modifying and editing dictionary

19

Content


Dictionaries and corpora


20

Overview

Dictionaries and corpora have been developed by each group by their need and ability. Dictionary

E-V dictionaries are well done, V-E dictionaries are in debateModel for Japanese EDR-based dictionaryNo J-V, V-J dictionaries on computer

CorporaSome work in the pastNew plan for corpora

21

Japanese EDR-based Dictionary (JAIST)

Model for such a dictionary (in NLP project 2001-2003)Can benefit from Japanese EDR

English word dictionaryConcept dictionary with concept primary illustration and concept explication in VietnameseEnglish co-occurrence dictionaryEDR Corpus (English Corpus)

Components to be newly doneVietnamese word dictionaryEnglish co-occurrence dictionaryBilingual dictionary English-Vietnamese, Vietnamese-EnglishEDR Corpus (Vietnamese Corpus)

22

Word Morp POS GRM SEM English Freq Field

máy tính C Ns cnt ART computer 2.221 cpt

hiển thị C Vt Vcom display 1.956 cpt

đường W Ns cnt LIN line 2.087

đường W Nm uncnt CHM sugar 1.987

A Vietnamese MRD (UNS-VNUHCM)

23

Dictionary for machine translation(Example from JAIST group)<word> take

<grammar>verb_i<semantic-category> [none] </semantic-category><translation-default> có hiệu lực </translaition-default><translation>

cắn câu<constraint> subj: {fish} </constraint>

</translation><translation-pattern>

$VP[inf]:=take off for $Obj<translation-default> "vội vàng đến“ $Obj </translation-default><translation>

"cất cánh đi“ $Obj<constraint>

subj: {plane, aeroplane, airplane, aircraft}</constraint>

<translation></translation-pattern>

</grammar>. . .

<word>

Entries: 95,000 words, 15,000 phrases, 18,000 translation patterns (lexical rules)

24

JAIST’s group MT system on the Web

25

Some corpora

Monolingual corpora: VLC (Vietnam Lexicography Centre), UNS-VNUHCM, etc. for Vietnamese

Bilingual corpora: The EVC corpus (UNS-VNUHCM) consists of 400,000 pairs of E-V sentences (approx. 5,500,000 words) in the fields of Science and Technology (Computer, Electronics,..). This EVC has been being partially annotated with morphology (word boundary, lemmatize), POS and Sense tags semi-automatically.

26

VĂN BẢN THÔ(Draw Text)

TÁCH TỪ(Word segmentation)

GÁN NHÃN TỪ LOẠI

(POS tagging)

GÁN NHÃN CÚ PHÁP

(Chunking & Parsing)

KHO NGỮ LIỆUTIẾNG VIỆT(Treebank)

Vietnamese Corpus ToolsVietnamese

Corpus Tools

KIỂM TRA CHÍNH TẢ(Spelling)

XÁC ĐỊNH CÂU(Sentence determination)

TỪ ĐIỂNTừ điển đơn ngữ

(Monolingual dictionary)Từ điển đa ngữ

(Bilingual dictionary)

VLC: Development of supporting tools

27

VLC: Capacity and realization

Available tools- Word segmentation, POS tagging, Deep Parsing in TAG

formalism, Syllable list and morpho-syntactic lexicon, Editor for segmentation and tagging revision

- Some utilities for corpus explorationOngoing work

- Improvement of available tools- Improvement of the tagset for POS tagging- Building syntactic lexicon based on morpho-syntactic

lexicon.- Collection of a balanced corpus following the above criteria.

Human resources- About 10 persons working on the project.

28

Content


Dictionary and corpora


29

VLSP national project 2006-2010

National project with participation of more ten research groups (all active groups on VLSP)Leaders: Prof. Ho Tu Bao (JAIST & IOIT) and Assoc. Prof. Luong Chi Mai (IOIT, VAST)Objectives:

1. Build and develop several typical products for VLSP for public end-users.

2. Build and develop indispensable resources and tools for the VLSP development

30

Content: Basic research (1/3)

Basic research on methods for processing Vietnamese language and speech.Applied research to adapt methods and technologies for processing other languages or advanced techniques to Vietnamese language and speech.

Computation methods for VLSP

Typical products for the end-users

Resources and tools for

VLSP

31

Content: Products for end-user (2/3)

P1: VnVoice system for VN synthesis

P2: Embedded speech synthesis and recognition system

P3: Large lexicon based speech recognizer

P4: Domain-specific English-Vietnamese translation system

P5: IREST system for information retrieval, extraction, summarization, and translation

P6: Vietnamese spelling checker




VLSP

32

Content: Resources and tools (3/3)

P7: Basic resources for speechCorpus for speech synthesis and

recognitionP8: Three basic resources for language

P81: Vietnamese MRD P82: Annotated corpora (mono, multi)P83. Entities (KB)

(Rules of VN grammar)P9: Five basic tools for language

P91: Spelling checkerP92: Vietnamese word segmentationP93: Vietnamese POS taggerP94: Vietnamese chunkerP95: Vietnamese syntax analyzer




VLSP

33

Recent events

VLSP workshop, 29 March 2005, HanoiVLSP workshop, 21 May 2005, HanoiVLSP workshop, July 2005VLSP meeting, 21-25 Nov. 2005, JAIST

34

Content


Dictionary and corpora


35

Current and future demands

Current need for development: tourist, economy, communication, etc.Demand increases both on human translation and automatically translation, especially the translation on the Internet.Lack of translation experts, especially in the foreign languages other than English, such as important languages for Vietnam such as Japanese, Chinese.Demand of translation in future will be increased because of the increase of the world integration.

current status of machine translation research in vietnambao/talks/machinetranslationinvn.pdf ·...

Documents