the current status of chinese-english ebmt research -where are we now joy, ralf brown, robert...
Post on 15-Jan-2016
218 views
TRANSCRIPT
![Page 1: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d565503460f94a34265/html5/thumbnails/1.jpg)
The current status of Chinese-English EBMT research
-where are we now
Joy, Ralf Brown, Robert Frederking, Erik Peterson
Aug 2001
![Page 2: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d565503460f94a34265/html5/thumbnails/2.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
2
Overview of Ch-En EBMT• Adapting EBMT to Chinese
– Segmentation of Chinese
• Corpus used– Hong Kong legal code (from LDC)– Hong Kong news articles (from LDC)
• In this project:– Robert Frederking, Ralf Brown, Joy, Erik Peterson,
Stephan Vogel, Alon Lavie, Lori Levin,
![Page 3: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d565503460f94a34265/html5/thumbnails/3.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
3
Corpus Statistics• Hong Kong Legal Code:
Chinese: 23 MB
English: 37.8 MB
• Hong Kong News (After cleaning): 7622 DocumentsDev-test: Size: 1,331,915 byte , 4,992 sentence pairs
Final-test: Size: 1,329,764 byte, 4,866 sentence pairs
Training: Size: 25,720,755 byte, 95,752 sentence pairs
• Corpus Cleaning– Converted from Big5 to GB
– Divided into Training set (90%), Dev-test (5%) and test set (5%)
– Sentence level alignment, using Church & Gale Method (by ISI)
– Cleaned
– Convert two-byte Chinese characters to their cognates
![Page 4: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d565503460f94a34265/html5/thumbnails/4.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
4
Chinese Segmentation• There are no spaces between Chinese words in written Chinese.
• The segmentation problem: Given a sentence with no spaces, break it into words. Definition of Chinese word is vague.
![Page 5: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d565503460f94a34265/html5/thumbnails/5.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
5
Our Definition of Words/Phrases/Terms
• Chinese Characters– The smallest unit in written Chinese is a character, which is represented
by 2 bytes in GB-2312 code.
• Chinese Words– A word in natural language is the smallest reusable unit which can be used
in isolation.
• Chinese Phrases– We define a Chinese phrase as a sequence of Chinese words. For each
word in the phrase, the meaning of this word is the same as the meaning when the word appears by itself.
• Terms– A term is a meaningful constituent. It can be either a word or a phrase.
![Page 6: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d565503460f94a34265/html5/thumbnails/6.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
6
Complicated Constructions• Transliterated foreign words and names
• Abbreviations
• Chinese Names
• Chinese Numbers
![Page 7: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d565503460f94a34265/html5/thumbnails/7.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
7
Segmenter
• Approaches– Statistical approaches:
• Idea: Building collocation models for Chinese characters, such as first-order HMM. Place the space at the place where two characters rarely co-occur.
• Cons:– Data sparseness
– Cross boundary
![Page 8: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d565503460f94a34265/html5/thumbnails/8.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
8
Segmenter (2)– Dictionary-based approaches
• Idea: Use a dictionary to find the words in the sentence
• Forward maximum match / backward maximum match/ or both direction
• Cons:– The size and quality of the dictionary used are of great
importance: New words, Named-entity
– Maximum (greedy) match may cause mis-segmentations
![Page 9: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d565503460f94a34265/html5/thumbnails/9.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
9
Segmenter (3)
– A combination of dictionary and linguistic knowledge
• Ideas: Using morphology, POS, grammar and heuristics to aid disambiguation
• Pros: high accuracy (possible)
• Cons: – Require a dictionary with POS and word-frequency
– Computationally expensive
![Page 10: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d565503460f94a34265/html5/thumbnails/10.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
10
Segmenter (4)
• We first used LDC’s segmenter
• Currently we are using a forward/backward maximum match segmenter for baseline. The word frequency dictionary is from LDC
• The word frequency dictionary from LDC: 43,959 entries
• For HLT 2001, we augmented the frequency dictionary with new words found from the corpus by statistical method
![Page 11: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d565503460f94a34265/html5/thumbnails/11.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
11
Two-threshold method
• Two-threshold for tokenization (finding new words from the corpus) : for MT Summit VIII
![Page 12: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d565503460f94a34265/html5/thumbnails/12.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
12
For PI Meeting• Baseline System
– Using LDC’s frequency word dictionary
• Full System– Tokenize new words from the pre-segmented corpus using two-
threshold method, augment the frequency dictionary with new words to re-segment the corpus
– Bracket English– Using feedback from statDict to adjust segmentation/bracketing
• Baseline + Named-Entity– Named-entity tagger by Erik Peterson
• Multi-corpora System– Cluster the documents into sub-corpora according to their topics
![Page 13: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d565503460f94a34265/html5/thumbnails/13.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
13
Evaluation Issues
• Automatic Measures– EBMT Source Match– EBMT Source Coverage– EBMT Target Coverage– MEMT (EBMT+DICT) Unigram Coverage– MEMT (EBMT+DICT) PER
![Page 14: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d565503460f94a34265/html5/thumbnails/14.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
14
Evaluation Issues (2)• Human Evaluations
– 4-5 graders each time
– 6 categories
![Page 15: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d565503460f94a34265/html5/thumbnails/15.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
15
After PI Meeting (0)• Study of results reported in PI meeting(http://pizza.is.cs.cmu.edu/research/internal/ebmt/tokenLen/index.htm)
– The quality of Named-Entity (Cleaned by Erik)
– Performance difference of EBMT while changing the average length of Chinese word token (by changing segmentation)
– How to evaluate the performance of the system
• Experiment of G-EBMT– Word clustering
![Page 16: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d565503460f94a34265/html5/thumbnails/16.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
16
After PI Meeting (1)
• Changing the average length of Chinese token– No bracket on English– Use a subset of LDC’s frequency dictionary for
segmentation– Study the performance of EBMT system on
different average Chinese token length
![Page 17: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d565503460f94a34265/html5/thumbnails/17.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
17
After PI Meeting (2)• Avg. Token Len. Vs. PER
![Page 18: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d565503460f94a34265/html5/thumbnails/18.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
18
After PI Meeting (3)• Type-Token curve of Chinese and English
![Page 19: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d565503460f94a34265/html5/thumbnails/19.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
19
Future Research Plan
• Generalized EBMT– Word-clustering– Grammar Induction
• Using Machine Learning to optimize the parameters used in MEMT
• Better Alignment Model: Integrating segmentation, brackting and alignment
![Page 20: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d565503460f94a34265/html5/thumbnails/20.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
20
New Alignment Model (1)• Using both monolingual and bilingual collocation information to
segment and align corpus
![Page 21: The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d565503460f94a34265/html5/thumbnails/21.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
21
References
• Tom Emerson, “Segmentation of Chinese Text”. In #38 Volume 12 Issue2 of MultiLingual Computing & Technology published by MultiLingual Computing, Inc.
• Ying Zhang, Ralf D. Brown, and Robert E. Frederking. "Adapting an Example-Based Translation System to Chinese". To appear in Proceedings of Human Language Technology Conference 2001 (HLT-2001).
• Ying Zhang, Ralf D. Brown, Robert E. Frederking and Alon Lavie. "Pre-processing of Bilingual Corpora for Mandarin-English EBMT". Accepted in MT Summit VIII (Santiago de Compostela, Spain, Sep. 2001)