introduction to machine transliteration
DESCRIPTION
Explained two tasks of machine transliteration, transliteration mining and transliteration generation. I used two excellent open source software: m2m-aligner and directl-p developed by Jiampojamarn.TRANSCRIPT
Introduction to Machine Transliteration
Yoh Okuno / @nokuno
#TokykoNLP
About me
• Name: Yoh Okuno / @nokuno
• Software Engineer at Yahoo! Japan
• Interest: NLP, Machine Learning, Data Mining
• Skill: C/C++, Python, Hadoop, and English.
• Website: http://yoh.okuno.name/
What is transliteration? [Zhang+ 12] Whitepaper of NEWS 2012
Shared Task on Machine Transliteration
What is transliteration? • Transliteration is defined as phonetic translation of names across languages
• Similar to Letter-‐to-‐Phoneme (L2P) and
Pronunciation Inference
• Reverse operation of transliteration is called back-‐transliteration
Examples of Transliteration • A shared task supports 14 language pairs
All language pairs at NEW 2012
Two types of transliteration 1. Transliteration mining
– Given source-‐target language pairs with noise,
find correct transliteration from them
2. Transliteration generation
– Given source language characters, generate
ranked list of target language characters
1. Transliteration mining [Jiampojamarn+ 07] Applying Many-‐to-‐Many
Alignments and Hidden Markov Models to
Letter-‐to-‐Phoneme Conversion
Character alignment • Align pairs of Kana and Kanji characters monotonically and detect failures of alignment
• Techniques from statistical machine translation
• Used m2m-‐aligner because of its functions
http://code.google.com/p/m2m-‐aligner/
四季多彩 しきたさい 西都原 さいとばる iPhone あいふぉん
四|季|多|彩| し|き|た|さい| 西|都|原| さい|と|ばる| i|Ph|o|n|e| あい|ふ|ぉ|ん|_|
Training m2m-‐aligner • Trained 3 datasets
– Mozc’s dictionary (1.5 M words)
– unidic (230k words)
– alt-‐cannadic (400k words) → most suitable
• Just run 2 commands
Trained results • Three files are generated
Alignment:
Error:
Model:
Applying m2m-‐aligner • Apply to 6 datasets – Social IME shared dictionary (93k words)
– Mined from Wikipedia (169k words)
– Crawled MS-‐IME dictionary (18k words)
– Manually corrected MS-‐IME dictionary (92k words)
– Hatena keyword (315k words)
– Mined from Aozora Bunko(225k words)
What is Social IME? • The most popular “Cloud-‐based” Japanese
input method (230k unique user per month)
http://www.social-‐ime.com/
Shared Dictionary of Social IME
• Noisy & Crazy → Needs cleaning!
shared with all users
Mining words from Wikipedia
grep like “[一-‐龠]+([ぁ-‐んヴー]+)”
Crawling MS-‐IME user dictionary
Hatena keyword http://developer.hatena.ne.jp/ja/documents/keyword/misc/catalog
Mining Aozora Bunko http://satomacoto.blogspot.com/2012/01/blog-‐post.html
Applied results
• Run:
• Results: Dataset Social IME Wikipedia MS-‐IME MS-‐IME2 hatena Aozora
Size 93k 169k 18k 97k 314k 255k Align 48k 137k 16k 86k 235k 114k Error 45k 32k 2k 10k 78k 110k
Alignment examples • Not perfect but practical precision From Social IME:
From Wikipedia:
“ゃ,ゅ,ょ,っ” should be combined with the previous character
Error examples (from Social IME)
• Error analysis is most interesting!
Abbreviations: Emoticons (顔文字):
Personal Information:
Error examples (from Hatena) Length limit (16 chars):
Chinese / Korean / old Japanese words:
Semantic translation:
Error examples (From Aozora) • Many old Japanese cannot be aligned
• Many semantic translations in old Japanese
Aligning Mozc dictionary • Aligned Mozc dictionary with cannadic model
• Error examples
Data Input Alignment Error Size 1488k 1424k 64k
ぎんごう 銀行 かくちょうだかいだかい 格調高い あくせられーた 一方通行 {こうたろう/ひろたろう} 廣太郎
Conclusion
• Described how to clean Social-‐IME/Wikipedia/
MS-‐IME dictionary using m2m-‐aligner
• Future work: automatically classify pairs with
alignment error to emoticons, abbreviations,
personal information and so on.
2. Transliteration Generation [Jiampojamarn+ 08] Joint Processing and
Discriminative Training for Letter-‐to-‐
Phoneme Conversion
DirecTL+: String Transduction Model
• Training and decoding tool developed by the same author of m2m-‐aligner (now Googler)
• Used structured perceptron and MIRA
• Require aligned corpus (m2m-‐alginer format)
• http://code.google.com/p/directl-‐p/
Adopted joint model • Joint model is better than pipeline
Structured Perceptron
Features for transliteration • Target 1-‐gram, 2-‐gram and combinations
Evaluation Metrics
• Word Accuracy: Top-‐1 accuracy
• Mean F-‐score: Character-‐based accuracy
• MRR: Top-‐k ranking using position of the first
correct candidate
• MAP: Top-‐k ranking using all information
Experiments • MIRA outperformed perceptron and others
Conclusion • Proposed joint model for transliteration /
letter-‐to-‐phoneme conversion
• MIRA outperformed structured perceptron
• Features including unigram and linear-‐chain
perform well
References
• [Zhang+ 12] Whitepaper of NEWS 2012 Shared Task on
Machine Transliteration
• [Jiampojamarn+ 07] Applying Many-‐to-‐Many Alignments and
Hidden Markov Models to Letter-‐to-‐Phoneme Conversion
• [Jiampojamarn+ 08] Joint Processing and Discriminative
Training for Letter-‐to-‐Phoneme Conversion
Any Question?