introduction to machine transliteration

Introduction to Machine Transliteration

Yoh Okuno / @nokuno

#TokykoNLP

About me

•  Name: Yoh Okuno / @nokuno

•  Software Engineer at Yahoo! Japan

•  Interest: NLP, Machine Learning, Data Mining

•  Skill: C/C++, Python, Hadoop, and English.

•  Website: http://yoh.okuno.name/

What is transliteration? [Zhang+ 12] Whitepaper of NEWS 2012

Shared Task on Machine Transliteration

What is transliteration? •  Transliteration is defined as phonetic translation of names across languages

•  Similar to Letter-‐to-‐Phoneme (L2P) and

Pronunciation Inference

•  Reverse operation of transliteration is called back-‐transliteration

Examples of Transliteration •  A shared task supports 14 language pairs

All language pairs at NEW 2012

Two types of transliteration 1.  Transliteration mining

–  Given source-‐target language pairs with noise,

find correct transliteration from them

2.  Transliteration generation

–  Given source language characters, generate

ranked list of target language characters

1. Transliteration mining [Jiampojamarn+ 07] Applying Many-‐to-‐Many

Alignments and Hidden Markov Models to

Letter-‐to-‐Phoneme Conversion

Character alignment •  Align pairs of Kana and Kanji characters monotonically and detect failures of alignment

•  Techniques from statistical machine translation

•  Used m2m-‐aligner because of its functions

http://code.google.com/p/m2m-‐aligner/

四季多彩しきたさい西都原さいとばる iPhone あいふぉん

四|季|多|彩| し|き|た|さい| 西|都|原| さい|と|ばる| i|Ph|o|n|e| あい|ふ|ぉ|ん|_|

Training m2m-‐aligner •  Trained 3 datasets

– Mozc’s dictionary (1.5 M words)

– unidic (230k words)

– alt-‐cannadic (400k words) → most suitable

•  Just run 2 commands

Trained results •  Three files are generated

Alignment:

Error:

Model:

Applying m2m-‐aligner •  Apply to 6 datasets –  Social IME shared dictionary (93k words)

– Mined from Wikipedia (169k words)

– Crawled MS-‐IME dictionary (18k words)

– Manually corrected MS-‐IME dictionary (92k words)

– Hatena keyword (315k words)

– Mined from Aozora Bunko(225k words)

What is Social IME? •  The most popular “Cloud-‐based” Japanese

input method (230k unique user per month)

http://www.social-‐ime.com/

Shared Dictionary of Social IME

•  Noisy & Crazy → Needs cleaning!

shared with all users

Mining words from Wikipedia

grep like “[一-‐龠]+（[ぁ-‐んヴー]+）”

Crawling MS-‐IME user dictionary

Hatena keyword http://developer.hatena.ne.jp/ja/documents/keyword/misc/catalog

Mining Aozora Bunko http://satomacoto.blogspot.com/2012/01/blog-‐post.html

Applied results

•  Run:

•  Results: Dataset Social IME Wikipedia MS-‐IME MS-‐IME2 hatena Aozora

Size 93k 169k 18k 97k 314k 255k Align 48k 137k 16k 86k 235k 114k Error 45k 32k 2k 10k 78k 110k

Alignment examples •  Not perfect but practical precision From Social IME:

From Wikipedia:

“ゃ，ゅ，ょ，っ” should be combined with the previous character

Error examples (from Social IME)

•  Error analysis is most interesting!

Abbreviations: Emoticons (顔文字):

Personal Information:

Error examples (from Hatena) Length limit (16 chars):

Chinese / Korean / old Japanese words:

Semantic translation:

Error examples (From Aozora) •  Many old Japanese cannot be aligned

•  Many semantic translations in old Japanese

Aligning Mozc dictionary •  Aligned Mozc dictionary with cannadic model

•  Error examples

Data Input Alignment Error Size 1488k 1424k 64k

ぎんごう銀行かくちょうだかいだかい格調高いあくせられーた一方通行 {こうたろう/ひろたろう} 廣太郎

Conclusion

•  Described how to clean Social-‐IME/Wikipedia/

MS-‐IME dictionary using m2m-‐aligner

•  Future work: automatically classify pairs with

alignment error to emoticons, abbreviations,

personal information and so on.

2. Transliteration Generation [Jiampojamarn+ 08] Joint Processing and

Discriminative Training for Letter-‐to-‐

Phoneme Conversion

DirecTL+: String Transduction Model

•  Training and decoding tool developed by the same author of m2m-‐aligner (now Googler)

•  Used structured perceptron and MIRA

•  Require aligned corpus (m2m-‐alginer format)

•  http://code.google.com/p/directl-‐p/

Adopted joint model •  Joint model is better than pipeline

Structured Perceptron

Features for transliteration •  Target 1-‐gram, 2-‐gram and combinations

Evaluation Metrics

•  Word Accuracy: Top-‐1 accuracy

•  Mean F-‐score: Character-‐based accuracy

•  MRR: Top-‐k ranking using position of the first

correct candidate

•  MAP: Top-‐k ranking using all information

Experiments •  MIRA outperformed perceptron and others

Conclusion •  Proposed joint model for transliteration /

letter-‐to-‐phoneme conversion

•  MIRA outperformed structured perceptron

•  Features including unigram and linear-‐chain

perform well

References

•  [Zhang+ 12] Whitepaper of NEWS 2012 Shared Task on

Machine Transliteration

•  [Jiampojamarn+ 07] Applying Many-‐to-‐Many Alignments and

Hidden Markov Models to Letter-‐to-‐Phoneme Conversion

•  [Jiampojamarn+ 08] Joint Processing and Discriminative

Training for Letter-‐to-‐Phoneme Conversion

Any Question?

introduction to machine transliteration

Technology

transliteration letter

examples of transliteration

transliteration target

mining words

social ime error analysis

corrected msime dictionary

words unidic

nd correct transliteration