introduction to machine transliteration

35
Introduction to Machine Transliteration Yoh Okuno / @nokuno #TokykoNLP

Upload: yoh-okuno

Post on 11-Jun-2015

18.352 views

Category:

Technology


4 download

DESCRIPTION

Explained two tasks of machine transliteration, transliteration mining and transliteration generation. I used two excellent open source software: m2m-aligner and directl-p developed by Jiampojamarn.

TRANSCRIPT

Page 1: Introduction to Machine Transliteration

Introduction  to    Machine  Transliteration

Yoh  Okuno  /  @nokuno  

#TokykoNLP  

Page 2: Introduction to Machine Transliteration

About  me

•  Name:  Yoh  Okuno    /  @nokuno  

•  Software  Engineer  at  Yahoo!  Japan  

•  Interest:  NLP,  Machine  Learning,  Data  Mining  

•  Skill:  C/C++,  Python,  Hadoop,  and  English.  

•  Website:  http://yoh.okuno.name/  

Page 3: Introduction to Machine Transliteration

What  is  transliteration? [Zhang+  12]  Whitepaper  of  NEWS  2012  

Shared  Task  on  Machine  Transliteration  

Page 4: Introduction to Machine Transliteration

What  is  transliteration? •  Transliteration  is  defined  as  phonetic  translation  of  names  across  languages  

•  Similar  to  Letter-­‐to-­‐Phoneme  (L2P)  and  

Pronunciation  Inference  

•  Reverse  operation  of  transliteration  is  called  back-­‐transliteration  

Page 5: Introduction to Machine Transliteration

Examples  of  Transliteration •  A  shared  task  supports  14  language  pairs  

Page 6: Introduction to Machine Transliteration

All  language  pairs  at  NEW  2012

Page 7: Introduction to Machine Transliteration

Two  types  of  transliteration 1.  Transliteration  mining  

–  Given  source-­‐target  language  pairs  with  noise,  

find  correct  transliteration  from  them  

2.  Transliteration  generation  

–  Given  source  language  characters,  generate  

ranked  list  of  target  language  characters

Page 8: Introduction to Machine Transliteration

1.  Transliteration  mining [Jiampojamarn+  07]  Applying  Many-­‐to-­‐Many  

Alignments  and  Hidden  Markov  Models  to  

Letter-­‐to-­‐Phoneme  Conversion  

Page 9: Introduction to Machine Transliteration

Character  alignment •  Align  pairs  of  Kana  and  Kanji  characters  monotonically  and  detect  failures  of  alignment  

•  Techniques  from  statistical  machine  translation  

•  Used  m2m-­‐aligner  because  of  its  functions  

http://code.google.com/p/m2m-­‐aligner/

四季多彩  しきたさい  西都原  さいとばる  iPhone  あいふぉん

四|季|多|彩|  し|き|た|さい|  西|都|原|  さい|と|ばる|  i|Ph|o|n|e|  あい|ふ|ぉ|ん|_|

Page 10: Introduction to Machine Transliteration

Training  m2m-­‐aligner •  Trained  3  datasets  

– Mozc’s  dictionary  (1.5  M  words)  

– unidic  (230k  words)  

– alt-­‐cannadic  (400k  words)  →  most  suitable    

•  Just  run  2  commands  

Page 11: Introduction to Machine Transliteration

Trained  results •  Three  files  are  generated  

Alignment:

Error:

Model:

Page 12: Introduction to Machine Transliteration

Applying  m2m-­‐aligner •  Apply  to  6  datasets  –  Social  IME  shared  dictionary  (93k  words)  

– Mined  from  Wikipedia  (169k  words)    

– Crawled  MS-­‐IME  dictionary  (18k  words)  

– Manually  corrected  MS-­‐IME  dictionary  (92k  words)  

– Hatena  keyword  (315k  words)  

– Mined  from  Aozora  Bunko(225k  words)  

Page 13: Introduction to Machine Transliteration

What  is  Social  IME? •  The  most  popular  “Cloud-­‐based”  Japanese  

input  method  (230k  unique  user  per  month)  

http://www.social-­‐ime.com/  

Page 14: Introduction to Machine Transliteration

Shared  Dictionary  of  Social  IME

•  Noisy  &  Crazy  →  Needs  cleaning!  

shared  with  all  users

Page 15: Introduction to Machine Transliteration

Mining  words  from  Wikipedia

grep  like  “[一-­‐龠]+([ぁ-­‐んヴー]+)”  

Page 16: Introduction to Machine Transliteration

Crawling  MS-­‐IME  user  dictionary

Page 17: Introduction to Machine Transliteration

Hatena  keyword http://developer.hatena.ne.jp/ja/documents/keyword/misc/catalog

Page 18: Introduction to Machine Transliteration

Mining  Aozora  Bunko http://satomacoto.blogspot.com/2012/01/blog-­‐post.html

Page 19: Introduction to Machine Transliteration

Applied  results

•  Run:    

•  Results:  Dataset Social  IME Wikipedia MS-­‐IME MS-­‐IME2 hatena Aozora

Size 93k 169k 18k 97k 314k 255k Align 48k 137k 16k 86k 235k 114k Error 45k 32k 2k 10k 78k 110k

Page 20: Introduction to Machine Transliteration

Alignment  examples •  Not  perfect  but  practical  precision From  Social  IME:

From  Wikipedia:

“ゃ,ゅ,ょ,っ”  should  be  combined  with  the  previous  character

Page 21: Introduction to Machine Transliteration

Error  examples  (from  Social  IME)

•  Error  analysis  is  most  interesting!  

Abbreviations: Emoticons  (顔文字):

Personal  Information:

Page 22: Introduction to Machine Transliteration

Error  examples  (from  Hatena) Length  limit  (16  chars):

Chinese  /  Korean  /  old  Japanese  words:

Semantic  translation:

Page 23: Introduction to Machine Transliteration

Error  examples  (From  Aozora) •  Many  old  Japanese  cannot  be  aligned  

•  Many  semantic  translations  in  old  Japanese  

Page 24: Introduction to Machine Transliteration

Aligning  Mozc  dictionary •  Aligned  Mozc  dictionary  with  cannadic  model  

•  Error  examples  

Data Input Alignment Error Size 1488k 1424k 64k

ぎんごう      銀行  かくちょうだかいだかい  格調高い    あくせられーた    一方通行  {こうたろう/ひろたろう}          廣太郎

Page 25: Introduction to Machine Transliteration

Conclusion  

•  Described  how  to  clean  Social-­‐IME/Wikipedia/

MS-­‐IME  dictionary  using  m2m-­‐aligner  

•  Future  work:  automatically  classify  pairs  with  

alignment  error  to  emoticons,  abbreviations,    

personal  information  and  so  on.  

Page 26: Introduction to Machine Transliteration

2.  Transliteration  Generation [Jiampojamarn+  08]  Joint  Processing  and  

Discriminative  Training  for  Letter-­‐to-­‐

Phoneme  Conversion  

Page 27: Introduction to Machine Transliteration

DirecTL+:  String  Transduction  Model

•  Training  and  decoding  tool  developed  by  the  same  author  of  m2m-­‐aligner  (now  Googler)  

•  Used  structured  perceptron  and  MIRA  

•  Require  aligned  corpus  (m2m-­‐alginer  format)  

•  http://code.google.com/p/directl-­‐p/  

Page 28: Introduction to Machine Transliteration

Adopted  joint  model •  Joint  model  is  better  than  pipeline  

Page 29: Introduction to Machine Transliteration

Structured  Perceptron

Page 30: Introduction to Machine Transliteration

Features  for  transliteration •  Target  1-­‐gram,  2-­‐gram  and  combinations  

Page 31: Introduction to Machine Transliteration

Evaluation  Metrics

•  Word  Accuracy:  Top-­‐1  accuracy  

•  Mean  F-­‐score:  Character-­‐based  accuracy  

•  MRR:  Top-­‐k  ranking  using  position  of  the  first  

correct  candidate  

•  MAP:  Top-­‐k  ranking  using  all  information

Page 32: Introduction to Machine Transliteration

Experiments •  MIRA  outperformed  perceptron  and  others  

Page 33: Introduction to Machine Transliteration

Conclusion •  Proposed  joint  model  for  transliteration  /  

letter-­‐to-­‐phoneme  conversion  

•  MIRA  outperformed  structured  perceptron  

•  Features  including  unigram  and  linear-­‐chain  

perform  well  

Page 34: Introduction to Machine Transliteration

References

•  [Zhang+  12]  Whitepaper  of  NEWS  2012  Shared  Task  on  

Machine  Transliteration  

•  [Jiampojamarn+  07]  Applying  Many-­‐to-­‐Many  Alignments  and  

Hidden  Markov  Models  to  Letter-­‐to-­‐Phoneme  Conversion  

•  [Jiampojamarn+  08]  Joint  Processing  and  Discriminative  

Training  for  Letter-­‐to-­‐Phoneme  Conversion  

Page 35: Introduction to Machine Transliteration

Any  Question?