improving machine transliteration performance by using multiple transliteration models

12
Improving Machine Transliteration Performance by Using Multiple Transliteration Models Jong-Hoon Oh 1 , Key-Sun Choi 2 , and Hitoshi Isahara 1 1 Computational Linguistics Group, National Institute of Information and Communications Technology (NICT), 3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0289 Japan {rovellia, isahara}@nict.go.jp 2 Computer Science Division, EECS, KAIST, 373-1 Guseong-dong, Yuseong-gu, Daejeon 305-701 Republic of Korea [email protected] Abstract. Machine transliteration has received significant attention as a supporting tool for machine translation and cross-language informa- tion retrieval. During the last decade, four kinds of transliteration model have been studied — grapheme-based model, phoneme-based model, hy- brid model, and correspondence-based model. These models are classi- fied in terms of the information sources for transliteration or the units to be transliterated — source graphemes, source phonemes, both source graphemes and source phonemes, and the correspondence between source graphemes and phonemes, respectively. Although each transliteration model has shown relatively good performance, one model alone has lim- itations on handling complex transliteration behaviors. To address the problem, we combined different transliteration models with a “generat- ing transliterations followed by their validation ” strategy. The strategy makes it possible to consider complex transliteration behaviors using the strengths of each model and to improve transliteration performance by validating transliterations. Our method makes use of web-based and transliteration model-based validation for transliteration validation. Ex- periments showed that our method outperforms both the individual transliteration models and previous work. 1 Introduction Machine transliteration has received significant attention as a supporting tool for machine translation (MT) [1,2] and cross-language information retrieval (CLIR) [3,4]. During the last decade, several transliteration models – grapheme 1 - based transliteration model (GTM) [5,6,7,8], phoneme 2 -based transliteration model (PTM) [1,9,10], hybrid transliteration model (HTM) [2,11], and correspondence- based transliteration model (CTM) [12,13,14] – have been proposed. These models 1 Graphemes refer to the basic units (or the smallest contrastive units) of a written language: for example, English has 26 graphemes or letters. 2 Phonemes are the simplest significant unit of sound. We used ARPAbet symbols to represent source phonemes (http://www.cs.cmu.edu/ ~ laura/pages/arpabet.ps ). Y. Matsumoto et al. (Eds.): ICCPOL 2006, LNAI 4285, pp. 85–96, 2006. c Springer-Verlag Berlin Heidelberg 2006

Upload: independent

Post on 15-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Improving Machine Transliteration Performance

by Using Multiple Transliteration Models

Jong-Hoon Oh1 Key-Sun Choi2 and Hitoshi Isahara1

1 Computational Linguistics Group National Institute of Information andCommunications Technology (NICT) 3-5 Hikaridai Seika-cho Soraku-gun Kyoto

619-0289 Japanrovellia isaharanictgojp

2 Computer Science Division EECS KAIST 373-1 Guseong-dong Yuseong-guDaejeon 305-701 Republic of Korea

kschoicskaistackr

Abstract Machine transliteration has received significant attention asa supporting tool for machine translation and cross-language informa-tion retrieval During the last decade four kinds of transliteration modelhave been studied mdash grapheme-based model phoneme-based model hy-brid model and correspondence-based model These models are classi-fied in terms of the information sources for transliteration or the unitsto be transliterated mdash source graphemes source phonemes both sourcegraphemes and source phonemes and the correspondence between sourcegraphemes and phonemes respectively Although each transliterationmodel has shown relatively good performance one model alone has lim-itations on handling complex transliteration behaviors To address theproblem we combined different transliteration models with a ldquogenerat-ing transliterations followed by their validationrdquo strategy The strategymakes it possible to consider complex transliteration behaviors usingthe strengths of each model and to improve transliteration performanceby validating transliterations Our method makes use of web-based andtransliteration model-based validation for transliteration validation Ex-periments showed that our method outperforms both the individualtransliteration models and previous work

1 Introduction

Machine transliteration has received significant attention as a supporting toolfor machine translation (MT) [12] and cross-language information retrieval(CLIR) [34] During the last decade several transliteration models ndash grapheme1-based transliteration model (GTM) [5678] phoneme2-based transliteration model(PTM) [1910] hybrid transliteration model (HTM) [211] and correspondence-based transliteration model (CTM) [121314] ndash have been proposed These models1 Graphemes refer to the basic units (or the smallest contrastive units) of a written

language for example English has 26 graphemes or letters2 Phonemes are the simplest significant unit of sound We used ARPAbet symbols to

represent source phonemes (httpwwwcscmuedu~laurapagesarpabetps)

Y Matsumoto et al (Eds) ICCPOL 2006 LNAI 4285 pp 85ndash96 2006ccopy Springer-Verlag Berlin Heidelberg 2006

86 J-H Oh K-S Choi and H Isahara

are classified in terms of the information sources for transliteration or the units tobe transliterated GTM PTM HTM and CTM make use of source graphemessource phonemes both source graphemes and source phonemes and the cor-respondence between source graphemes and phonemes respectively Althougheach transliteration model has shown relatively good performance it often pro-duced transliterations with errors The errors are mainly caused by complextransliteration behaviors meaning that a transliteration process dynamicallyuses both source graphemes and source phonemes Sometimes either sourcegraphemes or source phonemes contribute to the transliteration process whilesometimes both contribute Therefore it is hard to consider the complex translit-eration behaviors depending on one transliteration model because one modeljust concentrates on only one of the complex transliteration behaviors To ad-dress this problem we combined the different transliteration models with aldquogenerating transliterations followed by their validationrdquo strategy as shown inFig 1 First we generate transliteration candidates (or a list of translitera-tions) using GTM PTM HTM and CTM Then we validate the candidatesusing two measures mdash a transliteration model-based measure and a web-basedmeasure

Fig 1 System architecture

This paper is organized as follows In section 2 we review previous work basedon the four transliteration models In section 3 we describe the framework ofdifferent transliteration models and in section 4 we describe the translitera-tion validation In section 5 we describe our experiments and results We thenconclude in section 6

Improving Machine Transliteration Performance 87

2 Previous Work

21 Grapheme-Based Transliteration Model

The grapheme-based transliteration model (GTM) is conceptually a direct or-thographical mapping model from source graphemes to target graphemes Sev-eral different transliteration methods have been proposed within this frameworkKang amp Choi [5] proposed a decision tree-based transliteration method Deci-sion trees which transform each source grapheme into target graphemes arelearned and then they are directly applied to machine transliteration Kang ampKim [6] and Goto et al [7] proposed a method based on a transliteration networkThe transliteration network is composed of nodes and arcs A node represents achunk of source graphemes and its corresponding target grapheme An arc rep-resents a possible link between nodes and it has a weight showing its strengthLi et al [8] used a joint source-channel model to simultaneously model boththe source language and the target language contexts (bigram and trigram) formachine transliteration Its main advantage is the use of bilingual contexts

The main drawback of GTM is that it does not consider any phonetic aspectof transliteration

22 Phoneme-Based Transliteration Model

Basically the phoneme-based transliteration model (PTM) is composed of sourcegrapheme-to-source phoneme transformation and source phoneme-to-targetgrapheme transformation Knight amp Graehl [1] modeled Japanese-to-Englishtransliteration with weighted finite state transducers (WFSTs) by combiningseveral parameters such as romaji-to-phoneme phoneme-to-English Englishword probability models and so on Meng et al [10] proposed an English-to-Chinese transliteration model It was based on English grapheme-to-phonemeconversion cross-lingual phonological rules and mapping rules between Eng-lish and Chinese phonemes and Chinese syllable-based and character-based lan-guage models Jung et al [9] modeled English-to-Korean transliteration withextended Markov window First they transformed an English word into Englishpronunciation by using a pronunciation dictionary Then they segmented theEnglish phonemes into chunk of English phonemes which corresponds to oneKorean grapheme by using predefined handcrafted rules Finally they automat-ically transformed each chunk of English phoneme into Korean graphemes byusing extended Markov window

The main drawback of PTM is error propagation caused by its two-step pro-cedure ndash errors in source grapheme-to-source phoneme transformation make itdifficult to generate correct transliterations in the next step

23 Hybrid Transliteration Model and Correspondence-BasedTransliteration Model

There have been attempts to use both source graphemes and source phonemesin machine transliteration Such research falls into two categories the

88 J-H Oh K-S Choi and H Isahara

correspondence-based transliteration model (CTM) [121314] and the hybridtransliteration model (HTM) [211] The CTM makes use of the correspondencebetween a source grapheme and a source phoneme when it produces targetlanguage graphemes the HTM just combines GTM and PTM through linearinterpolation The hybrid transliteration model requires the grapheme-basedtransliteration probability (Pr(GTM)) and phoneme-based transliteration prob-ability (Pr(PTM)) and then it combines the two probabilities through linearinterpolation

Oh amp Choi [12] considered the contexts of a source grapheme and its cor-responding source phoneme for English-to-Korean transliteration It is basedon semi-automatically constructed context-sensitive rewrite rules in a formAXB rarr y meaning that X is rewritten as target grapheme y in the con-text A and B Note that X A and B represent correspondence between Englishgrapheme and phoneme like ldquor |R|rdquo ndash English grapheme r corresponding toEnglish phoneme |R| Oh amp Choi [1314] trained a generative model representingtransliteration rules by using the correspondence between source grapheme andsource phoneme and machine learning algorithms The correspondence makesit possible to model machine transliteration in a more sophisticated manner

Several researchers [211] have proposed hybrid model-based transliterationmethods They modeled GTM and PTM with WFSTs or a source-channelmodel Then they combined GTM and PTM through linear interpolation Intheir PTM several parameters are considered such as the source grapheme-to-source phoneme probability source phoneme-to-target grapheme probabilitytarget language word probability and so on In their GTM the source grapheme-to-target grapheme probability is mainly considered

3 Framework of Different Transliteration Models

Let SW be a source word PSW be the pronunciation of SW TSW be a targetword corresponding to SW and CSW be a correspondence between SW and PSW PSW and TSW can be segmented into a series of sub-strings each of which corre-sponds to a source grapheme Then we can write SW = s1 middot middot middot sn = sn

1 PSW =p1 middot middot middot pn = pn

1 TSW = t1 middot middot middot tn = tn1 and CSW = c1 middot middot middot cn = cn1 where si pi

ti and ci = lt si pi gt represent the ith source grapheme source phonemes corre-sponding to si target graphemes corresponding to si and pi and the correspon-dence between si and pi respectively With this definition GTM PTM CTMand HTM can be represented as Eqs (1) (2) (3) and (4) respectively

Prg(TSW |SW ) = Pr(tn1 |sn1 ) asymp

prod

i

Pr(ti|timinus1iminusk si+k

iminusk) (1)

Prp(TSW |SW ) = Pr(pn1 |sn

1 ) times Pr(tn1 |pn1 ) (2)

asympprod

i

Pr(pi|piminus1iminusk si+k

iminusk) times Pr(ti|timinus1iminusk pi+k

iminusk)

Prc(TSW |SW ) = Pr(pn1 |sn

1 ) times Pr(tn1 |cn1 ) (3)

asympprod

i

Pr(pi|piminus1iminusk si+k

iminusk) times Pr(ti|timinus1iminusk ci+k

iminusk)

Improving Machine Transliteration Performance 89

Prh(TSW |SW ) = α times Prp(TSW |SW ) + (1 minus α) times Prg(TSW |SW ) (4)

With the assumption that each transliteration model depends on the size ofthe contexts k Eqs (1) (2) (3) and (4) can be simplified To estimate theprobabilities in Eqs (1) (2) (3) and (4) we used the maximum entropy modelwhich can effectively incorporate heterogeneous information [15] In the maxi-mum entropy model event ev is composed of a target event (te) and a historyevent (he) and it is represented by a bundle of feature functions (fi(he te))which represent the existence of certain characteristics in the event ev The fea-ture function enables a model based on the maximum entropy model to estimateprobability [15] Therefore designing the feature functions which effectively sup-port certain decisions made by the model is important Our basic philosophyfor the feature function design for each transliteration model is that the con-text information collocated with the unit of interest is important With thisphilosophy we designed the feature functions with all possible combinations of(si+k

iminusk pi+kiminusk ci+k

iminusk and timinus1iminusk) Generally a conditional maximum entropy model

is an exponential log-linear model that gives the conditional probability of eventev =lt te he gt as described in Eq (5) where λi is a parameter to be estimatedand Z(he) is the normalizing factor [15]

Pr(te|he) =1

Z(he)exp(

sum

i

λifi(he te)) (5)

Z(he) =sum

te

exp(sum

i

λifi(he te))

With Eq (5) and feature functions conditional probabilities can be estimatedin Eqs (1) (2) (3) and (4) For example we can write Pr(ti|timinus1

iminusk ci+kiminusk) =

Pr(teCTM |heCTM ) because we can represent target events (teCTM ) and historyevents (heCTM ) of CTM as ti and tuples lt timinus1

iminusk ci+kiminusk gt respectively In the same

way Pr(ti|timinus1iminusk si+k

iminusk) Pr(ti|timinus1iminusk pi+k

iminusk) and Pr(pi|piminus1iminusk si+k

iminusk) can be representedas Pr(te|he) with their target events and history events We used a maximumentropy modeling tool [16] to estimate Eqs (1) (2) (3) and (4)

4 Transliteration Validation

We validated transliterations by using web-based validation Sweb(s tci) andtransliteration model-based validation Stm(s tci) like in Eq (6) Using Eq (6)we can validate transliterations in a more correct and robust manner becauseSweb(s tci) reflects real-world usage of the transliterations in web data andStm(s tci) ranks the transliterations independent of the web data

STV (s tci) = Stm(s tci) times Sweb(s tci) (6)

41 Transliteration Model-Based Validation Stm

Our transliteration model-based validation Stm uses the rank assigned by eachtransliteration model For a given source word (s) each transliteration modelgenerates transliterations (tci in TC) and ranks them using the probability

90 J-H Oh K-S Choi and H Isahara

described in Eqs (1) (2) (3) and (4) The underlying assumption in Stm isthat the rank of the correct transliterations tends to be higher on average thanthe wrong ones With this assumption we represented Stm(s tci) as Eq (7)where Rankg(tci) Rankp(tci) Rankh(tci) and Rankc(tci) represent the rankof tci assigned by GTM PTM HTM and CTM respectively

Stm(s tci) =14times (

1Rankg(tci)

+1

Rankp(tci)+

1Rankh(tci)

+1

Rankc(tci)) (7)

42 Web-Based Validation Sweb

Korean or Japanese web pages are usually composed of rich texts in a mixtureof Korean or Japanese (main language) and English (auxiliary language) Lets and t be a source language word and a target language word respectivelyWe observed that s and t tend to be near each other in the text of Korean orJapanese web pages when the authors of the web pages describe s as translationof t or vice versa We retrieved such web pages for transliteration validation

There have been several web-based validation methods for translation valida-tion [1718] or transliteration validation [219] They usually rely on the web fre-quency (the number of web pages) derived from ldquoBilingual Keyword Search(BKS)rdquo [21718] or ldquoMonolingual Keyword Search (MKS)rdquo [219] BKS re-trieves web pages by using a query composed of two keywords s and t while MKSretrieves web pages by using a query composed of t Qu amp Grefenstette [17] andWang et al [18] proposed BKS-based translation validation methods such as rel-ative web frequency and chi-square (χ2) test Al-Onaizan amp Knight [2] used bothMKS and BKS and Grefenstette et al [19] used only MKS for validating translit-erations However web pages retrieved by MKS tend to show whether t is used intarget language texts rather than whether t is a translation of s BKS frequentlyretrieves web pages where s and t have little relation to each other because it doesnot consider distance between s and t in the web pages To address these prob-lems we developed a validation method based on ldquoBilingual Phrasal Search(BPS)rdquo where a phrase composed of s and t is used as a query for a search engineLet lsquo[s t]rsquo or lsquo[t s]rsquo lsquos And trsquo and lsquotrsquo respectively be queries for BPS BKS andMKS The difference among BPS BKS and MKS is shown in Fig 2 In Fig 2 lsquo[st]rsquo or lsquo[t s]rsquo retrieves web pages where lsquo[s t]rsquo or lsquo[t s]rsquo exists as phrases while lsquos Andtrsquo retrieves web pages where s and t simply exist in the same document Thereforethe number of web pages retrieved by BPS is more reliable for validating translit-erations because s and t usually have high co-relation in the web pages retrievedby BPS For example web pages retrieved by BPS in Fig 3 usually contain cor-rect Korean and Japanese transliterations and their corresponding English wordamylase as translation pairs in parentheses expression For these reasons BPS ismore suitable for our transliteration validation

Let TC be a set of transliterations (or transliteration candidates) producedby different transliteration models tci be the ith transliteration candidate inTC s be the source language word resulting in TC and WF (s tci) be the web

Improving Machine Transliteration Performance 91

Fig 2 Difference among BPS BKS and MKS

Fig 3 Web pages retrieved by BPS

frequency for [s tci] Our web-based validation method Sweb can be representedas Eq (8) which is the relative web frequency derived from BPS

Sweb(s tci) =WF (s tci) + WF (tci s)sum

tckisinTC(WF (s tck) + WF (tck s))(8)

Let TC for s = data be tc1= tc2= tc3= and WF (s tc1) +WF (tc1 s) WF (s tc2) + WF (tc2 s) and WF (s tc3) + WF (tc3 s) be 9410067800 and 54 respectively Then Sweb for each tci can be calculated as follows

ndash Sweb(s tc1) = 94100161954 = 05811ndash Sweb(s tc2) = 67800161954 = 04186ndash Sweb(s tc3) = 54161954 = 00003

92 J-H Oh K-S Choi and H Isahara

5 Experiments

Our experiments were done for English-to-Korean and English-to-Japanesetransliteration The test set for the English-to-Koreantransliteration (EKSet) [20]consisted of 7172 English-Korean pairs ndash the number of training data was about6000 and the number of blind test data was about 1000 The test set for theEnglish-to-Japanese transliteration (EJSet) which consisted of English-katakanapairs from EDICT [21] consisted of 10417 pairs mdash the number of training datawas about 9000 and the number of blind test data was about 1000 EJSet con-tained one or more than one correct transliteration for one English word likeltmicroכgt and ltmicroכgt the average number of Japanesetransliterations for an English word was 115 EKSet and EJSet covered propernames technical terms and general terms Evaluation was done in terms of theword accuracy (WA) in Eq (9) In the evaluation we used k-fold cross-validation(k = 7 for the EKSet and k = 10 for the EJSet) The test set was divided into ksubsets Each one was used for testing while the remainder was used for trainingThen the average WA across all the k trials was computed Through the cross-validation we set α (04 for the EKSet and 05 for the EJSet) for HTM in Eq (4)

WA =the number of correct transliterations output by the system

the number of transliterations in the blind test data(9)

51 Experimental Results

Summaries of our experimental results conducted on EKSet and EJSet are shownin Table 1 In the table GTM PTM HTM and CTM represent the individualtransliteration models used for generating transliterations Stm Sweb and STV

Table 1 Summary of Results ()

Methods EKSet EJSetTop-1 Top-3 Top-5 Top-10 Top-1 Top-3 Top-5 Top-10

GTM 568 769 825 880 516 766 846 941

PTM 496 683 751 825 476 728 816 926

HTM 606 785 835 886 557 776 851 940

CTM 608 796 847 896 582 811 879 965

GPC [6] 551 NA NA NA 532 NA NA NA

GMEM [7] 559 NA NA NA 562 NA NA NA

HWFST [11] 583 NA NA NA 625 NA NA NA

Stm 710 812 854 891 669 813 875 932

MKS 505 759 843 916 499 751 830 930Sweb BKS 744 888 912 921 761 942 960 968

BPS 836 915 919 921 813 956 969 978

MKS 595 813 878 921 567 773 835 928STV BKS 795 905 917 921 796 947 962 968

BPS 844 917 920 921 820 956 969 981

Improving Machine Transliteration Performance 93

represent experimental results validated by Eqs (7) (8) and (6) respectivelyMoreover we tested Sweb and STV according to web search methods (BPS BKSand MKS) to show the effect of BPS on transliteration validation We comparedour proposed method with the previous work GPC [6] GMEM [7] and HWFST[11]3 Note that only Top-1 was considered in the previous work because theyexcept for HWFST focused only on the Top-1 The Top-n considers whether thecorrect transliteration is in the Top-n ranked transliterations4

Compared to individual transliteration models and previous work [6711] StmSweb(BPS) and STV (BPS) are more effective especially in the Top-15 AlthoughStm by itself showed higher performance than individual transliteration modelsand previous work [6711] STV (BPS) (the combination of Stm and Sweb(BPS))shows much better performance The Top-1 of STV (BPS) has the best perfor-mance6 The powerful transliteration validation ability of STV (BPS) enables ourmethod to achieve the best result in the Top-1 More specifically Sweb(BPS)contributes highly to the performance improvement This indicates that the webdata used as the knowledge source for transliteration validation is very usefulAlthough Stm makes a small contribution to the performance improvement ofSTV (BPS) because Sweb(BPS) correctly validates transliterations whenever Stm

does the errors of Sweb(BKS) and Sweb(MKS) are well compensated for by Stm inSTV (BKS) and STV (MKS) For example Korean and Japanese transliterationsfor the English words methoxyl and netware were validated by each validationmethod as shown in Tables 2 and 3 Note that the value coupled with each translit-eration was assigned by each validation method In Tables 2 and 3 Stm causes therank of correct transliterations to be higher in STV (BKS) and STV (MKS) thanin Sweb(BKS) and Sweb(MKS)

When comparing BPS with BKS and MKS BPS is the most effective websearch method for transliteration validation Sweb based on MKS has the worstperformance because it tends to validate whether tci is used in a target languagerather than whether it is used as a translation of s Actually Sweb based on BKSis effective because BKS considers both s and tci while it retrieves web pagesHowever the more powerful retrieval ability of BPS protects Sweb based on BPSfrom errors that Sweb based on BKS causes as shown in Tables 2 and 3 So higherperformance can be had with Sweb(BPS) than with Sweb(BKS) ndash about 12 im-provement in Top-1 of EKSet and about 7 improvement in Top-1 of EJSet7 The

3 We implemented the three previous methods [6711] and then trained and testedthem using the same data as our proposed method

4 For one English word there are one or more than one correct transliterations in EJSetbut there is only one correct transliteration in EKSet Therefore we had higher TOP-1 accuracies but lower TOP-10 accuracies in EKSet than in EJSet

5 A one-tail paired t-test showed that the results of Stm Sweb(BPS) and STV (BPS)were always significantly better than those of individual transliteration models andprevious work (level of significance = 0001)

6 A one-tail paired t-test showed that the results of STV (BPS) were always significantlybetter than those of the others (level of significance = 0001)

7 A one-tail paired t-test showed that the results of Sweb(BPS) were always significantlybetter than those of Sweb(BKS) (level of significance = 0001)

94 J-H Oh K-S Choi and H Isahara

Table 2 Korean transliterations for English word methoxyl and their validation (Theunderlined is the correct Korean transliteration)

Stm (0772) (0081) 13 (0253)

Sweb(MKS) 13 (0482) (0334) (0148) (0029)

Sweb(BKS) (0564) (0413) 13 (0011)

Sweb(BPS) (0947) 13 (0053)

STV (MKS) 13 (0122) (0022) (00003)

STV (BKS) (0319) (0045) 13 (0003)

STV (BPS) (0742) 13 (0013)

Table 3 Japanese transliterations for English word netware and their validation (Theunderlined is the correct Japanese transliteration)

Stm (0875) (0458) (0319)

Sweb(MKS) 13ndash (0988) (0009) (0002)

Sweb(BKS) 13ndash (0626) (0274) (0100)

Sweb(BPS) (0860) 13ndash (0079) (0061)

STV (MKS) 13ndash (0189) (0004) (0002)

STV (BKS) (0240) 13ndash (0120) (0032)

STV (BPS) (0752) (0019) 13ndash (0015)

higher performance of Sweb(BPS) positively effects STV (BPS) thus the perfor-mance of STV (BPS) is higher than that of STV (BKS) Both Sweb and STV basedon BPS outperform those based on BKS or MKS

The experimental results can be summarized as follows

ndash Stm Sweb(BPS) and STV (BPS) are more effective than individual transliter-ation models and previous work [6711] especially in the Top-1

ndash STV (BPS) shows the best performancendash Sweb(BPS) mainly contributes the high performance of STV (BPS)ndash BPS is the most effective web search method for transliteration validation

among BPS BKS and MKS

6 Conclusion

We proposed a novel approach for improving machine transliteration performanceby combining multiple transliteration models We applied a ldquogeneratingtransliterations followed by their validationrdquo strategy We generated transliter-ation candidates using four different transliteration models and validated themusing web-based validation and transliteration model-based validation Experi-ments showed that combining multiple transliteration models was one way forconsidering complex transliteration behaviors and that transliteration validationwas very important for improving machine transliteration performance Our two

Improving Machine Transliteration Performance 95

transliteration validation methods were effective The web-based validationmethod effectively filtered out wrong transliterations by using web data whichreflects real-world usage of transliterations and the transliteration model-basedvalidation method as a web-independent validation measure complemented theweb-based validation method Moreover we showed that a web search methodsignificantly affects the performance of the web-based validation method Exper-iments showed that our ldquoBilingual Phrasal Search (BPS)rdquo is more suitablethan ldquoBilingual Keyword Search (BKS)rdquo and ldquoMonolingual KeywordSearch (MKS)rdquo in transliteration validation

References

1 Knight K Graehl J Machine transliteration In Proc of the 35th Annual Meet-ings of the Association for Computational Linguistics (1997) pp128ndash135

2 Al-Onaizan Y Knight K Translating named entities using monolingual and bilin-gual resources In Proc of ACL 2002 (2002) 400ndash408

3 Fujii A Tetsuya I JapaneseEnglish cross-language information retrieval Ex-ploration of query translation and transliteration Computers and the Humanities35 (2001) 389ndash420

4 Lin WH Chen HH Backward machine transliteration by learning phonetic sim-ilarity In Proc of the Sixth Conference on Natural Language Learning (CoNLL)(2002) 139ndash145

5 Kang BJ Choi KS Automatic transliteration and back-transliteration by de-cision tree learning In Proc of the 2nd International Conference on LanguageResources and Evaluation (2000) 1135ndash1411

6 Kang IH Kim GC English-to-Korean transliteration using multiple unboundedoverlapping phoneme chunks In Proc of the 18th International Conference onComputational Linguistics (2000) 418ndash424

7 Goto I Kato N Uratani N Ehara T Transliteration considering context in-formation based on the maximum entropy method In Proc of MT-Summit IX(2003) 125ndash132

8 Li H Zhang M Su J A joint source-channel model for machine transliterationIn Proc of ACL 2004 (2004) 160ndash167

9 Jung SY Hong S Paek E An English to Korean transliteration model of ex-tended markov window In Proc of the 18th conference on Computational linguis-tics (2000) 383ndash389

10 Meng H Lo WK Chen B Tang K Generating phonetic cognates to handlenamed entities in English-Chinese cross-language spoken document retrieval InProc of Automatic Speech Recognition and Understanding 2001 ASRU rsquo01 (2001)311ndash314

11 Bilac S Tanaka H Improving back-transliteration by combining informationsources In Proc of IJCNLP2004 (2004) 542ndash547

12 Oh JH Choi KS An English-Korean transliteration model using pronunciationand contextual rules In Proc of COLING2002 (2002) 758ndash764

13 Oh JH Choi KS An ensemble of grapheme and phoneme for machine translit-eration In Proc of IJCNLP05 (2005) 450ndash461

14 Oh JH Choi KS Machine learning based English-to-Korean transliterationusing grapheme and phoneme information IEICE Transaction on Information ampSystems E88-D (2005) 1737ndash1748

96 J-H Oh K-S Choi and H Isahara

15 Berger AL Pietra SD Pietra VJD A maximum entropy approach to naturallanguage processing Computational Linguistics 22 (1996) 39ndash71

16 Zhang L Maximum entropy modeling toolkit for python and C++ http

homepagesinfedacuks0450736softwaremaxentmanualpdf (2004)17 Qu Y Grefenstette G Finding ideographic representations of Japanese names

written in Latin script via language identification and corpus validation In ACL(2004) 183ndash190

18 Wang JH Teng JW Lu WH Chien LF Exploiting the web as the multi-lingual corpus for unknown query translation Journal of the American Society forInformation Science and Technology 57 (2006) 660ndash670

19 Grefenstette G Qu Y Evans DA Mining the web to create a language modelfor mapping between English names and phrases and Japanese In Proc of WebIntelligence (2004) 110ndash116

20 Nam YS Foreign dictionary Sung An Dang (1997)21 Breen J EDICT JapaneseEnglish dictionary le The Electronic Dictionary Re-

search and Development Group Monash University httpwwwcssemonash

eduau~jwbedicthtml (2003)

86 J-H Oh K-S Choi and H Isahara

are classified in terms of the information sources for transliteration or the units tobe transliterated GTM PTM HTM and CTM make use of source graphemessource phonemes both source graphemes and source phonemes and the cor-respondence between source graphemes and phonemes respectively Althougheach transliteration model has shown relatively good performance it often pro-duced transliterations with errors The errors are mainly caused by complextransliteration behaviors meaning that a transliteration process dynamicallyuses both source graphemes and source phonemes Sometimes either sourcegraphemes or source phonemes contribute to the transliteration process whilesometimes both contribute Therefore it is hard to consider the complex translit-eration behaviors depending on one transliteration model because one modeljust concentrates on only one of the complex transliteration behaviors To ad-dress this problem we combined the different transliteration models with aldquogenerating transliterations followed by their validationrdquo strategy as shown inFig 1 First we generate transliteration candidates (or a list of translitera-tions) using GTM PTM HTM and CTM Then we validate the candidatesusing two measures mdash a transliteration model-based measure and a web-basedmeasure

Fig 1 System architecture

This paper is organized as follows In section 2 we review previous work basedon the four transliteration models In section 3 we describe the framework ofdifferent transliteration models and in section 4 we describe the translitera-tion validation In section 5 we describe our experiments and results We thenconclude in section 6

Improving Machine Transliteration Performance 87

2 Previous Work

21 Grapheme-Based Transliteration Model

The grapheme-based transliteration model (GTM) is conceptually a direct or-thographical mapping model from source graphemes to target graphemes Sev-eral different transliteration methods have been proposed within this frameworkKang amp Choi [5] proposed a decision tree-based transliteration method Deci-sion trees which transform each source grapheme into target graphemes arelearned and then they are directly applied to machine transliteration Kang ampKim [6] and Goto et al [7] proposed a method based on a transliteration networkThe transliteration network is composed of nodes and arcs A node represents achunk of source graphemes and its corresponding target grapheme An arc rep-resents a possible link between nodes and it has a weight showing its strengthLi et al [8] used a joint source-channel model to simultaneously model boththe source language and the target language contexts (bigram and trigram) formachine transliteration Its main advantage is the use of bilingual contexts

The main drawback of GTM is that it does not consider any phonetic aspectof transliteration

22 Phoneme-Based Transliteration Model

Basically the phoneme-based transliteration model (PTM) is composed of sourcegrapheme-to-source phoneme transformation and source phoneme-to-targetgrapheme transformation Knight amp Graehl [1] modeled Japanese-to-Englishtransliteration with weighted finite state transducers (WFSTs) by combiningseveral parameters such as romaji-to-phoneme phoneme-to-English Englishword probability models and so on Meng et al [10] proposed an English-to-Chinese transliteration model It was based on English grapheme-to-phonemeconversion cross-lingual phonological rules and mapping rules between Eng-lish and Chinese phonemes and Chinese syllable-based and character-based lan-guage models Jung et al [9] modeled English-to-Korean transliteration withextended Markov window First they transformed an English word into Englishpronunciation by using a pronunciation dictionary Then they segmented theEnglish phonemes into chunk of English phonemes which corresponds to oneKorean grapheme by using predefined handcrafted rules Finally they automat-ically transformed each chunk of English phoneme into Korean graphemes byusing extended Markov window

The main drawback of PTM is error propagation caused by its two-step pro-cedure ndash errors in source grapheme-to-source phoneme transformation make itdifficult to generate correct transliterations in the next step

23 Hybrid Transliteration Model and Correspondence-BasedTransliteration Model

There have been attempts to use both source graphemes and source phonemesin machine transliteration Such research falls into two categories the

88 J-H Oh K-S Choi and H Isahara

correspondence-based transliteration model (CTM) [121314] and the hybridtransliteration model (HTM) [211] The CTM makes use of the correspondencebetween a source grapheme and a source phoneme when it produces targetlanguage graphemes the HTM just combines GTM and PTM through linearinterpolation The hybrid transliteration model requires the grapheme-basedtransliteration probability (Pr(GTM)) and phoneme-based transliteration prob-ability (Pr(PTM)) and then it combines the two probabilities through linearinterpolation

Oh amp Choi [12] considered the contexts of a source grapheme and its cor-responding source phoneme for English-to-Korean transliteration It is basedon semi-automatically constructed context-sensitive rewrite rules in a formAXB rarr y meaning that X is rewritten as target grapheme y in the con-text A and B Note that X A and B represent correspondence between Englishgrapheme and phoneme like ldquor |R|rdquo ndash English grapheme r corresponding toEnglish phoneme |R| Oh amp Choi [1314] trained a generative model representingtransliteration rules by using the correspondence between source grapheme andsource phoneme and machine learning algorithms The correspondence makesit possible to model machine transliteration in a more sophisticated manner

Several researchers [211] have proposed hybrid model-based transliterationmethods They modeled GTM and PTM with WFSTs or a source-channelmodel Then they combined GTM and PTM through linear interpolation Intheir PTM several parameters are considered such as the source grapheme-to-source phoneme probability source phoneme-to-target grapheme probabilitytarget language word probability and so on In their GTM the source grapheme-to-target grapheme probability is mainly considered

3 Framework of Different Transliteration Models

Let SW be a source word PSW be the pronunciation of SW TSW be a targetword corresponding to SW and CSW be a correspondence between SW and PSW PSW and TSW can be segmented into a series of sub-strings each of which corre-sponds to a source grapheme Then we can write SW = s1 middot middot middot sn = sn

1 PSW =p1 middot middot middot pn = pn

1 TSW = t1 middot middot middot tn = tn1 and CSW = c1 middot middot middot cn = cn1 where si pi

ti and ci = lt si pi gt represent the ith source grapheme source phonemes corre-sponding to si target graphemes corresponding to si and pi and the correspon-dence between si and pi respectively With this definition GTM PTM CTMand HTM can be represented as Eqs (1) (2) (3) and (4) respectively

Prg(TSW |SW ) = Pr(tn1 |sn1 ) asymp

prod

i

Pr(ti|timinus1iminusk si+k

iminusk) (1)

Prp(TSW |SW ) = Pr(pn1 |sn

1 ) times Pr(tn1 |pn1 ) (2)

asympprod

i

Pr(pi|piminus1iminusk si+k

iminusk) times Pr(ti|timinus1iminusk pi+k

iminusk)

Prc(TSW |SW ) = Pr(pn1 |sn

1 ) times Pr(tn1 |cn1 ) (3)

asympprod

i

Pr(pi|piminus1iminusk si+k

iminusk) times Pr(ti|timinus1iminusk ci+k

iminusk)

Improving Machine Transliteration Performance 89

Prh(TSW |SW ) = α times Prp(TSW |SW ) + (1 minus α) times Prg(TSW |SW ) (4)

With the assumption that each transliteration model depends on the size ofthe contexts k Eqs (1) (2) (3) and (4) can be simplified To estimate theprobabilities in Eqs (1) (2) (3) and (4) we used the maximum entropy modelwhich can effectively incorporate heterogeneous information [15] In the maxi-mum entropy model event ev is composed of a target event (te) and a historyevent (he) and it is represented by a bundle of feature functions (fi(he te))which represent the existence of certain characteristics in the event ev The fea-ture function enables a model based on the maximum entropy model to estimateprobability [15] Therefore designing the feature functions which effectively sup-port certain decisions made by the model is important Our basic philosophyfor the feature function design for each transliteration model is that the con-text information collocated with the unit of interest is important With thisphilosophy we designed the feature functions with all possible combinations of(si+k

iminusk pi+kiminusk ci+k

iminusk and timinus1iminusk) Generally a conditional maximum entropy model

is an exponential log-linear model that gives the conditional probability of eventev =lt te he gt as described in Eq (5) where λi is a parameter to be estimatedand Z(he) is the normalizing factor [15]

Pr(te|he) =1

Z(he)exp(

sum

i

λifi(he te)) (5)

Z(he) =sum

te

exp(sum

i

λifi(he te))

With Eq (5) and feature functions conditional probabilities can be estimatedin Eqs (1) (2) (3) and (4) For example we can write Pr(ti|timinus1

iminusk ci+kiminusk) =

Pr(teCTM |heCTM ) because we can represent target events (teCTM ) and historyevents (heCTM ) of CTM as ti and tuples lt timinus1

iminusk ci+kiminusk gt respectively In the same

way Pr(ti|timinus1iminusk si+k

iminusk) Pr(ti|timinus1iminusk pi+k

iminusk) and Pr(pi|piminus1iminusk si+k

iminusk) can be representedas Pr(te|he) with their target events and history events We used a maximumentropy modeling tool [16] to estimate Eqs (1) (2) (3) and (4)

4 Transliteration Validation

We validated transliterations by using web-based validation Sweb(s tci) andtransliteration model-based validation Stm(s tci) like in Eq (6) Using Eq (6)we can validate transliterations in a more correct and robust manner becauseSweb(s tci) reflects real-world usage of the transliterations in web data andStm(s tci) ranks the transliterations independent of the web data

STV (s tci) = Stm(s tci) times Sweb(s tci) (6)

41 Transliteration Model-Based Validation Stm

Our transliteration model-based validation Stm uses the rank assigned by eachtransliteration model For a given source word (s) each transliteration modelgenerates transliterations (tci in TC) and ranks them using the probability

90 J-H Oh K-S Choi and H Isahara

described in Eqs (1) (2) (3) and (4) The underlying assumption in Stm isthat the rank of the correct transliterations tends to be higher on average thanthe wrong ones With this assumption we represented Stm(s tci) as Eq (7)where Rankg(tci) Rankp(tci) Rankh(tci) and Rankc(tci) represent the rankof tci assigned by GTM PTM HTM and CTM respectively

Stm(s tci) =14times (

1Rankg(tci)

+1

Rankp(tci)+

1Rankh(tci)

+1

Rankc(tci)) (7)

42 Web-Based Validation Sweb

Korean or Japanese web pages are usually composed of rich texts in a mixtureof Korean or Japanese (main language) and English (auxiliary language) Lets and t be a source language word and a target language word respectivelyWe observed that s and t tend to be near each other in the text of Korean orJapanese web pages when the authors of the web pages describe s as translationof t or vice versa We retrieved such web pages for transliteration validation

There have been several web-based validation methods for translation valida-tion [1718] or transliteration validation [219] They usually rely on the web fre-quency (the number of web pages) derived from ldquoBilingual Keyword Search(BKS)rdquo [21718] or ldquoMonolingual Keyword Search (MKS)rdquo [219] BKS re-trieves web pages by using a query composed of two keywords s and t while MKSretrieves web pages by using a query composed of t Qu amp Grefenstette [17] andWang et al [18] proposed BKS-based translation validation methods such as rel-ative web frequency and chi-square (χ2) test Al-Onaizan amp Knight [2] used bothMKS and BKS and Grefenstette et al [19] used only MKS for validating translit-erations However web pages retrieved by MKS tend to show whether t is used intarget language texts rather than whether t is a translation of s BKS frequentlyretrieves web pages where s and t have little relation to each other because it doesnot consider distance between s and t in the web pages To address these prob-lems we developed a validation method based on ldquoBilingual Phrasal Search(BPS)rdquo where a phrase composed of s and t is used as a query for a search engineLet lsquo[s t]rsquo or lsquo[t s]rsquo lsquos And trsquo and lsquotrsquo respectively be queries for BPS BKS andMKS The difference among BPS BKS and MKS is shown in Fig 2 In Fig 2 lsquo[st]rsquo or lsquo[t s]rsquo retrieves web pages where lsquo[s t]rsquo or lsquo[t s]rsquo exists as phrases while lsquos Andtrsquo retrieves web pages where s and t simply exist in the same document Thereforethe number of web pages retrieved by BPS is more reliable for validating translit-erations because s and t usually have high co-relation in the web pages retrievedby BPS For example web pages retrieved by BPS in Fig 3 usually contain cor-rect Korean and Japanese transliterations and their corresponding English wordamylase as translation pairs in parentheses expression For these reasons BPS ismore suitable for our transliteration validation

Let TC be a set of transliterations (or transliteration candidates) producedby different transliteration models tci be the ith transliteration candidate inTC s be the source language word resulting in TC and WF (s tci) be the web

Improving Machine Transliteration Performance 91

Fig 2 Difference among BPS BKS and MKS

Fig 3 Web pages retrieved by BPS

frequency for [s tci] Our web-based validation method Sweb can be representedas Eq (8) which is the relative web frequency derived from BPS

Sweb(s tci) =WF (s tci) + WF (tci s)sum

tckisinTC(WF (s tck) + WF (tck s))(8)

Let TC for s = data be tc1= tc2= tc3= and WF (s tc1) +WF (tc1 s) WF (s tc2) + WF (tc2 s) and WF (s tc3) + WF (tc3 s) be 9410067800 and 54 respectively Then Sweb for each tci can be calculated as follows

ndash Sweb(s tc1) = 94100161954 = 05811ndash Sweb(s tc2) = 67800161954 = 04186ndash Sweb(s tc3) = 54161954 = 00003

92 J-H Oh K-S Choi and H Isahara

5 Experiments

Our experiments were done for English-to-Korean and English-to-Japanesetransliteration The test set for the English-to-Koreantransliteration (EKSet) [20]consisted of 7172 English-Korean pairs ndash the number of training data was about6000 and the number of blind test data was about 1000 The test set for theEnglish-to-Japanese transliteration (EJSet) which consisted of English-katakanapairs from EDICT [21] consisted of 10417 pairs mdash the number of training datawas about 9000 and the number of blind test data was about 1000 EJSet con-tained one or more than one correct transliteration for one English word likeltmicroכgt and ltmicroכgt the average number of Japanesetransliterations for an English word was 115 EKSet and EJSet covered propernames technical terms and general terms Evaluation was done in terms of theword accuracy (WA) in Eq (9) In the evaluation we used k-fold cross-validation(k = 7 for the EKSet and k = 10 for the EJSet) The test set was divided into ksubsets Each one was used for testing while the remainder was used for trainingThen the average WA across all the k trials was computed Through the cross-validation we set α (04 for the EKSet and 05 for the EJSet) for HTM in Eq (4)

WA =the number of correct transliterations output by the system

the number of transliterations in the blind test data(9)

51 Experimental Results

Summaries of our experimental results conducted on EKSet and EJSet are shownin Table 1 In the table GTM PTM HTM and CTM represent the individualtransliteration models used for generating transliterations Stm Sweb and STV

Table 1 Summary of Results ()

Methods EKSet EJSetTop-1 Top-3 Top-5 Top-10 Top-1 Top-3 Top-5 Top-10

GTM 568 769 825 880 516 766 846 941

PTM 496 683 751 825 476 728 816 926

HTM 606 785 835 886 557 776 851 940

CTM 608 796 847 896 582 811 879 965

GPC [6] 551 NA NA NA 532 NA NA NA

GMEM [7] 559 NA NA NA 562 NA NA NA

HWFST [11] 583 NA NA NA 625 NA NA NA

Stm 710 812 854 891 669 813 875 932

MKS 505 759 843 916 499 751 830 930Sweb BKS 744 888 912 921 761 942 960 968

BPS 836 915 919 921 813 956 969 978

MKS 595 813 878 921 567 773 835 928STV BKS 795 905 917 921 796 947 962 968

BPS 844 917 920 921 820 956 969 981

Improving Machine Transliteration Performance 93

represent experimental results validated by Eqs (7) (8) and (6) respectivelyMoreover we tested Sweb and STV according to web search methods (BPS BKSand MKS) to show the effect of BPS on transliteration validation We comparedour proposed method with the previous work GPC [6] GMEM [7] and HWFST[11]3 Note that only Top-1 was considered in the previous work because theyexcept for HWFST focused only on the Top-1 The Top-n considers whether thecorrect transliteration is in the Top-n ranked transliterations4

Compared to individual transliteration models and previous work [6711] StmSweb(BPS) and STV (BPS) are more effective especially in the Top-15 AlthoughStm by itself showed higher performance than individual transliteration modelsand previous work [6711] STV (BPS) (the combination of Stm and Sweb(BPS))shows much better performance The Top-1 of STV (BPS) has the best perfor-mance6 The powerful transliteration validation ability of STV (BPS) enables ourmethod to achieve the best result in the Top-1 More specifically Sweb(BPS)contributes highly to the performance improvement This indicates that the webdata used as the knowledge source for transliteration validation is very usefulAlthough Stm makes a small contribution to the performance improvement ofSTV (BPS) because Sweb(BPS) correctly validates transliterations whenever Stm

does the errors of Sweb(BKS) and Sweb(MKS) are well compensated for by Stm inSTV (BKS) and STV (MKS) For example Korean and Japanese transliterationsfor the English words methoxyl and netware were validated by each validationmethod as shown in Tables 2 and 3 Note that the value coupled with each translit-eration was assigned by each validation method In Tables 2 and 3 Stm causes therank of correct transliterations to be higher in STV (BKS) and STV (MKS) thanin Sweb(BKS) and Sweb(MKS)

When comparing BPS with BKS and MKS BPS is the most effective websearch method for transliteration validation Sweb based on MKS has the worstperformance because it tends to validate whether tci is used in a target languagerather than whether it is used as a translation of s Actually Sweb based on BKSis effective because BKS considers both s and tci while it retrieves web pagesHowever the more powerful retrieval ability of BPS protects Sweb based on BPSfrom errors that Sweb based on BKS causes as shown in Tables 2 and 3 So higherperformance can be had with Sweb(BPS) than with Sweb(BKS) ndash about 12 im-provement in Top-1 of EKSet and about 7 improvement in Top-1 of EJSet7 The

3 We implemented the three previous methods [6711] and then trained and testedthem using the same data as our proposed method

4 For one English word there are one or more than one correct transliterations in EJSetbut there is only one correct transliteration in EKSet Therefore we had higher TOP-1 accuracies but lower TOP-10 accuracies in EKSet than in EJSet

5 A one-tail paired t-test showed that the results of Stm Sweb(BPS) and STV (BPS)were always significantly better than those of individual transliteration models andprevious work (level of significance = 0001)

6 A one-tail paired t-test showed that the results of STV (BPS) were always significantlybetter than those of the others (level of significance = 0001)

7 A one-tail paired t-test showed that the results of Sweb(BPS) were always significantlybetter than those of Sweb(BKS) (level of significance = 0001)

94 J-H Oh K-S Choi and H Isahara

Table 2 Korean transliterations for English word methoxyl and their validation (Theunderlined is the correct Korean transliteration)

Stm (0772) (0081) 13 (0253)

Sweb(MKS) 13 (0482) (0334) (0148) (0029)

Sweb(BKS) (0564) (0413) 13 (0011)

Sweb(BPS) (0947) 13 (0053)

STV (MKS) 13 (0122) (0022) (00003)

STV (BKS) (0319) (0045) 13 (0003)

STV (BPS) (0742) 13 (0013)

Table 3 Japanese transliterations for English word netware and their validation (Theunderlined is the correct Japanese transliteration)

Stm (0875) (0458) (0319)

Sweb(MKS) 13ndash (0988) (0009) (0002)

Sweb(BKS) 13ndash (0626) (0274) (0100)

Sweb(BPS) (0860) 13ndash (0079) (0061)

STV (MKS) 13ndash (0189) (0004) (0002)

STV (BKS) (0240) 13ndash (0120) (0032)

STV (BPS) (0752) (0019) 13ndash (0015)

higher performance of Sweb(BPS) positively effects STV (BPS) thus the perfor-mance of STV (BPS) is higher than that of STV (BKS) Both Sweb and STV basedon BPS outperform those based on BKS or MKS

The experimental results can be summarized as follows

ndash Stm Sweb(BPS) and STV (BPS) are more effective than individual transliter-ation models and previous work [6711] especially in the Top-1

ndash STV (BPS) shows the best performancendash Sweb(BPS) mainly contributes the high performance of STV (BPS)ndash BPS is the most effective web search method for transliteration validation

among BPS BKS and MKS

6 Conclusion

We proposed a novel approach for improving machine transliteration performanceby combining multiple transliteration models We applied a ldquogeneratingtransliterations followed by their validationrdquo strategy We generated transliter-ation candidates using four different transliteration models and validated themusing web-based validation and transliteration model-based validation Experi-ments showed that combining multiple transliteration models was one way forconsidering complex transliteration behaviors and that transliteration validationwas very important for improving machine transliteration performance Our two

Improving Machine Transliteration Performance 95

transliteration validation methods were effective The web-based validationmethod effectively filtered out wrong transliterations by using web data whichreflects real-world usage of transliterations and the transliteration model-basedvalidation method as a web-independent validation measure complemented theweb-based validation method Moreover we showed that a web search methodsignificantly affects the performance of the web-based validation method Exper-iments showed that our ldquoBilingual Phrasal Search (BPS)rdquo is more suitablethan ldquoBilingual Keyword Search (BKS)rdquo and ldquoMonolingual KeywordSearch (MKS)rdquo in transliteration validation

References

1 Knight K Graehl J Machine transliteration In Proc of the 35th Annual Meet-ings of the Association for Computational Linguistics (1997) pp128ndash135

2 Al-Onaizan Y Knight K Translating named entities using monolingual and bilin-gual resources In Proc of ACL 2002 (2002) 400ndash408

3 Fujii A Tetsuya I JapaneseEnglish cross-language information retrieval Ex-ploration of query translation and transliteration Computers and the Humanities35 (2001) 389ndash420

4 Lin WH Chen HH Backward machine transliteration by learning phonetic sim-ilarity In Proc of the Sixth Conference on Natural Language Learning (CoNLL)(2002) 139ndash145

5 Kang BJ Choi KS Automatic transliteration and back-transliteration by de-cision tree learning In Proc of the 2nd International Conference on LanguageResources and Evaluation (2000) 1135ndash1411

6 Kang IH Kim GC English-to-Korean transliteration using multiple unboundedoverlapping phoneme chunks In Proc of the 18th International Conference onComputational Linguistics (2000) 418ndash424

7 Goto I Kato N Uratani N Ehara T Transliteration considering context in-formation based on the maximum entropy method In Proc of MT-Summit IX(2003) 125ndash132

8 Li H Zhang M Su J A joint source-channel model for machine transliterationIn Proc of ACL 2004 (2004) 160ndash167

9 Jung SY Hong S Paek E An English to Korean transliteration model of ex-tended markov window In Proc of the 18th conference on Computational linguis-tics (2000) 383ndash389

10 Meng H Lo WK Chen B Tang K Generating phonetic cognates to handlenamed entities in English-Chinese cross-language spoken document retrieval InProc of Automatic Speech Recognition and Understanding 2001 ASRU rsquo01 (2001)311ndash314

11 Bilac S Tanaka H Improving back-transliteration by combining informationsources In Proc of IJCNLP2004 (2004) 542ndash547

12 Oh JH Choi KS An English-Korean transliteration model using pronunciationand contextual rules In Proc of COLING2002 (2002) 758ndash764

13 Oh JH Choi KS An ensemble of grapheme and phoneme for machine translit-eration In Proc of IJCNLP05 (2005) 450ndash461

14 Oh JH Choi KS Machine learning based English-to-Korean transliterationusing grapheme and phoneme information IEICE Transaction on Information ampSystems E88-D (2005) 1737ndash1748

96 J-H Oh K-S Choi and H Isahara

15 Berger AL Pietra SD Pietra VJD A maximum entropy approach to naturallanguage processing Computational Linguistics 22 (1996) 39ndash71

16 Zhang L Maximum entropy modeling toolkit for python and C++ http

homepagesinfedacuks0450736softwaremaxentmanualpdf (2004)17 Qu Y Grefenstette G Finding ideographic representations of Japanese names

written in Latin script via language identification and corpus validation In ACL(2004) 183ndash190

18 Wang JH Teng JW Lu WH Chien LF Exploiting the web as the multi-lingual corpus for unknown query translation Journal of the American Society forInformation Science and Technology 57 (2006) 660ndash670

19 Grefenstette G Qu Y Evans DA Mining the web to create a language modelfor mapping between English names and phrases and Japanese In Proc of WebIntelligence (2004) 110ndash116

20 Nam YS Foreign dictionary Sung An Dang (1997)21 Breen J EDICT JapaneseEnglish dictionary le The Electronic Dictionary Re-

search and Development Group Monash University httpwwwcssemonash

eduau~jwbedicthtml (2003)

Improving Machine Transliteration Performance 87

2 Previous Work

21 Grapheme-Based Transliteration Model

The grapheme-based transliteration model (GTM) is conceptually a direct or-thographical mapping model from source graphemes to target graphemes Sev-eral different transliteration methods have been proposed within this frameworkKang amp Choi [5] proposed a decision tree-based transliteration method Deci-sion trees which transform each source grapheme into target graphemes arelearned and then they are directly applied to machine transliteration Kang ampKim [6] and Goto et al [7] proposed a method based on a transliteration networkThe transliteration network is composed of nodes and arcs A node represents achunk of source graphemes and its corresponding target grapheme An arc rep-resents a possible link between nodes and it has a weight showing its strengthLi et al [8] used a joint source-channel model to simultaneously model boththe source language and the target language contexts (bigram and trigram) formachine transliteration Its main advantage is the use of bilingual contexts

The main drawback of GTM is that it does not consider any phonetic aspectof transliteration

22 Phoneme-Based Transliteration Model

Basically the phoneme-based transliteration model (PTM) is composed of sourcegrapheme-to-source phoneme transformation and source phoneme-to-targetgrapheme transformation Knight amp Graehl [1] modeled Japanese-to-Englishtransliteration with weighted finite state transducers (WFSTs) by combiningseveral parameters such as romaji-to-phoneme phoneme-to-English Englishword probability models and so on Meng et al [10] proposed an English-to-Chinese transliteration model It was based on English grapheme-to-phonemeconversion cross-lingual phonological rules and mapping rules between Eng-lish and Chinese phonemes and Chinese syllable-based and character-based lan-guage models Jung et al [9] modeled English-to-Korean transliteration withextended Markov window First they transformed an English word into Englishpronunciation by using a pronunciation dictionary Then they segmented theEnglish phonemes into chunk of English phonemes which corresponds to oneKorean grapheme by using predefined handcrafted rules Finally they automat-ically transformed each chunk of English phoneme into Korean graphemes byusing extended Markov window

The main drawback of PTM is error propagation caused by its two-step pro-cedure ndash errors in source grapheme-to-source phoneme transformation make itdifficult to generate correct transliterations in the next step

23 Hybrid Transliteration Model and Correspondence-BasedTransliteration Model

There have been attempts to use both source graphemes and source phonemesin machine transliteration Such research falls into two categories the

88 J-H Oh K-S Choi and H Isahara

correspondence-based transliteration model (CTM) [121314] and the hybridtransliteration model (HTM) [211] The CTM makes use of the correspondencebetween a source grapheme and a source phoneme when it produces targetlanguage graphemes the HTM just combines GTM and PTM through linearinterpolation The hybrid transliteration model requires the grapheme-basedtransliteration probability (Pr(GTM)) and phoneme-based transliteration prob-ability (Pr(PTM)) and then it combines the two probabilities through linearinterpolation

Oh amp Choi [12] considered the contexts of a source grapheme and its cor-responding source phoneme for English-to-Korean transliteration It is basedon semi-automatically constructed context-sensitive rewrite rules in a formAXB rarr y meaning that X is rewritten as target grapheme y in the con-text A and B Note that X A and B represent correspondence between Englishgrapheme and phoneme like ldquor |R|rdquo ndash English grapheme r corresponding toEnglish phoneme |R| Oh amp Choi [1314] trained a generative model representingtransliteration rules by using the correspondence between source grapheme andsource phoneme and machine learning algorithms The correspondence makesit possible to model machine transliteration in a more sophisticated manner

Several researchers [211] have proposed hybrid model-based transliterationmethods They modeled GTM and PTM with WFSTs or a source-channelmodel Then they combined GTM and PTM through linear interpolation Intheir PTM several parameters are considered such as the source grapheme-to-source phoneme probability source phoneme-to-target grapheme probabilitytarget language word probability and so on In their GTM the source grapheme-to-target grapheme probability is mainly considered

3 Framework of Different Transliteration Models

Let SW be a source word PSW be the pronunciation of SW TSW be a targetword corresponding to SW and CSW be a correspondence between SW and PSW PSW and TSW can be segmented into a series of sub-strings each of which corre-sponds to a source grapheme Then we can write SW = s1 middot middot middot sn = sn

1 PSW =p1 middot middot middot pn = pn

1 TSW = t1 middot middot middot tn = tn1 and CSW = c1 middot middot middot cn = cn1 where si pi

ti and ci = lt si pi gt represent the ith source grapheme source phonemes corre-sponding to si target graphemes corresponding to si and pi and the correspon-dence between si and pi respectively With this definition GTM PTM CTMand HTM can be represented as Eqs (1) (2) (3) and (4) respectively

Prg(TSW |SW ) = Pr(tn1 |sn1 ) asymp

prod

i

Pr(ti|timinus1iminusk si+k

iminusk) (1)

Prp(TSW |SW ) = Pr(pn1 |sn

1 ) times Pr(tn1 |pn1 ) (2)

asympprod

i

Pr(pi|piminus1iminusk si+k

iminusk) times Pr(ti|timinus1iminusk pi+k

iminusk)

Prc(TSW |SW ) = Pr(pn1 |sn

1 ) times Pr(tn1 |cn1 ) (3)

asympprod

i

Pr(pi|piminus1iminusk si+k

iminusk) times Pr(ti|timinus1iminusk ci+k

iminusk)

Improving Machine Transliteration Performance 89

Prh(TSW |SW ) = α times Prp(TSW |SW ) + (1 minus α) times Prg(TSW |SW ) (4)

With the assumption that each transliteration model depends on the size ofthe contexts k Eqs (1) (2) (3) and (4) can be simplified To estimate theprobabilities in Eqs (1) (2) (3) and (4) we used the maximum entropy modelwhich can effectively incorporate heterogeneous information [15] In the maxi-mum entropy model event ev is composed of a target event (te) and a historyevent (he) and it is represented by a bundle of feature functions (fi(he te))which represent the existence of certain characteristics in the event ev The fea-ture function enables a model based on the maximum entropy model to estimateprobability [15] Therefore designing the feature functions which effectively sup-port certain decisions made by the model is important Our basic philosophyfor the feature function design for each transliteration model is that the con-text information collocated with the unit of interest is important With thisphilosophy we designed the feature functions with all possible combinations of(si+k

iminusk pi+kiminusk ci+k

iminusk and timinus1iminusk) Generally a conditional maximum entropy model

is an exponential log-linear model that gives the conditional probability of eventev =lt te he gt as described in Eq (5) where λi is a parameter to be estimatedand Z(he) is the normalizing factor [15]

Pr(te|he) =1

Z(he)exp(

sum

i

λifi(he te)) (5)

Z(he) =sum

te

exp(sum

i

λifi(he te))

With Eq (5) and feature functions conditional probabilities can be estimatedin Eqs (1) (2) (3) and (4) For example we can write Pr(ti|timinus1

iminusk ci+kiminusk) =

Pr(teCTM |heCTM ) because we can represent target events (teCTM ) and historyevents (heCTM ) of CTM as ti and tuples lt timinus1

iminusk ci+kiminusk gt respectively In the same

way Pr(ti|timinus1iminusk si+k

iminusk) Pr(ti|timinus1iminusk pi+k

iminusk) and Pr(pi|piminus1iminusk si+k

iminusk) can be representedas Pr(te|he) with their target events and history events We used a maximumentropy modeling tool [16] to estimate Eqs (1) (2) (3) and (4)

4 Transliteration Validation

We validated transliterations by using web-based validation Sweb(s tci) andtransliteration model-based validation Stm(s tci) like in Eq (6) Using Eq (6)we can validate transliterations in a more correct and robust manner becauseSweb(s tci) reflects real-world usage of the transliterations in web data andStm(s tci) ranks the transliterations independent of the web data

STV (s tci) = Stm(s tci) times Sweb(s tci) (6)

41 Transliteration Model-Based Validation Stm

Our transliteration model-based validation Stm uses the rank assigned by eachtransliteration model For a given source word (s) each transliteration modelgenerates transliterations (tci in TC) and ranks them using the probability

90 J-H Oh K-S Choi and H Isahara

described in Eqs (1) (2) (3) and (4) The underlying assumption in Stm isthat the rank of the correct transliterations tends to be higher on average thanthe wrong ones With this assumption we represented Stm(s tci) as Eq (7)where Rankg(tci) Rankp(tci) Rankh(tci) and Rankc(tci) represent the rankof tci assigned by GTM PTM HTM and CTM respectively

Stm(s tci) =14times (

1Rankg(tci)

+1

Rankp(tci)+

1Rankh(tci)

+1

Rankc(tci)) (7)

42 Web-Based Validation Sweb

Korean or Japanese web pages are usually composed of rich texts in a mixtureof Korean or Japanese (main language) and English (auxiliary language) Lets and t be a source language word and a target language word respectivelyWe observed that s and t tend to be near each other in the text of Korean orJapanese web pages when the authors of the web pages describe s as translationof t or vice versa We retrieved such web pages for transliteration validation

There have been several web-based validation methods for translation valida-tion [1718] or transliteration validation [219] They usually rely on the web fre-quency (the number of web pages) derived from ldquoBilingual Keyword Search(BKS)rdquo [21718] or ldquoMonolingual Keyword Search (MKS)rdquo [219] BKS re-trieves web pages by using a query composed of two keywords s and t while MKSretrieves web pages by using a query composed of t Qu amp Grefenstette [17] andWang et al [18] proposed BKS-based translation validation methods such as rel-ative web frequency and chi-square (χ2) test Al-Onaizan amp Knight [2] used bothMKS and BKS and Grefenstette et al [19] used only MKS for validating translit-erations However web pages retrieved by MKS tend to show whether t is used intarget language texts rather than whether t is a translation of s BKS frequentlyretrieves web pages where s and t have little relation to each other because it doesnot consider distance between s and t in the web pages To address these prob-lems we developed a validation method based on ldquoBilingual Phrasal Search(BPS)rdquo where a phrase composed of s and t is used as a query for a search engineLet lsquo[s t]rsquo or lsquo[t s]rsquo lsquos And trsquo and lsquotrsquo respectively be queries for BPS BKS andMKS The difference among BPS BKS and MKS is shown in Fig 2 In Fig 2 lsquo[st]rsquo or lsquo[t s]rsquo retrieves web pages where lsquo[s t]rsquo or lsquo[t s]rsquo exists as phrases while lsquos Andtrsquo retrieves web pages where s and t simply exist in the same document Thereforethe number of web pages retrieved by BPS is more reliable for validating translit-erations because s and t usually have high co-relation in the web pages retrievedby BPS For example web pages retrieved by BPS in Fig 3 usually contain cor-rect Korean and Japanese transliterations and their corresponding English wordamylase as translation pairs in parentheses expression For these reasons BPS ismore suitable for our transliteration validation

Let TC be a set of transliterations (or transliteration candidates) producedby different transliteration models tci be the ith transliteration candidate inTC s be the source language word resulting in TC and WF (s tci) be the web

Improving Machine Transliteration Performance 91

Fig 2 Difference among BPS BKS and MKS

Fig 3 Web pages retrieved by BPS

frequency for [s tci] Our web-based validation method Sweb can be representedas Eq (8) which is the relative web frequency derived from BPS

Sweb(s tci) =WF (s tci) + WF (tci s)sum

tckisinTC(WF (s tck) + WF (tck s))(8)

Let TC for s = data be tc1= tc2= tc3= and WF (s tc1) +WF (tc1 s) WF (s tc2) + WF (tc2 s) and WF (s tc3) + WF (tc3 s) be 9410067800 and 54 respectively Then Sweb for each tci can be calculated as follows

ndash Sweb(s tc1) = 94100161954 = 05811ndash Sweb(s tc2) = 67800161954 = 04186ndash Sweb(s tc3) = 54161954 = 00003

92 J-H Oh K-S Choi and H Isahara

5 Experiments

Our experiments were done for English-to-Korean and English-to-Japanesetransliteration The test set for the English-to-Koreantransliteration (EKSet) [20]consisted of 7172 English-Korean pairs ndash the number of training data was about6000 and the number of blind test data was about 1000 The test set for theEnglish-to-Japanese transliteration (EJSet) which consisted of English-katakanapairs from EDICT [21] consisted of 10417 pairs mdash the number of training datawas about 9000 and the number of blind test data was about 1000 EJSet con-tained one or more than one correct transliteration for one English word likeltmicroכgt and ltmicroכgt the average number of Japanesetransliterations for an English word was 115 EKSet and EJSet covered propernames technical terms and general terms Evaluation was done in terms of theword accuracy (WA) in Eq (9) In the evaluation we used k-fold cross-validation(k = 7 for the EKSet and k = 10 for the EJSet) The test set was divided into ksubsets Each one was used for testing while the remainder was used for trainingThen the average WA across all the k trials was computed Through the cross-validation we set α (04 for the EKSet and 05 for the EJSet) for HTM in Eq (4)

WA =the number of correct transliterations output by the system

the number of transliterations in the blind test data(9)

51 Experimental Results

Summaries of our experimental results conducted on EKSet and EJSet are shownin Table 1 In the table GTM PTM HTM and CTM represent the individualtransliteration models used for generating transliterations Stm Sweb and STV

Table 1 Summary of Results ()

Methods EKSet EJSetTop-1 Top-3 Top-5 Top-10 Top-1 Top-3 Top-5 Top-10

GTM 568 769 825 880 516 766 846 941

PTM 496 683 751 825 476 728 816 926

HTM 606 785 835 886 557 776 851 940

CTM 608 796 847 896 582 811 879 965

GPC [6] 551 NA NA NA 532 NA NA NA

GMEM [7] 559 NA NA NA 562 NA NA NA

HWFST [11] 583 NA NA NA 625 NA NA NA

Stm 710 812 854 891 669 813 875 932

MKS 505 759 843 916 499 751 830 930Sweb BKS 744 888 912 921 761 942 960 968

BPS 836 915 919 921 813 956 969 978

MKS 595 813 878 921 567 773 835 928STV BKS 795 905 917 921 796 947 962 968

BPS 844 917 920 921 820 956 969 981

Improving Machine Transliteration Performance 93

represent experimental results validated by Eqs (7) (8) and (6) respectivelyMoreover we tested Sweb and STV according to web search methods (BPS BKSand MKS) to show the effect of BPS on transliteration validation We comparedour proposed method with the previous work GPC [6] GMEM [7] and HWFST[11]3 Note that only Top-1 was considered in the previous work because theyexcept for HWFST focused only on the Top-1 The Top-n considers whether thecorrect transliteration is in the Top-n ranked transliterations4

Compared to individual transliteration models and previous work [6711] StmSweb(BPS) and STV (BPS) are more effective especially in the Top-15 AlthoughStm by itself showed higher performance than individual transliteration modelsand previous work [6711] STV (BPS) (the combination of Stm and Sweb(BPS))shows much better performance The Top-1 of STV (BPS) has the best perfor-mance6 The powerful transliteration validation ability of STV (BPS) enables ourmethod to achieve the best result in the Top-1 More specifically Sweb(BPS)contributes highly to the performance improvement This indicates that the webdata used as the knowledge source for transliteration validation is very usefulAlthough Stm makes a small contribution to the performance improvement ofSTV (BPS) because Sweb(BPS) correctly validates transliterations whenever Stm

does the errors of Sweb(BKS) and Sweb(MKS) are well compensated for by Stm inSTV (BKS) and STV (MKS) For example Korean and Japanese transliterationsfor the English words methoxyl and netware were validated by each validationmethod as shown in Tables 2 and 3 Note that the value coupled with each translit-eration was assigned by each validation method In Tables 2 and 3 Stm causes therank of correct transliterations to be higher in STV (BKS) and STV (MKS) thanin Sweb(BKS) and Sweb(MKS)

When comparing BPS with BKS and MKS BPS is the most effective websearch method for transliteration validation Sweb based on MKS has the worstperformance because it tends to validate whether tci is used in a target languagerather than whether it is used as a translation of s Actually Sweb based on BKSis effective because BKS considers both s and tci while it retrieves web pagesHowever the more powerful retrieval ability of BPS protects Sweb based on BPSfrom errors that Sweb based on BKS causes as shown in Tables 2 and 3 So higherperformance can be had with Sweb(BPS) than with Sweb(BKS) ndash about 12 im-provement in Top-1 of EKSet and about 7 improvement in Top-1 of EJSet7 The

3 We implemented the three previous methods [6711] and then trained and testedthem using the same data as our proposed method

4 For one English word there are one or more than one correct transliterations in EJSetbut there is only one correct transliteration in EKSet Therefore we had higher TOP-1 accuracies but lower TOP-10 accuracies in EKSet than in EJSet

5 A one-tail paired t-test showed that the results of Stm Sweb(BPS) and STV (BPS)were always significantly better than those of individual transliteration models andprevious work (level of significance = 0001)

6 A one-tail paired t-test showed that the results of STV (BPS) were always significantlybetter than those of the others (level of significance = 0001)

7 A one-tail paired t-test showed that the results of Sweb(BPS) were always significantlybetter than those of Sweb(BKS) (level of significance = 0001)

94 J-H Oh K-S Choi and H Isahara

Table 2 Korean transliterations for English word methoxyl and their validation (Theunderlined is the correct Korean transliteration)

Stm (0772) (0081) 13 (0253)

Sweb(MKS) 13 (0482) (0334) (0148) (0029)

Sweb(BKS) (0564) (0413) 13 (0011)

Sweb(BPS) (0947) 13 (0053)

STV (MKS) 13 (0122) (0022) (00003)

STV (BKS) (0319) (0045) 13 (0003)

STV (BPS) (0742) 13 (0013)

Table 3 Japanese transliterations for English word netware and their validation (Theunderlined is the correct Japanese transliteration)

Stm (0875) (0458) (0319)

Sweb(MKS) 13ndash (0988) (0009) (0002)

Sweb(BKS) 13ndash (0626) (0274) (0100)

Sweb(BPS) (0860) 13ndash (0079) (0061)

STV (MKS) 13ndash (0189) (0004) (0002)

STV (BKS) (0240) 13ndash (0120) (0032)

STV (BPS) (0752) (0019) 13ndash (0015)

higher performance of Sweb(BPS) positively effects STV (BPS) thus the perfor-mance of STV (BPS) is higher than that of STV (BKS) Both Sweb and STV basedon BPS outperform those based on BKS or MKS

The experimental results can be summarized as follows

ndash Stm Sweb(BPS) and STV (BPS) are more effective than individual transliter-ation models and previous work [6711] especially in the Top-1

ndash STV (BPS) shows the best performancendash Sweb(BPS) mainly contributes the high performance of STV (BPS)ndash BPS is the most effective web search method for transliteration validation

among BPS BKS and MKS

6 Conclusion

We proposed a novel approach for improving machine transliteration performanceby combining multiple transliteration models We applied a ldquogeneratingtransliterations followed by their validationrdquo strategy We generated transliter-ation candidates using four different transliteration models and validated themusing web-based validation and transliteration model-based validation Experi-ments showed that combining multiple transliteration models was one way forconsidering complex transliteration behaviors and that transliteration validationwas very important for improving machine transliteration performance Our two

Improving Machine Transliteration Performance 95

transliteration validation methods were effective The web-based validationmethod effectively filtered out wrong transliterations by using web data whichreflects real-world usage of transliterations and the transliteration model-basedvalidation method as a web-independent validation measure complemented theweb-based validation method Moreover we showed that a web search methodsignificantly affects the performance of the web-based validation method Exper-iments showed that our ldquoBilingual Phrasal Search (BPS)rdquo is more suitablethan ldquoBilingual Keyword Search (BKS)rdquo and ldquoMonolingual KeywordSearch (MKS)rdquo in transliteration validation

References

1 Knight K Graehl J Machine transliteration In Proc of the 35th Annual Meet-ings of the Association for Computational Linguistics (1997) pp128ndash135

2 Al-Onaizan Y Knight K Translating named entities using monolingual and bilin-gual resources In Proc of ACL 2002 (2002) 400ndash408

3 Fujii A Tetsuya I JapaneseEnglish cross-language information retrieval Ex-ploration of query translation and transliteration Computers and the Humanities35 (2001) 389ndash420

4 Lin WH Chen HH Backward machine transliteration by learning phonetic sim-ilarity In Proc of the Sixth Conference on Natural Language Learning (CoNLL)(2002) 139ndash145

5 Kang BJ Choi KS Automatic transliteration and back-transliteration by de-cision tree learning In Proc of the 2nd International Conference on LanguageResources and Evaluation (2000) 1135ndash1411

6 Kang IH Kim GC English-to-Korean transliteration using multiple unboundedoverlapping phoneme chunks In Proc of the 18th International Conference onComputational Linguistics (2000) 418ndash424

7 Goto I Kato N Uratani N Ehara T Transliteration considering context in-formation based on the maximum entropy method In Proc of MT-Summit IX(2003) 125ndash132

8 Li H Zhang M Su J A joint source-channel model for machine transliterationIn Proc of ACL 2004 (2004) 160ndash167

9 Jung SY Hong S Paek E An English to Korean transliteration model of ex-tended markov window In Proc of the 18th conference on Computational linguis-tics (2000) 383ndash389

10 Meng H Lo WK Chen B Tang K Generating phonetic cognates to handlenamed entities in English-Chinese cross-language spoken document retrieval InProc of Automatic Speech Recognition and Understanding 2001 ASRU rsquo01 (2001)311ndash314

11 Bilac S Tanaka H Improving back-transliteration by combining informationsources In Proc of IJCNLP2004 (2004) 542ndash547

12 Oh JH Choi KS An English-Korean transliteration model using pronunciationand contextual rules In Proc of COLING2002 (2002) 758ndash764

13 Oh JH Choi KS An ensemble of grapheme and phoneme for machine translit-eration In Proc of IJCNLP05 (2005) 450ndash461

14 Oh JH Choi KS Machine learning based English-to-Korean transliterationusing grapheme and phoneme information IEICE Transaction on Information ampSystems E88-D (2005) 1737ndash1748

96 J-H Oh K-S Choi and H Isahara

15 Berger AL Pietra SD Pietra VJD A maximum entropy approach to naturallanguage processing Computational Linguistics 22 (1996) 39ndash71

16 Zhang L Maximum entropy modeling toolkit for python and C++ http

homepagesinfedacuks0450736softwaremaxentmanualpdf (2004)17 Qu Y Grefenstette G Finding ideographic representations of Japanese names

written in Latin script via language identification and corpus validation In ACL(2004) 183ndash190

18 Wang JH Teng JW Lu WH Chien LF Exploiting the web as the multi-lingual corpus for unknown query translation Journal of the American Society forInformation Science and Technology 57 (2006) 660ndash670

19 Grefenstette G Qu Y Evans DA Mining the web to create a language modelfor mapping between English names and phrases and Japanese In Proc of WebIntelligence (2004) 110ndash116

20 Nam YS Foreign dictionary Sung An Dang (1997)21 Breen J EDICT JapaneseEnglish dictionary le The Electronic Dictionary Re-

search and Development Group Monash University httpwwwcssemonash

eduau~jwbedicthtml (2003)

88 J-H Oh K-S Choi and H Isahara

correspondence-based transliteration model (CTM) [121314] and the hybridtransliteration model (HTM) [211] The CTM makes use of the correspondencebetween a source grapheme and a source phoneme when it produces targetlanguage graphemes the HTM just combines GTM and PTM through linearinterpolation The hybrid transliteration model requires the grapheme-basedtransliteration probability (Pr(GTM)) and phoneme-based transliteration prob-ability (Pr(PTM)) and then it combines the two probabilities through linearinterpolation

Oh amp Choi [12] considered the contexts of a source grapheme and its cor-responding source phoneme for English-to-Korean transliteration It is basedon semi-automatically constructed context-sensitive rewrite rules in a formAXB rarr y meaning that X is rewritten as target grapheme y in the con-text A and B Note that X A and B represent correspondence between Englishgrapheme and phoneme like ldquor |R|rdquo ndash English grapheme r corresponding toEnglish phoneme |R| Oh amp Choi [1314] trained a generative model representingtransliteration rules by using the correspondence between source grapheme andsource phoneme and machine learning algorithms The correspondence makesit possible to model machine transliteration in a more sophisticated manner

Several researchers [211] have proposed hybrid model-based transliterationmethods They modeled GTM and PTM with WFSTs or a source-channelmodel Then they combined GTM and PTM through linear interpolation Intheir PTM several parameters are considered such as the source grapheme-to-source phoneme probability source phoneme-to-target grapheme probabilitytarget language word probability and so on In their GTM the source grapheme-to-target grapheme probability is mainly considered

3 Framework of Different Transliteration Models

Let SW be a source word PSW be the pronunciation of SW TSW be a targetword corresponding to SW and CSW be a correspondence between SW and PSW PSW and TSW can be segmented into a series of sub-strings each of which corre-sponds to a source grapheme Then we can write SW = s1 middot middot middot sn = sn

1 PSW =p1 middot middot middot pn = pn

1 TSW = t1 middot middot middot tn = tn1 and CSW = c1 middot middot middot cn = cn1 where si pi

ti and ci = lt si pi gt represent the ith source grapheme source phonemes corre-sponding to si target graphemes corresponding to si and pi and the correspon-dence between si and pi respectively With this definition GTM PTM CTMand HTM can be represented as Eqs (1) (2) (3) and (4) respectively

Prg(TSW |SW ) = Pr(tn1 |sn1 ) asymp

prod

i

Pr(ti|timinus1iminusk si+k

iminusk) (1)

Prp(TSW |SW ) = Pr(pn1 |sn

1 ) times Pr(tn1 |pn1 ) (2)

asympprod

i

Pr(pi|piminus1iminusk si+k

iminusk) times Pr(ti|timinus1iminusk pi+k

iminusk)

Prc(TSW |SW ) = Pr(pn1 |sn

1 ) times Pr(tn1 |cn1 ) (3)

asympprod

i

Pr(pi|piminus1iminusk si+k

iminusk) times Pr(ti|timinus1iminusk ci+k

iminusk)

Improving Machine Transliteration Performance 89

Prh(TSW |SW ) = α times Prp(TSW |SW ) + (1 minus α) times Prg(TSW |SW ) (4)

With the assumption that each transliteration model depends on the size ofthe contexts k Eqs (1) (2) (3) and (4) can be simplified To estimate theprobabilities in Eqs (1) (2) (3) and (4) we used the maximum entropy modelwhich can effectively incorporate heterogeneous information [15] In the maxi-mum entropy model event ev is composed of a target event (te) and a historyevent (he) and it is represented by a bundle of feature functions (fi(he te))which represent the existence of certain characteristics in the event ev The fea-ture function enables a model based on the maximum entropy model to estimateprobability [15] Therefore designing the feature functions which effectively sup-port certain decisions made by the model is important Our basic philosophyfor the feature function design for each transliteration model is that the con-text information collocated with the unit of interest is important With thisphilosophy we designed the feature functions with all possible combinations of(si+k

iminusk pi+kiminusk ci+k

iminusk and timinus1iminusk) Generally a conditional maximum entropy model

is an exponential log-linear model that gives the conditional probability of eventev =lt te he gt as described in Eq (5) where λi is a parameter to be estimatedand Z(he) is the normalizing factor [15]

Pr(te|he) =1

Z(he)exp(

sum

i

λifi(he te)) (5)

Z(he) =sum

te

exp(sum

i

λifi(he te))

With Eq (5) and feature functions conditional probabilities can be estimatedin Eqs (1) (2) (3) and (4) For example we can write Pr(ti|timinus1

iminusk ci+kiminusk) =

Pr(teCTM |heCTM ) because we can represent target events (teCTM ) and historyevents (heCTM ) of CTM as ti and tuples lt timinus1

iminusk ci+kiminusk gt respectively In the same

way Pr(ti|timinus1iminusk si+k

iminusk) Pr(ti|timinus1iminusk pi+k

iminusk) and Pr(pi|piminus1iminusk si+k

iminusk) can be representedas Pr(te|he) with their target events and history events We used a maximumentropy modeling tool [16] to estimate Eqs (1) (2) (3) and (4)

4 Transliteration Validation

We validated transliterations by using web-based validation Sweb(s tci) andtransliteration model-based validation Stm(s tci) like in Eq (6) Using Eq (6)we can validate transliterations in a more correct and robust manner becauseSweb(s tci) reflects real-world usage of the transliterations in web data andStm(s tci) ranks the transliterations independent of the web data

STV (s tci) = Stm(s tci) times Sweb(s tci) (6)

41 Transliteration Model-Based Validation Stm

Our transliteration model-based validation Stm uses the rank assigned by eachtransliteration model For a given source word (s) each transliteration modelgenerates transliterations (tci in TC) and ranks them using the probability

90 J-H Oh K-S Choi and H Isahara

described in Eqs (1) (2) (3) and (4) The underlying assumption in Stm isthat the rank of the correct transliterations tends to be higher on average thanthe wrong ones With this assumption we represented Stm(s tci) as Eq (7)where Rankg(tci) Rankp(tci) Rankh(tci) and Rankc(tci) represent the rankof tci assigned by GTM PTM HTM and CTM respectively

Stm(s tci) =14times (

1Rankg(tci)

+1

Rankp(tci)+

1Rankh(tci)

+1

Rankc(tci)) (7)

42 Web-Based Validation Sweb

Korean or Japanese web pages are usually composed of rich texts in a mixtureof Korean or Japanese (main language) and English (auxiliary language) Lets and t be a source language word and a target language word respectivelyWe observed that s and t tend to be near each other in the text of Korean orJapanese web pages when the authors of the web pages describe s as translationof t or vice versa We retrieved such web pages for transliteration validation

There have been several web-based validation methods for translation valida-tion [1718] or transliteration validation [219] They usually rely on the web fre-quency (the number of web pages) derived from ldquoBilingual Keyword Search(BKS)rdquo [21718] or ldquoMonolingual Keyword Search (MKS)rdquo [219] BKS re-trieves web pages by using a query composed of two keywords s and t while MKSretrieves web pages by using a query composed of t Qu amp Grefenstette [17] andWang et al [18] proposed BKS-based translation validation methods such as rel-ative web frequency and chi-square (χ2) test Al-Onaizan amp Knight [2] used bothMKS and BKS and Grefenstette et al [19] used only MKS for validating translit-erations However web pages retrieved by MKS tend to show whether t is used intarget language texts rather than whether t is a translation of s BKS frequentlyretrieves web pages where s and t have little relation to each other because it doesnot consider distance between s and t in the web pages To address these prob-lems we developed a validation method based on ldquoBilingual Phrasal Search(BPS)rdquo where a phrase composed of s and t is used as a query for a search engineLet lsquo[s t]rsquo or lsquo[t s]rsquo lsquos And trsquo and lsquotrsquo respectively be queries for BPS BKS andMKS The difference among BPS BKS and MKS is shown in Fig 2 In Fig 2 lsquo[st]rsquo or lsquo[t s]rsquo retrieves web pages where lsquo[s t]rsquo or lsquo[t s]rsquo exists as phrases while lsquos Andtrsquo retrieves web pages where s and t simply exist in the same document Thereforethe number of web pages retrieved by BPS is more reliable for validating translit-erations because s and t usually have high co-relation in the web pages retrievedby BPS For example web pages retrieved by BPS in Fig 3 usually contain cor-rect Korean and Japanese transliterations and their corresponding English wordamylase as translation pairs in parentheses expression For these reasons BPS ismore suitable for our transliteration validation

Let TC be a set of transliterations (or transliteration candidates) producedby different transliteration models tci be the ith transliteration candidate inTC s be the source language word resulting in TC and WF (s tci) be the web

Improving Machine Transliteration Performance 91

Fig 2 Difference among BPS BKS and MKS

Fig 3 Web pages retrieved by BPS

frequency for [s tci] Our web-based validation method Sweb can be representedas Eq (8) which is the relative web frequency derived from BPS

Sweb(s tci) =WF (s tci) + WF (tci s)sum

tckisinTC(WF (s tck) + WF (tck s))(8)

Let TC for s = data be tc1= tc2= tc3= and WF (s tc1) +WF (tc1 s) WF (s tc2) + WF (tc2 s) and WF (s tc3) + WF (tc3 s) be 9410067800 and 54 respectively Then Sweb for each tci can be calculated as follows

ndash Sweb(s tc1) = 94100161954 = 05811ndash Sweb(s tc2) = 67800161954 = 04186ndash Sweb(s tc3) = 54161954 = 00003

92 J-H Oh K-S Choi and H Isahara

5 Experiments

Our experiments were done for English-to-Korean and English-to-Japanesetransliteration The test set for the English-to-Koreantransliteration (EKSet) [20]consisted of 7172 English-Korean pairs ndash the number of training data was about6000 and the number of blind test data was about 1000 The test set for theEnglish-to-Japanese transliteration (EJSet) which consisted of English-katakanapairs from EDICT [21] consisted of 10417 pairs mdash the number of training datawas about 9000 and the number of blind test data was about 1000 EJSet con-tained one or more than one correct transliteration for one English word likeltmicroכgt and ltmicroכgt the average number of Japanesetransliterations for an English word was 115 EKSet and EJSet covered propernames technical terms and general terms Evaluation was done in terms of theword accuracy (WA) in Eq (9) In the evaluation we used k-fold cross-validation(k = 7 for the EKSet and k = 10 for the EJSet) The test set was divided into ksubsets Each one was used for testing while the remainder was used for trainingThen the average WA across all the k trials was computed Through the cross-validation we set α (04 for the EKSet and 05 for the EJSet) for HTM in Eq (4)

WA =the number of correct transliterations output by the system

the number of transliterations in the blind test data(9)

51 Experimental Results

Summaries of our experimental results conducted on EKSet and EJSet are shownin Table 1 In the table GTM PTM HTM and CTM represent the individualtransliteration models used for generating transliterations Stm Sweb and STV

Table 1 Summary of Results ()

Methods EKSet EJSetTop-1 Top-3 Top-5 Top-10 Top-1 Top-3 Top-5 Top-10

GTM 568 769 825 880 516 766 846 941

PTM 496 683 751 825 476 728 816 926

HTM 606 785 835 886 557 776 851 940

CTM 608 796 847 896 582 811 879 965

GPC [6] 551 NA NA NA 532 NA NA NA

GMEM [7] 559 NA NA NA 562 NA NA NA

HWFST [11] 583 NA NA NA 625 NA NA NA

Stm 710 812 854 891 669 813 875 932

MKS 505 759 843 916 499 751 830 930Sweb BKS 744 888 912 921 761 942 960 968

BPS 836 915 919 921 813 956 969 978

MKS 595 813 878 921 567 773 835 928STV BKS 795 905 917 921 796 947 962 968

BPS 844 917 920 921 820 956 969 981

Improving Machine Transliteration Performance 93

represent experimental results validated by Eqs (7) (8) and (6) respectivelyMoreover we tested Sweb and STV according to web search methods (BPS BKSand MKS) to show the effect of BPS on transliteration validation We comparedour proposed method with the previous work GPC [6] GMEM [7] and HWFST[11]3 Note that only Top-1 was considered in the previous work because theyexcept for HWFST focused only on the Top-1 The Top-n considers whether thecorrect transliteration is in the Top-n ranked transliterations4

Compared to individual transliteration models and previous work [6711] StmSweb(BPS) and STV (BPS) are more effective especially in the Top-15 AlthoughStm by itself showed higher performance than individual transliteration modelsand previous work [6711] STV (BPS) (the combination of Stm and Sweb(BPS))shows much better performance The Top-1 of STV (BPS) has the best perfor-mance6 The powerful transliteration validation ability of STV (BPS) enables ourmethod to achieve the best result in the Top-1 More specifically Sweb(BPS)contributes highly to the performance improvement This indicates that the webdata used as the knowledge source for transliteration validation is very usefulAlthough Stm makes a small contribution to the performance improvement ofSTV (BPS) because Sweb(BPS) correctly validates transliterations whenever Stm

does the errors of Sweb(BKS) and Sweb(MKS) are well compensated for by Stm inSTV (BKS) and STV (MKS) For example Korean and Japanese transliterationsfor the English words methoxyl and netware were validated by each validationmethod as shown in Tables 2 and 3 Note that the value coupled with each translit-eration was assigned by each validation method In Tables 2 and 3 Stm causes therank of correct transliterations to be higher in STV (BKS) and STV (MKS) thanin Sweb(BKS) and Sweb(MKS)

When comparing BPS with BKS and MKS BPS is the most effective websearch method for transliteration validation Sweb based on MKS has the worstperformance because it tends to validate whether tci is used in a target languagerather than whether it is used as a translation of s Actually Sweb based on BKSis effective because BKS considers both s and tci while it retrieves web pagesHowever the more powerful retrieval ability of BPS protects Sweb based on BPSfrom errors that Sweb based on BKS causes as shown in Tables 2 and 3 So higherperformance can be had with Sweb(BPS) than with Sweb(BKS) ndash about 12 im-provement in Top-1 of EKSet and about 7 improvement in Top-1 of EJSet7 The

3 We implemented the three previous methods [6711] and then trained and testedthem using the same data as our proposed method

4 For one English word there are one or more than one correct transliterations in EJSetbut there is only one correct transliteration in EKSet Therefore we had higher TOP-1 accuracies but lower TOP-10 accuracies in EKSet than in EJSet

5 A one-tail paired t-test showed that the results of Stm Sweb(BPS) and STV (BPS)were always significantly better than those of individual transliteration models andprevious work (level of significance = 0001)

6 A one-tail paired t-test showed that the results of STV (BPS) were always significantlybetter than those of the others (level of significance = 0001)

7 A one-tail paired t-test showed that the results of Sweb(BPS) were always significantlybetter than those of Sweb(BKS) (level of significance = 0001)

94 J-H Oh K-S Choi and H Isahara

Table 2 Korean transliterations for English word methoxyl and their validation (Theunderlined is the correct Korean transliteration)

Stm (0772) (0081) 13 (0253)

Sweb(MKS) 13 (0482) (0334) (0148) (0029)

Sweb(BKS) (0564) (0413) 13 (0011)

Sweb(BPS) (0947) 13 (0053)

STV (MKS) 13 (0122) (0022) (00003)

STV (BKS) (0319) (0045) 13 (0003)

STV (BPS) (0742) 13 (0013)

Table 3 Japanese transliterations for English word netware and their validation (Theunderlined is the correct Japanese transliteration)

Stm (0875) (0458) (0319)

Sweb(MKS) 13ndash (0988) (0009) (0002)

Sweb(BKS) 13ndash (0626) (0274) (0100)

Sweb(BPS) (0860) 13ndash (0079) (0061)

STV (MKS) 13ndash (0189) (0004) (0002)

STV (BKS) (0240) 13ndash (0120) (0032)

STV (BPS) (0752) (0019) 13ndash (0015)

higher performance of Sweb(BPS) positively effects STV (BPS) thus the perfor-mance of STV (BPS) is higher than that of STV (BKS) Both Sweb and STV basedon BPS outperform those based on BKS or MKS

The experimental results can be summarized as follows

ndash Stm Sweb(BPS) and STV (BPS) are more effective than individual transliter-ation models and previous work [6711] especially in the Top-1

ndash STV (BPS) shows the best performancendash Sweb(BPS) mainly contributes the high performance of STV (BPS)ndash BPS is the most effective web search method for transliteration validation

among BPS BKS and MKS

6 Conclusion

We proposed a novel approach for improving machine transliteration performanceby combining multiple transliteration models We applied a ldquogeneratingtransliterations followed by their validationrdquo strategy We generated transliter-ation candidates using four different transliteration models and validated themusing web-based validation and transliteration model-based validation Experi-ments showed that combining multiple transliteration models was one way forconsidering complex transliteration behaviors and that transliteration validationwas very important for improving machine transliteration performance Our two

Improving Machine Transliteration Performance 95

transliteration validation methods were effective The web-based validationmethod effectively filtered out wrong transliterations by using web data whichreflects real-world usage of transliterations and the transliteration model-basedvalidation method as a web-independent validation measure complemented theweb-based validation method Moreover we showed that a web search methodsignificantly affects the performance of the web-based validation method Exper-iments showed that our ldquoBilingual Phrasal Search (BPS)rdquo is more suitablethan ldquoBilingual Keyword Search (BKS)rdquo and ldquoMonolingual KeywordSearch (MKS)rdquo in transliteration validation

References

1 Knight K Graehl J Machine transliteration In Proc of the 35th Annual Meet-ings of the Association for Computational Linguistics (1997) pp128ndash135

2 Al-Onaizan Y Knight K Translating named entities using monolingual and bilin-gual resources In Proc of ACL 2002 (2002) 400ndash408

3 Fujii A Tetsuya I JapaneseEnglish cross-language information retrieval Ex-ploration of query translation and transliteration Computers and the Humanities35 (2001) 389ndash420

4 Lin WH Chen HH Backward machine transliteration by learning phonetic sim-ilarity In Proc of the Sixth Conference on Natural Language Learning (CoNLL)(2002) 139ndash145

5 Kang BJ Choi KS Automatic transliteration and back-transliteration by de-cision tree learning In Proc of the 2nd International Conference on LanguageResources and Evaluation (2000) 1135ndash1411

6 Kang IH Kim GC English-to-Korean transliteration using multiple unboundedoverlapping phoneme chunks In Proc of the 18th International Conference onComputational Linguistics (2000) 418ndash424

7 Goto I Kato N Uratani N Ehara T Transliteration considering context in-formation based on the maximum entropy method In Proc of MT-Summit IX(2003) 125ndash132

8 Li H Zhang M Su J A joint source-channel model for machine transliterationIn Proc of ACL 2004 (2004) 160ndash167

9 Jung SY Hong S Paek E An English to Korean transliteration model of ex-tended markov window In Proc of the 18th conference on Computational linguis-tics (2000) 383ndash389

10 Meng H Lo WK Chen B Tang K Generating phonetic cognates to handlenamed entities in English-Chinese cross-language spoken document retrieval InProc of Automatic Speech Recognition and Understanding 2001 ASRU rsquo01 (2001)311ndash314

11 Bilac S Tanaka H Improving back-transliteration by combining informationsources In Proc of IJCNLP2004 (2004) 542ndash547

12 Oh JH Choi KS An English-Korean transliteration model using pronunciationand contextual rules In Proc of COLING2002 (2002) 758ndash764

13 Oh JH Choi KS An ensemble of grapheme and phoneme for machine translit-eration In Proc of IJCNLP05 (2005) 450ndash461

14 Oh JH Choi KS Machine learning based English-to-Korean transliterationusing grapheme and phoneme information IEICE Transaction on Information ampSystems E88-D (2005) 1737ndash1748

96 J-H Oh K-S Choi and H Isahara

15 Berger AL Pietra SD Pietra VJD A maximum entropy approach to naturallanguage processing Computational Linguistics 22 (1996) 39ndash71

16 Zhang L Maximum entropy modeling toolkit for python and C++ http

homepagesinfedacuks0450736softwaremaxentmanualpdf (2004)17 Qu Y Grefenstette G Finding ideographic representations of Japanese names

written in Latin script via language identification and corpus validation In ACL(2004) 183ndash190

18 Wang JH Teng JW Lu WH Chien LF Exploiting the web as the multi-lingual corpus for unknown query translation Journal of the American Society forInformation Science and Technology 57 (2006) 660ndash670

19 Grefenstette G Qu Y Evans DA Mining the web to create a language modelfor mapping between English names and phrases and Japanese In Proc of WebIntelligence (2004) 110ndash116

20 Nam YS Foreign dictionary Sung An Dang (1997)21 Breen J EDICT JapaneseEnglish dictionary le The Electronic Dictionary Re-

search and Development Group Monash University httpwwwcssemonash

eduau~jwbedicthtml (2003)

Improving Machine Transliteration Performance 89

Prh(TSW |SW ) = α times Prp(TSW |SW ) + (1 minus α) times Prg(TSW |SW ) (4)

With the assumption that each transliteration model depends on the size ofthe contexts k Eqs (1) (2) (3) and (4) can be simplified To estimate theprobabilities in Eqs (1) (2) (3) and (4) we used the maximum entropy modelwhich can effectively incorporate heterogeneous information [15] In the maxi-mum entropy model event ev is composed of a target event (te) and a historyevent (he) and it is represented by a bundle of feature functions (fi(he te))which represent the existence of certain characteristics in the event ev The fea-ture function enables a model based on the maximum entropy model to estimateprobability [15] Therefore designing the feature functions which effectively sup-port certain decisions made by the model is important Our basic philosophyfor the feature function design for each transliteration model is that the con-text information collocated with the unit of interest is important With thisphilosophy we designed the feature functions with all possible combinations of(si+k

iminusk pi+kiminusk ci+k

iminusk and timinus1iminusk) Generally a conditional maximum entropy model

is an exponential log-linear model that gives the conditional probability of eventev =lt te he gt as described in Eq (5) where λi is a parameter to be estimatedand Z(he) is the normalizing factor [15]

Pr(te|he) =1

Z(he)exp(

sum

i

λifi(he te)) (5)

Z(he) =sum

te

exp(sum

i

λifi(he te))

With Eq (5) and feature functions conditional probabilities can be estimatedin Eqs (1) (2) (3) and (4) For example we can write Pr(ti|timinus1

iminusk ci+kiminusk) =

Pr(teCTM |heCTM ) because we can represent target events (teCTM ) and historyevents (heCTM ) of CTM as ti and tuples lt timinus1

iminusk ci+kiminusk gt respectively In the same

way Pr(ti|timinus1iminusk si+k

iminusk) Pr(ti|timinus1iminusk pi+k

iminusk) and Pr(pi|piminus1iminusk si+k

iminusk) can be representedas Pr(te|he) with their target events and history events We used a maximumentropy modeling tool [16] to estimate Eqs (1) (2) (3) and (4)

4 Transliteration Validation

We validated transliterations by using web-based validation Sweb(s tci) andtransliteration model-based validation Stm(s tci) like in Eq (6) Using Eq (6)we can validate transliterations in a more correct and robust manner becauseSweb(s tci) reflects real-world usage of the transliterations in web data andStm(s tci) ranks the transliterations independent of the web data

STV (s tci) = Stm(s tci) times Sweb(s tci) (6)

41 Transliteration Model-Based Validation Stm

Our transliteration model-based validation Stm uses the rank assigned by eachtransliteration model For a given source word (s) each transliteration modelgenerates transliterations (tci in TC) and ranks them using the probability

90 J-H Oh K-S Choi and H Isahara

described in Eqs (1) (2) (3) and (4) The underlying assumption in Stm isthat the rank of the correct transliterations tends to be higher on average thanthe wrong ones With this assumption we represented Stm(s tci) as Eq (7)where Rankg(tci) Rankp(tci) Rankh(tci) and Rankc(tci) represent the rankof tci assigned by GTM PTM HTM and CTM respectively

Stm(s tci) =14times (

1Rankg(tci)

+1

Rankp(tci)+

1Rankh(tci)

+1

Rankc(tci)) (7)

42 Web-Based Validation Sweb

Korean or Japanese web pages are usually composed of rich texts in a mixtureof Korean or Japanese (main language) and English (auxiliary language) Lets and t be a source language word and a target language word respectivelyWe observed that s and t tend to be near each other in the text of Korean orJapanese web pages when the authors of the web pages describe s as translationof t or vice versa We retrieved such web pages for transliteration validation

There have been several web-based validation methods for translation valida-tion [1718] or transliteration validation [219] They usually rely on the web fre-quency (the number of web pages) derived from ldquoBilingual Keyword Search(BKS)rdquo [21718] or ldquoMonolingual Keyword Search (MKS)rdquo [219] BKS re-trieves web pages by using a query composed of two keywords s and t while MKSretrieves web pages by using a query composed of t Qu amp Grefenstette [17] andWang et al [18] proposed BKS-based translation validation methods such as rel-ative web frequency and chi-square (χ2) test Al-Onaizan amp Knight [2] used bothMKS and BKS and Grefenstette et al [19] used only MKS for validating translit-erations However web pages retrieved by MKS tend to show whether t is used intarget language texts rather than whether t is a translation of s BKS frequentlyretrieves web pages where s and t have little relation to each other because it doesnot consider distance between s and t in the web pages To address these prob-lems we developed a validation method based on ldquoBilingual Phrasal Search(BPS)rdquo where a phrase composed of s and t is used as a query for a search engineLet lsquo[s t]rsquo or lsquo[t s]rsquo lsquos And trsquo and lsquotrsquo respectively be queries for BPS BKS andMKS The difference among BPS BKS and MKS is shown in Fig 2 In Fig 2 lsquo[st]rsquo or lsquo[t s]rsquo retrieves web pages where lsquo[s t]rsquo or lsquo[t s]rsquo exists as phrases while lsquos Andtrsquo retrieves web pages where s and t simply exist in the same document Thereforethe number of web pages retrieved by BPS is more reliable for validating translit-erations because s and t usually have high co-relation in the web pages retrievedby BPS For example web pages retrieved by BPS in Fig 3 usually contain cor-rect Korean and Japanese transliterations and their corresponding English wordamylase as translation pairs in parentheses expression For these reasons BPS ismore suitable for our transliteration validation

Let TC be a set of transliterations (or transliteration candidates) producedby different transliteration models tci be the ith transliteration candidate inTC s be the source language word resulting in TC and WF (s tci) be the web

Improving Machine Transliteration Performance 91

Fig 2 Difference among BPS BKS and MKS

Fig 3 Web pages retrieved by BPS

frequency for [s tci] Our web-based validation method Sweb can be representedas Eq (8) which is the relative web frequency derived from BPS

Sweb(s tci) =WF (s tci) + WF (tci s)sum

tckisinTC(WF (s tck) + WF (tck s))(8)

Let TC for s = data be tc1= tc2= tc3= and WF (s tc1) +WF (tc1 s) WF (s tc2) + WF (tc2 s) and WF (s tc3) + WF (tc3 s) be 9410067800 and 54 respectively Then Sweb for each tci can be calculated as follows

ndash Sweb(s tc1) = 94100161954 = 05811ndash Sweb(s tc2) = 67800161954 = 04186ndash Sweb(s tc3) = 54161954 = 00003

92 J-H Oh K-S Choi and H Isahara

5 Experiments

Our experiments were done for English-to-Korean and English-to-Japanesetransliteration The test set for the English-to-Koreantransliteration (EKSet) [20]consisted of 7172 English-Korean pairs ndash the number of training data was about6000 and the number of blind test data was about 1000 The test set for theEnglish-to-Japanese transliteration (EJSet) which consisted of English-katakanapairs from EDICT [21] consisted of 10417 pairs mdash the number of training datawas about 9000 and the number of blind test data was about 1000 EJSet con-tained one or more than one correct transliteration for one English word likeltmicroכgt and ltmicroכgt the average number of Japanesetransliterations for an English word was 115 EKSet and EJSet covered propernames technical terms and general terms Evaluation was done in terms of theword accuracy (WA) in Eq (9) In the evaluation we used k-fold cross-validation(k = 7 for the EKSet and k = 10 for the EJSet) The test set was divided into ksubsets Each one was used for testing while the remainder was used for trainingThen the average WA across all the k trials was computed Through the cross-validation we set α (04 for the EKSet and 05 for the EJSet) for HTM in Eq (4)

WA =the number of correct transliterations output by the system

the number of transliterations in the blind test data(9)

51 Experimental Results

Summaries of our experimental results conducted on EKSet and EJSet are shownin Table 1 In the table GTM PTM HTM and CTM represent the individualtransliteration models used for generating transliterations Stm Sweb and STV

Table 1 Summary of Results ()

Methods EKSet EJSetTop-1 Top-3 Top-5 Top-10 Top-1 Top-3 Top-5 Top-10

GTM 568 769 825 880 516 766 846 941

PTM 496 683 751 825 476 728 816 926

HTM 606 785 835 886 557 776 851 940

CTM 608 796 847 896 582 811 879 965

GPC [6] 551 NA NA NA 532 NA NA NA

GMEM [7] 559 NA NA NA 562 NA NA NA

HWFST [11] 583 NA NA NA 625 NA NA NA

Stm 710 812 854 891 669 813 875 932

MKS 505 759 843 916 499 751 830 930Sweb BKS 744 888 912 921 761 942 960 968

BPS 836 915 919 921 813 956 969 978

MKS 595 813 878 921 567 773 835 928STV BKS 795 905 917 921 796 947 962 968

BPS 844 917 920 921 820 956 969 981

Improving Machine Transliteration Performance 93

represent experimental results validated by Eqs (7) (8) and (6) respectivelyMoreover we tested Sweb and STV according to web search methods (BPS BKSand MKS) to show the effect of BPS on transliteration validation We comparedour proposed method with the previous work GPC [6] GMEM [7] and HWFST[11]3 Note that only Top-1 was considered in the previous work because theyexcept for HWFST focused only on the Top-1 The Top-n considers whether thecorrect transliteration is in the Top-n ranked transliterations4

Compared to individual transliteration models and previous work [6711] StmSweb(BPS) and STV (BPS) are more effective especially in the Top-15 AlthoughStm by itself showed higher performance than individual transliteration modelsand previous work [6711] STV (BPS) (the combination of Stm and Sweb(BPS))shows much better performance The Top-1 of STV (BPS) has the best perfor-mance6 The powerful transliteration validation ability of STV (BPS) enables ourmethod to achieve the best result in the Top-1 More specifically Sweb(BPS)contributes highly to the performance improvement This indicates that the webdata used as the knowledge source for transliteration validation is very usefulAlthough Stm makes a small contribution to the performance improvement ofSTV (BPS) because Sweb(BPS) correctly validates transliterations whenever Stm

does the errors of Sweb(BKS) and Sweb(MKS) are well compensated for by Stm inSTV (BKS) and STV (MKS) For example Korean and Japanese transliterationsfor the English words methoxyl and netware were validated by each validationmethod as shown in Tables 2 and 3 Note that the value coupled with each translit-eration was assigned by each validation method In Tables 2 and 3 Stm causes therank of correct transliterations to be higher in STV (BKS) and STV (MKS) thanin Sweb(BKS) and Sweb(MKS)

When comparing BPS with BKS and MKS BPS is the most effective websearch method for transliteration validation Sweb based on MKS has the worstperformance because it tends to validate whether tci is used in a target languagerather than whether it is used as a translation of s Actually Sweb based on BKSis effective because BKS considers both s and tci while it retrieves web pagesHowever the more powerful retrieval ability of BPS protects Sweb based on BPSfrom errors that Sweb based on BKS causes as shown in Tables 2 and 3 So higherperformance can be had with Sweb(BPS) than with Sweb(BKS) ndash about 12 im-provement in Top-1 of EKSet and about 7 improvement in Top-1 of EJSet7 The

3 We implemented the three previous methods [6711] and then trained and testedthem using the same data as our proposed method

4 For one English word there are one or more than one correct transliterations in EJSetbut there is only one correct transliteration in EKSet Therefore we had higher TOP-1 accuracies but lower TOP-10 accuracies in EKSet than in EJSet

5 A one-tail paired t-test showed that the results of Stm Sweb(BPS) and STV (BPS)were always significantly better than those of individual transliteration models andprevious work (level of significance = 0001)

6 A one-tail paired t-test showed that the results of STV (BPS) were always significantlybetter than those of the others (level of significance = 0001)

7 A one-tail paired t-test showed that the results of Sweb(BPS) were always significantlybetter than those of Sweb(BKS) (level of significance = 0001)

94 J-H Oh K-S Choi and H Isahara

Table 2 Korean transliterations for English word methoxyl and their validation (Theunderlined is the correct Korean transliteration)

Stm (0772) (0081) 13 (0253)

Sweb(MKS) 13 (0482) (0334) (0148) (0029)

Sweb(BKS) (0564) (0413) 13 (0011)

Sweb(BPS) (0947) 13 (0053)

STV (MKS) 13 (0122) (0022) (00003)

STV (BKS) (0319) (0045) 13 (0003)

STV (BPS) (0742) 13 (0013)

Table 3 Japanese transliterations for English word netware and their validation (Theunderlined is the correct Japanese transliteration)

Stm (0875) (0458) (0319)

Sweb(MKS) 13ndash (0988) (0009) (0002)

Sweb(BKS) 13ndash (0626) (0274) (0100)

Sweb(BPS) (0860) 13ndash (0079) (0061)

STV (MKS) 13ndash (0189) (0004) (0002)

STV (BKS) (0240) 13ndash (0120) (0032)

STV (BPS) (0752) (0019) 13ndash (0015)

higher performance of Sweb(BPS) positively effects STV (BPS) thus the perfor-mance of STV (BPS) is higher than that of STV (BKS) Both Sweb and STV basedon BPS outperform those based on BKS or MKS

The experimental results can be summarized as follows

ndash Stm Sweb(BPS) and STV (BPS) are more effective than individual transliter-ation models and previous work [6711] especially in the Top-1

ndash STV (BPS) shows the best performancendash Sweb(BPS) mainly contributes the high performance of STV (BPS)ndash BPS is the most effective web search method for transliteration validation

among BPS BKS and MKS

6 Conclusion

We proposed a novel approach for improving machine transliteration performanceby combining multiple transliteration models We applied a ldquogeneratingtransliterations followed by their validationrdquo strategy We generated transliter-ation candidates using four different transliteration models and validated themusing web-based validation and transliteration model-based validation Experi-ments showed that combining multiple transliteration models was one way forconsidering complex transliteration behaviors and that transliteration validationwas very important for improving machine transliteration performance Our two

Improving Machine Transliteration Performance 95

transliteration validation methods were effective The web-based validationmethod effectively filtered out wrong transliterations by using web data whichreflects real-world usage of transliterations and the transliteration model-basedvalidation method as a web-independent validation measure complemented theweb-based validation method Moreover we showed that a web search methodsignificantly affects the performance of the web-based validation method Exper-iments showed that our ldquoBilingual Phrasal Search (BPS)rdquo is more suitablethan ldquoBilingual Keyword Search (BKS)rdquo and ldquoMonolingual KeywordSearch (MKS)rdquo in transliteration validation

References

1 Knight K Graehl J Machine transliteration In Proc of the 35th Annual Meet-ings of the Association for Computational Linguistics (1997) pp128ndash135

2 Al-Onaizan Y Knight K Translating named entities using monolingual and bilin-gual resources In Proc of ACL 2002 (2002) 400ndash408

3 Fujii A Tetsuya I JapaneseEnglish cross-language information retrieval Ex-ploration of query translation and transliteration Computers and the Humanities35 (2001) 389ndash420

4 Lin WH Chen HH Backward machine transliteration by learning phonetic sim-ilarity In Proc of the Sixth Conference on Natural Language Learning (CoNLL)(2002) 139ndash145

5 Kang BJ Choi KS Automatic transliteration and back-transliteration by de-cision tree learning In Proc of the 2nd International Conference on LanguageResources and Evaluation (2000) 1135ndash1411

6 Kang IH Kim GC English-to-Korean transliteration using multiple unboundedoverlapping phoneme chunks In Proc of the 18th International Conference onComputational Linguistics (2000) 418ndash424

7 Goto I Kato N Uratani N Ehara T Transliteration considering context in-formation based on the maximum entropy method In Proc of MT-Summit IX(2003) 125ndash132

8 Li H Zhang M Su J A joint source-channel model for machine transliterationIn Proc of ACL 2004 (2004) 160ndash167

9 Jung SY Hong S Paek E An English to Korean transliteration model of ex-tended markov window In Proc of the 18th conference on Computational linguis-tics (2000) 383ndash389

10 Meng H Lo WK Chen B Tang K Generating phonetic cognates to handlenamed entities in English-Chinese cross-language spoken document retrieval InProc of Automatic Speech Recognition and Understanding 2001 ASRU rsquo01 (2001)311ndash314

11 Bilac S Tanaka H Improving back-transliteration by combining informationsources In Proc of IJCNLP2004 (2004) 542ndash547

12 Oh JH Choi KS An English-Korean transliteration model using pronunciationand contextual rules In Proc of COLING2002 (2002) 758ndash764

13 Oh JH Choi KS An ensemble of grapheme and phoneme for machine translit-eration In Proc of IJCNLP05 (2005) 450ndash461

14 Oh JH Choi KS Machine learning based English-to-Korean transliterationusing grapheme and phoneme information IEICE Transaction on Information ampSystems E88-D (2005) 1737ndash1748

96 J-H Oh K-S Choi and H Isahara

15 Berger AL Pietra SD Pietra VJD A maximum entropy approach to naturallanguage processing Computational Linguistics 22 (1996) 39ndash71

16 Zhang L Maximum entropy modeling toolkit for python and C++ http

homepagesinfedacuks0450736softwaremaxentmanualpdf (2004)17 Qu Y Grefenstette G Finding ideographic representations of Japanese names

written in Latin script via language identification and corpus validation In ACL(2004) 183ndash190

18 Wang JH Teng JW Lu WH Chien LF Exploiting the web as the multi-lingual corpus for unknown query translation Journal of the American Society forInformation Science and Technology 57 (2006) 660ndash670

19 Grefenstette G Qu Y Evans DA Mining the web to create a language modelfor mapping between English names and phrases and Japanese In Proc of WebIntelligence (2004) 110ndash116

20 Nam YS Foreign dictionary Sung An Dang (1997)21 Breen J EDICT JapaneseEnglish dictionary le The Electronic Dictionary Re-

search and Development Group Monash University httpwwwcssemonash

eduau~jwbedicthtml (2003)

90 J-H Oh K-S Choi and H Isahara

described in Eqs (1) (2) (3) and (4) The underlying assumption in Stm isthat the rank of the correct transliterations tends to be higher on average thanthe wrong ones With this assumption we represented Stm(s tci) as Eq (7)where Rankg(tci) Rankp(tci) Rankh(tci) and Rankc(tci) represent the rankof tci assigned by GTM PTM HTM and CTM respectively

Stm(s tci) =14times (

1Rankg(tci)

+1

Rankp(tci)+

1Rankh(tci)

+1

Rankc(tci)) (7)

42 Web-Based Validation Sweb

Korean or Japanese web pages are usually composed of rich texts in a mixtureof Korean or Japanese (main language) and English (auxiliary language) Lets and t be a source language word and a target language word respectivelyWe observed that s and t tend to be near each other in the text of Korean orJapanese web pages when the authors of the web pages describe s as translationof t or vice versa We retrieved such web pages for transliteration validation

There have been several web-based validation methods for translation valida-tion [1718] or transliteration validation [219] They usually rely on the web fre-quency (the number of web pages) derived from ldquoBilingual Keyword Search(BKS)rdquo [21718] or ldquoMonolingual Keyword Search (MKS)rdquo [219] BKS re-trieves web pages by using a query composed of two keywords s and t while MKSretrieves web pages by using a query composed of t Qu amp Grefenstette [17] andWang et al [18] proposed BKS-based translation validation methods such as rel-ative web frequency and chi-square (χ2) test Al-Onaizan amp Knight [2] used bothMKS and BKS and Grefenstette et al [19] used only MKS for validating translit-erations However web pages retrieved by MKS tend to show whether t is used intarget language texts rather than whether t is a translation of s BKS frequentlyretrieves web pages where s and t have little relation to each other because it doesnot consider distance between s and t in the web pages To address these prob-lems we developed a validation method based on ldquoBilingual Phrasal Search(BPS)rdquo where a phrase composed of s and t is used as a query for a search engineLet lsquo[s t]rsquo or lsquo[t s]rsquo lsquos And trsquo and lsquotrsquo respectively be queries for BPS BKS andMKS The difference among BPS BKS and MKS is shown in Fig 2 In Fig 2 lsquo[st]rsquo or lsquo[t s]rsquo retrieves web pages where lsquo[s t]rsquo or lsquo[t s]rsquo exists as phrases while lsquos Andtrsquo retrieves web pages where s and t simply exist in the same document Thereforethe number of web pages retrieved by BPS is more reliable for validating translit-erations because s and t usually have high co-relation in the web pages retrievedby BPS For example web pages retrieved by BPS in Fig 3 usually contain cor-rect Korean and Japanese transliterations and their corresponding English wordamylase as translation pairs in parentheses expression For these reasons BPS ismore suitable for our transliteration validation

Let TC be a set of transliterations (or transliteration candidates) producedby different transliteration models tci be the ith transliteration candidate inTC s be the source language word resulting in TC and WF (s tci) be the web

Improving Machine Transliteration Performance 91

Fig 2 Difference among BPS BKS and MKS

Fig 3 Web pages retrieved by BPS

frequency for [s tci] Our web-based validation method Sweb can be representedas Eq (8) which is the relative web frequency derived from BPS

Sweb(s tci) =WF (s tci) + WF (tci s)sum

tckisinTC(WF (s tck) + WF (tck s))(8)

Let TC for s = data be tc1= tc2= tc3= and WF (s tc1) +WF (tc1 s) WF (s tc2) + WF (tc2 s) and WF (s tc3) + WF (tc3 s) be 9410067800 and 54 respectively Then Sweb for each tci can be calculated as follows

ndash Sweb(s tc1) = 94100161954 = 05811ndash Sweb(s tc2) = 67800161954 = 04186ndash Sweb(s tc3) = 54161954 = 00003

92 J-H Oh K-S Choi and H Isahara

5 Experiments

Our experiments were done for English-to-Korean and English-to-Japanesetransliteration The test set for the English-to-Koreantransliteration (EKSet) [20]consisted of 7172 English-Korean pairs ndash the number of training data was about6000 and the number of blind test data was about 1000 The test set for theEnglish-to-Japanese transliteration (EJSet) which consisted of English-katakanapairs from EDICT [21] consisted of 10417 pairs mdash the number of training datawas about 9000 and the number of blind test data was about 1000 EJSet con-tained one or more than one correct transliteration for one English word likeltmicroכgt and ltmicroכgt the average number of Japanesetransliterations for an English word was 115 EKSet and EJSet covered propernames technical terms and general terms Evaluation was done in terms of theword accuracy (WA) in Eq (9) In the evaluation we used k-fold cross-validation(k = 7 for the EKSet and k = 10 for the EJSet) The test set was divided into ksubsets Each one was used for testing while the remainder was used for trainingThen the average WA across all the k trials was computed Through the cross-validation we set α (04 for the EKSet and 05 for the EJSet) for HTM in Eq (4)

WA =the number of correct transliterations output by the system

the number of transliterations in the blind test data(9)

51 Experimental Results

Summaries of our experimental results conducted on EKSet and EJSet are shownin Table 1 In the table GTM PTM HTM and CTM represent the individualtransliteration models used for generating transliterations Stm Sweb and STV

Table 1 Summary of Results ()

Methods EKSet EJSetTop-1 Top-3 Top-5 Top-10 Top-1 Top-3 Top-5 Top-10

GTM 568 769 825 880 516 766 846 941

PTM 496 683 751 825 476 728 816 926

HTM 606 785 835 886 557 776 851 940

CTM 608 796 847 896 582 811 879 965

GPC [6] 551 NA NA NA 532 NA NA NA

GMEM [7] 559 NA NA NA 562 NA NA NA

HWFST [11] 583 NA NA NA 625 NA NA NA

Stm 710 812 854 891 669 813 875 932

MKS 505 759 843 916 499 751 830 930Sweb BKS 744 888 912 921 761 942 960 968

BPS 836 915 919 921 813 956 969 978

MKS 595 813 878 921 567 773 835 928STV BKS 795 905 917 921 796 947 962 968

BPS 844 917 920 921 820 956 969 981

Improving Machine Transliteration Performance 93

represent experimental results validated by Eqs (7) (8) and (6) respectivelyMoreover we tested Sweb and STV according to web search methods (BPS BKSand MKS) to show the effect of BPS on transliteration validation We comparedour proposed method with the previous work GPC [6] GMEM [7] and HWFST[11]3 Note that only Top-1 was considered in the previous work because theyexcept for HWFST focused only on the Top-1 The Top-n considers whether thecorrect transliteration is in the Top-n ranked transliterations4

Compared to individual transliteration models and previous work [6711] StmSweb(BPS) and STV (BPS) are more effective especially in the Top-15 AlthoughStm by itself showed higher performance than individual transliteration modelsand previous work [6711] STV (BPS) (the combination of Stm and Sweb(BPS))shows much better performance The Top-1 of STV (BPS) has the best perfor-mance6 The powerful transliteration validation ability of STV (BPS) enables ourmethod to achieve the best result in the Top-1 More specifically Sweb(BPS)contributes highly to the performance improvement This indicates that the webdata used as the knowledge source for transliteration validation is very usefulAlthough Stm makes a small contribution to the performance improvement ofSTV (BPS) because Sweb(BPS) correctly validates transliterations whenever Stm

does the errors of Sweb(BKS) and Sweb(MKS) are well compensated for by Stm inSTV (BKS) and STV (MKS) For example Korean and Japanese transliterationsfor the English words methoxyl and netware were validated by each validationmethod as shown in Tables 2 and 3 Note that the value coupled with each translit-eration was assigned by each validation method In Tables 2 and 3 Stm causes therank of correct transliterations to be higher in STV (BKS) and STV (MKS) thanin Sweb(BKS) and Sweb(MKS)

When comparing BPS with BKS and MKS BPS is the most effective websearch method for transliteration validation Sweb based on MKS has the worstperformance because it tends to validate whether tci is used in a target languagerather than whether it is used as a translation of s Actually Sweb based on BKSis effective because BKS considers both s and tci while it retrieves web pagesHowever the more powerful retrieval ability of BPS protects Sweb based on BPSfrom errors that Sweb based on BKS causes as shown in Tables 2 and 3 So higherperformance can be had with Sweb(BPS) than with Sweb(BKS) ndash about 12 im-provement in Top-1 of EKSet and about 7 improvement in Top-1 of EJSet7 The

3 We implemented the three previous methods [6711] and then trained and testedthem using the same data as our proposed method

4 For one English word there are one or more than one correct transliterations in EJSetbut there is only one correct transliteration in EKSet Therefore we had higher TOP-1 accuracies but lower TOP-10 accuracies in EKSet than in EJSet

5 A one-tail paired t-test showed that the results of Stm Sweb(BPS) and STV (BPS)were always significantly better than those of individual transliteration models andprevious work (level of significance = 0001)

6 A one-tail paired t-test showed that the results of STV (BPS) were always significantlybetter than those of the others (level of significance = 0001)

7 A one-tail paired t-test showed that the results of Sweb(BPS) were always significantlybetter than those of Sweb(BKS) (level of significance = 0001)

94 J-H Oh K-S Choi and H Isahara

Table 2 Korean transliterations for English word methoxyl and their validation (Theunderlined is the correct Korean transliteration)

Stm (0772) (0081) 13 (0253)

Sweb(MKS) 13 (0482) (0334) (0148) (0029)

Sweb(BKS) (0564) (0413) 13 (0011)

Sweb(BPS) (0947) 13 (0053)

STV (MKS) 13 (0122) (0022) (00003)

STV (BKS) (0319) (0045) 13 (0003)

STV (BPS) (0742) 13 (0013)

Table 3 Japanese transliterations for English word netware and their validation (Theunderlined is the correct Japanese transliteration)

Stm (0875) (0458) (0319)

Sweb(MKS) 13ndash (0988) (0009) (0002)

Sweb(BKS) 13ndash (0626) (0274) (0100)

Sweb(BPS) (0860) 13ndash (0079) (0061)

STV (MKS) 13ndash (0189) (0004) (0002)

STV (BKS) (0240) 13ndash (0120) (0032)

STV (BPS) (0752) (0019) 13ndash (0015)

higher performance of Sweb(BPS) positively effects STV (BPS) thus the perfor-mance of STV (BPS) is higher than that of STV (BKS) Both Sweb and STV basedon BPS outperform those based on BKS or MKS

The experimental results can be summarized as follows

ndash Stm Sweb(BPS) and STV (BPS) are more effective than individual transliter-ation models and previous work [6711] especially in the Top-1

ndash STV (BPS) shows the best performancendash Sweb(BPS) mainly contributes the high performance of STV (BPS)ndash BPS is the most effective web search method for transliteration validation

among BPS BKS and MKS

6 Conclusion

We proposed a novel approach for improving machine transliteration performanceby combining multiple transliteration models We applied a ldquogeneratingtransliterations followed by their validationrdquo strategy We generated transliter-ation candidates using four different transliteration models and validated themusing web-based validation and transliteration model-based validation Experi-ments showed that combining multiple transliteration models was one way forconsidering complex transliteration behaviors and that transliteration validationwas very important for improving machine transliteration performance Our two

Improving Machine Transliteration Performance 95

transliteration validation methods were effective The web-based validationmethod effectively filtered out wrong transliterations by using web data whichreflects real-world usage of transliterations and the transliteration model-basedvalidation method as a web-independent validation measure complemented theweb-based validation method Moreover we showed that a web search methodsignificantly affects the performance of the web-based validation method Exper-iments showed that our ldquoBilingual Phrasal Search (BPS)rdquo is more suitablethan ldquoBilingual Keyword Search (BKS)rdquo and ldquoMonolingual KeywordSearch (MKS)rdquo in transliteration validation

References

1 Knight K Graehl J Machine transliteration In Proc of the 35th Annual Meet-ings of the Association for Computational Linguistics (1997) pp128ndash135

2 Al-Onaizan Y Knight K Translating named entities using monolingual and bilin-gual resources In Proc of ACL 2002 (2002) 400ndash408

3 Fujii A Tetsuya I JapaneseEnglish cross-language information retrieval Ex-ploration of query translation and transliteration Computers and the Humanities35 (2001) 389ndash420

4 Lin WH Chen HH Backward machine transliteration by learning phonetic sim-ilarity In Proc of the Sixth Conference on Natural Language Learning (CoNLL)(2002) 139ndash145

5 Kang BJ Choi KS Automatic transliteration and back-transliteration by de-cision tree learning In Proc of the 2nd International Conference on LanguageResources and Evaluation (2000) 1135ndash1411

6 Kang IH Kim GC English-to-Korean transliteration using multiple unboundedoverlapping phoneme chunks In Proc of the 18th International Conference onComputational Linguistics (2000) 418ndash424

7 Goto I Kato N Uratani N Ehara T Transliteration considering context in-formation based on the maximum entropy method In Proc of MT-Summit IX(2003) 125ndash132

8 Li H Zhang M Su J A joint source-channel model for machine transliterationIn Proc of ACL 2004 (2004) 160ndash167

9 Jung SY Hong S Paek E An English to Korean transliteration model of ex-tended markov window In Proc of the 18th conference on Computational linguis-tics (2000) 383ndash389

10 Meng H Lo WK Chen B Tang K Generating phonetic cognates to handlenamed entities in English-Chinese cross-language spoken document retrieval InProc of Automatic Speech Recognition and Understanding 2001 ASRU rsquo01 (2001)311ndash314

11 Bilac S Tanaka H Improving back-transliteration by combining informationsources In Proc of IJCNLP2004 (2004) 542ndash547

12 Oh JH Choi KS An English-Korean transliteration model using pronunciationand contextual rules In Proc of COLING2002 (2002) 758ndash764

13 Oh JH Choi KS An ensemble of grapheme and phoneme for machine translit-eration In Proc of IJCNLP05 (2005) 450ndash461

14 Oh JH Choi KS Machine learning based English-to-Korean transliterationusing grapheme and phoneme information IEICE Transaction on Information ampSystems E88-D (2005) 1737ndash1748

96 J-H Oh K-S Choi and H Isahara

15 Berger AL Pietra SD Pietra VJD A maximum entropy approach to naturallanguage processing Computational Linguistics 22 (1996) 39ndash71

16 Zhang L Maximum entropy modeling toolkit for python and C++ http

homepagesinfedacuks0450736softwaremaxentmanualpdf (2004)17 Qu Y Grefenstette G Finding ideographic representations of Japanese names

written in Latin script via language identification and corpus validation In ACL(2004) 183ndash190

18 Wang JH Teng JW Lu WH Chien LF Exploiting the web as the multi-lingual corpus for unknown query translation Journal of the American Society forInformation Science and Technology 57 (2006) 660ndash670

19 Grefenstette G Qu Y Evans DA Mining the web to create a language modelfor mapping between English names and phrases and Japanese In Proc of WebIntelligence (2004) 110ndash116

20 Nam YS Foreign dictionary Sung An Dang (1997)21 Breen J EDICT JapaneseEnglish dictionary le The Electronic Dictionary Re-

search and Development Group Monash University httpwwwcssemonash

eduau~jwbedicthtml (2003)

Improving Machine Transliteration Performance 91

Fig 2 Difference among BPS BKS and MKS

Fig 3 Web pages retrieved by BPS

frequency for [s tci] Our web-based validation method Sweb can be representedas Eq (8) which is the relative web frequency derived from BPS

Sweb(s tci) =WF (s tci) + WF (tci s)sum

tckisinTC(WF (s tck) + WF (tck s))(8)

Let TC for s = data be tc1= tc2= tc3= and WF (s tc1) +WF (tc1 s) WF (s tc2) + WF (tc2 s) and WF (s tc3) + WF (tc3 s) be 9410067800 and 54 respectively Then Sweb for each tci can be calculated as follows

ndash Sweb(s tc1) = 94100161954 = 05811ndash Sweb(s tc2) = 67800161954 = 04186ndash Sweb(s tc3) = 54161954 = 00003

92 J-H Oh K-S Choi and H Isahara

5 Experiments

Our experiments were done for English-to-Korean and English-to-Japanesetransliteration The test set for the English-to-Koreantransliteration (EKSet) [20]consisted of 7172 English-Korean pairs ndash the number of training data was about6000 and the number of blind test data was about 1000 The test set for theEnglish-to-Japanese transliteration (EJSet) which consisted of English-katakanapairs from EDICT [21] consisted of 10417 pairs mdash the number of training datawas about 9000 and the number of blind test data was about 1000 EJSet con-tained one or more than one correct transliteration for one English word likeltmicroכgt and ltmicroכgt the average number of Japanesetransliterations for an English word was 115 EKSet and EJSet covered propernames technical terms and general terms Evaluation was done in terms of theword accuracy (WA) in Eq (9) In the evaluation we used k-fold cross-validation(k = 7 for the EKSet and k = 10 for the EJSet) The test set was divided into ksubsets Each one was used for testing while the remainder was used for trainingThen the average WA across all the k trials was computed Through the cross-validation we set α (04 for the EKSet and 05 for the EJSet) for HTM in Eq (4)

WA =the number of correct transliterations output by the system

the number of transliterations in the blind test data(9)

51 Experimental Results

Summaries of our experimental results conducted on EKSet and EJSet are shownin Table 1 In the table GTM PTM HTM and CTM represent the individualtransliteration models used for generating transliterations Stm Sweb and STV

Table 1 Summary of Results ()

Methods EKSet EJSetTop-1 Top-3 Top-5 Top-10 Top-1 Top-3 Top-5 Top-10

GTM 568 769 825 880 516 766 846 941

PTM 496 683 751 825 476 728 816 926

HTM 606 785 835 886 557 776 851 940

CTM 608 796 847 896 582 811 879 965

GPC [6] 551 NA NA NA 532 NA NA NA

GMEM [7] 559 NA NA NA 562 NA NA NA

HWFST [11] 583 NA NA NA 625 NA NA NA

Stm 710 812 854 891 669 813 875 932

MKS 505 759 843 916 499 751 830 930Sweb BKS 744 888 912 921 761 942 960 968

BPS 836 915 919 921 813 956 969 978

MKS 595 813 878 921 567 773 835 928STV BKS 795 905 917 921 796 947 962 968

BPS 844 917 920 921 820 956 969 981

Improving Machine Transliteration Performance 93

represent experimental results validated by Eqs (7) (8) and (6) respectivelyMoreover we tested Sweb and STV according to web search methods (BPS BKSand MKS) to show the effect of BPS on transliteration validation We comparedour proposed method with the previous work GPC [6] GMEM [7] and HWFST[11]3 Note that only Top-1 was considered in the previous work because theyexcept for HWFST focused only on the Top-1 The Top-n considers whether thecorrect transliteration is in the Top-n ranked transliterations4

Compared to individual transliteration models and previous work [6711] StmSweb(BPS) and STV (BPS) are more effective especially in the Top-15 AlthoughStm by itself showed higher performance than individual transliteration modelsand previous work [6711] STV (BPS) (the combination of Stm and Sweb(BPS))shows much better performance The Top-1 of STV (BPS) has the best perfor-mance6 The powerful transliteration validation ability of STV (BPS) enables ourmethod to achieve the best result in the Top-1 More specifically Sweb(BPS)contributes highly to the performance improvement This indicates that the webdata used as the knowledge source for transliteration validation is very usefulAlthough Stm makes a small contribution to the performance improvement ofSTV (BPS) because Sweb(BPS) correctly validates transliterations whenever Stm

does the errors of Sweb(BKS) and Sweb(MKS) are well compensated for by Stm inSTV (BKS) and STV (MKS) For example Korean and Japanese transliterationsfor the English words methoxyl and netware were validated by each validationmethod as shown in Tables 2 and 3 Note that the value coupled with each translit-eration was assigned by each validation method In Tables 2 and 3 Stm causes therank of correct transliterations to be higher in STV (BKS) and STV (MKS) thanin Sweb(BKS) and Sweb(MKS)

When comparing BPS with BKS and MKS BPS is the most effective websearch method for transliteration validation Sweb based on MKS has the worstperformance because it tends to validate whether tci is used in a target languagerather than whether it is used as a translation of s Actually Sweb based on BKSis effective because BKS considers both s and tci while it retrieves web pagesHowever the more powerful retrieval ability of BPS protects Sweb based on BPSfrom errors that Sweb based on BKS causes as shown in Tables 2 and 3 So higherperformance can be had with Sweb(BPS) than with Sweb(BKS) ndash about 12 im-provement in Top-1 of EKSet and about 7 improvement in Top-1 of EJSet7 The

3 We implemented the three previous methods [6711] and then trained and testedthem using the same data as our proposed method

4 For one English word there are one or more than one correct transliterations in EJSetbut there is only one correct transliteration in EKSet Therefore we had higher TOP-1 accuracies but lower TOP-10 accuracies in EKSet than in EJSet

5 A one-tail paired t-test showed that the results of Stm Sweb(BPS) and STV (BPS)were always significantly better than those of individual transliteration models andprevious work (level of significance = 0001)

6 A one-tail paired t-test showed that the results of STV (BPS) were always significantlybetter than those of the others (level of significance = 0001)

7 A one-tail paired t-test showed that the results of Sweb(BPS) were always significantlybetter than those of Sweb(BKS) (level of significance = 0001)

94 J-H Oh K-S Choi and H Isahara

Table 2 Korean transliterations for English word methoxyl and their validation (Theunderlined is the correct Korean transliteration)

Stm (0772) (0081) 13 (0253)

Sweb(MKS) 13 (0482) (0334) (0148) (0029)

Sweb(BKS) (0564) (0413) 13 (0011)

Sweb(BPS) (0947) 13 (0053)

STV (MKS) 13 (0122) (0022) (00003)

STV (BKS) (0319) (0045) 13 (0003)

STV (BPS) (0742) 13 (0013)

Table 3 Japanese transliterations for English word netware and their validation (Theunderlined is the correct Japanese transliteration)

Stm (0875) (0458) (0319)

Sweb(MKS) 13ndash (0988) (0009) (0002)

Sweb(BKS) 13ndash (0626) (0274) (0100)

Sweb(BPS) (0860) 13ndash (0079) (0061)

STV (MKS) 13ndash (0189) (0004) (0002)

STV (BKS) (0240) 13ndash (0120) (0032)

STV (BPS) (0752) (0019) 13ndash (0015)

higher performance of Sweb(BPS) positively effects STV (BPS) thus the perfor-mance of STV (BPS) is higher than that of STV (BKS) Both Sweb and STV basedon BPS outperform those based on BKS or MKS

The experimental results can be summarized as follows

ndash Stm Sweb(BPS) and STV (BPS) are more effective than individual transliter-ation models and previous work [6711] especially in the Top-1

ndash STV (BPS) shows the best performancendash Sweb(BPS) mainly contributes the high performance of STV (BPS)ndash BPS is the most effective web search method for transliteration validation

among BPS BKS and MKS

6 Conclusion

We proposed a novel approach for improving machine transliteration performanceby combining multiple transliteration models We applied a ldquogeneratingtransliterations followed by their validationrdquo strategy We generated transliter-ation candidates using four different transliteration models and validated themusing web-based validation and transliteration model-based validation Experi-ments showed that combining multiple transliteration models was one way forconsidering complex transliteration behaviors and that transliteration validationwas very important for improving machine transliteration performance Our two

Improving Machine Transliteration Performance 95

transliteration validation methods were effective The web-based validationmethod effectively filtered out wrong transliterations by using web data whichreflects real-world usage of transliterations and the transliteration model-basedvalidation method as a web-independent validation measure complemented theweb-based validation method Moreover we showed that a web search methodsignificantly affects the performance of the web-based validation method Exper-iments showed that our ldquoBilingual Phrasal Search (BPS)rdquo is more suitablethan ldquoBilingual Keyword Search (BKS)rdquo and ldquoMonolingual KeywordSearch (MKS)rdquo in transliteration validation

References

1 Knight K Graehl J Machine transliteration In Proc of the 35th Annual Meet-ings of the Association for Computational Linguistics (1997) pp128ndash135

2 Al-Onaizan Y Knight K Translating named entities using monolingual and bilin-gual resources In Proc of ACL 2002 (2002) 400ndash408

3 Fujii A Tetsuya I JapaneseEnglish cross-language information retrieval Ex-ploration of query translation and transliteration Computers and the Humanities35 (2001) 389ndash420

4 Lin WH Chen HH Backward machine transliteration by learning phonetic sim-ilarity In Proc of the Sixth Conference on Natural Language Learning (CoNLL)(2002) 139ndash145

5 Kang BJ Choi KS Automatic transliteration and back-transliteration by de-cision tree learning In Proc of the 2nd International Conference on LanguageResources and Evaluation (2000) 1135ndash1411

6 Kang IH Kim GC English-to-Korean transliteration using multiple unboundedoverlapping phoneme chunks In Proc of the 18th International Conference onComputational Linguistics (2000) 418ndash424

7 Goto I Kato N Uratani N Ehara T Transliteration considering context in-formation based on the maximum entropy method In Proc of MT-Summit IX(2003) 125ndash132

8 Li H Zhang M Su J A joint source-channel model for machine transliterationIn Proc of ACL 2004 (2004) 160ndash167

9 Jung SY Hong S Paek E An English to Korean transliteration model of ex-tended markov window In Proc of the 18th conference on Computational linguis-tics (2000) 383ndash389

10 Meng H Lo WK Chen B Tang K Generating phonetic cognates to handlenamed entities in English-Chinese cross-language spoken document retrieval InProc of Automatic Speech Recognition and Understanding 2001 ASRU rsquo01 (2001)311ndash314

11 Bilac S Tanaka H Improving back-transliteration by combining informationsources In Proc of IJCNLP2004 (2004) 542ndash547

12 Oh JH Choi KS An English-Korean transliteration model using pronunciationand contextual rules In Proc of COLING2002 (2002) 758ndash764

13 Oh JH Choi KS An ensemble of grapheme and phoneme for machine translit-eration In Proc of IJCNLP05 (2005) 450ndash461

14 Oh JH Choi KS Machine learning based English-to-Korean transliterationusing grapheme and phoneme information IEICE Transaction on Information ampSystems E88-D (2005) 1737ndash1748

96 J-H Oh K-S Choi and H Isahara

15 Berger AL Pietra SD Pietra VJD A maximum entropy approach to naturallanguage processing Computational Linguistics 22 (1996) 39ndash71

16 Zhang L Maximum entropy modeling toolkit for python and C++ http

homepagesinfedacuks0450736softwaremaxentmanualpdf (2004)17 Qu Y Grefenstette G Finding ideographic representations of Japanese names

written in Latin script via language identification and corpus validation In ACL(2004) 183ndash190

18 Wang JH Teng JW Lu WH Chien LF Exploiting the web as the multi-lingual corpus for unknown query translation Journal of the American Society forInformation Science and Technology 57 (2006) 660ndash670

19 Grefenstette G Qu Y Evans DA Mining the web to create a language modelfor mapping between English names and phrases and Japanese In Proc of WebIntelligence (2004) 110ndash116

20 Nam YS Foreign dictionary Sung An Dang (1997)21 Breen J EDICT JapaneseEnglish dictionary le The Electronic Dictionary Re-

search and Development Group Monash University httpwwwcssemonash

eduau~jwbedicthtml (2003)

92 J-H Oh K-S Choi and H Isahara

5 Experiments

Our experiments were done for English-to-Korean and English-to-Japanesetransliteration The test set for the English-to-Koreantransliteration (EKSet) [20]consisted of 7172 English-Korean pairs ndash the number of training data was about6000 and the number of blind test data was about 1000 The test set for theEnglish-to-Japanese transliteration (EJSet) which consisted of English-katakanapairs from EDICT [21] consisted of 10417 pairs mdash the number of training datawas about 9000 and the number of blind test data was about 1000 EJSet con-tained one or more than one correct transliteration for one English word likeltmicroכgt and ltmicroכgt the average number of Japanesetransliterations for an English word was 115 EKSet and EJSet covered propernames technical terms and general terms Evaluation was done in terms of theword accuracy (WA) in Eq (9) In the evaluation we used k-fold cross-validation(k = 7 for the EKSet and k = 10 for the EJSet) The test set was divided into ksubsets Each one was used for testing while the remainder was used for trainingThen the average WA across all the k trials was computed Through the cross-validation we set α (04 for the EKSet and 05 for the EJSet) for HTM in Eq (4)

WA =the number of correct transliterations output by the system

the number of transliterations in the blind test data(9)

51 Experimental Results

Summaries of our experimental results conducted on EKSet and EJSet are shownin Table 1 In the table GTM PTM HTM and CTM represent the individualtransliteration models used for generating transliterations Stm Sweb and STV

Table 1 Summary of Results ()

Methods EKSet EJSetTop-1 Top-3 Top-5 Top-10 Top-1 Top-3 Top-5 Top-10

GTM 568 769 825 880 516 766 846 941

PTM 496 683 751 825 476 728 816 926

HTM 606 785 835 886 557 776 851 940

CTM 608 796 847 896 582 811 879 965

GPC [6] 551 NA NA NA 532 NA NA NA

GMEM [7] 559 NA NA NA 562 NA NA NA

HWFST [11] 583 NA NA NA 625 NA NA NA

Stm 710 812 854 891 669 813 875 932

MKS 505 759 843 916 499 751 830 930Sweb BKS 744 888 912 921 761 942 960 968

BPS 836 915 919 921 813 956 969 978

MKS 595 813 878 921 567 773 835 928STV BKS 795 905 917 921 796 947 962 968

BPS 844 917 920 921 820 956 969 981

Improving Machine Transliteration Performance 93

represent experimental results validated by Eqs (7) (8) and (6) respectivelyMoreover we tested Sweb and STV according to web search methods (BPS BKSand MKS) to show the effect of BPS on transliteration validation We comparedour proposed method with the previous work GPC [6] GMEM [7] and HWFST[11]3 Note that only Top-1 was considered in the previous work because theyexcept for HWFST focused only on the Top-1 The Top-n considers whether thecorrect transliteration is in the Top-n ranked transliterations4

Compared to individual transliteration models and previous work [6711] StmSweb(BPS) and STV (BPS) are more effective especially in the Top-15 AlthoughStm by itself showed higher performance than individual transliteration modelsand previous work [6711] STV (BPS) (the combination of Stm and Sweb(BPS))shows much better performance The Top-1 of STV (BPS) has the best perfor-mance6 The powerful transliteration validation ability of STV (BPS) enables ourmethod to achieve the best result in the Top-1 More specifically Sweb(BPS)contributes highly to the performance improvement This indicates that the webdata used as the knowledge source for transliteration validation is very usefulAlthough Stm makes a small contribution to the performance improvement ofSTV (BPS) because Sweb(BPS) correctly validates transliterations whenever Stm

does the errors of Sweb(BKS) and Sweb(MKS) are well compensated for by Stm inSTV (BKS) and STV (MKS) For example Korean and Japanese transliterationsfor the English words methoxyl and netware were validated by each validationmethod as shown in Tables 2 and 3 Note that the value coupled with each translit-eration was assigned by each validation method In Tables 2 and 3 Stm causes therank of correct transliterations to be higher in STV (BKS) and STV (MKS) thanin Sweb(BKS) and Sweb(MKS)

When comparing BPS with BKS and MKS BPS is the most effective websearch method for transliteration validation Sweb based on MKS has the worstperformance because it tends to validate whether tci is used in a target languagerather than whether it is used as a translation of s Actually Sweb based on BKSis effective because BKS considers both s and tci while it retrieves web pagesHowever the more powerful retrieval ability of BPS protects Sweb based on BPSfrom errors that Sweb based on BKS causes as shown in Tables 2 and 3 So higherperformance can be had with Sweb(BPS) than with Sweb(BKS) ndash about 12 im-provement in Top-1 of EKSet and about 7 improvement in Top-1 of EJSet7 The

3 We implemented the three previous methods [6711] and then trained and testedthem using the same data as our proposed method

4 For one English word there are one or more than one correct transliterations in EJSetbut there is only one correct transliteration in EKSet Therefore we had higher TOP-1 accuracies but lower TOP-10 accuracies in EKSet than in EJSet

5 A one-tail paired t-test showed that the results of Stm Sweb(BPS) and STV (BPS)were always significantly better than those of individual transliteration models andprevious work (level of significance = 0001)

6 A one-tail paired t-test showed that the results of STV (BPS) were always significantlybetter than those of the others (level of significance = 0001)

7 A one-tail paired t-test showed that the results of Sweb(BPS) were always significantlybetter than those of Sweb(BKS) (level of significance = 0001)

94 J-H Oh K-S Choi and H Isahara

Table 2 Korean transliterations for English word methoxyl and their validation (Theunderlined is the correct Korean transliteration)

Stm (0772) (0081) 13 (0253)

Sweb(MKS) 13 (0482) (0334) (0148) (0029)

Sweb(BKS) (0564) (0413) 13 (0011)

Sweb(BPS) (0947) 13 (0053)

STV (MKS) 13 (0122) (0022) (00003)

STV (BKS) (0319) (0045) 13 (0003)

STV (BPS) (0742) 13 (0013)

Table 3 Japanese transliterations for English word netware and their validation (Theunderlined is the correct Japanese transliteration)

Stm (0875) (0458) (0319)

Sweb(MKS) 13ndash (0988) (0009) (0002)

Sweb(BKS) 13ndash (0626) (0274) (0100)

Sweb(BPS) (0860) 13ndash (0079) (0061)

STV (MKS) 13ndash (0189) (0004) (0002)

STV (BKS) (0240) 13ndash (0120) (0032)

STV (BPS) (0752) (0019) 13ndash (0015)

higher performance of Sweb(BPS) positively effects STV (BPS) thus the perfor-mance of STV (BPS) is higher than that of STV (BKS) Both Sweb and STV basedon BPS outperform those based on BKS or MKS

The experimental results can be summarized as follows

ndash Stm Sweb(BPS) and STV (BPS) are more effective than individual transliter-ation models and previous work [6711] especially in the Top-1

ndash STV (BPS) shows the best performancendash Sweb(BPS) mainly contributes the high performance of STV (BPS)ndash BPS is the most effective web search method for transliteration validation

among BPS BKS and MKS

6 Conclusion

We proposed a novel approach for improving machine transliteration performanceby combining multiple transliteration models We applied a ldquogeneratingtransliterations followed by their validationrdquo strategy We generated transliter-ation candidates using four different transliteration models and validated themusing web-based validation and transliteration model-based validation Experi-ments showed that combining multiple transliteration models was one way forconsidering complex transliteration behaviors and that transliteration validationwas very important for improving machine transliteration performance Our two

Improving Machine Transliteration Performance 95

transliteration validation methods were effective The web-based validationmethod effectively filtered out wrong transliterations by using web data whichreflects real-world usage of transliterations and the transliteration model-basedvalidation method as a web-independent validation measure complemented theweb-based validation method Moreover we showed that a web search methodsignificantly affects the performance of the web-based validation method Exper-iments showed that our ldquoBilingual Phrasal Search (BPS)rdquo is more suitablethan ldquoBilingual Keyword Search (BKS)rdquo and ldquoMonolingual KeywordSearch (MKS)rdquo in transliteration validation

References

1 Knight K Graehl J Machine transliteration In Proc of the 35th Annual Meet-ings of the Association for Computational Linguistics (1997) pp128ndash135

2 Al-Onaizan Y Knight K Translating named entities using monolingual and bilin-gual resources In Proc of ACL 2002 (2002) 400ndash408

3 Fujii A Tetsuya I JapaneseEnglish cross-language information retrieval Ex-ploration of query translation and transliteration Computers and the Humanities35 (2001) 389ndash420

4 Lin WH Chen HH Backward machine transliteration by learning phonetic sim-ilarity In Proc of the Sixth Conference on Natural Language Learning (CoNLL)(2002) 139ndash145

5 Kang BJ Choi KS Automatic transliteration and back-transliteration by de-cision tree learning In Proc of the 2nd International Conference on LanguageResources and Evaluation (2000) 1135ndash1411

6 Kang IH Kim GC English-to-Korean transliteration using multiple unboundedoverlapping phoneme chunks In Proc of the 18th International Conference onComputational Linguistics (2000) 418ndash424

7 Goto I Kato N Uratani N Ehara T Transliteration considering context in-formation based on the maximum entropy method In Proc of MT-Summit IX(2003) 125ndash132

8 Li H Zhang M Su J A joint source-channel model for machine transliterationIn Proc of ACL 2004 (2004) 160ndash167

9 Jung SY Hong S Paek E An English to Korean transliteration model of ex-tended markov window In Proc of the 18th conference on Computational linguis-tics (2000) 383ndash389

10 Meng H Lo WK Chen B Tang K Generating phonetic cognates to handlenamed entities in English-Chinese cross-language spoken document retrieval InProc of Automatic Speech Recognition and Understanding 2001 ASRU rsquo01 (2001)311ndash314

11 Bilac S Tanaka H Improving back-transliteration by combining informationsources In Proc of IJCNLP2004 (2004) 542ndash547

12 Oh JH Choi KS An English-Korean transliteration model using pronunciationand contextual rules In Proc of COLING2002 (2002) 758ndash764

13 Oh JH Choi KS An ensemble of grapheme and phoneme for machine translit-eration In Proc of IJCNLP05 (2005) 450ndash461

14 Oh JH Choi KS Machine learning based English-to-Korean transliterationusing grapheme and phoneme information IEICE Transaction on Information ampSystems E88-D (2005) 1737ndash1748

96 J-H Oh K-S Choi and H Isahara

15 Berger AL Pietra SD Pietra VJD A maximum entropy approach to naturallanguage processing Computational Linguistics 22 (1996) 39ndash71

16 Zhang L Maximum entropy modeling toolkit for python and C++ http

homepagesinfedacuks0450736softwaremaxentmanualpdf (2004)17 Qu Y Grefenstette G Finding ideographic representations of Japanese names

written in Latin script via language identification and corpus validation In ACL(2004) 183ndash190

18 Wang JH Teng JW Lu WH Chien LF Exploiting the web as the multi-lingual corpus for unknown query translation Journal of the American Society forInformation Science and Technology 57 (2006) 660ndash670

19 Grefenstette G Qu Y Evans DA Mining the web to create a language modelfor mapping between English names and phrases and Japanese In Proc of WebIntelligence (2004) 110ndash116

20 Nam YS Foreign dictionary Sung An Dang (1997)21 Breen J EDICT JapaneseEnglish dictionary le The Electronic Dictionary Re-

search and Development Group Monash University httpwwwcssemonash

eduau~jwbedicthtml (2003)

Improving Machine Transliteration Performance 93

represent experimental results validated by Eqs (7) (8) and (6) respectivelyMoreover we tested Sweb and STV according to web search methods (BPS BKSand MKS) to show the effect of BPS on transliteration validation We comparedour proposed method with the previous work GPC [6] GMEM [7] and HWFST[11]3 Note that only Top-1 was considered in the previous work because theyexcept for HWFST focused only on the Top-1 The Top-n considers whether thecorrect transliteration is in the Top-n ranked transliterations4

Compared to individual transliteration models and previous work [6711] StmSweb(BPS) and STV (BPS) are more effective especially in the Top-15 AlthoughStm by itself showed higher performance than individual transliteration modelsand previous work [6711] STV (BPS) (the combination of Stm and Sweb(BPS))shows much better performance The Top-1 of STV (BPS) has the best perfor-mance6 The powerful transliteration validation ability of STV (BPS) enables ourmethod to achieve the best result in the Top-1 More specifically Sweb(BPS)contributes highly to the performance improvement This indicates that the webdata used as the knowledge source for transliteration validation is very usefulAlthough Stm makes a small contribution to the performance improvement ofSTV (BPS) because Sweb(BPS) correctly validates transliterations whenever Stm

does the errors of Sweb(BKS) and Sweb(MKS) are well compensated for by Stm inSTV (BKS) and STV (MKS) For example Korean and Japanese transliterationsfor the English words methoxyl and netware were validated by each validationmethod as shown in Tables 2 and 3 Note that the value coupled with each translit-eration was assigned by each validation method In Tables 2 and 3 Stm causes therank of correct transliterations to be higher in STV (BKS) and STV (MKS) thanin Sweb(BKS) and Sweb(MKS)

When comparing BPS with BKS and MKS BPS is the most effective websearch method for transliteration validation Sweb based on MKS has the worstperformance because it tends to validate whether tci is used in a target languagerather than whether it is used as a translation of s Actually Sweb based on BKSis effective because BKS considers both s and tci while it retrieves web pagesHowever the more powerful retrieval ability of BPS protects Sweb based on BPSfrom errors that Sweb based on BKS causes as shown in Tables 2 and 3 So higherperformance can be had with Sweb(BPS) than with Sweb(BKS) ndash about 12 im-provement in Top-1 of EKSet and about 7 improvement in Top-1 of EJSet7 The

3 We implemented the three previous methods [6711] and then trained and testedthem using the same data as our proposed method

4 For one English word there are one or more than one correct transliterations in EJSetbut there is only one correct transliteration in EKSet Therefore we had higher TOP-1 accuracies but lower TOP-10 accuracies in EKSet than in EJSet

5 A one-tail paired t-test showed that the results of Stm Sweb(BPS) and STV (BPS)were always significantly better than those of individual transliteration models andprevious work (level of significance = 0001)

6 A one-tail paired t-test showed that the results of STV (BPS) were always significantlybetter than those of the others (level of significance = 0001)

7 A one-tail paired t-test showed that the results of Sweb(BPS) were always significantlybetter than those of Sweb(BKS) (level of significance = 0001)

94 J-H Oh K-S Choi and H Isahara

Table 2 Korean transliterations for English word methoxyl and their validation (Theunderlined is the correct Korean transliteration)

Stm (0772) (0081) 13 (0253)

Sweb(MKS) 13 (0482) (0334) (0148) (0029)

Sweb(BKS) (0564) (0413) 13 (0011)

Sweb(BPS) (0947) 13 (0053)

STV (MKS) 13 (0122) (0022) (00003)

STV (BKS) (0319) (0045) 13 (0003)

STV (BPS) (0742) 13 (0013)

Table 3 Japanese transliterations for English word netware and their validation (Theunderlined is the correct Japanese transliteration)

Stm (0875) (0458) (0319)

Sweb(MKS) 13ndash (0988) (0009) (0002)

Sweb(BKS) 13ndash (0626) (0274) (0100)

Sweb(BPS) (0860) 13ndash (0079) (0061)

STV (MKS) 13ndash (0189) (0004) (0002)

STV (BKS) (0240) 13ndash (0120) (0032)

STV (BPS) (0752) (0019) 13ndash (0015)

higher performance of Sweb(BPS) positively effects STV (BPS) thus the perfor-mance of STV (BPS) is higher than that of STV (BKS) Both Sweb and STV basedon BPS outperform those based on BKS or MKS

The experimental results can be summarized as follows

ndash Stm Sweb(BPS) and STV (BPS) are more effective than individual transliter-ation models and previous work [6711] especially in the Top-1

ndash STV (BPS) shows the best performancendash Sweb(BPS) mainly contributes the high performance of STV (BPS)ndash BPS is the most effective web search method for transliteration validation

among BPS BKS and MKS

6 Conclusion

We proposed a novel approach for improving machine transliteration performanceby combining multiple transliteration models We applied a ldquogeneratingtransliterations followed by their validationrdquo strategy We generated transliter-ation candidates using four different transliteration models and validated themusing web-based validation and transliteration model-based validation Experi-ments showed that combining multiple transliteration models was one way forconsidering complex transliteration behaviors and that transliteration validationwas very important for improving machine transliteration performance Our two

Improving Machine Transliteration Performance 95

transliteration validation methods were effective The web-based validationmethod effectively filtered out wrong transliterations by using web data whichreflects real-world usage of transliterations and the transliteration model-basedvalidation method as a web-independent validation measure complemented theweb-based validation method Moreover we showed that a web search methodsignificantly affects the performance of the web-based validation method Exper-iments showed that our ldquoBilingual Phrasal Search (BPS)rdquo is more suitablethan ldquoBilingual Keyword Search (BKS)rdquo and ldquoMonolingual KeywordSearch (MKS)rdquo in transliteration validation

References

1 Knight K Graehl J Machine transliteration In Proc of the 35th Annual Meet-ings of the Association for Computational Linguistics (1997) pp128ndash135

2 Al-Onaizan Y Knight K Translating named entities using monolingual and bilin-gual resources In Proc of ACL 2002 (2002) 400ndash408

3 Fujii A Tetsuya I JapaneseEnglish cross-language information retrieval Ex-ploration of query translation and transliteration Computers and the Humanities35 (2001) 389ndash420

4 Lin WH Chen HH Backward machine transliteration by learning phonetic sim-ilarity In Proc of the Sixth Conference on Natural Language Learning (CoNLL)(2002) 139ndash145

5 Kang BJ Choi KS Automatic transliteration and back-transliteration by de-cision tree learning In Proc of the 2nd International Conference on LanguageResources and Evaluation (2000) 1135ndash1411

6 Kang IH Kim GC English-to-Korean transliteration using multiple unboundedoverlapping phoneme chunks In Proc of the 18th International Conference onComputational Linguistics (2000) 418ndash424

7 Goto I Kato N Uratani N Ehara T Transliteration considering context in-formation based on the maximum entropy method In Proc of MT-Summit IX(2003) 125ndash132

8 Li H Zhang M Su J A joint source-channel model for machine transliterationIn Proc of ACL 2004 (2004) 160ndash167

9 Jung SY Hong S Paek E An English to Korean transliteration model of ex-tended markov window In Proc of the 18th conference on Computational linguis-tics (2000) 383ndash389

10 Meng H Lo WK Chen B Tang K Generating phonetic cognates to handlenamed entities in English-Chinese cross-language spoken document retrieval InProc of Automatic Speech Recognition and Understanding 2001 ASRU rsquo01 (2001)311ndash314

11 Bilac S Tanaka H Improving back-transliteration by combining informationsources In Proc of IJCNLP2004 (2004) 542ndash547

12 Oh JH Choi KS An English-Korean transliteration model using pronunciationand contextual rules In Proc of COLING2002 (2002) 758ndash764

13 Oh JH Choi KS An ensemble of grapheme and phoneme for machine translit-eration In Proc of IJCNLP05 (2005) 450ndash461

14 Oh JH Choi KS Machine learning based English-to-Korean transliterationusing grapheme and phoneme information IEICE Transaction on Information ampSystems E88-D (2005) 1737ndash1748

96 J-H Oh K-S Choi and H Isahara

15 Berger AL Pietra SD Pietra VJD A maximum entropy approach to naturallanguage processing Computational Linguistics 22 (1996) 39ndash71

16 Zhang L Maximum entropy modeling toolkit for python and C++ http

homepagesinfedacuks0450736softwaremaxentmanualpdf (2004)17 Qu Y Grefenstette G Finding ideographic representations of Japanese names

written in Latin script via language identification and corpus validation In ACL(2004) 183ndash190

18 Wang JH Teng JW Lu WH Chien LF Exploiting the web as the multi-lingual corpus for unknown query translation Journal of the American Society forInformation Science and Technology 57 (2006) 660ndash670

19 Grefenstette G Qu Y Evans DA Mining the web to create a language modelfor mapping between English names and phrases and Japanese In Proc of WebIntelligence (2004) 110ndash116

20 Nam YS Foreign dictionary Sung An Dang (1997)21 Breen J EDICT JapaneseEnglish dictionary le The Electronic Dictionary Re-

search and Development Group Monash University httpwwwcssemonash

eduau~jwbedicthtml (2003)

94 J-H Oh K-S Choi and H Isahara

Table 2 Korean transliterations for English word methoxyl and their validation (Theunderlined is the correct Korean transliteration)

Stm (0772) (0081) 13 (0253)

Sweb(MKS) 13 (0482) (0334) (0148) (0029)

Sweb(BKS) (0564) (0413) 13 (0011)

Sweb(BPS) (0947) 13 (0053)

STV (MKS) 13 (0122) (0022) (00003)

STV (BKS) (0319) (0045) 13 (0003)

STV (BPS) (0742) 13 (0013)

Table 3 Japanese transliterations for English word netware and their validation (Theunderlined is the correct Japanese transliteration)

Stm (0875) (0458) (0319)

Sweb(MKS) 13ndash (0988) (0009) (0002)

Sweb(BKS) 13ndash (0626) (0274) (0100)

Sweb(BPS) (0860) 13ndash (0079) (0061)

STV (MKS) 13ndash (0189) (0004) (0002)

STV (BKS) (0240) 13ndash (0120) (0032)

STV (BPS) (0752) (0019) 13ndash (0015)

higher performance of Sweb(BPS) positively effects STV (BPS) thus the perfor-mance of STV (BPS) is higher than that of STV (BKS) Both Sweb and STV basedon BPS outperform those based on BKS or MKS

The experimental results can be summarized as follows

ndash Stm Sweb(BPS) and STV (BPS) are more effective than individual transliter-ation models and previous work [6711] especially in the Top-1

ndash STV (BPS) shows the best performancendash Sweb(BPS) mainly contributes the high performance of STV (BPS)ndash BPS is the most effective web search method for transliteration validation

among BPS BKS and MKS

6 Conclusion

We proposed a novel approach for improving machine transliteration performanceby combining multiple transliteration models We applied a ldquogeneratingtransliterations followed by their validationrdquo strategy We generated transliter-ation candidates using four different transliteration models and validated themusing web-based validation and transliteration model-based validation Experi-ments showed that combining multiple transliteration models was one way forconsidering complex transliteration behaviors and that transliteration validationwas very important for improving machine transliteration performance Our two

Improving Machine Transliteration Performance 95

transliteration validation methods were effective The web-based validationmethod effectively filtered out wrong transliterations by using web data whichreflects real-world usage of transliterations and the transliteration model-basedvalidation method as a web-independent validation measure complemented theweb-based validation method Moreover we showed that a web search methodsignificantly affects the performance of the web-based validation method Exper-iments showed that our ldquoBilingual Phrasal Search (BPS)rdquo is more suitablethan ldquoBilingual Keyword Search (BKS)rdquo and ldquoMonolingual KeywordSearch (MKS)rdquo in transliteration validation

References

1 Knight K Graehl J Machine transliteration In Proc of the 35th Annual Meet-ings of the Association for Computational Linguistics (1997) pp128ndash135

2 Al-Onaizan Y Knight K Translating named entities using monolingual and bilin-gual resources In Proc of ACL 2002 (2002) 400ndash408

3 Fujii A Tetsuya I JapaneseEnglish cross-language information retrieval Ex-ploration of query translation and transliteration Computers and the Humanities35 (2001) 389ndash420

4 Lin WH Chen HH Backward machine transliteration by learning phonetic sim-ilarity In Proc of the Sixth Conference on Natural Language Learning (CoNLL)(2002) 139ndash145

5 Kang BJ Choi KS Automatic transliteration and back-transliteration by de-cision tree learning In Proc of the 2nd International Conference on LanguageResources and Evaluation (2000) 1135ndash1411

6 Kang IH Kim GC English-to-Korean transliteration using multiple unboundedoverlapping phoneme chunks In Proc of the 18th International Conference onComputational Linguistics (2000) 418ndash424

7 Goto I Kato N Uratani N Ehara T Transliteration considering context in-formation based on the maximum entropy method In Proc of MT-Summit IX(2003) 125ndash132

8 Li H Zhang M Su J A joint source-channel model for machine transliterationIn Proc of ACL 2004 (2004) 160ndash167

9 Jung SY Hong S Paek E An English to Korean transliteration model of ex-tended markov window In Proc of the 18th conference on Computational linguis-tics (2000) 383ndash389

10 Meng H Lo WK Chen B Tang K Generating phonetic cognates to handlenamed entities in English-Chinese cross-language spoken document retrieval InProc of Automatic Speech Recognition and Understanding 2001 ASRU rsquo01 (2001)311ndash314

11 Bilac S Tanaka H Improving back-transliteration by combining informationsources In Proc of IJCNLP2004 (2004) 542ndash547

12 Oh JH Choi KS An English-Korean transliteration model using pronunciationand contextual rules In Proc of COLING2002 (2002) 758ndash764

13 Oh JH Choi KS An ensemble of grapheme and phoneme for machine translit-eration In Proc of IJCNLP05 (2005) 450ndash461

14 Oh JH Choi KS Machine learning based English-to-Korean transliterationusing grapheme and phoneme information IEICE Transaction on Information ampSystems E88-D (2005) 1737ndash1748

96 J-H Oh K-S Choi and H Isahara

15 Berger AL Pietra SD Pietra VJD A maximum entropy approach to naturallanguage processing Computational Linguistics 22 (1996) 39ndash71

16 Zhang L Maximum entropy modeling toolkit for python and C++ http

homepagesinfedacuks0450736softwaremaxentmanualpdf (2004)17 Qu Y Grefenstette G Finding ideographic representations of Japanese names

written in Latin script via language identification and corpus validation In ACL(2004) 183ndash190

18 Wang JH Teng JW Lu WH Chien LF Exploiting the web as the multi-lingual corpus for unknown query translation Journal of the American Society forInformation Science and Technology 57 (2006) 660ndash670

19 Grefenstette G Qu Y Evans DA Mining the web to create a language modelfor mapping between English names and phrases and Japanese In Proc of WebIntelligence (2004) 110ndash116

20 Nam YS Foreign dictionary Sung An Dang (1997)21 Breen J EDICT JapaneseEnglish dictionary le The Electronic Dictionary Re-

search and Development Group Monash University httpwwwcssemonash

eduau~jwbedicthtml (2003)

Improving Machine Transliteration Performance 95

transliteration validation methods were effective The web-based validationmethod effectively filtered out wrong transliterations by using web data whichreflects real-world usage of transliterations and the transliteration model-basedvalidation method as a web-independent validation measure complemented theweb-based validation method Moreover we showed that a web search methodsignificantly affects the performance of the web-based validation method Exper-iments showed that our ldquoBilingual Phrasal Search (BPS)rdquo is more suitablethan ldquoBilingual Keyword Search (BKS)rdquo and ldquoMonolingual KeywordSearch (MKS)rdquo in transliteration validation

References

1 Knight K Graehl J Machine transliteration In Proc of the 35th Annual Meet-ings of the Association for Computational Linguistics (1997) pp128ndash135

2 Al-Onaizan Y Knight K Translating named entities using monolingual and bilin-gual resources In Proc of ACL 2002 (2002) 400ndash408

3 Fujii A Tetsuya I JapaneseEnglish cross-language information retrieval Ex-ploration of query translation and transliteration Computers and the Humanities35 (2001) 389ndash420

4 Lin WH Chen HH Backward machine transliteration by learning phonetic sim-ilarity In Proc of the Sixth Conference on Natural Language Learning (CoNLL)(2002) 139ndash145

5 Kang BJ Choi KS Automatic transliteration and back-transliteration by de-cision tree learning In Proc of the 2nd International Conference on LanguageResources and Evaluation (2000) 1135ndash1411

6 Kang IH Kim GC English-to-Korean transliteration using multiple unboundedoverlapping phoneme chunks In Proc of the 18th International Conference onComputational Linguistics (2000) 418ndash424

7 Goto I Kato N Uratani N Ehara T Transliteration considering context in-formation based on the maximum entropy method In Proc of MT-Summit IX(2003) 125ndash132

8 Li H Zhang M Su J A joint source-channel model for machine transliterationIn Proc of ACL 2004 (2004) 160ndash167

9 Jung SY Hong S Paek E An English to Korean transliteration model of ex-tended markov window In Proc of the 18th conference on Computational linguis-tics (2000) 383ndash389

10 Meng H Lo WK Chen B Tang K Generating phonetic cognates to handlenamed entities in English-Chinese cross-language spoken document retrieval InProc of Automatic Speech Recognition and Understanding 2001 ASRU rsquo01 (2001)311ndash314

11 Bilac S Tanaka H Improving back-transliteration by combining informationsources In Proc of IJCNLP2004 (2004) 542ndash547

12 Oh JH Choi KS An English-Korean transliteration model using pronunciationand contextual rules In Proc of COLING2002 (2002) 758ndash764

13 Oh JH Choi KS An ensemble of grapheme and phoneme for machine translit-eration In Proc of IJCNLP05 (2005) 450ndash461

14 Oh JH Choi KS Machine learning based English-to-Korean transliterationusing grapheme and phoneme information IEICE Transaction on Information ampSystems E88-D (2005) 1737ndash1748

96 J-H Oh K-S Choi and H Isahara

15 Berger AL Pietra SD Pietra VJD A maximum entropy approach to naturallanguage processing Computational Linguistics 22 (1996) 39ndash71

16 Zhang L Maximum entropy modeling toolkit for python and C++ http

homepagesinfedacuks0450736softwaremaxentmanualpdf (2004)17 Qu Y Grefenstette G Finding ideographic representations of Japanese names

written in Latin script via language identification and corpus validation In ACL(2004) 183ndash190

18 Wang JH Teng JW Lu WH Chien LF Exploiting the web as the multi-lingual corpus for unknown query translation Journal of the American Society forInformation Science and Technology 57 (2006) 660ndash670

19 Grefenstette G Qu Y Evans DA Mining the web to create a language modelfor mapping between English names and phrases and Japanese In Proc of WebIntelligence (2004) 110ndash116

20 Nam YS Foreign dictionary Sung An Dang (1997)21 Breen J EDICT JapaneseEnglish dictionary le The Electronic Dictionary Re-

search and Development Group Monash University httpwwwcssemonash

eduau~jwbedicthtml (2003)

96 J-H Oh K-S Choi and H Isahara

15 Berger AL Pietra SD Pietra VJD A maximum entropy approach to naturallanguage processing Computational Linguistics 22 (1996) 39ndash71

16 Zhang L Maximum entropy modeling toolkit for python and C++ http

homepagesinfedacuks0450736softwaremaxentmanualpdf (2004)17 Qu Y Grefenstette G Finding ideographic representations of Japanese names

written in Latin script via language identification and corpus validation In ACL(2004) 183ndash190

18 Wang JH Teng JW Lu WH Chien LF Exploiting the web as the multi-lingual corpus for unknown query translation Journal of the American Society forInformation Science and Technology 57 (2006) 660ndash670

19 Grefenstette G Qu Y Evans DA Mining the web to create a language modelfor mapping between English names and phrases and Japanese In Proc of WebIntelligence (2004) 110ndash116

20 Nam YS Foreign dictionary Sung An Dang (1997)21 Breen J EDICT JapaneseEnglish dictionary le The Electronic Dictionary Re-

search and Development Group Monash University httpwwwcssemonash

eduau~jwbedicthtml (2003)