advisor : dr. hsu graduate : chun kai chen author : keita tsuji

27
Intelligent Database Systems Lab 國國國國國國國國 National Yunlin University of Science and T echnology Automatic Extraction of Translation al Japanese-KATAKANA and English Word Pairs from Bilingu al Corpora Advisor Dr. Hsu Graduate Chun Kai Chen Author Keita Tsuji ICCPOL (2001) 245-250

Upload: pilar

Post on 22-Jan-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Automatic Extraction of Translational Japanese-KATAKANA and English Word Pairs from Bilingual Corpora. Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji. ICCPOL (2001) 245-250. Outline. Motivation Objective Introduction Extraction of Traslational Word Pairs - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

Automatic Extraction of Translational Japanese-KATAKANAand English Word Pairs from Bilingual Corpora

Advisor : Dr. Hsu

Graduate : Chun Kai Chen

Author : Keita Tsuji

ICCPOL (2001) 245-250

Page 2: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Outline

Motivation Objective Introduction Extraction of Traslational Word Pairs Experimental Results Conclusions Personal Opinion

Page 3: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Motivation

The bilingual lexicon is in demand in many fields ─ cross-language information retrieval─ machine translation

The bilingual corpora ─ become widely available reflecting the change in

publishing activities

Page 4: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Objective

Propose a method ─ automatically extract translational Japanese-KATAKA

NA and English word pairs from bilingual corpora based on transliteration rules

─ < グラフ , graph>

Page 5: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Introduction(1/2)

Many of the methods proposed ─ depend heavily on word frequency in the corpora─ cannot treat low-frequency words properly

The low-frequency words ─ include newly-coined words which are especially in de

mand in many fields

Page 6: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Introduction(2/2)

Against these background─ we propose a method to extract translational KATAKA

NA-English word pairs from bilingual corpora The method

─ applies all the existing transliteration rules to each mora unit in a KATAKANA word

─ extract English word which matched or partially-matched to one of these transliteration candidates as translation

Page 7: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Extraction of Traslational Word Pairs

Pick up J Decompose J

Generate candidates

Pick up E Identify LCS P(J, E)=maxi Dice()

Transliteration rules

グラフ ‘グ’ ,’ラ’ ,’フ’

Ti(J)=graf

Ti(J)=graph

graph P(グラフ , graph)=1P(グラフ , library)=0.36

S(‘guraf’, ‘graph’) = ‘gra’)

Construction of Transliteration Rule

Measure for Matching

Page 8: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Construction of Transliteration Rule(1/2)

1. Decompose KATAKANA word ─ ‘ディスパッチャー’─ ‘ディ’ , ‘ ス’ , ‘ パッ’ and ‘ チャー’

2. Extract transliteration rule for each unit manually─ ‘ディスパッチャー’ and ‘dispatcher’─ ‘ディ’ = ‘di’─ ‘ス’ = ‘s’─ ‘パッ’ = ‘pat’─ ‘チャー’ = ‘cher’

Page 9: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Construction of Transliteration Rule(2/2)

3. Repeat (1) and (2) for all the word pairs in the list

─ count the frequency of each rule─ rank them for each KATAKANA unit

4. Add Hepburn transliteration rules into the above rules

─ If the source list is large enough, this process might not be necessary

Page 10: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Assume

When Japanese borrows an English word─ the word is transliterated on the basis of Japanese mora unit ─ their correspondences are stable

These transliterations are free from context ─ have no relation to the preceding─ following mora units

The number of transliteration rules is small They do not vary drastically from time to time nor fro

m domain to domain

Page 11: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Extraction of Traslational Word Pairs

Pick up J and decompose it into units ─ according to the same framework we used at TR construction

Using all the transliteration rules in TR─ generate all the possible transliteration candidates for J─ Henceforth we represent the i-th transliteration candidate of J as Ti(J)

Pick up E which co-occurred with J ─ occurred in the same aligned segment in the corpus─ identify the longest common subsequence with each Ti(J)

If the following P(J, E) exceeds certain threshold─ extract pair J and E as translation

P(J, E)=maxi Dice(L(S(Ti(J), E)), L(Ti(J)), L(E))

Page 12: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Extraction of Traslational Word Pairs

Pick up J Decompose J

Generate candidates

Pick up E Identify LCS P(J, E)=maxi Dice()

Transliteration rules

グラフ ‘グ’ ,’ラ’ ,’フ’

Ti(J)=graf

Ti(J)=graph

graph P(グラフ , graph)=1P(グラフ , library)=0.36

S(‘guraf’, ‘graph’) = ‘gra’)

Construction of Transliteration Rule

Measure for Matching

Page 13: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Dice Measure J: KATAKANA word in the corpus E: English word in the corpus L(w): number of characters of word w T(J): transliteration candidate of word J S(w1, w2): longest common subsequence of w1 and w2

─ (e.g. S(‘guraf’, ‘graph’) = ‘gra’)

Dice(k,m,n)=k*2/(m+n)

Dice(L(S(Ti(J), E)), L(Ti(J)), L(E))

Dice(L(S(graf, graph)), L(graf), L(graph))

Dice(3,4,5)=3*2/(4+5)=0.67

Ti(J)=graf

Ti(J)=graph

Page 14: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Example P( グラフ , graph)

= max( Dice(L(S(graf, graph)), L(graf), L(graph)), Dice(L(S(graph, graph)), L(graph), L(graph)), Dice(L(S(graff, graph)), L(graff), L(graph)), … Dice(L(S(gulerfe, graph)), L(gulerfe), L(graph))) = max(0.67, 1.00, 0.60, …, 0.33, 0.33) = 1.00

P( グラフ , library)=max(0.36, 0.33, ..., 0.29, 0.29)=0.36

Page 15: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Device for Time-saving(1/2)

Procedure requires much computational time when applied to the actual data─ using all rules in TR often leads to the combinatory exp

losion of the number of transliteration candidates─ identifying the longest common subsequence often req

uires much time

Page 16: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Device for Time-saving(2/2)

Apply less transliteration rules to longer KATAKANA word─ at TR construction, we have ranked transliteration rules according

to their frequencies─ applied top 12/(the number of units in J)+1 rules to each unit of J

3*6*4 candidates

12/3+1=5

3*5*4 candidates

3 6 4

Page 17: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Other Alternatives(1/2)

Transliteration Rule (TR)─ comparing the effectiveness of TR and HR (Hepburn tr

ansliteration rule)─ HR is an well-known rule for transliterating Japanese t

o alphabet strings and is easily available─ HR gives each KATAKANA unit a unique alphabet str

ing TR usually generates many T(J) HR generates only one T(J). If HR alone can produce good result, we can save computation

al time

Page 18: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Other Alternatives(2/2)

Measure for Matching─ Our method uses the combination of Dice and NPT_sc

ore─ Mdic was used on behalf of Dice:

Mdic(k, m, n)=(1+logk)*k*2/(m+n)─ Bgrm was used on behalf of the combination of Dice a

nd NPT_score:

Bgrm(T(J), E)=|NT(J) ∩ NE|/|NT(J) ∪ NE|

Page 19: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiment Data(1/2)

Source Data for TR─ from 3,742 transliterational term pairs in dictionary of

artificial intelligence─ added Hepburn transliteration rules to them

Bilingual Corpora─ eight bilingual corpora, Japanese-English parallel abstr

acts/titles of academic papers,─ domains of these corpora are artificial intelligence (AI),

forestry (FR), information processing (IP) and architecture (AC)

Page 20: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiment Data(2/2)

The purpose of using abstract and title corpora─ examine the influence of size and the degree of parallelism of eac

h segment to the extraction results─ abstracts are larger and noisier than titles

The purpose of using four domains ─ examine the influence of domain difference to the results─ results of the other three domains do not significantly differ from

that of artificial intelligence─ need not be so nervous about the domain of source data for TR co

nstruction

Page 21: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Results(1/3)

TR is much more effective than simple HR(Hepburn transliteration rules)─ TR_Dice achieved 93% precision at 75% recall─ HR_Dice at 75% recall remained 4%

Dice is more effective than Mdic─ many word pairs which are long and morphologically-r

elated but not translational─ Mdic between these word pairs tend to become higher t

han those of short translational pairs, which leads to the decrease of precision

Page 22: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Results(2/3) The combination of Dice and NPT_score is more effe

ctive than Bgrm Our method is more effective than using combination

of NPT and Match─ in [1] tends to generate strings which contain incorrect translation

s. ‘グラフ’ is transliterated into ‘ghurlaoffphu’ contains not only correct translation ‘graph’, but also incorrect one ‘

hop’.─ Match depends only on the length of English words and their mat

ched parts─ Therefore, it tends to evaluate short English words

Match(‘ghurlaoffphu’, ‘hop’)=3/3=1)

Page 23: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Results(3/3)

TR which was constructed based on artificial intelligence term pairs also performed well against the other three domains─ This indicates the domain-independence of TR ─ we do not have to construct TR for each domain

Our method achieved 96-100% precision at 75% recall against title corpora

Our method performed well against abstract corpora (83-93% precision at 75% recall)

Page 24: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Error Analysis(1/2)

Most of the translational word pairs not extracted were those containing transliteration units which were not listed in TR─ For instance

‘アーキテクチャ’ and ‘architecture’ was not extracted because ‘ キ’ = ‘chi’ ,‘チャ’ = ‘ture’ were not listed in TR

─ We can solve this problem by enriching TR

Page 25: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Error Analysis(2/2)

A few of the translational word pairs which were not extracted were acronyms and their original forms─ cannot be extracted based on transliterations by nature

Many of the non-translational word pairs wrongly extracted were morphologically related pairs. ─ For instance

‘プログラマ’ (programmer) and ‘program’ , ‘クラス’ (class) and ‘subclass’

─ Emphasizing the first and the last unit matching might be effective

Page 26: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Conclusion

We proposed a method─ for extracting translational KATAKANA-English word

pairs from bilingual corpora

The experiment─ shows our method is highly effective

By enriching TR and introducing some heuristics─ the performance will become higher

Page 27: Advisor   : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Personal Opinion

Advantage─ Extraction of Traslational Word Pairs─ Error Analysis

Disadvantage─ Construction of Transliteration Rule

─ Assume and limit