learning formulation and transformation rules for multilingual named entities

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

Learning Formulation and Transformation Rules for Multilingual Named Entities

Advisor ： Dr. Hsu

Reporter ： Chun Kai Chen

Author ： Hsin-Hsi Chen, Changhua Yang and Ying Lin

Proceedings of the ACL 2003

N.Y.U.S.T.

I. M.Outline

Motivation Objective Introduction Multilingual Named Entity Corpora Rule Mining Experimental Results Conclusions Personal Opinion

N.Y.U.S.T.

I. M.Motivation

The past works on multilingual named entities emphasizes on the transliteration issues

However, the transformation between named entities in different languages is not transliteration only─ Victoria Fall- 維多利亞瀑布─ Little Rocky Mountains- 小落磯山脈─ Kenmare- 康美爾─ East Chicago- 東芝加哥

N.Y.U.S.T.

I. M.Objective

Propose a method extract─ formulation rules of named entities for individual

languages─ transformation rules for mapping among languages

Application of the results on cross language information retrieval (CLIR)

N.Y.U.S.T.

I. M.Introduction(1/3)

In the past, named entity extraction ─ mainly focuses on general domains─ employed to various applications such as information r

etrieval, question-answering

N.Y.U.S.T.

I. M.Introduction(2/3) Most of the previous approaches

─ dealt with monolingual named entity extraction─ Chen et al.(1998) extended it to cross-language information retrieval (C

LIR) A grapheme-based model was ( 字母 )

─ proposed to compute the similarity between Chinese transliteration name and English name.

Lin and Chen (2000) further classified the works into two directions─ forward transliteration (Wan and Verspoor, 1998)─ backward transliteration (Chen et al., 1998; Knight and Graehl, 199

8),─ proposed a phoneme-based model

N.Y.U.S.T.

I. M.Introduction(3/3)

This paper will study ─ the issues of languages and named entity types on the

choices of translation and transliteration. ─ We focus on three more challenging named entities onl

y, i.e., named people named locations named organizations

N.Y.U.S.T.

I. M.Multilingual Named Entity Corpora

NICT location name corpus─ Developed by Ministry of Education of Taiwan in 1995─ consists of three parts

Foreign location name, Chinese transliteration/translation name, country name (Victoria Fall, “ 維多利亞瀑布” (wei duo li ya pu bu), South Africa)

CNA personal name and organization corpora─ are used by news reporters to unify the name translitera

tion/translation in news stories

N.Y.U.S.T.

I. M.Rule Mining

Frequency-Based Approach with a Bilingual Dictionary

Keyword Extraction without a Bilingual Dictionary

Extraction of Transformation Rules Extraction of Keywords at a Distance

N.Y.U.S.T.

I. M.Learning Formulation and Transformation Rules

Frequency-Based with a Bilingual Dictionary

Keyword Extraction without a Bilingual Dictionary

Generate candidatesCount the frequency (TFIDF)

Victoria FallVictoria, “ 維多利亞” Fall, “ 瀑布”

World Taiwanese Association “ 世台會”

Decompose E

(s6) {Catalan Mountain , 卡太蘭山 }(s7) {Aletschhorn Mountain , 阿利奇赫恩山 }

{Catalan Mountain , 卡太蘭山 }{Catalan , 卡太蘭山 }{e1, 卡太太蘭蘭山 }{e1, …}{e1, 卡太蘭山 }

{Mountain , 卡太蘭山 }{e2, 卡太太蘭蘭山 }{e2, …}{e2, 卡太蘭山 }

{Aletschhorn Mountain , 阿利奇赫恩山 }{Aletschhorn , 阿利奇赫恩山 }{e1, 阿利利奇奇赫赫恩恩山 }{e1, …}{e1, 阿利奇赫恩山 }

{Mountain , 阿利奇赫恩山 }{e2, 阿利利奇奇赫赫恩恩山 }{e2, …}

{Mountain, “ 山” (shan)}

Extraction of Transformation Rules

(s6’) γ mountain ⇔ δ 山(s7’) γ mountain ⇔ δ 山(s8’) γ Strait ⇔ δ 海峽(s9’) γ, Strait of ⇔ δ 海峽

Extraction of Keywords at a Distance

“American Civil Liberties Union”.“American ∆ Liberties Union”“American Civil ∆ Union”“American ∆ Union”

Dictionary

“Mountain” ⇔ “ 山”

N.Y.U.S.T.

I. M.Frequency-Based Approach with a Bilingual Dictionary We postulate

─ transliterated term is usually an unknown word and not listed in a lexicon

─ translated term often appears in a lexicon

Under this postulation ─ translated term( 翻譯詞 ) occurs more often in a corpus

Fall, “ 瀑布”─ transliterated term( 音譯詞 ) only appears very few

Victoria, “ 維多利亞”

N.Y.U.S.T.

I. M.Frequency-based method(1/2) Simple frequency-based method will compute the frequencies

of terms and use them to tell out the transliteration and translation parts in a named entity─ Compute word frequencies of each word in the foreign name list─ Keep those words

appear more than a threshold appear in a common foreign dictionary these words form candidates of simple keywords

Mountain─ Examine the foreign word list again─ Cluster the Chinese name list

based on foreign keywords here a bilingual dictionary may be consulted “Mountain” ⇔ “ 山”

N.Y.U.S.T.

I. M.Frequency-based method(2/2) NICT location name corpus

─ River ( 河 , he), Island ( 島 , dao), Lake ( 湖 ,hu), Mountain ( 山 , shan), Bay ( 灣 , wan), Mountain ( 峰 , feng), Peak ( 峰 , feng)

─ “Mountain” ⇔ “ 山” (shan) and “ 峰” (feng)─ “峰” (feng) ⇔ “Mountain” and “Peak”

CNA organization name corpus─ Suffix

Association ( 協會 , xie hui), University ( 大學 , da xue)─ Prefix

International ( 國際 , guo ji), World ( 世界 ,shi jie), American ( 美國 , mei guo)

N.Y.U.S.T.

I. M.Keyword Extraction without a Bilingual Dictionary (problem) Abbreviation is common adopted in translation,

dictionary-based approach is hard to capture this phenomenon─ (World Taiwanese Association,“ 世台會” )

Here another approach without dictionary is proposed

N.Y.U.S.T.

I. M.Keyword Extraction without a Bilingual Dictionary (process)(s6) Aletschhorn Mountain ⇔ 阿利奇赫恩山

─ {e1, s1 s2 … st} {Aletschhorn , 阿利奇赫恩

山 } {e1, 阿利利奇奇赫赫恩恩山 } {e1, 阿利奇利奇赫奇赫恩赫恩

山 } {e1, 阿利奇赫利奇赫恩奇赫恩山 } {e1, 阿利奇赫恩利奇赫恩山 } {e1, 阿利奇赫恩山 }

─ {e2, s1 s2 … st} {Mountain , 阿利奇赫恩

山 } {e2, 阿利利奇奇赫赫恩恩山 } {e2, 阿利奇利奇赫奇赫恩赫恩

山 } {e2, 阿利奇赫利奇赫恩奇赫恩山 } {e2, 阿利奇赫恩利奇赫恩山 } {e2, 阿利奇赫恩山 }

(s7) Catalan Mountain ⇔ 卡太蘭山─ {e1, s1 s2 … st}

{Catalan , 卡太蘭山 } {e1, 卡太太蘭蘭山 } {e1, 卡太蘭太蘭山 } {e1, 卡太蘭山 }

─ {e2, s1 s2 … st} {Mountain , 卡太蘭山 } {e2, 卡太太蘭蘭山 } {e2, 卡太蘭太蘭山 } {e2, 卡太蘭山 }

•{e, c} whose frequency > 2 are kept•{Mountain, “ 山” (shan)}

N.Y.U.S.T.

I. M.Keyword Extraction without a Bilingual Dictionary (algorithm) {Ej, Cj}

─ Ej is a foreign named entity─ Cj is a Chinese named entity

decompose the named entities─ Ej

comprises m words w1·w2…wm a candidate segment ep, q is defined as wp … wq

─ Cj has n syllables s1·s2…sn a candidate segment cx, y is defined as sx … sy

─ we can get pairs of {ep, q, cx, y} from {Ej, Cj}. group and count

─ the pairs collected from the multilingual named entity list─ count the frequency for each occurrence─ pairs with higher frequency denote significant segment pairs

N.Y.U.S.T.

I. M.Keyword Extraction without a Bilingual Dictionary (example) Example

─ All the pairs {e, c} whose frequency > 2 are kept─ {Mountain, “ 山” (shan)} and {Strait, “ 海峽” (ha

i xia)} appear twice

(s6) Aletschhorn Mountain ⇔ 阿利奇赫恩山(s7) Catalan Mountain ⇔ 卡太蘭山(s8) Cook Strait ⇔ 科克海峽(s9) Dover, Strait of ⇔ 多佛海峽

N.Y.U.S.T.

I. M.Keyword Extraction without a Bilingual Dictionary (problem) Two issues have to be addressed

─ redundancy which may exist in the pairs of segments should be eliminated carefully

─ e may be translated to more than one synonym “Association” ⇔“ 協會” (xie hui) and “ 聯誼會” (lian yi hui)

A metric to deal with the above issues is proposed)1 (log 2 iiii c)idf(c}) f({e,c})score({e,c

) (max

}tf{e,c

}{e,ctf })f({e,c

)(log 2

N )idf(c

}) ,({max arg icescore c

N.Y.U.S.T.

I. M.Extraction of Transformation Rules

Chinese location name keyword ─ tends to be located in the rightmost─ the remaining part is a transliterated name

Foreign location name keyword ─ tends to be either located in the rightmost, or permuted by some preposi

tions, comma, and the transliterating part

(s6) Aletschhorn Mountain ⇔ 阿利奇赫恩山(s7) Catalan Mountain ⇔ 卡太蘭山(s8) Cook Strait ⇔ 科克海峽(s9) Dover, Strait of ⇔ 多佛海峽

(s6’) γ mountain ⇔ δ 山(s7’) γ mountain ⇔ δ 山(s8’) γ Strait ⇔ δ 海峽(s9’) γ, Strait of ⇔ δ 海峽

N.Y.U.S.T.

I. M.Extraction of Keywords at a Distance

(s12) and (s13)─ English compound keyword is separated and so is its corresponding Chi

nese counterpart

(s14) and (s15)─ English compound keyword is connected in ─ but the corresponding Chinese translation is separated

(s12) American Podiatric medical Association ⇔ 美國足病醫療學會(s13) American Public Health Association ⇔ 美國公共衛生學會(s14) American Society for Industrial Security ⇔ 美國工業安全協會(s15) American Society of Newspaper Editors ⇔ 美國報紙編輯人協會

N.Y.U.S.T.

I. M.Extraction of Keywords at a Distance

Introduce a symbol ∆ to cope with the distance issue─ “American Civil Liberties Union”.─ “American ∆ Liberties Union”─ “American Civil ∆ Union”─ “American ∆ Union”

N.Y.U.S.T.

I. M.Experimental Analysis (corpus) NICT location corpus

─ Total 122 keyword pairs are identified─ Total 230 transformation rules─ On the average, a keyword pair corresponds to 1.89 transformation rules

CNA personal names─ are composed of more than one Word

(100 / 50,586)─ the number of keywords extracted is only a few

De ⇔ 戴 (dai), La ⇔ 拉 (la), De La ⇔ 戴拉 (dai la), Du ⇔ 杜 (du), David ⇔ 大衛 (da wei)

CNA organization─ are composed of more than one Word

(12,885 / 14,658)─ 5,229 keyword pairs are extracted─ most of the keyword pairs are meaning translated

N.Y.U.S.T.

I. M.Experimental Analysis (classify) We classify these keyword pairs into the following types

─ Meaning translation common location keywords

Bir ⇔ 井 (jing), Ain ⇔ 泉 (quan),Bahr ⇔ 河 (he), Cerro ⇔ 山 (shan) Direction

Central ⇔ 中 (zhong), East ⇔ 東 (dong), etc.) size (e.g., Big ⇔ 大 (da)), length (e.g, Long ⇔ 長 (zhang)), color (e.g., Black ⇔ 黑 (hei), Blue ⇔ 藍 (lan), etc.)

the specificity of place or area Crystal ⇔ 結晶 , Diamond⇔ 鑽石 (zuan shi)

─ Phoneme transliteration keywords Dera ⇔ 德拉 (de la), Monte⇔ 蒙特 (meng te), Los ⇔ 洛斯 (luo si) 伊利莎白 (yi li sha bai), Edward ⇔ 愛德華 (ai de hua) Total 39 terms belong to this type. It occupies 31.97%.

─ Some keywords in type (1) are transliterated Bay ⇔ 貝 (Bay), Beach ⇔ 比奇 (bi qi) Total 14 keywords (11.48%) are extracted.

N.Y.U.S.T.

I. M.Experimental Results

NICT location corpus─ Total 122 keyword pairs are identified─ Total 230 transformation rules─ On the average, a keyword pair corresponds to 1.89 tra

nsformation rules keyword pair mountain ⇔ 山 (shan)

─ Four transformation rules (1) γα ⇔ δβ (234) (2) γ, α ⇔ δβ (45) (3) γ, αγ ⇔ δβ (1) (4) γαγ ⇔ δβ (1)

N.Y.U.S.T.

I. M.Application on CLIR

N.Y.U.S.T.

I. M.Conclusion and Remarks

This paper proposes corpus-based approaches ─ extract the formulation rules and the translation/transliteration

rules among multilingual named entities

Two types of evaluation─ partition the corpora into two parts, one for training and the other

one for testing─ integrating our method in a cross language information retrieval

system

Further applications ─ will be explored in the future and the methodology will be

extended to other types of named entities

N.Y.U.S.T.

I. M.Personal Opinion

Drawback─ Lack analysis about time complexity

Application─ Construct Chinese-English rules apply to IR

Future Work─ Adopt transliterated / translated term issue

learning formulation and transformation rules for multilingual named entities

Documents

multilingual drupal

multilingual dictionary

multilingual digital single market - lt-innovate.org...

defining multilingual information literacy (mlil) in the...

improving multilingual catalog search services by means of...

multilingual speech communities language choice in...

multilingual mps

multilingual websites

yago: a multilingual knowledge base from wikipedia ... ·...

multilingual strategy

multilingual glossary

multilingual ontology

multilingual songbook

multilingual internet

internationalization - multilingual

wordpress multilingual

multilingual brain

localization - multilingual

multilingual program

multilingual dictionary