post -al: part of speech tagger for ainu...

1
Presented POST-AL, the first POS tagger for Ainu language. The system performs three main tasks: tokenization, part-of- speech tagging and token translation. The results were around 97-98%. Output of POST-AL can be presented in one of three POS standards for Ainu language, with either vertical or horizontal view. In the near future we plan to: - compare different tokenization approaches (ex. Huang et al., 2007). - Add other dictionaries (Nakagawa, 1995; Tamura, 1998). - Add English translations (Batchelor, 1905) to make the tool usable also for non-Japanese speaking researchers. - Perform a robust evaluation of the annotations with the help of several experts and Ainu native speakers. - Bootstrap the system for even better performance. - Apply POST-AL to machine translation. POST-AL: Part-of-Speech Tagger for Ainu Language Michal Ptaszynski 1 and Yoshio Momouchi 2 1) JSPS Research Fellow / Hokkai-Gakuen University, High-Tech Research Center 2) Hokkai-Gakuen University, Department of Electronics and Information Engineering We present POST-AL, a part-of-speech tagger for Ainu language. The system uses a hand-crafted dictionary and performs three tasks: tokenization, part of speech tagging, and token translation (to Japanese). The system is evaluated on 13 Ainu stories called yukar. The system could be useful in a number of tasks related to the research on Ainu language, such as content analysis or translation, which till now have been done mostly manually. Abstract Tokenization DL-LSM: (Dictionary Lookup with Longest String Matching) based on Longest Match Principle DL-P-LSM: (Dictionary Lookup with Partial-LSM) based on LMP with caesurae POS Tagging S-POST: (Statistical Part of Speech Tagging) all words of the same form are treated as one list, choose POS with the highest occurrence CON-POST: (Contextual Part of Speech Tagging) based on higher order HMM trained on dictionary examples (* HMM = bigrams, higher order HMM = bigrams, trigrams and longer) Token Translation RAN-ToT: (Random Token Translation) translation selected randomly from the list of words of the same POS (S-POST extension) CON-ToT: (Contextual Token Translation) translation selected specifically for the word selected in CON-POST System Description © Michał Ptaszyński 2012 Output Options Linguistic Studies: collections of Ainu epic stories and myths (Chiri, 1978; Kayano, 1998; Piłsudski and Majewicz,2004) dictionaries and lexicons (Hattori, 1964; Chiri, 1975-1976; Nakagawa, 1995; Kayano, 1996; Tamura, 1998; Kirikae, 2003) grammar descriptions (Chiri, 1974; Murasaki, 1979; Refsing, 1986; Kindaichi, 1993; Sato, 2008) NLP-related Studies: attempt to transform Ainu language dictionary into an online database (Bugaeva, 2010) automatically gather word translations from texts (Echizen-ya et al., 2004;2005) analysis / retrieval of hierarchical Ainu-Japanese translations (Azumi and Momouchi, 2009a,b) annotating Ainu “yukar” stories for machine translation system (Momouchi et al. 2008) a system for translation of Ainu topological names (Momouchi and Kobayashi 2010) Previous Research on Ainu language 13 Ainu stories (yukar) from Ainu shin-yoshu (Ainu Songs of Gods) gathered by Chiri (1978). all stories are tokenized (by Kirikae, 2003) one yukar is annotated with POS and translations (by Momouchi et al., 2008) Yukar 10: Pon Okikirmuy yayeyukar “kutnisa kutunkutun” (The “Kutnisa kutunkutun” story told by Small Okikirmuy himself) Evaluation Dataset Description Conclusions and Future Work References 1. Skye Hohmann. 2008. The Ainu’s modern struggle. In Worlds Watch, Vol 21., No. 6. 2. Christopher Moseley (ed.). 2010. Atlas of the World? Languages in Danger, 3rd ed. Paris, UNESCO Publishing. Online version:http://www.unesco.org/culture/languages-atlas/ 3. Yukie Chiri. 1978. Ainu shin-yoshu. Tokyo, Iwanami Shoten. 4. Shigeru Kayano. 1998. Kayano no ainu shinwa shuusei [A collection of Ainu myths by Kayano]. vol. 1-10, Tokyo, Heibonsha. 5. Bronisław Piłsudski (Author), Alfred F. Majewicz (Editor). 2004. The Collected Works of Bronislaw Pilsudski: Materials for the Study of the Ainu Language and Folklore, v.3, Pt. 2: Materials for the Study of the Ainu, (Trends in Linguistics: Documentation). Mouton de Gruyter (Oct 2004) 6. Shiroo Hattori (ed.). 1964. An Ainu dialect dictionary. Tokyo, Iwanami Shoten. 7. Mashiho Chiri. 1975-1976. Bunrui ainugo jiten [A classificational dictionary of the Ainu language], vol. 1-3, Tokyo, Heibonsha. Reprint of works from 1953, 1954 and 1962. 8. Hiroshi Nakagawa. 1995. Ainugo Chitose Hogen Jiten: The Ainu-Japanese Dictionary: Chitose Dialect [In Japanese]. Sofukan. 9. Shigeru Kayano. 1996. Kayano Shigeru no ainugo jiten [An Ainu dictionary by Kayano Shigeru]. Tokyo, Sanseido. 10. Suzuko Tamura. 1998. Ainugo Chitose Hogen Jiten: The Ainu-Japanese Dictionary: Saru Dialect [In Japanese]. Sofukan. 11. Hideo Kirikae. 2003. Ainu shin-yoshu jiten: tekisuto bumpo kaisetsu tsuki (Lexicon to Yukie Chiri’s Ainu Shin-yosyu (Ainu Songs of Gods) with Text and Grammatital Notes) [In Japanese]. Sapporo: Hokkaido Daigaku Bungakubu Gengogaku. 12. Mashiho Chiri. 1974. Ainu goho gaisetu (An outline of Ainu grammar). In Chiri Mashiho chosakushuu (Collection of works by Machiho Chiri) [In Japanese], vol. 4, 3-197. Tokyo, Heibonsha. Reprint from 1936. 13. Kyoko Murasaki. 1979. Karafuto ainugo. Bunpo-hen (Sakhalin Ainu. Grammar volume) [In Japanese]. Tokyo, Kokushokan-kokai. 14. Kirsten Refsing. 1986. The Ainu language. The morphology and syntax of the Shizunai dialect. Aarhus, Aarhus University Press. 15. Kyosuke Kindaichi. 1993. Ainu yukara goho tekiyo (An outline grammar of Ainu epic stories) [In Japanese]. In Ainugogaku kogi 2 (Lectures on Ainu studies 2). Kindaichi Kyosuke zenshu. Ainugo I, v. 5, 145-366. Tokyo, Sanseidoo. Reprint from Ainu jojishi yukara no kenkyu (Research on Ainu epic stories) 2, 1-233, Tokyo: Toyo Bunko, 1931. 16. Tomomi Sato. 2008. Ainugo bunpo no kiso (The basics of Ainu grammar) [In Japanese]. Tokyo, Daigakushorin. 17. Anna Bugaeva. 2010. Internet Applications for Endangered Languages: A Talking Dictionary of Ainu. Waseda Institute for Advanced Study Research Bulletin,No.3, pp. 73-81. 18. Hiroshi Echizen-ya, Kenji Araki, Yoshio Momouchi and Koji Tochinai. 2004. Acquisition of Word Translations Using Local Focus-Based Learning in Ainu-Japanese Parallel Corpora. Lecture Notes in Computer Science, Springer-Verlag, Vol. 2945, pp.300-304. 19. Hiroshi Echizen-ya, Kenji Araki and Yoshio Momouchi. 2005. Learning Method for Automatic Acquisition of Translation Knowledge. Lecture Notes in Artificial Intelligence, Springer-Verlag, Berlin, Heidelberg, New York, Vol. 3682, pp.1347-1353. 20. Yasunori Azumi and Yoshio Momouchi. 2009a. Development of Analysis Tool for Hierarchical Ainu-Japanese Translation Data [In Japanese]. Bulletin of the Faculty of Engineering at Hokkai-Gakuen University, No.36, pp.175-193. 21. Yasunori Azumi and Yoshio Momouchi. 2009b. Development of Tools for retrieving and analyzing Ainu-Japanese translation data and their applications to Ainu-Japanese machine translation system [In Japanese]. Engineering Research: The Bulletin of Graduate School of Engineering at Hokkai-Gakuen University, No.9, pp.37-58. 22. Yoshio Momouchi, Yasunori Azumi and Yukio Kadoya. 2008. Research Note: Construction and Utilization of Electronic Data for “Ainu Shin-yosyu” [In Japanese]. Bulletin of the Faculty of Engineering at Hokkai-Gakuen University, No. 35, pp. 159-171. 23. Yoshio Momouchi and Ryosuke Kobayashi. 2010. Dictionaries and Analysis Tools for the Componential Analysis of Ainu Place Names [In Japanese]. Engineering Research: The Bulletin of Graduate School of Engineering at Hokkai-Gakuen University, No.10, pp.39-49. 24. Huang, C., Simon, P., Hsieh, S., & Prevot, L. (2007). Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Word break Identification, In Proceedings of the ACL 2007 Demo and Poster Sessions, pages 69-72, 2007 25. John Batchelor. 1905. An Ainu-English-Japanese dictionary (including a grammar of the Ainu language). Tokyo Methodist Pub.House. POST-AL:アイヌ語形態素解析システム base dictionary for POST-AL: Ainu shin-yoshu jiten (Lexicon to Yukie Chiri’s Ainu Shin-yosyu (Ainu Songs of Gods)) by Kirikae (2003) transform dictionary information to XML database: 1. token (word, morpheme, etc.) 2. part of speech 3. meaning (in Japanese) 4. usage examples (not for all cases) 5. reference to the story it appears in (not for all cases) Dictionary Construction Score Calculation Calculate score as balanced F1 score for all parts of POST-AL Results Tokenization DL-LSM was slightly Better (98.46%). POS tagging Contextual POS tagging was much better (96.96%) than statistical (90.11%). Token translation Contextual token translation was much better (98.36%) than statistical (90.11%). Evaluation Ainu language is a language of Ainu people, mostly living in northern Japan. Population of Ainu = about 23 thousand people. Number of native speakers = less than hundred (Hohmann, 2008). Ainu language is critically endangered (Moseley, 2010). Purpose of this research: Contribute to the process of reviving Ainu language. Help in linguistic and language anthropology research. Create part of speech tagger for the Ainu language (useful in any kind of language-related research). Introduction input tokenization DL-LSM | DL-P-LSM POS tagging S-POST | CON-POST token translation RAN-ToT | CON-ToT output 1. POS standard selection Nakagawa (1995) compact Tamura (1998) sophisticated Kirikae (2003) balanced 2. View selection Vertical (typical for POS taggers Horizontal (familiar for language anthropologists Image source: http://en.wikipedia.org/wiki/Ainu_people Image source: http://www.amazon.co.jp Image source: http://www.amazon.co.jp

Upload: ngocong

Post on 16-May-2018

224 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: POST -AL: Part of Speech Tagger for Ainu Languagearakilab.media.eng.hokudai.ac.jp/~ptaszynski/data/2012...Presented POST-AL, the first POS tagger for Ainu language. The system performs

Presented POST-AL, the first POS tagger for Ainu language.

The system performs three main tasks: tokenization, part-of-

speech tagging and token translation. The results were around

97-98%. Output of POST-AL can be presented in one of three

POS standards for Ainu language, with either vertical or

horizontal view.

In the near future we plan to: - compare different tokenization approaches (ex. Huang et al., 2007). - Add other dictionaries (Nakagawa, 1995; Tamura, 1998). - Add English translations (Batchelor, 1905) to make the tool usable also for non-Japanese speaking researchers. - Perform a robust evaluation of the annotations with the help of several experts and Ainu native speakers. - Bootstrap the system for even better performance. - Apply POST-AL to machine translation.

POST-AL: Part-of-Speech Tagger for Ainu Language

Michal Ptaszynski1 and Yoshio Momouchi2

1) JSPS Research Fellow / Hokkai-Gakuen University, High-Tech Research Center 2) Hokkai-Gakuen University, Department of Electronics and Information Engineering

We present POST-AL, a part-of-speech tagger for Ainu language. The system uses a hand-crafted

dictionary and performs three tasks: tokenization, part of speech tagging, and token translation (to

Japanese). The system is evaluated on 13 Ainu stories called “yukar”. The system could be useful in a

number of tasks related to the research on Ainu language, such as content analysis or translation,

which till now have been done mostly manually.

Abstract

Tokenization DL-LSM: (Dictionary Lookup with Longest String Matching)

based on Longest Match Principle

DL-P-LSM: (Dictionary Lookup with Partial-LSM)

based on LMP with caesurae

POS Tagging S-POST: (Statistical Part of Speech Tagging)

all words of the same form are treated as one list,

choose POS with the highest occurrence

CON-POST: (Contextual Part of Speech Tagging)

based on higher order HMM trained on dictionary examples (* HMM = bigrams, higher order HMM = bigrams, trigrams and longer)

Token Translation RAN-ToT: (Random Token Translation)

translation selected randomly from the list of words

of the same POS (S-POST extension)

CON-ToT: (Contextual Token Translation)

translation selected specifically for the word selected in CON-POST

System Description

© Michał Ptaszyński 2012

Output Options

Linguistic Studies:

• collections of Ainu epic stories and myths (Chiri, 1978; Kayano, 1998; Piłsudski and Majewicz,2004)

• dictionaries and lexicons (Hattori, 1964; Chiri, 1975-1976; Nakagawa, 1995; Kayano, 1996;

Tamura, 1998; Kirikae, 2003)

• grammar descriptions (Chiri, 1974; Murasaki, 1979; Refsing, 1986; Kindaichi, 1993; Sato, 2008)

NLP-related Studies:

• attempt to transform Ainu language dictionary into an online database (Bugaeva, 2010)

• automatically gather word translations from texts (Echizen-ya et al., 2004;2005)

• analysis / retrieval of hierarchical Ainu-Japanese translations (Azumi and Momouchi, 2009a,b)

• annotating Ainu “yukar” stories for machine translation system (Momouchi et al. 2008)

• a system for translation of Ainu topological names (Momouchi and Kobayashi 2010)

Previous Research on Ainu language

• 13 Ainu stories (yukar) from Ainu shin-yoshu

(Ainu Songs of Gods) gathered by Chiri (1978).

• all stories are tokenized (by Kirikae, 2003)

• one yukar is annotated with POS and

translations (by Momouchi et al., 2008)

Yukar 10: Pon Okikirmuy yayeyukar “kutnisa kutunkutun”

(The “Kutnisa kutunkutun” story told by Small Okikirmuy himself)

Evaluation Dataset Description

Conclusions and Future Work

References 1. Skye Hohmann. 2008. The Ainu’s modern struggle. In Worlds Watch, Vol 21., No. 6. 2. Christopher Moseley (ed.). 2010. Atlas of the World? Languages in Danger, 3rd ed. Paris,

UNESCO Publishing. Online version:http://www.unesco.org/culture/languages-atlas/ 3. Yukie Chiri. 1978. Ainu shin-yoshu. Tokyo, Iwanami Shoten. 4. Shigeru Kayano. 1998. Kayano no ainu shinwa shuusei [A collection of Ainu myths by

Kayano]. vol. 1-10, Tokyo, Heibonsha. 5. Bronisław Piłsudski (Author), Alfred F. Majewicz (Editor). 2004. The Collected Works of

Bronislaw Pilsudski: Materials for the Study of the Ainu Language and Folklore, v.3, Pt. 2: Materials for the Study of the Ainu, (Trends in Linguistics: Documentation). Mouton de Gruyter (Oct 2004)

6. Shiroo Hattori (ed.). 1964. An Ainu dialect dictionary. Tokyo, Iwanami Shoten. 7. Mashiho Chiri. 1975-1976. Bunrui ainugo jiten [A classificational dictionary of the Ainu

language], vol. 1-3, Tokyo, Heibonsha. Reprint of works from 1953, 1954 and 1962. 8. Hiroshi Nakagawa. 1995. Ainugo Chitose Hogen Jiten: The Ainu-Japanese Dictionary:

Chitose Dialect [In Japanese]. Sofukan. 9. Shigeru Kayano. 1996. Kayano Shigeru no ainugo jiten [An Ainu dictionary by Kayano

Shigeru]. Tokyo, Sanseido. 10. Suzuko Tamura. 1998. Ainugo Chitose Hogen Jiten: The Ainu-Japanese Dictionary: Saru

Dialect [In Japanese]. Sofukan.

11. Hideo Kirikae. 2003. Ainu shin-yoshu jiten: tekisuto bumpo kaisetsu tsuki (Lexicon to Yukie Chiri’s Ainu Shin-yosyu (Ainu Songs of Gods) with Text and Grammatital Notes) [In Japanese]. Sapporo: Hokkaido Daigaku Bungakubu Gengogaku.

12. Mashiho Chiri. 1974. Ainu goho gaisetu (An outline of Ainu grammar). In Chiri Mashiho chosakushuu (Collection of works by Machiho Chiri) [In Japanese], vol. 4, 3-197. Tokyo, Heibonsha. Reprint from 1936.

13. Kyoko Murasaki. 1979. Karafuto ainugo. Bunpo-hen (Sakhalin Ainu. Grammar volume) [In Japanese]. Tokyo, Kokushokan-kokai.

14. Kirsten Refsing. 1986. The Ainu language. The morphology and syntax of the Shizunai dialect. Aarhus, Aarhus University Press.

15. Kyosuke Kindaichi. 1993. Ainu yukara goho tekiyo (An outline grammar of Ainu epic stories) [In Japanese]. In Ainugogaku kogi 2 (Lectures on Ainu studies 2). Kindaichi Kyosuke zenshu. Ainugo I, v. 5, 145-366. Tokyo, Sanseidoo. Reprint from Ainu jojishi yukara no kenkyu (Research on Ainu epic stories) 2, 1-233, Tokyo: Toyo Bunko, 1931.

16. Tomomi Sato. 2008. Ainugo bunpo no kiso (The basics of Ainu grammar) [In Japanese]. Tokyo, Daigakushorin.

17. Anna Bugaeva. 2010. Internet Applications for Endangered Languages: A Talking Dictionary of Ainu. Waseda Institute for Advanced Study Research Bulletin,No.3, pp. 73-81.

18. Hiroshi Echizen-ya, Kenji Araki, Yoshio Momouchi and Koji Tochinai. 2004. Acquisition of Word Translations Using Local Focus-Based Learning in Ainu-Japanese Parallel Corpora. Lecture Notes in Computer Science, Springer-Verlag, Vol. 2945, pp.300-304.

19. Hiroshi Echizen-ya, Kenji Araki and Yoshio Momouchi. 2005. Learning Method for Automatic Acquisition of Translation Knowledge. Lecture Notes in Artificial Intelligence, Springer-Verlag, Berlin, Heidelberg, New York, Vol. 3682, pp.1347-1353.

20. Yasunori Azumi and Yoshio Momouchi. 2009a. Development of Analysis Tool for Hierarchical Ainu-Japanese Translation Data [In Japanese]. Bulletin of the Faculty of Engineering at Hokkai-Gakuen University, No.36, pp.175-193.

21. Yasunori Azumi and Yoshio Momouchi. 2009b. Development of Tools for retrieving and analyzing Ainu-Japanese translation data and their applications to Ainu-Japanese machine translation system [In Japanese]. Engineering Research: The Bulletin of Graduate School of Engineering at Hokkai-Gakuen University, No.9, pp.37-58.

22. Yoshio Momouchi, Yasunori Azumi and Yukio Kadoya. 2008. Research Note: Construction and Utilization of Electronic Data for “Ainu Shin-yosyu” [In Japanese]. Bulletin of the Faculty of Engineering at Hokkai-Gakuen University, No. 35, pp. 159-171.

23. Yoshio Momouchi and Ryosuke Kobayashi. 2010. Dictionaries and Analysis Tools for the Componential Analysis of Ainu Place Names [In Japanese]. Engineering Research: The Bulletin of Graduate School of Engineering at Hokkai-Gakuen University, No.10, pp.39-49.

24. Huang, C., Simon, P., Hsieh, S., & Prevot, L. (2007). Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Word break Identification, In Proceedings of the ACL 2007 Demo and Poster Sessions, pages 69-72, 2007

25. John Batchelor. 1905. An Ainu-English-Japanese dictionary (including a grammar of the Ainu language). Tokyo Methodist Pub.House.

POST-AL:アイヌ語形態素解析システム

• base dictionary for POST-AL:

Ainu shin-yoshu jiten (Lexicon to

Yukie Chiri’s Ainu Shin-yosyu (Ainu

Songs of Gods)) by Kirikae (2003)

• transform dictionary information

to XML database:

1. token (word, morpheme, etc.)

2. part of speech

3. meaning (in Japanese)

4. usage examples (not for all cases)

5. reference to the story it appears in (not for all cases)

Dictionary Construction

Score Calculation

Calculate score as balanced

F1 score for all parts of POST-AL

Results

Tokenization DL-LSM was slightly

Better (98.46%).

POS tagging Contextual POS tagging

was much better (96.96%)

than statistical (90.11%).

Token translation Contextual token translation

was much better (98.36%)

than statistical (90.11%).

Evaluation

• Ainu language is a language of Ainu people, mostly living in northern Japan.

• Population of Ainu = about 23 thousand people.

• Number of native speakers = less than hundred (Hohmann, 2008).

• Ainu language is critically endangered (Moseley, 2010).

Purpose of this research:

↓ Contribute to the process of reviving Ainu language.

↓ Help in linguistic and language anthropology research.

Create part of speech tagger for the Ainu language

(useful in any kind of language-related research).

Introduction

input

tokenization

DL-LSM | DL-P-LSM

POS tagging

S-POST | CON-POST

token translation

RAN-ToT | CON-ToT

output

1. POS standard

selection

• Nakagawa (1995) compact

• Tamura (1998) sophisticated

• Kirikae (2003) balanced

2. View selection

• Vertical (typical for POS

taggers

• Horizontal (familiar for language

anthropologists

Image source: http://en.wikipedia.org/wiki/Ainu_people

Ima

ge

so

urc

e: h

ttp://w

ww

.am

azo

n.c

o.jp

Ima

ge

so

urc

e: h

ttp://w

ww

.am

azo

n.c

o.jp