faculty of computer science & information technology named-entity... · 2015-07-20 · generic...

24
Faculty of Computer Science & Information Technology GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF SARAWAK (NERSIL) YONG SOO FONG Master of Computer Science 2013

Upload: others

Post on 29-Jan-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Faculty of Computer Science & Information Technology Named-Entity... · 2015-07-20 · GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF SARAWAK (NERSIL) YONG SOO FONG

Faculty of Computer Science & Information Technology

GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS

LANGUAGES OF SARAWAK (NERSIL)

YONG SOO FONG

Master of Computer Science

2013

Page 2: Faculty of Computer Science & Information Technology Named-Entity... · 2015-07-20 · GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF SARAWAK (NERSIL) YONG SOO FONG

GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF

SARAWAK (NERSIL)

YONG SOO FONG

A thesis submitted in

fulfillment of the requirements for the degree of Master of Computer Science

Faculty of Computer Science and Information Technology

UNIVERSITI MALAYSIA SARAWAK

2013

Page 3: Faculty of Computer Science & Information Technology Named-Entity... · 2015-07-20 · GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF SARAWAK (NERSIL) YONG SOO FONG

ii

Declaration

No portion of the work referred to in this report has been submitted in support of an

application for another degree or qualification of this or any other university or institution

of higher learning.

………………………………….

YONG SOO FONG 24th September 2013

Page 4: Faculty of Computer Science & Information Technology Named-Entity... · 2015-07-20 · GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF SARAWAK (NERSIL) YONG SOO FONG

iii

Acknowledgements

At the end of my thesis, I would like to thank everyone who made this thesis a success

and an unforgettable experience for me.

First and foremost, I would like to express my sincerest gratitude to my supervisor,

Assoc. Prof. Dr. Alvin Yeo Wee, for his constructive comments, and his strong support

throughout this work.

Secondly, I am extremely grateful to Assoc. Prof. Dr. Bali Ranaivo-Malanҫon. I

thank her for her guidance and great effort in training me in computational linguistics

field. I attribute my Master degree to her encouragement and effort and without her; this

thesis would have not been completed.

I am thankful to my best friend, Amy Chong for her selfless support,

encouragement and also for her grammatical editing of my thesis.

I would like to acknowledge the financial, academic and technical support of the

Universiti Malaysia Sarawak particularly from the award of Vice Chancellor's Research

Scholarship that provided the necessary financial support for this research.

Finally, I take this opportunity to express my profound and deepest gratitude to

my beloved parents and my siblings for their love and continuous support, both

spiritually and financially.

Page 5: Faculty of Computer Science & Information Technology Named-Entity... · 2015-07-20 · GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF SARAWAK (NERSIL) YONG SOO FONG

iv

Abstract

The aim of this research is to create the first Named Entity Recognition (NER) system for

the Sarawak Indigenous Languages (SILs), hereinafter is called NERSIL. The main goal

of NERSIL is to achieve a good accuracy with regard to the identification and

classification of named entities (NEs). The NEs considered in this research are Person,

Location, Organisation, Date, Time, Monetary and Percentage. Generally, all these NEs

carry important information about the text itself. Thus, there are targets for extraction.

NER approaches can be categorised broadly as rule-based approach, machine learning-

based approach, and hybrid approach. Rule-based approach relies on hand-crafted

linguistic grammars. Machine learning-based approach needs a large amount of annotated

training data, which is unavailable for SILs. Hybrid approach is the combination of rule-

based and machine learning-based approach. NERSIL requires special attention as it is

impossible to apply directly from the existing NER approaches.

In this thesis, an NER system that is built by extending and modifying the existing NER

approaches is presented. There are three main processes: the non-modified ANNIE (A

Nearly-New IE system) NER, the adapted ANNIE to SILs, and finally the context

investigation. Firstly, the input texts are submitted to an English NER, in this case

ANNIE with the assumption that some NEs that appear in English texts will also occur in

SIL‟s texts. At that stage, the rules for unrecognised NEs from the rules of recognised

NEs are distinguished. Next, the new rules for unrecognised NEs are written and the new

gazetteers for SILs are built in order to identify more NEs. However, the first two

Page 6: Faculty of Computer Science & Information Technology Named-Entity... · 2015-07-20 · GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF SARAWAK (NERSIL) YONG SOO FONG

v

processes are not enough to provide a good accuracy in recognising all NEs. Thus,

context investigation is needed. Context investigation includes frequency analysis,

triggered words filtering, and concordance analysis. The context of a NE (the left or right

side of NE) will be investigated.

Finally, a NER system designed for SILs will be an advancement of world knowledge.

Besides, the design can be improved by incorporating the machine translation, WordNet,

and adding more noise filtering (e.g. context filtering, and morphological filtering). With

more research and future studies, this NER system will reach a high level of performance

like the English NER work on.

Page 7: Faculty of Computer Science & Information Technology Named-Entity... · 2015-07-20 · GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF SARAWAK (NERSIL) YONG SOO FONG

vi

Abstrak

Tujuan kerja tesis Sarjana ini adalah untuk menghasilkan sebuah sistem Named Entity

Recognition yang pertama (NER) untuk bahasa pribumi Sarawak, yang dipanggil

NERSIL. Matlamat utama NERSIL ialah untuk mendapatkan ketepatan yang baik

berhubung dengan pengenalpastian dan pengelasan entiti-entiti yang dinamakan (NEs).

NEs yang dipertimbangkan dalam kajian ini adalah Orang, Tempat, Pertubuhan, Tarikh,

Masa, Kewangan dan Peratus. Secara umumnya, semua NEs ini membawa maklumat

penting tentang teks sendiri. Oleh itu, terdapat sasaran untuk pengekstrakan.

Pendekatan NER boleh dikategorikan secara meluas sebagai pendekatan berdasarkan

peraturan, pendekatan berdasarkan pembelajaran mesin, dan pendekatan berdasarkan

hibrid. Pendekatan berdasarkan peraturan bergantung kepada tatabahasa linguistik.

Pendekatan berdasarkan pembelajaran mesin memerlukan sejumlah besar data latihan

beranotasi, yang buat masa ini tidak wujud untuk bahasa pribumi Sarawak. Pendekatan

berdasarkan hibrid adalah gabungan pendekatan berdasarkan peraturan dan pendekatan

berdasarkan pembelajaran mesin. NERSIL memerlukan pemerhatian khusus kerana ia

adalah mustahil untuk menggunakan terus dari pendekatan NER yang sedia ada.

Sistem NER yang dibangunkan dengan melanjutkan dan mengubahsuai pendekatan NER

yang wujud dibentangkan di dalam tesis ini. Terdapat tiga proses utama: ANNIE (A

Nearly-New IE system) yang tidak diubahsuai, ANNIE disesuaikan dengan bahasa

pribumi Sarawak dan akhirnya kajian konteks. Pertama, teks input telah diserahkan

kepada English NER, dari kes ini ANNIE dengan andaian bahawa sesetengah NEs

Page 8: Faculty of Computer Science & Information Technology Named-Entity... · 2015-07-20 · GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF SARAWAK (NERSIL) YONG SOO FONG

vii

muncul dalam teks bahasa Inggeris juga akan berlaku dalam bahasa pribumi Sarawak.

Pada peringkat itu, peraturan untuk NEs tidak dikenali dibezakan dari peraturan NEs

yang diiktiraf. Seterusnya, peraturan baru utnuk NEs tidak diiktiraf telah disenaraikan

dan gazetteer baru dibina untuk bahasa pribumi Sarawak supaya mengenalpasti lebih

banyak NEs. Bagaimanapun, dua proses pertama tidak cukup untuk memberikan

ketepatan yang baik dalam pengiktirafan semua NEs. Oleh itu, kajian konteks diperlukan.

Kajian konteks termasuk analisis kekerapan, penapisan perkataan dicetuskan, dan analisa

konkordans. Konteks NE (sebelah kiri atau kanan NE) akan dikaji.

Akhir sekali, bahasa sistem NER yang direka untuk bahasa pribumi Satawak adalah satu

kemajuan bagi pengetahuan seluruh dunia. Selain itu, rekaan boleh diperbaiki dengan

menggunakan penterjemahan mesin, WordNet, dan menambah lebih banyak penapisan

(seperti penapisan konteks, dan penapisan morfologi). Dengan lebih banyak penyelidikan

dan kajian masa hadapan, sistem NER ini akan mencapai satu tahap prestasi yang tinggi

seperti English NER pada suatu masa nanti.

Page 9: Faculty of Computer Science & Information Technology Named-Entity... · 2015-07-20 · GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF SARAWAK (NERSIL) YONG SOO FONG

viii

Table of Contents Declaration ................................................................................................................................. ii

Acknowledgements .................................................................................................................... iii

Abstract ..................................................................................................................................... iv

Abstrak ...................................................................................................................................... vi

Table of Contents ......................................................................................................................viii

List of Published Papers ............................................................................................................. xi

List of Figures........................................................................................................................... xii

List of Tables ............................................................................................................................ xiii

List of Abbreviations ................................................................................................................ xiv

Chapter 1 INTRODUCTION ...................................................................................................... 1

1.1 Definitions: Named Entity (NE) and Named Entity Recognition (NER) ....................... 1

1.2 Background of SILs ..................................................................................................... 3

1.3 Problem Statement ....................................................................................................... 4

1.4 Objectives of the Study ................................................................................................ 5

1.5 Scope of the Study ....................................................................................................... 6

1.6 Significance of the Study ............................................................................................. 6

1.7 Organisation of the Thesis ........................................................................................... 6

Chapter 2 LITERATURE REVIEW ............................................................................................ 9

2.1 Introduction ................................................................................................................. 9

2.2 Named Entity Recognition ........................................................................................... 9

2.2.1 Named Entity (NE) Types .................................................................................. 10

2.2.2 Problems in NEs ................................................................................................ 15

2.2.3 Applications of NER .......................................................................................... 17

2.3 Features of NEs ......................................................................................................... 18

2.3.1 Word-level Features ........................................................................................... 19

2.3.2 List Lookup Features ......................................................................................... 20

2.3.3 Document and Corpus Features .......................................................................... 22

2.4 NER Approaches ....................................................................................................... 23

2.4.1 Rule-based Approach ......................................................................................... 23

2.4.2 Machine Learning-based Approach .................................................................... 26

2.4.3 Hybrid-based Approach ..................................................................................... 30

Page 10: Faculty of Computer Science & Information Technology Named-Entity... · 2015-07-20 · GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF SARAWAK (NERSIL) YONG SOO FONG

ix

2.4.4 Summary of the Three Major NER Approaches .................................................. 30

2.4.5 NER via Machine Translation ............................................................................ 32

2.5 Some Existing NER Systems ..................................................................................... 33

2.5.1 ANNIE .............................................................................................................. 36

2.5.2 Freeling ............................................................................................................. 38

2.5.3 Text Pro ............................................................................................................. 40

2.5.4 ClearForest ........................................................................................................ 41

2.5.5 Summary of the Existing of NER systems .......................................................... 43

2.6 Summary of the Literature Review of NER ................................................................ 44

Chapter 3 METHODOLOGY ................................................................................................... 46

3.1 Introduction ............................................................................................................... 46

3.2 Define the Research Problems ................................................................................... 48

3.3 Review the Literature ................................................................................................ 48

3.4 Propose a Solution to the Problems ............................................................................ 48

3.4.1 NERSIL Overall Framework .............................................................................. 49

3.4.2 Requirements ..................................................................................................... 66

3.5 Collect data ............................................................................................................... 79

3.6 Implement and Iteratively Improve ............................................................................ 79

3.7 Evaluation and Discussion ......................................................................................... 79

3.8 Summary ................................................................................................................... 80

Chapter 4 EXPERIMENTS, RESULTS ANALYSIS AND DISCUSSION ................................ 81

4.1 Introduction ............................................................................................................... 81

4.2 Experiments Description and Setup ............................................................................ 81

4.2.1 Data Set ............................................................................................................. 82

4.2.2 Evaluation Metrics ............................................................................................. 85

4.3 Result Analysis on Iban Corpus ................................................................................. 89

4.3.1 Results from Non-modified ANNIE NER .......................................................... 89

4.3.2 Results from Adapted ANNIE for Iban ............................................................... 90

4.3.3 Context Investigation: Results from Frequency Analysis .................................... 93

4.3.4 Context Investigation: Results from Triggered Words Filtering .......................... 94

4.3.5 Context Investigation: Results from Concordance Analysis ................................ 99

4.3.6 Performance of NERSIL .................................................................................. 102

Page 11: Faculty of Computer Science & Information Technology Named-Entity... · 2015-07-20 · GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF SARAWAK (NERSIL) YONG SOO FONG

x

4.4 Results Analysis on Bau Bidayuh Corpus ................................................................ 103

4.4.1 Results from Non-modified ANNIE NER ........................................................ 103

4.4.2 Results from Adapted ANNIE for Bau Bidayuh ............................................... 104

4.4.3 Context Investigation: Results from Frequency Analysis .................................. 106

4.4.4 Context Investigation: Results from Triggered Words Filtering ........................ 107

4.4.5 Context Investigation: Results from Concordance Analysis .............................. 109

4.4.6 Performance of NERSIL .................................................................................. 109

4.5 Summary of the Results ........................................................................................... 110

4.6 Discussion ............................................................................................................... 111

Chapter 5 CONCLUSION AND FUTURE WORK ................................................................. 113

5.1 Introduction ............................................................................................................. 113

5.2 Research Contributions ............................................................................................ 113

5.3 Limitations .............................................................................................................. 115

5.4 Future Works ........................................................................................................... 117

5.5 Summary ................................................................................................................. 119

References .............................................................................................................................. 121

Appendix A: The Most Frequent Top 30 Words in Iban Corpus .............................................. 129

Appendix B: The Context of Iban Language ............................................................................ 130

Appendix C: The Most Frequent Top 30 Words in Bau Bidayuh Corpus ................................. 134

Appendix D: The Context of Bau Bidayuh Language .............................................................. 135

Page 12: Faculty of Computer Science & Information Technology Named-Entity... · 2015-07-20 · GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF SARAWAK (NERSIL) YONG SOO FONG

xi

List of Published Papers

1. Yong Soo Fong, Bali Ranaivo-Malançon, & Alvin Yeo Wee. “NERSIL – the Named-

Entity Recognition System for Iban Language”. The 25th Pacific Asia Conference on

Language, Information and Computation (PACLIC 25), Singapore, 16-18 December

2011.

2. Yong Soo Fong, Bali Ranaivo-Malançon, & Alvin Yeo Wee. “Discovering Triggered

Word for Iban-Entity Recogniser”. Proceedings of the Sixth International Workshop

on Malay and Indonesian Language Engineering (MALINDO 2012), Universit i

Malaysia Sarawak, 21 Jun 2012.

Page 13: Faculty of Computer Science & Information Technology Named-Entity... · 2015-07-20 · GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF SARAWAK (NERSIL) YONG SOO FONG

xii

List of Figures

Figure 1.1: Organisation of the Thesis ......................................................................................... 8

Figure 2.1: 200 Extended Named Entity (ENE) Categories (Sekine & Nobata, 2004) ................ 11

Figure 2.2: NER in Newswire Domain (Institute for InfoComm Research, 2004) ...................... 13

Figure 2.3: NER in Biomedical Domain (Institute for InfoComm Research, 2004) .................... 14

Figure 2.4: NER Approaches .................................................................................................... 23

Figure 2.5: Tool Features of NER (Marrero et al., 2009) ............................................................ 33

Figure 2.6: Screenshot of ANNIE (Cunningham et al., 2012) ..................................................... 36

Figure 2.7: Screenshot of Freeling 3.0 ....................................................................................... 38

Figure 2.8: Screenshot of TextPro ............................................................................................ 40

Figure 2.9: Screenshot of ClearForest ....................................................................................... 42

Figure 2.10: F-measure in Entity Identification and Classification (Marrero et al., 2009) ........... 43

Figure 3.1: Research Methodology Process ............................................................................... 47

Figure 3.2: Conceptual Design of the Proposed Framework ....................................................... 49

Figure 3.3: ABBYY FineReader's Process .................................................................................. 52

Figure 3.4: Output of After Performing OCR and After Correction ............................................ 52

Figure 3.5: ANNIE Works with the Set of Core PRs (Maynard, 2004) ....................................... 53

Figure 3.6: Results of non-modified ANNIE NER on Iban Text ................................................ 54

Figure 3.7: Adapted ANNIE to Iban .......................................................................................... 55

Figure 3.8: Context Investigation .............................................................................................. 62

Figure 3.9: Input and Output of Concordance Analysis .............................................................. 65

Figure 3.10: Screenshot of ABBYY FineReader ........................................................................ 68

Figure 3.11: Screenshot of GATE's Framework (Cunningham et al., 2012) ................................ 69

Figure 3.12: Screenshot of VIM Editor ...................................................................................... 73

Figure 3.13: Screenshot of AntConc .......................................................................................... 74

Figure 4.1: NE Distribution in the Iban Corpus.......................................................................... 83

Figure 4.2: Annotation Tool Using GATE's ANNIE System ...................................................... 84

Figure 4.3: NE Distribution in the Bau Bidayuh Corpus ............................................................ 85

Figure 4.4: Annotation Diff Tool ............................................................................................... 87

Figure 4.5: NEs recognised by Non-modified ANNIE NER (Iban Corpus) ................................ 89

Figure 4.6: Results from Adapted ANNIE for Iban (Iban Corpus).............................................. 92

Figure 4.7: The Relationship between the Frequency and Ranking (Iban Corpus) ...................... 99

Figure 4.8: Class of Triggered Word (Iban Corpus) ................................................................. 101

Figure 4.9: NEs recognised by Non-modified ANNIE NER (Bau Bidayuh Corpus) ................. 103

Figure 4.10: NEs recognised by Adapted ANNIE for Bau Bidayuh (Bau Bidayuh Corpus) ...... 105

Page 14: Faculty of Computer Science & Information Technology Named-Entity... · 2015-07-20 · GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF SARAWAK (NERSIL) YONG SOO FONG

xiii

List of Tables

Table 2.1: Output of Machine Translation (using NER) (Ishak et al., 2008) ............................... 17

Table 2.2: Word-level Features of NEs (Nadeau & Sekine, 2007) .............................................. 19

Table 2.3: List Lookup Features of NEs (Nadeau & Sekine, 2007) ............................................ 20

Table 2.4: Document and Corpus Features of NEs (Nadeau & Sekine, 2007) ............................. 22

Table 2.5: Results NER for Indonesian Language (Budi et al., 2005) ......................................... 26

Table 2.6: Strengths and Weaknesses of Each Approach ........................................................... 31

Table 2.7: Results by Entity Type (Marrero et al., 2009) ............................................................ 35

Table 2.8: Default Resources of ANNIE (Cunningham et al., 2012) .......................................... 37

Table 2.9: Analysis Services Available for Each Language (Padró & Stanilovsk, 2012) ............. 39

Table 2.10: Summary of the Literature Review ......................................................................... 44

Table 3.1: List of Contexts Features for Iban NEs ..................................................................... 56

Table 3.2: List of Word-Level Features ..................................................................................... 57

Table 3.3: Examples of JAPE Rules for Iban Person .................................................................. 59

Table 3.4: Structure of JAPE Rules (Maynard, 2004) ................................................................ 71

Table 3.5: Hardware Requirements for Each of the Software ..................................................... 75

Table 3.6: Running Time Evaluation for ANNIE ....................................................................... 75

Table 3.7: Summary of the Framework ..................................................................................... 77

Table 4.1: Total No. of Word Types, Total No. of Word Token, Size of Data Set ...................... 83

Table 4.2: Differences between Iban Text and English Text ...................................................... 90

Table 4.3: Number of Iban Jape Rules ....................................................................................... 91

Table 4.4: Number of Jape Rules which Reused and Created for Iban ........................................ 91

Table 4.5: Gazetteers which Created for Iban ............................................................................ 91

Table 4.6: The Most Frequently Occurring Words (Top Ten) (Iban Corpus) .............................. 93

Table 4.7: Results from Triggered Word Filtering (Iban Corpus) .............................................. 94

Table 4.8: Type of Word for the Most Frequently Occurring Word (Top Ten) (Iban Corpus) ..... 96

Table 4.9: The Most (top ten) frequently occurring in Five Different Sets of Data ..................... 98

Table 4.10: Probability of a Word at Rank r .............................................................................. 98

Table 4.11: Probability of Triggered Word in Each Category of NEs Class (Iban Corpus) ....... 100

Table 4.12: Performance f NERSIL (Iban Corpus) .................................................................. 102

Table 4.13: The Most Frequently Occurring Words (top ten) (Bau Bidayuh Corpus) ............... 106

Table 4.14: Results from Triggered Words Filtering (Bau Bidayuh Corpus) ............................ 107

Table 4.15: Results from Native Speakers ............................................................................... 108

Table 4.16: Performance of NERSIL (Bau Bidayuh Corpus) ................................................... 109

Table 4.17: Summary of the Results ........................................................................................ 110

Page 15: Faculty of Computer Science & Information Technology Named-Entity... · 2015-07-20 · GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF SARAWAK (NERSIL) YONG SOO FONG

xiv

List of Abbreviations

The following is a list of abbreviations used in this thesis:

IE: Information Extraction

MUC: Message Understanding Conference

NER: Named Entity Recognition

GATE: General Architecture for Text Engineering

ANNIE: A Nearly-New Information Extraction System

NERSIL: Named Entity Recognition Sarawak Indigenous Languages

NLP: Natural Language Processing

IR: Information Retrieval

HMM: Hidden Markov Models

MEMM: Maximum Entropy Markov Model

SILs: Sarawak Indigenous Languages

SVM: Support Vector Machines

Page 16: Faculty of Computer Science & Information Technology Named-Entity... · 2015-07-20 · GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF SARAWAK (NERSIL) YONG SOO FONG

1

Chapter 1 INTRODUCTION

1.1 Definitions: Named Entity (NE) and Named Entity Recognition (NER)

The numbers of online electronic documents are growing exponentially with more

important information continuing to become available as text. Thus, it is very difficult to

identify the relevant information quickly and accurately. Thus, it should be supported by

computational tools as the identification task is complex. Currently, there are many

technologies that have been developed to deal with the tremendous amount of

information such as Information Extraction (IE). However, Named Entity Recognition

(NER) is one of the important sub-tasks of IE. The NER process is divided into

successive parts. The first part consists of identifying proper names in a given text. The

second part concerns the classification of these proper names into semantic class such as

Person, Organisation, Location, Date, Time, Monetary and Percentage. Currently, much

work has been done in NER for English and others that are deemed “big” languages. This

has generated much interest among researchers in finding ways to develop NER for

Sarawak Indigenous Languages (SILs). The background of SILs will be described in

detail in Section 1.2.

The identification and classification of rigid designators such as name expressions,

numeral expressions and temporal expressions from raw texts are very important in

numerous Natural Language Processing (NLP) applications. According to online

Collinsdictionary.com (2012), rigid designators refer an expression that distinguishes the

Page 17: Faculty of Computer Science & Information Technology Named-Entity... · 2015-07-20 · GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF SARAWAK (NERSIL) YONG SOO FONG

2

same individual in every possible world. For example, “Shakespeare” is a rigid

designator. This can be seen in the following sentence: “Shakespeare might not have been

a playwright but not that he might not have been Shakespeare” (Collinsdictionary.com,

2012). These rigid designators are called Named Entities (NEs), as defined by Kripke

(1980). Generally speaking, NEs are proper nouns. NEs are often used in naming sports

and adventure activities, and terms for biological species and substances. Besides, there

are different lists of NE types provided by Message Understanding Conference (MUC)‟s

list (Grishman & Sundheim, 1996), Conference on Computational Natural Language

Learning (CoNLL)‟s list (Sang & Meulder, 2003), and Sekine‟s list (Sekine & Nobata,

2004). Indeed, NE types are confusing for researchers. Thus, NE types will be explained

in more details in Chapter 2.

NER approaches can be broadly divided into three main types: a rule-based approach, a

machine learning-based approach, and a hybrid-based approach. Rule-based approach

relies on hand-crafted linguistic grammars. Machine learning-based approach needs a

huge amount of annotated training data, which is often unavailable for SILs. Besides, the

hybrid-based approach is used to overcome the weaknesses of the two NER approaches.

In general, rule-based approach will provide better results compared with the other two

approaches. NERSIL requires special attention as it is impossible to apply directly the

existing NER approaches.

In conclusion, it is possible to build a strong NER for SILs by using conventional rule-

based approach, extending and modifying the existing NER approaches.

Page 18: Faculty of Computer Science & Information Technology Named-Entity... · 2015-07-20 · GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF SARAWAK (NERSIL) YONG SOO FONG

3

1.2 Background of SILs

According to Dewan Bahasa Dan Pustaka (Malay for The Institute of Language and

Literature), there are 63 indigenous languages in Sarawak, an East Malaysia state.

According to the Ethnologue report (Encyclopaedia for the languages of the world in

2009), the number of individual languages listed in Malaysia (Sarawak) is 46 (Lewis,

2009). Out of the 46 languages, 44 are living languages and 2 have no known speakers

(Lewis, 2009). Examples of SILs are Iban, Bidayuh (Bau-Jagoi), and Melanau (Matu-

Daro). These SILs have received relatively little research attention. In Zahid (2008), he

reported that the reasons for the absence of research on SILs are due to reliance on

researchers coming from outside of Sarawak such as Peninsular Malaysia and other parts

of the world. So far, the research that has been conducted has mostly been confined to

collect basic word lists aiming to gather structural characteristics in terms of phonology,

morphology and syntax.

Among the indigenous of Sarawak, Iban has received considerable research attention.

The Iban is the largest ethnic group making up about 44% of the population of Sarawak

(Berita Publishing Group, 1994). Iban language is the vernacular for Iban people.

Presently, the online Theborneopost.com (2011) reported that Sarawak government

continues to promote the Iban language as an international lingua franca. In 2008, Iban

language was introduced as one of the subjects for the fifth-year secondary school

Malaysia examination. In comparison with the other indigenous languages in Sarawak,

Iban has already its own orthography system along with a few dictionaries and grammar

Page 19: Faculty of Computer Science & Information Technology Named-Entity... · 2015-07-20 · GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF SARAWAK (NERSIL) YONG SOO FONG

4

books (Suhaila et al., 2008). Apart from the Malaysian government initiative, the

Sarawak Language Technology (SaLT) research group has also embarked on research on

the application of information and communication technology (ICT) in the preservation

of SILs. Nevertheless, Iban language is still considered as an under-resourced language

although it has a few NLP tools such as a morphological analyzer and generator, a

syntactic parser, a part-of-speech tagger, and a spell checker. However, some of these

tools are still work in progress.

Moreover, researchers face a number of challenges in the development of NER for SILs

as there are certain restrictions. Below are the restrictions stated by Zahid (2008):

“SILs have not yet been systematically romanised”

“SILs are written in the Roman script likes many of its neighbouring languages

such as Iban, Bidayuh, and Malay”

“SILs do not boast an extensive corpus that is able to provide a reliable resources

about the language‟s syntax, morphology and phonology”

“Have limited word list which fails to reveal the phonological, syntactic, and

lexical variations of the language”

1.3 Problem Statement

NER are now available for European languages (English & French) and even for East

Asian languages (Chinese, Japanese, Korean, and Vietnamese). However, for under-

resourced languages such as the SILs, the problem of NER is still far from being solved.

Page 20: Faculty of Computer Science & Information Technology Named-Entity... · 2015-07-20 · GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF SARAWAK (NERSIL) YONG SOO FONG

5

Although many insights can be gained from the methods used in English, but there are

still many issues that need to be considered. One significant issue is that researchers do

not have a deep linguistic knowledge about the SILs. Besides, linguistic-resources are

scarce. There are also the problems of non-standard spelling and spelling variation. Also,

an NER for SILs does not exist. Thus, an approach for developing the first NER for SILs

will be proposed.

To summarise,

Lack of standardisation of spelling and variation in spelling

No existing NER system for SILs

NER system is an important component of many NLP applications such as

information extraction, machine translation, and question answering. Thus,

building NER for SILs (NERSIL) will open the possibility of creating many NLP

applications for SILs.

Researchers do not have a deep linguistic knowledge in SILs

1.4 Objectives of the Study

With the problem description as a basis, the present research has the following objectives.

The main objective of this research is

To design and develop a generic NER for SILs (NERSIL)

The specific objectives are as follows:

i. To define a framework for developing NERSIL

ii. To design generic rules and build gazetteers

Page 21: Faculty of Computer Science & Information Technology Named-Entity... · 2015-07-20 · GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF SARAWAK (NERSIL) YONG SOO FONG

6

iii. To investigate the contexts of NEs for SILs

iv. To evaluate automatically the accuracy of NERSIL

1.5 Scope of the Study

The scope of this study is in the NER field only. Besides that, only three types of NEs are

considered in these studies which are name expressions, numeric expressions, and time

expressions. Name expressions include Person, Organisation, and Location. The numeric

expressions include Monetary and Percentage. Time expressions include Time and Date.

The target languages are SILs that are Iban and Bau Bidayuh language.

1.6 Significance of the Study

The significance of this study is proposing a solution for developing NER for SILs. The

proposed solution may apply to other under-resourced languages. This study will also

provide access to indigenous languages to researchers and people who are interested in

the local culture. Thus, there will be more work to be conducted and in turn preserve the

culture.

1.7 Organisation of the Thesis

This thesis is divided into five chapters that are the introduction, literature review,

methodology, results analysis and discussion, conclusion, and future works.

Page 22: Faculty of Computer Science & Information Technology Named-Entity... · 2015-07-20 · GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF SARAWAK (NERSIL) YONG SOO FONG

7

Chapter One provides an overview of this dissertation by providing a definition of NE

and NER, the background of SILs, the problem statement, the objectives of the present

studies, the scope as well as the significance of the research.

Chapter Two presents the background of NER, NEs, and NER applications. A review of

NER approaches and existing NER systems is also presented. At the end of the chapter, a

summary of the literature review is outlined.

Chapter Three lays out the research methodology process. Each step in the proposed

framework will be discussed in details. Besides, this chapter also describes the

environment requirements for implementation of the proposed framework such as

software requirements and hardware requirements.

Chapter Four covers the setup of the experiments. Moreover, the results and analysis are

shown through several graphs and tables and followed by discussion.

Chapter Five ends this dissertation with restates the contributions, the limitations, and

the ideas will be retained in the future works.

Next, Figure 1.1 is the big picture of this research: what, why, and how. This picture also

summarises the background information. Thus, readers will be able to understand the

relevance of the work and will get familiar with the NER area.

Page 23: Faculty of Computer Science & Information Technology Named-Entity... · 2015-07-20 · GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF SARAWAK (NERSIL) YONG SOO FONG

8

Figure 1. 1 Sumaary of ChaptersNatural Language Processing

Existing NER system

NER Approaches

NER Related Works

Conclusion and Future Work Result Analysis and Discussion

Proposed Framework

Chapter 5 Chapter 4

Chapter 3

Chapter 2

Chapter 1

Context Investigation

Frequency analysis

Triggered word filtering

Adapted ANNIE to Iban/ Bau

Bidayuh

Rules Building

Gazetteers Building

Pre-processing

Non-modified

ANNIE NER

Analysis of Literature

Reviews

Extension, Modification existing approach

HOW WHY

WHO

?

WHERE

?

WHEN

PERFORMANCE

ANNIE

Machine Translation

Question Answering

Automatic Text

Summarization

Person, Location, Organization, Time,

Date, Monetary, Percentage

Named Entity Types

Background of NER

NER Applications

Rule-Based

Patterns

Gazetteers

Machine Learning

HMM, CRF,

SVM, MEM

Hybrid

Rule-based

+Machine Learning

Information Extraction

Named Entity Recognition

WHAT

Concordance Analysis

Figure 1.1: Organisation of the Thesis

Page 24: Faculty of Computer Science & Information Technology Named-Entity... · 2015-07-20 · GENERIC NAMED-ENTITY RECOGNITION FOR INDIGENOUS LANGUAGES OF SARAWAK (NERSIL) YONG SOO FONG

9

Chapter 2 LITERATURE REVIEW

2.1 Introduction

In previous chapters, the definition of NE, NER and the background of the SILs are

briefly introduced. Besides that, the problem statement, objectives, scope as well as the

significance of the study are identified. In this chapter, more details on NER will be

studied. It will cover NEs types, problems in NEs as well as applications of NER.

Moreover, the achievements and limitations of recent works for NER approaches as well

as some existing NER systems will be reviewed.

2.2 Named Entity Recognition

Nowadays, most of the knowledge is stored and communicated as natural language text.

Many resources are freely available in the Internet. To make this knowledge available in

a structured form for deeper analysis, technologies from the field of IE are necessary.

NER is a fundamental task in information extraction.

In 1990s, NER was successfully applied in English after the evaluation conference such

as Message Understanding Conference (MUC). The reason for such success in English

was because English has a very rich tagged corpus. In addition, researchers obtained

good linguistic insights about the use of a NE. Thus, English is the most popular