semantic frame based automatic extraction of typological ...1371627/fulltext01.pdf · the automatic...

Semantic Frame Based Automatic Extraction ofTypological Information from Descriptive

GrammarsInstitutionen för informationsteknologiExamensarbete i datavetenskap 30hp

Avancerad nivå

Muhammad Irfan Aslam

November 19, 2019

Abstract

This thesis project addresses the machine learning (ML) modelling aspects of the problemof automatically extracting typological linguistic information of natural languages spokenin South Asia from annotated descriptive grammars. Without getting stuck into the theoryand methods of Natural Language Processing (NLP), the focus has been to develop andtest a machine learning (ML) model dedicated to the information extraction part. Start-ing with the existing state-of-the-art frameworks to get labelled training data through thestructured representation of the descriptive grammars, the problem has been modelled asa supervised ML classification task where the annotated text is provided as input and theobjective is to classify the input to one of the pre-learned labels. The approach has been tosystematically explore the data to develop understanding of the problem domain and thenevaluate a set of four potential ML algorithms using predetermined performance metricsnamely: accuracy, recall, precision and f-score. It turned out that the problem splits upinto two independent classification tasks: binary classification task and multiclass classi-fication task. The four selected algorithms: Decision Trees, Naïve Bayes, Support VectorMachines, and Logistic Regression belonging to both linear and non-linear families ofML models are independently trained and compared for both classification tasks. Usingstratified 10-fold cross validation performance metrics are measured and the candidate al-gorithms are compared. Logistic Regression provided overall best results with DecisionTree as the close follow up. Finally, the Logistic Regression model was selected for furtherfine tuning and used in a web demo for typological information extraction tool developedto show the usability of the ML model in the field.Keywords: Automatic Information Extraction, Spoken Languages, Typological LinguisticInformation, Logistic Regression, Classification

Dedication

I dedicate this thesis project to my father Malik Muhamamd Aslam and mother ShahnazAkhter. Although my father is no longer in this world, his memories continue to regulatemy life. I should also mention my wife and children who have supported me throughoutthe process.

Acknowledgements

First and foremost, I thank Allah Almighty for letting me live to see this thesis through.I must acknowledge and extend my special thanks to my best friend Muhammad Azambeing there for me throughout for valuable discussions. I am forever grateful to my su-pervisor Shafqat Mumtaz Virk, for his unwavering support, encouragements and patiencethrough this process. I am very grateful to Mikael Berndtsson, Ronnie Johanson, JonasMellin, Juhee Bae for their time and ideas; Thank you for your support and helpful sug-gestions, I will be forever thankful to you.

Not least of all, I owe so much to my whole family for their undying support, theirunwavering belief that I can achieve so much. Unfortunately, I cannot thank everyone byname because it would take a lifetime but, I just want you all to know that you count somuch. Had it not been for all your prayers and benedictions; were it not for your sincerelove and help, I would never have completed this thesis. Thank you all!

Contents

1 Introduction 6

2 Background and Related Work 82.1 Automatic Information Extraction . . . . . . . . . . . . . . . . . . . . . 82.2 What is Linguistic Typology? . . . . . . . . . . . . . . . . . . . . . . . . 92.3 What are Descriptive Grammars? . . . . . . . . . . . . . . . . . . . . . . 92.4 A brief description of Frame Semantics . . . . . . . . . . . . . . . . . . 92.5 FrameNet: A lexico-semantic Resource . . . . . . . . . . . . . . . . . . 102.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Problem 123.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.3 Scope and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Method 194.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1.1 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.1.2 Generating Training Data . . . . . . . . . . . . . . . . . . . . . . 20

4.2 Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2.1 Frame-element Identification Dataset . . . . . . . . . . . . . . . 214.2.2 Frame-element Classification Dataset . . . . . . . . . . . . . . . 234.2.3 Bivariate Relationship Analysis . . . . . . . . . . . . . . . . . . 23

4.3 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.3.1 Label Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.3.2 One Hot Encoding . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.4 Machine Learning Modelling . . . . . . . . . . . . . . . . . . . . . . . . 294.4.1 Formulating the Machine Learning Task . . . . . . . . . . . . . . 29

4

4.4.2 Selection of the Evaluation Metrics . . . . . . . . . . . . . . . . 304.4.3 Choosing Appropriate Machine Learning Algorithms . . . . . . . 334.4.4 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 Discussion of Results and Future Work 365.1 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.1.1 Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.1.2 Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2 Developing a Machine Learning Model . . . . . . . . . . . . . . . . . . 385.2.1 Summary of the Findings . . . . . . . . . . . . . . . . . . . . . . 405.2.2 Optimizing and Tuning the Best Model . . . . . . . . . . . . . . 40

5.3 Web Demo: Typological Feature Extraction System . . . . . . . . . . . . 435.4 Concluding Remarks and Future Work . . . . . . . . . . . . . . . . . . . 45

5

Chapter 1

Introduction

Natural languages of the world are large and complex in their form and structure. Thisdiversity of these languages is of interest for linguistics and social scientists who are inter-ested in finding relations between language and society. A deeper understanding of theirstructure and diversity can potentially provide answers to many of the philosophical andsocio-cultural unanswered questions. In order to develop this understanding, on the onehand, it is required to explore them for their richness and versatility. While on the otherhand, it is also important to figure out their connections to each other and to the society.Thus, instead of studying them individually, that is, just one language at a time, it is desir-able to compare several languages thoroughly. In this way, it becomes possible to trace thehistory of human generations and to understand the processing machinery of their brains[4]. However, to count, study and compare the languages at deeper and wider scales is nottrivial. Manual study by humans can not reach very far in achieving this goal. Even, it isarguably beyond traditional computational techniques.

However, this does not mean that there is no progress in this direction. The restlessnature of humans keep them motivated to adapt different ways to solve their problems,and this domain is no exception. There has been continuous attempts to count and keepa record of the natural languages spoken around the world. One such recent successfulattempt is Ethnologue1, which is a well-known inventory. Together with the counts ofworld’s living and dead languages, Ethnologue also records some basic information onthose languages. There are also some other open-access databases which contain typolog-ical (i.e., related to the type of language based on structural and functional attributes of thelanguage) information. Examples include World Atlas of Language Structures (WALS),2

1https://www.ethnologue.com/2wals.info

6

wals.info

the Atlas of Pidgin and Creole Language Structures (APiCS),3 the South American In-digenous Language Structures (SAILS)4 and the Phonetics Information base and Lexicon(PH)5.

Historically, the development of typological databases such as the ones mentionedabove has involved manual reading and extraction of information from descriptive gram-mars. As can be imagined, this makes it very labor and time consuming task. Further, theextracted information is limited by human brain processing capabilities. Consequently,most of detailed information about languages still remains in countless number of docu-ments describing the languages, and a wider scale comparison of languages is still waiting.

One can exploit the advancements in technology and devise tools and methodologiesto automatize the whole process of information extraction, and to do a large-scale system-atic comparison of world’s languages. The story has already begun, and as a first step anincreasing number of non-digitally born descriptive grammars and other historical doc-uments on languages are being digitized. In addition to this, a huge amount of data onlanguages is being digitally produced every passing day. The next step is to exploit com-putational powers and methodologies to explore and analyze the data to find answers tomany of the typological linguistic questions. This is exactly, what is the overall majoraim of the research in this direction and this thesis project can be considered a step to-wards achieving those goals. Starting with existing state-of-the-art frameworks to get thestructured representation of the descriptive grammars, the main objective is to focus onthe automatic information extraction from these representation. Without getting stuck intothe theory and methods of Natural Language Processing (NLP), the goal is to develop amachine learning model dedicated to the information extraction part.

Chapter 2 of this thesis presents the description of automatic information extraction,and provides basic knowledge about typological linguistic and descriptive grammars. Italso introduces the frameworks necessary to get the structured representation of the de-scriptive grammars. In Chapter 3, a detailed description of the problem is provided andformally the aim, scope and objective for this thesis project are defined. Then, data, relatedresources, a systematic exploration of the data and choice of evaluation metrics togetherwith the machine learning models selection is presented in Chapter 4. Finally, Chapter 5concludes the results by summarizing the findings and some pointers to extend this work.

3apics.org4sails.clld.org5phoible.org

7

apics.org

sails.clld.org

phoible.org

Chapter 2

Background and Related Work

In this chapter we start by briefing the process of automatic information extraction andhighlight the concepts particularly related to this thesis project within the broader scopeof automatic extraction of typological and other linguistic information from descriptivegrammars of natural languages. Next, it is natural to continue with the definitions oflinguistic typology and descriptive grammars. Then, we provide simple details of the NLPframework used to achieve structured representation of the natural languages used in thisthesis project.

2.1 Automatic Information ExtractionAutomatic Information Extraction (IE) is the process of starting with machine readabledocuments and then automatically extracting required information in a structured formatfrom them. The input documents are sometimes completely unstructured, however, mostlysome preprocessing steps are carried out to introduce some preliminary structure in thedocuments so that at the later stages, one can take full advantage of information extractionprocess. In the area of open information extraction, the objective is to extract whateverinformation can be extracted from the document, while in the other extreme case, infor-mation about a well-defined facts over a narrow domain are to extracted.

In a special case, to be targeted in this thesis, the target documents are about humanlanguages, and the objective is to create a structured view of the information present inthese documents. Starting with the natural language processing techniques to introduce theinitial structure in the documents, the end goal, which is the main focus of this thesis work,is to apply machine learning for automatic information extraction of specific features of alanguage. An example could be to extract information about order of words in a language.The sentences in a language are composed of words of various morphological categories

8

and grammatical categories. The most obvious categories are of object (O), subject (S)and verb (V). In a sentence, the order could be either SOV, SVO, VSO, VOS, OVS, orOSV. One could target to automatically extract information about the specific word orderof a target language from the descriptive grammars.

2.2 What is Linguistic Typology?There exist more than 7000 natural languages in the world. For various purposes, theselanguages have been grouped into different families and branches. Their classification intovarious groups/families is not random rather is based on certain attributes of the languages.The area of linguistics which deals with comparison and classification of languages basedon their structural features is known as linguistic typology. The objectives of this area areto find commonalities between, and to explore diversity across the world’s languages andto explain these in historical and/or universal terms. [2]. The area has a long history of itsexistence and is closely related to areas of genetic and areal linguistics.

2.3 What are Descriptive Grammars?The descriptive grammars are plain text descriptive description of various phonological,morphological, and grammatical attributes of a given natural language. These descriptionsare written by linguists based on their investigations of a particular language of its lin-guistic characteristics at the phonological, morphological, syntactic, and semantic levels[1]. One can find an uncountable number of books, articles, thesis, monograms writtenan languages of the world. These descriptions contains very valuable knowledge aboutlanguages of the world which can be exploited to compare languages of the world to findsimilarities/differences between world’s languages and their connections to each other, andto the species that speak them.

2.4 A brief description of Frame SemanticsFrame semantics is a theory of meaning in language introduced by Charles Filmore andhis colleagues [7, 8, 9]. The theory is based on the notion that meanings of words canbe best understood when studied in connection with the situations to which they belong,and/or in which they may occur.

The backbone of the theory is a conceptual structure called a semantic frame, whichis a script-like description of a prototypical situation, an event, an object, or a relation.

9

As an example, consider a real life scenario of robbery – a situation in which someone (aperpetrator) wrongs a victim by taking something (Goods) from them. In Frame semanticsworld this situation could be represented in a structured way and the resulting structure iscalled a semantic frame. The participants of the situation (i.e. the perpetrator, the victim,and the goods) are called as frame-elements. In addition to these core elements, things likethe place where the robbery took place, the manner in which it took place could also bemade part of the semantic frame. Now, with the availability of a structured representationof the robbery situation, the words like hold up, mug, ransack, rifle, rob, stick up can bebetter understood and analyzed.

2.5 FrameNet: A lexico-semantic ResourceThe development of a lexico-semantic resource – FrameNet [10] – based on the theoryof frame semantics was initiated in 1998 for English. In this lexical resource, generallyreferred to as simply FrameNet or Berkeley FrameNet (BFN), each of the semantic frameshas a set of associated words (or triggers) which can evoke that particular semantic frame.The linguistic expressions for participants, props, and other characteristic elements of thesituations (called frame elements) are also identified for each frame. In addition, each se-mantic frame is accompanied by example sentences taken from naturally occurring naturallanguage text, annotated with triggers, frame elements and other linguistic information.The frames are also linked to each other based on a set of conceptual relations makingthem a network of connected frames, hence the name FrameNet.

Because of their usefulness, framenets have also been developed for a number of otherlanguages (Chinese, French, German, Hebrew, Korean, Italian, Japanese, Portuguese,Spanish, and Swedish), using the BFN model. This long standing effort has contributedextensively to the investigation of various semantic characteristics of many languages at in-dividual levels, even though most cross-linguistic and universal aspects of the BFN modeland its theoretical basis still remain to be explored.

In the context of deploying it in NLP applications, BFN and other framenets have of-ten been criticized for their limited coverage. A solution to this problem this is to developdomain-specific (sub-language) framenets to complement the corresponding general-languageframenets for particular NLP tasks. In the literature we find such initiatives covering vari-ous domains, e.g.:

1. a framenet to cover medical terminology [16];

2. Kicktionary,1 a soccer language framenet;

1http://www.kicktionary.de/

10

http://www.kicktionary.de/

3. the Copa 2014 project, covering the domains of soccer, tourism and the World Cupin Brazilian Portuguese, English and Spanish [17].

On the applications side, it has been used in a number of natural language processing(NLP) tasks such question answering [11], coreference resolution [12], paraphrase extrac-tion [13], machine translation [14], and information extraction [15].

2.6 Related WorkThe area of automatic extraction of linguistic information from descriptive grammars isstill in its early stages and to the best of our knowledge the only work reported in thisdirection is [2, 3, 5, 16]. Among these, in [16], the authors have shared their experimentswith pattern and syntactic parsing based methods. The approach and the results presentedtherein are promising. However, such methods seem quite restricted and cannot be ex-tended beyond certain limits.

On the other hand, the areas of frame-semantics [7, 8, 9] and frame semantic parsing[19, 20, 21] , on which the data generation work for this thesis project is based, are wellmatured. Frame semantics was introduced by Charles J. Fillmore and colleagues back inearly 70’s, which later became basis for the development of a lexico-semantic resource –FrameNet. The criticism regarding the limited coverage of framenets paved way for thedevelopment of domain specific framenets such as those listed in the previous section.

11

Chapter 3

Problem

3.1 DescriptionThough there is a lot of research which resulted in the development of methodologies aswell as tools for automatic extraction of information from textual data, there has not beena rich literature available in the area of automatic linguistic information extraction. Thebasic reason being that the area is very young.

Språkbanken (the Swedish Language Bank) is a nationally and internationally ac-knowledged research unit at the Department of Swedish, University of Gothenburg, Swe-den. To address the limitations of the current state-of-the-art, researchers at Språkbankenare building resources and methodologies for automatic extraction of typological informa-tion from descriptive grammars as a part of two projects:

• An European Union project DReaM1

• A Swedish Research council project LSI2.

For the purpose of structured representation of the descriptive grammars and automaticinformation extraction, the project is relying on frame-semantics and its associated lexico-semantic resource i.e., FrameNet (see Chapter 2 for some details on frame-semantics andthe FrameNet).

Without going into any development details, below is a list of major modules thatneeds to be completed in order to develop a system for automatic extraction of linguisticinformation from descriptive grammars.

1https://spraakbanken.gu.se/eng/dream2https://spraakbanken.gu.se/eng/research/lsi

12

1. Part-I: Linguistic Frames Development (LingFN) Build a set of semantic framesfor the linguistic domain. This part has been completed independent of this thesis. Anumber of frames have already been developed and the work has been reported in [18].

2. Part-II: Data Annotation and Training Data Generation Annotate manually a set ofdescriptive grammars with the frames developed in Part-I. These annotations form thebasis for the training data generation for modelling purposes. Each annotated sentenceis parsed with the Stanford parser [22] resulting in parse trees (one such tree is shownin Figure 3.2). Each node of such a tree then becomes a training instance for whicha set of features (shown in Table 3.1) is computed. Each feature vector is labelledwhether it is a frame-element, and if so, also with the class(type) of the correspondingframe-element.

3. Part-III: Machine Learning Modelling Using labelled feature vectors from Part-II,machine learning models are trained on those features. This involves learning the dataencoding, evaluating different models, and then selecting the best performing model,which will later be used to automatically annotate the un-annotated data. These auto-matic annotations are later to be used for typological information extraction (Part-IV).

4. Part-IV: Typological Information Extraction The automatic annotation of new datameans that one can retrieve the list of linguistic frames and their frame-elements fromthe descriptive grammars. This information then can be used to formulate the answersto the typological questions of interest. A web-demo is planned for this part.

13

Figure 3.1: System Architecture

Figure 3.1 provides a complete overview of how the tasks defined in the above four mod-ules are interconnected to fulfill the overall aim and objectives.

For better understanding, let us take an example and see how with the help of devel-oped semantic-frames and machine learning model, the project aim is envisioned to beachieved. Consider the following sentence taken from a descriptive grammar of a particu-lar language:

14

Figure 3.2: Example Parse Tree

The genitive sometimes agrees with the qualified noun in gender, as is also the case inGondi.

Now, suppose we have developed a frame called ’Agreement’ with the following Structure,and the lemma ’agree’ as one of its lexical units.

AgreementParticipant_1Participant_2Grammatical_CategoryDegreeFrequencyLanguage_VarietyReference_LanguageCondition

With this structure in hand, the above given sentence can be hand-annotated as below:

[The genitive]Participant_1 [sometimes]Frequency [agrees]LU with [the qualified noun]Participant_2[in gender]Grammatical_Category , as is also the case in [Gondi]Reference_Language .

15

What we have achieved by developing the ‘Agreement’ semantic frame (Part-I), andthe above given annotation (Part-II) is that we have represented the knowledge within thesentence in a structured way. Collecting several similar annotations forms the basis for themachine learning modelling part. That is, given we have enough training examples, wecan train models which will learn the annotations, and later the learned models can be usedfor automatic annotations of the unannotated text (Part-III). Those annotations can thenbe used to retrieve a list of linguist frames and their frame-elements which later can beused to extract the typological and other linguistic information from descriptive grammars(Part-IV).

3.2 AimThe latest attempts in formulating a structured representation of the descriptive grammarsfollowed by the manual annotations, as described in the previous section, has provided theresearchers with an enormous amount of data ready to be used in modern machine learningbased modelling. This is a critical step in the process of automatic information extractionfrom description grammars, since the success of the aforementioned bigger projects atSpråkbanken is subject to the successful implementation of a machine learning model.Therefore, the aim of this thesis project is:

• to systematically explore the annotated data of the languages spoken in South Asiaand compare machine learning algorithms trained on this data to construct a modelwhich effectively performs identification and classification of frame-elements with-out using explicit instructions, relying just on the patterns learned from the data.

This will also result in the creation of useful annotated data-sources which will also bevery useful for other researchers working in the same or similar areas of natural languageprocessing.

3.3 Scope and ObjectivesObviously, the overall scope and objectives of the two above mentioned funded projects atSpråkbanken are bigger. Here we describe the focus of this thesis project with referenceto the major modules outlined under Section 3.1.

Since achieving the aim is solely a data driven task, the scope must be defined by theappropriate data source to generate training data (Part-II). The main data source selectedfor this thesis project is Grierson’s [6] classical Linguistic Survey of India (LSI) covering

16

the languages spoken in South Asia. A set of around 70 selected descriptive grammarsfrom the LSI data were annotated (Part-II) using the frames developed as part of the workdone in Part-I. Further details about data and related resources are provided in Section 4.1.

This thesis project primarily covers Part-III, together with limited contributions inParts-II and IV. Part-III involves frame-element identification and frame-element classi-fication using the training data. These are the necessary components of a frame-semanticparser as reported in previous literature [21]. The frame-element identification and classi-fication steps form the core steps to achieve the aim of this project and the Part-III is themain module which deals with these.

For the sake of clarity, following objectives are to be met for successfully achievingthe aim in this thesis project.

1. To develop understanding of the data and generate training instances by

i. Annotating a set of descriptive grammars using the existing as well as newly de-veloped linguistic frames to generate training data.

ii. Exploring the data and summarizing relationships as well as descriptive statistics.

2. Developing a machine learning model trained using the generated data to performframe-element identification and classification tasks by

i. Performing data encoding and preparing the data for the modelling.

ii. Comparing of a set of representative set of machine learning algorithms for mod-elling and comparing their performance using a selection of relevant metrics.

iii. Optimize and tune the best algorithm to train a final model.

3. Evaluating the model in a web demo.

i. Using the trained model to perform predictions on the unseen data: Frame-elementidentification and classification.

17

# Feature Explanation Example Feature Value1 target_lemma Lemmatized form of the tar-

get wordagree

2 target_pos Part of speech (POS) tag ofthe target_lemma

VBP

3 arg_word The head word of the argu-ment node

nouns

4 arg_word_pos POS tag of the arg_word NNS5 right_word The right most dependent

word of the argument nodethe

6 right_word_pos POS tag of the right_word DT7 left_word The left most dependent word

of the argument nodeNA

8 left_word_pos POS tag of the left_word NA9 parent_word Head word of the parent node

of the targetagree

10 parent_word_pos POS tag of the parent_word VBP11 c_subcat Subcategorization frame

corresponding to the phrasestructure rule used to expandthe phrase around the target

VP->VBP PP

12 phrase_type Phrase type of the argumentnode

NP

13 position Position of the argument wrttarget word

14 fes_list List of frame-elements of thetriggered frame

(Participant_1, Participant_2,Grammatical_Category,Degree, Frequency, Lan-guage_Variety, Refer-ence_Language, Condition)

15 gov_cat The governing category eitherS or VP

VP

Table 3.1: Feature Set

18

Chapter 4

Method

In order to achieve the objectives specified in Section 3.3, here in this chapter, we iden-tify the relevant methods. Starting with data selection and training instances generationafter annotation, a careful selection of the relevant methods for every objective is cov-ered separately with necessary description and motivations thereby providing a systematicendeavour to address the issues of the corresponding objective.

4.1 DataAs briefly mentioned under the scope, the major data source for the project is Grierson’s[6] classical Linguistic Survey of India (LSI). The LSI presents a comprehensive survey ofthe languages spoken in South Asia conducted in the late nineteenth and the early twenti-eth century by the British government. Under the supervision of George A. Grierson, thesurvey resulted into a detailed report comprising 19 books comprising around 9500 pagesin total. The survey covered 723 linguistic varieties representing major language fami-lies and some unclassified languages, of almost the whole of nineteenth-century British-controlled India (modern Pakistan, India, Bangladesh, and parts of Burma). Språkbanekenresearchers have scanned, OCRed, and stored major parts of the survey accessible throughthe Korp1 – a corpus infrastructure.

1https://spraakbanken.gu.se/korp/

19

4.1.1 AnnotationFor the annotation purposes, an annotation tool from the Brazilian FrameNet project2 isused, a copy of which was installed on Språkbanken servers and access had been grantedfor data generation tasks.

A total of 70 descriptive grammars from the LSI data were annotated to generate thetraining data. This annotation process is a collective work and a number of data annotatorswere involved in this step. Each data annotator was responsible for 6 documents. Thelength of each document is between 90 to 155 sentences.

As part of this thesis project, 6 documents with a total length of 579 sentences wereannotated. Since, the project aim is mainly focused on the machine learning part, a smallscale participation in the annotation process was carried out to develop understanding ofthe underlying process of the data annotation and the training dataset generation.

4.1.2 Generating Training DataIn Section 3.1, Part-II already outlines, at a high level, the process of generating trainingdata after performing annotations. To add to that, it should be stated that following thestandard practice, this step was carried out using the already developed state-of-the-arttools [22] which do not depend on any machine learning modeling expertise, rather, adeep linguistic understanding of the problem domain is required. For this thesis project,this has been provided by the domain experts (supervisor and other researchers in thegroup). Therefore, here we do not dig deeper into the dataset generation process whichis more on the linguistic side, and keep our focus towards the data exploration and themachine learning modelling parts which are the core of the aim of this thesis.

Table 3.1 provides the list of 15 features which together describe every sample in thedataset. All examples given in the last column of the table are with respect to the parsetree shown in Figure 3.2 with ’agree’ as the target word and the NP node referring to ’thequalified nouns’ (the one enclosed within the dotted area) as the argument node.

The sentence by sentence annotation, of the entire set of documents chosen for the gen-eration of training data, resulted in a labelled dataset organized in a tabular form with theextracted features as the independent variables (predictors) and the label (target/dependentvariable) defining whether it is a frame-element (Y/N), and if so, also providing the classof the corresponding frame-element. This defines two versions of the dataset. One whichwill be used for the frame-element identification task while the other for classifying theframe-elements into their corresponding types(classes).

2http://www.ufjf.br/framenetbr-eng/

20

A random sample of 5 records from the dataset is presented in Table 4.1. The firstrow lists all possible features followed by the last column label which defines whether thefeature values correspond to a frame-element using the letters Y or N. This dataset cor-responds to the frame-element identification task and contains 92,479 training instances.Those samples which are labelled as Y are later separately tagged with the class of thecorresponding frame-element, thus forming another dataset corresponding to the frame-element classification task. Now, the label column provides the class of the correspondingframe-element as shown in a sample of 5 records in Table 4.2. In total, there are 6,344samples in the training data for the classification task.

target_lemma target_pos arg_word arg_word_pos right_word right_word_pos left_word left_word_pos parent_word parent_word_pos c_subcat phrase_type position fes_list gov_cat labelverb NNS also RB Default Default about RB walked VBD NP->JJNNS ADVP R fe_language_variety#and#fe_data VP Nplural NN is VBZ Default Default was VBD past NN NP->NNNN SBAR L fe_subclass#and#fe_data#and#fe_data_translation ROOT Yplural NN past NN . . The DT ROOT ROOT NP->NNNN ROOT O fe_subclass#and#fe_data#and#fe_data_translation ROOT Noblique JJ ag NN Default Default twai-na NN twai NN ADJP->JJJJ NP R fe_sublass#and#fe_data#and#fe_data_translation VP Ydecline VBD like IN Default Default Default Default like IN VP->VBDPP IN R fe_inflectional_scheme#and#fe_form VP N

Table 4.1: A Sample from the Frame-element identification dataset

target_lemma target_pos arg_word arg_word_pos right_word right_word_pos left_word left_word_pos parent_word parent_word_pos c_subcat phrase_type position fes_list gov_cat labelverb VB tong NN Default Default Default Default tong NN VP->VBSBAR NN R fe_data#and#fe_data_translation#and#fe_subclass VP datapronoun NNS Relative JJ Default Default Default Default pronouns NNS NP->DTJJNNS JJ L fe_language_variety#and#fe_data VP sublassprefix NNS towards IN Default Default signifying VBG Hon NNP NP->DTVBGNNS UCP R fe_subclass#and#fe_data#and#fe_language_variety VP fe_Data_Translationsuffix NN yo NN Default Default Default Default ya NN NP->DTNN NP L fe_subclass#and#fe_language_variety VP datapronoun NN who WP Default Default Default Default who WP NP->JJNNNN WP R fe_language_variety#and#fe_data#and#fe_data_tr ROOT data_translation

Table 4.2: A sample from the frame-element classification dataset

4.2 Data ExplorationWe explore both the datasets in detail. In the following instance, record, sample and rowmean the same.

4.2.1 Frame-element Identification DatasetFor the frame-element identification dataset, we found that out of the 92,479 samples,86,136 (93.14%) are labelled as N, and the remaining 6,343 (6.86%) are labelled Y. Furtherexploration revealed that the dataset has 10,601 (11.46%) duplicate rows. We consulted thedomain experts who advised to remove all the duplicates. After cleaning, i.e., removingthe duplicates, 81,878 instances were left, which is 88.54% of 92,479 and the updateddistribution of the labels in the dataset became 76,036 (92.86%) cases for N, and 5,842(7.14%) cases for Y. Table 4.3 provides a summary of the cleaned data. There are nomissing values for any of the variables in the dataset. All of the variables are categorical;any variable that is not quantitative is called a categorical variable (sometimes also called

21

variable n_missing type distinct_values mode frequency percentarg_word 0 CAT 4776 , 3970 4.85parent_word 0 CAT 3941 is 7021 8.57left_word 0 CAT 2569 Default 49558 60.53right_word 0 CAT 1152 Default 72765 88.87c_subcat 0 CAT 252 NP-DTJJNN 10684 13.05target_lemma 0 CAT 91 suffix 13171 16.09phrase_type 0 CAT 70 NP 17796 21.73left_word_pos 0 CAT 45 Default 49558 60.53arg_word_pos 0 CAT 44 NN 18967 23.16parent_word_pos 0 CAT 40 NN 23517 28.72right_word_pos 0 CAT 37 Default 72764 88.87fes_list 0 CAT 30 fe_subclass, fe_data, fe_language_variety 14974 18.29target_pos 0 CAT 12 NN 35895 43.84gov_cat 0 CAT 5 VP 48326 59.02position 0 CAT 3 R 44751 54.66label 0 CAT 2 N 76036 92.86

Table 4.3: Summary of the Frame-element Identification Dataset

variable n_missing type distinct_values mode frequency percentarg_word 0 CAT 1936 of 97 1.66parent_word 0 CAT 1761 is 195 3.33left_word 0 CAT 377 Default 5086 86.87c_subcat 0 CAT 241 NP-DTNN 751 12.83target_lemma 0 CAT 91 suffix 1067 18.22label 0 CAT 49 data 2672 45.64phrase_type 0 CAT 47 NP 1682 28.73right_word 0 CAT 39 Default 5807 99.18parent_word_pos 0 CAT 35 NN 2906 49.63arg_word_pos 0 CAT 33 NN 2707 46.23fes_list 0 CAT 30 fe_subclass, fe_data, fe_language_variety 1150 19.64left_word_pos 0 CAT 21 Default 5086 86.87right_word_pos 0 CAT 12 Default 5807 99.18target_pos 0 CAT 12 NN 2414 41.23gov_cat 0 CAT 4 VP 3489 59.59position 0 CAT 3 R 4363 74.52

Table 4.4: Summary of the frame-element classification dataset

a nominal variable). Categorical variables take a value that is one of several possiblevalues. However, in our case, some of the variables have a large number of possiblevalues. Table 4.3 also provides the most frequent value (mode) for every variable, togetherwith number of times. This has appeared in the dataset (frequency) which is also showedin terms of the percentage of the total records in the dataset. We also tested for missingdata and found that there are no missing values for any of the variables.

22

variable n_missing type distinct_values mode frequency percentarg_word 0 CAT 1911 of 96 1.67parent_word 0 CAT 1742 is 185 3.22left_word 0 CAT 364 Default 5005 87.06c_subcat 0 CAT 236 NP-DTNN 737 12.82target_lemma 0 CAT 89 suffix 1061 18.46phrase_type 0 CAT 46 NP 1653 28.75right_word 0 CAT 39 Default 5701 99.17parent_word_pos 0 CAT 35 NN 2874 49.99arg_word_pos 0 CAT 32 NN 2669 46.43fes_list 0 CAT 29 fe_subclass, fe_data, fe_language_variety 1150 20.00label 0 CAT 25 data 2672 46.48left_word_pos 0 CAT 21 Default 5005 87.06right_word_pos 0 CAT 12 Default 5701 99.17target_pos 0 CAT 12 NN 2391 41.59gov_cat 0 CAT 4 VP 3415 59.40position 0 CAT 3 R 4300 74.80

Table 4.5: Summary of the frame-element classification dataset after removing low fre-quency classes

4.2.2 Frame-element Classification DatasetSimilar to the identification task, we first checked for duplicate instances. After removingduplicates, 5,855 cases were left, which is 92.29% of the total 6,344 instances in thisdataset covering 49 different classes of frame-elements. Tables 4.6 and 4.7 provide list ofthe classes as well as their distribution (frequency) in the entire dataset. After discussionswith the domain experts, it was decided to discard all of the instances which correspondto the classes which have very little (≤10 instances) representation in the dataset. Themain reason being that before including these classes into the classification task, moredocuments corresponding to these classes should be annotated to bring in proper structureof these classes into the dataset. It turned out that 24 out of 49 classes fulfill this criteria.All of the low frequency classes (Table 4.7) together correspond to only 106 samples ofthe dataset. After discarding these, the resulting final dataset consist of 5,749 instances, asummary of which is shown in Table 4.5.

4.2.3 Bivariate Relationship AnalysisWe also quantitatively measured the relationship between all pairs of the variables. Incase of numerical variables, among others, Pearson correlation coefficient is widely usedto determine a linear relationship between two variables. However, Pearson’s correlationcoefficient is not defined when the data is categorical. Since all of our variables are categor-ical, we need a measure of association between two categorical variables. Cramér’s V is

23

label frequency percentagedata 2672 45.636data_translation 836 14.278subclass 764 13.049language_variety 303 5.175language 230 3.928sublass 186 3.177fe_Data_Translation 182 3.108location 108 1.845formed_entity 86 1.469process 60 1.025degree 39 0.666language_family 37 0.632participant_2 27 0.461affix 26 0.444position 22 0.376spoken_by 21 0.359participant_1 21 0.359reference_language 20 0.342word 20 0.342anthropomorphic_entity 20 0.342language_subfamily 16 0.273language_group 15 0.256formed_from 14 0.239verb 13 0.222condition 11 0.188

Table 4.6: Distribution of the Classes Kept in the Frame-element Classification Dataset

one such measure which quantifies the association between a pair of categorical variables.Cramér’s V varies from 0 to 1. The lowest value, i.e., 0 corresponds to no associationbetween the variables. Whereas, when the two variables are equal to each other, meaningthere exists a complete association, a value of 1 is achieved.

Figure 4.1 shows a heatmap plot of the measure of the Cramér’s V for frame-elementidentification dataset. The bottom row in this plot shows the association value betweenour target variable label and the 15 features. The most notable features are arg_word ,parent_word, phrase_type and arg_word_pos achieving high values 0.53, 0.41, 0.28 and0.27 respectively. On the other extreme, right_word and gov_cat have apparently very low

24

label frequency percentagelexeme 10 0.171grammatical_category 9 0.154result 9 0.154frequency 8 0.137manner 8 0.137script 8 0.137derivational_morpheme 7 0.120analogical_form 6 0.102range 6 0.102name 5 0.085stem 4 0.068certainty 3 0.051means 3 0.051form 3 0.051linguistic_example 2 0.034morpheme 2 0.034certainity 2 0.034inflectional_scheme 2 0.034argument 2 0.034aspirant 2 0.034mood_of 2 0.034purpose 1 0.017example_pointer 1 0.017base 1 0.017

Table 4.7: Distribution of the Classes Removed from the Frame-element ClassificationDataset

or almost no association with the target. However, it must be kept in mind that this is ameasure of one to one linear relationship, and those showing low association individuallymight together with other features exhibit a better multivariate relationship with the targetvariable.

25

Figure 4.1: Bivariate Relationship Analysis for Frame-element Identification Dataset

The other notable high correlations such as between pairs arg_word and arg_word_pos,right_word and right_word_pos, left_word and left_word_pos, parent_word and parent_word_posare found to be already expected after discussion with the domain experts. However,the remaining strong relationship pairs namely: target_lemma and fe_list, target_pos andc_subset, arg_word_pos and phrase_type are somewhat special and the domain expertswanted to explore them systematically to help in improving their future developments offrames for annotation and subsequently enhance the quality of the generated data.

26

Figure 4.2: Bivariate Relationship Analysis for Frame-element Classification Dataset

In Figure 4.2, a similar heatmap plot for frame-element classification dataset is shown.The findings mentioned in the previous paragraph are also consistent in this dataset. Thebig difference is that now, as evident from the last row, except right_word and right_word_pos,all other features exhibit comparatively better values of the measure compared to theframe-element dataset.

The insights gained through this bivariate analysis is vital for making choices for themachine learning algorithms discussed later in this chapter.

27

4.3 Data RepresentationAll of the variables in both of the datasets are same, except that they differ on the possibleset of values for the target variable, label. However, in either case, together with the target,all of the variables are categorical (see Table 4.1, 4.5).

In order to achieve best performance while performing machine learning modelling,the right choice of data representation technique for categorical data is very important. Themain reason is that there are a limited number of machine learning algorithms that can bedirectly applied to categorical data. On the other hand, if we can turn them into numericalvariables, starting from basic Decision Trees, Naïve Bayes, Support Vector Machines,Logistic Regression, Random Forest, to Multi-layer Perceptron (Deep Learning), almostall of the machine learning algorithms can be applied.

There are plenty of techniques to transform categorical values to numerical data. Inthe following we describe two such techniques that are widely used and we have chosento apply to our datasets.

4.3.1 Label EncodingThe first approach that we have used for encoding categorical values is a technique calledlabel encoding. It is applied by simply converting each value in a categorical variable to anumber. For example, the variable gov_cat has four different levels (e.g., Table 4.1). Wecould choose to encode it as follows:

• VP⇒ 0

• S⇒ 1

• ROOT ⇒ 2

• SINV ⇒ 3

Label encoding has the advantage that it is straightforward to apply the transformationof categorical data to numerical values. However, this simplicity comes with a disadvan-tage. That is, the numeric values can be misinterpreted in the machine learning models,because the ordering of the numerical vales may be taken into account in by the learningalgorithm, although this order may not be relevant for the dataset at hand.

Because label encoding is the most basic way of enabling categorical data to be treatedas numeric data, we have implemented it as a baseline for evaluating the performance ofthe algorithms using other encoding schemes.

28

4.3.2 One Hot EncodingA common alternative approach is called one hot encoding. The basic strategy is to converteach category level (value) of the categorical variable into a new variable, and assign thevalue 1 to this new variable wherever the corresponding categorical variable equals thislevel, and 0 otherwise. This is done for all category levels of the variable being encodedexcept one, which will be redundant (applies when all other associated variables equalzero) and can be any category level. The key is to always create one fewer binary variablesthan the number of categories. The new binary variables together replace the originalcategorical variable. The new variables are sometimes termed dummy variables, and theapproach is also called Dummy Variables Encoding. This encoding has the benefit of notweighting a value improperly, but does have the downside of adding more variables to thedataset.

To exemplify, let us again consider the variable gov_cat. Applying one hot encodingwould turn this variable into a set of three binary variables named VP, S, and ROOT, wherethe category value SINV is redundant and will be encoded implicitly by setting the othervariables to 0.

For our datasets, we applied both data encoding schemes using Python (version 3.5)via Pandas (version 0.22.0) and Scikit–Learn (version 0.20.3)

4.4 Machine Learning ModellingSuccessful encoding makes the dataset ready to be used for applying machine learningalgorithms. However, before going into the model selection and training, we must clearlydefine the learning task and the evaluation criteria.

4.4.1 Formulating the Machine Learning TaskMachine learning can be categorized into two broad learning tasks: Supervised and Unsu-pervised. In simple words, the task is called supervised learning when all data is labeledand the algorithm learns to predict the output from the input data. On the other hand, inunsupervised learning, all data is unlabeled and the algorithm learns the inherent structuresfrom the input data. Without diving further into the details, we can already establish thatsince we have labelled data, we are dealing with supervised learning task.

The block diagram in Figure 4.3 3 provides an overview of machine learning tasksand algorithms. Under supervised learning, Regression and Classification are the main

3https://www.guru99.com/machine-learning-tutorial.html

29

Figure 4.3: Machine learning Algorithms and where they are used?

categories. The main difference between them is that the target variable in regression isnumerical (or continuous) while that for classification is categorical (or discrete). Thus,both frame-element identification and classification tasks addressed in this thesis projectare clearly classification tasks within the domain of supervised learning.

Within the current setup, as explored in Sections 4.2.1 – 2 , we have two independentvariants of the training data obtained from annotated descriptive grammars together withtheir labels. The frame-element identification training dataset has two possible valuesfor the target variable, therefore, it becomes a Binary Classification Task. On the otherhand, frame-element classification task assumes that the target variable can have one ofthe possible 25 class labels. Such a problem of classifying instances into three or moreclasses is known as Multiclass Classification Task. In this thesis project, both tasks havebeen treated and tacked independent of each other, as two separate learning tasks.

4.4.2 Selection of the Evaluation MetricsAccuracy is one of the widely used metrics for evaluating the performance of a classifi-cation model (or "classifier"). Informally, accuracy is the fraction of correct predictionsover the total number of predictions made by a model. We can define accuracy using thefollowing:

Accuracy =Number of correct predictionsTotal number of predictions

(4.1)

30

Most real life classification datasets do not have exactly equal number of instancesin each class, but a small difference often does not matter and accuracy provides a fairestimate of the performance of the classification algorithm.

However, the situation is different in our case. Data exploration has already revealedthat for both of the tasks, we have highly imbalanced data. For the identification task,distribution of the labels in the dataset are 92.86% cases for N, and 7.14% cases for Y.

Similarly, for the multiclass classification task, the most frequent class namely data,alone occupies more than 45% of the instances in the dataset. Together with the next twomost frequent classes data_translation and subclass, they already cover almost 73% of thetraining data. Thus, the rest of the 22 possible class labels together cover only 17% of thedata, out of which there are 15 classes each of which is having less than 1% representationin the dataset. All these facts are summarized using the calculations provided in Table 4.5.

Imbalance datasets call for special consideration when choosing for appropriate evalu-ation metric(s). A learning algorithm can simply learn to always predict the majority classto quickly achieve a higher accuracy score, e.g., in our frame-element identification task,it can be as high as 0.9286 due to the percentage of the cases for the class label N. Thesituation is termed as the accuracy paradox. Therefore, to deal with imbalance datasets, itis advised that together with accuracy, other performance measures should also be consid-ered to achieve better insight into the performance of the classifier than just the traditionalclassification accuracy.

We begin by describing the widely used concept called Confusion Matrix and its re-lated terms to formally define accuracy as well as other metrics of interest namely Preci-sion, Recall and F-Score which together help to truly access the performance of classifica-tion models specially for imbalanced datasets. The following discussion is very important,since the purpose is not to present the book definitions, rather, the objective is to unroll themeanings of these metrics in the context of the problem addressed in this thesis project.

Figure 4.4: Confusion Matrix for Binary Classification Task

Confusion Matrix: A confusion matrix is a table that is often used to describe the per-formance of a classifier on a set of given data samples in supervised learning setting (weknow the true class labels). It provides a breakdown of the correct predictions and the

31

types of incorrect predictions made (what classes the incorrect predictions were assignedto).

It is convenient to describe a confusion matrix for the case of binary classification tasksas shown in Figure 4.4. The terms positive and negative refer to the classifier’s prediction,and the terms true and false refer to whether the prediction corresponds to the expected(actual) value.

Let us define the four basic quantities that constitute a confusion matrix. These arewhole numbers and form the basis for explaining the contents of confusion matrix as wellas the other metrics mentioned above. We are referring to our binary classification task(frame-element identification) to make the description convenient to follow.

True Positive (TP): These are cases in which a frame-element is predicted as Y and thecorrect (positive) label was also Y.

True negative (TN): The classifier predicted N, and got it right, i.e., it was a negativeinstance.

False Positive (FP): The prediction is Y, but actually the correct label is N, meaning that,it is wrong to say it a positive instance.

False Negative (FN): The classifier predicted the instance to be N but actually it is apositive (correct label = Y) instance. Thus, it is false to say it a negative instance.

Now, in the following we use the four above mentioned quantities to formally define ac-curacy as well as other relevant metrics. All of these metrics are fractions, therefore, thevalues they take range from 0 to 1.0 (worst to the perfect).

• Accuracy: What fraction of the instances did the model label correctly?

Accuracy =T P+T N

T P+T N +FP+FN=

T P+T NTotal Population (predictions)

(4.2)

• Precision: Out of all the examples the classifier thought were positive, how oftenwere the examples actually positive? It is a measure of a classifier’s exactness, thatis, how believable the model is when it says an instance is positive?

Precision =T P

T P+FP=

T PTotal Predicted Positive

(4.3)

32

• Recall: When presented a positive example, how often was the model able to clas-sify it as positive? It is a measure of a classifier’s completeness.

Recall =T P

T P+FN=

T PTotal Actual Positive

(4.4)

• F1 Score (or F-score): A perfect precision score of 1.0 would mean every frame-element which is predicted as Y by the classifier is indeed a true label for that frame-element, however, it says nothing about whether all of the Y-labeled frame-elementswere actually predicted correctly. Whereas, a perfect recall score of 1.0 would meanall Y-labeled frame-elements have been predicted correctly, but at the same time, itsays nothing about how many of the N-labeled frame-elements have been predictedas Y.

Actually, precision and recall are inversely proportional to each other, that is, in-creasing one decreases the other and vice versa. Understanding this effect is im-portant in building an efficient classification model. The solution lies in choosing ameasure that is a combination of both precision and recall. F-score is one such mea-sure whose value is defined as a weighted harmonic mean of precision and recall.

F-score = 2precision x recallprecision + recall

=2T P

2T P+FP+FN(4.5)

As clear from the definition of F-score, its values always lies between precision andrecall. However, to account for class imbalance datasets, it is advised to computeda weighted F-score. For this, we first compute the metrics (precision and recall) foreach class label separately and then find their average weighted by the number oftrue instances for each class label. In this way, it allows us to account for imbalancein the dataset. It should be noted that weighted F-score value may not be betweenprecision and recall, but being a fraction, still remains less than one.

Equations 4.2–4.5 complete our list of the metrics to evaluate and compare the per-formance of machine learning models. The definitions and the comments providedfor these metrics apply equally to both variants of the datasets.

Standard functions from the Scikit–Learn (version 0.20.3) have been used to com-pute these metrics.

4.4.3 Choosing Appropriate Machine Learning AlgorithmsResults of the bivariate analysis suggest that there is strong evidence of linearity of therelationship between features and the target variable. However, at the same time, a possi-

33

bility of a non-linear relationship can not be ruled out. Therefore, we should include bothlinear as well as non-linear algorithms as candidate models for the model comparison andselection. Since, our datasets correspond to both binary and multiclass classification tasks,we should select the algorithms which are applicable to both scenarios. Keeping in mindthe golden rule, try simple models first, we proceed as follows.

Together with the fact that the features are all categorical variables, the most naturalcandidate from non-linear models is a Decision Tree Classifier. A decision tree classifiercan inherently handle both binary as well as multiclass classification tasks. Our secondchoice is Logistic Regression. One should not be confused by the word regression in thename, it is a well known classification model from the family of linear models primar-ily used for binary classification tasks. However, a variant of logistic regression calledmultinomial logistic is applicable to multiclass classification tasks as well.

Naïve Bayes based classifiers are also capable to handle both binary and multiclassclassification and it is also a reasonable option in the presence of categorical variables.Naïve Bayes classifier belong to the family of simple probabilistic classifiers. It is basedon Bayes’ Theorem with an assumption of independence among predictor variables, thatis, the presence of a particular feature in a class is unrelated to the presence of any otherfeature. Even if they are interdependent to a certain extent, as revealed for few of thefeatures in our bivariate analysis, Naïve Bayes may still work reasonably well. So it isworth trying out.

SVM or Support Vector Machine, in its standard form, is a linear model applicable toboth binary as well as multiclass classification tasks. It can solve linear and non-linearproblems and has been successfully applied to many practical problems. Therefore, it isgood to include both, Naïve Bayes and SVM in our list of the selected algorithms.

Another import aspect of the problem is figuring out which encoding is the best, sincethat may greatly influence any of the algorithms mentioned above. We have already cho-sen two types of variable encoding schemes as mentioned under the Data Representationsection. Separately applying both encoding to our datasets, all four of the algorithms men-tioned above are implemented and compared using the performance evaluation metricsdiscussed in the previous section using Python (version 3.5) via Pandas (version 0.22.0)and Scikit–Learn (version 0.20.3).

4.4.4 Cross ValidationCross validation is an essential step in the process of model selection as well as it allows usto utilize our data better. While accessing the performance of a machine learning model,one would like to validate the model on the unseen data. The classic approach is to do asimple 80%-20% split, sometimes with different values like 70%-30% or 90%-10%. In

34

cross validation, we do more than one split. We can do 3, 5, 10 or any K number of splits.Those splits called Folds. Usually, splitting is performed at random. However, since ourdatasets are imbalanced we need to be careful. Random splits into k-folds is not a goodidea, e.g., in case of frame-element identification dataset, some folds might get only theinstances with class label N. When we split our data into folds, we want to make sure thateach fold contains a good representation of all possible class labels in the data, that is, wewant the same proportion of different classes in each fold. For this purpose, a strategycalled Stratified K-fold is useful and provides the splits which are representative of thedistribution of the datasets. We have used the function Stratified K-fold from Scikit–Learn(version 0.20.3) to achieve this for our both dataset.

35

Chapter 5

Discussion of Results and Future Work

The main focus of this thesis project has been to address the ML modelling aspects ofthe problem of automatically extracting typological linguistic information of natural lan-guages spoken in South Asia from annotated descriptive grammars. An overall goal (aim)and a set of milestones (objectives) which define how to achieve this goal has been iden-tified and clearly formulated in Chapter 3. There has been three main objectives whichwere further divided into sub-tasks. Here in this chapter, the purpose is to discuss andreflect on each of these objectives, the achieved results, and make a fair assessment of howwell these objectives have been met in order to access the overall success in achieving theaim of this thesis project. Towards the end of the chapter, we conclude the discussion bypointing out some directions for future work.

5.1 Training DataTo develop an ML model requires training data which is provided to the ML algorithmto learn from. That is, to pursue any research query using the ML modelling, it beginswith the identification of the relevant data type(s) and source(s). Thus, one of the first andobvious tasks to fulfill the aim in this thesis project was to figure out the appropriate datasource of natural languages spoken in South Asia (scope of this thesis project), selectionof the appropriate tools for the annotation of the descriptive grammars, and finally turningthe annotations into training instances. Next, it was also necessary, and mentioned aspart of the first objective, that a systematic exploration must be carried out to develop anunderstating of the data. In the following subsections we discuss and reflect on the datageneration and exploration aspect of this thesis project.

36

5.1.1 GenerationSection 4.1 and the subsections therein, provide detailed discussion as well as motivationof the selection of the data source, standard tools used for the annotation process and fi-nally the generation of the training data through parsing tools. Since data generation forour particular problem depends upon the linguistic understanding of the problem domain,throughout this process, the supervisor and other researchers at Språkbanken were con-sulted and their feedback was taken into account to select the right data source and usestate-of-the-art practices for turning the annotated data into training instances. In this way,all of the steps related to training data generation are thus validated through presentationand discussion with the domain experts.

Each sentence in the set of documents chosen for the generation of training data re-sulted in one or more training instances with the 15 extracted features as mentioned inTable 3.1 (For details see Part-I and Part-II in Section 3.1). These features representedthe predictors in our dataset and the label defining the target whether it is a frame-element(Y/N), and if so, also providing the class of the corresponding frame-element. Discussionwith the domain experts at this point turned out that our task of linguistic informationextraction splits up into two independent classification tasks: binary classification andmulticlass classification. Thus, we defined two versions of the dataset: one which corre-sponded to the frame-element identification task (binary classification) while the other forclassifying the frame-elements into their corresponding types (multiclass classification).

At this point, setting up two separate versions of the dataset meant that steps involvedin the data exploration as well as remainder of the objectives must consider both datasetsand any analysis and the modelling results obtained should be prepared separately for bothdatasets.

This concludes the discussion on the first objective, the details of the methods.

5.1.2 ExplorationData exploration is vital to get insights about the data at hand and be used for makingdecisions in the ML modelling purpose. The second part of the first objective in this thesisproject explicitly mentioned the need to figure out any relationships within the features aswell as generating the descriptive statistics so that it can aid in the proper preprocessing ofthe data and also in the selection of ML algorithms for the modelling part.

For both variants of the data, Section 4.2: its subsections, the tables and the figurestherein cover in depth analysis and exploration choices and their results. A purposefuldiscussion and the motivation for each step has already been added in there. To reflectupon, here is a brief account of the exploration. First of all, as suggested by the domainexperts, duplicated instances in the training data were removed and continuing with the

37

exploration, following aspects were particularly studied: individual features (variables)analysis for type, missing and distinct values and frequency (see Tables 4.3 and 4.2).

Also, a class distribution analysis was conducted for both versions of the dataset, andspecifically for the frame-element classification dataset, it was found that for 24 out of49 classes have very little (≤10 instances each) representation in the dataset. All of thelow frequency classes (Table 4.7) together added up to only 106 samples out of 5,855total samples of the dataset. Such a dataset could lead to unexpected results in the MLmodelling part. Thus, these were discarded and the resulting final dataset consisted of5,749 instances covering a set of 25 classes, a summary of which is shown in Table 4.5. Itwas noted that before including the low frequency classes into the classification task, moredocuments corresponding to these classes should be annotated to bring in proper structureof these classes into the dataset.

As a last step of the data exploration, we quantitatively measured the relationship be-tween all pairs of the variables in both the datasets. Motivation for the choice of themethod along with the discussion of the results was covered in detail in Section 4.2.3. Anotable finding of the bivariate relationship analysis was that we found out strong evidenceof linear relationship between the features and the target labels for both the dataset. Lateron, in the ML modelling part, these findings were used to make an informed selection ofcandidate ML algorithms for both the datasets.

5.2 Developing a Machine Learning ModelThe second objective which is also the core part of the aim for this thesis project is todevelop an ML model. As described in Section 3.3, this objective was divided into threesub-tasks namely: (1) performing data encoding, (2) comparing a set of representativemachine learning algorithms, and finally (3) optimizing and tuning the best algorithm totrain a final model. Section 4.3 covers (1) by exploring two state-of-the-art data represen-tation techniques namely: Label Encoding and One Hot Encoding. Whereas for (2) andits related topics such as formulating the machine learning task, metrics to be used in theevaluation criteria, and a validation approach employed to access the generalization of theresults, are discussed in Section 4.4 providing great details along with the proper moti-vations. Therefore, for details of these first two sub-tasks, please refer to the mentionedsections.

The outcome of the work carried out for these two sub-tasks was the set up of a super-vised learning based ML modelling approach where a set of four algorithms namely: De-cision Trees, Naïve Bayes, Support Vector Machines, and Logistic Regression were chosento compare against each other. It is worth mentioning that these algorithms are carefully

38

selected such that we have algorithms belonging to both linear and non-linear familiesof ML models. The experiments were designed to independently train and compare thesealgorithms for both classification tasks and evaluation was conducted based on the prede-termined performance metrics namely: accuracy, recall, precision and f-score.

A comparison of the results of the selected machine learning algorithms applied to thebinary classification (frame-element identification) and multiclass classification (frame-element classification) tasks for our datasets was performed. Tables 5.1–5.4 summarize theresults of using both data representation techniques (one hot and label encoding) togetherwith the selected metrics for evaluation and reported values correspond to an average scoreof 10-fold cross validation.

Model Accuracy Precision Recall F_scoreDecision Tree 0.900 0.621 0.620 0.900Logistic Regression 0.928 0.554 0.508 0.896Naïve Bayes 0.620 0.550 0.681 0.710Support Vector Machine 0.928 0.644 0.507 0.896

Table 5.1: Model Comparison for Frame-element Identification dataset using Label En-coding


Table 5.2: Model Comparison for Frame-element Identification dataset using One HotEncoding


Table 5.3: Model Comparison for Frame-element Classification dataset using Label En-coding

39


Table 5.4: Model Comparison for Frame-element Classification dataset using One HotEncoding

5.2.1 Summary of the FindingsThe part of the objective is to compare the performance of the candidate algorithms andselect an appropriate one to further optimize and tune for a final model. For this purpose,in the following we analyze the results and draw conclusions supported with the argumentsbased on the findings in the results Tables 5.1–5.4.

• Overall, all of the algorithms performed better for the frame-element identifica-tion dataset (binary classification task) compared to the frame-element classificationdataset (multiclass classification task).

• Using one hot encoding is clearly advantageous over label encoding for both thetasks as all the models perform better over all the metrics, except for SVM.

• Within the best results of one hot encoding, Decision tree and Logistic Regressionhave consistently shown overall best results among the four models in both the tasks.

• If we focus only on accuracy and the F-score, and consider one hot encoding results,the only winner that pops up is Logistic Regression which is a linear model. It alsosupports that the hypotheses about adding linear models supported by the bivariateanalysis was correct.

5.2.2 Optimizing and Tuning the Best ModelThe last sub-task of the second objective asks to optimize and tune the best algorithm totrain a final model. Logistic Regression has been selected for further optimizing to train afinal model to be integrated into a demonstration of the overall system as depicted in thesystem architecture diagram in Figure 3.1.

Looking closer into result Tables 5.2, and 5.4, we realized that particularly for theframe-element identification task, the true positive rate (recall) is low for the LogisticRegression model: a value of 0.609 compared to the values 0.664 and 0.686 for Decision

40

Tree and Naïve Bayes respectively. A better recall for our Logistic Regression model canprovide gain in the overall performance on unseen data.

Improving recall can be achieved by explicitly telling the Logistic Regression algo-rithm to penalize mistakes in samples of the desired class (e.g., Y labels in frame-elementidentification task) more than the other class(s). In this way, for class imbalanced dataset,more emphasis can be put on minority class(s). In Scikit–Learn implementation of Logis-tic Regression, this can be achieved by setting the parameter class_weight = ’balanced’.Secondly, we learned about a variant of the standard accuracy measure called balancedaccuracy which is well suited for our tasks. The balanced accuracy can be computed forboth binary and multiclass classification problems to combat with imbalanced datasets. Itis defined as the average of recall obtained on each class. Since this is also essentially anaverage of fractional numbers, its value lies between 0 (worst) and 1 (best).

Figure 5.1: Confusion Matrix for Frame-element Identification Dataset using LogisticRegression

With the above parameter and metric choice, we learned, separately for both datasetwith one hot encoding, an optimal Logistic Regression model using 10–fold cross valida-tion and then the model was applied to get confusion matrices shown in Figures 5.1and 5.2for both the datasets. For frame-element identification task, a balanced accuracy of 0.91was achieved. Surprisingly, for the frame-element classification task, a drastic improve-ment is achieved as evident in the Figure 5.2. The accuracy value of 0.817 for Logistic

41

Regression as in Table 5.4 has reached to a new value of 0.997 and at the same time,achieving an excellent score of 0.9988 for balanced accuracy. This concludes the discus-sion on the main focus of this thesis project.

Figure 5.2: Confusion Matrix for Frame-element Classification Dataset using LogisticRegression

42

5.3 Web Demo: Typological Feature Extraction SystemIn the following, we discuss and reflect on the completion of the last objective which asksfor the integration of the best tuned model into a web demo for typological informationextraction system. A preliminary version of a web demo has been developed to showthe working of the complete system. Figure 3.1 shows the complete architecture of thetypological feature extraction system. As shown (the middle part within dotted area),the system takes a descriptive grammar in raw form and annotate it with LingFN framesusing the pre-trained models both for the frame-element identification and frame-elementclassification tasks (the pre-training of models using the annotated data is shown in the partabove the dotted area). The annotated data is further processed with a simple rule basedmodule to convert those annotations to typological feature values (i.e., the part below thedotted area). Let us take an example to explain this part (i.e., the rule based part) inparticular, and the overall purpose of such a system in general.

Suppose we are interested in finding an answer to the question “What is the order ofadjective and noun in the noun phrase” for the Siyin 1 language. The LSI data set containsa grammatical description of this language, and one of the sentences in that description isThe adjectives follow the noun they qualify. Automatic parsing of this sentence using thedeveloped LingFN parser will result into the annotations shown in Figure 5.3 (a screenshotfrom the web demo of the parser).

This parse contains the answer to the above asked question. However, the typologicaldatabases often record answers in a specific format. For example, the answer to the abovequestion could be required to be of one of these values ‘NA’, ‘AN’, or ‘Both’ meaning thatthe order is ‘Noun-Adjective’, ‘Adjective-Noun’, or ‘Both’ respectively. If required, theabove given parse information can be converted into specific feature values using a simplerule-based module such as mentioned in Algorithm 1 (only a part of the full module isshown). The module simply checks the contents of different frame elements to formulatethe feature value.

Using the same sort of procedure and the frames in LingFN, we have targeted to extractand formulate values for some of the typological features given in the Grambank 2 andother typological databases. A few of these features are given below.

• Can an adnominal property word agree with the noun in gender/noun class?

• Can an article agree with the noun in gender/noun class?

• Can an article agree with the noun in number?1A Tibeto-Burman language spoken in southern Tedim township, Chin State, Burma. Also known as

Siyin Chin and Sizang Chin, ISO 639-3: csy2A typological database: https://github.com/clld/grambank.

43

https://github.com/clld/grambank

Figure 5.3: Automatic Frame Annotation

• Can the relative clause precede the noun?

• Can the relative clause follow the noun?

• Order of Adjective and Noun.

• Order of Subject, Object and Verb.

• Order of Numeral and Noun.

• Order of Relative Clause and Noun.

It is worth mentioning that the same methodology can be used to extract values for variousother typological features from the descriptive grammars. This will require designing suit-able frames, annotating the data and re-training models, which we leave as a future work.

44

Algorithm 1 Extract adjective noun order0: procedure EXTRACTADJECTIVENOUNORDER(parse)0: for <every frame in parse> do0: if f rame = SEQUENCE then0: NA← False0: AN← False0: Both← False0: if ′ad jective′ ∈ Entity_1∧′ noun′ ∈ Entity_2 then0: if Frequency ∈ [sometimes,usually,mostly,o f ten] then0: Both← True0: else if order = f ollow then0: AN← True0: else if order = precede then0: NA← True0: end if0: end if0: end if0: end for0: end procedure=0

Further, the methodology can be extended to descriptive grammars written in languagesother than English.

5.4 Concluding Remarks and Future WorkWe have presented a novel system for automatic extraction of typological features fromdescriptive grammars. Based on the theory of frame semantics and frame-semantic pars-ing, we have presented the methodology, set up the machinery and architecture, conductedthe machine learning modelling, and shown the working of this machinery for extractionof feature values of an example typological feature.

Regarding the machine learning results, here are some points to consider for betterresults in a future work.

• A natural and quick extension to improve the results of this work could be to tryout ensemble of the best performing models, such as Decision Tree and LogisticRegression, or just ensembles of tree only.

• Among ensemble learning using trees, Random Forest and Gradient-Boosting Trees

45

are worth trying out. Random forest generates many times simple decision trees anduses the majority vote method to decide on which label to return. In this way, usuallya much better results can be achieved. Gradient-boosting trees is also a state-of-the-art technique tree based ensemble. It also works with many decision tree focusingon the error committed by the previous trees and tries to correct it.

• Since, the best performance on the frame-element classification dataset is around 0.8in both accuracy and F-score, it hints that complex non-linear relationships mightexist and a significant improvement can be achieved by systematic evaluation ofnon-linear models such as Neural Networks.

Now, some remarks about the generalization and applicability of the proposed systemin the wider context. First of all, the methodology is scalable and can easily be extendednot only to other features but also to the descriptive grammars written in other naturallanguages. This is required because there are many grammatical descriptions written inlanguages other than English (German, French, Spanish, and Russian are among them).

Secondly, the system we report is expected to be a useful assistance for the develop-ment of typological databases, which otherwise are built manually. Manual curation oftypological databases is very time and labor consuming, as well as cognitively taxing, thusmaking the scope of studies based on such databases very limited. We hope with the au-tomatic extraction of typological databases, the scope of studies in typological and otherrelated areas can be broaden further.

Now, some observations about the limitations. In this regard, specifically we want tomention that the current version of LingFN provides a very limited number of eventfulframes restricting us to target only a few typological features. There are 195 typologicalfeatures listed in Grambank. In the future, we would like to build more frames, annotatemore grammars, and automatically extract values for as many as possible features of theGrambank.

In conclusion, the current study can be considered as a rigorous proof of concept.Thus, in the future, we plan to extend the system both on ML and LingFN fronts andevaluate it against existing manually curated typological databases to compute measuressuch as precision and recall among others. Further, the extraction of typological featuresis just a case study, the automatically annotated grammars are envisioned to be equallyuseful in other linguistic subdisciplines, in particular the related areas of genetic and areallinguistics. In the future, we also have plans to show the usefulness of the annotateddescriptions in these and other related areas.

46

Bibliography

[1] Per Malm , Shafqat Mumtaz Virk, Lars Borin, and Anju Saxena, LingFN: Towards aFramenet for the Linguistics Domain, In 11th edition of the Language Resources andEvaluation Conference, May 7, 2018 - May 12 2018, Miyazaki (Japan), pp. 37-43.2018.

[2] Shafqat Mumtaz Virk, Lars Borin, Anju Saxena and Harald Hammarström, AutomaticExtraction of Typological Linguistic Features from Descriptive Grammars, 20th In-ternational Conference on Text, Speech and Dialogue (TSD) Aug 27, 2017 - Aug 31,2017 Prague.

[3] Lars Borin, Shafqat Mumtaz Virk and Anju Saxena. Language Technology for DigitalLinguistics: Turning the Linguistic Survey of India Into a Rich Source of LinguisticInformation, 18th International Conference on Computational Linguistics and Intelli-gent Text Processing (CICLing), April 17 to 23, 2017, Budapest, Hungary.

[4] Evans, Nicholas and Stephen Levinson. The Myth of Language Universals: Languagediversity and its importance for cognitive science. Behavioral and Brain Sciences,2009, 32(5). 429–492.

[5] Hammarstrom, Harald. 2013. Three Approaches to Prefix and Suffix Statis- tics in theLanguages of the World. Paper presented at the Workshop on Corpus-based Quantita-tive Typology (CoQuaT 2013).

[6] Grierson, G.A.: A Linguistic Survey of India, vol. I–XI. Government of India, CentralPub- lication Branch, Calcutta (1903–1927).

[7] Fillmore, C. J. (1976). Frame semantics and the nature of language. Annals of the NewYork Academy of Sciences, 280(1):20–32.

[8] Fillmore, C. J., (1977). Scenes-and-frames semantics. Number 59 in FundamentalStudies in Computer Sci- ence. North Holland Publishing, Amsterdam.

47

[9] Fillmore, C. J. (1982). Frame semantics. In Linguistic So- ciety of Korea, editor, Lin-guistics in the Morning Calm, pages 111–137. Hanshin Publishing Co., Seoul.

[10] Baker, C. F., Fillmore, C. J., and Lowe, J. B. (1998). The Berkeley FrameNet project.In Proceedings of ACL/COLING 1998, pages 86–90, Montreal. ACL.

[11] Shen, D. and Lapata, M. (2007). Using semantic roles to improve question answer-ing. In Proceedings of EMNLP- CoNLL 2007, pages 12–21, Prague. ACL.

[12] Ponzetto, S. P. and Strube, M. (2006). Exploiting semantic role labeling, word-net and wikipedia for coreference resolution. In Proceedings of HLT 2006, pages192–199, New York. ACL.

[13] Hasegawa, Y., Lee-Goldman, R., Kong, A., and Akita, K. (2011). Framenet as aresource for paraphrase research. Constructions and Frames, 3(1):104–127.

[14] Wu, D. and Fung, P. (2009). Semantic roles for SMT: A hybrid two-pass model. InProceeding

[15] Surdeanu, M., Harabagiu, S., Williams, J., and Aarseth, P. (2003). Using predicate-argument structures for information extraction. In Proceedings of ACL 2003, pages8–15, Sapporo. ACL.

[16] Borin, L., Toporowska Gronostaj, M., and Kokkinakis, D. (2007). Medical framesas target and tool. In FRAME 2007: Building Frame Semantics resources for Scandi-navian and Baltic languages. (Nodalida 2007 workshop proceedings), pages 11–18,Tartu. NEALT.

[17] Torrent, T. T., Salomao, M. M. M., Matos, E. E. d. S., Ga-monal, M. A., Goncalves,J., de Souza, B. P., Gomes, D. S., and Pero Correa, S. R. (2014). Multilingual lex-icographic annotation for domain-specific electronic dictionaries: The Copa 2014FrameNet Brasil project. Constructions and Frames, 6(1):73–91.

[18] Shafqat, V., Borin, L., Malm, Per., Saxena, A., LingFN: A Framenet for the Linguis-tic Domain. Cicling 2019, April 7 to 13, 2019, La Rochelle, France.

[19] Dipanjan Das, Desai Chen, André F. T. Martins, Nathan Schneider, and Noah A.Smith. 2014. Frame-semantic parsing. Comput. Linguist. 40, 1 (March 2014), 9-56.

[20] Das, D., Schneider, N., Chen, D., Smith, N.A. (2010). Probabilistic Frame-SemanticParsing. HLT-NAACL.

48

[21] Richard Johansson and Pierre Nugues. 2008. The effect of syntactic representationon semantic role labeling. In Proceedings of the 22nd International Conference onComputational Linguistics - Volume 1 (COLING ’08), Vol. 1. Association for Com-putational Linguistics, Stroudsburg, PA, USA, 393-400.

[22] Manning,C.D.,Surdeanu,M.,Bauer,J.,Finkel,J.,Bethard,S.J.,McClosky,D.:TheStanfordCoreNLP natural language processing toolkit. In: ACL System Demonstrations. pp.55–60. ACL, Portland (2014), http://www.aclweb.org/anthology/P/P14/P14-5010

49

semantic frame based automatic extraction of typological ...1371627/fulltext01.pdf · the automatic...

Documents