starved neural learning - su.diva-portal.org1255506/fulltext01.pdf · starved neural learning...

Starved neural learningMorpheme segmentation using low amounts of data

Peter Persson

Institutionen för lingvistik

Datorlingvistik

Examensarbete för kandidatexamen 15 hp

Kandidatprogram i lingvistik 180 hp

Höstterminen 2018

Handledare: Robert Östling

Examinator: Mats Wirén

Expertgranskare: Mats Wirén

English title: Starved neural learning

Starved neural learning

Morpheme segmentation using low amounts of data

Abstract

Automatic morpheme segmentation as a field has been dominated by unsupervised methods since itsinception. Partly due to theoretical motivations, but also due to resource constraints. Given the successneural network methods have shown on a wide variety of field in later years, it would seem compellingto apply these methods to the morpheme segmentation field. This study explores the efficacy of modernneural networks, specifically convolutional neural networks and Bi-directional LSTM networks, on themorpheme segmentation task in a resource low setting to determine their viability as contenders withprevious unsupervised, minimally supervised, and semi-supervised systems in the field. One architectureof each type is implemented and trained on a new gold standard data set and the results are compared topreviously established methods. A qualitative error analysis of the architectures’ segmentations is alsoperformed. The study demonstrates that a BLSTM system can be trained with minimal effort to producea proof of concept solution at low levels of training data and suggests that BLSTM methods may be afruitful direction for further research in this field.

Keywords

morpheme segmentation, machine learning, convolutional neural network, LSTM, neural networks

Morfemsegmentering med neurala nätverkmed små mängder data

Sammanfattning

Fältet automatisk morfemsegmentering har från första början dominerats av oövervakade maskininlärn-ingsmetoder. Detta har delvis motiverats av teoretiska infallsvinklar men även till stor del av databrist.Efter den framgång neurala nätverksmetoder har uppvisat inom diverse fält inom NLP så är det lockandeatt utforska om dessa metoder kan appliceras på automatisk morfemsegmentering. Den här studien un-dersöker moderna neurala nätverksmetoders effektivitet, specifikt konvolutionsnät och dubbelriktadeLSTM arkitekturer, på automatisk morfemsegmentering i en resursfattig kontext – detta för att jäm-föra deras förmåga mot oövervakade respektive minimalt övervakade system som använts inom fältettidigare. En arkitektur av vardera omnämnd typ implementeras och tränas på en ny guldstandard ochresultaten jämförs med tidigare publicerad forskning i fältet. Dessutom görs en kvalitativ felanalys påvardera arkitekturs respektive segmentaringar. Studien visar att ett BLSTM system som åstadkommer enproof-of-concept lösning kan konstrueras med liten ansträngning och små mängder träningsdata. Dessu-tom föreslår studien att BLSTM system kan vara ett gynnsamt område för framtida forskning i området.

Nyckelord

morfemsemgentering, maskininlärning, konvolutionsnät, LSTM, neurala nätverk

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 Some basic terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.1.1 Model, architecture, system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.1.2 Unsupervised, minimally supervised, semi-supervised . . . . . . . . . . . . . . . . . . . . 22.1.3 Epoch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.1.4 Validation set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.1.5 Parameters and hyper parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.2 Morphology and natural language processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2.1 Morphological markings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2.2 Overview of English morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2.3 Limitations of NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2.4 Segmentation as a tagging problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3.1 Basic structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3.2 Learning and back propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3.3 Regularization and overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3.4 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.5 LSTM networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.6 Bi-directional networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.7 Convolutional Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.8 Residual connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.9 Preprocessing and batch normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.10 Embedding layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5 Previous research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Purpose and research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.1 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2.1 Evaluation data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.2.2 Training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.3 Model architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.3.1 Input and embedding layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.3.2 BLSTM architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.3.3 Convolutional architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.3.4 Output layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.6 Experiment design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4

5.1 Results by research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.2 Quantitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.2.1 Relationship between training set size and F1 . . . . . . . . . . . . . . . . . . . . . . . . . . 205.2.2 Comparison with earlier research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.3 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.3.1 Convolutional architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.3.2 BLSTM architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246.1 Discussion of method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6.1.1 Discussion of network models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246.1.2 Discussion of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246.1.3 Discussion of annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246.1.4 Discussion of tagset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246.1.5 Discussion of experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6.2 Discussion of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256.2.1 Discussion of quantitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256.2.2 Discussion of error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256.2.3 Discussion of comparison with previous works . . . . . . . . . . . . . . . . . . . . . . . . . . 26

7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287.1 Conclusions by question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

7.1.1 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297.1.2 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5

1 Introduction

Automatic morpheme segmentation is not a new concept, rather it has been around since the work ofHarris in 1955. In short, the aim of the task is to detect and sometimes classify morphemes in raw textdata without the need of manual labour. E.g. if we give a segmentation system the word ’disservice’,we would want the system to provide the following analysis: ’dis-serv-ice’. This is contrasted withautomatic morphological analysis, which attempts, to given a word-form, find an analysis of its semanticmorphemic properties, such as part-of-speech, tense-aspect value, or number. Various motivations havebeen used to justify research in morpheme segmentation – some theoretical, some practical – and asa consequence the exact details on the task often differ. In a practical setting, automatic segmentationhas been utilized as a preprocessing step for word-based NLP tasks such as machine translation andinformation retrieval.

Traditionally the field has been dominated by unsupervised machine learning methods. Influentialamong these is the work of Goldsmith (2001) and Creutz & Lagus (2005, 2007), both utilizing patternsin letter and word frequencies to detect boundaries between morphemes, from which the system buildsgenerative models which it successively attempts to optimize. Another interesting approach is takenby Ruokolainen et al. (2014) where the problem of segmentation is rewritten as a tagging problem.Utilizing a minimal amount of annotated data Ruokolainen et al. implements a conditional random fieldsbased system to tackle this tagging problem. While technically a supervised machine learning method,systems that utilize such small amounts of data have been called minimally supervised (Ruokolainenet al. 2016).This system and several of the earlier unsupervised methods have later been expanded to asemi-supervised learning setting by allowing them to leverage a small amount of annotated data duringtraining (Kohonen et al. 2010, Goldwater et al. 2009, Ruokolainen et al. 2014).

The present study aims to explore the efficacy of modern neural network methods on the resource lowlearning environment of morpheme segmentation. This is done by implementing two neural networkarchitectures, a convolutional network and a bi-directional LSTM network. These two architectures arethen trained on a new annotated gold standard made for this study, on training sets of three graduallyincreasing sizes. The smallest one corresponds to the unsupervised, minimally supervised, and semi-supervised learning environments outlined by Ruokolainen et al. (2016).

1

2 Background

2.1 Some basic terms

2.1.1 Model, architecture, system

This thesis uses the terms model, architecture, and system for very specific meanings. System refers to aprogram which performs a certain task, e.g. morpheme segmentation. A system can use any method, orseveral methods to solve this task. Architecture refers to the details of a neural network’s construction.A model is a trained neural network which follows a certain architecture in it’s construction.

2.1.2 Unsupervised, minimally supervised, semi-supervised

In machine learning a system which only uses raw unnanotated data in its learning process is called anunsupervised system. This is contrasted with a supervised system, which utilizes some form of anno-tation or curated selection of its data in the learning process. A semi-supervised system is a hybrid ofthe two where a small portion of the training data is annotated in some way in order to ease the learn-ing process. Minimally supervised systems refer to systems that are fully supervised, yet utilize only aminimal amount of data. This minimal amount of data is often the same or comparable to the amountof supervised data a semi-supervised system has access to. These four categories are called learningenvironments in this study.

2.1.3 Epoch

During training the system will process parts of the training data at a time, updating parameters as itgoes along, and continue this until it’s gone through a full pass of all of the available training data. Thisis called an epoch. Many systems keep training for many epochs, i.e. it goes over all of the availabletraining data many times.

2.1.4 Validation set

A small portion of the training data is commonly set aside before training, called the validation set. Thisportion is used to get an estimate of how far the learning process has gone by having the system evaluateagainst the validation set after fixed training intervals. Most commonly this evaluation happens at theend of an epoch.

2.1.5 Parameters and hyper parameters

Neural networks consist of large amounts of vectors that are combined in various ways during calcula-tion. The values of vectors which are fine-tuned during training are called the parameters of the network.Simply put, the parameters of a network are those numbers the system alters in order to learn. Hyper pa-rameters represent numbers and settings which are fixed during training but still greatly affect learning.Hyper parameters are fine-tuned by the programmers and researchers before training to get the best ver-sion of the architecture for the task. The number of units in a neural network layer, the specific p-valueused for dropout, or the magnitude of a max norm constraint are examples of hyper parameters.

2

2.2 Morphology and natural language processing

The study of morphology concerns itself largely with the concept of a morpheme, commonly defined asthe smallest meaning bearing unit of words. In particular, morphology studies the relationship betweenmorphemes and the word-forms which are said to contain or carry them. This is challenging to an NLPapproach which traditionally operates on written texts, partly due to the difficulty of deriving meaningfrom surface forms of words and partly due to the ambiguities surrounding where one morpheme beginsand another ends.

This section goes over how morphemes are marked on word-forms from a typological perspective, aswell as outlining specific strategies as used in the English language. Then, this section briefly goes overthe difficulties of modelling morphemes in an NLP environment.

2.2.1 Morphological markings

There are many ways in which languages mark the presence of a morpheme in a word-form. A broadcategorization would divide these strategies in concatenative and non-concatenative types, a distinctionwhich is useful in the context of this study.

Concatenative strategies traditionally refer to affixation and compounding. These marking strategieshave the convenient property that the affix, or roots in the case of compounding, consist of clearly de-lineable phonemic material that can be associated to the morpheme(s) in question. This allows for aconveniently simple and concrete model of what a morpheme is: a chunk of extra phonemic materialadded to a base or root to add meaning. Examples 1 through 3 show examples of concatenative mor-phology in Swedish. In example 1 we see inflectional suffixation, in 2 derivational prefixation, and in 3we see compounding.

(1) stol,chair,

stol-ar,chair-pl,

stol-ar-na,chair-pl-DEF,

stol-ar-na-schair-pl-DEF-GEN

(2) leda,lead,

av-leda,divert,

för-ledamislead

(3) ben,leg,

stolsben,chair leg,

bordsbentable leg

Non-concatenative strategies do not necessarily share this easily delineable property of concatenativestrategies. A common class of non-concatenative marking strategies is that of base alteration, whichincludes strategies such as e.g. vowel alternations, stress shift, palatalization, or gemination. Anothercase is that of reduplication, where the entirety or a part of a base or stem is repeated somewhere inthe word-form. Finally there is also conversion, a strategy which does not explicitly mark one or moremorphemes, e.g. in English noun/verb alternations such as ‘work’ (noun) or ‘work’ (verb). In thesecases morphemes are either represented by a replacement or modification of a part of another word-form (base alterations), not represented at all (conversion), or represented by sound changes which arenot isolable from other parts of the word-form (stress shifts and tonal shifts). Example 4 shows tenseexpressed through vowel alternations in English. In example 5 we see derivation through consonantlengthening in Modern Standard Arabic. Example 6 shows reduplication of initial CV syllable of thebase in Ponapean (Rehg & Sohl 1981, p. 78). Notice how the same morpheme is expressed with differingphonemic material in the two cases.

(4) singsing.PRS

sangsing.PST

sungsing.PRF

3

(5) (a) darasa‘learn’

darrasa‘teach’

(b) damara‘perish’

dammara‘annihilate’

(6) (a) duhp‘dive’

du-duhp‘be diving’

(b) wehk‘confess’

we-wehk‘be confessing’

2.2.2 Overview of English morphology

The English language has a limited set of inflectional morphemes with highly predictable allormophycoupled with an extensive and unpredictable system of derivational morphology. The primary markingstrategy is affixation (especially suffixation, and some prefixation) with a significant minority of lexemesemploying base-alternation or suppletion. It is also common for affixes to co-occur with stress shift andvowel lengthening or vowel shortening (Bauer et al. 2013).

Among the inflectional morphemes we find a few tense-aspect morphemes as well as a 3sg agree-ment in the present tense on verbs, plural and genitive or possessive markers for nouns, comparative (ordegree) markings for adjectives and adverbs. The verbal morphemes are primarily expressed throughsuffixation, though tense-aspect morphemes are sometimes expressed through vowel changes or some-times through suppletion. The verb ‘sing’ inflects largely through vowel alternation as seen in example4, but also has the inflected forms ‘sing-s’ and ‘sing-ing’. Constrast this with the verb ‘talk’ which onlyinflects using suffixation. The genitive/possessive morpheme and degree markings can alternatively beexpressed through suffixation or through periphrasis. E.g. “the student’s”, but also ‘of the student’, or‘thick’, ‘thicker’ contrasted with ‘affordable’ and ‘more affordable’. Periphrastic phrases such as theseare not treated by this thesis.

The plural morpheme on nouns is marked through suffixation except for some irregular lexemes.These irregular nouns come in two types: older Germanic roots that use suppletion or vowel change androots of classical origin where the plural marking of the source language was borrowed into English.The former category is small, and follows a few different non-productive patterns. The latter categoryis highly unpredictable and employs suffixation, haplology, and vowel lengthening depending on thesource of the root in question and the knowledge and attitude of the speaker towards the classical lan-guages. In addition, there are several functional auxiliary verbs, such as ’be’ and ’have’, which almostexclusively inflect using suppletion.

Derivation predominantly utilizes suffixation and prefixation, together with some highly productiveconversion (or zero-derivation) strategies. Many affixes occur together with base allomorphy such asfinal vowel deletions, consonant lenition, stress shifts, or vowel shifts. There is also significant allomor-phy among the suffixes themselves, some which appear to be in stylistic (or free) variation and somewhich have complementary distributions through selectional restriction on bases. These selectional re-strictions of affixes in English frequently relate to the part-of-speech of the base, together with stresspattern and sometimes the final consonant of the base.

2.2.3 Limitations of NLP

Traditional morphology looks at both the form and meaning of words simultaneously and seeks to un-derstand the relationships between similarity of form and similarity of meaning. Haspelmath & Sims(2013, p. 2) summarizes this in the following definition: "Morphology is the study of systematic covari-ation in the form and meaning of words". Computational morphological processing falls short of this

4

ideal in a number of ways. To begin with, computational approaches require large quantities of data,which most often is only available in text. Relying on textual sources as opposed to spoken languagerenders important information regarding a word’s form such as tone, vowel length, or stress patternslargely unavailable. This severely limits computational systems in their ability to generalize on the formof words. Furthermore, the difficulty of extracting detailed information regarding semantics from puretext prompts many researchers to avoid meaning entirely when discussing morphology from a com-putational perspective. In this regard, computational morphology bears little resemblence to traditionalmorphology as defined above.

Instead, computational morphology often looks at distributional patterns of contiguous orthographicsequences and study which sequences are highly over-represented and which are frequently combined.Typically ’combined’ is taken to mean concatenation between strings (i.e. ’abc’ + ’def’ = ’abcdef’)and morphological marking strategies that cannot be described as a concatenation of a sequence ofcharacters at the front or end of a base are commonly not treated in the literature, though there areexceptions (e.g. Habash et al. 2012, and Pasha et al. 2014, on morphological analysis of Arabic). Thesesystems clearly do not handle non-concatenative morphology very well and many systems also lackgood ways of handling infixes and circumfixes since these strategies create disconnected sequenceswhich correspond to the same morpheme.

2.2.4 Segmentation as a tagging problem

Morpheme segmentation, from a computational perspective, is the task of given a source string, find theroots, suffixes, prefixes, etc. that are clearly delineable and output the source string with the boundariesbetween these marked by some delimiter character. E.g., if we have the string internationalists’,the correct output could be inter-nation-al-ist-s-’. There is some variation as to the exactdefinition in the literature (Hammarström & Borin 2011), especially with the neighbouring task of wordsegmentation, where sentences are segmented into word-forms, and sometimes entire paragraphs ortexts. In some cases these systems also segment word-forms on a morpheme level, rendering the distinc-tion between the two tasks blurry. Here we define morpheme segmentation as performed on type-level,meaning that the boundaries between surface forms (or morphs) of underlying morphemes – that areclearly delineable – are marked within individual types. A precise definition is given in 1.

Definition 1. Type-level morpheme segmentation is taken to mean partitioning of the orthographic ma-terial of types into contiguous, non-overlapping substrings which correspond to surface forms of compo-nent morphemes.

Traditionally many morpheme segmentation methods have focused on detecting affixes and delineat-ing these in word-forms when detected, sometimes by generating large lexicons over potential affixes(e.g. Goldsmith 2001, 2006, Creutz & Lagus 2005, 2007). Another approach is to view the task as abinary classification problem: by assigning each character in the word-form to one of the classes, ’isfollowed by a morpheme boundary’ or ’is not followed by a morpheme boundary’, we can treat the taskas a tagging problem. This allows the use of conventional and well studied sequence tagging methodssuch as hidden Markov models or recurrent neural networks, as in this study. Such a system would use anintermediate output step that marks the class for each character, which can then trivially be decoded intothe desired segmentations. With the ’internationalists’ example above, where 0 represents ’not followedby a boundary’, and 1 represents ’followed by a boundary’, we get:

(7) i0

n0

t0

e0

r1

n0

a0

t0

i0

o0

n1

a0

l1

i0

s0

t1

s1

’1

inter-nation-al-ist-s-’

5

Figure 1: A feed forward neural network. Each unit in a layer is connected to all units in the previouslayer.

2.3 Neural networks

Words such as AI and artificial neural networks conjure up an image of advanced technologies thatborder on the fantastical. While it may be exciting to draw parallels between human brain and thestructure of neural network models, the reality is that neural network models have little in common withthe processing that goes on in a human mind. Instead, these models are convenient computational toolsthat allow for efficient parallel computations – and have been proven to be very flexible in adapting to awide array of different processing and labelling tasks in numerous fields.

This section goes over the basic structure of neural network models, outlines the two architecture usedin this study, and briefly explains the most important technical details of their construction.

2.3.1 Basic structure

The basic structure of a neural network model is fairly simple. It is built around the core concept of aunit, sometimes called a neuron. Units are building blocks that are organized in layers. Each unit canbe viewed as a function, that takes a little bit of data as input, performs a mathematical operation on itand then feeds the result to the next layer in the model. What distinguishes different neural architecturesfrom each other is how these units are connected to each other, and precisely what type of mathematicalfunction is performed in each unit. The power of the neural network comes from the fact that each unithas a vector of parameters that can be adjusted through a learning algorithm. That way each unit canlearn to react differently to different properties in the data, which can then be used in following layersto learn abstractions about the data.

Formally speaking, a unit is nothing more complicated than a weighted sum with an activation func-tion – which in theory can be any non-linear function. The outputs, or activations, from units in theprevious layers x are weighted by an equally large vector of weights w which is unique to the unit.The sum of these weighted inputs are then forwarded to the activation function f , together with a sin-gle number b called the bias. The operation of a single unit can be summarized as f (∑

iwixi + b). The

learnable parameters of the unit is simply the weight vector w.

6

By convention the layers of a neural network are called input layers, hidden layers, and output layersdepending on their order. The input layer is the first layer in the model, and can be thought of as arestriction on the size and dimensionality of the input to the network. E.g. a network supposed to accept32× 32 pixel images with three colour channels will have an input layer with a size of 32× 32× 3.The input layer itself does not perform any computation and is sometimes called a virtual layer. Theoutput layer is the last layer in the model. Its size is determined by the desired size of the output. E.g.in a classification task the output layer will have a unit per class, whose activations will represent therespective class scores. Hidden layers are all layers between the input layer and the output layer, andthese perform the majority of all computation in the network.

When discussing neural network architectures it is common to refer to the depth of a network, as wellas the size of the hidden layers. The depth of a network refers simply to the number of hidden layersused in the architecture and when a description refers to something being placed at a certain depth inan architecture that refers to the organizational slot that is equivalent to that position among the hiddenlayers. E.g. depth two of a four-layer deep network is the same place as the second hidden layer wouldbe. The size of a layer is the number of units in that layer. In addition, the basic neural network structureshown in Figure 1 is known as a feed-forward network, and will be used as a reference point whendescribing the other architectures used in this study.

2.3.2 Learning and back propagation

A set of training data with a gold standard annotation forms the basis from which the network learns. Itprocesses the training set to produce its own solutions to the data and then compares these solutions tothe gold standard. The system measures how incorrect it was through a loss function. There are manydifferent types of loss functions designed for different tasks and different system architectures. Thecommon properties of all loss functions is that they accept as input the two analyses, from the systemand from the gold standard, and output a single number which is higher the more the two analyses differand lower the more they agree. Once the system has this number it adapts the weights of its componentunits using an algorithm known as backpropagation. This updating of parameters is done regularly asthe system works its way through the training set.

After each completed epoch, the system processes and produces attempted solutions of the validationset (similarly to how it normally operates while training but without updating any parameters) and savesa copy of the current model. It then compares its own solutions on the validation set using the lossfunction, similar to comparing with the gold standard during regular training. If the loss value gainedfrom testing against the validation set shows a sufficient decrease in performance compared to a previouscheck against the validation set, training ends and the last best performing version of the model is used.

2.3.3 Regularization and overfitting

When a network is very large in terms of raw number of parameters, the representational power of thenetwork can lead to the network tailoring its solution to pick up noisy patterns in the data that are notgeneralizable. This leads to the system being hyperspecialized to correctly solve its intended task forthe training data while being less accurate at data from other sources. This problem is called overfitting,and the methods used to combat this are called regularization. The core idea behind regularizationmethods is to force the network to prioritize simpler solutions without the hyperspecialization that leadsto overfitting.

7

Figure 2: An ‘unrolled’ schematic of an RNN network. The same layer A processes all the steps in theinput while taking the activations from the previous step into account.

2.3.4 Dropout

The regularization method used in all architectures in this study is called dropout. Srivastava et al.(2014) introduces the method as a computationally efficient alternative to using a large number of dif-ferent models for the same task and averaging the results. During the training process, a network usingdropout will temporarily disable a unit in the network with a probability p for each unit. Then duringinference with the trained model each weight will be weighted with 1

p , which is approximately the sameas averaging over all of the possible different models that can be produced through randomly droppingunits in the network.

2.3.5 LSTM networks

A key concept in understanding LSTM networks is understanding a recurrent neural network, or RNN.An RNN network is in simple terms a neural network with loops. They are applied to sequences of dataand each step in the sequence is processed completely before the next step. The key point is that theRNN has a modular structure that is slightly more complicated than the feed-forward network in Figure1. Each module in the RNN is a layer that in addition to working as a conventional hidden layer is alsoconnected to the module at the same depth in the network from the previous step in the sequence. Thuswhen processing the i-th step in the input sequence an RNN module will access the activations fromthe previous layer at step i as well as the activations from the same module at step i− 1, provided thati > 0. This loop-like structure allows the network to take past context into account when processing asequence, which allows for more accuracy in certain tasks. A graphical explanation of this is found inFigure 2.

8

An LSTM network, or Long Term Short Memory network, is a form of RNN developed in order toeasier learn dependencies over larger distances in the input sequence (Hochreiter & Schmidhuber 1997).Like RNN networks, the LSTM network allows its modules to access the information from the samedepth at an earlier step. In addition to this, the LSTM architecture uses a sort of memory called the cellstate, that spans all the modules at the same depth throughout a sequence in a more direct way. TheLSTM module uses four inner layers of units in order to gate information to and from this cell state.This allows the network to learn what to add to the cell state, what to remove from it, and what partsfrom it to take into account when producing activations – using the learnable parameters of these fourinner layers.

2.3.6 Bi-directional networks

The original RNN architectures are designed around the idea that a sequence has a beginning and anend, and that it has a right direction which goes from the beginning to the end. In many applicationsthis seems straight forward: in speech recognition the network needs to determine what is being said ata point in time using only the information available up until then. The system is allowed to glance atthe past context, since this is information that is available, whereas looking at future context would becheating. In many applications the notion of time is not as clear as in e.g. speech recognition, and insuch applications glancing at the ’future’ context is not only acceptable but also in some cases practical.

A Bi-directional RNN is an adaptation of the RNN architecture that allows for taking into accountboth the past context and future context when processing a sequence. Each module of the Bi-directionalRNN (or BRNN) consists of two independent RNN modules. Both of these process the same inputsequence, one in the traditional front to back order, and the other in the reverse order. The output ofthese two RNN modules are then combined to form the input to the following layer in the model. Thisallows the next layer to consider both an analysis that has access to past context, and an analysis thathas access to future context.

2.3.7 Convolutional Layers

Convolutional layers, originally introduced under the name of Time Delay Neural Network by Waibelet al. (1990), are defined largely by how units are connected from one layer to another through the useof filters. This differs from a normal feed-forward network by having the units of the convolutionallayer look at small parts of the input volume that are anchored locally with the aim of forming a greaterpicture from smaller windows in the input. This is visually demonstrated in Figure 3. Intuitively we canunderstand each filter as scanning the input data in small chunks to look for a feature of some kind, thenature of which is learned during the training process. In actuality each filter consists of a set of unitsthat all share the same weights and are locally connected to some chunk, or window, in the input data. Inthat sense each unit in the filter outputs the judgement of that filter at a specific point in the input. Mostconvolutional neural networks apply a series of filters per layer which can learn to react differently todifferent potentially relevant features in the data. By stacking multiple convolutional layers the networkcan model more and more abstract features in its filters.

Formally speaking, a convolutional layer takes an input volume and returns an output volume. Theinput volume is normally either a 1D sequence or a 2D image, together with an extra dimension whichis channel depth. E.g. a sequence of words or characters would commonly be fed into the network as asequence S of length l consisting of vector embeddings of length d. A filter operating on a 1D sequencewould connect windows of a certain odd length k from the input sequence to units u with identicalweights such that each unit ui is connected to the embedding vectors s j where i− k−1

2 ≤ j ≤ i+ k−12 .

This means that each filter will take into account the entire depth of each step in a window. The output

9

Figure 3: A visualisation of convolutional filters operating on a 1D sequence.

volume then, becomes as wide as the number of steps needed to cover the entire input sequence with adepth equal to the number of filters used. When an output volume from a convolutional layer is then fedto another convolutional layer as input, the result from each filter at step i will be taken as the channeldepth at step i. Each unit performs a weighted sum of the data points it is connected to in the inputvolume similarly to the operation in the basic structure outlined above. The main difference lies in theconnectivity of the units, as well as the fact that each unit in the same filter uses the same set of weightsas opposed to individual weights. This means that each filter is in actuality a convolution of the inputvolume, which is where the name convolutional layer originates from.

2.3.8 Residual connections

When a network using convolutional layer becomes too deep it begins to be difficult to properly trainand optimize the network using standard learning algorithms. One method of combating this is usingresidual connections inside of structures called ResNet modules. ResNet modules are simply modulesof two convolutional layers back to back that runs parallel to a loop hole, or skip connection. In essence,after every two convolutional layers in a ResNet architecture the output of the second layer of the twolayers is combined, most commonly through addition, with the input to the first of the two layers. Thisallows information to propagate through the network unhindered and it is this property of a ResNetwhich is called the residual connection. The gain of designing a network with these skip connectionsis that it allows for easier optimization of extremely deep network architectures without significantadaptation of the learning algorithms (He et al. 2016b).

10

2.3.9 Preprocessing and batch normalization

Neural networks learn faster and more accurately if the input values are of fairly uniform magnitude.In order to ensure this, it is common to apply various preprocessing techniques to normalize the inputmagnitude without losing the desired information. A common normalization is Xi−X

s where Xi is eachindividual input value, X is the mean of the input values, and s is an estimation of standard deviation.Batch normalization is a process where this normalization is also done inside of the network betweenlayers, with mean and standard deviation calculated in batches.

2.3.10 Embedding layers

An embedding layer maps integer represented data and maps it to a continuous vector space. Input tosuch a layer would be a vector of interger indexes, which is then transformed into a matrix where eachinteger is now represented as a column in that matrix. This is often done in stead of using one-hotencoded vector representations, which otherwise function identically from a purely theoretical standpoint. The key gain in using embeddings in this case is a more efficient implementation. On other tasksthe continuous vector representations are useful for capturing semantic similarities and relationshipsbetween words (Mikolov et al. 2013).

2.4 Evaluation

Both architectures (the convolutional network and the BLSTM) will be evaluated in a supervised envi-ronment against a gold standard segmentation. Recall, Precision and F1 will be used, after the definitionsgiven in Goldwater et al. (2009). Goldwater et al. defines these measurements in the context of a wordsegmentation task, which is analogous to a morpheme segmentation task in the sense that both tasksrequire the system to partition a string of characters into substrings. Since the difference between thetwo tasks lies in the granularity of the intended analysis as opposed to the structure of the analysis, thusthe definitions given in Goldwater et al.’s work is applicable here as well.

Let’s say we have a gold standard segmentation of the word “internationalists’” to whichwe want to compare the analysis from our segmentation system. Let’s furthermore assume the segmen-tations are as in example 8. In this case the gold standard contains seven boundaries which yields sixpartitions of the word, or segments. Conversely, the analysis hypothesized by the system contains eightboundaries, or seven segments.

(8) *in-ter-inter-

nation-nation-

al-al-

is-ts-ist-s-

’’

A segment in the system analysis is considered correct if it corresponds to one of these six segmen-tations in the gold standard, with both start and end boundaries detected and no further division of thesegment proposed by the system. Thus only a completely identified segment is considered valid. Re-call is then defined as the number of correct segments in the system analysis divided by the number ofsegments in the gold standard. In the current example three segments where correctly identified – thesegments ‘nation’, ‘al’, and “’" – yielding a recall of 3/6 or 0.5. Precision is defined as the numberof correct segments in the system analysis divided by the total number of segments proposed by thesystem. The precision in the current example is then 3/7 or roughly 0.43. F1 is defined as the harmonicmean of precision and recall, or 2×precision×recall

precision+recall .

11

2.5 Previous research

Morpheme segmentation is a diverse field with numerous systems and methods that have been exploredby many contributors. This section will begin by briefly outlining a few of these systems which will beused for comparison in the results section. All three of the systems outlined were tested on the samedata and with the same evaluation techniques by Ruokolainen et al. (2016). This study reports resultsfor both the unsupervised versions (or supervised with minimal data for the CRF system) as well assemi-supervised versions of all three systems.

The first system is the semi-supervised extension of Morfessor by Kohonen et al. (2010). Like allMorfessor systems it defines a probabilistic generative model where word-forms are assumed to have asingle correct analysis composed of string representation of morphs concatenated together to form thesurface word-form. The model defines a lexicon of morphs together with probabilities of their occur-rence. The system then attempts to strike a balance between encoding observed word-forms conciselyand maintaining a minimal morph lexicon through successive small changes utilizing a minimal de-scription length approach (Barron et al. 1998).

Sirts & Goldwater (2013) instead utilize a non-parametric Bayesian modelling framework calledadaptor grammars. The system accepts an abstract grammar specification together with the training dataand learns to detect structures that match this grammar and assigns surface forms to the structures in thegrammar derived from the observed data. Sirts & Goldwater argues that this allows for experimentingwith various structural constraints on the resulting morphology in a simpler way.

Ruokolainen et al. (2013, 2014) views morpheme segmentation as a tagging problem on a string ofcharacters a1 through an. By assigning the tags ‘B’ for beginning of a morpheme, ‘M’ for ‘middle’ ofa morpheme, and ‘S’ for single-letter morpheme, the system provides a segmentation analysis for theoriginal string.

(9) sB

tM

oM

lM

aB

rM

nB

aM

sS

This is a similar approach to the one taken in this study albeit with a more complex tag set. TheRuokolainen et al. (2014) utilizes conditional random fields method for the tagging problem as opposedto neural networks. This means that the method must specify features on which probability estimationmodels can be built, which are normally designed specifically for the present task. The CRF system ofRuokolainen et al. utilizes left contexts and right contexts of a character t in a word x when determiningwhich class to assign to the currect character t. The maximum lengths of these contexts are a hyper-parameter δ . The semi-supervised version from Ruokolainen et al. (2014) additionally uses characterfrequency measures based on the work of Harris (1955).

There are more systems who while not used for direct comparison are still pertinent to mention inthe context of the present study. Östling (2016) utilizes a convolutional neural network for the mor-phological reinflection shared task of SIGMORPHON 2016. While the task of reinflection differs fromthe task of segmentation, the demands placed on a system by morphological reinflection are similarto those required for segmentation. Particularly the ability to detect patterns in surface forms betweenword-forms on a character level. The architecture design and code of this project is built on top of thework of Östling.

Lastly Goldwater has done a substantial amount of work on modelling human performances on wordsegmentation of contiguous streams of phonemes using Bayesian methods (Goldwater et al. 2009, Franket al. 2010). While this research has a markedly different end goal than the current study, word segmen-tation is sufficiently close to morpheme segmentation to make the distinction between the two tasksunclear. In fact, the F1 metrics used in this study is as defined in Goldwater et al. (2009).

12

3 Purpose and research questions

3.1 Purpose

The purpose of the present study is to implement and test a convolutional neural network (CNN) ar-chitecture and a bi-directional LSTM (BLSTM) architecture on the morpheme segmentation task todetermine their efficacy with low amounts of training data.

3.2 Research questions

Question 1: What is the relationship between training data size and efficacy in F-score for the CNNand the BLSTM architectures respectively?

Hypothesis 1a: Both architectures will perform better in terms of F-score on larger training sets.Motivation: More data yields more examples to learn from, assumed to result in higher generalizability.

Hypothesis 1b: The two architectures will achieve similar F-scores with the same training set sizes.Motivation: Since there is no theoretical motivation as to why one architecture would outperform theother, empirical evidence is required to support a difference.

Question 2: How do these models compare to previous results from unsupervised methods?

Hypothesis 2a: The BLSTM models will have a slightly lower F-score compared to earlier methods.Motivation: Previous research utilizes models that are designed around low-resource environment, orleverage unannotated data.

Hypothesis 2b: The CNN models will have a slightly lower F-score compared to earlier methods.Motivation: Previous research utilizes models that are designed around low-resource environment, orleverage unannotated data.

Question 3: What morphological patterns can the system detect, or not detect, and is there anydifference between the two architectures in this regard?

Hypothesis 3: Both architectures will have the same strengths and weaknesses regarding specificmorphological patterns.Motivation: Since there is no theoretical motivation as to why one architecture would outperform theother, empirical evidence is required to support a difference.

13

4 Method

4.1 System overview

The system accepts word-forms encoded as strings, and outputs these strings with delimiter charac-ters interposed between morphemes of the input word-forms. This is done in three steps: translation,processing, and a final (trivial) decoding.

Translation is done by assigning each character found in the input an index above 0, and translatingeach character in all input strings to its corresponding index. Uppercase letters receive the same indexas their lowercase counterpart. All strings are also padded with trailing zeros up to a maximum length.This maximum length is set to four characters longer than the longest type found in the training data.

After translation the index representation is processed by the neural network to produce an equallylong sequence of numbers between 0 and 1. These numbers correspond to the network’s confidencein either the class 0, not followed by a boundary, or the class 1, followed by a boundary. Decoding isthen trivially done by rounding the class scores to 0s and 1s, and comparing the original string with theresulting tags to determine which characters should be followed by a delimiter character. Example 10shows the relationship between a decoded sequence and the final segmented output.

(10) u0

n1

e0

x0

p0

e0

c0

t1

e0

d1

un-expect-ed

4.2 Data

This present study concerns itself with data sets of types, as opposed to the more common token corpora.In addition, all data files follow the same simple format. Each line contains one type in plain text with adelimiter character used to mark morpheme boundaries. Type initial and final boundaries are omitted.

The scarcity of available annotated data for morpheme segmentation prompted this study to use twodifferent data sources. One a readily available gold standard which was used for evaluation, and a se-lection from a larger corpus which was manually annotated by the author as a part of this study. OnlyEnglish data was used, due to the author’s lack of familiarity with the other available languages in thegold standard.

4.2.1 Evaluation data

The data used for evaluation was the gold standard segmentations provided for the MorphoChallenge2010 competition in Unsupervised Morpheme Analysis (Kurimo et al. 2010). Morphological labellingand lemma forms were stripped from the gold standard through a simple script to conform to the dataformat outlined above. The final result was then divided into a test set of 1000 types and a developmentset of 686 types randomly.

4.2.2 Training data

All training data used was sourced from the Universal Dependencies English corpus (Nivre et al. 2017).This was motivated by a familiarity with the POS tag set used in the corpus as well as it being readilyavailable. Presumably most modern text corpora would be equally suitable.

14

Table 1: Number of types in each training set.

Set Types

1k 989

2k 1979

4k 3873

A selection of 4000 types was extracted from the corpus and then manually cleaned from noise andannotated with morpheme boundaries. Three different sized data sets were then prepared from thisselection, one at 1000 types, one at 2000 types, and the full selection. The final number of annotatedand cleaned types in each set is slightly lower, given in Table 1.

4.2.2.1 Selection

Six categories of types were used in the selection process: nouns, verbs, adjectives, adverbs, pronouns,and modal verbs with enclitic constructions such as “would’ve” or “can’t”. These categories were se-lected to give a good spread of realizations of the inflectional categories in English. Derivational mor-phology follows naturally from biasing the selection towards lexical categories – since lexical items aremore likely to be morphologically derived from other word-forms. Each category was given a quota ofthe 4000 target size which was filled by randomly selecting types from the corpus matching categorycriteria. The pronoun and modal verb categories yielded too few matches to fill their quotas and theremaining spots were distributed equally among the verbs and adverbs categories.

Due to the low amount of matches, the pronouns and modals categories were used in full in all threetraining sets. The remaining slots in each data set were then randomly selected from each of the fourremaining categories so that the relative sizes of each category in each data set remained constant.When compiling the three data sets a number of overlaps where two types with identical surface formshad been selected for different categories was noted. This together with a final spelling homogenizationmeant that a small number of duplicates had to be removed from each data set after the selection process.Thus the number of selected types reported below is slightly higher than what’s used in the actual datasets.

The categories’ criteria were defined in terms of POS tags. The Universal dependencies corpus pro-vides two sets for the English corpus: one which is the universal tag set and one which is languagespecific, based on the original Brown corpus tag set. In this study the latter was used, due to it’s highergranularity. In the case of pronouns and modals a bigram of tags was used to catch cases such as “I” +“’ve” as “I’ve” or “ca” + “n’t” as “can’t”. The host word and the following enclitic is treated as one typein all cases where no whitespace is used between them in standard orthography.

Adjectives

For Adjectives the tags used in the language specific set are JJ, JJR, JJS. All of these were includedin the category criterion for adjectives. These tags correspond to lemma, comparative, and superlativeforms. A total of 994 unique adjectives was selected.

Adverbs

The POS tags used in the language specific set for adverbs are RB, RBR, and RBS. All of these wereincluded in the category criterion. These tags correspond to lemma, comparative, and superlative forms.

15

A total of 850 unique adverbs were selected.

Modals

Only modals followed by an enclitic modal such as “n’t”, “’ve”, or “’d” were included in the selection,due to the low morphological compositionality of modal verbs outside of such constructions. A total of17 modals were selected, since these were all that matched the criterion.

Pronouns

The pronouns included in the training data were reflexive pronouns, possessive pronouns, and pronoun+ modal enclitic constructions. A total of 25 unique pronouns were selected, since these were all thatmatched the criterion.

Nouns

Four morphological value sets for nouns were used: plural, singular, plural and genitive, and singularand genitive. The NN and NNS tags were used to detect nouns, which has the desired side effect ofexcluding proper nouns. A total of 999 unique nouns were selected.

Verbs

Verbs show the largest inflectional morphological complexity among the categories. The language spe-cific POS annotation distinguishes between these using a diverse tag set representing lemma form,present, present participle, past participle, past tense, and 3rd person agreement. All of these categoriesof verbs were included in equal number in the selection. A total of 1042 unique verbs were selected.

4.2.2.2 Annotation

All manual annotation of the training data was performed by the author. To the greatest extent possible,the guidelines outlined here aim at producing identical analyses to the original gold standard segmenta-tions provided for the Morpho Challenge in 2010. The annotation includes inflectional and derivationalmorpheme boundaries. Enclitic items such as trailing genitive markers and reduced auxiliary verbs weretreated as morphemes attached to their respective hosts. No further morphological information was in-cluded beyond morpheme boundaries.

Primary considerations for deciding upon a hypothesized morpheme were the following: Does a ver-sion of the word-form exist without the morpheme and is it etymologically or semantically related tothe version with the morpheme? Is the morpheme systemic, i.e. are there other known available hostsfor the morpheme? Is this morpheme a known inflectional morpheme?

In uncertain cases additional etymological information was used, and hypothesized word-forms weredouble checked to be attested in a dictionary. Only readily segmentable morphemes were treated inthe annotation process, and stem-final vowel alternations were considered part of the stem as opposedto belonging to a variation on the morpheme. In all cases where uncertainty arose regarding where aboundary between the stem and following morpheme was, the analysis where the morpheme maintainedthe most consistent form was selected.

4.3 Model architecture

Two different architectures are used in the present study, one based on a bi-directional LSTM archi-tecture (BLSTM) and one utilizing convolutional layers in a ResNet structure (CNN). The BLSTMarchitecture is motivated by the research by Wang et al. (2016) on a wide variety of tagging problems,though the specifications differ greatly. The convolutional architecture is derived from work by Östling

16

(2016) for morphological reinflection with minor modifications to allow for the current task.

4.3.1 Input and embedding layer

Both architectures accept the same input: integer vectors of a fixed length representing strings of char-acters. These input vectors are first fed to an embedding layer, which transforms each integer into a128-dimensional embedding vector. The constraint on character sequence length is determined from thetraining data and is set to 4 characters longer than the longest detected type.

In the convolutional architecture the entire sequence of embedding vectors is processed simultane-ously. The BLSTM architecture processes each character index in succession, first passing it throughthe embedding layer to obtain an embedding vector and then through the rest of the network.

4.3.2 BLSTM architecture

The bi-directional LSTM architecture, or BLSTM architecture, utilizes two stacked BLSTM modules.The first BLSTM module accepts an embedding vector from the embedding layer as input. The secondBLSTM module utilizes the activations from both the forward and the backward pass of the previousBLSTM module as input. The dimensionality of the output of each LSTM in a module is set to 200.This constraints the dimensionality of the cell state, as well as the size of the gates, in units, to the same200. Dropout with a probability of 0.5 is applied after each BLSTM module.

4.3.3 Convolutional architecture

The convolutional architecture utilizes 6 ResNet modules which amounts to a total of 12 convolutionallayers. Each of these layers uses 128 filters with a kernel size of 5. The modules use a pre-activationscheme, where the activation function is applied to the input before each layer. The ReLU function isused for activations, given by f (x) = max(0,x). In addition, batch normalization is applied immedi-ately before each set of activations, and dropout with a probability of 0.5 is applied after activationsimmediately before each convolutional layer.

4.3.4 Output layer

The output from either architecture is a sequence of numbers between zero and one of the same lengthas the input sequence. This is accomplished by having an output layer with an effective size equal tothe input sequence length consisting of units with a sigmoid activation function. The activation functionensures that the output is kept between 0 and 1.

In the convolutional architecture the entire input sequence is processed at once, so the output layer is alayer with one unit for each index in the input sequence. These units are connected to the previous layerso that an output unit oi only takes into account units ui, j, where i is the position in the input sequenceand j is the filter number. In other words an output unit at a certain position only takes into account theunits in the previous layer that correspond to applications of a filter at that position.

The BLSTM architecture instead uses only a single unit with a sigmoid activation. Since each positionin the input sequence is processed one by one the single unit will output a classification score at eachposition in the sequence before the network begins to process the next position. In other words, the sameunit is reused to classify all points in the sequence after one another.

17

4.4 Implementation

The system was implemented in python 3.4.2 with the Keras neural network library (Chollet et al. 2015)using the Theano backend (Theano Development Team 2016). A majority of the code was build on topof work by Robert Östling for the SIGMORPHON 2016 challenge (Östling 2016).

4.5 Training

The BLSTM architectures were trained using the Adam optimization algorithm and the Convolutionalarchitectures used RMSprop. The loss function used was in both cases the Keras implementation ofbinary cross-entropy. In addition, the embedding layers enforced a maximum limit of 2 on the magnitudeof the weight vectors of each unit, sometimes referred to as a maxnorm constraint (of 2). Dropoutremained fixed at p = 0.5 for all experimental setups.

4.6 Experiment design

This study uses an experimental paradigm of two independent variables in a 2x3 design. The first vari-able is model architecture, which is either BLSTM based or Convolution based. The second variableis size of the training set. There are three training sets used, referred to as the 1000 set, the 2000 set,and the 4000 set. This yields a total of six experimental setups, all of which will be evaluated againstthe same test set. The dependent variable is F1, though precision and recall will be reported as well.The specific definitions used are given in section 2.4. F1 scores will be compared to three previouslypublished systems (Sirts & Goldwater 2013, Ruokolainen et al. 2014, Kohonen et al. 2010), all threeevaluated on the same data set by Ruokolainen et al. (2016). In addition, qualitative error analysis willbe performed on each setup.

18

Table 2: Mean values for 30 trained models at each training size. *Due to highly skewed distributionsmedians are reported for these set ups.

F1 Precision Recall

Size CNN BLSTM CNN BLSTM CNN BLSTM

1k 0.0730* 0.635 0.173* 0.599 0.0462* 0.677

2k 0.601 0.674 0.577 0.641 0.633 0.711

4k 0.601 0.705 0.572 0.677 0.640 0.735

5 Results

5.1 Results by research questions

1: The BLSTM architecture displayed a significant (p < 0.001 at each step) monotonous increase in F1

relative to training set size. The convolutional architecture failed to evince any significant differencesin F1 between the 2k set and the 4k set, and failed to stably train models at the 1k set. Furthermore theBLSTM architecture achieved a higher F1 than the convolutional architecture for each training set,with a significance of p < 0.001 at each point of comparison.

2: The Conditional Random Fields system by Ruokolainen et al. outperformed the best performingexperimental set up by 16 percentage point in F1, using a data set comparable to the smallest data setused in this study. The AG and Morfessor systems outperformed the same experimental set up by amargin of 1-2 percentage points. However these systems use a large unannotated data set. In addition,systems using the semi-supervised learning environment outperformed all other versions and learningenvironments.

3: All models showed their best performance on inflectional suffixes and highly frequent derivationalsuffixes. In addition all models detected hyphenated compounds to some degree. The BLSTM modelsdetected prefixes and non-hyphenated compounds to a varying degree, as well as detected a largernumber of derivational suffixes. All models struggled with compounds and derivational morphology ofa classical origin posed difficult. In addition, no model detects vowel shifts or stress shifts due to alimitation in the method used.

5.2 Quantitative results

30 models of each architecture was trained on each of the three data sets: the 1k, 2k, and 4k set. Themean values of F1, precision, and recall, are presented for each set up in Table 2. The convolutionalarchitecture failed to achieve any non-trivial learning in 28 out of 30 cases on the 1k set. Due to thishighly skewed distribution of results the values reported are the medians of the respective populationsas opposed to means for this architecture and training size set up.

Since early inspection of the data revealed the samples from each of the experimental setups to benon-normally distributed, the significance tests used at each point of comparison was a Wilcoxon two-sample test. For each training set the two architectures were compared, and a two-sample comparison

19

Table 3: Comparison in F1 score with earlier research. * Ruokolainen et al. (2013, 2016) uses a differenttraining set than this study. † Morf. and AG systems used 385k unannotated data tokens. †† Thesemi-supervised versions of the systems from Ruokolainen et al. (2016) used a small annotateddata set and a large unnannotated data set.

Size CNN BLSTM CRF AG Morf.

1k 0.0730 0.635

1k* 0.861

2k 0.601 0.674

4k 0.601 0.705

385k† 0.717 0.763

1k + 385k†† .881 0.775 0.841

was performed pairwise between each step from a smaller training set to a larger training set within eacharchitecture.

5.2.1 Relationship between training set size and F1

The BLSTM architecture scored higher in F1, recall, and precision than the convolutional architecturein all size categories. These differences are all significant with p < 0.001. With the 1k training set thedifference is extreme. On the 2k set the BLSTM scores 7 percentage points higher in F1 and on the 4kset the BLSTM scores 10 percentage points above the convolutional architecture.

A monotonous increase in all three measures is found in the BLSTM results relative to the size of thetraining set. All of these differences are significant with a p< 0.001. This is not true for the convolutionalarchicture, which peaks its precision and F1 at the 2k training set. Note that the difference between theperformance of the convolutional architecture at the 2k and 4k sets is not statistically significant for anyof the three measurements. The differences between the performance of the convolutional architectureat the 1k set and the 2k and 4k sets are trivially significant, and a test confirmed this with p < 0.001.

On all training set sizes the BLSTM architecture achieves a higher recall than precision. This patternis the same for the convolutional architecture for the 2k and 4k data sets.

5.2.2 Comparison with earlier research

No significance testing was performed on the data from previous published research, due to only havingthe published numbers available. Table 3 summarizes the differences between the systems developedfor this study and those mentioned in Ruokolainen et al. (2016). The reported scores for the Morfessor,Adaptor grammar, and Conditional random fields systems are all higher than the F1 for the convolutionaland BLSTM architectures at all levels. This difference is even larger in the semi-supervised learningenvironment.

The highest F1 result from this study is the BLSTM models trained on the 4k data set, which achieveda mean of 0.705. Notably the CRF system achieved an F1 increase of 16 percentage points over thisresult, using a data set which is similar in size to the smallest traning set in this study. The AG systemperformed one percentage point over the 4k BLSTM models, and the Morfessor system performedroughly one and a half percentage point above the same BLSTM 4k result. Both the AG system and the

20

Morfessor system utilized a large unannotated data set to achieve these results.

5.3 Error analysis

Error analysis was performed before the qualitative analysis. A single high-performing model out offive from each experimental setup was taken as a representative for each architecture on each trainingset. These models are not among the 30 trained for the qualitative analysis, but were trained using thesame specifications. In addition, the error analysis focused on what types of errors the different setups produced, as opposed to the quantity of each type. References to comparisons between error typefrequencies between models are to be interpreted as rough subjective estimations.

5.3.1 Convolutional architecture

1k data set

The convolutional model trained with the 1k training set performed far below satisfaction. In a majorityof cases it simply posited a morpheme break after each character, and has found no relevant linguisticpatterns in the data what so ever.

2k data set

Using the 2k training set the convolutional architecture achieved a mixture of oversegmentation andundersegmentation. The model fails to detect compounds in almost every case, not labelling the tworoots as separate morphemes. The only exception to this is where a compound is hyphenated, whichthe system detects to a very high degree. It also consistently fails to identify prefixes of any kind. Thesystem does manage to find several of the most occurring suffixes to a high degree. Specifically theplural and genitive suffixes, along with the present participle -ing, de-adjectival -ly, and de-verbal -erand -ion. A common error the system makes is where a part of a stem is interpreted as belonging to oneof these suffixes. In addition, the system frequently posits the suffix -ss instead of a more proper -ness.

Further more the system struggles with longer chains of classical suffixes. Most commonly thesesuffixes are simply not segmented, and in other cases the system posits morpheme boundaries in anincorrect location. There are also cases where a part of a stem is interpreted as one of the more commonclassical suffixes. In addition to these errors there is also a class of inexplicable oversegmentation. Allof these cases involve parts of a stem being interpreted as an arbitrary suffix, or a suffix being split intotwo.

4k data set

The convolutional model trained on the full training set of 4000 types performed similarly to the 2kmodel. It successfully locates many instances of a few common suffixes (-ing, -ed, -s, -’ and -’s) andposits several other suffixes with varying degrees of success. One of the biggest sources of overseg-menting is positing one of the common morphemes incorrectly. Especially the suffixes -er, -s, and -ingfrequently appear in the system’s analysis as false positives. One common category of this error is whenthe system segments a compound word that has been lexicalized from derived words. Another commoncase is when an -es plural is falsely tagged as containing an -s plural.

The model also frequently undersegments towards the beginning and middle of a type. It locates noprefixes, and only finds compounds if they are hyphenated. In addition, when more than one or twosuffixes are present in a word the model is exceedingly likely to miss at least one of the inner suffixes.This is especially true in cases where multiple classical suffixes are stacked together towards the end of aword. Furthermore, many common suffixes are simply never found by the system, such as de-adjectival

21

-ly and -ness. Another source of error is when the system posits a dubious suffix. These appear in manydifferent forms, often in conjunction with having posited a more common suffix incorrectly.

Compared to the 2k model the 4k model has a larger inventory of suffixes it uses in its analyses. Italso has a stronger tendency to falsely analyse a type as containing one of these suffixes, particularly themore common -s, -er, -ing, and -ed.

5.3.2 BLSTM architecture

1k data set

The BLSTM model trained on the 1k training set performed slightly better than the convolutional mod-els trained on the 2k set and the 4k set respectively in terms of raw numbers. However, the specificsegmentations the two systems make show some marked differences. The BLSTM model detects andcorrectly segments the roots of compounds in many cases where the convolutional models did not. Thisis particularly clear in non-hyphenated compounds, which the convolutional models entirely ignored.The drawback is that the BLSTM model also frequently misplaces the boundary between two root mor-phemes and occasionally hypothesises a boundary in a longer type which is not a compound at all. Insome cases these extra boundaries appear immediately following a homograph of an actual stem the sys-tem might recognize. In other cases it’s harder to find a pattern in why these errors appear. The BLSTMmodel also fares better on detecting prefixes than its convolutional counterparts. It successfully detectsthe prefixes sub-, in-, and un- in a few cases, however most prefixes still go undetected. In this case too,the system over generalizes the pattern found by positing boundaries after the string "sub" even whereit’s part of a stem as opposed to a prefix.

The range of suffixes that the BLSTM model finds is large compared to the convolutional models. Inaddition to the most common plural, genitive, past tense, and present participle endings it also detects alarger variety of derivational suffixes, such as -ful, -able, -ance, -ize, -ion, and -ous. Similarly to otherareas discussed above, the model over generalizes these suffixes to types where they don’t belong. Onecommon example is positing the -ing suffix in a compound that was lexicalized in that form. The modelalso struggles with long chains of classical suffixes, especially around word endings such as -ology,-ological, and -ologically. Frequently, the model oversegments these endings as ic-al-ly where an ical-lyor ic-ally is present in the gold standard. It also occasionally finds segments inside of chains of suffixeswhich do not correspond to any known morpheme.

Finally, the model produces two classes of errors which are harder to describe in linguistic terms. Thefirst class is one of spurious suffixes, such as a stem-final -ia. The second one corresponds with curioussegments formed before a common inflectional suffix which are clearly not a suffix nor a stem of anysort.

2k data set

The BLSTM model trained on the 2k data set notably outperformed the BLSTM trained on the 1k set aswell as all of the convolutional models. Similar to the 1k model, it detects the common regular inflec-tional suffixes (genitives, plurals, tense) and a large number of derivational suffixes. It does detect andmark a few suffixes that the 1k model does not, most notably -ness and -less, as well as outperforms allprevious models in terms of detecting prefixes. Similar to other models, it also over generalizes thesesuffixes to parts of certain stems. The model detects and segments the roots of compounds, similar to the1k model. The preciseness of these segmentations are higher than the 1k model, with very few bound-aries being placed a character off. In addition the system produces fewer novel but spurious suffixessuch as those proposed by the 1k model. However both of these error types are still present in the modeloutput.

22

Chains of classical suffixes is something this model also struggles with. Largely this is in cases wherethe combinatorial potential of the suffixes is somewhat irregular. An example would be the type ’compu-tation’. There is no way to know that ’compute’ and ’computation’ exist without the intermediate form’*computate’ existing, short of having prior lexical knowledge or having seen all forms in the trainingdata. With the required knowledge, we can conclude that the correct segmentation is ’comput-ation’, asopposed to ’comput-at-ion’ as would be true if the intermediate form ’*computate’ did exist. Related tothis problem, is an error type where the model correctly analyses and finds the first boundary separat-ing the stem from the first suffix, but fails to assign boundaries between the suffixes. Occasionally thesystem will produce a segmentation of this type where the first boundary is in fact in the middle of astem.

4k data set

The BLSTM model trained on the 4k training set achieved roughly 2-3 percentage points higher scorein all three metrics compared to the model trained on the 2k set. The differences between the respectivemodels’ segmentations are minor. A common source of error for the 4k model is, similarly to the othermodels, over generalizing common inflectional suffixes. Frequently this is done with the -er, -en, or -ingsuffixes. Sometimes the same type of error happens with frequent suffixes which are derivational, suchas the -ic or -ist suffixes. A specific case of this error which the other two BLSTM models exhibitedseems to be almost entirely absent, namely segmenting a plural -es as -e-s.

Compound words pose a challenge to this model too, though it appears to be equally equipped tohandle these as the 2k model. Frequently, the errors are of not positing a boundary where one ought to be,or of attempting to find two roots where there is only one. The system also in a few cases places a root-root boundary one or two characters off, though this error is not as frequent as in the 1k model. Similarto the other BLSTM models these errors almost exclusively occur in non-hyphenated compounds.

The model has trouble with longer chains of classical suffixes, similarly to all the other models.The 4k BLSTM version performs some very good segmentations in this category, but also completelyover segments some words, and sometimes heavily undersegments. There does not appear to be a clearpattern as to when it over-segments or under-segments. There is also the more specific case of this errorclass where the model cannot determine the exact segmentation of some more complicated derivations.

Lastly the 4k model appears slightly less proficient in detecting prefixes than the 2k model. Howeveron average the 4k model segments more than the 2k model, and according to the statistics also moreaccurately.

23

6 Discussion

6.1 Discussion of method

6.1.1 Discussion of network models

Both BLSTM networks and Convolutional networks are standard and well-studied network architec-tures for various NLP tasks. Convolutional architectures are most know for image recognition (e.g. Heet al. 2015, 2016a) but have also been utilized in morphological processing in Östling (2016). BLSTMarchitectures are commonly used for sequence labelling tasks (Wang et al. 2015) and have also beenapplied to morpheme segmentation with a slightly different approach in Wang et al. (2016).

6.1.2 Discussion of data

Data scarcity is a significant problem in the field of morphological segmentation. This scarcity mandatedmanual annotation of at least part of the data used in the study, which limited the language choice toEnglish. A resulting effect of this is that the inherent problems in utilizing a concatenative approachto morphological segmentation go largely unnoticed in the present study. If a language which usesa significant amount of non-concatenative morphology had been opted for instead, either the systemwould have to change considerably or a significant portion of the morphology of that language wouldhave to be ignored.

In addition, the lack of available data means that the source domain for this study’s training data isin fact a different one than the gold standard. Here, we derived our data by randomly sampling typesfrom a general corpus to fit specific POS and morphological value combinations. The gold standard isinstead based on the Hutmegs corpus (Creutz et al. 2005) which derives it’s types from lexicons. It is,however, unclear what functional difference there is between types collected from corpus data that hasbeen cleaned from non-standardized spellings and data collected from a lexicon.

6.1.3 Discussion of annotation

There was no documentation on the annotation process used by the gold standard available during theannotation of the training data. Instead, the annotation scheme had to be inferred from the availablesegmentations. Undoubtedly, some inconsistencies between the two annotations, the one on the goldstandard and the one on the training data, are bound to have resulted from this. Furthermore, the anno-tator for this study is not an L1 speaker, rather an advanced L2 speaker. This may also have impactedthe quality of the annotations.

A consequence of using a concatenative model of morphology is that ablaut and umlaut alternations(e.g. sing, sung, sang) are not captured by the annotation. Essentially, these morphologically relatedword-forms are treated as unrelated bases, with no explicit marking linking them together. While this isa severe limitations on concatenative models, it is by no means unique to this study. A majority of thesystems used in research on English have the same limitation.

6.1.4 Discussion of tagset

There are several tagsets applicable to the current problem that are in use in the literature, e.g. vari-ous BIO/IOB-2 related tagsets and the BMS tagset in Ruokolainen et al. (2014, 2013). The approachtaken in this study uses a more simplistic tagset than all of these established methods, and presumably

24

performance could be increased by adopting one of the other more fine-grained sets. However, sincewe cannot guarantee that the neural network will output a legal tag sequence in one of these advancedtagsets, a decoding algorithm would be required to interpret the output sequence into a segmentation(e.g. a hidden Markov model). The simpler tagset was chosen to avoid this extra complication.

6.1.5 Discussion of experimental setup

Testing the system on low amounts of data, as in this study, allows for easier comparison with minimallysupervised systems. While the amount of data used is considerably less than many unsupervised systemsuse, supervised data is harder to come by, meaning that a comparison with a supervised system using thesame amount of data as an unsupervised system would be meaningless. In this way, limiting the studyto small amounts of data makes the comparisons with the unsupervised systems more insightful as well.

A potential issue with the current experimental setup is that the limited data points on the trainingsize-axis in the experiment leaves little room to detect trends in how training size affects F-score. Inparticular it becomes impossible to say anything about how the architectures may compare on largeramounts of data. Ideally, a setup which also tests them at say 8k, 10k, 15k data would be preferablein this context. This was not done because annotating data is very time-consuming and 4k types wasdeemed the largest amount that was feasible.

Another potential issue lies in the comparison done with the numbers from Ruokolainen et al. (2016).This study uses the training data from that study as the gold standard, with the new annotated data astraining data. While it is unlikely that the training data from the 2010 Morpho Challenge is particularlydifferent from the evaluation data, it is important to note that this discrepancy exists.

6.2 Discussion of results

6.2.1 Discussion of quantitative results

The poor performance of the convolutional architecture using 1k training data is certainly due to a lackof data. Several attempts were made to find a model trained on 1k data that did in fact learn and 28 out of30 failed. This argument is strongly supported by the results of this study: the convolutional architecturelearned far more aptly and consistently on 2k and 4k data. This suggests a minimum requirement ondata for this architecture to learn, which would be somewhere between 1000 types and 2000 types.

The difference in F1 between the two architectures is difficult to explain with any certainty given theevidence in this study. The fact that the convolutional architecture failed to stably train models at the1k data set suggests that it is more data dependent than the BLSTM architecture, however this is notsupported by the fact that the convolutional architecture showed no significant difference between the2k and 4k data sets. It is highly possible that the convolutional architecture requires significant fine-tuning of hyper parameters in order to perform well on morpheme segmentation, where the BLSTMarchitecture may be less sensitive to specific hyper parameter settings in this particular task.

6.2.2 Discussion of error analysis

The types of errors performed by the various models are fairly similar: missing compounds, over andunder positing suffixes, and erroneously analysed chains of suffixes. The system’s weakness in identi-fying compounds is understandable: unlike suffixes compounding is done with lexical items which arefar less frequent than grammatical items such as a suffix. This means that the number of examples ofcompounding with any given root is far less likely to appear in the training data, as opposed to a suffixwhich is likely to be represented on a number of word-forms and contexts. It is possible that compound

25

boundary detection would be improved with a larger data set – even more so than suffix detection – dueto there being more roots represented in the data which can later be detected in a compound.

The difference in performance between the two architectures on compound detection, however, isharder to explain. The BLSTM models correctly identified more compounds, but also incorrectly positedboundaries in the middle of longer roots as though they were compounds. This would suggest that theBLSTM models has a tendency to posit more boundaries than the convolutional models. Alternatively,the BLSTM models could simply have learnt that longer roots may sometimes be split in two, as opposedto having learnt specific roots which can be detected in a compound.

Another interesting error type is when the system incorrectly posits an affix that is frequently occur-ring elsewhere. E.g. plural ‘-s’ or a derivational ‘in-’. The BLSTM models had a higher rate of thiserror type, together with a broader range of correctly identified suffixes. In line with the argument aboveregarding compound detection, this is likely to be a matter of higher tendency to posit a morphemeboundary compared to the convolutional models. Simply put: if the model requires less evidence toposit a suffix it is going to find a broader variety of suffixes as well as posit these suffixes in cases wherethey do not belong to a higher degree. It is possible that biasing the data to contain a higher quota ofnouns could alleviate the problem by giving the model access to a larger variety of stems to learn from.This would increase the likelihood that the model has seen a given stem before, and hypothetically de-crease the likelihood that the system incorrectly divides a given stem into parts. Alternatively, a largeramount of data in general could have a similar effect.

Word-forms and suffixes of classical Greek or Latin origin present a significant challenge to all of themodels in this study. This is to be expected: in English there is a large amount of inconsistency in howthese morphemes are realized and applied depending on various factors such as when the word-formappeared in the language, individual levels of lexicalization, and which language it has been borrowedfrom. Based on the surface form alone it is often unclear whether a word-form has its origins in medievalLatin, Norman French, later medical usage of Latin roots and terminology, or sometimes Greek loanswhich use morphology borrowed from Latin. Depending on when and where the loan happened word-forms may have been borrowed whole-sale as lexical items, or have been morphologically derived fromanother root which was borrowed much earlier. This means that word-forms which appear to have thesame morphological structure on the surface may in fact have different correct analyses, due to whethera root exists with or without a certain possible derivational suffix at the end of the word-form. E.g.the word-form ‘computation’ is easy to segment as ‘comput-ation’ with the knowledge that a word-form ‘compute’ exists, but there’s no such thing as *‘comput-ate’. Whereas ‘administration’ is correctlyanalysed as ‘ad-ministr-at-ion’ due to the existence of the word-forms ‘administrate’ and ‘administer’,as opposed to *‘ad-ministr-ation’. Bauer et al. (2013, p. 181) analyses this using a concept of extenders– a class of semantically empty morphs which are in specific cases required to precede certain suffixes inorder to make them compatible with a base. Under this analysis, the ‘ation’ suffix in the ‘computation’example is analysed as the same ‘-ion’ suffix used in the ‘administration’ example with the extender‘at’ added to the base ‘compute’. While this provides a linguistic description of the current state ofthese suffixes in the English language, it does not change the ambiguity in the surface forms ending insubstrings such as ‘ation’.

6.2.3 Discussion of comparison with previous works

Surprisingly, the best performing BLSTM model achieved an F1 which is similar to that of the AdaptorGrammars system of Sirts & Goldwater (2013) in the unsupervised learning environment. Both Morfes-sor and Ruokolainen’s CRF system outperformed all BLSTM models, with the CRF system managing a15+ percentage point increase in F1 using similarly small amount of annotated training data. The supe-

26

riority of the CRF system over the neural network methods may be explained by the fine-tuned featureset it uses which was designed specifically for morpheme segmentation. Neural networks in contrastmust learn their own features from the data. This difference is especially true for the convolutional ar-chitecture. The BLSTM architecture however is designed around taking left context and right contextinto account, which is reminiscent of the character context features of the CRF system.

All three of the convolutional models achieved a lower F1 than all three of the reference systems. Thedifference is particularly striking in the 1k training set size where the CRF system managed an F1 of0.861 and the convolutional architecture failed to learn any patterns. One possible explanation for thesedifferences would be the methods’ inherent reliability on data. It is entirely possible that the neuralnetwork methods could outperform the other methods on large amounts of data, given the large-scalesuccess of neural network systems in other fields.

All three systems which had access to the semi-supervised learning environment outperformed themodels trained in this study. This is highly expected, since this learning environment gains ‘the best ofboth worlds’ in terms of available data. The unnannotated data set provides large numbers of examplesof stems, roots, and affixes; and the smaller supervised data set provides accurate boundary information.The unsupervised or minimally supervised systems gain only one of these two resources.

27

7 Conclusions

The purpose of the present study was to evaluate the efficacy of CNN and BLSTM methods on the taskof morpheme segmentation and compare these methods to previously published work. This was done byannotating a new gold standard to use as training data, implementing a CNN architecture and a BLSTMarchitecture, and evaluating these networks in F1 on a gold standard which had previously been used in acomparative study on other segmentation systems. In addition a qualitative error analysis was performedto identify which types of morphological patterns the system could detect.

7.1 Conclusions by question

Question 1: What is the relationship between training data size and efficacy in F-score for theconvolutional and the BLSTM architectures respectively?

There were two hypotheses regarding this question: 1a, both architectures will achieve higher F1 onlarger data sets; 1b, the architectures will achieve similar F1 with the same amount of training data.

Hypothesis 1a is only partially supported by the results of this study. The BLSTM architecture didshow a monotonous increase in F1 with training size, but the CNN architecture did not. Thus, the hy-pothesis is accepted only for the BLSTM architecture. Hypothesis 1b is not supported by the resultsof this study: the CNN models achieved a lower mean F1 in all experimental setups than the BLSTMmodels.

Question 2: How do these models compare to previous results from unsupervised methods?

There were two hypotheses regarding this question: 2a, the BLSTM models will achieve a lower F1

compared to previous research; 2b, the CNN models will achieve lower F1 than previous research.Hypothesis 2a is considered accepted. The best set of BLSTM models achieved an F1 mean of 0.705

which is a similar number to the Adaptor Grammar system of Sirts & Goldwater (2013)’s score of0.711 yet lower than the best performing system which achieved an F1 of 0.861. Hypothesis 2b is alsoaccepted, though the CNN architecture achieved a maximum F1 mean of 0.601.

Question 3: What morphological patterns can the system detect, or not detect, and is there anydifference between the two architectures in this regard?

The hypothesis regarding this question was as follows: 3, both architectures will have the samestrengths and weaknesses regarding specific morphological patterns.

Hypothesis 3 is rejected based on the results from the qualitative error analysis. There were severalproblem areas which both architectures shared on all training sizes. All models struggled with over-generalizing common suffixes while failing to detect less common suffixes. Compound boundaries werealso difficult for both architectures, as well as correctly segmenting suffixes of classical origin. How-ever, the convolutional architecture performed significantly worse in regards to compounding, and onlyattempted to segment hyphenated compounds. The convolutional models also failed to detect prefixes.The BLSTM models outperformed the convolutional models in both compount detection and prefix de-tection, but came up with more spurious suffixes which do not exist in the English language such as aword-final ‘-ia’. What both architectures did excell at whas detecting and segmenting common inflec-tional suffixes, as well as highly frequent derivational ones. Here too, the BLSTM models outperformedthe convolutional models, particularly on derivational suffixes.

28

7.1.1 Future research

In addition to the hypotheses discussed above, this study concludes that BLSTM method is a viablealternative to explore for morpheme segmentation – on languages with a largely concatenative morphol-ogy. This restriction is not seen as a detriment to this study compared to others, since it is a limitationon all of the systems this study has been compared to as well. Further research into how a convolutionalnetwork would fare on a larger data set would be required to determine how suited CNN methods areto the current task. The architecture tested here did not perform very well, and showed severe inconsis-tencies in training on low amounts of data. It is however possible that a larger amount of training dataand an optimization of architecture design could render a convolutional network approach fruitful onmorpheme segmentation.

7.1.2 Main contributions

The main contributions of this study is as follows:

• A new annotated gold standard of 3873 morpheme segmented types.• A proof of concept implementation of a BLSTM architecture on morpheme segmentation and an

adaptation of Östling’s CNN system.• An exploration of what types of morphological structure these methods capture.• A comparison between these systems on low amounts of data with other systems on similar data sets

or unannotated data.

29

References

Barron, A., Rissanen, J. & Yu, B. (1998), ‘The minimum description length principle in coding andmodeling’, IEEE Transactions on Information Theory 44(6), 2743–2760.

Bauer, L., Lieber, R. & Plag, I. (2013), The Oxford reference guide to English morphology, OxfordUniversity Press.

Chollet, F. et al. (2015), ‘Keras’, https://github.com/fchollet/keras.

Creutz, M. & Lagus, K. (2005), Inducing the morphological lexicon of a natural language fromunannotated text, in ‘Proceedings of the International and Interdisciplinary Conference on AdaptiveKnowledge Representation and Reasoning (AKRR05)’, Vol. 1, pp. 51–59.URL: http://nlp.cs.swarthmore.edu/msim/papers/creutz2005-inducing.

pdf

Creutz, M. & Lagus, K. (2007), ‘Unsupervised models for morpheme segmentation and morphologylearning’, ACM Transactions on Speech and Language Processing 4(1), 1–34.URL: http://portal.acm.org/citation.cfm?doid=1187415.1187418

Creutz, M., Lagus, K., Lindén, K. & Virpioja, S. (2005), Morfessor and hutmegs: Unsupervised mor-pheme segmentation for highly-inflecting and compounding languages, in ‘Proceedings of the SecondBaltic Conference on Human Language Technologies’, pp. 107–112.URL: http://www.ling.helsinki.fi/~klinden/pubs/Creutz05hlt.pdf

Frank, M. C., Goldwater, S., Griffiths, T. L. & Tenenbaum, J. B. (2010), ‘Modeling human performancein statistical word segmentation’, Cognition 117(2), 107–125.

Goldsmith, J. (2001), ‘Unsupervised Learning of the Morphology of a Natural Language’, Computa-tional Linguistics 27(2), 153–198.URL: http://dx.doi.org/10.1162/089120101750300490

Goldsmith, J. (2006), ‘An algorithm for the unsupervised learning of morphology’, Natural LanguageEngineering 12(4), 353–371.URL: http://journals.cambridge.org/abstract_S1351324905004055

Goldwater, S., Griffiths, T. L. & Johnson, M. (2009), ‘A bayesian framework for word segmentation:Exploring the effects of context’, Cognition 112(1), 21–54.

Habash, N., Eskander, R. & Hawwari, A. (2012), A morphological analyzer for egyptian arabic, in‘Proceedings of the twelfth meeting of the special interest group on computational morphology andphonology’, Association for Computational Linguistics, pp. 1–9.

Hammarström, H. & Borin, L. (2011), ‘Unsupervised learning of morphology’, Computational Linguis-tics 37(2), 309–350.URL: http://dl.acm.org/citation.cfm?id=2000519

Harris, Z. S. (1955), ‘From morpheme to phoneme’, Language 31(2), 190–222.

Haspelmath, M. & Sims, A. (2013), Understanding morphology, Routledge.

30

https://github.com/fchollet/keras

http://nlp.cs.swarthmore.edu/msim/papers/creutz2005-inducing.pdf

http://nlp.cs.swarthmore.edu/msim/papers/creutz2005-inducing.pdf

http://portal.acm.org/citation.cfm?doid=1187415.1187418

http://www.ling.helsinki.fi/~klinden/pubs/Creutz05hlt.pdf

http://dx.doi.org/10.1162/089120101750300490

http://journals.cambridge.org/abstract_S1351324905004055

http://dl.acm.org/citation.cfm?id=2000519

He, K., Zhang, X., Ren, S. & Sun, J. (2015), Delving deep into rectifiers: Surpassing human-level perfor-mance on imagenet classification, in ‘Proceedings of the IEEE international conference on computervision’, pp. 1026–1034.

He, K., Zhang, X., Ren, S. & Sun, J. (2016a), Deep residual learning for image recognition, in ‘Pro-ceedings of the IEEE conference on computer vision and pattern recognition’, pp. 770–778.

He, K., Zhang, X., Ren, S. & Sun, J. (2016b), Identity mappings in deep residual networks, in B. Leibe,J. Matas, N. Sebe & M. Welling, eds, ‘Computer Vision – ECCV 2016: 14th European Conference,Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV’, Springer InternationalPublishing, Cham, pp. 630–645.URL: https://doi.org/10.1007/978-3-319-46493-0_38

Hochreiter, S. & Schmidhuber, J. (1997), ‘Long short-term memory’, Neural computation 9(8), 1735–1780.

Kohonen, O., Virpioja, S. & Lagus, K. (2010), Semi-supervised learning of concatenative morphology,in ‘Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphologyand Phonology’, Association for Computational Linguistics, pp. 78–86.

Kurimo, M., Virpioja, S., Turunen, V. & Lagus, K. (2010), Morpho challenge competition 2005–2010:evaluations and results, in ‘Proceedings of the 11th Meeting of the ACL Special Interest Group onComputational Morphology and Phonology’, Association for Computational Linguistics, pp. 87–95.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. (2013), Distributed representations ofwords and phrases and their compositionality, in ‘Advances in neural information processing systems’,pp. 3111–3119.

Nivre, J., Agic, Ž., Ahrenberg, L., Aranzabe, M. J., Asahara, M., Atutxa, A., Ballesteros, M., Bauer,J., Bengoetxea, K., Bhat, R. A., Bick, E., Bosco, C., Bouma, G., Bowman, S., Candito, M., Ce-birolu Eryiit, G., Celano, G. G. A., Chalub, F., Choi, J., Çöltekin, Ç., Connor, M., Davidson, E.,de Marneffe, M.-C., de Paiva, V., Diaz de Ilarraza, A., Dobrovoljc, K., Dozat, T., Droganova, K.,Dwivedi, P., Eli, M., Erjavec, T., Farkas, R., Foster, J., Freitas, C., Gajdošová, K., Galbraith, D.,Garcia, M., Ginter, F., Goenaga, I., Gojenola, K., Gökrmak, M., Goldberg, Y., Gómez Guinovart, X.,Gonzáles Saavedra, B., Grioni, M., Gruzitis, N., Guillaume, B., Habash, N., Hajic, J., Hà M, L., Haug,D., Hladká, B., Hohle, P., Ion, R., Irimia, E., Johannsen, A., Jørgensen, F., Kaskara, H., Kanayama, H.,Kanerva, J., Kotsyba, N., Krek, S., Laippala, V., Lê Hng, P., Lenci, A., Ljubešic, N., Lyashevskaya,O., Lynn, T., Makazhanov, A., Manning, C., Maranduc, C., Marecek, D., Martínez Alonso, H., Mar-tins, A., Mašek, J., Matsumoto, Y., McDonald, R., Missilä, A., Mititelu, V., Miyao, Y., Montemagni,S., More, A., Mori, S., Moskalevskyi, B., Muischnek, K., Mustafina, N., Müürisep, K., Nguyn Th,L., Nguyn Th Minh, H., Nikolaev, V., Nurmi, H., Ojala, S., Osenova, P., Øvrelid, L., Pascual, E.,Passarotti, M., Perez, C.-A., Perrier, G., Petrov, S., Piitulainen, J., Plank, B., Popel, M., Pretkalnia,L., Prokopidis, P., Puolakainen, T., Pyysalo, S., Rademaker, A., Ramasamy, L., Real, L., Rituma, L.,Rosa, R., Saleh, S., Sanguinetti, M., Saulite, B., Schuster, S., Seddah, D., Seeker, W., Seraji, M.,Shakurova, L., Shen, M., Sichinava, D., Silveira, N., Simi, M., Simionescu, R., Simkó, K., Šimková,M., Simov, K., Smith, A., Suhr, A., Sulubacak, U., Szántó, Z., Taji, D., Tanaka, T., Tsarfaty, R., Tyers,F., Uematsu, S., Uria, L., van Noord, G., Varga, V., Vincze, V., Washington, J. N., Žabokrtský, Z.,Zeldes, A., Zeman, D. & Zhu, H. (2017), ‘Universal dependencies 2.0’. LINDAT/CLARIN digitallibrary at the Institute of Formal and Applied Linguistics, Charles University.URL: http://hdl.handle.net/11234/1-1983

31

https://doi.org/10.1007/978-3-319-46493-0_38

Östling, R. (2016), Morphological reinflection with convolutional neural networks, in ‘Proceedings ofthe 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Mor-phology’.URL: http://aclweb.org/anthology/W/W16/W16-2003.pdf

Pasha, A., Al-Badrashiny, M., Diab, M. T., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Ram-bow, O. & Roth, R. (2014), Madamira: A fast, comprehensive tool for morphological analysis anddisambiguation of arabic., in ‘LREC’, Vol. 14, pp. 1094–1101.

Rehg, K. L. & Sohl, D. G. (1981), Rehg: Ponapean Reference Grammar, University of Hawaii Press.

Ruokolainen, T., Kohonen, O., Sirts, K., Grönroos, S.-A., Kurimo, M. & Virpioja, S. (2016), ‘A Com-parative Study of Minimally Supervised Morphological Segmentation’, Computational Linguistics42(1), 91–120.URL: http://www.mitpressjournals.org/doi/10.1162/COLI_a_00243

Ruokolainen, T., Kohonen, O., Virpioja, S. & Kurimo, M. (2013), Supervised morphological segmenta-tion in a low-resource learning setting using conditional random fields, in ‘Proceedings of the Seven-teenth Conference on Computational Natural Language Learning’, pp. 29–37.

Ruokolainen, T., Kohonen, O., Virpioja, S. et al. (2014), Painless semi-supervised morphological seg-mentation using conditional random fields, in ‘Proceedings of the 14th Conference of the EuropeanChapter of the Association for Computational Linguistics, volume 2: Short Papers’, pp. 84–89.

Sirts, K. & Goldwater, S. (2013), ‘Minimally-supervised morphological segmentation using adaptorgrammars’, Transactions of the Association for Computational Linguistics 1, 255–266.URL: https://www.transacl.org/ojs/index.php/tacl/article/view/20

Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. (2014), ‘Dropout: asimple way to prevent neural networks from overfitting.’, Journal of Machine Learning Research15(1), 1929–1958.URL: http://www.jmlr.org/papers/volume15/srivastava14a.old/source/

srivastava14a.pdf

Theano Development Team (2016), ‘Theano: A Python framework for fast computation of mathematicalexpressions’, arXiv e-prints abs/1605.02688.URL: http://arxiv.org/abs/1605.02688

Waibel, A., Hanazawa, T., Hinton, G., Shikano, K. & Lang, K. J. (1990), Phoneme recognition usingtime-delay neural networks, in ‘Readings in speech recognition’, Elsevier, pp. 393–404.

Wang, L., Cao, Z., Xia, Y. & de Melo, G. (2016), Morphological Segmentation with Window LSTMNeural Networks., in ‘AAAI’, pp. 2842–2848.URL: http://iiis.tsinghua.edu.cn/~weblt/papers/

window-lstm-morph-segmentation.pdf

Wang, P., Qian, Y., Soong, F. K., He, L. & Zhao, H. (2015), ‘A unified tagging solution: BidirectionalLSTM recurrent neural network with word embedding’, arXiv preprint arXiv:1511.00215 .URL: https://arxiv.org/abs/1511.00215

32

http://aclweb.org/anthology/W/W16/W16-2003.pdf

http://www.mitpressjournals.org/doi/10.1162/COLI_a_00243

https://www.transacl.org/ojs/index.php/tacl/article/view/20

http://www.jmlr.org/papers/volume15/srivastava14a.old/source/srivastava14a.pdf

http://www.jmlr.org/papers/volume15/srivastava14a.old/source/srivastava14a.pdf

http://arxiv.org/abs/1605.02688

http://iiis.tsinghua.edu.cn/~weblt/papers/window-lstm-morph-segmentation.pdf

http://iiis.tsinghua.edu.cn/~weblt/papers/window-lstm-morph-segmentation.pdf

https://arxiv.org/abs/1511.00215

Stockholms universitet/Stockholm University

SE-106 91 Stockholm

Telefon 08 - 16 20 00

www.su.se

starved neural learning - su.diva-portal.org1255506/fulltext01.pdf · starved neural learning...

Documents