sigmorphon 2021 - acl anthology

239
SIGMORPHON 2021 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology Proceedings of the Workshop August 5, 2021 Bangkok, Thailand (online)

Upload: khangminh22

Post on 22-Feb-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

SIGMORPHON 2021

18th SIGMORPHON Workshop onComputational Research in Phonetics,

Phonology, and Morphology

Proceedings of the Workshop

August 5, 2021Bangkok, Thailand (online)

©2021 The Association for Computational Linguisticsand The Asian Federation of Natural Language Processing

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL)209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]

ISBN 978-1-954085-62-6

ii

Preface

Welcome to the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology,and Morphology, to be held on August 5, 2021 as part of a virtual ACL. The workshop aims tobring together researchers interested in applying computational techniques to problems in morphology,phonology, and phonetics. Our program this year highlights the ongoing investigations into how neuralmodels process phonology and morphology, as well as the development of finite-state models for low-resource languages with complex morphology. .

We received 25 submissions, and after a competitive reviewing process, we accepted 14.

The workshop is privileged to present four invited talks this year, all from very respected members ofthe SIGMORPHON community. Reut Tsarfaty, Kenny Smith, Kristine Yu, and Ekaterina Vylomova allpresented talks at this year’s workshop.

This year also marks the sixth iteration of the SIGMORPHON Shared Task. Following upon the successof last year’s multiple tasks, we again hosted 3 shared tasks:

Task 0:

SIGMORPHON’s sixth installment of its inflection generation shared task is divided into two parts:Generalization, and cognitive plausibility.

In the first part, participants designed a model that learned to generate morphological inflections from alemma and a set of morphosyntactic features of the target form, similar to previous year’s tasks. This year,participants learned morphological tendencies on a set of development languages, and then generalizedthese findings to new languages - without much time to adapt their models to new phenomena.

The second part asks participants to inflect nonce words in the past tense, which are then judged forplausibility by native speakers. This task aims to investigate whether state-of-the-art inflectors arelearning in a way that mimics human learners.

Task 1:

The second SIGMORPHON shared task on grapheme-to-phoneme conversion expands on the task fromlast year, recategorizing data as belonging to one of three different classes: low-resource, medium-resource, and high-resource.

The task saw 23 submissions from 9 participants.

Task 2:

Task 2 continues the effort from the 2020 shared task in unsupervised morphology. Unlike last year’stask, which asked participants to implement a complete unsupervised morphology induction pipeline,this year’s task concentrates on a single aspect of morphology discovery: paradigm induction. This taskasks participants to cluster words into inflectional paradigms, given no more than raw text.

The task saw 14 submissions from 4 teams.

We are grateful to the program committee for their careful and thoughtful reviews of the papers submittedthis year. Likewise, we are thankful to the shared task organizers for their hard work in preparing theshared tasks. We are looking forward to a workshop covering a wide range of topics, and we hope forlively discussions.

Garrett NicolaiKyle Gorman

iii

Ryan Cotterell

iv

Organizing CommitteeGarrett Nicolai (University of British Columbia, Canada)

Kyle Gorman (City University of New York, USA)Ryan Cotterell (ETH Zürich, Switzerland)

Program CommitteeDamián Blasi (Harvard University)

Grzegorz Chrupała (Tilburg University)Jane Chandlee (Haverford College)

Çagrı Çöltekin (University of Tübingen)Daniel Dakota (Indiana University)

Colin de la Higuera (University of Nantes)Micha Elsner (The Ohio State University)

Nizar Habash (NYU Abu Dhabi)Jeffrey Heinz (University of Delaware)Mans Hulden (University of Colorado)

Adam Jardine (Rutgers University)Christo Kirov (Google AI)

Greg Kobele (Universität Leipzig)Grzegorz Kondrak (University of Alberta)

Sandra Kübler (Indiana University)Adam Lamont (University of Massachusetts Amherst)

Kevin McMullin (University of Ottawa)Kemal Oflazer (CMU Qatar)

Jeff Parker (Brigham Young University)Gerald Penn (University of Toronto)Jelena Prokic (Universiteit Leiden)

Miikka Silfverberg (University of British Columbia)Kairit Sirts (University of Tartu)

Kenneth Steimel (Indiana University)Reut Tsarfaty (Bar-Ilan University)Francis Tyers (Indiana University)

Ekaterina Vylomova (University of Melbourne)Adina Williams (Facebook AI Research)Anssi Yli-Jyrä (University of Helsinki)

Kristine Yu (University of Massachusetts)

Task 0 Organizing CommitteeTiago Pimentel (University of Cambridge)Brian Leonard (Brian Leonard Consulting)

Maria Ryskina (Carnegie Mellon University)Sabrina Mielke (Johns Hopkins University)Coleman Haley (Johns Hopkins University)

Eleanor Chodroff (University of York)Johann-Mattis List (Max Planck Institute)Adina Williams (Facebook AI Research)

Ryan Cotterell (ETH Zürich)Ekaterina Vylomova (University of Melbourne)

Ben Ambridge (University of Liverpool)

Task 1 Organizing Committeev

To come

Task 2 Organizing CommitteeAdam Wiemerslage( University of Colorado Boulder)

Arya McCarthy (Johns Hopkins University)Alexander Erdmann (Ohio State University)

Manex Agirrezabal (University of Copenhagen)Garrett Nicolai (University of British Columbia)

Miikka Silfverberg (University of British Columbia)Mans Hulden (University of Colorado Boulder)

Katharina Kann (University of Colorado Boulder)

vi

Table of Contents

Towards Detection and Remediation of Phonemic ConfusionFrancois Roewer-Despres, Arnold Yeung and Ilan Kogan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Recursive prosody is not finite-stateHossep Dolatian, Aniello De Santo and Thomas Graf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11

The Match-Extend serialization algorithm in MultiprecedenceMaxime Papillon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Incorporating tone in the calculation of phonotactic probabilityJames Kirby . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

MorphyNet: a Large Multilingual Database of Derivational and Inflectional MorphologyKhuyagbaatar Batsuren, Gábor Bella and fausto giunchiglia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

A Study of Morphological Robustness of Neural Machine TranslationSai Muralidhar Jayanthi and Adithya Pratapa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Sample-efficient Linguistic Generalizations through Program Synthesis: Experiments with PhonologyProblems

Saujas Vaduguru, Aalok Sathe, Monojit Choudhury and Dipti Sharma. . . . . . . . . . . . . . . . . . . . . . . .60

Findings of the SIGMORPHON 2021 Shared Task on Unsupervised Morphological Paradigm ClusteringAdam Wiemerslage, Arya D. McCarthy, Alexander Erdmann, Garrett Nicolai, Manex Agirrezabal,

Miikka Silfverberg, Mans Hulden and Katharina Kann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Adaptor Grammars for Unsupervised Paradigm ClusteringKate McCurdy, Sharon Goldwater and Adam Lopez . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

Orthographic vs. Semantic Representations for Unsupervised Morphological Paradigm ClusteringE. Margaret Perkoff, Josh Daniels and Alexis Palmer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

Unsupervised Paradigm Clustering Using Transformation RulesChangbing Yang, Garrett Nicolai and Miikka Silfverberg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Paradigm Clustering with Weighted Edit DistanceAndrew Gerlach, Adam Wiemerslage and Katharina Kann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

Results of the Second SIGMORPHON Shared Task on Multilingual Grapheme-to-Phoneme ConversionLucas F.E. Ashby, Travis M. Bartley, Simon Clematide, Luca Del Signore, Cameron Gibson, Kyle

Gorman, Yeonju Lee-Sikka, Peter Makarov, Aidan Malanoski, Sean Miller, Omar Ortiz, Reuben Raff,Arundhati Sengupta, Bora Seo, Yulia Spektor and Winnie Yan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Data augmentation for low-resource grapheme-to-phoneme mappingMichael Hammond . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

Linguistic Knowledge in Multilingual Grapheme-to-Phoneme ConversionRoger Yu-Hsiang Lo and Garrett Nicolai . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

Avengers, Ensemble! Benefits of ensembling in grapheme-to-phoneme predictionVasundhara Gautam, Wang Yau Li, Zafarullah Mahmood, Frederic Mailhot, Shreekantha Nadig,

Riqiang WANG and Nathan Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

vii

CLUZH at SIGMORPHON 2021 Shared Task on Multilingual Grapheme-to-Phoneme Conversion: Vari-ations on a Baseline

Simon Clematide and Peter Makarov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

What transfers in morphological inflection? Experiments with analogical modelsMicha Elsner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

Simple induction of (deterministic) probabilistic finite-state automata for phonotactics by stochastic gra-dient descent

Huteng Dai and Richard Futrell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

Recognizing Reduplicated Forms: Finite-State Buffered MachinesYang Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

An FST morphological analyzer for the Gitksan languageClarissa Forbes, Garrett Nicolai and Miikka Silfverberg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

Comparative Error Analysis in Neural and Finite-state Models for Unsupervised Character-level Trans-duction

Maria Ryskina, Eduard Hovy, Taylor Berg-Kirkpatrick and Matthew R. Gormley . . . . . . . . . . . . 198

Finite-state Model of Shupamem ReduplicationMagdalena Markowska, Jeffrey Heinz and Owen Rambow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

Improved pronunciation prediction accuracy using morphologyDravyansh Sharma, Saumya Sahai, Neha Chaudhari and Antoine Bruguier . . . . . . . . . . . . . . . . . . 222

viii

Workshop Program

Due to the ongoing pandemic, and the virtual nature of the workshop, the papers will be presentedasynchronously, with designated question periods.

Towards Detection and Remediation of Phonemic ConfusionFrancois Roewer-Despres, Arnold Yeung and Ilan Kogan

Recursive prosody is not finite-stateHossep Dolatian, Aniello De Santo and Thomas Graf

The Match-Extend serialization algorithm in MultiprecedenceMaxime Papillon

What transfers in morphological inflection? Experiments with analogical modelsMicha Elsner

Simple induction of (deterministic) probabilistic finite-state automata for phonotac-tics by stochastic gradient descentHuteng Dai and Richard Futrell

Incorporating tone in the calculation of phonotactic probabilityJames Kirby

Recognizing Reduplicated Forms: Finite-State Buffered MachinesYang Wang

An FST morphological analyzer for the Gitksan languageClarissa Forbes, Garrett Nicolai and Miikka Silfverberg

MorphyNet: a Large Multilingual Database of Derivational and Inflectional Mor-phologyKhuyagbaatar Batsuren, Gábor Bella and Fausto Giunchiglia

Comparative Error Analysis in Neural and Finite-state Models for UnsupervisedCharacter-level TransductionMaria Ryskina, Eduard Hovy, Taylor Berg-Kirkpatrick and Matthew R. Gormley

Finite-state Model of Shupamem ReduplicationMagdalena Markowska, Jeffrey Heinz and Owen Rambow

A Study of Morphological Robustness of Neural Machine TranslationSai Muralidhar Jayanthi and Adithya Pratapa

ix

No Day Set (continued)

Sample-efficient Linguistic Generalizations through Program Synthesis: Experi-ments with Phonology ProblemsSaujas Vaduguru, Aalok Sathe, Monojit Choudhury and Dipti Sharma

Improved pronunciation prediction accuracy using morphologyDravyansh Sharma, Saumya Sahai, Neha Chaudhari and Antoine Bruguier

Data augmentation for low-resource grapheme-to-phoneme mappingMichael Hammond

Linguistic Knowledge in Multilingual Grapheme-to-Phoneme ConversionRoger Yu-Hsiang Lo and Garrett Nicolai

Avengers, Ensemble! Benefits of ensembling in grapheme-to-phoneme predictionVasundhara Gautam, Wang Yau Li, Zafarullah Mahmood, Frederic Mailhot, Shree-kantha Nadig, Riqiang WANG and Nathan Zhang

CLUZH at SIGMORPHON 2021 Shared Task on Multilingual Grapheme-to-Phoneme Conversion: Variations on a BaselineSimon Clematide and Peter Makarov

Findings of the SIGMORPHON 2021 Shared Task on Unsupervised MorphologicalParadigm ClusteringAdam Wiemerslage, Arya D. McCarthy, Alexander Erdmann, Garrett Nicolai,Manex Agirrezabal, Miikka Silfverberg, Mans Hulden and Katharina Kann

Orthographic vs. Semantic Representations for Unsupervised MorphologicalParadigm ClusteringE. Margaret Perkoff, Josh Daniels and Alexis Palmer

Unsupervised Paradigm Clustering Using Transformation RulesChangbing Yang, Garrett Nicolai and Miikka Silfverberg

Paradigm Clustering with Weighted Edit DistanceAndrew Gerlach, Adam Wiemerslage and Katharina Kann

Adaptor Grammars for Unsupervised Paradigm ClusteringKate McCurdy, Sharon Goldwater and Adam Lopez

x

Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research

in Phonetics, Phonology, and Morphology,

August 5, 2021. ©2021 Association for Computational Linguistics

pages 1–10

Towards Detection and Remediation of Phonemic Confusion

Francois Roewer-Despres1∗ Arnold YS Yeung1∗ Ilan Kogan2∗1Department of Computer Science 2Department of Statistics

University of Torontofrancoisrd, [email protected]

[email protected]

AbstractReducing communication breakdown is criti-cal to success in interactive NLP applications,such as dialogue systems. To this end, we pro-pose a confusion-mitigation framework for thedetection and remediation of communicationbreakdown. In this work, as a first step towardsimplementing this framework, we focus on de-tecting phonemic sources of confusion. As aproof-of-concept, we evaluate two neural ar-chitectures in predicting the probability that alistener will misunderstand phonemes in an ut-terance. We show that both neural models out-perform a weighted n-gram baseline, showingearly promise for the broader framework.

1 Introduction

Ensuring that interactive NLP applications, suchas dialogue systems, communicate clearly and ef-fectively is critical to their long-term success andviability, especially in high-stakes domains, such ashealthcare. Successful systems should thus seek toreduce communication breakdown. One aspect ofsuccessful communication is the degree to whicheach party understands the other. For example,properly diagnosing a patient may necessitate ask-ing logically complex questions, but these ques-tions should be phrased as clearly as possible topromote understanding and mitigate confusion.

To reduce confusion-related communicationbreakdown, we propose that generative NLP sys-tems integrate a novel confusion-mitigation frame-work into their natural language generation (NLG)processes. In brief, this framework ensures thatsuch systems avoid transmitting utterances withhigh predicted probabilities of confusion. In thesimplest and most decoupled formulation, an exist-ing NLG component simply produces alternativesto any rejected utterances without additional guid-ing information. In more advanced and coupled

∗Equal contribution.

Figure 1: A simplified variant of our proposedconfusion-mitigation framework, which enables gener-ative NLP systems to detect and remediate confusion-related communication breakdown. The confusion pre-diction component predicts the confusion probabilityof candidate utterances, which are rejected if this prob-ability is above a decision threshold, φ.

formulations, the NLG and confusion predictioncomponents can be closely integrated to better de-termine precisely how to avoid confusion. Thisprocess can also be conditioned on models of thecurrent listener or task to achieve personalized orcontext-dependent results. Figure 1 shows the sim-plest variant of the framework.

As a first step towards implementing this frame-work, we work towards developing its central con-fusion prediction component, which predicts theconfusion probability of an utterance. In this work,we specifically target phonemic confusion, that is,the misidentification of heard phonemes by a lis-tener. We consider two potential neural architec-tures for this purpose: a fixed-context, feed-forwardnetwork and a residual, bidirectional LSTM net-work. We train these models using a novel proxydata set derived from audiobook recordings, andcompare their performance to that of a weightedn-gram baseline.

1

2 Background and Related Work

Prior work focused on identifying confusion in nat-ural language, rather than proactively altering itto help reduce communication breakdown, as ourframework proposes. For example, Batliner et al.(2003) showed that certain features of recordedspeech (e.g., repetition, hyperarticulation, strongemphasis) can be used to identify communica-tion breakdown. The authors relied primarily onprosodic properties of recorded phrases, rather thanthe underlying phonemes, words, or semantics, foridentifying communication breakdown. On theother hand, conversational repair is a turn-levelprocess in which conversational partners first iden-tify and then remediate communication breakdownas part of a trouble-source repair (TSR) sequence(Sacks et al., 1974). Using this approach, Orangeet al. (1996) identified differences in TSR patternsamongst people with no, early-stage, and middle-stage Alzheimer’s, highlighting the usefulness ofcommunication breakdown detection. However,such work does not directly address the issue ofproactive confusion mitigation and remediation,which the more advanced formulation of our frame-work aims to address through listener and task con-ditioning. Our focus is on the simpler formulationin this preliminary work.

Rothwell (2010) identified four types of noisethat may cause confusion: physical noise (e.g., aloud highway), physiological noise (e.g., hearingimpairment), psychological noise (e.g., attentive-ness of listener), and semantic noise (e.g., wordchoice). We postulate that mitigating confusionresulting from each type of noise may be possi-ble, at least to some extent, given sufficient contextto make an informed compensatory decision. Forexample, given a particularly physically noisy envi-ronment, speaking loudly would seem appropriate.Unfortunately, such contextual information is oftenlacking from existing data sets. In particular, thephysiological and psychological states of listenersis rarely recorded. Even when such information isrecorded (e.g., in Alzheimer’s speech studies Or-ange et al., 1996), the information is very coarse(e.g, broad Alzheimer’s categories such as none,early-stage, and middle-stage).

We leave these non-trivial data gathering chal-lenges as future work, instead focusing on phone-mic confusion, which is significantly easier to op-erationalize. In practice, confusion at the phoneme-level may arise from any category of Rothwell

noise. It may also arise from the natural similaritiesbetween phonemes (discussed next). While manyof these will not be represented in the text-basedphonemic transcriptions data set used in this pre-liminary work, our approach can be extended toinclude them.

Researchers in speech processing have studiedthe prediction of phonemic confusion but, to ourknowledge, this work has not been adapted to ut-terance generation. Instead, tasks such as prevent-ing of sound-alike medication errors (i.e., namingmedications so that two medications do not soundidentical) are common (Lambert, 1997). Zgankand Kacic (2012) showed that the potential confus-ability of a word can be estimated by calculatingthe Levenshtein distance (Levenshtein, 1966) of itsphonemic transcription to that of all others in thevocabulary. We take inspiration from Zgank andKacic (2012) and employ a phoneme-level Leven-shtein distance approach in this work.

In the basic definition of the Levenshtein dis-tance, all errors are equally weighted. In practice,however, words that share many similar or identicalphonemes are more likely to be confused for oneanother. Given this, Sabourin and Fabiani (2000)developed a weighted phoneme-level Levenshteindistance, where weights are determined by a humanexpert or a learned model, such as a hidden Markovmodel. Unfortunately, while these weights aremeant to represent phonemic similarity, selectingan appropriate distance metric in phoneme space isnon-trivial. The classical results of Miller (1954)and Miller and Nicely (1955) group phonemes ex-perimentally based on the noise level at which theybecome indiscernible. The authors identify voic-ing, nasality, affrication, duration, and place ofarticulation as sub-phoneme features that predict aphoneme’s sensitivity to distortion, and thereforemeasure its proximity to others. Unfortunately,later work showed that these controlled conditionsdo not map cleanly to the real world (Batliner et al.,2003). In addition, Wickelgren (1965) found al-ternative phonemic distance features that could beadapted into a distance metric.

While this prior research sought to directly de-fine a distance metric between phonemes basedon sub-phoneme features, since no method hasemerged as clearly superior, researchers now favourdirect, empirical measures of confusability (Baileyand Hahn, 2005). Likewise, our work assumes thatthese classical feature-engineering approaches to

2

predicting phoneme confusability can be improvedupon with neural approaches, just as automaticspeech recognition (ASR) systems have been im-proved through the use of similar methods (e.g.,Seide et al., 2011; Zeyer et al., 2019; Kumar et al.,2020). In addition, these classical approaches donot account for context (i.e., other phonemes sur-rounding the phoneme of interest), whereas ourapproach conditions on such context to refine theconfusion estimate.

3 Data

3.1 Data Gathering Process

To predict the phonemic confusability of utterances,we would ideally use a data set in which each utter-ance is annotated with speaker phonemic transcrip-tion (the reference transcription), as well as listenerperceived phonemic transcription (the hypothesistranscription). We could then compare these tran-scriptions to identify phonemic confusion.

To the best of our knowledge, a data set of thistype does not exist. The English Consistent Con-fusion Corpus contains a collection of individualwords spoken against a noisy background, with hu-man listener transcriptions (Marxer et al., 2016).This is similar to our ideal data set, however thewords are spoken in isolation, and thus without anyutterance context. This same issue arises in theDiagnostic Rhyme Test and its derivative data sets(Voiers et al., 1975; Greenspan et al., 1998). Othercorpora, such as the BioScope Corpus (Vinczeet al., 2008) and the AMI Corpus (Carletta et al.,2005), contain annotations of dialogue acts, whichrepresent the intention of the speaker in producingeach utterance (e.g., asking a question is labeledwith the dialogue act elicit information).However, dialogue acts relating to confusion onlyappear when a listener explicitly requests clarifica-tion from the speaker. This does not provide fine-grained information regarding which phonemescaused the confusion, nor does it capture any in-stances of confusion in which the listener does notexplicitly vocalize their confusion.

We thus create a new data set for this work (Fig-ure 2). The Parallel Audiobook Corpus contains121 hours of recorded speech data across 59 speak-ers (Ribeiro, 2018). We use four of its audiobooks:Adventures of Huckleberry Finn, Emma, TreasureIsland, and The Adventures of Sherlock Holmes.Crucially, the audio recordings in this corpus arealigned with the text being read, which allows us to

Figure 2: We create a new data set with parallel ref-erence and hypothesis transcriptions from audiobookdata with parallel text and audio recordings. The textsimply becomes the reference transcriptions. A tran-scriber converts the audio recordings into hypothesistranscriptions. In this preliminary work, we use anASR system as a proxy for human transcribers.

create aligned reference and hypothesis transcrip-tions. For each text-audio pair, the text simplybecomes the reference transcriptions, while a tran-scriber converts the audio into hypothesis transcrip-tions. Given the preliminary nature of this work,we create a proxy data set in which we use GoogleCloud’s publicly-available ASR system as a proxyfor human transcribers (Cloud, 2019). We thenprocess these transcriptions to identify phonemicconfusion events (as described in Section 3.2). Thefinal data set contains 84,253 parallel transcriptions.We split these into 63,189 training, 10,532 valida-tion, and 10,532 test transcriptions (a 75%-12.5%-12.5% split). The average reference and hypothesistranscription lengths are 65.2 and 62.3 phonemes,respectively. The transcription error rate (i.e., theproportion of phonemes that are mis-transcribed)is only 8%, so there is significant imbalance in thedata set.

For the purposes of this preliminary work, theGoogle Cloud ASR system (Cloud, 2019) is anacceptable proxy for human transcription abilityunder the reasonable assumption that, for any par-ticular transcriber, the distribution of error ratesacross different phoneme sequences is nonuniform(i.e., within-transcriber variation is present). Thisassumption holds in all practical cases, and is rea-sonable since the confusion-mitigation frameworkwe propose can be conditioned on different tran-scribers to control for inter-transcriber variation asfuture work.

3.2 Transcription Error Labeling

We post-process our aligned reference-hypothesistranscription data set in two steps. First, each tran-scription must be converted from the word-levelto the phoneme-level. For this, we use the CMUPronouncing Dictionary (Weide, 1998), which isbased on the ARPAbet symbol set. For any wordswith multiple phonemic conversions, we simply

3

Figure 3: Illustration of our transcription error labelingprocess (using letters instead of phonemes for readabil-ity). Given aligned reference (x) and hypothesis (y)vectors, we use the Levenshtein algorithm to ensurethey have the same length. Because y is not availableat test time, we then “collapse” consecutive insertiontokens to force the vectors to have the original lengthof x. Finally, we replace y with the binary vector d,which has 1’s wherever x and y don’t match.

default to the first conversion returned by the API.Second, we label each resulting phoneme in each

reference transcription as either correctly or incor-rectly transcribed. This is nontrivial, because thenumber of phonemes in the reference and hypothe-sis transcriptions are rarely equal, and thus requirephoneme-level alignment. For this purpose, weuse a variant of the phoneme-level Levenshtein dis-tance that returns the actual alignment, rather thanthe final distance score (Figure 3).

Formally, let x ∈ Ka be a vector of referencephonemes and y ∈ Kb be a vector of hypothesisphonemes from the data set. K refers to the set1, 2, 3, . . . , k,<INS>,<DEL>,<SOS>,<EOS>,where k is the number of unique phonemes inthe language being considered (e.g., in English,k ≈ 40 depending on the dialect). In general,a 6= b, but we can manipulate the vectors byincorporating insertion, deletion, and substitutiontokens (as done in the Levenshtein distancealgorithm). In general, this yields two vectorsof the same length, x, y ∈ Kc, c = max(a, b).While this manipulation can be performed attraining time because y and b are known, suchinformation is unavailable at test time. Therefore,we modify the alignment at training time to ensurex ≡ x and c ≡ a. To achieve this, we “collapse”consecutive insertion tokens into a single instanceof the insertion token, which ensures that |y| = a.

Additionally, we assume that each hypothesisphoneme, yi ∈ y, is conditionally independent

of the others. That is, P (yi = xi |x, y 6=i) =P (yi = xi |x).1 We hypothesize that this assump-tion, similar to the conditional independence as-sumption of Naıve Bayes (Zhang, 2004), will stillyield directionally-correct results, while drasticallyincreasing the tractability of the computation.

This assumption also allows us to simplify theoutput space of the problem. Specifically, sincewe only care to predict P (y 6= x), with this as-sumption, we now only need to consider, for each i,whether yi = xi, rather than dealing with the muchharder problem of predicting the exact value of yi.To achieve this, we use an element-wise Kroneckerdelta function to replace y with a binary vector, d,such that di ← yi 6= xi. Thus, the binary vectord records the position of each transcription error,that is, the position of each phoneme in x that wasconfused.

With the x’s as inputs and the d’s as ground truthlabels, we can train models to predict P (di |x) foreach i. As a post-processing step, we can thencombine these individual probabilities to estimatethe utterance-level probability of phonemic confu-sion, P (y 6= x), which is the output of the centralconfusion prediction component in Figure 1.

This formulation is general in the sense that anyxi can affect the predicted probability of any di.In practice, however, and especially for long utter-ances, this is overly conservative, as only nearbyphonemes are likely to have a significant effect. InSection 4, we describe any additional conditionalindependence assumptions that each architecturemakes to further simplify its probability estimate.

4 Model Architectures and Baseline

With recent advances, various neural architectureshave been applied to NLP tasks. Early work in-cludes n-gram-based, fully-connected architecturesfor language modeling tasks (Bengio et al., 2003;Mikolov et al., 2013). Recurrent neural network(RNN) architectures were then shown to be suc-cessful for applications such as language model-ing, speech recognition, and phoneme recognition(Graves and Schmidhuber, 2005; Mikolov et al.,2011). RNN architectures such as the LSTM(Hochreiter and Schmidhuber, 1997) and GRU(Chung et al., 2015) variants had been successfulin many NLP applications, such as machine lan-guage translation and phoneme classification (Sun-dermeyer et al., 2012; Graves et al., 2013; Graves

1y 6=i is every element in y except the one at position i.

4

and Schmidhuber, 2005). Recently, the transformerarchitecture (Vaswani et al., 2017), which uses at-tention instead of recurrence to form dependenciesbetween inputs, has shown state-of-the-art resultsin many areas of NLP, including syllable-basedtasks (e.g., Zhou et al., 2018).

In this work, we propose a fixed-context-windowarchitecture and a residual bi-LSTM architec-ture for the central component of our confusion-mitigation framework. While similar architectureshave already been applied to phoneme-based appli-cations, such as phoneme recognition and classifi-cation (Graves and Schmidhuber, 2005; Weningeret al., 2015; Graves et al., 2013; Li et al., 2017),to our knowledge, our study is the first to applythese architectures to identify phonemes related toconfusion for listeners. In our opinion, these archi-tectures strike an acceptable balance between com-pute and capability for this current work, unlike themore advanced transformer architectures, whichrequire significantly more resources to train.2

Since the data set is imbalanced (see Sec-tion 3.1), without sample weighting, early experi-ments showed that both architectures never identi-fied any phonemes as likely to be mis-transcribed(i.e., high specificity, low sensitivity). Accordingly,since the imbalance ratio is approximately 1:10,transcription errors are given 10-times more weightthan properly-transcribed phonemes in our binarycross-entropy loss function.

4.1 Fixed-Context Network

The fixed-context network takes as input the currentphoneme, xi, and the 4 phonemes before and afterit as a fixed window of context (Figure 4a). Thisresults in the additional conditional independenceassumption that P (di |x) = P (di |xi−4:i+4). Thatis, only phonemes within the fixed context windowof size 4 can affect the predicted probability of di.

These 9 phonemes are first embedded in a 15-dimensional embedding space. The embeddinglayer is followed by a sequence of seven fully-connected hidden layers with 512, 256, 256, 128,128, 64, and 64 neurons respectively. Each layeris separated by Rectified Linear Unit (ReLU) non-linearities (Nair and Hinton, 2010; He et al., 2016).Finally, an output with a sigmoid activation func-tion predicts the probability of a transcription er-ror. We train with minibatches of size 32, using

2Link to code: https://github.com/francois-rd/phonemic-confusion

the Adam optimizer with parameters α = 0.001,β1 = 0.9, and β2 = 0.999 (Kingma and Ba, 2014)to optimize a 1:10 weighted binary cross-entropyloss function. We explored alternative parametersettings, and in particular a larger number of neu-rons, but found this architecture to be the moststable and highest performing of all variants tested,given the nature and relatively small size of thedata set.

4.2 LSTM NetworkThe LSTM network receives the entire referencetranscription, x, as input and predicts the entire bi-narized hypothesis transcription, d, as output (Fig-ure 4b). Since the LSTM is bidirectional, we donot introduce any additional conditional indepen-dence assumptions. Each input phoneme is passedthrough an embedding layer of dimension 42 (equalto |K|) followed by a bidirectional LSTM layer andtwo residual linear blocks with ReLU activations(He et al., 2016). An output residual linear blockwith a sigmoid activation predicts the probabilityof a transcription error. These skip connectionsare added since residual layers tend to outperformsimpler alternatives (He et al., 2016). Passing theembedded input via skip connections ensures thatthe original input is accessible at all depths of thenetwork, and also helps mitigate against any van-ishing gradients that may arise in the LSTM.

We use the following output dimensions for eachlayer: 50 for LSTM hidden and cell states, 40 forthe first residual linear block, and 10 for the second.We train with minibatches of size 256, using theAdam optimizer with parameters α = 0.00005,β1 = 0.9, and β2 = 0.999 (Kingma and Ba, 2014)to optimize a 1:10 weighted binary cross-entropyloss function.

4.3 Weighted n-Gram BaselineWe compare our neural models to a weighted n-gram baseline model. That is, di depends onlyon the n previous phonemes in x (an order-nMarkov assumption). Formally, we make the con-ditional independence assumption that P (di |x) =p(di |xi−n+1:i). Extending this baseline model toinclude future phonemes would violate the order-nMarkov assumption that is standard in n-gram ap-proaches. In this preliminary work, we opt to keepthe baseline as standard as possible.

A weighted n-gram model is computed using analgorithm similar to the standard maximum like-lihood estimation (MLE) n-gram counting algo-

5

(a) The fixed-context network uses a fixed window of contextof size 4. These 9 phonemes are embedded using a sharedembedding layer, concatenated, and then passed through 7linear layers with ReLU activations, followed by an outputlayer with a sigmoid activation.

(b) Unrolled architecture of the LSTM network. The architec-ture consists of one bidirectional LSTM layer, two residuallinear blocks with ReLU activations, and an output residuallinear block with a sigmoid activation. Additional skip con-nections are added throughout.

Figure 4: Architectural variants of the confusion prediction component of our confusion-mitigation framework.

rithm, but with the introduction of a weightingscheme to deal with the class imbalance issue. Theweighting is necessary for a fair comparison to theweighted loss function used in the neural networkmodels. This approaches generalizes the standardMLE n-gram counting algorithm, which implicitlyuses a weight of 1.

Formally, let W > 0 be the selected weight, anddefine ci ≡ xi−n+1:i to simplify the notation. Also,let C(di | ci) be the count of all incorrect phonemetranscriptions in the context ci in the entire data set,and similarly, C(1 − di | ci) for correct transcrip-tions.3 The weighted n-gram is then computed asfollows:

P (di | ci) =W × C(di | ci)

C(1− di | ci) +W × C(di | ci)

Empirically, we find that a weighted 3-grammodel works best; larger contexts are too sparsegiven the size of the data set and smaller contextslack expressive capacity. We do not use any n-gramsmoothing methods. Instead, any missing contextsencountered at test time are simply marked as in-correct predictions. For this particular data set,such missing contexts are vanishingly rare (occur-ring only 0.003% of the time), which justifies ourapproach.

3We slightly abuse the notation here. Recall that di ←yi 6= xi, so we notate yi = xi as 1− di.

5 Results and Discussion

5.1 Quantitative Analysis

We report receiver operating characteristic (ROC)curves for all models (Figure 5). To facilitate faircomparison, all models are trained with the samerandom ordering of training data in each epoch.Both neural network architectures outperform theweighted n-gram baseline by a small margin, withthe fixed-context network appearing to performslightly better overall. While no individual modelexhibits any significant performance gain over theothers, all models perform significantly better thanrandom chance. This shows the promise of ourframework, which is precisely the objective of thiswork. We next speculate as to the causes of theslight gaps that are observed.

The neural network models likely outperformthe weighted n-gram baseline for multiple reasons.First, both neural network models condition on acontext that includes both past and future phonemes(i.e., bidirectional), whereas the baseline only con-ditions on past phonemes (i.e., unidirectional). Uti-lizing future phonemes as context is useful sinceboth humans and most state-of-the-art ASR sys-tems use this information to revise their predic-tions. Second, the neural networks can learn sub-contextual patterns that the baseline cannot. Forexample, the contexts A B C and A B D havethe sub-context A B in common. Whereas theweighted n-gram treats these as completely dif-

6

Ground Truth Phrase

... for they say every body is in love once ...

... his grave looks shewed that she was not ...

... shall use the carriage to night ...

... making him understand I warn’t dead ...

... shore at that place so we warn’t afraid ...

... read Elton’s letter as I was shewn in ...

... sacrifice my poor hair to night and ...

... we warn’t feeling just right ...

... that there was no want of taste ...

... knew that Arthur had discovered ...

Transcription of Audio Recording

... for they say everybody is in love once ...

... his grave look showed that she was not ...

... shall use the carriage tonight ...

... making him understand I warrant dead ...

... sure at that place so we weren’t afraid ...

... read Elton’s letters I was shown in ...

... sacrifice my poor head tonight and ...

... we weren’t feeling just right ...

... that there was no on toothpaste ...

... knew was it also have discovered ...

Table 1: Randomly selected phrases from amongst the top 100 phonemes predicted to be incorrectly transcribedby the fixed-context model (transcription error probability > 0.999). Bold text denotes ASR transcription errors.

0 20 40 60 80 1000

20

40

60

80

100

False positive rate (%)

True

posi

tive

rate

(%)

Fixed-ContextLSTM

BaselineRandom

Figure 5: ROC curves for our model variants.

ferent contexts, the neural networks may be able toexploit the similarity between them. This kind ofparameter sharing is more data efficient, which canlead to lower variance estimates (less overfitting)in the small data set setting we are considering.

The simpler fixed-context network slightly out-performs the more complex LSTM alternative.While RNN architectures have been shown to out-perform feed-forward networks in language pro-cessing tasks (Sundermeyer et al., 2012), other re-search has shown that simpler architectures are stillable to process phonemic data effectively (Ba andCaruana, 2014). The lack of an additional con-ditional independence assumption for the LSTMmodel may have resulted in worse data efficiency,since the model needs to expend parameters onall reference phonemes, even those very far awaythat may have little impact on the current one. Inaddition, the smaller number of parameters to esti-mate may have lead to lower variance in the fixed-

context model. Given this, our avoidance of moreadvanced or deeper model, such as transformers,seems justified for this preliminary work. We hy-pothesize that such models could outperform allthe models considered here given a significantlylarger data set.

5.2 Qualitative Analysis

5.2.1 DescriptionWe perform qualitative error analysis on randomlyselected phonemes from amongst those that aremost (Table 1) and least (Table 2) likely to containtranscription errors according to the fixed-contextmodel. This offers some qualitative insights re-garding phonemic confusion. We sample from thefixed-context model due to its slightly superior per-formance, and show small phrases centered aroundthe phoneme most (least) likely to cause confusion,rather than full transcriptions, for clarity.

In addition, to improve readability, we showwords rather than the underlying phonemes. As aresult, some of the errors appear to be orthographicin nature even if they are not. For example, “everybody” becomes “everybody” in the first exampleof Table 1. However, the phonemes that constitute“every body” and “everybody” are indeed differ-ent: “EH V ER IY B AA D IY” versus “EHV R IY B AA D IY”. As per our definitions inSection 3.2, these cases do represent transcriptionerrors. However, it may be argued that such errorsintroduce unwanted noise in the data set, which wehope to correct in future work.

5.2.2 AnalysisFirst, we note that every sample in Table 1 doesindeed have a transcription error, while few sam-

7

Ground Truth Phrase

... the exquisite feelings of delight and ...

... gone Mister Knightley called ...

... has been exceptionally ...

... not afraid of your seeing ...

... the sale of Randalls was long ...

... her very kind reception of himself ...

... for the purpose of preparatory inspection ...

... you would not be happy until you ...

... with the exception of this little blot ...

... night we were in a great bustle getting ...

Transcription of Audio Recording

... the exquisite feelings of delight and ...

... gone Mister Knightley called ...

... has been exceptionally ...

... not afraid if you’re saying ...

... the sale of Randalls was long ...

... her very kind reception to himself ...

... for the purpose of preparatory inspection ...

... you would not be happy until you ...

... with the exception of this little blot ...

... night we were in a great bustle getting ...

Table 2: Randomly selected phrases from amongst the top 100 phonemes predicted to be correctly transcribed bythe fixed-context model (transcription error probability < 0.03). Bold text denotes ASR transcription errors.

ples have errors in Table 2. It therefore seems asthough, when the fixed-context model is very cer-tain about the presence or absence of errors, it isusually correct.

Second, many of the transcription errors in Ta-ble 1 are seemingly caused by the archaic or id-iosyncratic writing present in the books used tocreate the data set. While this can be seen as asource of unwanted noise (we used an ASR sys-tem trained on standard modern English), we arguethat, as per Rothwell’s model of communication(Section 2), familiarity with the vocabulary is, infact, a very legitimate source of semantic noise.Indeed, phrases using more modern and standardvernacular are seemingly less likely to be confus-ing, according to the fixed-context model.

Third, many of the errors not related to ar-chaism involve stop words, homonyms, or near-homophones, which intuitively makes sense. Ad-ditionally, hard consonant sounds between words(and stress at the beginning rather than at the endof words) appears more common in the set ofcorrectly-transcribed phrases as compared to the setof incorrectly-transcribed ones. These findings sug-gest the fixed-context model has picked up on someunderlying patterns governing phonemic confusion,which is promising for our confusion-mitigationframework as a whole.

5.3 Future WorkThis work uses a relatively small data set. Creat-ing and using a significantly larger corpus usinghuman subjects rather than an ASR proxy wouldlikely yield more directly relevant results. We pos-tulate that, with a larger and higher quality dataset, a deeper and more advanced neural network

architecture, such as the transformer, may producestronger results. Future work can also investigatethe differences in human phonemic confusabilityon ‘natural’ versus semantically-unpredictable sen-tences.

A major aspect of our confusion-mitigationframework, which we have not explored in thiswork, is the generation of alternative, clearer utter-ances that retain the initial meaning. Constructivelyenumerating these alternatives is non-trivial, as isidentifying the neighbourhood beyond which theirmeaning differs too significantly from the original.Conditioning on a specific listener’s priors as anadditional mechanism to reduce communicationbreakdown is another major aspect we leave to fu-ture work.

Perhaps most significantly, we have limited thescope of our confusion assessment drastically inthis preliminary work, primarily to simplify thedata gathering process. While our results arepromising, communication breakdown is a nuancedand multi-faceted phenomenon of which phonemicconfusion is but one small component. Modelingthese larger and more complex processes remainsan important open challenge.

6 Conclusion

Reducing communication breakdown is criticalto successful interaction in dialogue systems andother generative NLP systems. In this work, weproposed a novel confusion-mitigation frameworkthat such systems could employ to help minimizethe probability of human confusion during an in-teraction. As a first step towards implementingthis framework, we evaluated two potential neu-

8

ral architectures—a fixed-context network and anLSTM network—for its central component, whichpredicts the confusion probability of a candidateutterance. These neural architectures outperformeda weighted n-gram baseline (with the fixed-contextnetwork performing best overall) when trained us-ing a proxy data set derived from audiobook record-ings. In addition, qualitative analyses suggest thatthe fixed-context model has uncovered some of themore intuitive causes of phonemic confusion, in-cluding stop words, homonyms, near-homophones,and familiarity with the vocabulary. These prelim-inary results show the promise of our confusion-mitigation framework. Given this early success,further investigation and refinement is warranted.

Acknowledgments

Resources used in preparing this research wereprovided, in part, by the Province of Ontario,the Government of Canada through CIFAR,and companies sponsoring the Vector Institute(www.vectorinstitute.ai/partners).

ReferencesJimmy Ba and Rich Caruana. 2014. Do Deep Nets

Really Need to be Deep? In Z. Ghahramani,M. Welling, C. Cortes, N. D. Lawrence, and K. Q.Weinberger, editors, Advances in Neural Informa-tion Processing Systems 27, pages 2654–2662. Cur-ran Associates, Inc.

Todd M. Bailey and Ulrike Hahn. 2005. Phoneme sim-ilarity and confusability. Journal of Memory andLanguage, 52(3):339–362.

Anton Batliner, Kerstin Fischer, Richard Huber, JorgSpilker, and Elmar Noth. 2003. How to find troublein communication. Speech Communication, 40(1-2):117–143.

Yoshua Bengio, Rejean Ducharme, Pascal Vincent, andChristian Jauvin. 2003. A neural probabilistic lan-guage model. Journal of Machine Learning Re-search, 3(Feb):1137–1155.

Jean Carletta, Simone Ashby, Sebastien Bourban, MikeFlynn, Mael Guillemot, Thomas Hain, JaroslavKadlec, Vasilis Karaiskos, Wessel Kraaij, MelissaKronenthal, et al. 2005. The AMI Meeting Cor-pus: A Pre-Announcement. In International Work-shop on Machine Learning for Multimodal Interac-tion, pages 28–39. Springer.

Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho,and Yoshua Bengio. 2015. Gated Feedback Recur-rent Neural Networks. In International Conferenceon Machine Learning, pages 2067–2075.

Google Cloud. 2019. Speech-to-Text Client Libraries.

Alex Graves, Abdel-rahman Mohamed, and GeoffreyHinton. 2013. Speech recognition with deep recur-rent neural networks. In 2013 IEEE InternationalConference on Acoustics, Speech and Signal Pro-cessing, pages 6645–6649. IEEE.

Alex Graves and Jurgen Schmidhuber. 2005. Frame-wise phoneme classification with bidirectionalLSTM and other neural network architectures. Neu-ral Networks, 18(5-6):602–610.

Steven L Greenspan, Raymond W Bennett, and Ann KSyrdal. 1998. An evaluation of the diagnostic rhymetest. International Journal of Speech Technology,2(3):201–214.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep Residual Learning for ImageRecognition. In 2016 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), pages770–778. ISSN: 1063-6919.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long Short-Term Memory. Neural Computation,9(8):1735–1780.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Kshitiz Kumar, Chaojun Liu, Yifan Gong, and Jian Wu.2020. 1-D Row-Convolution LSTM: Fast Stream-ing ASR at Accuracy Parity with LC-BLSTM. Proc.Interspeech 2020, pages 2107–2111.

Bruce L. Lambert. 1997. Predicting look-alike andsound-alike medication errors. American Journal ofHealth-System Pharmacy, 54(10):1161–1171.

Vladimir I Levenshtein. 1966. Binary codes capableof correcting deletions, insertions, and reversals. InSoviet Physics Doklady, volume 10, pages 707–710.

Kun Li, Xiaojun Qian, and Helen Meng. 2017. Mispro-nunciation Detection and Diagnosis in L2 EnglishSpeech Using Multidistribution Deep Neural Net-works. IEEE/ACM Transactions on Audio, Speech,and Language Processing, 25(1):193–207.

Ricard Marxer, Jon Barker, Martin Cooke, andMaria Luisa Garcia Lecumberri. 2016. A corpusof noise-induced word misperceptions for English.The Journal of the Acoustical Society of America,140(5):EL458–EL463.

Tomas Mikolov, Stefan Kombrink, Lukas Burget, JanCernocky, and Sanjeev Khudanpur. 2011. Exten-sions of recurrent neural network language model.In 2011 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP), pages5528–5531. IEEE.

9

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor-rado, and Jeff Dean. 2013. Distributed Representa-tions of Words and Phrases and their Composition-ality. In Advances in neural information processingsystems, pages 3111–3119.

George A. Miller. 1954. An Analysis of the Confusionamong English Consonants Heard in the Presence ofRandom Noise. Journal of The Acoustical Society ofAmerica, 26.

George A. Miller and Patricia E. Nicely. 1955. An anal-ysis of perceptual confusions among some Englishconsonants. The Journal of the Acoustical Societyof America, 27(2):338–352.

Vinod Nair and Geoffrey E Hinton. 2010. Rectifiedlinear units improve restricted boltzmann machines.In Icml.

John B. Orange, Rosemary B. Lubinski, and D. Jef-fery Higginbotham. 1996. Conversational Repairby Individuals with Dementia of the Alzheimer’sType. Journal of Speech, Language, and HearingResearch, 39(4):881–895.

Manuel Sam Ribeiro. 2018. Parallel Audiobook Cor-pus. University of Edinburgh School of Informatics.

J. Dan Rothwell. 2010. In the Company of Others: AnIntroduction to Communication. New York: OxfordUniversity Press.

Michael Sabourin and Marc Fabiani. 2000. Predictingauditory confusions using a weighted Levinstein dis-tance. US Patent 6,073,099.

Harvey Sacks, Emanuel Schegloff, and Gail Jefferson.1974. A Simple Systematic for the Organisation ofTurn-Taking in Conversation. Language, 50:696–735.

Frank Seide, Gang Li, and Dong Yu. 2011. Con-versational Speech Transcription Using Context-Dependent Deep Neural Networks. In Twelfth An-nual Conference of the International Speech Com-munication Association.

Martin Sundermeyer, Ralf Schluter, and Hermann Ney.2012. LSTM Neural Networks for Language Mod-eling. In Thirteenth Annual Conference of the Inter-national Speech Communication Association.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. arXiv preprint arXiv:1706.03762.

Veronika Vincze, Gyorgy Szarvas, Richard Farkas,Gyorgy Mora, and Janos Csirik. 2008. The Bio-Scope corpus: biomedical texts annotated for uncer-tainty, negation and their scopes. BMC Bioinformat-ics, 9(11):S9.

William D Voiers, Alan D Sharpley, and Carl J Hehm-soth. 1975. Research on Diagnostic Evaluation ofSpeech Intelligibility. Research Report AFCRL-72-0694, Air Force Cambridge Research Laboratories,Bedford, Massachusetts.

Robert L. Weide. 1998. The CMU pronouncing dictio-nary. The Speech Group.

Felix Weninger, Hakan Erdogan, Shinji Watanabe, Em-manuel Vincent, Jonathan Le Roux, John R Her-shey, and Bjorn Schuller. 2015. Speech Enhance-ment with LSTM Recurrent Neural Networks and itsApplication to Noise-Robust ASR. In InternationalConference on Latent Variable Analysis and SignalSeparation, pages 91–99. Springer.

Wayne A. Wickelgren. 1965. Acoustic similarity andintrusion errors in short-term memory. Journal ofExperimental Psychology, 70(1):102.

Albert Zeyer, Parnia Bahar, Kazuki Irie, Ralf Schluter,and Hermann Ney. 2019. A Comparison of Trans-former and LSTM Encoder Decoder Models forASR. In 2019 IEEE Automatic Speech Recognitionand Understanding Workshop (ASRU), pages 8–15.IEEE.

Andrej Zgank and Zdravko Kacic. 2012. Predictingthe Acoustic Confusability between Words for aSpeech Recognition System using Levenshtein Dis-tance. Elektronika ir Elektrotechnika, 18(8):81–84.

Harry Zhang. 2004. The optimality of naive Bayes.American Association for Artificial Intelligence,1(2):3.

Shiyu Zhou, Linhao Dong, Shuang Xu, and Bo Xu.2018. Syllable-Based Sequence-to-SequenceSpeech Recognition with the Transformer in Man-darin Chinese. arXiv preprint arXiv:1804.10752.

10

Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research

in Phonetics, Phonology, and Morphology,

August 5, 2021. ©2021 Association for Computational Linguistics

pages 11–22

Recursive prosody is not finite-state

Hossep DolatianDepartment of LinguisticsStony Brook UniversityStony Brook, NY, USA

[email protected]

Aniello De SantoDepartment of Linguistics

University of UtahSalt Lake City, Utah, [email protected]

Thomas GrafDepartment of LinguisticsStony Brook UniversityStony Brook, NY, [email protected]

Abstract

This paper investigates bounds on the generativecapacity of prosodic processes, by focusingon the complexity of recursive prosody incoordination contexts in English (Wagner, 2010).Although all phonological processes and mostprosodic processes are computationally regularstring languages, we show that recursive prosodyis not. The output string language is insteadparallel multiple context-free (Seki et al., 1991).We evaluate the complexity of the pattern overstrings, and then move on to a characterizationover trees that requires the expressivity of multibottom-up tree transducers. In doing so, weprovide a foundation for future mathematicallygrounded investigations of the syntax-prosodyinterface.

1 Introduction

At the level of words, all attested processes in phonol-ogy form regular string languages and can be gener-ated via finite-state acceptors (FSAs) and transducers(FSTs) (Johnson, 1972; Kaplan and Kay, 1994; Heinz,2018). However, not much attention has been givento the generative capacity of prosodic processes atthe phrasal or sentential level (but see Yu, 2019). Thelittle work that exists in this respect has shown thatmany attested intonational processes are finite-stateand regular (Pierrehumbert, 1980). It is thus a commonhypothesis in the literature that the cross-linguistic ty-pology of prosodic phonology should also be regular.

In this paper, we falsify this hypothesis by provid-ing a mathematically grounded characterization of apattern of recursive prosody in English coordination,as empirically documented by Wagner (2010). Specif-ically, we show that when converting a syntactic repre-sentation into a prosodic representation, the string lan-guage that is generated by this prosodic process is nei-ther a regular nor context-free language, and thus can-not be generated by string-based FSAs. As a tree-to-tree function, the pattern can be captured by a class of

bottom-up tree transducers whose outputs correspondto parallel multiple context-free string languages.

This paper is organized as follows. In §2, weprovide a literature review of phonology and prosodicphonology, with emphasis on the general tendency forregular computation. In §3, we describe the recursiveprosody of coordination structures, and why it cannotbe generated with an FST over string inputs. In §4,we show how a multi bottom-up tree transducer cangenerate the prosodic patterns. We discuss our resultsin §5, and conclude in §6.

2 Computation of prosody

Within computational prosody, there are two strandsof work. One focuses on the generation of prosodicstructure at or below the word level. The otheroperates above the word-level.

At the word level, there is a plethora of workon generating prosodic constituents, all of whichrequire finite-state or regular computation, whetherfor syllables (Kiraz and Mobius, 1998; Yap, 2006;Hulden, 2006; Idsardi, 2009), feet (van Oostendorp,1993; Idsardi, 2009; Yu, 2017), or prosodic words(Coleman, 1995; Chew, 2003).1 In fact, most word-level prosody seems to require at most subregularcomputation (Strother-Garcia, 2018, 2019; Hao, 2020;Dolatian, 2020; Dolatian et al., 2021; Koser, in prep).

However, there is a dearth of formal results forphrasal or intonational prosody. Early work in genera-tive phonology treated the prosodic representations asdirectly generated from the syntax, with any deviationscaused by readjustment rules (Chomsky and Halle,1968). Notoriously, syntactic representations are at

1For syllables and feet, there is a large literature of formal-ization within Declarative Phonology (Scobbie et al., 1996). Thiswork tends to employ formal representations that are similarto context-free grammars (Klein, 1991; Walther, 1993, 1995;Dirksen, 1993; Coleman, 1991, 1992, 1993, 1996, 2000, 1998;Coleman and Pierrehumbert, 1997; Chew, 2003). But theserepresentations can be restricted enough to be equivalent toregular languages (see earlier such restrictions in Church, 1983).

11

least context-free (Chomsky, 1956; Chomsky andSchutzenberger, 1959). Because sentential prosodyinteracts with the syntactic level in non-trivial ways, itmight seem sensible to assume that 1) the transforma-tion from syntax to prosody is not finite-state definable(= definable with finite-state transducers), and that2) the string language of prosodic representationsis a supra-regular language, not a regular language.Importantly though, this assumption is not triviallytrue. In fact, early work has shown that even if syntaxis context-free, the corresponding prosodic structurescan be a regular string language. For instance, Reich(1969) argued that the prosodic structures in SPE canbe generated via finite-state devices (see also Langen-doen, 1975), while Pierrehumbert (1980) modeledEnglish intonation using a simple finite-state acceptor.

When analyzed over string languages, thismismatch between supra-regular syntax and regularprosody was not explored much in the subsequentliterature. In fact, it seems that current research oncomputational prosody uses the premise that prosodicstructures are at most regular (Gibbon, 2001). Cru-cially, this premise is confounded by the general lackof explicit mathematical formalizations of prosodicsystems. For example, there are algorithms for Dutchintonation that capture surface intonational contoursand other acoustic cues (t’Hart and Cohen, 1973;t’Hart and Collier, 1975). These algorithms howeverdo not themselves provide sufficient mathematicaldetail to show that the prosodic phenomenon inquestion is a regular string language. Instead, onehas to deduce that Dutch intonation is regular becausethe algorithm does not utilize counting or unboundedlook-ahead (t’Hart et al., 2006, pg. 114).

As a reflection of this mismatch, early work inprosodic phonology assumed something known as thestrict layer hypothesis (SLH; Nespor and Vogel, 1986;Selkirk, 1986). The SLH assumed that prosodic treescannot be recursive — i.e. a prosodic phrase cannotdominate another prosodic phrase — thus ensuringthat a prosodic tree will have fixed depth. Subsequentwork in prosodic phonology weakened the SLH:prosodic recursion at the phrase or sentence level isnow accepted as empirically robust (Ladd 1986, 2008,ch8; Selkirk 2011; Ito and Mester 2012, 2013). Butempirically, it is difficult to find cases of unboundedprosodic recursion (Van der Hulst, 2010). Considera language that uses only bounded prosodic recursion— e.g. there can be at most two recursive levels ofprosodic phrases. The prosodic tree will have fixeddepth; and the computation of the corresponding

string language is regular. It is then possible to createa computational network that uses a supra-regulargrammar for the syntax which interacts with afinite-state grammar for the prosody (Yu and Stabler,2017; Yu, 2019). To summarize, it seems that theimplicit consensus in computational prosody is that1) syntax can be supra-regular, but the correspondingprosody is regular; 2) prosodic recursion is bounded.

However, as we elaborate in the next section,coordination data from Wagner (2005) is a case wheresyntactic recursion generates potentially unbounded-recursive prosodic structure. The rest of the paper isthen dedicated to exploring the consequences of thisconstruction for the expressivity of sentential prosody.

3 Prosodic recursion in coordination

To our knowledge, Wagner (2005, 2010) is theclearest case where syntactic recursion gets mappedto recursive prosody, such that the recursion isunboundedly deep for the prosody. In this section, wego over the data and generalizations (§3.1), we sketchWagner’s cyclic analysis (§3.2), and we discuss issueswith finiteness (§3.3). Finally, we show that that thisconstruction does not correspond to a regular stringlanguage (§3.4).

3.1 Unbounded recursive prosodyWagner documents unbounded prosodic recursionin the coordination of nouns, in contrast to earlierresults which reported flat non-recursive prosody(Langendoen, 1987, 1998). Based on experimentaland acoustic studies, Wagner reports that recursivecoordination creates recursively strong prosodicboundaries. Syntactic edges have a prosodic strengththat incrementally depends on their distance from thebottom-most constituents.

When three items are coordinated with two non-identical operators, then two syntactic parses are pos-sible. Each syntactic parse has an analogous prosodicparse. The prosodic parse is based on the relativestrength of a prosodic boundary, with | being weakerthan ||. The boundary is placed before the operator.

Table 1: Prosody of three items with non-identicaloperators

Syntactic grouping Prosodic grouping[A and [B or C]] A || and B | or C[[A and B] or C] A | and B || or C

When the two operators are identical, then threesyntactic and prosodic parses are possible. The

12

difference between the parses is determined bysemantic associativity. For example, a sentence likeI saw [[A and B] and C] means that I saw A and Btogether, and I saw C separately.

Table 2: Prosody of three items with identical operators

Syntactic grouping Prosodic grouping[A and [B and C]] A || and B | and C[[A and B] and C] A | and B || and C[[A and B and C] A | and B | and C

When four items are coordinated, then at most11 parses are possible. The maximum is reachedwhen the three operators are identical. We can havethree levels of prosodic boundaries, ranging from theweakest | to the strongest |||.

Table 3: Prosody of four items with identical operators

Syntactic grouping Prosodic grouping[A and B and C and D] A | and B | and C | and D[A and B and [C and D]] A || and B || and C | and D[A and [B and C] and D] A || and B | and C || and D[[A and B] and C and D] A | and B || and C || and D[A and [B and C and D]] A || and B | and C | and D[[A and B and C] and D] A | and B | and C || and D[[A and B] and [C and D]] A | and B || and C | and D[A and [B and [C and D]] A ||| and B || and C | and D[A and [[B and C] and D]] A ||| and B | and C || and D[[A and [B and C]] and D] A || and B | and C ||| and D[[[A and B] and C] and D] A | and B || and C ||| and D

We can extract the following generalizations fromthe data above. First, the depth of a constituent di-rectly affects the prosodic strength of its edges. At asyntactic edge, the strength of the prosodic boundarydepends on the distance between that edge and themost embedded element: for instance, in (1a) the left-bracket between A-B is mapped to a prosodic bound-ary of strength three |||, because A is above two layersof coordination. The deepest constituent C-D gets theweakest boundary |. Second, when there is associativ-ity, the prosodic strength percolates to other positionswithin this associative span. For example, in (1b) theboundary of strength || is percolated to A-B from B-C.

1. Generalizations on coordination(a) Strength is long-distantly calculated

[A and [B and [C and D]]] is mapped toA ||| and B || and C | and D

(b) Strength percolates when associative[A and B and [C and D]] is mapped toA || and B || and C | and D

3.2 Wagner’s cyclic analysis

In order to generate the above forms, Wagner deviseda cyclic procedure which we summarize with thealgorithm below.

2. Wagner’s cyclic algorithm

(a) Base case: Let X be a constituent thatcontains a set of unprosodified nouns(terminal nodes) that are in an associativecoordination. Place a boundary of strength| between each noun.

(b) Recursive case: Consider a constituent Y.Let S be a set of constituents S (terminalsor non-terminals) that is properly containedin Y, such that at least one constituent inS be prosodified. Let |k be the strongestprosodic boundary inside Y. Place theboundary |k+1 between each constituentin Y.

The algorithm is generalized to coordination of anydepth. It takes as input a syntactic tree, and the outputis prosodically marked strings. We illustrate this below,with the input tree represented as a bracketed string.

3. Illustrating Wagner’s algorithmInput [A and B and [C and D]]Base case C | and DRecursive case A || and B || and C | and D

3.3 Issues of finiteness

Because Wagner’s study used noun phrases withat most three or four items, the resulting languageof prosodic parses is a finite language. Thus, therelevant syntax-to-prosody function is bounded. It isdifficult to elicit coordination of 5 items, likely dueto processing reasons (Wagner, 2010, 194).

If the primary culprit is performance, though,then syntactic competence may in fact allow forcoordination constructions of unbounded depth withany number of items. Wagner’s algorithm generatesa prosodic structure for any such sentence, such asfor (4). For the rest of this paper, we abstract away thefinite bounds on coordination size in order to analyzethe generative capacity of the underlying system (seeSavitch, 1993, for mathematical arguments in supportof factoring out finite bounds).

4. Hypothetical prosody for large coordination[A and B and [C and [D and E]]] is mapped toA ||| and B ||| and C || and D | and E

13

3.4 Computing recursive prosody over strings

The choice of representation plays an important rolein determining the generative capacity of the prosodicmapping. We first start by treating the mapping asa string-to-string function. We show that the mappingis not regular.

Let the input language be a bracketed stringlanguage, such that the input alphabet is a set ofnounsA, ..., Z, coordinators, and brackets. Theoutput language replaces the brackets with substringsof |∗. For illustration, assume that the input languageis guaranteed to be a well-bracketed string. At asyntactic boundary, we have to calculate the numberof intervening boundaries between it and deepest node.But this requires unbounded memory. For instance, toparse the example below, we incrementally increasethe prosodic strength of each boundary as we readthe input left-to-right.

5. Linearly parsing the prosody:[[[A and B] and C] and D] is mapped toA | and B || and C ||| and D, whereInput alphabet Σ = A, ... , Z, and, or, [, ]Output alphabet ∆ = A, ... , Z, and, or, |Input language is Σ∗ and well-bracketed

Given the above string with only left-branchingsyntax, the leftmost prosodic boundary will have ajuncture of strength |. Every subsequent prosodicboundary will have incrementally larger strength.Over a string, this means we have to memorize thenumber x of prosodic junctures that were generatedat any point in order to then generate x+1 juncturesat the next point. A 1-way FST cannot memorize anunbounded amount of information. Thus, this functionis not rational function and cannot be defined by a1-way FST. To prove this, we can look at this functionin terms of the size of the input and output strings.

6. Illustrating growth size of recursive prosody[n A0 and A1 ] and A2] and ... and An]is mapped toA0 | and A1 || and A2 ||| and ... |n and An

Abstractly, for a left-branching input string withn number of left-brackets [, the output string hasa monotonically increasing number of prosodicjunctures: | ··· || ··· ||| ··· |n. The total number ofprosodic junctures is a triangular number n(n+1)/2.We thus derive the following lemma.

Lemma 1. For generating coordination prosody as astring-to-string function, the size of the output string

grows at a rate of at leastO(n2) where n is the sizeof the input string.

Such a function is neither rational nor regular.Rational functions are computed by 1-way FSTs,and regular functions by 2-way FSTs (Engelfrietand Hoogeboom, 2001).2 They share the followingproperty in terms of growth rates (Lhote, 2020).Theorem 1. Given an input string of size n, the sizeof the output string of a regular function grows atmost linearly as c·n, where c is a constant.

Thus, this string-to-string function is not regular.It could be a more expressive polyregular function(Engelfriet and Maneth, 2002; Engelfriet, 2015;Bojanczyk, 2018; Bojanczyk et al., 2019), a questionthat we leave for future work.

The discussion in this section focused on generat-ing the output prosodic string when the input syntaxis a bracketed string. Importantly though, Lemma 1entails that no matter how one chooses their stringencoding of syntactic structure, prosody cannot bemodeled as a rational transduction unless there isan upper bound on the minimum number of outputsymbols that a single syntactic boundary must berewritten as. To the best of our knowledge, there isno syntactic string encoding that guarantees such abound. In the next section, we will discuss how tocompute prosodic strength starting from a tree.

4 Computing recursive prosody over trees

Wagner (2010)’s treatment of recursive prosody as-sumes an algorithm that maps a syntactic tree to aprosodic string. It is thus valuable to understand thecomplexity of processes at the syntax-prosody inter-face starting from the tree representation of a sen-tence. Assuming we start from trees, there is onemore choice to be made, namely whether the prosodicinformation (in the output) is present within a string ora tree. Notably, every tree-to-string transduction canbe regarded as a tree-to-tree transduction plus a stringyield mapping. As the tree-to-tree case subsumes thetree-to-string one, it makes sense to consider onlythe former. For a tree-to-tree mapping, the goal isto obtain a tree representation that already containsthe correct prosodic information (Ladd, 1986; Selkirk,2011). This is the focus of the rest of this paper.

4.1 Dependency treesWhen working over syntactic structures explicitly, it isimportant to commit to a specific tree representation.

2This equivalence only holds for functions and deterministicFSTs. Non-deterministic FSTs can also compute relations.

14

In what follows, we adopt a type of dependency trees,where the head of a phrase is treated as the mother ofthe subtree that contains its arguments. For example,the coordinated noun phrase Pearl and Garnet isrepresented as the following dependency tree.

and

Pearl Garnet

Dependency trees have a rich tradition in descrip-tive, theoretical, and computational approaches to lan-guage, and their properties have been defined across avariety of grammar formalisms (Tesniere, 1965; Nivre,2005; Boston et al., 2009; Kuhlmann, 2013; Debus-mann and Kuhlmann, 2010; De Marneffe and Nivre,2019; Graf and De Santo, 2019; Shafiei and Graf,2020, a.o.). Dependency trees keep the relation be-tween heads and arguments local, and they maximallysimplify the readability of our mapping rules. Hence,they allow us to focus our discussion on issues thatare directly related to the connection of coordinatedembeddings and prosodic strength, without having tocommit to a particular analysis of coordinate structure.

Importantly, this choice does not impact the gener-alizability of the solution. It is fairly straightforward toconvert basic dependency trees into phrase structuretrees. Similarly, although it is possible to adopt n-arybranching structures, we chose to limit ourselvesto binary trees (in the input). This turns out to bethe most conservative assumption, as it forces us toexplicitly deal with associativity and flat prosody.

4.2 Encoding prosodic strength over treesWe are interested in the complexity of mapping a“plain” syntactic tree to a tree representation which con-tains the correct prosodic information. Because of this,we encode prosodic strength over trees in the form ofstrength boundaries at each level of embedding. Eachembedding level in our final tree representation willthus have a prosodic strength branch. The tree belowshows how the syntactic tree for Pearl and Garnetis enriched with prosodic information, according toour encoding choices. For readability, we use $ tomark prosodic boundaries in trees instead of |, sincethe latter could be confused with a unary tree branch.

and

Pearl $ Garnet

As the tree below shows, the depth of the prosodybranch at each embedding level corresponds to the

number of prosodic boundaries needed at that level.

and4

Pearl1 $2

$3

and7

Garnet5 $6 Rose8

Finally, the prosodic tree is fed to a yield functionto generate an output prosodified string. In particular,the correct tree-to-string mapping can be obtainedby a modified version of a recursive-descent yield,which enumerates nodes left-to-right, depth first,and only enumerates the mother node of eachlevel after the boundary branch. This strategy isdepicted by the numerical subscripts in the tree above,which reconstruct how the yield of the prosodicallyannotated tree produces the string: Pearl || andGarnet | and Rose. The rest of this section will focuson how to obtain the correct tree encoding of prosodicinformation, starting from a plain dependency tree.

4.3 Mathematical preliminariesFor a natural number n, we let [n] = 1,...,n. Aranked alphabet Σ is a finite set of symbols, each oneof which has a rank assigned by the function r :Σ→N.We write Σ(n) to denote σ∈Σ |r(σ)=n, and σ(n)

indicates that σ has rank n.Given a ranked alphabet Σ and a set A, TΣ(A) is

the set of all trees over Σ indexed byA. The symbolsin Σ are possible labels for nodes in the tree, indexedby elements in A. The set TΣ of Σ-trees containsall σ∈Σ(0) and all terms σ(n)(t1,...,tn) (n≥0) suchthat t1, ... , tn ∈ TΣ. Given a term m(n)(s1, ... ,sn)where each si is a subtree with root di, we callm themother of the daughters d1,...,dn (1≤ i≤n). If twodistinct nodes have the same mother, they are siblings.Essentially, the rank of a symbol denotes the finitenumber of daughters that it can take. Elements ofAare considered as additional symbols of rank 0.Example 1. Given Σ :=

a(0),b(0),c(2),d(2)

, TΣ is

an infinite set. The symbol a(0) means that a isa terminal node without daughters, while c(2) is anon-terminal node with two daughters. For example,consider the tree below.

d

c

b b

d

b a

This tree corresponds to the term d(c(b,b),d(b,a)),contained in TΣ. y

15

As is standard in defining meta-rules, we introduceX as a countably infinite set of variable symbols(X ∩ Σ = X) to be used as place-holders in thedefinitions of transduction rules over trees.

4.4 Multi bottom-up tree transducersWe assume that the starting point of the prosodic pro-cess is a plain syntactic tree. Thus, in order to derivethe correct prosodic encoding, we need to propagateinformation about levels of coordination embeddingand about associativity. We adopt a bottom-up ap-proach, and characterize this process in terms of multibottom-up tree transducers (MBOT; Engelfriet et al.,1980; Lilin, 1981; Maletti, 2011). Essentially, MBOTsgeneralize traditional bottom–up tree transducers inthat they allow states to pass more than one output sub-tree up to subsequent transducer operations (Gildea,2012). In other words, each MBOT rule potentiallyspecifies several parts of the output tree. This is high-lighted by the fact that the transducer states (q∈Q) canhave rank greater than one — i.e. they can have morethan one daughter, where the additional daughters areused to hold subtrees in memory. We follow Fulopet al. (2004) in presenting the semantics of MBOTs.

Definition 1 (MBOT). A multi bottom-up tree trans-ducer (MBOT) is a tuple M = (Q,Σ,∆,root,qf ,R),where Q, Σ∪∆, root, qf are pairwise disjoint,such that:

• Q is a ranked alphabet withQ(0) =∅, called theset of states

• Σ and ∆ are ranked input and output alphabets,respectively

• root is a unary symbol, called the root symbol• qf is a unary symbol called the final state

R is a finite set of rules of two forms:

• σ(q1(x1,1,...,x1,n1),...,qk(xk,1,...,xk,nk))

→q0(t1,...,tn0)

where k ≥ 0, σ ∈ Σ(k), for everyi ∈ [k] ∪ 0, qi ∈ Q(ni) for some ni ≥ 1, forevery j∈ [n0],tj∈T∆(xi,j|i∈ [k],j∈ [ni]).

• root(q(x1,...,xn))→qf(t)

where n≥1,q∈Q(n), and t∈T∆(Xn). yThe derivational relation induced byM is a binary re-lation⇒M over the set TΣ∪∆∪Q∪root,qf defined asfollows. For every ϕ,ψ∈TΣ∪∆∪Q∪root,qf, ϕ⇒M ψiff there is a tree β ∈ TΣ∪∆∪Q∪root,qf(X1) s.t. x1

occurs exactly once in β and either there is a rule

• σ(q1(x1,1,...,x1,n1),...,qk(xk,1,...,xk,nk))→ rinR

and there are trees Ti,j ∈ TΣ for everyi ∈ [k] and j ∈ [ni], s.t. ϕ =β[σ(q1(t1,1, ... , t1,n1), ... , qk(tk,1, ... , tk,nk))], andψ=β[r[xi,j←ti,j|i∈ [k],j∈ [ni]]]; or there is a rule

• root(q(x1,...,xn))→qf(t) inR

and there are trees ti∈T∆ for every i∈ [n] s.t. ϕ=β[root(q(t1,...,tn))], and ψ=β[qf(t[t1,...,tn])]. Thetree transformation computed byM is the relation:

τM =(s,t)∈TΣ×T∆ | root(s)⇒∗M qf(t)

Intuitively, tree transductions are performed byrewriting a local tree fragment as specified by oneof the rules in R. For instance, a rule can replacea subtree, or copy it to a different position. Rulesapply bottom–up from the leaves of the input tree,and terminate in an accepting state qf .

4.5 MBOT for recursive prosodyWe want a transducer which captures Wagner(2010)’s bottom-up cyclic procedure. Consider nowthe MBOT Mpros = (Q,Σ,∆, root, qf , R), withQ= q∗,qc, σc ∈and,or(Σ, σ∈Σ−and,or,and Σ = ∆. We use qc to indicate that Mpros hasverified that a branch contains a coordination (so σc),with q∗ assigned to any other branch. As mentioned,we use $ to mark prosodic boundaries in the treesinstead of |. The set of rulesR is as follows.

Rule 1 rewrites a terminal symbol σ as itself. TheMBOT for that branch transitions to q∗(σ).

σ→q∗(σ) (1)

Rule 2 applies to a subtree headed byσc∈and,or, with only terminal symbols as daugh-ters: σc(q∗(x),q∗(y)). It inserts a prosodic boundary$ between the daughters x,y. The boundary $ is alsocopied as a daughter of the mother qc, as record ofthe fact that we have seen one coordination level.

σc(q∗(x),q∗(y))→qc(σc(x,$,y),$) (2)

We illustrate this in Figure 1 with a coordinationof two items, representing the mapping: [B and A]→ B | and A. We also assume that sentence-initialboundaries are vacuously interpreted.

We now consider cases where a coordination isthe mother not just of terminal nodes, but of othercoordinated phrases. Rule 3 handles the case in which

16

and

B A

(1)and

q∗

B

q∗

A

and

q∗

B

q∗

A

(2)

qc

and

B $ A

$

Figure 1: Example of the application of rules (1) and (2).The numerical label on the arrow indicates which rulewas applied in order to rewrite the tree on the left as thetree on the right.

the right sibling of the mother was also headed bya coordination (as encoded by σc having qc as oneof its daughters). Here, qc is the result of a previousrule application (e.g. rule 2) and it has two subtreesitself: qc(w,y). Although we do not have access tothe internal labels of x, y, and w, by the format of theprevious rules we know that the right daughter of qc(i.e. y) is the one that contains the strength informa-tion. Then, rule 3 has three things to do. It incrementsy by one boundary: $(y). It places $(y) in betweenthe two subtrees x and w. And, it copies $(y) as thedaughter of the new qc state in order to propagate$(y) to the next embedding level (see Figure 2).

σc(q∗(x),qc(w,y))→qc(σc(x,$(y),w),$(y)) (3)

and

C qc

and

B $ A

$(3)

qc

and

C $

$

and

B $ A

$

$

Figure 2: Example of the application of rule (3). For easeof readability, we omit q∗ states over terminal nodes.

Rule 4 applies once all coordinate phrases up to theroot have been rewritten. It simply rewrites the rootas the final accepting state. It gets rid of the daughterof qc that contains the strength markers, since thereis no need to propagate them any further.

root(qc(x,y))→qf(x) (4)

As the examples so far should have clarified,Mpros as currently defined readily handles cases

where the embedding of the coordination is strictlyright branching, with the bulk of the work done viarule 3. However, while these rules work well forinstances in which a coordination is always the rightdaughter of a node, they cannot deal with cases inwhich the coordination branches left, or alternatesbetween the two. This is easily fixed by introducingvariants to rule 3, which consider the position ofthe coordination as marked by qc. Importantly, theposition of the copy of the boundary branch is notaltered, and it is always kept as the rightmost siblingof qc. What changes is the relative position of the wand x subbranches in the output (see Figure 3).

σc(qc(w,y),q∗(x))→qc(σc(w,$(y),x),$(y)) (5)

and

qc

and

B $ A

$

C

(5)

qc

and

and

B $ A

$

$

C

$

$

Figure 3: Left branching example as in rule (5).

Following the same logic, rule 6 handles cases like[[A and B] and [C and D]], in which both daughtersof a coordination are headed by a coordinationthemselves (see Figure 4).

σc(qc(x,z),qc(y,w))→qc($(x),σc(z,$(x),w)) (6)

and

qc

and

A $ B

$

$

qc

and

C $ D

$

$

(6)

qc

and

and

A $ B

$

$

and

C $ D

$

$

$

Figure 4: Example of the application of rule (6).

Finally, we need to take care of the flat prosodyor associativity issue. The MBOT Mpros as outlinedso far increases the depth of the boundary branch ateach level of embedding. Because we are adoptingbinary branching trees, the current set of rules istrivially unable to encode cases like [A and B andC]. We follow Wagner’s assumption that semanticinformation on the syntactic tree guides the prosodycycles. Representationally, we mark this by usingspecific labels on the internal nodes of the tree. Weassume that the flat constituent interpretation is

17

Input Apply rule (2) Apply rule (3) Apply rule (3) Apply rule (4)

and

D and

C and

B A

and

D and

C qc

and

B $ A

$

and

D qc

and

C $

$

and

B $ A

$

$

qc

and

D $

$

$

and

C $

$

and

B $ A

$

$

$

and

D $

$

$

and

C $

$

and

B $ A

Figure 5: Walk-through of the transduction defined byMpros . For ease of readability, and to highlight how qc propagatesembedding information about the coordination, q∗ and qf states are omitted.

obtained by marking internal nodes as non-cyclic,introducing the alphabet symbol σn:

σn(q∗(x),qc(w,y)→qc(σc(x,y,w),y) (7)

Essentially, rule 7 tells us that when a coordinationnode is marked as σn,Mpros just propagates the levelof prosodic strength that it currently has registered (iny), without increments (see Figure 6). This rule can betrivially adjusted to deal with branching differences,as done for rules 3 and 5.

andn

C qc

and

B $ A

$(7)

qc

andn

C $ and

B $ A

$

Figure 6: Application of rule (7) for flat prosody.

A full, step by step Mpros transduction is shownin Figure 5. Taken together, the recursive prosodicpatterns are fully characterized by Mpros when it isadjusted with a set of rules to deal with alternatingbranching and flat associativity. The tree transducergenerates tree representations where each level ofembedding is marked by a branch, which carriesinformation about the prosodic strength for that level.As outlined in Section 4.2, this final representationmay then be fed to a modified string yield functionfor dependency tree languages.

Dependency trees allowed us to present a transducerwith rules that are relatively easy to read. But, as men-tioned before, this choice does not affect our generalresult. Under the standard assumption that the distancebetween the head of a phrase and its maximal projec-tion is bounded,Mpros can be extended to phrase struc-

ture trees, by virtue of the bottom-up strategy beingintrinsically equipped with finite look-ahead. A switchto phrase structure trees may prove useful for futurework on the interaction of prosody and movement.

5 Generating recursive prosody

The previous section characterized recursive prosodyover trees with a non-linear, deterministic MBOT.This is a nice result, as MBOTs are generally well-understood in terms of their algorithmic properties.Moreover, this result is in line with past work explor-ing the connections of MBOTs, tree languages, andthe complexity of movement and copying operationsin syntax (Kobele, 2006; Kobele et al., 2007, a.o.).

We can now ask what the complexity of thisapproach is. MBOTs generate output string languagesthat are potentially parallel multiple context-freelanguages (PMCFL; Seki et al., 1991, 1993; Gildea,2012; Maletti, 2014; Fulop et al., 2005). Since thisclass of string languages is more powerful thancontext-free, the corresponding tree language is nota regular tree language (Gecseg and Steinby, 1997).This is not surprising, as MBOTs can be understoodas an extension of synchronous tree substitutiongrammars (Maletti, 2014).

Notably, independently of our specific MBOT solu-tion, prosody as defined in this paper generates at leastsome output string languages that lack the constantgrowth property — hence, that are PMCFLs. Consideras input a regular tree language of left-branchingcoordinationate phrases, where each level is simply ofthe form and(X, Mary). Then−th level of embeddingfrom the top extends the string yield by n+2 symbols.This immediately implies no constant growth, andthus no semi-linearity (Weir, 1988; Joshi et al., 1990).

Interestingly though, the prosody MBOT devel-oped here is fairly limited in its expressivity as the

18

transducer states themselves do almost no work,and most of the transduction rules in Mpros relyon the ability to store the prosody strength branch.Hence, the specific MBOT in this paper might turnout to belong to a relatively weak subclass of treetransductions with copying, perhaps a variant of inputstrictly local tree transductions (cf. Ikawa et al., 2020;Ji and Heinz, 2020), or a transducer variant of sensingtree automata (cf. Fulop et al., 2004; Kobele et al.,2007; Maletti, 2011, 2014; Graf and De Santo, 2019).Since all of those have recently been used in theformal study of syntax, they are natural candidatesfor a computational model of prosody, and their sensi-tivity to minor representational difference might alsoilluminate what aspects of syntactic representationaffect the complexity of prosodic processes.

Finally, one might worry that the mathematicalcomplexity is a confound of the representation we use,rather than a genuine property of the phenomenon.However, a representation of prosodic strength isnecessary and cannot be reduced further for tworeasons. First, strength cannot be reduced to syntacticboundaries because a single prosodic edge ( maycorrespond to |k for any k≥1. As discussed in depthby Wagner (2005, 2010), one cannot simply converta syntactic tree into a prosodic tree by replacing thelabels of nonterminal nodes. Second, strength alsocannot be reduced to different categories of prosodicconstituents — e.g. assuming that | is a prosodicphrase while || is an intonational phrase. As arguedin depth in (Wagner, 2005, 2010), these differentconstituent types do not map neatly to prosodicstrength. Instead, these boundaries all encode relativestrengths of prosodic phrase boundaries.

6 Conclusion

This paper formalizes the computation of unboundedrecursive prosodic structures in coordination. Theircomputation cannot be done by string-based finite-state transducers. They instead need more expressivegrammars. To our knowledge, this paper is one ofthe few (if only) formal results on how prosodicphonology at the sentence-level is computationallymore expressive than phonology at the word-level.

As discussed above, recent work in prosodicphonology relies on the assumption that prosodicstructure can be recursive. However, because suchwork usually uses bounded-recursion, such phenom-ena are computationally regular. Departing from thisstance, this paper focused on the prosodic phenomenareported in Wagner (2005) as a core case study,

because of the following fundamental properties:

• The syntax has unbounded recursion.• The prosody has unbounded recursion.• All recursive prosodic constituents have the

same prosodic label (= a prosodic phrase).• The recursive prosodic constituents have

acoustic cues marking different strengths.• There is an algorithm which explicitly assigns

the recursive prosodic constituents to thesedifferent strengths.

In this paper, we focused on explicitly generatingthe prosodic strengths at each recursive prosodiclevels, putting aside the mathematically simpler taskof converting a recursive syntactic tree into a recursiveprosodic tree (Elfner, 2015; Bennett and Elfner,2019) — which is a process essentially analogous toa relabeling of the nonterminal nodes of the syntactictree, without care for the prosodic strength. Themapping studied in this paper has been conjectured inthe past to be computationally more expressive thanregular languages or functions (Yu and Stabler, 2017).Here, we formally verified that hypothesis.

An open question then is to find other empiricalphenomena which also have the above properties.One potential area of investigation is the assignmentof relative prominence relations in English compoundprosody (Chomsky and Halle, 1968). However, En-glish compound prosody is a highly controversial area.It is unclear what is the current consensus on an exactalgorithm for these compounds, especially one thatutilizes recursion and is not based on impressionisticjudgments (Liberman and Prince, 1977; Gussenhoven,2011). In this sense, the mathematical results in thispaper highlight the importance of representationalcommitments and of explicit assumptions in the studyof prosodic expressivity. Our paper might then helpidentify crucial issues in future theoretical and em-pirical investigations of the syntax-prosody interface.

Acknowledgements

We are grateful to our anonymous reviewers, JonRawski, and Kristine Yu. Thomas Graf is supportedby the National Science Foundation under Grant No.BCS-1845344.

ReferencesRyan Bennett and Emily Elfner. 2019. The syntax–

prosody interface. Annual Review of Linguistics,5:151–171.

19

Mikołaj Bojanczyk. 2018. Polyregular functions. arXivpreprint arXiv:1810.08760.

Mikołaj Bojanczyk, Sandra Kiefer, and Nathan Lhote.2019. String-to-string interpretations with polynomial-size output. In 46th International Colloquium onAutomata, Languages, and Programming, ICALP2019, July 9-12, Patras, Greece. (LIPIcs), volume132, page 106:1–106:14, Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik.

Marisa Ferrara Boston, John T. Hale, and MarcoKuhlmann. 2009. Dependency structures derivedfrom minimalist grammars. In The Mathematics ofLanguage, pages 1–12. Springer.

Peter Chew. 2003. A computational phonology of Russian.Universal-Publishers, Parkland, FL.

Noam Chomsky. 1956. Three models for the descriptionof language. IRE Transactions on information theory,2(3):113–124.

Noam Chomsky and Morris Halle. 1968. The soundpattern of English. MIT Press, Cambridge, MA.

Noam Chomsky and Marcel P Schutzenberger. 1959.The algebraic theory of context-free languages. InStudies in Logic and the Foundations of Mathematics,volume 26, pages 118–161. Elsevier.

Kenneth Ward Church. 1983. Phrase-structure parsing: Amethod for taking advantage of allophonic constraints.Ph.D. thesis, Massachusetts Institute of Technology.

John Coleman. 1992. The phonetic interpretation ofheaded phonological structures containing overlappingconstituents. Phonology, 9(1):1–44.

John Coleman. 1993. English word-stress in unification-based grammar. In T. Mark Ellison and James Scobbie,editors, Computational Phonology, page 97–106.Centre for Cognitive Science, University of Edinburgh.

John Coleman. 1995. Declarative lexical phonology.In Jacques Durand and Francsis Katamba, editors,Frontiers of phonology: Atoms, structures, derivations,pages 333–383. Longman, London.

John Coleman. 1996. Declarative syllabification inTashlhit Berber. In Jacques Durand and Bernard Laks,editors, Current trends in phonology: Models andmethods, volume 1, pages 175–216. European StudiesResearch Institute, University of Salford, Salford.

John Coleman. 1998. Phonological representations:Their names, forms and powers. Cambridge UniversityPress, Cambridge.

John Coleman. 2000. Candidate selection. The LinguisticReview, 17(2-4):167–180.

John Coleman and Janet Pierrehumbert. 1997. Stochasticphonological grammars and acceptability. In Thirdmeeting of the ACL special interest group in com-putational phonology: Proceedings of the workshop,pages 49–56, East Stroudsburg, PA. Association forcomputational linguistics.

John S Coleman. 1991. Prosodic structure, parameter-setting and ID/LP grammar. In Steven Bird, editor,Declarative perspectives on phonology, pages 65–78.Centre for Cognitive Science, University of Edinburgh.

Marie-Catherine De Marneffe and Joakim Nivre. 2019.Dependency grammar. Annual Review of Linguistics,5:197–218.

Ralph Debusmann and Marco Kuhlmann. 2010. Depen-dency grammar: Classification and exploration. InResource-adaptive cognitive processes, pages 365–388.Springer.

Arthur Dirksen. 1993. Phrase structure phonology. InT. Mark Ellison and James Scobbie, editors, Compu-tational Phonology, page 81–96. Centre for CognitiveScience, University of Edinburgh.

Hossep Dolatian. 2020. Computational locality of cyclicphonology in Armenian. Ph.D. thesis, Stony BrookUniversity.

Hossep Dolatian, Nate Koser, Kristina Strother-Garcia,and Jonathan Rawski. 2021. Computational restric-tions on iterative prosodic processes. In Proceedingsof the 2019 Annual Meeting on Phonology. LinguisticSociety of America.

Emily Elfner. 2015. Recursion in prosodic phrasing:Evidence from Connemara Irish. Natural Language &Linguistic Theory, 33(4):1169–1208.

Joost Engelfriet. 2015. Two-way pebble transducersfor partial functions and their composition. ActaInformatica, 52(7-8):559–571.

Joost Engelfriet and Hendrik Jan Hoogeboom. 2001.MSO definable string transductions and two-way finite-state transducers. Transactions of the Association forComputational Linguistics, 2(2):216–254.

Joost Engelfriet and Sebastian Maneth. 2002. Two-wayfinite state transducers with nested pebbles. In Inter-national Symposium on Mathematical Foundations ofComputer Science, pages 234–244. Springer.

Joost Engelfriet, Grzegorz Rozenberg, and Giora Slutzki.1980. Tree transducers, l systems, and two-waymachines. Journal of Computer and System Sciences,20(2):150–202.

Zoltan Fulop, Armin Kuhnemann, and Heiko Vogler.2004. A bottom-up characterization of deterministictop-down tree transducers with regular look-ahead.Information Processing Letters, 91(2):57–67.

Zoltan Fulop, Armin Kuhnemann, and Heiko Vogler. 2005.Linear deterministic multi bottom-up tree transducers.Theoretical computer science, 347(1-2):276–287.

Ferenc Gecseg and Magnus Steinby. 1997. Tree lan-guages. In Handbook of formal languages, pages 1–68.Springer.

20

Dafydd Gibbon. 2001. Finite state prosodic analysis ofAfrican corpus resources. In EUROSPEECH 2001Scandinavia, 7th European Conference on SpeechCommunication and Technology, 2nd INTERSPEECHEvent, Aalborg, Denmark, September 3-7, 2001, pages83–86. ISCA.

Daniel Gildea. 2012. On the string translations producedby multi bottom–up tree transducers. ComputationalLinguistics, 38(3):673–693.

Thomas Graf and Aniello De Santo. 2019. Sensing treeautomata as a model of syntactic dependencies. InProceedings of the 16th Meeting on the Mathematics ofLanguage, pages 12–26, Toronto, Canada. Associationfor Computational Linguistics.

Carlos Gussenhoven. 2011. Sentential prominence inEnglish. In Marc van Oostendorp, Colin Ewen, Eliz-abeth Hume, and Keren Rice, editors, The Blackwellcompanion to phonology, volume 5, pages 1–29.Wiley-Blackwell, Malden, MA.

Yiding Hao. 2020. Metrical grids and generalized tier pro-jection. In Proceedings of the Society for Computationin Linguistics, volume 3.

Jeffrey Heinz. 2018. The computational nature ofphonological generalizations. In Larry Hyman andFrans Plank, editors, Phonological Typology, Phoneticsand Phonology, chapter 5, pages 126–195. Mouton deGruyter, Berlin.

Mans Hulden. 2006. Finite-state syllabification. In AnssiYli-Jyra, Lauri Karttunen, and Juhani Karhumaki,editors, Finite-State Methods and Natural LanguageProcessing. FSMNLP 2005. Lecture Notes in ComputerScience, volume 4002. Springer, Berlin/Heidelberg.

Harry Van der Hulst. 2010. A note on recursion inphonology recursion. In Harry Van der Hulst, editor,Recursion and human language, pages 301–342.Mouton de Gruyter, Berlin & New York.

William J Idsardi. 2009. Calculating metrical structure.In Eric Raimy and Charles E. Cairns, editors, Contem-porary views on architecture and representations inphonology, number 48 in Current Studies in Linguistics,pages 191–211. MIT Press, Cambridge, MA.

Shiori Ikawa, Akane Ohtaka, and Adam Jardine. 2020.Quantifier-free tree transductions. Proceedings of theSociety for Computation in Linguistics, 3(1):455–458.

Junko Ito and Armin Mester. 2012. Recursive prosodicphrasing in Japanese. In Toni Borowsky, ShigetoKawahara, Shinya Takahito, and Mariko Sugahara,editors, Prosody matters: Essays in honor of ElisabethSelkirk, pages 280–303. Equinox Publishing, London.

Junko Ito and Armin Mester. 2013. Prosodic subcate-gories in Japanese. Lingua, 124:20–40.

Jing Ji and Jeffrey Heinz. 2020. Input strictly localtree transducers. In International Conference onLanguage and Automata Theory and Applications,pages 369–381. Springer.

C Douglas Johnson. 1972. Formal aspects of phonologi-cal description. Mouton, The Hague.

Aravind K Joshi, K Vijay Shanker, and David Weir. 1990.The convergence of mildly context-sensitive grammarformalisms. Technical Reports (CIS), page 539.

Ronald M. Kaplan and Martin Kay. 1994. Regularmodels of phonological rule systems. Computationallinguistics, 20(3):331–378.

George Anton Kiraz and Bernd Mobius. 1998. Mul-tilingual syllabification using weighted finite-statetransducers. In The third ESCA/COCOSDA workshop(ETRW) on speech synthesis.

Ewan Klein. 1991. Phonological data types. In StevenBird, editor, Declarative perspectives on phonol-ogy, pages 127–138. Centre for Cognitive Science,University of Edinburgh.

Gregory M. Kobele, Christian Retore, and Sylvain Salvati.2007. An automata-theoretic approach to minimalism.Model theoretic syntax at 10, pages 71–80.

Gregory Michael Kobele. 2006. Generating Copies: Aninvestigation into structural identity in language andgrammar. Ph.D. thesis, University of California, LosAngeles.

Nate Koser. in prep. The computational nature of stressassignment. Ph.D. thesis, Rutgers University.

Marco Kuhlmann. 2013. Mildly non-projective de-pendency grammar. Computational Linguistics,39(2):355–387.

D. Robert Ladd. 1986. Intonational phrasing: The case forrecursive prosodic structure. Phonology, 3:311–340.

D. Robert Ladd. 2008. Intonational phonology. Cam-bridge University Press, Cambridge.

D. Terence Langendoen. 1975. Finite-state parsingof phrase-structure languages and the status ofreadjustment rules in grammar. Linguistic Inquiry,6(4):533–554.

D. Terence Langendoen. 1987. On the phrasing ofcoordinate compound structures. In Brian Joseph andArnold Zwicky, editors, A festschrift for Ilse Lehiste,page 186–196. Ohio State University, Ohio.

D. Terence Langendoen. 1998. Limitations on embeddingin coordinate structures. Journal of PsycholinguisticResearch, 27(2):235–259.

Nathan Lhote. 2020. Pebble minimization of polyreg-ular functions. In Proceedings of the 35th AnnualACM/IEEE Symposium on Logic in Computer Science,pages 703–712.

Mark Liberman and Alan Prince. 1977. On stress andlinguistic rhythm. Linguistic inquiry, 8(2):249–336.

21

Eric Lilin. 1981. Proprietes de cloture d’une extension detransducteurs d’arbres deterministes. In Colloquiumon Trees in Algebra and Programming, pages 280–289.Springer.

Andreas Maletti. 2011. How to train your multi bottom-up tree transducer. In Proceedings of the 49th AnnualMeeting of the Association for Computational Linguis-tics: Human Language Technologies, pages 825–834.

Andreas Maletti. 2014. The power of regularity-preserving multi bottom-up tree transducers. InInternational Conference on Implementation andApplication of Automata, pages 278–289. Springer.

Marina Nespor and Irene Vogel. 1986. Prosodicphonology. Foris, Dordrecht.

Joakim Nivre. 2005. Dependency grammar and depen-dency parsing. MSI report, 5133.1959:1–32.

Marc van Oostendorp. 1993. Formal properties ofmetrical structure. In Sixth Conference of the Euro-pean Chapter of the Association for ComputationalLinguistics, pages 322–331, Utrecht. ACL.

Janet Breckenridge Pierrehumbert. 1980. The phonologyand phonetics of English intonation. Ph.D. thesis,Massachusetts Institute of Technology.

Peter. A. Reich. 1969. The finiteness of natural language.Language, 45:831–843.

Walter J Savitch. 1993. Why it might pay to assume thatlanguages are infinite. Annals of Mathematics andArtificial Intelligence, 8(1-2):17–25.

James M. Scobbie, John S. Coleman, and Steven Bird.1996. Key aspects of declarative phonology. In JacquesDurand and Bernard Laks, editors, Current Trends inPhonology: Models and Methods, volume 2. EuropeanStudies Research Institute, Salford, Manchester.

Hiroyuki Seki, Takashi Matsumura, Mamoru Fujii, andTadao Kasami. 1991. On multiple context-free gram-mars. Theoretical Computer Science, 88(2):191–229.

Hiroyuki Seki, Ryuichi Nakanishi, Yuichi Kaji, SachikoAndo, and Tadao Kasami. 1993. Parallel multiplecontext-free grammars, finite-state translation systems,and polynomial-time recognizable subclasses oflexical-functional grammars. In Proceedings of the31st annual meeting on Association for Computa-tional Linguistics, pages 130–139. Association forComputational Linguistics.

Elisabeth Selkirk. 1986. On derived domains in sentencephonology. Phonology Yearbook, 3(1):371–405.

Elisabeth Selkirk. 2011. The syntax-phonology interface.In John Goldsmith, Jason Riggle, and Alan C. L. Yu,editors, The Handbook of Phonological Theory, 2edition, pages 435–483. Blackwell, Oxford.

Nazila Shafiei and Thomas Graf. 2020. The subregularcomplexity of syntactic islands. In Proceedings of theSociety for Computation in Linguistics, volume 3.

Kristina Strother-Garcia. 2018. Imdlawn Tashlhiyt Berbersyllabification is quantifier-free. In Proceedings ofthe Society for Computation in Linguistics, volume 1,pages 145–153.

Kristina Strother-Garcia. 2019. Using model theoryin phonology: a novel characterization of syllablestructure and syllabification. Ph.D. thesis, Universityof Delaware.

Lucien Tesniere. 1965. Elements de syntaxe structurale,1959. Paris, Klincksieck.

Johan t’Hart and Antonie Cohen. 1973. Intonation by rule:a perceptual quest. Journal of Phonetics, 1(4):309–327.

Johan t’Hart and Rene Collier. 1975. Integrating differentlevels of intonation analysis. Journal of Phonetics,3(4):235–255.

Johan t’Hart, Rene Collier, and Antonie Cohen. 2006.A perceptual study of intonation: An experimental-phonetic approach to speech melody. CambridgeUniversity Press.

Michael Wagner. 2005. Prosody and recursion. Ph.D.thesis, Massachusetts Institute of Technology.

Michael Wagner. 2010. Prosody and recursion in coor-dinate structures and beyond. Natural Language &Linguistic Theory, 28(1):183–237.

Markus Walther. 1993. Declarative syllabification withapplications to German. In T. Mark Ellison and JamesScobbie, editors, Computational Phonology, pages55–79. Centre for Cognitive Science, University ofEdinburgh.

Markus Walther. 1995. A strictly lexicalized approachto phonology. In Proceedings of DGfS/CL’95, page108–113, Dusseldorf. Deutsche Gesellschaft furSprachwissenschaft, Sektion Computerlinguistik.

David Jeremy Weir. 1988. Characterizing mildly context-sensitive grammar formalisms. Ph.D. thesis, Universityof Pennsylvania.

Ngee Thai Yap. 2006. Modeling syllable theory with finite-state transducers. Ph.D. thesis, University of Delaware.

Kristine M. Yu. 2017. Advantages of constituency:Computational perspectives on Samoan word prosody.In International Conference on Formal Grammar 2017,page 105–124, Berlin. Spring.

Kristine M. Yu. 2019. Parsing with minimalist grammarsand prosodic trees. In Robert C. Berwick and Edward P.Stabler, editors, Minimalist Parsing, pages 69–109.Oxford University Press, London.

Kristine M. Yu and Edward P. Stabler. 2017. (In)variability in the Samoan syntax/prosody interfaceand consequences for syntactic parsing. LaboratoryPhonology: Journal of the Association for LaboratoryPhonology, 8(1):1–44.

22

Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research

in Phonetics, Phonology, and Morphology,

August 5, 2021. ©2021 Association for Computational Linguistics

pages 23–31

The Match-Extend Serialization Algorithm in Multiprecedence

Maxime PapillonClassic, Modern Languages and Linguistics, Concordia University

[email protected]

Abstract

Raimy (1999; 2000a; 2000b) proposed agraphical formalism for modeling redu-plication, originallymostly focused onphonological overapplication in a deriva-tional framework. This framework is nowknown as Precedence-based phonologyor Multiprecedence phonology. Raimy’sidea is that the segments at the input tothe phonology are not totally ordered byprecedence. This paper tackles a chal-lenge that arose with Raimy’s work, thedevelopment of a deterministic serializa-tion algorithm as part of the derivation ofsurface forms. The Match-Extend algo-rithm introduced here requires fewer as-sumptions and sticks tighter to the attestedtypology. The algorithm also contains noparameter or constraint specific to individ-ual graphs or topologies, unlike previousproposals. Match-Extend requires nothingexcept knowing the last added set of links.

1 Introduction

This paper provides a general serialization algo-rithm for all morphological structures in all lan-guages. The challenge of converting non-linearstructures of linguistic representation into a formatready to be handled in production is one that mat-ters to both morphosyntax and morphophonology.Reduplication is a phenomenon at the frontier ofmorphology and phonology that has drawn a lotof attention in the last few decades. Reduplica-tion’s non-concatenative nature and the fact thatit manifests long-distance dependencies amongsegments set it apart from the ‘standard’ word-formation that most theories are designed to han-dle. These properties have often pushed theoreti-cians to propose expansive systems such as copy-ing procedures on top of traditional linear seg-mental phonology to make the system powerful

enough to handle these dependencies. The mul-tiprecedence model expanded upon here buildson properties that are already implicit in all ap-proaches to phonological representation, and ac-tually gets rid of some standard assumptions. Itaccounts for attested patterns and predicts an unat-tested reduplication pattern to be impossible.

2 Multiprecedence

The theory of Multiprecedence seeks to accountfor reduplication representationally via loops in agraph. Eschewing correspondence statements andcopying procedures, Multiprecedence treats redu-plication as fundamentally a structural propertycreated by the addition of an affix, whose serial-ization has the effect of pronouncing all or part ofthe form twice.

Consider a string like Fig. 1a, the standardway of representing the segments that constitutea phonological representation. An alternative wayto encode that same information is in the form ofa set of immediate precedence statements like Fig.1b. For legibility the set of pairs in Fig. 1b canbe represented in the form of a graph. Adding theconvention that of using # and % for the STARTand END symbols respectively we get the picturein Fig. 1c. In general I will refer to this as thegraph representation.

a.kætb. 〈 START, k 〉, 〈 k,æ 〉,〈 æ,t 〉,〈 t,END 〉c. #→ k→ æ→ t→ %

Figure 1: Phonological representations of theword cat as a string (a), ordered pairs (b), and agraph (c).

The graph representation should highlight animportant detail. There is no a priori logical rea-son in this representation why forms should be lin-ear, with one segment following another in a chain.This is only an assumption that we impose on the

23

structure when assuming strings. This assumptionis what Multiprecedence abandons. Multiprece-dence proposes that asymmetry and irreflexivityare not relevant to phonology. A segment can pre-cede or follow multiple segment, two segmentscan transitively precede each other, and a segmentcan precede itself. A valid multiprecedence graphis not restricted by topology, a term a will use forthe pattern of the graph independent from the con-tent of the nodes.

Using this view of precedence, affixation is theprocess of combining the graph representations ofdifferent morphemes. A word is a graph consist-ing of the edges and vertices (precedence relationsand segments) of one or more morphemes. An ex-ample of the suffixation of the English plural isshown in Fig. 2a, and the infixation of the Atayalanimate focus morpheme is given in Fig. 2b. Fullroot reduplication, which expresses the plural ofnouns in Indonesian is shown in Fig. 2c. There aretwo things to notice in Fig. 2c. First, that a prece-dence arrow is added, without any segmental ma-terial: the reduplicative morpheme consists of justthat arrow. Second, although Fig. 2a and Fig. 2beach offer two paths from the START to the ENDof the graph, Fig. 2c contains a loop that offers aninfinite number of paths from START to End. Therepresentation itself does not enforce how manytimes the arrow added by the plural morphemeshould be traversed. All three of these structureshave to be handled by a serialization algorithm inorder to be actualized by the phonetic motor sys-tem, which selects a path through the graph to besent to the articulators. A correct serialization al-gorithm must be able to select the correct of thetwo paths in Fig. 2a and Fig. 2b and the path go-ing through the back loop only once in Fig. 2c.

I will assume here that these forms are con-structed by the attachment of an affix morphemeonto a stem as in Fig. 3. English speakers have agraph as a lexical item for the plural as in Fig. 3aand a lexical item for CAT as in Fig. 3b, whichcombine as in Fig. 3c. The moniker “last seg-ment” is an informal way to refer to that part ofthe affix that is responsible for attaching it to thestem in the right location. This piece of the pluralaffix will attach onto the last segment, the one pre-ceding the end of the word %, of what it combinedwith, and onto %, yielding Fig 3c. Similarly theAtayal form in Fig. 2 is built from a root #hNuP%‘soak’ and an infix -m- marking the animate ac-

a.

# k æ t %

z

=⇒ # k æ t z %b.

# h N u P %

m

=⇒ # h m N u P %

c.

# k @ r a %

=⇒ # k @ r a k @ r a %

Figure 2: Affixation in Multiprecedence. Suffixa-tion (a), infixation (b), reduplication (c).

tor focus and attaching between the first and thesecond segment. For details on the mechanicsof attachment see Raimy (2000a, §3.2), Samuels(2009, p.177-87), and Papillon (2020, §2.2). Itsuffices here to say that at vocabulary insertion anaffix can target certain segments of the stem forattachment. Raimy (2000a) shows how this rep-resentation can generate the reduplicative patternsfrom numerous languages as well as account forsuch phenomena as over- and under-application ofphonological processes in reduplication.

a. [last segment]→ z→ %b. #→ k→ æ→ t→ %c.

# k æ t %

z

Figure 3: Affix (a) and root (b) combined in (c).

Given the assumption that a non-linear graphcannot be pronounced, phonology requires an al-gorithm capable of converting graph representa-tions into strings like in Fig. 2. Two main fam-ilies of algorithms have been proposed. Raimy(1999) proposed a stack-based algorithm whichwas expanded upon by Idsardi and Shorey (2007)and McClory and Raimy (2007). This algo-rithm traverses the graph from # to % by access-ing the stack. This idea suffers the problem ofrequiring parameters on individual arcs. Every

24

morphologically-added precedence link must beparametrized as to its priority determining whetherit goes to the top or the bottom of the stack. This isnecessary in this system because when a given arcis traversed is not predictable on the basis of whenit is encountered in a traversal. This parametriza-tion radically explodes the range of patterns pre-dicted to be possible much beyond what is at-tested. Fitzpatrick and Nevins (2002; 2004) pro-posed a different constraint-base algorithm whichglobally compares paths through the graph forcompleteness and economy but suffers the prob-lem of requiring ad hoc constraints targeting indi-vidual types of graphs, lacking generality. In therest of this article I will present a new algorithmwhich lacks any parameter and whose two opera-tions are generic and not geared towards any spe-cific configuration.

3 The Match-Extend algorithm

This section will present the Match-Extend al-gorithm and follow up with a demonstration ofits operation on various attested Multiprecedencetopologies.

The input to the algorithm is the set of pairsof segments corresponding to the pairs of seg-ments in immediate precedence relation withoutthe affix, e.g. #k,kæ,æt,t% for the English stemkæt, and the set of pair of segments correspond-ing to the precedence links added by the affix, e.g.tz,z% when the plural is added.

Intuitively the algorithm starts from the mor-phologically added links and extends outwards byfollowing the precedence links in the StemSet, theset of all precedence links in the stem to which themorpheme is being added. If there is more thanone morphologically added link, they all extend inparallel and collapse together if one string ends inone or more segment and the other begins with thesame segment or segments. A working version ofthis algorithm coded in Python will be included assupplementary material.

3.1 Match-Extend in action

Consider first total reduplication as in Fig. 2cabove. Fig. 3.1 shows the full derivation of k@ra-k@ra with total reduplication. As there is only onemorphologically-added link, no Match step willhappen.

Let us turn to more complex graphs discussedin the literature. Raimy discusses a process of CV

1 .The precedence links of the stem begin in aset StemSet.2. The morphologically added links begin ina set WorkSpace.3. Whenever two strings in the WorkSpacematch such that the end of one string is iden-tical to the end of the other, the operationMatch collapses the two into one string suchthat the shared part appears once. E.g. abcdand cdef to abcdef. A Match along multi-ple characters is done first.4. When there is no match within theWorkSpace, the operation Extend simultane-ously lengthens all strings in the WorkSpaceto the right and left using matching prece-dence links of the stem. StemSet remains un-changed.5.Steps 3 and 4 are repeated until # and %have been reached by Extend and there is asingle string in the WorkSpace.

Algorithm 1: The Match-Extend Algorithm(informal version).

StemSet #k , k@ , @r , ra , a%WorkSpace ak

Extend rak@Extend @rak@rExtend k@rak@raExtend #k@rak@ra%

Figure 4: Match-Extend derivation of k@ra-k@ra.

reduplication in Tohono O’odham involving redu-plicated pattern such as babaD to ba-b-baD, andcipkan to ci-cpkan requiring graphs as in Raimy(2000a, p.114). Although there are multiple plau-sible paths through this graph, only one is attestedand this path requires traversing the graph by fol-lowing the backlink before the front-link, eventhough the front-link would be encountered firstin a traversal.

# c i p k a n %

Figure 5: Tohono O’odham ci-cpkan.

The match-Extend algorithm will correctly de-rive the correct form as shown in . Right awaythe strings ic and cp match, as one starts with

25

the node c and the other ends with the same node.The two are collapsed as icp and then keep ex-tending.

StemSet #c, ci, ip, pk, ka, an, n%

WorkSpacecpic

Match icpExtend cicpkExtend #cicpkaExtend #cicpkanExtend #cicpkan%

Figure 6: Match-extend derivation of TohonoO’odham cicpkan.

A similarly complex graph is needed in Nan-cowry. Raimy (2000a, p.81) discusses exampleslike Nancowry reduplication of the last consonanttoward the beginning of the word, e.g. sut ‘to rub’to Pit-sut which requires a graph as in Fig. 7.However here the opposite order of traversal mustbe followed, not skipping the first forward link. Iassume here, like Raimy, that the glottal stop isepenthetic and added after serialization. Here, nottaking the first link would also result in the wrongoutput [*sutsut]. So this form requires the firstmorphologically-added link to be taken to producethe correct form.

# s u t %

i

Figure 7: Nancowry Pit-sut.

Again Match-Extend will serialize Fig. 7 with-out any further parameter as in Fig. 8. The threestrings #i, it, and ts can match right away intoa single string #its which will keep extending.

As these examples illustrate, Match-Extenddoes not need to be specified with look-ahead,global considerations, or graph-by-graph specifi-cations of serialization to derive the attested seri-alization of graphs like Fig. 5 or Fig. 7. The se-rialization starts in parallel from two added linksthat extend until they reach each other in the mid-dle, and this will work regardless of the order inwhich ‘backward’ and ‘forward’ arcs are located.They will meet in one direction and serialize inthis order.

Another interesting topology is found in theanalysis of Lushotseed. Fitzpatrick & Nevins

StemSet #s, su, ut, t%

WorkSpace#iitts

Match#itts

Match #itsExtend #itsuExtend #itsutExtend #itsut%

Figure 8: Derivation of Nancowry Pitsut.

(2002; 2004) observed that in cases where mul-tiple reduplication processes of different size hap-pen to the same form, with multiple morpholog-ically added arrows forking away from the samesegment, these graphs are seemingly universallyserialized such that they follow the shorter arcfirst. They discuss Lushotseed forms with bothdistributive and Out-Of-Control (OOC) reduplica-tion. They argue on the basis of the fact that in ei-ther scope order the form is serialized in the sameway, suggesting that they are serialized simultane-ously. This implies forms like gwad, ‘talk’, surfac-ing in the distributive OOC or the OOC distribu-tive as gwad-ad-gwad, requiring a graphs like Fig.9.

# gw a d %

Figure 9: Lushotseed gwad-ad-gwad.

Fitzpatrick & Nevins (2002; 2004) proposedan ad hoc constraint to handle this type of sce-nario, the constraint SHORTEST, enforcing seri-alizations that follow the shorter arrow first. ButMatch-Extend derives the attested pattern withoutany further assumptions. Consider the derivationof the Lushotseed form in Fig. 9. After one Ex-tend step, the two strings adgwa and adad matchalong the nodes ad. You might notice that the twostrings also match in the other order with the nodea, so we must assume the reasonable principle thatin case of multiple matches, the best match, mean-ing the match along more nodes, is chosen. Fromthat point on adadgwa extends into the desiredform.

It is somewhat intuitive to see why this works:because Match-Extend applies one step of Extendat a time and must Match and collapse separate

26

StemSet #gw, gwa, ad, d%

WorkSpacedgw

da

Extendadgwaadad

Match adadgwaExtend gwadadgwadExtend #gwadadgwad%

Figure 10: Derivation of Lushotseed gwadadgwad.

strings from the WorkSpace immediately when aMatch is found, two arcs added by the morphologywill necessarily match in the direction in whichthey are the closest. The end of the d→a arc iscloser to the beginning of the d→gw one than vice-versa, and hence the two will join in this directionand therefore surface in this order. This can begeneralized as Fig. 11.

• If the graph contains two morphologicallyadded links α→ β and γ → δ, and

I There is a unique path X from β to γ notgoing through α→ β or γ → δ, and

I There is a unique path Y from δ to α notgoing through α→ β or γ → δ,

• Then the Match-Extend algorithm will outputa string containing:

I ...αβ...γδ... if X is shorter than YI ...γδ...αβ... if Y is shorter than X

Figure 11: Closest Attachment in Match-Extend.

Note that this is not a new assumption: this isa theorem of the model derivable from the wayMatch and Extend interact with multiple morpho-logically added arcs. This can allow us to workout some serializations without having to do thewhole derivation.

Consider for instance the Nlaka’pamuctsin dis-tributive+diminutive double reduplication, e.g. sil,‘calico’, to sil-si-sil, (Broselow, 1983). This pat-tern requires the Multiprecedence graph to look asin Fig. 12.

# s i l %

Figure 12: Nlaka’pamuctsin sil-si-sil.

The graph in Fig. 12 is simply the transposegraph of a graph where SHORTEST would applylike Fig. 9, but it does not actually fit the pat-tern of SHORTEST as its two ‘backward’ arrowsdo not start from the same node. In fact if any-thing SHORTEST would predict the wrong surfaceform, as *si-sil-sil would be the form if the shorterpath were taken first. In Match-Extend and Clos-est Attachment Fig. 11 the prediction is clear: it ispredicted to serialize as sil-si-sil because the pathfrom l→s to i→s is shorter than the path from i→sto l→s, thus deriving the correct string.

Fitzpatrick and Nevins (2002) report someforms with graphs like Fig. 12 which must belinearized in ways that would contradict Match-Extend, such as saxw to sa-saxw-saxw in Lusot-sheed Diminutive+Distributive forms. But con-trary to the Distributive+OOC forms discussedearlier there is no independent evidence here forthe two reduplications being serialized together.I therefore assume that those instances consistof two separate cycles, serialized one at a time:saxw to saxw-saxw to sa-saxw-saxw. Match-Extendtherefore relies on cyclicity, with the graph builtup through affixation and serialized multiple timesover the course of the derivation.

3.2 Non-Edge Fixed Segmentism

Fixed segmentism refers to cases of reduplicationwhere a segment of one copy is overridden by oneor more fixed segments. A well known Englishexample is schm-reduplication like table to table-schmable where schm-replaces the initial onset. Iwill call Non-Edge Fixed Segmentism (NEFS) thespecial case of fixed segmentism where the fixedsegment is not at the edge of one of the copies.These are the examples where the graph needed islike Fig. 13 or Fig. 14.

# a b c d e %

x

Figure 13: NEFS ‘early’ in the copy.

# a b c d e %

x

Figure 14: NEFS ‘late’ in the copy.

27

Closest Attachment in Match-Extend predictsthat if a fixed-segment is added towards the be-ginning of the form, it should surface in the sec-ond copy, and if it is added toward the end of theform, it should surface in the first copy. Or in otherwords the fixed segment will always occur in thecopy such that the fixed segment is closer to thejuncture of the two copies. The graph in Fig. 13will serialize as abcde-axcde and the graph inFig. 14 will serialize as abcxe-abcde. Thisfollows from the properties of Match and Extend:as the precedence pairs of the overwriting segmentand the precedence pair of the backward link ex-tend outward, it will either reach the left or rightside first and this will determine the order in whichthey appear in the final serialized form.

This prediction is borne out by many exam-ples of productive patterns of reduplication withNEFS such as Marathi saman-suman (Alderete etal., 1999, citing Apte 1968), Bengali sajôa-sujôa(Khan, 2006, p.14), Kinnauri migo-m@go (Chang,2007).

Apparent counterexamples exist, but have otherplausible analyses. A major one worth discussingbriefly is the previous multiprecedence analysis ofthe Javanese Habitual-Repetitive as described byYip (1995; 1998). Most forms surface with a fixed/a/ in the first copy as in elaN-eliN ‘remember’.This requires a graph such as Fig. 15 which se-rializes in comformity with Match-Extend.

# e l i N %

a

Figure 15: Javanese elaN-eliN.

However when the first copy already contains/a/ as the second vowel the form is realized with/e/ in the second copy as udan-uden ‘rain’. Id-sardi and Shorey (2007) and McClory and Raimy(2007) have analyzed this as a phonologically-conditioned allomorph with fixed segment /e/ thatmust be serialized differently from the /a/ allo-morph, with the overwriting vowel in the secondcopy, i.e. a graph such as Fig. 16 that does notserialize in comformity with Match-Extend. Id-sardi and Shorey (2007) and McClory and Raimy(2007) use this example to argue for a system ofstacks that serialization must read from the top-down. Precedence arcs in turn can be lexically

parametrized as to whether they are added on topor at the bottom of the stack upon affixation, thusderiving elaNeliN from the /a/ allomorph being ontop of the stack and traversed early and udanudenfrom the /e/ allomorph being at the bottom of thestack and traversed late. This freedom of lexicalspecification grants their system the power to en-force any order needed, including the capacity tohandle the ‘look-ahead’ and ‘shortest’ cases abovein terms of full lexical specification. They couldalso easily handle languages with the equivalentof a LONGEST constraint. This model is less pre-dictive while also being more complex.

# u d a n %

e

Figure 16: Javanese udan-uden according to Id-sardi and Shorey (2007) and McClory and Raimy(2007).

But this complexity is unneeded if we insteadadopt dissimilation analysis closer in spirit toYip’s original Optimality-Theory analysis. Wecan say that the /a/ of the first copy is an over-written /a/ in both elaN-eliN and in udan-udenand a phonological process causes dissimilation ofthe root /a/ in the presence of the added /a/. InOptimality-Theory this requires an appeal to theObligatory Contour Principle operating betweenthe two copies, but in Multiprecedence the dissim-ilation is even simpler to state because the two /a/’sare very local in the graph. We simply need a ruleto the effect of raising a stem /a/ in the context of amorphologically-added /a/ that precedes the samesegment as in Fig.17.

a

X

a

=⇒e

X

a

Figure 17: Dissimilation Rule

# u d a n %

a

=⇒ # u d e n %

a

Figure 18: Derivation of udan-uden.

There is therefore no need to abandon Match-

28

Extend on the basis of Javanese.Consider another apparent counterexample to

the prediction: the Palauan root /rEb@th/ formsits distributive with CVCV reduplication and theverbal prefix m@- forming m@-r@b@-rEb@th (Zuraw,2003). At first blush, one may be tempted to seethe first schwa of the first copy as overwriting theroot’s /E/. But the presence of this schwa actuallyfollows from the independently-motivated phonol-ogy of Palauan in which all non-stressed vowelsgo to [@]. This thus is the result of a phonolog-ical rule applying after serialization about whichMatch-Extend has nothing to say.

Relatedly, other apparent issues may be causedby interactions with phonology. D’souza (1991,p.294) describes how echo-formation in someMunda languages is accomplished by replacing allthe vowels in the second copy with a fixed vowel,e.g. Gorum bubuP ‘snake’ > bubuP-bibiP. Fixedsegmentism of each vowel individually may notbe the best analysis of these forms, there may in-stead be a single fixed segment and a separate pro-cess of vowel harmony or something along thoselines. This type of complex interaction of non-local phonology with reduplication has been in-vestigated before in Multiprecedence, e.g. theanalyses of Tuvan vowel harmony in reduplicatedforms in Harrison and Raimy (2004) and Papillon(2020, §7.1), but these analyses make extra as-sumptions about possible Multiprecedence struc-tures that go far beyond the basics explored here.The subject requires further exploration, but ap-pears to be more of an issue of phonology and rep-resentation than of serialization per se.

Apparent counterexamples will have to be ap-proached on a case-by case basis, but I have notidentified many problematic examples so far thatdid not turn out to be errors of analysis.1

1One such apparent counter-example is worth brieflycommenting on here due to its being mentioned in well-known surveys of reduplication. This alleged reduplicationis from in Macdonald and Darjowidjojo (1967, p.54) andrepeated in Rubino (2005, p.16): Indonesian belat ‘screen’to belat-belit ‘underhanded’. If correct this example wouldbe a counterexample to Match-Extend, as a fixed /i/ mustsurface in the second copy. However this pair seems to bemisidentified. The English-Indonesian bilingual dictionaryby (Stevens and Schmidgall-Tellings, 2004) lists a word be-lit meaning ‘crooked, cunning, deceitful, dishonest, under-handed’, which semantically seems like a more plausiblesource for the reduplicated form belat-belit and fits the pre-dictions of Match-Extend. The same dictionary’s entry underbelat lists some screen-related entries and then belat-belit asmeaning ‘crooked, devious, artful, cunning, insincere’ cross-referencing to belit as the base. I conclude that this example

We have therefore seen that Match-Extend canstraightforwardly account for a number of attestedcomplex reduplicative patterns without any specialstipulations. More interestingly Match-Extendmakes strong novel predictions about the loca-tion of fixed segments. I have not been able tolocate many examples of NEFS in the literature.For example the typology of fixed segmentism inAlderete et al. (1999) does not contain any exam-ple of NEFS. This will require further empiricalresearch.

4 One limitation of Match-Extend:overly symmetrical graphs

There is a gap in the predictions of Fig. 11:Closest Attachment predicts that morphologically-added edges will attach in the order they are theclosest, which relies on an asymmetry in the formsuch that morphologically-added links are closerin one order than the other. This leaves the prob-lem of symmetrical forms like Fig. 19. The formerof there was posited in the analysis of Semai con-tinuative reduplication by Raimy (2000a, p.146-47) for forms like dNOh ‘appearance of nodding’to dh-dNOh; the latter would be needed in variouslanguages reduplicating CVC forms with vowelchanges such as the Takelma aorist describedin Sapir (1922, p.58) like t’eu ‘play shinny’ tot’eut’au.

# a b c % # a b c %

x

Figure 19: Two structures overly symmetrical forMatch-Extend.

These are the forms which, in the course ofMatch-Extend, will come to a point where Matchis indeterminate because two strings could matchequally well in either direction. For example theWorkSpace of the first of these structures will startwith ac and ca, which can match either as acaor cac. The former would extend into #acabc%and the latter into #abcac%. Match-Extend asstated so far is therefore indeterminate with regardto these symmetrical forms.

This is not an insurmountable problem forMatch-Extend. To the contrary this is a problem

was misidentified by previous authors and is unproblematicfor Match-Extend.

29

of having too many solutions without a way to de-cide between them, none of which require addingparametrization to Match-Extend. Maybe sym-metrical forms crash the derivation and all appar-ent instances in the literature must contain somehidden asymmetry. It is worth noting that the pat-tern in Fig. 19 attested in Semai has a close cog-nate in Temiar, but in this language the symmet-rical structure is only obtained for simple onsets,kOw ‘call’ to kw-kOw, but slOg ‘sleep with’ to s-g-lOg (Raimy, 2000a, p.146). This asymmetry re-solves the Match-Extend derivation. It may simplybe the case that the forms that look symmetricalhave a hidden asymmetry in the form of silent seg-ments. For example if the root has an X at the startas in Fig. 21. This is obviously very ad hoc andpowerful so minimally we should seek language-internal evidence for such a segment before jump-ing to conclusions.

# s l O g %

Figure 20: Temiar sglog.

# X k O w %

Figure 21: Semai kw-kOw with hidden asymme-try in the form of a segment X without a phoneticcorrelate, which breaks the symmetry.

Alternatively it could be that symmetrical formslead to both options being constructed and this op-tionality is resolved in extra-grammatical ways. Iwill leave this hole in the theory open, as a prob-lem to be resolved through further research.

5 Conclusion

This article presents an invariant serialization al-gorithm for all morphological patterns in Multi-precedence.

The Multiprecedence research programhas been fruitful in bringing various non-concatenative phenomena other than reduplicationwithin the scope of a derivational item-and-arrangement model of morphology, includinge.g. subtractive morphology (Gagnon and Piche,2007), Semitic templatic morphology (Raimy,2007), and vowel harmony, word tone, and

allomorphy (Papillon, 2020). A serializationalgorithm capable of handling these structures iscrucial for the completeness of the theory.

As pointed out by a reviewer, it is crucial to de-velop a a typology of the possible attested graph-ical input structures to the algorithm so as toproperly characterize and formalize the algorithmneeded. In every form discussed here the roots isimplicitly assumed to be underlyingly linear andaffixes alone add some topological variety to thegraphs, as is mostly the case in all the forms from(Raimy, 1999; Raimy, 2000a). Elsewhere I havechallenged this idea by positing parallel structuresboth underlyingly and in the output of phonol-ogy (Papillon, 2020). If these structures are al-lowed in Multiprecedence Phonology then Match-Extend will need to be amended or enhanced tohandle more varied structures.

In this paper I proposed a model that departsfrom the previous ones in being framed as patch-ing a path from the morphology-added links to-wards # and % from the inside-out, as opposed tothe existing models seeking to give a set of instruc-tions to correctly traverse the graph from # to %from beginning to the end.

ReferencesJohn Alderete, Jill Beckman, Laura Benua, Amalia

Gnanadesikan, John McCarthy, and Suzanne Ur-banczyk. 1999. Reduplication with fixed segmen-tism. Linguistic inquiry, 30(3):327–364.

Ellen I Broselow. 1983. Salish double reduplications:subjacency in morphology. Natural Language &Linguistic Theory, 1(3):317–346.

Charles B Chang. 2007. Accounting for the phonologyof reduplication. Presentation at LASSO XXXVI.Available on https://cbchang.com/curriculum-vitae/presentations/.

Jean D’souza. 1991. Echos in a sociolinguistic area.Language sciences, 13(2):289–299.

Justin Fitzpatrick and Andrew Nevins. 2002. Phono-logical occurrences: Relations and copying. In Sec-ond North American Phonology Conference, Mon-treal.

Justin Fitzpatrick and Andrew Nevins. 2004. Lin-earization of nested and overlapping precedence inmultiple reduplication.

Michael Gagnon and Maxime Piche. 2007. Principlesof linearization & subtractive morphology. In CUNYPhonology Forum Conference on Precedence Rela-tions.

30

K. David Harrison and Eric Raimy. 2004. Reduplica-tion in Tuvan: Exponence, readjustment and phonol-ogy. In Proceedings of Workshop in Altaic FormalLinguistics, volume 1. Citeseer.

William Idsardi and Rachel Shorey. 2007. Unwindingmorphology. In CUNY Phonology Forum Workshopon Precedence Relations.

SD Khan. 2006. Similarity Avoidance in BengaliFixed-Segment Reduplication. Ph.D. thesis, Univer-sity of California.

Roderick Ross Macdonald and Soenjono Darjowidjojo.1967. A student’s reference grammar of modern for-mal Indonesian. Georgetown University Press.

Daniel McClory and Eric Raimy. 2007. Enhancededges: morphological influence on linearization. InPoster presented at The 81st Annual Meeting of theLinguistics Society of America. Anaheim, CA.

Maxime Papillon. 2020. Precedence and the LackThereof: Precedence-Relation-Oriented Phonology.Ph.D. thesis. https://drum.lib.umd.edu/handle/1903/26391.

Eric Raimy. 1999. Representing reduplication. Ph.D.thesis, University of Delaware.

Eric Raimy. 2000a. The phonology and morphology ofreduplication, volume 52. Walter de Gruyter.

Eric Raimy. 2000b. Remarks on backcopying. Lin-guistic Inquiry, 31(3):541–552.

Eric Raimy. 2007. Precedence theory, root and tem-plate morphology, priming effects and the struc-ture of the lexicon. CUNY Phonology SymposiumPrecedence Conference, January.

Carl Rubino. 2005. Reduplication: Form, function anddistribution. Studies on reduplication, pages 11–29.

Bridget Samuels. 2009. The structure of phonologicaltheory. Harvard University.

Edward Sapir. 1922. Takelma. In Franz Boas, editor,Handbook of American Indian Languages,. Smith-sonian Institution. Bureau of American Ethnology.

Alan M. Stevens and A. Ed. Schmidgall-Tellings.2004. A comprehensive Indonesian-English dictio-nary. PT Mizan Publika.

Moira Yip. 1995. Repetition and its avoidance: Thecase in Javanese.

Moira Yip. 1998. Identity avoidance in phonologyand morphology. In Morphology and its Relation toPhonology and Syntax, pages 216–246. CSLI Publi-cations.

Kie Zuraw. 2003. Vowel reduction in Palauan redu-plicants. In Proceedings of the 8th Annual Meetingof the Austronesian Formal Linguistics Association[AFLA 8], pages 385–398.

31

Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research

in Phonetics, Phonology, and Morphology,

August 5, 2021. ©2021 Association for Computational Linguistics

pages 32–38

Incorporating tone in the calculation of phonotactic probability

James P. KirbyUniversity of Edinburgh

School of Philosophy, Psychology, and Language [email protected]

Abstract

This paper investigates how the orderingof tone relative to the segmental stringinfluences the calculation of phonotacticprobability. Trigram and recurrent neu-ral network models were trained on sylla-ble lexicons of four Asian syllable-tone lan-guages (Mandarin, Thai, Vietnamese, andCantonese) in which tone was treated asa segment occurring in different positionsin the string. For trigram models, theoptimal permutation interacted with lan-guage, while neural network models wererelatively unaffected by tone position in alllanguages. In addition to providing a base-line for future evaluation, these results sug-gest that phonotactic probability is robustto choices of how tone is ordered with re-spect to other elements in the syllable.

1 IntroductionThe phonotactic probability of a string is animportant quantity in several areas of lin-guistic research, including language acquisi-tion, wordlikeness, word segmentation, andspeech production and perception (Bailey andHahn, 2001; Daland and Pierrehumbert, 2011;Storkel and Lee, 2011; Vitevitch and Luce,1999). When the language of interest is atone language, the question arises of how toneshould be incorporated into the probability cal-culation. As phonotactic probability is fre-quently computed based on some type of n-gram model, this means deciding on whichsegment(s) the probability of a tone shouldbe conditioned. For instance, using a bigrammodel, one might compute the probability ofthe Mandarin syllable fāng as P (a|f) × P (N|a)× P (tone 1|N), but could just as well considerP (tone 1|f) × P (a|1) × P (N|a), or any otherconceivable permutation of tone and segments.

While this issue is occasionally remarked on(e.g. Newman et al., 2011: 246), there remainsno widespread consensus in practice. Choiceof ordering is sometimes justified based onsegment-tone co-occurrence restrictions in thelanguage under study (Myers and Tsay, 2005),but is often presented without justification(Kirby and Yu, 2007; Yang et al., 2018), andin some cases tone is simply ignored (Gong,2017). When the space of possibilities is con-sidered, researchers generally select the permu-tation which maximizes model fit to some ex-ternal data, such as participant judgments ofphonological distance (Do and Lai, 2021a) orwordlikeness (Do and Lai, 2021b).

Although extrinsic evaluation is in somesense a gold standard, intrinsic metrics ofmodel fit can also be informative, in part be-cause extrinsic metrics are not always robustacross data sets. For instance, participantwordlikeness judgments can vary considerablybased on the particulars of the experimen-tal design (Myers and Tsay, 2005; Shademan,2006; Vitevitch and Luce, 1999), so the treat-ment of tone that produces a best-fit model forone dataset may not do so for another. Thelexicon of a given language is much more in-ternally stable in terms of how segments andtones are distributed, so intrinsic evaluationmay provide a useful baseline for reasoningabout the treatment of tone relative to seg-ments both within and across languages.

This short paper considers a simpleinformation-theoretic motivation for selectinga permutation: all else being equal, we shouldprefer a model that maximizes the probabilityof the lexicon (i.e., minimizes the cross-entropyloss), because this will be the model that bydefinition does the best job of capturing thephonotactic regularities of the lexicon (Cherry

32

et al., 1953; Goldsmith, 2002; Pimentel et al.,2020). By treating tone as another phone inthe segmental string, we can see whether andto what degree this choice has an effect on theoverall entropy of the lexicon.

Intuitively, any model that can take into ac-count phonotactic constraints will result in areduction in entropy. Thus, even an n-grammodel with a sufficiently large context windowshould in principle be able model segment-toneco-occurrences at the syllable level. However,tone languages differ with respect to tone-segment co-occurrence restrictions (see Sec. 2).If a relevant constraint primarily targets syl-lable onsets, for instance, placing the tonal“segment” in immediate proximity to the on-set will increase the probability of the string,even relative to a model capable of capturingthe dependency at a longer distance.

2 LanguagesFour syllable-tone languages were selected forthis study: Mandarin Chinese, Cantonese,Vietnamese and Thai. They are partially aconvenience sample in that the necessary lex-ical resources were readily available, but alsohave some useful similarities: all share a sim-ilar syllable structure template and have fiveor six tones. However, the four languages varyin terms of their segment-tone co-occurrencerestrictions, as detailed below.

In all cases, the lexicon was defined asthe set of unique syllable shapes in each lan-guage. For consistency, the syllable tem-plate in all four languages is considered to be(C1)(C2)V(C)T, with variable positioning ofT. Offglides were treated as codas in all lan-guages. The syllable lexicons for all four lan-guages are provided in the supplementary ma-terials (http://doi.org/10.17605/OSF.IO/NA5FB).

Mandarin (cmn) The Mandarin syllabaryconsists of 1,226 syllables based on list of at-tested readings of the 13,060 BIG5 charactersfrom Tsai (2000), phonetized using the phono-logical system of Duanmu (2007). This rep-resentation encodes 22 onsets, 3 medials (/j4 w/), 6 nuclei, 4 codas and 5 tones (includ-ing the neutral tone). In Mandarin, unaspi-rated obstruent onsets rarely appear with mid-rising tone (MC yang ping), and sonorant on-sets rarely occur with the high-level tone (MC

yin ping). Obstruents never occur as codas.

Thai (tha) A Thai lexicon of 4,133 uniquesyllables was created based on the dictionaryof Haas (1964) which contains around 19,000entries and 47,000 syllables. The phonemicrepresentation encodes 20 onsets, 3 medials /wl r/, 21 nuclei (vowel length being contrastivein Thai), 8 codas and 5 tones. In Thai, hightone is rare/unattested following unaspiratedand voiced onsets, but there is also statisticalevidence for a restriction on rising tones withthese onsets (Perkins, 2013). In syllables withan obstruent coda (/p t k/), only high, low, orfalling tones occur, depending on length of thenuclear vowel (Morén and Zsiga, 2006).

Vietnamese (vie) The Vietnamese lexiconof 8,128 syllables was derived from a freelyavailable dictionary of around 74,000 words(Đức, 2004), phonetized using a spelling pro-nunciation (Kirby, 2008). The resulting repre-sentation encodes 24 onsets, 1 medial (/w/),14 nuclei, 8 codas and 6 tones. Vietnamesesyllables ending in obstruents /p t k/ are re-stricted to just one of two tones.

Cantonese (yue) The Cantonese syllabaryconsists of the 1,884 unique syllables in theChinese Character Database (Kwan et al.,2003), encoded using the jyutping system.This representation distinguishes 22 onsets, 1medial (/w/), 11 nuclei, 5 codas and 6 tones.In Cantonese, unaspirated initials do not oc-cur in syllables with low-falling tones, andaspirated initials do not occur with the lowtone. Syllables ending with /p t k/ are re-stricted to one of the three “entering” tones(Yue-Hashimoto, 1972).

3 MethodsTwo classes of character-level language models(LMs) were considered: simple n-gram modelsand recurrent neural networks (Mikolov et al.,2010). In an n-gram model, the probabilityof a string is proportional to the conditionalprobabilities of the component n-grams:

P (xi|xi−11 ) ≈ P (xi|xi−1

i−n+1) (1)

The degree of context taken into account isthus determined by the value chosen for n.

In a recurrent neural network (RNN), thenext character in a sequence is predicting using

33

the current character and the previous hiddenstate. At each step t, the network retrieves anembedding for the current input xt and com-bines it with the hidden layer from the previ-ous step to compute a new hidden layer ht:

ht = g(Uht−1 +Wxt) (2)

where W is the weight matrix for the currenttime step, U the weight matrix for the previ-ous time step, and g is an appropriate non-linear activation function. This hidden layerht is then used to generate an output layeryt, which is passed through a softmax layer togenerate a probability distribution over the en-tire vocabulary. The probability of a sequencex1, x2 . . . xz is then just the product of theprobabilities of each character in the sequence:

P (x1, x2 . . . xz) =z∏

i=1

yi (3)

The incorporation of the recurrent connec-tion as part of the hidden layer allows RNNs toavoid the problem of limited context inherentin n-gram models, because the hidden stateembodies (some type of) information about allof the preceding characters in the string. Al-though RNNs cannot capture arbitrarily long-distance dependencies, this is unlikely to makea difference for the relatively short distancesinvolved in phonotactic modeling.

Trigram models were built using the SRILMtoolkit (Stolcke, 2002), with maximum likeli-hood estimates smoothed using interpolatedWitten-Bell discounting (Witten and Bell,1991). RNN LMs were built using PyTorch(Paszke et al., 2019), based on an implementa-tion by Mayer and Nelson (2020). The resultsreported here make use of simple recurrent net-works (Elman, 1990), but similar results wereobtained using an LSTM layer (Hochreiter andSchmidhuber, 1997).

3.1 ProcedureThe syllables in each lexicon were arrangedin 5 distinct permutations: tone following thecoda (T|C), nucleus (T|N), medial (T|M), on-set (T|O) and with tone as the initial seg-ment in the syllable (T|#). As many syl-lables in these languages lack onsets, medi-als, and/or codas, a sizable number of the

resulting strings were identical across permu-tations. Both smoothed trigram and simpleRNN LMs were then fit to each permuted lex-icon 10 times, with random 80/20 train/devsplits (other splits produced similar results).For each run, the perplexity of the languagemodel on the dev set D = x1x2 . . . xN (i.e., theexponentiated cross-entropy1) was recorded:

PPL(D) = bH(D) (4)= b−

1NlogbP (x1x2...xN ) (5)

4 ResultsFor brevity, only the main findings are sum-marized here; the full results are available aspart of the online supplementary materials(http://doi.org/10.17605/OSF.IO/NA5FB).

Table 1 show the orderings which minimizedperplexity for each method and language, aver-aged over 10 runs. Table 2 shows the averageperplexity over all permutations for a givenlanguage and method.

method lexicon order PPL

3-gram

cmn T|C 4.91 (0.06)tha T|M 7.34 (0.12)vie T|C 7.35 (0.03)yue T|M 5.84 (0.09)

RNN

cmn T|M 4.01 (0.08)tha T|M 5.20 (0.04)vie T|M 5.16 (0.02)yue T|# 4.37 (0.05)

Table 1: Orders which produced the lowest per-plexities averaged over 10 runs (means and stan-dard deviations).

Differences between orderings were then as-sessed visually, aided by simple analyses ofvariance. For the trigram LMs, perplexity waslowest in Mandarin when tones followed co-das, while differences in perplexity betweenother orderings were negligible. For Thai,Vietnamese, and Cantonese, all orderings wereroughly comparable except for when tone wasordered as the first segment in the syllable(T|#), which increased perplexity by up to1 over the mean of the other orderings. ForThai, the ordering T|M resulted in signifi-cantly lower perplexities compared to all other

1Equivalently, we may think of PPL(D) as the in-verse probability of the set of syllables D, normalizedfor the number of phonemes.

34

cmn tha vie yue3-gram 5.15 (0.17) 7.76 (0.4) 7.49 (0.27) 5.98 (0.18)RNN 4.01 (0.07) 5.28 (0.05) 5.18 (0.03) 4.42 (0.07)

Table 2: Mean and standard deviation of perplexity across all permutations by lexicon and languagemodel.

permutations. For the RNN LMs, althoughT|M was the numerically optimal ordering forthree out of the four languages, in practicalterms permutation had no effect on perplex-ity, with numerical differences of no greaterthan 0.1 (see Table 2).

5 Discussion

Consistent with other recent work in compu-tational phonotactics (e.g. Mayer and Nel-son, 2020; Mirea and Bicknell, 2019; Pimentelet al., 2020), the neural network models out-performed the trigram baselines by a consider-able margin (a reduction in average perplexityof up to 2.5, depending on language). Neu-ral network models were also much less sen-sitive to the linear position of tone relativeto other elements in the segmental string (cf.Do and Lai, 2021b), no doubt due to the factthat the ability of the RNNs to model co-occurrence tendencies within the syllable is notconstrained by context in the way that n-grammodels are.

Perhaps as a result, however, the RNN mod-els reveal little about the nature of segment-tone co-occurrence restrictions in any of thelanguages investigated. In this regard, the tri-gram models, while clearly less optimal in aglobal sense, are still informative. The factthat the ordering T|# was significantly worseunder the trigram model for Cantonese, Viet-namese and Thai but not Mandarin can be ex-plained (or predicted) by the fact that of thefour languages, only Mandarin does not per-mit obstruent codas, and consequently has nocoda-tone co-occurrence restrictions (indeed,the four primary tones of Mandarin occur withmore or less equal type frequency). In theother three languages, syllables with obstruentcodas can only bear a restricted set of tones,and in a trigram model, this dependency is notmodeled when tone is prepended to the sylla-ble, since this means it will frequently, thoughnot always, fall outside the window visible to

the language model. Even a model with a largeenough context window to capture such depen-dencies will assign the lexicon a higher perplex-ity when structured in this way.

The finding that the T|M ordering is alwaysoptimal in Thai (and by a larger margin thanin the other languages) is presumably due tothe fact that the distribution of the medials/w l r/ is severely restricted in this language,occurring only after /p ph t th k kh f/. Thedistribution of tones after onset-medial clus-ters is inherently more constrained and there-fore more predictable. A similar restrictionholds in Cantonese, albeit to a lesser degree(the medial /w/ only occurs with onsets /k/and /kh/).

5.1 Shortcomings and extensions

This work did not explore representationsbased on phonological features, given thattheir incorporation has failed to provide evalu-ative improvements in other studies of com-putational phonotactics (Mayer and Nelson,2020; Mirea and Bicknell, 2019; Pimentel et al.,2020). However, feature-based approaches canbe both theoretically insightful and may evenprove necessary for other quantifications, suchas the measure of phonological distance wheretone is involved (Do and Lai, 2021a).

The present study has focused on a smallsample of structurally and typologically simi-lar languages. All have relatively simple syl-lable structures in which one and only onetone is associated with each syllable. Not alltone languages share these properties, how-ever. In so-called “word-tone” languages, suchas Japanese or Shanghainese, the surface tonewith which a given syllable is realized is fre-quently not lexically specified. In other lan-guages, such as Yoloxóchitl Mixtec (DiCanioet al., 2014), tonal specification may be tiedto sub-syllabic units, such as the mora. Fi-nally, data from many other languages, suchas Kukuya (Hyman, 1987), make it clear that

35

in at least in some cases tones can only betreated in terms of abstract melodies, whichdo not have a consistent association to sylla-bles, moras, or vowels (Goldsmith, 1976). Inthese and many other cases, careful consider-ation of the theoretical motivations justifyinga particular representation are required beforeit makes sense to consider ordering effects.

However, to the extent that it is possible togenerate a segmental representation of a tonelanguage in which surface tones are indicated,what the present work suggests is that the pre-cise ordering of the tonal symbols with respectto other symbols in the string is unlikely tohave a significant impact on phonotactic prob-ability. This follows from two assumption (orconstraints): first, that the set of symbols usedto indicate tones is distinct from those used toindicate the vowels and consonants; and sec-ond, that one and only one such tone symbolappears per string domain (here, the syllable).If these two constraints hold, the complexityof the syllable template should in general havea greater impact on the entropy of the stringset than the position of the tone symbol, al-though the number of unique tone symbols rel-ative to the number of segmental symbols mayalso have an effect. According to Maddieson(2013) and Easterday (2019), languages withcomplex syllable structures (defined as thosepermitting fairly free combinations of two ormore consonants in the position before a vowel,and/or two or more consonants in the positionafter the vowel) rarely have complex tone sys-tems, or indeed tone systems at all, so this isunlikely to be an issue for most tone languages.

One possibility the present work did not ad-dress is whether it is even necessary, or desir-able, to include tone in phonotactic probabilitycalculations in the first place. The probabilityof the lexicon of a tonal language would surelychange if tone is ignored, but whether listeners’judgments of a sequence as well- or ill-formedis better predicted by a model that takes toneinto account vs. one that does not is an empir-ical question (but see Kirby and Yu, 2007; Doand Lai, 2021b for some evidence that it maynot). Similarly, for research questions focusedon tone sandhis, or on the distributions of thetonal sequences themselves (tonotactics), therelevant computations will be restricted to the

tonal tier in the first instance, and orderingwith respect to segments may simply not berelevant (but see Goldsmith and Riggle, 2012).

Finally, the present study has focused on thelexical representation of tone, but in many lan-guages tone primarily serves a morphologicalfunction. The SIGMORPHON 2020 Task 0shared challenge (Vylomova et al., 2020) in-cluded inflection data from several tonal Oto-Manguean languages in which tone was or-thographically encoded in different ways viastring diacritics. While the authors notedthe existence these differences, it is unclearwhether and to what extent the different rep-resentations of tones affected system perfor-mance. Similarly, the potential impact oftone ordering relative to other elements in thestring has yet to be systematically investigatedin this setting.

6 ConclusionThis paper has assessed how different permu-tations of tone and segments affects the per-plexity of the lexicon in four syllable-tone lan-guages using two types of phonotactic lan-guage models, an interpolated trigram modeland a simple recurrent neural network. Theperplexities assigned by the neural networkmodels were essentially unaffected by differentchoices of ordering; while the trigram modelwas more sensitive to permutations of tone andsegments, the effects on perplexity remainedminimal. In addition to providing a baselinefor future evaluation, these results suggest thatthe phonotactic probability of a syllable is rel-atively robust to choice of how tone is orderedwith respect to other elements in the string,especially when using a model capable of en-coding dependencies across the entire syllable.

AcknowledgmentsThis work was supported in part by the ERCEVOTONE project (grant no. 758605).

ReferencesTodd Bailey and Ulrike Hahn. 2001. Determinants

of wordlikeness: phonotactics or lexical neigh-borhoods? Journal of Memory and Language,44:568–591.

E. Colin Cherry, Morris Halle, and Roman Jakob-son. 1953. Toward the logical description of

36

languages in their phonemic aspect. Language,29(1):34–46.

Robert Daland and Janet B. Pierrehumbert. 2011.Learning diphone-based segmentation. Cogni-tive Science, 35(1):119–155.

Christian DiCanio, Jonathan D Amith, andRey Castillo García. 2014. The phonetics ofmoraic alignment in yoloxóchitl mixtec. In Pro-ceedings of the 4th International Symposium onTonal Aspects of Languages (TAL-2014), pages203–210.

Youngah Do and Ryan Ka Yau Lai. 2021a. Ac-counting for lexical tones when modeling phono-logical distance. Language, 97(1):e39–e67.

Youngah Do and Ryan Ka Yau Lai. 2021b. Incor-porating tone in the modelling of wordlikenessjudgements. Phonology, 37:577–615.

San Duanmu. 2007. The phonology of standardChinese, 2nd edition. Oxford University Press,Oxford.

Shelece Easterday. 2019. Highly complex syllablestructure: A typological and diachronic study.Studies in Laboratory Phonology. Language Sci-ence Press.

Jeffrey L. Elman. 1990. Finding structure in time.Cognitive Science, 14(2):179–211.

John Goldsmith. 1976. Autosegmental Phonology.Ph.D. thesis, MIT. [Published by Garland Press,New York, 1979.].

John Goldsmith. 2002. Phonology as informationminimization. Phonological Studies, 5:21–46.

John Goldsmith and Jason Riggle. 2012. Informa-tion theoretic approaches to phonological struc-ture: the case of Finnish vowel harmony. Natu-ral Language and Linguistic Theory, 30:859–896.

Donald Shuxiao Gong. 2017. Grammaticality andlexical statistics in Chinese unnatural phonotac-tics. UCL Working Papers in Linguistics, 17:1–23.

Mary R. Haas. 1964. Thai-English student’s dic-tionary. Stanford University Press, Stanford.

Sepp Hochreiter and Jürgen Schmidhuber. 1997.Long short-term memory. Neural Computation,9(8):1735–1780.

Larry M. Hyman. 1987. Prosodic domains inKukuya. Natural Language & Linguistic The-ory, 5(3):311–333.

James P. Kirby. 2008. vPhon: a Vietnamesephonetizer (version 2.1.1) [computer program].https://github.com/kirbyj/vPhon.

James P. Kirby and Alan C. L. Yu. 2007. Lexi-cal and phonotactic effects on wordlikeness judg-ments in Cantonese. In Proceedings of the 16thInternational Conference of the Phonetic Sci-ences, pages 1389–1392, Saarbrücken.

Tze-Wan Kwan, Wai-Sang Tang, Tze-MingChiu, Lei-Yin Wong, Denise Wong, andLi Zhong. 2003. Chinese character databasewith word-formations phonologically disam-biguated according to the Cantonese dialect.http://humanum.arts.cuhk.edu.hk/Lexis/lexi-can/. Accessed 9 February 2007.

Ian Maddieson. 2013. Tone. In Matthew S. Dryerand Martin Haspelmath, editors, The World At-las of Language Structures Online. Max PlanckInstitute for Evolutionary Anthropology.

Connor Mayer and Max Nelson. 2020. Phonotacticlearning with neural language models. Proceed-ings of the Society for Computation in Linguis-tics, 3:16.

Tomáš Mikolov, Martin Karafiát, Lukáš Burget,Jan Černocký, and Sanjeev Khudanpur. 2010.Recurrent neural network based language model.In Proc. INTERSPEECH 2010, page 1045–1048.

Nicole Mirea and Klinton Bicknell. 2019. UsingLSTMs to assess the obligatoriness of phonolog-ical distinctive features for phonotactic learning.In Proceedings of the 57th Annual Meeting ofthe Association for Computational Linguistics,pages 1595–1605. Association for ComputationalLinguistics.

Bruce Morén and Elizabeth Zsiga. 2006. Thelexical and post-lexical phonology of Thaitones. Natural Language and Linguistic Theory,24(1):113–178.

James Myers and Jane Tsay. 2005. The pro-cessing of phonological acceptability judgements.In Proc. Symposium on 90-92 NSC Projects,Taipei.

Ellen Hamilton Newman, Twila Tardif, JingyuanHuang, and Hua Shu. 2011. Phonemes matter:The role of phoneme-level awareness in emergentChinese readers. Journal of Experimental ChildPsychology, 108(2):242–259.

Adam Paszke, Sam Gross, Francisco Massa,Adam Lerer, James Bradbury, Gregory Chanan,Trevor Killeen, Zeming Lin, Natalia Gimelshein,Luca Antiga, Alban Desmaison, Andreas Kopf,Edward Yang, Zachary DeVito, Martin Raison,Alykhan Tejani, Sasank Chilamkurthy, BenoitSteiner, Lu Fang, Junjie Bai, and Soumith Chin-tala. 2019. Pytorch: An imperative style, high-performance deep learning library. In H. Wal-lach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advancesin Neural Information Processing Systems 32,pages 8024–8035. Curran Associates, Inc.

37

Jeremy Perkins. 2013. Consonant-tone interactionin Thai. Ph.D. thesis, Rutgers, The State Uni-versity of New Jersey.

Tiago Pimentel, Brian Roark, and Ryan Cotterell.2020. Phonotactic complexity and its trade-offs.Transactions of the Association for Computa-tional Linguistics, 8:1–18.

Shabnam Shademan. 2006. Is phonotactic knowl-edge grammatical knowledge? In Proceedings ofthe 25th West Coast Conference on Formal Lin-guistics, pages 371–379. Cascadilla ProceedingsProject.

Andreas Stolcke. 2002. SRILM – an extensible lan-guage modeling toolkit. In Proc. Intl. Conf. onSpoken Language Processing Vol. 2, pages 901–904, Denver.

Holly L. Storkel and Su-Yeon Lee. 2011. The inde-pendent effects of phonotactic probability andneighbourhood density on lexical acquisition bypreschool children. Language and Cognitive Pro-cesses, 26(2):191–211.

Chih-Hao Tsai. 2000. Mandarin syllablefrequency counts for Chinese characters.http://technology.chtsai.org/syllable/. Ac-cessed 10 March 2021.

Michael S. Vitevitch and Paul A. Luce. 1999. Prob-abilistic phonotactics and neighborhood activa-tion in spoken word recognition. Journal ofMemory and Language, 40:374–408.

Ekaterina Vylomova, Jennifer White, Eliza-beth Salesky, Sabrina J. Mielke, Shijie Wu,Edoardo Maria Ponti, Rowan Hall Maudslay,Ran Zmigrod, Josef Valvoda, Svetlana Toldova,and et al. 2020. Sigmorphon 2020 shared task0: Typologically diverse morphological inflec-tion. In Proceedings of the 17th SIGMORPHONWorkshop on Computational Research in Pho-netics, Phonology, and Morphology, page 1–39.Association for Computational Linguistics.

Ian H. Witten and Timothy C. Bell. 1991. Thezero-frequency problem: estimating the proba-bilities of novel events in adaptive text compres-sion. IEEE Transactions on Information The-ory, 37(4):1085–1094.

Shiying Yang, Chelsea Sanker, and Uriel Co-hen Priva. 2018. The organization of lexicons:a cross-linguistic analysis of monosyllabic words.In Proceedings of the Society for Computationin Linguistics (SCiL) 2018, page 164–173.

Anne O. Yue-Hashimoto. 1972. Studies in Yue Di-alects 1: Phonology of Cantonese. CambridgeUniversity Press.

Hồ Ngọc Đức. 2004. Vietnameseword list. http://www.informatik.uni-leipzig.de/∼duc/software/misc/wordlist.html.Accessed 24 February 2021.

38

Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research

in Phonetics, Phonology, and Morphology,

August 5, 2021. ©2021 Association for Computational Linguistics

pages 39–48

MorphyNet: a Large Multilingual Databaseof Derivational and Inflectional Morphology

Khuyagbaatar Batsuren1, Gábor Bella2, and Fausto Giunchiglia2,31National University of Mongolia, Mongolia

2DISI, University of Trento, Italy3College of Computer Science and Technology, Jilin University, China

[email protected];gabor.bella,[email protected]

AbstractLarge-scale morphological databases provideessential input to a wide range of NLP applica-tions. Inflectional data is of particular impor-tance for morphologically rich (agglutinativeand highly inflecting) languages, and deriva-tions can be used, e.g. to infer the semantics ofout-of-vocabulary words. Extending the scopeof state-of-the-art multilingual morphologicaldatabases, we announce the release of Mor-phyNet, a high-quality resource with 15 lan-guages, 519k derivational and 10.1M inflec-tional entries, and a rich set of morphologi-cal features. MorphyNet was extracted fromWiktionary using both hand-crafted and auto-mated methods, and was manually evaluatedto be of a precision higher than 98%. Boththe resource generation logic and the resultingdatabase are made freely available12 and arereusable as stand-alone tools or in combinationwith existing resources.

1 IntroductionDespite repeated paradigm shifts in computationallinguistics and natural language processing, mor-phological analysis and its related tasks, suchas lemmatization, stemming, or compound split-ting, have always remained essential componentswithin language processing systems. Recently, inthe context of language models based on subwordembeddings, a morphologically meaningful split-ting of words has been shown to improve the effi-ciency of downstream tasks (Devlin et al., 2019;Sennrich et al., 2016; Bojanowski et al., 2017;Provilkov et al., 2020). In particular, the rein-troduction of linguistically motivated approachesand high-quality linguistic resources into deeplearning architectures has been crucial for deal-ing with morphologically rich—highly inflecting,

1http://ukc.disi.unitn.it/index.php/MorphyNet

2http://github.com/kbatsuren/MorphyNet

agglutinative—languages more efficiently (Pinniset al., 2017; Vylomova et al., 2017; Ataman andFederico, 2018; Gerz et al., 2018).In response to such needs, and as simple and con-

venient substitutes for monolingual morphologicalanalyzers, multilingual morphological databaseshave been developed, indicating for each wordform entry one or more corresponding root ordictionary entries, as well as analysis (features)(Kirov et al., 2018; Metheniti and Neumann, 2020;Vidra et al., 2019). The precision and recall ofthese resources vary wildly, and there is still a lotof ground to cover with respect to the support ofnew languages, the modelling of the inflectionaland derivational complexity of each language, aswell as the richness of the information (features,affixes, parts of speech, etc.) provided.As a further step towards extending online mor-

phological data, we introduce MorphyNet, a newdatabase that addresses both derivational and in-flectional morphology. Its current version cov-ers 15 languages and has 519k derivational and10.1M inflectional entries, as well as a rich set offeatures (lemma, parts of speech, morphologicaltags, affixes, etc.). Similarly to certain existingdatabases, MorphyNet was built from Wiktionarydata; however, our extraction logic allows for amore exhaustive coverage of both derivational andinflectional cases.The contributions of this paper are the freely

available MorphyNet resource, the description ofthe data extraction logic and tool, also made freelyaccessible, as well as its evaluation and compari-son to state-of-the-art multilingual morphologicaldatabases. Due to the limited overlap between thecontents of these resources and MorphyNet, weconsider it as complementary and therefore usablein combination with them.Section 2 of the paper presents the state of the art.

Section 3 gives details on our method for generat-

39

Figure 1: The MorphyNet generation process and the input datasets used.

ing MorphyNet data. Section 4 presents the result-ing resource, and Section 5 evaluates it. Section 6concludes the paper.

2 State of the Art

Ever since the early days of computational linguis-tics, morphological analysis and its related tasks—such as stemming and lemmatization—have beenpart of NLP systems. Earlier grammar-based sys-tems used finite-state transducers or affix strippingtechniques, and certain of them were already mul-tilingual and were capable of tackling morpholog-ically complex languages (Beesley and Karttunen,2003; Trón et al., 2005; Inxight, 2005). However,due to the costliness of producing the grammarrules that drove them, many of these systems wereonly commercially available.More recently, several projects have followed

the approach of formalizing and/or integrating ex-isting morphological data for multiple languages.UDer (Universal Derivations) (Kyjánek et al.,2020) integrates 27 derivational morphology re-sources in 20 languages. UniMorph (Kirov et al.,2016, 2018) and theWikinflection Corpus (Methen-iti and Neumann, 2020) rely mostly onWiktionaryfrom which they extract inflectional information.Beyond the data source, however, the two lastprojects have little in common: UniMorph is byfar more precise and complete, and being usedas gold standard for NLP community (Cotterellet al., 2017, 2018) (recently covering 133 lan-guages (McCarthy et al., 2020)), while Wikinflec-tion follows a naïve, linguistically uninformed ap-proach of merely concatenating affixes, generat-ing an abundance of ungrammatical word forms(e.g. for Hungarian or Finnish).MorphyNet is also based on extracting morpho-

logical information from Wiktionary, extending

the scope of UniMorph by new extraction rulesand logic. The first version of MorphyNet covers15 languages, and it is distinct from other resourcesin three aspects: (1) it includes both inflectionaland derivational data; (2) it extracts a significantlyhigher number of inflections fromWiktionary; and(3) it provides a wider range of morphological in-formation. While for the languages it covers Mor-phyNet can be considered a superset of UniMorph,the latter supports more languages. With UDer, aswe show in section 4, the overlap is minor on alllanguages. For these reasons, we consider Mor-phyNet as complementary to these databases, con-siderably enriching their coverage on the 15 sup-ported languages but not replacing them.

3 MorphyNet GenerationMorphyNet is generated mainly from Wiktionary,through the following steps.

1. Filtering returns XML-basedWiktionary con-tent from specific sections of relevant lexicalentries: headword lines, etymology sections,and inflectional tables are returned for nouns,verbs, and adjectives.

2. Extraction obtains rawmorphological data byparsing the sections above.

3. Enrichment algorithmically extends the cov-erage of derivations and inflections obtainedfrom Wiktionary, through entirely distinctmethods for inflection and derivation.

4. Resource generation, finally, outputs Mor-phyNet data.

Below we explain the non-trivial Wiktionary ex-traction and enrichment steps, while Section 4 pro-vides details on the generated resource itself.

40

3.1 Wiktionary ExtractionWe extract inflectional and derivational datathrough hand-crafted extraction rules that targetrecurrent patterns in Wiktionary content both insource markdown and in HTML-rendered form.With respect to UniMorph that takes a similar ap-proach and scrapes tables that provide inflectionalparadigms, the scope of extraction is considerablyextended, also including headword lines and ety-mology sections. This allows us to obtain newderivations, inflections, and features not coveredby UniMorph, such as gender information or nounand adjective declensions for Catalan, French, Ital-ian, Spanish, Russian, English, or Serbo-Croatian.Our rules target nouns, adjectives, and verbs in alllanguages covered.Inflection extraction rules target two types of

Wiktionary content: inflectional tables and head-word lines. Inflectional tables provide conjugationand declension paradigms for a subset of verbs,nouns, and adjectives in Wiktionary. On tables,our extraction method was similar to that of Uni-Morph as described in (Kirov et al., 2016, 2018),with one major difference. UniMorph also ex-tracted a large number of separate entries withmodifier and auxiliary words, such as Spanishnegative imperatives (no comas, no coma, no co-mamos etc.) or Finnish negative indicatives (enpuhu, et puhu, eivät puhu etc.). MorphyNet, onthe other hand, has a single entry for each distinctword form, regardless of the modifier word used.This policy had a particular impact on the size ofthe Finnish vocabulary.As inflectional tables are only provided by Wik-

tionary for 62.5%3 of nouns, verbs, and adjectives,we extended the scope of extraction to headwordlines, such as

banca f (plural banche)From this headword line, we extract two entries:one for banca is feminine singular and second forbanche is feminine plural. We created specificparsing rules for nouns, verbs, and adjectives be-cause each part of speech is described through a dif-ferent set of morphological features. For example,valency (transitive or reflexive) and aspect (perfec-tive or imperfective) are essential for verbs, whilegender (masculine or feminine) and number (singu-lar or plural) pertain to nouns and adjectives.Derivation extraction rules were applied to the3Computed over the 15 languages covered byMorphyNet.

accusation = suffix|en|accuse|ationWiktionary

CogNet accuse.v

accusation.n

enacusar.v

acusação.n

pt

competição = suffix|pt|competir|ção

cognate

cognate

MorphyNetpt

en accuse.vcompetir.vacusar.v acusação.n

competição.naccusation.n

-ção pt -ção

-ation

Figure 2: Derivation enrichment example: inference ofthe derivation of the Portuguese word acusação.

etymology sections ofWiktionary entries to collectthe Morphology template usages, such as for theEnglish accusation:

Equivalent to accuse + -ation.where we have a morphology entrysuffix|en|accuse|-ation from the WiktionaryXML dump. After collecting all morphologyentries, we applied the enrichment method toincrease its coverage.

3.2 Derivation EnrichmentDerivation enrichment is based on a linguisticallyinformed cross-lingual generalization of deriva-tional patterns observed in Wiktionary data, in or-der to extend the coverage of derivational data.In the example shown in Figure 2, Wik-

tionary contains the Portuguese derivation com-petir (to compete)→ competição (competition)but not acusar (to accuse)→ acusação (accusa-tion). An indiscriminate application of the suf-fix -ção to all verbs would, of course, gener-ate lots of false positives, such as chegar (to ar-rive)↛ *chegação. Even when the target worddoes exist, the inferred derivation is often false, asin the case of corar (to blush)↛ coração (heart).A counter-example from English could be jewel+-ery→ jewellery but gal+-ery↛ gallery.

For this reason, we use stronger cross-lingualderivational evidence to induce the applicabilityof the affix. In the example above, the existenceof the English derivation accuse→ accusation,where the meanings of the English and the corre-sponding Portuguese words are the same, serves asa strong hint for the applicability of the Portuguesepattern.This intuition is formalized inMorphyNet as fol-

41

Table 1: Structure of MorphyNet inflectional data and its comparison to UniMorph. Data provided only by Mor-phyNet is highlighted in bold. The rest is provided by both resources in a nearly identical format.

Language base_word trg_word features src_word morphemeHungarian ház házak N;NOM;PL ház -akHungarian ház házat N;ACC;SG ház -atHungarian ház házakat N;ACC;PL házak -atRussian играть играть V;NFIN;IPFV;ACT игратьRussian играть играют V;IND;PRS;3;PL;IPFV;ACT;FIN играть -аютRussian играть играющий V;V.PTCP;ACT;PRS играют -щий

Table 2: Structure of MorphyNet derivational data and its comparison to UDer. Data only provided by MorphyNetis highlighted in bold. The rest is provided by both resources in a nearly identical format.

Language src_word trg_word src_pos trg_pos morphemeEnglish time timeless noun adjective -lessEnglish soda sodium noun substance.noun -iumEnglish zoo zoophobia noun state.noun -phobiaFinnish kirjoittaa kirjoittaminen verb noun -minenFinnish lyödä lyöjä verb person.noun -jä

lows: if in language A a derivation from sourceword wA

s to target word wAt through the affix aA is

not explicitly asserted (e.g. byWiktionary) but it isasserted for the corresponding cognates in at leastone language B, then we infer its existence:

cog(wAs ,wB

s ) ∧ cog(wAt ,wB

t ) ∧ cog(aA, aB)

∧ der(wBs , aB) = wB

t ⇒ der(wAs , aA) = wA

t

where cog(x, y) means that the words x and y arecognates and der(b, a) = d that word d is derivedfrom base word b and affix a. In our example,A = Portuguese, B = English, wA

s = acusar,wBs = accuse, wA

t = acusação, wBt = accusation,

aA = -ção, and aB = -tion.As shown in Figure 1, we exploited a cognate

database, CogNet4 (Batsuren et al., 2019, 2021),that has 8.1M cognate pairs, for evidence on cog-nacy: cog(wA,wB) = True is asserted by the pres-ence of the word pair in CogNet.The result of enrichment was a total increase of

25.6% of the number of derivations in MorphyNet.Efficiency varies among languages, essentially de-pending on the completeness of the Wiktionarycoverage: it was the lowest for English with 3%and the highest for Spanish with 57%.

3.3 Inflection EnrichmentThe enrichment of inflectional data is based onthe simple observation that Wiktionary doesnot provide the root word for all inflectedforms. For example, for the Hungarian múltjá-val (with his/her/its past), Wiktionary provides

4http://github.com/kbatsuren/CogNet

the inflection múltja→múltjával (his/her/itspast + instrumental). For múltja, in turn, it pro-vides múlt→múltja (past + possessive). It doesnot, however, directly provide the combinationof the two inflections: múlt→múltjával (past +possessive + instrumental). Inflection enrichmentconsists of inferring such missing rules from theexisting data.The case above is formalized as follows: if, after

the Wiktionary extraction phase, the MorphyNetdata contains the inflectionswr → w1 (with featureset F1) as well as w1 → w2 (with feature set F2),then we create the new inflection wr → w2 withfeature set F1 ∪ F2.The application of this logic increased the inflec-

tional coverage of MorphyNet by 10.8% and its re-call (with respect to ground truth data presented insection 5) by 8.2% on average.

4 The MorphyNet ResourceMorphynet is freely available for download, bothas text files containing the data and as the sourcecode of the Wiktionary extractor.5 Two text filesare provided per language: one for inflections andone for derivations. The structure of the two typesof files is illustrated in Tables 1 and 2, respectively.As shown, MorphyNet covers all data fields pro-vided by UniMorph for inflections and by UDerfor derivations. In addition, it extends UniMorphby indicating the affix and the immediate sourceword that produced the inflection. Such informa-tion is useful, for example, to NLP applicationsthat rely on subword information for understand-

5http://github.com/kbatsuren/WiktConv

42

Table 3: MorphyNet dataset statistics

Inflectional morphology Derivational morphology# Languages words entries morphemes words entries morphemes Total1 Finnish 65,402 1,617,751 1,139 18,142 37,199 446 1,654,9502 Serbo-Croatian 68,757 1,760,095 263 8,553 20,008 429 1,780,1033 Italian 75,089 748,321 104 22,650 42,149 749 790,4704 Hungarian 38,067 1,034,317 428 14,566 37,940 832 1,072,2575 Russian 67,695 1,343,760 252 21,922 36,922 575 1,380,6826 Spanish 67,796 677,423 145 16,268 27,633 490 705,0567 French 44,729 453,229 98 15,473 37,203 636 490,4328 Portuguese 30,969 329,861 161 10,504 15,974 387 345,8359 Polish 36,940 663,545 251 9,518 18,404 405 681,94910 German 35,086 214,401 243 13,070 23,867 465 238,26811 Czech 9,781 298,888 112 4,875 9,660 318 307,93512 English 149,265 652,487 8 67,412 200,365 2,445 852,85213 Catalan 16,404 168,462 91 3,244 4,083 220 172,54514 Swedish 14,485 131,693 32 3,190 5,810 217 137,50315 Mongolian 2,085 14,592 35 1,410 1,940 229 16,532

Total 722,550 10,108,825 3,362 230,797 519,157 8,843 10,627,369

Table 4: UniMorph and MorphyNet data sizes com-pared to Universal Dependencies content.

Language UniMorph MorphyNet Univ. Dep.Catalan 81,576 168,462 25,443Czech 134,528 298,888 151,838English 115,523 652,487 17,296French 367,733 453,229 28,921Finnish 2,490,377 1,617,751 47,813Hungarian 552,950 1,034,317 3,685Italian 509,575 748,321 24,002Serbo-Croatian 840,799 1,760,095 35,936Spanish 382,955 677,423 32,571Swedish 78,411 131,693 15,030Russian 473,482 1,343,760 18,774Total 5,893,381 8,886,426 401,309

ing out-of-vocabulary words. MorphyNet also ex-tends the UDer structure by indicating the affix andthe semantic category for the target word when itcan be inferred from the morpheme. Such informa-tion is again useful for subword regularization ofderivationally rich languages, such as English.Table 4 provides per-language statistics on Mor-

phyNet data. The present version of the resourcecontains 10.6 million entries, of which 95% are in-flections. Highly inflecting and agglutinative lan-guages are dominating the resource as 55% of allentries belong to Finnish, Hungarian, Russian, andSerbo-Croatian. Language coverage above all de-pends on the completeness ofWiktionary, the mainsource of our data.

5 Evaluation

We evaluated MorphyNet through two differentmethods: (1) through comparison to ground truthand (2) through manual validation by experts.

Comparison to ground truth. The quality eval-uation of morphology database is a challengingtask due to many weird morphology aspects of lan-guages evaluated (Gorman et al., 2019). As groundtruth on inflections we used the Universal Depen-dencies6 dataset (Nivre et al., 2016, 2017), which(among others) provides morphological analysisof inflected words over a multilingual corpus ofhand-annotated sentences. McCarthy et al. (2018)built a Python tool7 to convert these treebanksinto UniMorph schema (Sylak-Glassman, 2016).We evaluated both UniMorph 2.0 and MorphyNetagainst this data (performing the necessary map-ping of feature tags beforehand) over the 11 lan-guages in the intersection of the two resources:Hungarian (Vincze et al., 2010), Catalan, Span-ish (Taulé et al., 2008), Czech (Bejček et al.,2013), Finnish (Pyysalo et al., 2015), Russian (Lya-shevskaya et al., 2016), Serbo-Croatian (De Melo,2014), French (Guillaume et al., 2019), Italian(Bosco et al., 2013), Swedish (Nivre and Megyesi,2007), and English (Silveira et al., 2014). Ta-ble 5 contains evaluation results over nouns, verbs,and adjectives separately, as well as totals per lan-guage. Missing data points (e.g. for Catalan nouns)indicate that UniMorph did not have any corre-sponding inflections. For languages and parts ofspeech where both resources provide data, Mor-phyNet always provides higher recall. The excep-tion is Finnish because of our policy of not extract-ing conjugationswith auxiliary andmodifier wordsas separate entries (see Section 3.1). Overall, as

6https://universaldependencies.org/7https://github.com/unimorph/

ud-compatibility

43

Table 5: Inflectional morphology evaluation of MorphyNet against UniMorph on Universal Dependencies

Language Resource Noun Verb Adjective TotalR P F1 R P F1 R P F1 R P F1

Catalan UniMorph - - - 71.9 99.3 83.4 - - - 21.3 99.3 35.1MorphyNet 66.0 98.4 79.0 73.8 99.1 84.6 48.2 99.6 65.0 64.3 98.8 77.9

Czech UniMorph 28.2 99.1 43.9 9.5 18.1 12.5 17.6 44.8 25.3 21.0 72.7 32.6MorphyNet 33.2 98.9 49.7 28.2 93.8 43.4 36.1 98.1 52.8 34.2 98.0 50.7

English UniMorph - - - 96.1 90.9 93.4 - - - 28.3 90.9 43.2MorphyNet 81.5 99.1 89.4 97.1 96.8 96.9 85.3 99.7 91.9 83.2 98.8 90.3

French UniMorph - - - 70.6 98.5 82.2 - - - 20.6 98.5 34.1MorphyNet 80.2 98.6 88.5 94.4 98.5 96.4 60.1 94.6 73.5 79.7 97.9 87.9

Finnish UniMorph 45.5 99.5 62.4 50.5 88.4 64.3 61.4 81.7 70.1 49.1 93.5 64.4MorphyNet 49.8 99.4 66.4 53.8 89.5 67.2 67.2 98.1 79.8 54.5 96.7 69.7

Hungarian UniMorph 45.3 99.0 62.2 31.9 97.8 48.1 - - - 30.8 98.8 47.0MorphyNet 55.2 99.1 70.9 77.2 96.9 85.9 43.1 95.9 59.5 56.3 97.9 71.5

Italian UniMorph - - - 66.1 91.6 76.8 - - - 22.8 91.6 36.5MorphyNet 86.7 99.0 92.4 88.8 96.9 92.7 84.9 98.9 91.4 87.0 98.2 92.3

Serbo-Croatian UniMorph 0.0 0.0 0.0 0.0 0.0 0.0 49.4 47.4 48.4 18.5 47.4 26.6MorphyNet 69.5 88.4 77.8 69.1 98.1 81.1 54.9 98.6 70.5 63.9 93.3 75.9

Spanish UniMorph - - - 93.0 99.8 96.3 - - - 32.1 99.8 48.6MorphyNet 88.3 99.2 93.4 97.0 99.5 98.2 81.9 99.2 89.7 89.7 99.3 94.3

Swedish UniMorph 15.1 98.4 26.2 59.7 84.8 70.1 34.1 94.8 50.2 27.1 92.0 41.9MorphyNet 36.8 99.4 53.7 78.0 98.1 86.9 38.1 99.6 55.1 44.6 99.1 61.5

Russian UniMorph 0.0 0.0 0.0 0.0 0.0 0.0 52.8 97.4 68.5 10.8 97.4 19.4MorphyNet 56.5 95.1 70.9 67.7 92.9 78.3 64.5 99.0 78.1 61.5 95.2 74.7

seen from Table 4, MorphyNet contains about 47%more entries over the 11 languages where it over-laps with UniMorph. In terms of precision, the tworesources are comparable, except for Finnish (ad-jectives) and Swedish (adjectives and verbs) whereMorphyNet appears to be significantly more pre-cise.

UDer (Kyjánek et al., 2020) is a collection ofindividual monolingual resources of derivationalmorphology. Most of them have been carefullyevaluated against their own datasets and offer highquality. We evaluated MorphyNet derivationaldata against UDer over the nine languages coveredby both resources: French (Hathout and Namer,2014), Portuguese (de Paiva et al., 2014), Czech(Vidra et al., 2019), German (Zeller et al., 2013),Russian (Vodolazsky, 2020), Italian (Talamo et al.,2016), Finnish (Lindén and Carlson, 2010; Lindénet al., 2012), Latin (Litta et al., 2016), and En-glish (Habash and Dorr, 2003). Statistics and re-sults are shown in Table 6. First of all, the over-lap between MorphyNet and UDer is small, whichis visible from our recall values relative to UDerthat vary between 0.6% (Czech) and 59.5% (Ital-ian). Among the languages evaluated, six werebetter covered by MorphyNet and the remainingthree (Czech, German, and Russian) by UDer. Theagreement between the two resources, computed

as Cohen’s Kappa, was 0.85 overall, varying be-tween 0.74 (Finnish) and 0.97 (Portuguese). If weconsider UDer as gold standard, we obtain preci-sion figures between 87% and 99%.Manual evaluation was carried out by languageexperts over sample data from five languages: En-glish, Italian, French, Hungarian, and Mongolian.The sample consisted of 1,000 randomly selectedentries per language, half of them inflectional andthe other half derivational. The experts were askedto validate the correctness of source–target wordpairs, of morphemes, as well as inflectional fea-tures and parts of speech (the latter for deriva-tions). Table 7 shows detailed results. The over-all precision is 98.9%, per-language values varyingbetween 98.2% (Hungarian) and 99.5% (English).The good results are proof both of the high qual-ity of Wiktionary data and of the general correct-ness of the data extraction and enrichment logic ofMorphyNet. A manual checking of the incorrectentries revealed that most of them were due to thefailure of extraction rules due to occasional devia-tions in Wiktionary from its own conventions.

6 Conclusions and Future WorkWe consider the resource released and describedhere as an initial work-in-progress version that weplan to extend and improve. We are currently

44

Table 6: Derivational morphology evaluation of MorphyNet against Universal Derivations (UDer)

# Language MorphyNet Univeral Derivations (UDer) UDer ∩MorphyNet Recall Precision Kappa1 French 37,203 Démonette 13,272 2,558 18.5 95.5 0.912 Portuguese 15,974 NomLex-PT 3,420 1,235 35.8 98.9 0.973 Czech 9,660 Derinet 804,011 5,347 0.6 94.1 0.884 German 23,867 DerivBase 35,528 5,878 15.6 93.5 0.875 Russian 36,922 DerivBase.RU 118,374 6,370 12.3 88.1 0.766 Italian 42,149 DerIvaTario 1,548 958 59.5 90.7 0.817 Finnish 37,199 FinnWordnet 8,337 2,664 30.6 87.0 0.748 Latin 9,191 WFL 2,792 4,037 14.0 93.7 0.879 English 200,365 CatVar 16,185 7,397 45.7 91.9 0.83

Total 412,530 1,003,467 36,444 25.8 92.6 0.85

Table 7: Manual validation of language experts on MorphyNet

Inflectional morphology Derivational morphology# Language word pair features morphemes trg words POS morphemes Total1 English 99.2 100.0 99.0 100.0 99.0 100.0 99.52 French 99.8 98.0 100.0 100.0 96.8 100.0 99.13 Hungarian 97.0 95.0 100.0 98.6 99.1 99.2 98.24 Italian 100.0 100.0 99.4 98.0 97.4 99.0 99.05 Mongolian 98.2 100.0 99.2 98.4 98.1 98.6 98.8

Average. 98.8 98.6 99.5 99.0 98.1 99.4 98.9

working on increasing the coverage to 20 lan-guages. We also plan to extend MorphyNet datawith additional features and the semantic cate-gories of words (e.g. animate or inanimate object,action) inferred from derivations. We are planningto conduct a more in-depth study of our evaluationresults, especially with respect to UDer where itis not yet clear whether the occasional lower pre-cision figures (87% for Finnish, 88% for Russian)are due to mistakes in MorphyNet, in the UDer re-sources, or are caused by other factors.

A major piece of ongoing work concerns therepresentation of MorphyNet derivational data asa lexico-semantic graph, as it is done in word-nets (Miller, 1998; Giunchiglia et al., 2017) wherederivationally related word senses are intercon-nected by associative relationships. This effort,justifying the -Net in the name of our resource, willallow us to address completeness issues in existingwordnets by extending them by morphological re-lations and derived words.

We are happy to offer the MorphyNet extractionlogic to be reused on a community basis. As ex-tending the tool with new Wiktionary extractionrules is straightforward, we hope that the availabil-ity of the tool will allow language coverage to groweven further. We also hope that the MorphyNetdata and the extraction logic can serve existinghigh-quality projects such as UniMorph and UDer.

ReferencesDuygu Ataman and Marcello Federico.2018. Compositional representation ofmorphologically-rich input for neural machinetranslation. In Proceedings of the 56th AnnualMeeting of the Association for ComputationalLinguistics (Volume 2: Short Papers), pages305–311.

Khuyagbaatar Batsuren, Gábor Bella, and FaustoGiunchiglia. 2021. A large and evolving cog-nate database. Language Resources and Evalu-ation.

Khuyagbaatar Batsuren, Gábor Bella, and FaustoGiunchiglia. 2019. Cognet: a large-scale cog-nate database. In Proceedings of ACL 2019, Flo-rence, Italy.

Kenneth R Beesley and Lauri Karttunen. 2003.Finite-state morphology: Xerox tools and tech-niques. CSLI, Stanford.

Eduard Bejček, Eva Hajičová, Jan Hajič, PavlínaJínová, Václava Kettnerová, VeronikaKolářová, Marie Mikulová, Jiří Mírovskỳ,Anna Nedoluzhko, Jarmila Panevová, et al.2013. Prague dependency treebank 3.0.

Piotr Bojanowski, Edouard Grave, Armand Joulin,and Tomas Mikolov. 2017. Enriching word vec-tors with subword information. Transactions of

45

the Association for Computational Linguistics,5:135–146.

Cristina Bosco, Montemagni Simonetta, and SimiMaria. 2013. Converting italian treebanks: To-wards an italian stanford dependency treebank.In 7th Linguistic Annotation Workshop and In-teroperability with Discourse, pages 61–69. TheAssociation for Computational Linguistics.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman, Géraldine Walther, EkaterinaVylomova, Arya D McCarthy, Katharina Kann,Sebastian J Mielke, Garrett Nicolai, MiikkaSilfverberg, et al. 2018. The conll–sigmorphon2018 shared task: Universal morphologicalreinflection. In Proceedings of the CoNLL–SIGMORPHON 2018 Shared Task: UniversalMorphological Reinflection, pages 1–27.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman, Géraldine Walther, EkaterinaVylomova, Patrick Xia, Manaal Faruqui, San-dra Kübler, David Yarowsky, Jason Eisner,et al. 2017. Conll-sigmorphon 2017 sharedtask: Universal morphological reinflection in52 languages. In Proceedings of the CoNLLSIGMORPHON 2017 Shared Task: UniversalMorphological Reinflection, pages 1–30.

Gerard De Melo. 2014. Etymological wordnet:Tracing the history of words. In LREC, pages1148–1154. Citeseer.

Valeria de Paiva, Livy Real, Alexandre Rade-maker, and Gerard de Melo. 2014. Nomlex-pt: A lexicon of portuguese nominalizations. InProceedings of the Ninth International Confer-ence on Language Resources and Evaluation(LREC’14), pages 2851–2858.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language un-derstanding. In Proceedings of the 2019 Con-ference of the North American Chapter of theAssociation for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Longand Short Papers), pages 4171–4186.

Daniela Gerz, Ivan Vulić, Edoardo Ponti, JasonNaradowsky, Roi Reichart, and Anna Korhonen.2018. Language modeling for morphologicallyrich languages: Character-aware modeling for

word-level prediction. Transactions of the As-sociation for Computational Linguistics, 6:451–465.

Fausto Giunchiglia, Khuyagbaatar Batsuren, andGabor Bella. 2017. Understanding and exploit-ing language diversity. In IJCAI, pages 4009–4017.

Kyle Gorman, Arya D McCarthy, Ryan Cotterell,Ekaterina Vylomova, Miikka Silfverberg, andMagdalena Markowska. 2019. Weird inflectsbut ok: Making sense of morphological gener-ation errors. In Proceedings of the 23rd Con-ference on Computational Natural LanguageLearning (CoNLL), pages 140–151.

Bruno Guillaume, Marie-Catherine de Marneffe,and Guy Perrier. 2019. Conversion et amélio-rations de corpus du français annotés en univer-sal dependencies. Traitement Automatique desLangues, 60(2):71–95.

Nizar Habash and Bonnie Dorr. 2003. Catvar: Adatabase of categorial variations for english. InProceedings of the MT Summit, pages 471–474.Citeseer.

Nabil Hathout and Fiammetta Namer. 2014. Dé-monette, a french derivational morpho-semanticnetwork. In Linguistic Issues in Language Tech-nology, Volume 11, 2014-Theoretical and Com-putational Morphology: New Trends and Syner-gies.

Inxight. 2005. Linguistx natural language process-ing platform.

Christo Kirov, Ryan Cotterell, John Sylak-Glassman, Géraldine Walther, EkaterinaVylomova, Patrick Xia, Manaal Faruqui,Sabrina J Mielke, Arya D McCarthy, SandraKübler, et al. 2018. Unimorph 2.0: Universalmorphology. In Proceedings of the Eleventh In-ternational Conference on Language Resourcesand Evaluation (LREC 2018).

Christo Kirov, John Sylak-Glassman, Roger Que,and David Yarowsky. 2016. Very-large scaleparsing and normalization of wiktionary mor-phological paradigms. In Proceedings of theTenth International Conference on LanguageResources and Evaluation (LREC’16), pages3121–3126.

46

Lukáš Kyjánek, Zdeněk Žabokrtskỳ, MagdaŠevčíková, and Jonáš Vidra. 2020. Univer-sal derivations 1.0, a growing collection of har-monised word-formation resources. The PragueBulletin of Mathematical Linguistics, (115):5–30.

Krister Lindén and Lauri Carlson. 2010.Finnwordnet–finnish wordnet by transla-tion. LexicoNordica–Nordic Journal ofLexicography, 17:119–140.

Krister Lindén, Jyrki Niemi, andMirka Hyvärinen.2012. Extending and updating the finnish word-net. In Shall We Play the Festschrift Game?,pages 67–98. Springer.

Eleonora Litta, Marco Passarotti, and Chris Culy.2016. Formatio formosa est. Building a WordFormation Lexicon for Latin. In Proceedings ofthe Third Italian Conference on ComputationalLinguistics (CLIC–IT 2016), pages 185–189.

Olga Lyashevskaya, Kira Droganova, DanielZeman, Maria Alexeeva, Tatiana Gavrilova,Nina Mustafina, Elena Shakurova, et al. 2016.Universal dependencies for russian: A newsyntactic dependencies tagset. Lyashevkaya,K. Droganova, D. Zeman, M. Alexeeva, T.Gavrilova, N. Mustafina, E. Shakurova//HigherSchool of Economics Research Paper No. WPBRP, 44.

Arya D McCarthy, Christo Kirov, Matteo Grella,Amrit Nidhi, Patrick Xia, Kyle Gorman, Ekate-rina Vylomova, Sabrina J Mielke, Garrett Nico-lai, Miikka Silfverberg, et al. 2020. Unimorph3.0: Universal morphology. In Proceedings ofThe 12th Language Resources and EvaluationConference, pages 3922–3931.

Arya D McCarthy, Miikka Silfverberg, Ryan Cot-terell, Mans Hulden, and David Yarowsky. 2018.Marrying universal dependencies and universalmorphology. In Proceedings of the SecondWorkshop on Universal Dependencies (UDW2018), pages 91–101.

Eleni Metheniti and Günter Neumann. 2020.Wikinflection corpus: A (better) multilingual,morpheme-annotated inflectional corpus. InProceedings of The 12th Language Resourcesand Evaluation Conference, pages 3905–3912.

George A Miller. 1998. WordNet: An electroniclexical database. MIT press.

Joakim Nivre, Željko Agić, Lars Ahrenberg, LeneAntonsen, Maria Jesus Aranzabe, MasayukiAsahara, Luma Ateyah, Mohammed Attia, Aitz-iber Atutxa, Liesbeth Augustinus, et al. 2017.Universal dependencies 2.1.

JoakimNivre, Marie-Catherine DeMarneffe, FilipGinter, Yoav Goldberg, Jan Hajic, Christo-pher D Manning, Ryan McDonald, Slav Petrov,Sampo Pyysalo, Natalia Silveira, et al. 2016.Universal dependencies v1: A multilingual tree-bank collection. In Proceedings of the Tenth In-ternational Conference on Language Resourcesand Evaluation (LREC’16), pages 1659–1666.

JoakimNivre and BeataMegyesi. 2007. Bootstrap-ping a swedish treebank using cross-corpus har-monization and annotation projection. In Pro-ceedings of the 6th international workshop ontreebanks and linguistic theories, pages 97–102.Citeseer.

Mārcis Pinnis, Rihards Krišlauks, Daiga Deksne,and Toms Miks. 2017. Neural machine trans-lation for morphologically rich languages withimproved sub-word units and synthetic data. InInternational Conference on Text, Speech, andDialogue, pages 237–245. Springer.

Ivan Provilkov, Dmitrii Emelianenko, and ElenaVoita. 2020. Bpe-dropout: Simple and effectivesubword regularization. In Proceedings of the58th AnnualMeeting of the Association for Com-putational Linguistics, pages 1882–1892.

Sampo Pyysalo, Jenna Kanerva, Anna Missilä,Veronika Laippala, and Filip Ginter. 2015. Uni-versal dependencies for finnish. In Proceedingsof the 20th Nordic Conference of ComputationalLinguistics (Nodalida 2015), pages 163–172.

Rico Sennrich, Barry Haddow, and AlexandraBirch. 2016. Neural machine translation of rarewords with subword units. In Proceedings ofthe 54th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Pa-pers), pages 1715–1725.

Natalia Silveira, Timothy Dozat, Marie-Catherinede Marneffe, Samuel Bowman, Miriam Connor,John Bauer, and Christopher D. Manning. 2014.

47

A gold standard dependency corpus for English.In Proceedings of the Ninth International Con-ference on Language Resources and Evaluation(LREC-2014).

John Sylak-Glassman. 2016. The compositionand use of the universal morphological featureschema (unimorph schema). JohnsHopkinsUni-versity.

Luigi Talamo, Chiara Celata, and Pier MarcoBertinetto. 2016. Derivatario: An annotatedlexicon of italian derivatives. Word Structure,9(1):72–102.

Mariona Taulé, Maria Antònia Martí, and MartaRecasens. 2008. Ancora: Multilevel annotatedcorpora for catalan and spanish. In Lrec.

Viktor Trón, Gyögy Gyepesi, Péter Halácsy, An-drás Kornai, László Németh, and Dániel Varga.2005. Hunmorph: open source word analysis.In Proceedings of Workshop on Software, pages77–85.

Jonáš Vidra, Zdeněk Žabokrtskỳ, MagdaŠevčíková, and Lukáš Kyjánek. 2019. De-rinet 2.0: towards an all-in-one word-formationresource. In Proceedings of the Second Inter-national Workshop on Resources and Tools forDerivational Morphology, pages 81–89.

Veronika Vincze, Dóra Szauter, Attila Almási,György Móra, Zoltán Alexin, and János Csirik.2010. Hungarian dependency treebank.

Daniil Vodolazsky. 2020. Derivbase. ru: A deriva-tional morphology resource for russian. In Pro-ceedings of The 12th Language Resources andEvaluation Conference, pages 3937–3943.

Ekaterina Vylomova, Trevor Cohn, Xuanli He, andGholamreza Haffari. 2017. Word representationmodels for morphologically rich languages inneural machine translation. In Proceedings ofthe First Workshop on Subword and CharacterLevel Models in NLP, pages 103–108.

Britta Zeller, Jan Šnajder, and Sebastian Padó.2013. Derivbase: Inducing and evaluating aderivational morphology resource for german.InProceedings of the 51st AnnualMeeting of theAssociation for Computational Linguistics (Vol-ume 1: Long Papers), pages 1201–1211.

48

Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research

in Phonetics, Phonology, and Morphology,

August 5, 2021. ©2021 Association for Computational Linguistics

pages 49–59

A Study of Morphological Robustness of Neural Machine Translation

Sai Muralidhar Jayanthi,* Adithya Pratapa*

Language Technologies InstituteCarnegie Mellon University

sjayanth,[email protected]

Abstract

In this work, we analyze the robustness of neu-ral machine translation systems towards gram-matical perturbations in the source. In par-ticular, we focus on morphological inflectionrelated perturbations. While this has beenrecently studied for English→French transla-tion (MORPHEUS) (Tan et al., 2020), it isunclear how this extends to Any→Englishtranslation systems. We propose MORPHEUS-MULTILINGUAL that utilizes UniMorph dic-tionaries to identify morphological perturba-tions to source that adversely affect the trans-lation models. Along with an analysis of state-of-the-art pretrained MT systems, we train andanalyze systems for 11 language pairs usingthe multilingual TED corpus (Qi et al., 2018).We also compare this to actual errors of non-native speakers using Grammatical Error Cor-rection datasets. Finally, we present a qualita-tive and quantitative analysis of the robustnessof Any→English translation systems. Codefor our work is publicly available.1

1 Introduction

Multilingual machine translation is common-place, with high-quality commercial systems avail-able in over 100 languages (Johnson et al.,2017). However, translation from and into low-resource languages remains a challenge (Arivazha-gan et al., 2019). Additionally, translation frommorphologically-rich languages to English (andvice-versa) presents new challenges due to the widedifferences in morphosyntactic phenomenon of thesource and target languages. In this work, we studythe effect of noisy inputs to neural machine trans-lation (NMT) systems. A concrete practical appli-cation for this is the translation of text from non-native speakers. While the brittleness of NMT sys-

* Equal contribution.1https://github.com/murali1996/

morpheus_multilingual

tems to input noise is well-studied (Belinkov andBisk, 2018), most prior work has focused on trans-lation from English (English→X) (Anastasopouloset al., 2019; Alam and Anastasopoulos, 2020).

With over 800 million second-language (L2)speakers for English, it is imperative that the trans-lation models should be robust to any potential er-rors in the source English text. A recent work (Tanet al., 2020) has shown that English→X translationsystems are not robust to inflectional perturbationsin the source. Inspired by this work, we aim toquantify the impact of inflectional perturbationsfor X→English translation systems. We hypothe-size that inflectional perturbations to source tokensshouldn’t adversely affect the translation quality.However, morphologically-rich languages tend tohave free word order as compared to English, andsmall perturbations in the word inflections can leadto significant changes to the overall meaning of thesentence. This is a challenge to our analysis.

We build upon Tan et al. (2020) to induce in-flectional perturbations to source tokens using theunimorph inflect tool (Anastasopoulos andNeubig, 2019) along with UniMorph dictionaries(McCarthy et al., 2020) (§2). We then present acomprehensive evaluation of the robustness of MTsystems for languages from different language fam-ilies (§3). To understand the impact of size of par-allel corpora available for training, we experimenton a spectrum of high, medium and low-resourcelanguages. Furthermore, to understand the impactin real settings, we run our adversarial perturbationalgorithm on learners’ text from Grammatical ErrorCorrection datasets for German and Russian (§3.3).

2 Methodology

To evaluate the robustness of X→English NMT sys-tems, we generate inflectional perturbations to thetokens in source language text. In our methodology,

49

we aim to identify adversarial examples that leadto maximum degradation in the translation quality.We build upon the recently proposed MORPHEUS

toolkit (Tan et al., 2020), that evaluated the robust-ness of NMT systems translating from English→X.For a given source English text, MORPHEUS worksby greedily looking for inflectional perturbationsby sequentially iterating through the tokens in inputtext. For each token, it identifies inflectional editsthat lead to maximum drop in BLEU score.

We extend this approach to test X→Englishtranslation systems. Since their toolkit2 is limitedto perturbations in English only, in this work we de-velop our own inflectional methodology that relieson UniMorph (McCarthy et al., 2020).

2.1 Reinflection

UniMorph project3 provides morphological datafor numerous languages under a universal schema.The project supports over 100 languages and pro-vides morphological inflection dictionaries for uptothree part-of-speech tags, nouns (N), adjectives(ADJ) and verbs (V). While some UniMorph dic-tionaries include a large number of types (orparadigms) (German (≈15k), Russian (≈28k)),many dictionaries are relatively small (Turkish(≈3.5k), Estonian (<1k)). This puts a limit onthe number of tokens we can perturb via UniMorphdictionary look-up. To overcome this limitation, weuse the unimorph inflect toolkit4 that takesas input the lemma and the morphosyntactic de-scription (MSD) and returns a reinflected wordform. This tool was trained using UniMorph dictio-naries and generalizes to unseen types. An illustra-tion of our inflectional perturbation methodologyis described in Table 1.

2.2 MORPHEUS-MULTILINGUAL

Given an input sentence, our proposed method,MORPHEUS-MULTILINGUAL, identifies adversar-ial inflectional perturbations to the input tokensthat leads to maximum degradation in performanceof the machine translation system. We first iter-ate through the sentence to extract all possible in-flectional forms for each of the constituent tokens.Since, we are relying on UniMorph dictionaries, weare limited to perturbing only nouns, adjectives and

2https://github.com/salesforce/morpheus3https://unimorph.github.io/4https://github.com/antonisa/unimorph inflect

verbs.5 Now, to construct a perturbed sentence, weiterate through each token and uniformly sampleone inflectional form from the candidate inflections.We repeat this process N (=50) times and compileour pool of perturbed sentences.6

To identify the adversarial sentence, we computethe chrF score (Popovic, 2017) using the sacrebleutoolkit (Post, 2018) and select the sentence that re-sults in the maximum drop in chrF score (if any). Inour preliminary experiments, we found chrF to bemore reliable than BLEU (Papineni et al., 2002) foridentifying adversarial candidates. While BLEUuses word n-grams to compare the translation out-put with the reference, chrF uses character n-gramsinstead; which helps with matching morphologicalvariants of words.

The original MORPHEUS toolkit follows aslightly different algorithm to identify adversaries.Similar to our approach, they first extract all pos-sible inflectional forms for each of the constituenttokens. Then, they sequentially iterate through thetokens in the sentence, and for each token, theyselect an inflectional form that results in the worstBLEU score. Once an adversarial form is identi-fied, they directly replace the form in the originalsentence and continue to the next token. While asimilar approach is possible in our setup, we foundtheir algorithm to be computationally expensive asit prevents from performing efficient batching.

It is important to note that neither MORPHEUS-MULTILINGUAL nor the original MORPHEUS ex-haustively searches over all possible sentences, dueto memory and time constraints. However, ourapproach in MORPHEUS-MULTILINGUAL can beefficiently implemented and reduces the inferencetime by almost a factor of 20. We experiment on11 different language pairs, therefore, the run timeand computational costs are critical to our experi-ments.

3 Experiments

In this section, we present a comprehensive eval-uation of the robustness of X→English machinetranslation systems. Since it is natural for NMTmodels to be more robust when trained on largeamounts of parallel data, we experiment with two

5Some dictionaries might contain fewer POS tags, forexample, in German we are restricted to just nouns and verbs.

6N is a hyperparameter, and in our preliminary experi-ments, we find N = 50 to be sufficiently high to generatemany uniquely perturbed sentences and also keep the processcomputationally tractable.

50

PRON VERB PART PUNCT ADV NOUN VERB AUX

Sie wissen nicht , wann Rauber kommen konnenyou-NOM.3PL knowledge-PRS.3PL not-NEG , when robber-NOM.PL come-NFIN can-PRS.3PL

(*) Sie wissten nicht , wann Rauber kommen konnen(*) Sie wissen nicht , wann Rauber kommen konne(*) Sie wisse nicht , wann Rauber kommen konnte

Table 1: Example inflectional perturbations on a German text.

sets of translation systems. First, we use state-of-the-art pre-trained models for Russian→Englishand German→English from fairseq (Ott et al.,2019).7 Secondly, we use the multilingual TEDcorpus (Qi et al., 2018) to train transformer-basedtranslation systems from scratch.8 Using the TEDcorpus allows us to expand our evaluation to alarger pool of language pairs.

3.1 WMT19 Pretrained ModelsWe evaluate the robustness of best-performingsystems from WMT19 news translation sharedtask (Barrault et al., 2019), specifically forRussian→English and German→English (Ott et al.,2019). We follow the original work and use new-stest2018 as our test set for adversarial evaluation.Using the procedure described in §2.2, we createadversarial versions of newstest2018 for both thelanguage pairs. In Table 2, we present the base-line and adversarial results using BLEU and chrFmetrics. For both the language pairs, we noticesignificant drops on both metrics. Before divingfurther into the qualitative analysis of these MTsystems, we first present a broader evaluation onMT systems trained on multilingual TED corpus.

lg Baseline Adversarial

BLEU chrF BLEU chrF NR

rus 38.33 0.63 18.50 0.47 0.81deu 48.40 0.70 33.43 0.59 1.00

Table 2: Baseline & Adversarial results on new-stest2018 using fairseq’s pre-trained models. NRdenotes Target-Source Noise Ratio (2).

7Due to resource constraints, we only experiment with thesingle models and leave the evaluation of ensemble modelsfor future work.

8For the selected languages, we train an MT model with‘transformer iwslt de en’ architecture from fairseq. Weuse a sentence-piece vocab size of 8000, and train up to 80epochs with Adam optimizer (see A.2 in Appendix for moredetails)

3.2 TED corpusThe multilingual TED corpus (Qi et al., 2018) pro-vides parallel data for over 50 language pairs, butin our experiments we only use a subset of theselanguage pairs. We selected our test language pairs(X→English) to maximize the diversity in languagefamilies, as well as the resources available for train-ing MT systems. Since we rely on UniMorphand unimorph inflect for generating pertur-bations, we only select languages that have rea-sonably high accuracy in unimorph inflect(>80%). Table 3 presents an overview of the cho-sen source languages, along with the informationon language family and training resources.

We also quantify the morphological richness forthe languages listed in Table 3. As we are notaware of any standard metric to gauge morphologi-cal richness of a language, we use the reinflectiondictionaries to define this metric. We compute themorphological richness using the Type-Token Ra-tio (TTR) as follows,

TTRlg =Ntypes(lg)Ntokens(lg)

=Nparadigms(lg)Nforms(lg)

(1)

In Table 3, we report the TTRlg scores mea-sured on UniMorph dictionaries as well as on theUniMorph-style dictionaries constructed from TEDdev splits using unimorph inflect tool. Notethat, TTRlg, as defined here, slightly differs fromthe widely known Type-Token ration used for mea-suring lexical diversity (or richness) of a corpus.

We run MORPHEUS-MULTILINGUAL to gener-ate adversarial sentences for the validation splits ofthe TED corpus. We term a sentence adversarial ifit leads to the maximum drop in chrF score. Notethat, it is possible to have perturbed sentences thatmay not lead to any drop in chrF scores. In Figure1, we plot the fraction of perturbed sentences alongwith adversarial fraction for each of the source lan-guages. We see considerable perturbations for mostlanguages, with the exception of Swedish, Lithua-nian, Ukrainian, and Estonian.

51

lg Family Resource TTR

heb Semetic High (0.044, 0.191)rus Slavic High (0.080, 0.107)tur Turkic High (0.016, 0.048)deu Germanic High (0.210, 0.321)ukr Slavic High (0.103, 0.143)ces Slavic High (0.071, 0.082)swe Germanic Medium (0.156, 0.281)lit Baltic Medium (0.051, 0.084)slv Slavic Low (0.109, 0.087)kat Kartvelian Low (0.057, ——)est Uralic Low (0.026, 0.056)

Table 3: List of language chosen from multilingualTED corpus. For each language, the table presents thelanguage family, resource level as the Type-Token ratio(TTRlg). We measure the ratio using the types and to-kens present in the reinflection dictionaries (UniMorph,lexicon from TED dev)

Figure 1: Perturbation statistics for selected TED lan-guages

In preparing our adversarial set, we retain theoriginal source sentence if we fail to create any per-turbation or if none of the identified perturbationslead to a drop in chrF score. This is to make sure theadversarial set has the same number of sentencesas the original validation set. In Table 4, we presentthe baseline and adversarial MT results. We noticea considerable drop in performance for Hebrew,Russian, Turkish and Georgian. As expected, the %drops are correlated to the perturbations statisticsfrom Figure 1.

3.3 Translating Learner’s TextIn the previous sections (§3.1, §3.2), we have seenthe impact of noisy inputs to MT systems. While,these results indicate a need for improving the ro-bustness of MT systems, the above-constructed ad-

Figure 2: Schematic for preliminary evaluation onlearners’ language text. This is similar to the methodol-ogy used in Anastasopoulos (2019).

versarial sets are however synthetic. In this section,we evaluate the impact of morphological inflectionrelated errors directly on learners’ text.

To this end, we utilize two grammatical er-ror correction (GEC) datasets, German Falko-MERLIN-GEC (Boyd, 2018), Russian RULEC-GEC (Rozovskaya and Roth, 2019). Both of thesedatasets contain labeled error types relating to wordmorphology. Evaluating the robustness on thesedatasets will give us a better understanding of theperformance on actual text produced by secondlanguage (L2) speakers.

Unfortunately, we don’t have gold English trans-lations for the grammatically incorrect (or cor-rected) text from GEC datasets. While there is a re-lated prior work (Anastasopoulos et al., 2019) thatannotated Spanish translations for English GECdata, we are not aware of any prior work that pro-vide gold English translations for grammaticallyincorrect data in non-English languages. There-fore, we propose a pseudo-evaluation methodol-ogy that allows for measuring robustness of MTsystems. A schematic overview of our methodol-ogy is presented in Figure 2. We take the ungram-matical text and use the gold GEC annotations tocorrect all errors except for the morphology re-lated errors. We now have ungrammatical text thatonly contains morphology related errors and it issimilar to the perturbed outputs from MORPHEUS-MULTILINGUAL. Since, we don’t have gold trans-lations for the input Russian/German sentences,we use the machine translation output of the fullygrammatical text as reference and the translationoutput of partially-corrected text as hypothesis. InTable 5, we present the results on both Russian andGerman learners’ text.

Overall, we find that the pre-trained MT modelsfrom fairseq are quite robust to noise in learn-ers’ text. We manually inspected some examples,and found the MT systems to sufficiently robustto morphological perturbations and changes in theoutput translation (if any) are mostly warranted.

52

X→English Code # train Baseline Adversarial

BLEU chrF BLEU chrF NR

Hebrew heb 211K 40.06 0.5898 33.94 (-15%) 0.5354 (-9%) 1.56Russian rus 208K 25.64 0.4784 11.70 (-54%) 0.3475 (-27%) 1.03Turkish tur 182K 27.77 0.5006 18.90 (-32%) 0.4087 (-18%) 1.43German deu 168K 34.15 0.5606 31.29 (-8%) 0.5373 (-4%) 1.82Ukrainian ukr 108K 25.83 0.4726 25.66 (-1%) 0.4702 (-1%) 2.96Czech ces 103K 29.35 0.5147 26.58 (-9%) 0.4889 (-5%) 2.11Swedish swe 56K 36.93 0.5654 36.84 (-0%) 0.5646 (-0%) 3.48Lithuanian lit 41K 18.88 0.3959 18.82 (-0%) 0.3948 (-0%) 3.42Slovenian slv 19K 11.53 0.3259 10.48 (-9%) 0.3100 (-5%) 3.23Georgian kat 13K 5.83 0.2462 4.92 (-16%) 0.2146 (-13%) 2.49Estonian est 10K 6.68 0.2606 6.53 (-2%) 0.2546 (-2%) 4.72

Table 4: Results on multilingual TED corpus (Qi et al., 2018)

Dataset f-BLEU f-chrF

Russian GEC 85.77 91.56German GEC 89.60 93.95

Table 5: Translation results on Russian and GermanGEC corpora. An oracle (aka. fully robust) MT systemwould give a perfect score. We adopt the faux-BLEUterminology from Anastasopoulos (2019). f-BLEU isidentical to BLEU, except that it is computed against apseudo-reference instead of true reference.

Viewing these results in combination with resultson TED corpus, we believe that X→English arerobust to morphological perturbations at source aslong as they are trained on sufficiently large parallelcorpus.

4 Analysis

To better understand what makes a given MT sys-tem to be robust to morphology related grammati-cal perturbations in source, we present a thoroughanalysis of our results and also highlight a few lim-itations of our adversarial methodology.

Adversarial Dimensions: To quantify the im-pact of each inflectional perturbation, we performa fine-grained analysis on the adversarial sentencesobtained from multilingual TED corpus. For eachperturbed token in the adversarial sentence, weidentify the part-of-speech (POS) and the featuredimension(s) (dim) perturbed in the token. We uni-formly distribute the % drop in sentence-level chrFscore to each (POS, dim) perturbation in the ad-versarial sentence. This allows us to quantitatively

compare the impact of each perturbation type (POS,dim) on the overall performance of MT model. Ad-ditionally, as seen in Figure 1, all inflectional per-turbations need not cause a drop in chrF (or BLEU)scores. The adversarial sentences only capture theworst case drop in chrF. Therefore, to analyze theoverall impact of the each perturbation (POS, dim),we also compute the impact score on the entire setof perturbed sentences explored by MORPHEUS-MULTILINGUAL.

Table 8 (in Appendix) presents the results forall the TED languages. First, the trends for adver-sarial perturbations is quite similar to all exploredperturbations. This indicates that the adversarialimpact of a perturbation is not determined by justthe perturbation type (POS, dim) but is lexicallydependent.

Evaluation Metrics: In the results presented in§3, we reported the performance using BLEU andchrF metrics (following prior work (Tan et al.,2020)). We noticed significant drops on these met-rics, even for high-resource languages like Rus-sian, Turkish and Hebrew, including the state-of-the-art fairseq models. To better understandthese drops, we inspected the output translations ofadversarial source sentences. We found a numberof cases where the new translation is semanticallyvalid but both the metrics incorrectly score themlow (see S2 in Table 6). This is a limitation of usingsurface level metrics like BLEU/chrF.

Additionally, we require the new translation tobe as close as possible to the original translation,but this can be a strict requirement on many occa-sions. For instance, if we changed a noun in the

53

Figure 3: Correlation between Noise Ratio (NR) and# train. The results indicate that, larger the trainingdata, the models are more robust towards source pertur-bations (NR≈1).

source from its singular to plural form, it is naturalto expect a robust translation system to reflect thatchange in the output translation. To account for thisbehavior, we compute Target-Source Noise Ratio(NR) metric from Anastasopoulos (2019). NR iscomputed as follows,

NR(s, t, s, t) =100− BLEU(t, t)

100− BLEU(s, s)(2)

The ideal NR is∼1, where a change in the source(s→ s) results in a proportional change in the tar-get (t → t). For the adversarial experiments onTED corpus, we compute the NR metric for eachlanguage pair and the results are presented in Ta-ble 4. Interestingly, while Russian sees a majordrop in BLEU/chrF score, the noise ratio is closeto 1. This indicates that the Russian MT is actu-ally quite robust to morphological perturbations.Furthermore, in Figure 3, we present a correlationanalysis between the size of parallel corpus avail-able for training vs noise ratio metric. We see avery strong negative correlation, indicating thathigh-resource MT systems (e.g., heb, rus, tur)are quite robust to inflectional perturbations, inspiteof the large drops in BLEU/chrF scores. Addition-ally, we noticed that morphological richness of thesource language (measured via TTR in Table 3)doesn’t play any significant role in the MT perfor-mance under adversarial settings (e.g., rus, turvs deu). The scatter plot between TTR and NR forTED translation task is presented in Figure 4.

Morphological Richness: To analyze the im-pact of morphological richness of source, we lookdeeper into the Slavic language family. We ex-

Figure 4: Correlation between Target-Source Noise Ra-tio (NR) on TED machine translation and Type-TokenRatio (TTRlg) of the source language (from UniMorph).The results indicate that the morphological richnessof the source language doesn’t necessarily correlate toNMT robustness.

perimented with four languages within the Slavicfamily, Czech, Ukranian, Russian and Slovenian.All except Slovenian are high-resource. Theselanguages differ significantly in their morphologi-cal richness (TTR) with, TTRces < TTRslv <<TTRrus << TTRukr.9 As we have already seen inabove analysis (see Figure 4), morphological rich-ness isn’t indicative of the noise ratio (NR), andthis behavior is also true for Slavic languages. Wenow check if morphological richness determinesthe drop in BLEU/chrF scores? In fact, we findthat this is also not the case. We see larger % dropfor rus as compared to slv or ukr. We insteadnotice that the % drop in BLEU/chrF is dependenton the % edits we make to the validation set. The% edits we were able to make follows the order,δrus >> δces > δslv >> δukr (see Figure 1).

While NR is driven by size of training set, and %drop in BLEU is driven by % edits to the validationset. The % edits in turn depends on the size ofUniMorph dictionaries and not on morphologicalrichness of the language. Therefore, we concludethat both the metrics, % drop in BLEU/chrF andNR are dependent on the resource size (paralleldata and UniMorph dictionaries) and not on themorphological richness of the language.

Semantic Change: In our adversarial attacks,we aim to create a ungrammatical source via inflec-tional edits and evaluate the robustness of systemsfor these edits. While these adversarial attacks canhelp us discover any significant biases in the transla-

9TTRlg measured on lexicons from TED dev splits.

54

Figure 5: Elasticity score for TED languages

tion systems, they can often lead to unintended con-sequences. Consider the example Russian sentenceS1 (s) from Table 6. The sentence is grammaticallycorrect, with the subject Тренер (‘Coach’) andobject игрока (‘player’) in NOM and ACC casesrespectively. If were perturb this sentence to A-S1 (s), the new words Тренера (‘Coach’), andигрок (‘player’) are now in ACC and NOM casesrespectively. Due to case assignment phenomenonin Russian, this perturbation (s → s) has essen-tially swapped the subject and object roles in theRussian sentence. As we can see in the example,the English translation, t (A-T1) does in fact cor-rectly capture this change. This indicates that ourattacks can sometimes lead to significant changein the semantics of the source sentence. Handlingsuch cases would require deeper understanding ofeach language grammar and we leave this for futurework.

Elasticity: As we have seen in discussion onnoise ratio, it is natural for MT systems to transferchanges in source to the target. However, inspiredby (Anastasopoulos, 2019), we wanted to under-stand how this behavior changes as we increase thenumber of edits in the source sentence. For thispurpose, we first bucket all the explored perturbedsentences based on the number of edits (or perturba-tions) from the original source. Within each bucket,we compute the fraction of perturbed source sen-tences that result in same translation as the originalsource. We define this fraction as the elasticityscore, i.e. whether the translation remains the sameunder changes in source. Figure 5 presents theresults and we find the elasticity score droppingquickly to zero as the # edits increase. Notably,ukr drops quickly to zero, while rus retains rea-sonable elasticity score for higher number of edits.

Figure 6: Boxplots for the distribution of # edits persentence in the adversarial TED validation set.

Aggressive edits: Our algorithm doesn’t put anyrestrictions on the number of tokens that can beperturbed in a given sentence. This can lead to ag-gressive edits, especially in languages like Russianthat are morphologically-rich and the reinflectionlexicons are sufficiently large. As we illustrate inFigure 6, median edits per sentence in rus is 5,significantly higher than the next language (turat 1). Such aggressive edits in Russian can leadto unrealistic sentences, and far from our intendedsimulation of learners’ text. We leave the idea ofthresholding # edits to future work.

Adversarial Training: In an attempt to improverobustness of NMT systems against morphologi-cal perturbations, we propose training NMT mod-els with augmenting adversarially perturbed sen-tences. Due to computational constraints, we eval-uate this setting only for slv. We follow the strat-egy outlined in Section 2 to obtain adversarial per-turbations for TED corpus training data. We ob-serve that the adversarially trained model performsmarginally poorer (BLEU 10.30 from 10.48 whentrained without data augmentation). We hypothe-size that this could possibly due to small trainingdata, and believe that this training setting can betterbenefit models with already high BLEU scores. Weleave extensive evaluation and further analysis onadversarial training to future work.

5 Conclusion

In this work, we propose MORPHEUS-MULTILINGUAL, a tool to analyze the robustnessof X→English NMT systems under morphologicalperturbations. Using this tool, we experimentwith 11 different languages selected from diverselanguage families with varied training resources.

55

rus

S1 Source (s) Тренер полностью поддержал игрока.T1 Target (t) The coach fully supported the player.

A-S1 Source (s) Тренера полностью поддержал игрок.A-T1 Target (t) The coach was fully supported by the player.

deu

S2 Source (s) Dinosaurier benutzte Tarnung, um seinen Feinden auszuweichenT2 Target (t) Dinosaur used camouflage to evade its enemies (1.000)

A-S2 Source (s) Dinosaurier benutze Tarnung, um seinen Feindes auszuweichenA-T2 Target (t) Dinosaur Use camouflage to dodge his enemy (0.512)

rus

S3 Source (s) У нас вообще телесные наказания не редкость.T3 Target (t) In general, corporal punishment is not uncommon. (0.885)

A-S3 Source (s) У нас вообще телесных наказании не редкостях.A-T3 Target (t) We don’t have corporal punishment at all. (0.405)

rus

S4 Source (s) Вот телесные наказания - спасибо, не надо.T4 Target (t) That’s corporal punishment - thank you, you don’t have to. (0.458)

A-S4 Source (s) Вот телесных наказаний - спасибах, не надо.A-T4 Target (t) That’s why I’m here. (0.047)

deu

S5 Source (s) Die Schießereien haben nicht aufgehort.T5 Target (t) The shootings have not stopped. (0.852)

A-S5 Source (s) Die Schießereien habe nicht aufgehort.A-T5 Target (t) The shootings did not stop, he said. (0.513)

rus

S6 Source (s) Всякое бывает.T6 Target (t) Anything happens. (0.587)

A-S6 Source (s) Всякое будете бывать.A-T6 Target (t) You’ll be everywhere. (0.037)

kat

S7 Source (s)T7 Target (t) It ’s a real school. (0.821)

A-S7 Source (s)A-T7 Target (t) There ’s a man who ’s friend. (0.107)

est

S8 Source (s) Ning meie laste tuleviku varastamine saab uhel paeval kuriteoks.T8 Target (t) And our children ’s going to be the future of our own day. (0.446)

A-S8 Source (s) Ning meie laptegs tuleviku varastamine saab uhel paeval kuriteoks.A-T8 Target (t) And our future is about the future of the future. (0.227)

est

S9 Source (s) Nad pagevad ule piiride nagu see.T9 Target (t) They like that overdights like this. (0.318)

A-S9 Source (s) Nad pagevad ule piirete nagu see.A-T9 Target (t) They dress it as well as well. (0.141)

rus

S10 Source (s) Мой дедушка был необычайным человеком того времени.T10 Target (t) My grandfather was an extraordinary man at that time. (0.802)

A-S10 Source (s) Мой дедушка будё необычайна человеков того времи.A-T10 Target (t) My grandfather is incredibly harmful. (0.335)

Table 6: Qualitative analysis. (1) semantic change, (2) issues with evaluation metrics, (3,4,5,6,7,10) good examplesfor attacks, (8) poor attacks, (9) poor translation quality (s→ t)

56

We evaluate NMT models trained on TED corpusas well as pretrained models readily available aspart of fairseq library. We observe a widerange of 0-50% drop in performances underadversarial setting. We further supplement ourexperiments with an analysis on GEC-learnerscorpus for Russian and German. We qualitativelyand quantitatively analyze the perturbationscreated by our methodology and presented itsstrengths as well as limitations, outlining someavenues for future research towards building morerobust NMT systems.

ReferencesMd Mahfuz Ibn Alam and Antonios Anastasopoulos.

2020. Fine-tuning MT systems for robustness tosecond-language speaker variations. In Proceedingsof the Sixth Workshop on Noisy User-generated Text(W-NUT 2020), pages 149–158, Online. Associationfor Computational Linguistics.

Antonios Anastasopoulos. 2019. An analysis of source-side grammatical errors in NMT. In Proceedings ofthe 2019 ACL Workshop BlackboxNLP: Analyzingand Interpreting Neural Networks for NLP, pages213–223, Florence, Italy. Association for Computa-tional Linguistics.

Antonios Anastasopoulos, Alison Lui, Toan Q.Nguyen, and David Chiang. 2019. Neural machinetranslation of text from non-native speakers. In Pro-ceedings of the 2019 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 1 (Long and Short Papers), pages 3070–3080,Minneapolis, Minnesota. Association for Computa-tional Linguistics.

Antonios Anastasopoulos and Graham Neubig. 2019.Pushing the limits of low-resource morphological in-flection. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), pages984–996, Hong Kong, China. Association for Com-putational Linguistics.

Naveen Arivazhagan, Ankur Bapna, Orhan Firat,Dmitry Lepikhin, Melvin Johnson, Maxim Krikun,Mia Xu Chen, Yuan Cao, George Foster, ColinCherry, et al. 2019. Massively multilingual neuralmachine translation in the wild: Findings and chal-lenges. arXiv preprint arXiv:1907.05019.

Loıc Barrault, Ondrej Bojar, Marta R. Costa-jussa,Christian Federmann, Mark Fishel, Yvette Gra-ham, Barry Haddow, Matthias Huck, Philipp Koehn,Shervin Malmasi, Christof Monz, Mathias Muller,Santanu Pal, Matt Post, and Marcos Zampieri. 2019.

Findings of the 2019 conference on machine transla-tion (WMT19). In Proceedings of the Fourth Con-ference on Machine Translation (Volume 2: SharedTask Papers, Day 1), pages 1–61, Florence, Italy. As-sociation for Computational Linguistics.

Yonatan Belinkov and Yonatan Bisk. 2018. Syntheticand natural noise both break neural machine transla-tion. In International Conference on Learning Rep-resentations.

Adriane Boyd. 2018. Using Wikipedia edits in lowresource grammatical error correction. In Proceed-ings of the 2018 EMNLP Workshop W-NUT: The4th Workshop on Noisy User-generated Text, pages79–84, Brussels, Belgium. Association for Compu-tational Linguistics.

Melvin Johnson, Mike Schuster, Quoc V. Le, MaximKrikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,Fernanda Viegas, Martin Wattenberg, Greg Corrado,Macduff Hughes, and Jeffrey Dean. 2017. Google’smultilingual neural machine translation system: En-abling zero-shot translation. Transactions of the As-sociation for Computational Linguistics, 5:339–351.

Arya D. McCarthy, Christo Kirov, Matteo Grella,Amrit Nidhi, Patrick Xia, Kyle Gorman, Ekate-rina Vylomova, Sabrina J. Mielke, Garrett Nico-lai, Miikka Silfverberg, Timofey Arkhangelskiy, Na-taly Krizhanovsky, Andrew Krizhanovsky, ElenaKlyachko, Alexey Sorokin, John Mansfield, ValtsErnstreits, Yuval Pinter, Cassandra L. Jacobs, RyanCotterell, Mans Hulden, and David Yarowsky. 2020.UniMorph 3.0: Universal Morphology. In Proceed-ings of the 12th Language Resources and EvaluationConference, pages 3922–3931, Marseille, France.European Language Resources Association.

Myle Ott, Sergey Edunov, Alexei Baevski, AngelaFan, Sam Gross, Nathan Ng, David Grangier, andMichael Auli. 2019. fairseq: A fast, extensibletoolkit for sequence modeling. In Proceedings ofthe 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics(Demonstrations), pages 48–53, Minneapolis, Min-nesota. Association for Computational Linguistics.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics, pages 311–318, Philadelphia,Pennsylvania, USA. Association for ComputationalLinguistics.

Maja Popovic. 2017. chrF++: words helping charac-ter n-grams. In Proceedings of the Second Con-ference on Machine Translation, pages 612–618,Copenhagen, Denmark. Association for Computa-tional Linguistics.

Matt Post. 2018. A call for clarity in reporting BLEUscores. In Proceedings of the Third Conference on

57

Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computa-tional Linguistics.

Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Pad-manabhan, and Graham Neubig. 2018. When andwhy are pre-trained word embeddings useful for neu-ral machine translation? In Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 2 (Short Papers),pages 529–535, New Orleans, Louisiana. Associa-tion for Computational Linguistics.

Alla Rozovskaya and Dan Roth. 2019. Grammar errorcorrection in morphologically rich languages: Thecase of Russian. Transactions of the Association forComputational Linguistics, 7:1–17.

Samson Tan, Shafiq Joty, Min-Yen Kan, and RichardSocher. 2020. It’s morphin’ time! Combatinglinguistic discrimination with inflectional perturba-tions. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics,pages 2920–2935, Online. Association for Computa-tional Linguistics.

A Appendix

A.1 UniMorph ExampleAn example from German UniMorph dictionary ispresented in Table 7.

Paradigm Form MSD

abspielen (‘play’) abgespielt (‘played’) V.PTCP;PST

abspielen (‘play’) abspielend (‘playing’) V.PTCP;PRS

abspielen (‘play’) abspielen (‘play’) V;NFIN

Table 7: Example inflections for German verb abspie-len (‘play’) from the UniMorph dictionary.

A.2 MT trainingFor all the languages in TED corpus, we trainAny→English using the fairseq toolkit. Specif-ically, we use the ‘transformer iwslt de en’ archi-tecture, and train the model using Adam optimizer.We use an inverse square root learning rate sched-uler with warm-up update steps of 4000. In thelinear warm-up phase, we use an initial learningrate of 1e-7 until a configured rate of 2e-4. We usecross entropy criterion with label smoothing of 0.1.

A.3 Dimension Analysis

58

Dimension ces deu est heb kat lit rus slv swe tur ukr

ADJ.Animacy - - - - - - 3.51(0.89) - - - -ADJ.Case 4.31(0.81) - - - 10.67(2.59) - 4.78(0.91) - 5.05(5.05) - 6.04(1.10)ADJ.Comparison - - - - - 7.99(0.46) - - - - -ADJ.Gender 3.83(0.78) - - - - 6.81(-1.35) 5.30(1.00) - - - -ADJ.Number 4.07(0.78) - - - 13.90(1.52) 6.31(-2.26) 4.67(0.94) - 5.05(5.05) 7.92(2.23) 6.25(1.29)ADJ.Person - - - - - - - - - 8.89(2.43) -N.Animacy - - - - - - 6.53(1.19) - - - -N.Case 6.94(0.81) 6.39(1.26) 12.35(1.50) - 15.38(0.98) - 6.65(1.20) - 4.29(1.05) 14.39(2.37) 10.28(7.66)N.Definiteness - - - - - - - - 8.36(1.61) - -N.Number 5.44(0.77) 5.70(1.27) 8.10(1.33) 16.22(5.92) 14.46(0.66) - 6.12(1.22) - 4.30(1.52) 13.08(2.31) 21.20(15.96)N.Possession - - - 12.63(4.31) - - - - - - -V.Aspect - - - - 14.17(-0.38) - - - - - -V.Gender - - - - - - 6.52(1.51) - - - -V.Mood 13.17(2.78) 15.89(2.77) - - 11.11(0.58) - - 21.49(3.73) - - -V.Number 8.23(2.72) 32.86(8.12) - 13.78(4.60) 9.02(1.33) - 6.23(1.44) 21.47(-9.47) - - -V.Person 6.58(2.69) 6.22(1.50) - 10.86(4.99) 12.37(1.33) - 6.10(1.29) - - - -V.Tense - - - 17.52(7.13) 13.09(1.05) - 6.59(1.61) - - - -V.CVB.Tense - - - - - - 6.70(0.87) - - - 9.09(2.62)V.MSDR.Aspect - - - - 14.39(4.68) - - - - - -V.PTCP.Gender 10.28(2.75) - - - - - - - - - -V.PTCP.Number 9.31(2.51) - - - - - - - - - -

Table 8: Fine-grained analysis of X→English translation performance w.r.t the perturbation type (POS, Morpho-logical feature dimension). The number reported in this table indicate the average % drop in sentence level chrF foran adversarial pertubation on a token with POS on the dimension (dim). The numbers in the parentheses indicateaverage % drop for all the tested perturbations including the adversarial perturbations.

59

Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research

in Phonetics, Phonology, and Morphology,

August 5, 2021. ©2021 Association for Computational Linguistics

pages 60–71

Sample-efficient Linguistic Generalizations through Program Synthesis:Experiments with Phonology Problems

Saujas Vaduguru1 Aalok Sathe2 Monojit Choudhury3 Dipti Misra Sharma1

1 IIIT Hyderabad 2 MIT BCS∗3 Microsoft Research India

1 saujas.vaduguru@research.,[email protected] [email protected], richmond.edu

3 [email protected]

Abstract

Neural models excel at extracting statisticalpatterns from large amounts of data, but strug-gle to learn patterns or reason about languagefrom only a few examples. In this paper, weask: Can we learn explicit rules that general-ize well from only a few examples? We ex-plore this question using program synthesis.We develop a synthesis model to learn phonol-ogy rules as programs in a domain-specific lan-guage. We test the ability of our models togeneralize from few training examples usingour new dataset of problems from the Linguis-tics Olympiad, a challenging set of tasks thatrequire strong linguistic reasoning ability. Inaddition to being highly sample-efficient, ourapproach generates human-readable programs,and allows control over the generalizability ofthe learnt programs.

1 Introduction

In the last few years, the application of deep neuralmodels has allowed rapid progress in NLP. Tasksin phonology and morphology have been no excep-tion to this, with neural encoder-decoder modelsachieving strong results in recent shared tasks inphonology (Gorman et al., 2020) and morphology(Vylomova et al., 2020). However, the neural mod-els that perform well on these tasks make use ofhundreds, if not thousands of training examplesfor each language. Additionally, the patterns thatneural models identify are not interpretable. Inthis paper, we explore the problem of learning in-terpretable phonological and morphological rulesfrom only a small number of examples, a task thathumans are able to perform.

Consider the example of verb forms in the lan-guage Mandar presented in Table 1. How woulda neural model tasked with filling the two blankcells do? The data comes from a language that

∗Work done while at the University of Richmond

to V to be Ved

mappasuN dipasuNmattunu ditunu

? ditimbe? dipande

Table 1: Verb forms in Mandar (McCoy, 2018)

is not represented in large-scale text datasets thatcould allow the model to harness pretraining, andthe number of samples presented here is likely notsufficient for the neural model to learn the task.

However, a human would fare much better atthis task even if they didn’t know Mandar. Identi-fying rules and patterns in a different languageis a principal concern of a descriptive linguist(Brown and Ogilvie, 2010). Even people whoaren’t trained in linguistics would be able to solvesuch a task, as evidenced by contestants in the Lin-guistics Olympiads1, and general-audience puzzlebooks (Bellos, 2020). In addition to being able tosolve the task, humans would be able to expresstheir solution explicitly in terms of rules, that is tosay, a program that maps inputs to outputs.

Program synthesis (Gulwani et al., 2017) is amethod that can be used to learn programs that mapan input to an output in a domain-specific language(DSL). It has been shown to be a highly sample-efficient technique to learn interpretable rules byspecifying the assumptions of the task in the DSL(Gulwani, 2011).

This raises the questions (i) Can program syn-thesis be used to learn linguistic rules from onlya few examples? (ii) If so, what kind of rules canbe learnt? (iii) What kind of operations need to ex-plicitly be defined in the DSL to allow it to modellinguistic rules? (iv) What knowledge must be im-

1https://www.ioling.org/

60

plicitly provided with these operations to allow themodel to choose rules that generalize well?

In this work, we use program synthesis tolearn phonological rules for solving LinguisticsOlympiad problems, where only the minimal num-ber of examples necessary to generalize are given(Sahin et al., 2020). We present a program syn-thesis model and a DSL for learning phonologicalrules, and curate a set of Linguistics Olympiadproblems for evaluation.

We perform experiments and comparisons tobaselines, and find that program synthesis doessignificantly better than our baseline approaches.We also present some observations about the abilityof our system to find rules that generalize well, anddiscuss examples of where it fails.

2 Program synthesis

Program synthesis is “the task of automaticallyfinding programs from the underlying program-ming language that satisfy (user) intent expressedin some form of constraints” (Gulwani et al., 2017).This method allows us to specify domain-specificassumptions as a language, and use generic synthe-sis approches like FlashMeta (Polozov and Gul-wani, 2015) to synthesize programs.

The ability to explicitly encode domain-specificassumptions gives program synthesis broad appli-cability to various tasks. In this paper, we exploreapplying it to the task of learning phonologicalrules. Whereas previous work on rule-learning hasfocused on learning rules of a specific type (Brill,1992; Johnson, 1984), the DSL in program synthe-sis allows learning rules of different types, and indifferent rule formalisms.

In this work, we explore learning rules similar torewrite rules (Chomsky and Halle, 1968) that areused extensively to describe phonology. Sequencesof rules are learnt using a noisy disjunctive synthe-sis algorithm NDSyn (Iyer et al., 2019) extended tolearn stateful multi-pass rules (Sarthi et al., 2021).

2.1 Phonological rules as programs

The synthesis task we solve is to learn a program ina domain-specific language (DSL) for string trans-duction, that is, to transform a given sequence ofinput tokens i ∈ I∗ to a sequence of output tokenso ∈ O∗, where I is the set of input tokens, and Ois the set of output tokens. Each token is a symbolaccompanied by a feature set, a set of key-valuepairs that maps feature names to boolean values.

We learn programs for token-level examples,which transform an input token in its context tooutput tokens. The program is a sequence of ruleswhich are applied to each token in an input stringto produce the output string. The rules learnt aresimilar to rewrite rules, of the form

φ−l · · ·φ−2φ−1Xφ1φ2 · · ·φr → T

where (i) X : I → B is a boolean predicate thatdetermines input tokens to which the rule is applied(ii) φi : I → B is a boolean predicate applied tothe ith character relative to X , and the predicates φcollectively determine the context in which the ruleis applied (iii) T : I → O∗ is a function that mapsan input token to a sequence of output tokens.X and φ belong to a set of predicates P , and T

is a function belonging to a set of transformationfunctions T . P and T are specified by the DSL.

We allow the model to synthesize programs thatapply multiple rules to a single token by synthesiz-ing rules in passes and maintaining state from onepass to the next. This allows the system to learnstateful multi-pass rules (Sarthi et al., 2021).

2.2 Domain-specific language

The domain-specific language (DSL) is the declar-ative language which defines the allowable stringtransformation operations. The DSL is defined bya set of operators, a grammar which determineshow they can be combined, and a semantics whichdetermines what each operator does. By definingoperators to capture domain-specific phenomena,we can reduce the space of programs to be searchedto include those programs that capture distinctionsrelevant to the domain. This allows us to explicitlyencode knowledge of the domain into the system.

Operators in the DSL also have a score asso-ciated with each operator that allows for settingdomain-specific preferences for certain kinds ofprograms. We can combine scores for each oper-ator in a program to compute a ranking score thatwe can use to identify the most preferred programamong candidates. The ranking score can captureimplicit preferences like shorter programs, more/-less general programs, certain classes of transfor-mations, etc.

The DSL defines the predicates P and set oftransformations T that can be applied to a partic-ular token. The predicates and transformations inthe DSL we use, along with the description of theirsemantics, can be found in Tables 2 and 3.

61

Predicate

IsToken(w, s, i) Is x equal to the token s? This allows us to evaluate matches with specifictokens.

Is(w, f, i) Is f true for x? This allows us to generalize beyond single tokens and usefeatures that apply to multiple tokens.

TransformationApplied(w, t, i) Has the transformation t has been applied to x in a previous pass? Thisallows us to reference previous passes in learning rules for the current pass.

Not(p) Negates the predicate p.

Table 2: Predicates that are used for synthesis. The predicates are applied to a token x that is at an offset i fromthe current token in the word w. The offset may be positive to refer to tokens after the current token, zero to referto the current token, or negative to refer to tokens before the current token.

Transformation

ReplaceBy(x, s1, s2) If x is s1, it is replaced with s2. This allows the system to learn conditionalsubstitutions.

ReplaceAnyBy(x, s) x is replaced with s. This allows the system to learn unconditional substitutions.

Insert(x, S) This inserts a sequence of tokens S after x at the end of the pass. It allows for theinsertion of variable-length affixes.

Delete(x) This deletes x from the word at the end of the pass.

CopyReplace(x, i) These are analogues of the ReplaceBy and Insert transformations where thetoken which is added is the same as the token at an offset i from x. They allowthe system to learn phonological transformations such as assimilation andgemination.

CopyInsert(x, i)

Identity(x) This returns x unchanged. It allows the system where a transformation appliesunder certain conditions, but does not under others.

Table 3: Transformations that are used for synthesis. The transformations are applied to a token x in the word w.The offset i for the Copy transformations may be positive to refer to tokens after the current token, zero to refer tothe current token, or negative to refer to tokens before the current token.

output := Map(disjunction , input_tokens)disjunction := Else(rule , disjunction)rule := transformation

| IfThen(predicate , rule);

Figure 1: IfThen-Else statements in the DSL

Sequences of rules are learnt as disjunctions ofIfThen operators, and are applied to each tokenof the input using a Map operator (Figure 1). Theconjunction of predicates X and φ that define thecontext are learnt by nesting IfThen operators.

A transformation produces an token that istagged with the transformation that is applied. Thisallows for maintaining state across passes.

The operators in our DSL are quite generic andcan be applied to other string transformations aswell. In addition to designing our DSL for stringtransformation tasks, we allow for phonologicalinformation to be specified as features, which are aset of key-value pairs that map attributes to booleanvalues. While we restrict our investigation to fea-

tures based only on the symbols in the input, morecomplex features based on meaning and linguisticcategories can be provided to a system that workson learning rules for more complex domains likemorphology or syntax. We leave this investigationfor future work.

2.3 Synthesis algorithm

We use an extension (Sarthi et al., 2021) of theNDSyn algorithm (Iyer et al., 2019) that can syn-thesize stateful multi-pass rules. Iyer et al. (2019)describe an algorithm for selecting disjunctionsof rules, and use the FlashMeta algorithm as therule synthesis component. Sarthi et al. (2021) ex-tend the approach proposed by Iyer et al. (2019)for disjunctive synthesis to the task of grapheme-to-phoneme (G2P) conversion in Hindi and Tamil.They propose the idea of learning transformationson token aligned examples, and use language-specific predicates and transformations to learnrules for G2P conversion. We use a similar ap-proach, and use a different set of predicates and

62

Wordskæt → kætsdOg → dOgz

Token-levelexamples

k → kæ → æt → t→ s

d → dO → Og → g→ z

Candidates

#1. IfThen(IsToken(w,"$",1),IfThen(Is(w," voice",0),Insert(x,"z")))

#2. IfThen(IsToken(w,"t",0),IfThen(IsToken(w,"$",1),Insert(x,"s")))

#3. IfThen(IsToken(w,"$",1),Insert(x,"s")),

#4. IfThen(IsToken(w,"g",0),IfThen(IsToken(w,"$",1),Insert(x,"z")))

#5. IfThen(Not(IsToken(w,"$",1)),

Identity(x))

Rules

Else(#1,Else(#3,Else (#5))

)

Program

Inputexamples

align FM NDSyn

multi-pass

Figure 2: An illustration of the synthesis algorithm. FM is FlashMeta, which synthesizes rules which are com-bined into a disjunction of rules by NDSyn. Here, rule #1 is chosen over #4 since it uses the more general conceptof the voice feature as opposed to a specific token, and thus has a higher ranking score.

transformations that are language-agnostic. Fig-ure 2 sketches the working of the algorithm.

The NDSyn algorithm is an algorithm for learn-ing disjunctions of rules, of the form shown inFigure 1. Given a set of examples, it first gen-erates a set of candidate rules using the Flash-Meta synthesis algorithm (Polozov and Gulwani,2015). This algorithm searches for a program in theDSL that satisfies a set of examples by recursivelybreaking down the search problem into smaller sub-problems. Given an operator, and the input-outputconstraints it must satisfy, it infers constraints oneach of the arguments to the operator, allowing it torecursively search for programs that satisfy theseconstraints on each of the arguments. For exam-ple, given the Is predicate and a set of exampleswhere the predicate is true or false, the algorithminfers constraints on the arguments the token s andoffset i such that the set of examples is satisfied.The working of FlashMeta is illustrated with anexample in Figure 3. We use the implementation ofthe FlashMeta algorithm available as part of thePROSE2 framework.

From the set of candidate rules, NDSyn selectsa subset of rules with a high ranking score thatcorrectly answers the most examples as well incor-rectly answers the least3. Additional details aboutthe algorithm are provided in Appendix A.

The synthesis of multi-pass rules proceeds inpasses. In each pass, a set of token-aligned exam-ples is provided as input to the NDSyn algorithm.The resulting rules are then applied to all the exam-

2https://www.microsoft.com/en-us/research/group/prose/

3A rule will not produce any answer to examples that don’tsatisfy the context constraints of the rule.

ples, and those that are not solved are passed as theset of examples to NDSyn in the next pass. Thisproceeds until all the examples are solved, or for amaximum number of passes.

3 Dataset

To test the ability of our program synthesis systemto learn linguistic rules from only a few examples,we require a task with a small number of trainingexamples, and a number of test examples whichmeasure how well the model generalises to unseendata. Additionally, to ensure a fair evaluation, thetest examples should be chosen such that the sam-ples in the training data provide sufficient evidenceto correctly solve the test examples.

To this end, we use problems from the Linguis-tics Olympiad. The Linguistics Olympiad is anumbrella term describing contests for high schoolstudents across the globe. Students are tasked withsolving linguistics problems—a genre of composi-tion that presents linguistic facts and phenomenain enigmatic form (Derzhanski and Payne, 2010).These problems typically have 2 parts: the dataand the assignments.

The data consists of examples where the solver ispresented with the application rules to some linguis-tic forms (words, phrases, sentences) and the formsderived by applying the rules to these forms. Thedata typically consists of 20-50 forms, the minimalnumber of examples required to infer the correctrules is presented (Sahin et al., 2020).

The assignments provide other linguistic forms,and the solver is tasked with applying the rulesinferred from the data to these forms. The formsin the assignments are carefully selected by the

63

predicaterule

token offset

abc → d

Inverse SemanticsIfThen

abc → True

Inverse SemanticsIsToken

abc → a abc → −1abc → b

abc → c

abc → 0

abc → 1

IsToken(w,"a",-1)IsToken(w,"b",0)IsToken(w,"c",1)

Search for rule ReplaceBy(x,"b","d")ReplaceAnyBy(x,"d")

IfThen(IsToken(w,"a",-1),ReplaceBy(x,"b","d"))

IfThen(IsToken(w,"b",0),ReplaceAnyBy(x,"d"))

IfThen(IsToken(w,"c",1),ReplaceAnyBy(x,"d"))

Figure 3: An illustration of the search performed by the FlashMeta algorithm. The blue boxes show the spec-ification that an operator must satisfy in terms of input-output examples, with the input token underlined in thecontext of the word. The Inverse Semantics of an operator is a function that is used to infer the specification foreach argument of the operator based on the semantics of the operator. This may be a single specification (as forpredicate) or a disjunction of specifications (as for token and offset). The algorithm then recursively searches forprograms to satisfy the specification for each argument, and combines the results of the search to obtain a program.The search for the rule in an IfThen statement proceeds similarly to the search for a predicate. Examples of pro-grams that are inferred from a specification are indicated with =⇒ . A dashed line between inferred specificationsindicates that the specifications are inferred jointly.

designer to test whether the solver has correctlyinferred the rules, including making generalizationsto unseen data. This allows us to see how much ofthe intended solution has been learnt by the solverby examining responses to the assignments.

The small number of training examples (data)tests the generalization ability and sample effi-ciency of the system, and presents a challenginglearning problem for the system. The careful se-lection of test examples (assignment) lets us usethem to measure how well the model learns thesegeneralizations.

We present a dataset of 34 linguistics problems,collected from various publicly accessible sources.These problems are based on phonology, and someaspects of the morphology of languages, as wellas the orthographic properties of languages. Theseproblems are chosen such that the underlying rulesdepend only on the given word forms, and noton inherent properties of the word like grammat-ical gender or animacy. The problems involve(1) inferring phonological rules in morphologicalinflection (Table 4a) (2) inferring phonologicalchanges between multiple related languages (Ta-ble 4b) (3) converting between the orthographic

form of a language and the corresponding phono-logical form (Table 4c) (4) marking the phonolog-ical stress on a given word (Table 4d). We referto each of these categories of problems as mor-phophonology, multilingual, transliteration, andstress respectively. We further describe the datasetin Appendix B4.

3.1 Structure of the problems

Each problem is presented in the form of a matrixM . Each row of the matrix contains data pertainingto a single word/linguistic form, and each columncontains the same form of different words, i.e.,an inflectional or derivational paradigm, the wordform in a particular language, the word in a partic-ular script, or the stress values for each phoneme ina word. A test sample in this case is presented as aparticular cell Mij in the table that has to be filled.The model has to use the data from other words inthe same row (Mi:) and the words in the column(M:j) to predict the form of the word in Mij .

In addition to the data in the table, each prob-lem contains some additional information about thesymbols used to represent the words. This addi-

4The dataset is available here.

64

base form negative form

joy kas joya:ya’bi:law kas bika’lawtipoysu:da ?? kas wurula:la’

(a) Movima negation

Turkish Tatar

bandIr mandIryelken cilkan? ostabilezik ?

(b) Turkish and Tatar

Listuguj Pronunciation

g’p’ta’q g@b@da:xepsaqtejg epsaxteckemtoqwatg ?? @mtesk@m

(c) Micmac orthography

Aleut Stress

tatul 01000n@tG@lqin 000010000sawat ?qalpuqal 00001000

(d) Aleut stress

Table 4: A few examples from different types of Linguistics Olympiad problems. ‘?’ represents a cell in the tablethat is part of the test set.

tional information is meant to aid the solver under-stand the meaning of a symbol they may not haveseen before. We manually encode this informationin the feature set associated with each token forsynthesis. Where applicable, we also add conso-nant/vowel distinctions in the given features, sincethis is a basic distinction assumed in the solutionsto many Olympiad problems.

We use the assignments that accompany everyproblem as the test set, ensuring that the correctanswer can be inferred based on the given data.

3.2 Dataset statistics

The dataset we present is highly multilingual. The34 problems contain samples from 38 languages,drawn from across 19 language families. Thereare 15 morphophonology problems, 7 multilingualproblems, 6 stress, and 6 transliteration problems.The set contains 1452 training words with an aver-age of 43 words per problem, and 319 test wordswith an average of 9 per problem. Each problemhas a matrix that has between 7 and 43 rows, withan average of 23. The number of columns rangesfrom 2 to 6, with most problems having 2.

4 Experiments

4.1 Baselines

Given that we model our task as string transduc-tion, we compare with the following transductionmodels used as baselines in shared tasks on G2Pconversion (Gorman et al., 2020) and morphologi-cal reinflection (Vylomova et al., 2020).Neural: We use LSTM-based sequence-to-sequence models with attention as well as Trans-former models as implemented by Wu (2020). Foreach problem, we train a single neural model thattakes the source and target column numbers, andthe source word, and predicts the target word.WFST: We use models similar to the pair n-grammodels (Novak et al., 2016), with the implementa-tion similar to that used by Lee et al. (2020). We

train a model for each pair of columns in a problem.For each test exampleMij , we find the column withthe smallest index j′ such that Mij′ is non-emptyand use Mij′ as the source string to infer Mij .

Additional details of baselines are provided inAppendix C.

4.2 Program synthesis experiments

As discussed in Section 3.1, the examples in a prob-lem are in a matrix, and we synthesize programsto transform entries in one column to entries inanother. Given a problem matrix M , we refer toa program to transform an entry in column i toan entry in column j as M:i → M:j . To obtaintoken-level examples, we use the Smith-Watermanalignment algorithm (Smith et al., 1981), whichfavours contiguous sequences in aligned strings.

We train three variants of our synthesis systemwith different scores for the Is and IsToken op-erators. The first one, NOFEATURE, does not usefeatures, or the Is predicate. The second one, TO-KEN, assigns a higher score to IsToken and prefersmore specific rules that reference tokens. The thirdone, FEATURE, assigns a higher score to Is andprefers more general rules that reference featuresinstead of tokens. All other aspects of the modelremain the same across variants.Morphophonology and multilingual problems:For every pair of columns (s, t) in the problemmatrix M , we synthesize the program M:s →M:t.To predict the form of a test sample Mij , we finda column k such that the program M:k →M:j hasthe best ranking score, and evaluate it on Mik.Transliteration problems: Given a problem ma-trixM , we construct a new matrixM ′ for each pairof columns (s, t) such that all entries in M ′ are inthe same script. We align word pairs (Mis,Mit)using the Phonetisaurus many-to-many alignmenttool (Jiampojamarn et al., 2007), and build a sim-ple mapping f for each source token to the targettoken with which it is most frequently aligned. Wefill in M ′is by applying f to each token of Mis and

65

Model All Morphophonology Multilingual Transliteration Stress

EXACT CHRF EXACT CHRF EXACT CHRF EXACT CHRF EXACT

NOFEATURE 26.8% 0.64 30.1% 0.72 42.1% 0.59 12.0% 0.51 15.4%TOKEN 32.7% 0.63 37.5% 0.68 45.3% 0.60 16.4% 0.52 22.2%FEATURE 30.9% 0.51 38.6% 0.56 39.9% 0.42 9.5% 0.49 23.0%

LSTM 8.2% 0.44 9.2% 0.49 5.7% 0.45 2.1% 0.31 15.0%Transformer 5.4% 0.42 2.3% 0.39 9.2% 0.50 1.7% 0.42 12.6%WFST 20.9% 0.56 16.3% 0.47 38.7% 0.63 29.7% 0.71 2.8%

Table 5: Metrics for all problems, and for problems of each type. The CHFF score for stress problems is notcalculated, and not used to determine the overall CHRF score.

M ′it =Mit. We then find a program M ′:s →M ′:t.Stress problems: For these problems, we do notperform any alignment, since the training pairs arealready token aligned. The synthesis system learnsto transform the source string to the sequence ofstress values.

4.3 MetricsWe calculate two metrics: exact match accuracy,and CHRF score (Popovic, 2015). The exact matchaccuracy measures the fraction of examples thesynthesis system gets fully correct.

EXACT =#correctly predicted test samples

#test samplesThe CHRF score is calculated only at the tokenlevel, and measures the n-gram overlaps betweenthe predicted answer and the true answer, and al-lows us to measure partially correct answers. Wedo not calculate the CHRF score for stress problemsas n-gram overlap is not a meaningful measure ofperformance for these problems.

4.4 ResultsTable 5 summarizes the results of our experiments.We report the average of each metric across prob-lems for all problems and by category.

We find that neural models that don’t have spe-cific inductive biases for the kind of tasks wepresent here are not able to perform well with thisamount of data. The synthesis models do betterthan the WFST baseline overall, and on all typesof problems except transliteration. This could bedue to the simple map computed from alignmentsbefore program synthesis causing errors that therule learning process cannot correct.

5 Analysis

We examine two aspects of the program synthesismodels we propose. The first is the way it uses the

explicit knowledge in the DSL and implicit knowl-edge provided as the ranking score to generalize.We then consider specific examples of problems,and show examples of where our models succeedand fail in learning different types of patterns.

Model 100% ≥ 75% ≥ 50%

NOFEATURE 3 5 7TOKEN 3 6 10

FEATURE 3 6 11WFST 1 2 7

Table 6: Number of problems where the modelachieves different thresholds of the EXACT score.

5.1 Features aid generalizationSince the test examples are chosen to test specificrules, solving more test examples correctly is in-dicative of the number of rules inferred correctly.In Table 6, we see that providing the model withfeatures allows it to infer more general rules, solv-ing a greater fraction of more problems. We seethat allowing the model to use features increasesits performance, and having it prefer more generalrules involving features lets it do even better.

5.2 Correct programs are shortIn Figure 4 we see that the number of rules in aproblem5 tends to be higher when the model getsthe problem wrong, than when it gets it right. Thisindicates that when the model finds many specificrules, it overfits to the training data, and fails togeneralize well. This holds true for all the variants,as seen in the downward slope of the lines.

We also find that allowing and encouraging amodel to use features leads to shorter programs.The average length of a program synthesized by

5To account for some problems having more columns thanothers (and hence more rules), we find the average number ofrules for each pair of columns.

66

Figure 4: Number of rules plotted against EXACT score

NOFEATURES is 30.5 rules, while it is 25.8 forTOKEN, and 20.7 for FEATURE. This suggests thatexplicit access to features, and implicit preferencefor them leads to fewer, more general rules.

5.3 Using featuresSome problems provide additional informationabout certain sounds. For example, a prob-lem based on the alternation retroflexes inWarlpiri words (Laughren, 2011) explicitly identi-fies retroflex sounds in the problem statement. Inthis case, a program produced by our FEATURE

system is able to use these features, and isolate thefocus of the problem by learning rules such as

IfThen(Not(Is(w, "retroflex", 0)),Identity(x))

The system learns a concise solution, and is ableto generalize using features rather than learningseparate rules for individual sounds.

In the case of inflecting a Mandar verb (McCoy,2018), the FEATURE system uses a feature to finda more general rule than is the case. To capture therule that the prefix di- changes to mas- when theroot starts with s, the model synthesizes

IfThen(Is(w, "fricative", 1),ReplaceBy(x, "i", "s"))

However, since s is the only fricative in the data,this rule is equivalent to a rule specific to s. Thisrule also covers examples where the root starts withs, and causes the model to miss the more generalrule of a voiceless sound at the beginning of the rootto be copied to the end of the prefix. It identifiesthis rule only for roots starting with p as

IfThen(IsToken(w, "p", 1),CopyReplace(x, w, 1))

The TOKEN system does not synthesize these

rules based on features, and instead chooses rulesspecific to each initial character in the root.

Since the DSL allows for substituting one tokenwith one other, or inserting multiple tokens, thesystem has to use multiple rules to substitute onetoken with multiple tokens. In the case of Mandar,we see one way it does this, by performing multiplesubstitutions (to transform di- to mas- it replaces dand i with a and s respectively, and then inserts m).

5.4 Multi-pass rules

In a problem on Somali verb forms (Somers, 2016),we see a different way of handling multi-tokensubstitutions by using multi-pass rules to create acomplex rule using simpler elements. The problemrequires being able to convert verbs from 1st personto 3rd person singular. The solution includes a rulewhere a single token (l) is replaced with (sh). Thelearned program uses two passes to capture thisrule through sequential application of two rules:first ReplaceBy(x, "l", "h"), followed by

IfThen(TransformationApplied(w,"ReplaceBy , h", 1),

Insert(x, "s"))

5.5 Selecting spans of the input

In a problem involving reduplication in Tarangan(Warner, 2019), all variants fail to capture any syn-thesis rules. Reduplication in Tarangan involvescopying one or two syllables in the source wordto produce the target word. However, the DSL weuse does not have any predicates or transformationsthat allow the system to reference a span of mul-tiple tokens (which would form a syllable) in theinput. Therefore, it fails to model reduplication.

5.6 Global constraints

Since we provide the synthesis model with token-level examples, it does not have access to word-level information. This results in poor performanceon stress problems, as stress depends on the entireword. Consider the example of Chickasaw stress(Vaduguru, 2019). It correctly learns the rule

IfThen(Is(w, "long", 0),ReplaceAnyBy(x, "1"))

that stresses any long vowel in the word. How-ever, since it cannot check if the word has a longvowel that has already been stressed, it is not ableto correctly model the case when the word doesn’thave a long vowel. This results in some samples be-ing marked with stress at two locations, one where

67

the rule for long vowels applies, and one where therule for words without long vowels applies.

6 Related work

Gildea and Jurafsky (1996) also study the problemof learning phonological rules from data, and ex-plicitly controlling generalization behaviour. Wepursue a similar goal, but in a few-shot setting.

Barke et al. (2019) and Ellis et al. (2015) studyprogram synthesis applied to linguistic rule learn-ing. They make much stronger assumptions aboutthe data (the existence of an underlying form, andthe availability of additional information like IPAfeatures). We take a different approach, and studyprogram synthesis models that can work only onthe tokens in the word (like NOFEATURE), and alsoexplore the effect of providing features in thesecases. We also test our approach on a more variedset of problems that involves aspects of morphol-ogy, transliteration, multilinguality, and stress.

Sahin et al. (2020) also present a set of Linguis-tics Olympiad problems as a test of the metalin-guistic reasoning abilities of NLP models. Whileproblems in their set involve finding phonologicalrules, they also require the knowledge of syntaxand semantics that are out of the scope of our study.We present a set of problems that only requiresreasoning about surface word forms, and withoutrequiring the meanings.

7 Conclusion

In this paper, we explore the problem of learninglinguistic rules from only a few training examples.We approach this using program synthesis, anddemonstrate that it is a powerful and flexible tech-nique for learning phonology rules in Olympiadproblems. These problems are designed to be chal-lenging tasks that require learning rules from aminimal number of examples. These problems alsoallow us to specifically test for generalization.

We compare our approach to various baselines,and find that it is capable of learning phonologi-cal rules that generalize much better than existingapproaches. We show that using the DSL, we canexplicitly control the structure of rules, and usingthe ranking score, we can provide the model withimplicit preferences for certain kinds of rules.

Having demonstrated the potential of programsynthesis as a learning technique that can work withvery little data and provide human-readable models,

we hope to apply it to learning more complex typesof lingusitic rules in the future.

In addition to being a way to learn rules fromdata, the ability to explicity control the general-ization behaviour of the model allows for the useof program synthesis to understand the kinds oflearning biases and operations that are required tomodel various linguistic processes. We leave thisexploration to future work.

Acknowledgements

We would like to thank Partho Sarthi for invaluablehelp with PROSE and NDSyn. We would also liketo thank the authors of the ProLinguist paper fortheir assistance. Finally, we would like to thank theanonymous reviewers for their feedback.

ReferencesShraddha Barke, Rose Kunkel, Nadia Polikarpova, Eric

Meinhardt, Eric Bakovic, and Leon Bergen. 2019.Constraint-based learning of phonological processes.In Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP), pages 6176–6186, Hong Kong, China. Association for Computa-tional Linguistics.

A. Bellos. 2020. The Language Lover’s Puzzle Book:Lexical perplexities and cracking conundrums fromacross the globe. Guardian Faber Publishing.

Eric Brill. 1992. A simple rule-based part of speechtagger. In Speech and Natural Language: Proceed-ings of a Workshop Held at Harriman, New York,February 23-26, 1992.

Keith Brown and Sarah Ogilvie. 2010. Concise ency-clopedia of languages of the world. Elsevier.

Noam Chomsky and Morris Halle. 1968. The soundpattern of english.

Ivan Derzhanski and Thomas Payne. 2010. TheLinguistic Olympiads: academic competitions inlinguistics for secondary school students, page213–226. Cambridge University Press.

Kevin Ellis, Armando Solar-Lezama, and Josh Tenen-baum. 2015. Unsupervised learning by program syn-thesis. In C. Cortes, N. D. Lawrence, D. D. Lee,M. Sugiyama, and R. Garnett, editors, Advances inNeural Information Processing Systems 28, pages973–981. Curran Associates, Inc.

Daniel Gildea and Daniel Jurafsky. 1996. Learningbias and phonological-rule induction. Computa-tional Linguistics, 22(4):497–530.

68

Kyle Gorman, Lucas F.E. Ashby, Aaron Goyzueta,Arya McCarthy, Shijie Wu, and Daniel You. 2020.The SIGMORPHON 2020 shared task on multilin-gual grapheme-to-phoneme conversion. In Proceed-ings of the 17th SIGMORPHON Workshop on Com-putational Research in Phonetics, Phonology, andMorphology, pages 40–50, Online. Association forComputational Linguistics.

Sumit Gulwani. 2011. Automating string processingin spreadsheets using input-output examples. SIG-PLAN Not., 46(1):317–330.

Sumit Gulwani, Oleksandr Polozov, and RishabhSingh. 2017. Program synthesis. Foundations andTrends® in Programming Languages, 4(1-2):1–119.

Arun Iyer, Manohar Jonnalagedda, SureshParthasarathy, Arjun Radhakrishna, and Sriram KRajamani. 2019. Synthesis and machine learningfor heterogeneous extraction. In Proceedings of the40th ACM SIGPLAN Conference on ProgrammingLanguage Design and Implementation, pages301–315.

Sittichai Jiampojamarn, Grzegorz Kondrak, and TarekSherif. 2007. Applying many-to-many alignmentsand hidden Markov models to letter-to-phonemeconversion. In Human Language Technologies2007: The Conference of the North American Chap-ter of the Association for Computational Linguistics;Proceedings of the Main Conference, pages 372–379, Rochester, New York. Association for Compu-tational Linguistics.

Mark Johnson. 1984. A discovery procedure for cer-tain phonological rules. In 10th International Con-ference on Computational Linguistics and 22nd An-nual Meeting of the Association for ComputationalLinguistics, pages 344–347, Stanford, California,USA. Association for Computational Linguistics.

Mary Laughren. 2011. Stopping and flapping inwarlpiri. In Dragomir Radev and Patrick Littell,editors, North American Computational LinguisticsOlympiad 2011: Invitational Round. North Ameri-can Computational Linguistics Olympiad.

Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza,Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D.McCarthy, and Kyle Gorman. 2020. Massivelymultilingual pronunciation modeling with WikiPron.In Proceedings of the 12th Language Resourcesand Evaluation Conference, pages 4223–4228, Mar-seille, France. European Language Resources Asso-ciation.

Minh-Thang Luong, Hieu Pham, and Christopher DManning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprintarXiv:1508.04025.

Tom McCoy. 2018. Better left unsaid. In Patrick Lit-tell, Tom McCoy, Dragomir Radev, and Ali Shar-man, editors, North American Computational Lin-guistics Olympiad 2018: Invitational Round. NorthAmerican Computational Linguistics Olympiad.

Josef Robert Novak, Nobuaki Minematsu, and KeikichiHirose. 2016. Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models inthe wfst framework. Natural Language Engineer-ing, 22(6):907–938.

Oleksandr Polozov and Sumit Gulwani. 2015. Flash-meta: A framework for inductive program synthesis.SIGPLAN Not., 50(10):107–126.

Maja Popovic. 2015. chrF: character n-gram F-scorefor automatic MT evaluation. In Proceedings of theTenth Workshop on Statistical Machine Translation,pages 392–395, Lisbon, Portugal. Association forComputational Linguistics.

Gozde Gul Sahin, Yova Kementchedjhieva, PhillipRust, and Iryna Gurevych. 2020. PuzzLing Ma-chines: A Challenge on Learning From Small Data.In Proceedings of the 58th Annual Meeting of theAssociation for Computational Linguistics, pages1241–1254, Online. Association for ComputationalLinguistics.

Partho Sarthi, Monojit Choudhury, Arun Iyer, SureshParthasarathy, Arjun Radhakrishna, and Sriram Ra-jamani. 2021. ProLinguist: Program Synthesis forLinguistics and NLP. IJCAI Workshop on Neuro-Symbolic Natural Language Inference.

Temple F Smith, Michael S Waterman, et al. 1981.Identification of common molecular subsequences.Journal of molecular biology, 147(1):195–197.

Harold Somers. 2016. Changing the subject.In Andrew Lamont and Dragomir Radev, edi-tors, North American Computational LinguisticsOlympiad 2016: Invitational Round. North Ameri-can Computational Linguistics Olympiad.

Saujas Vaduguru. 2019. Chickasaw stress. InShardul Chiplunkar and Saujas Vaduguru, editors,Panini Linguistics Olympiad 2019. Panini Linguis-tics Olympiad.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems, volume 30. Curran Associates, Inc.

Ekaterina Vylomova, Jennifer White, Eliza-beth Salesky, Sabrina J. Mielke, Shijie Wu,Edoardo Maria Ponti, Rowan Hall Maudslay, RanZmigrod, Josef Valvoda, Svetlana Toldova, FrancisTyers, Elena Klyachko, Ilya Yegorov, NataliaKrizhanovsky, Paula Czarnowska, Irene Nikkarinen,Andrew Krizhanovsky, Tiago Pimentel, LucasTorroba Hennigen, Christo Kirov, Garrett Nicolai,Adina Williams, Antonios Anastasopoulos, HilariaCruz, Eleanor Chodroff, Ryan Cotterell, MiikkaSilfverberg, and Mans Hulden. 2020. SIGMOR-PHON 2020 shared task 0: Typologically diversemorphological inflection. In Proceedings of the17th SIGMORPHON Workshop on Computational

69

Research in Phonetics, Phonology, and Morphology,pages 1–39, Online. Association for ComputationalLinguistics.

Elysia Warner. 2019. Tarangan. In Samuel Ahmed,Bozhidar Bozhanov, Ivan Derzhanski (technicaleditor), Hugh Dobbs, Dmitry Gerasimov, Shin-jini Ghosh, Ksenia Gilyarova, Stanislav Gurevich,Gabrijela Hladnik, Boris Iomdin, Bruno L’Astorina,Tae Hun Lee (editor-in chief), Tom McCoy, AndreNikulin, Miina Norvik, Tung-Le Pan, AleksejsPegusevs, Alexander Piperski, Maria Rubinstein,Daniel Rucki, Arturs Semenuks, Nathan Somers,Milena Veneva, and Elysia Warner, editors, Interna-tional Linguistics Olympiad 2019. International Lin-guistics Olympiad.

Shijie Wu. 2020. Neural transducer. https://github.com/shijie-wu/neural-transducer/.

A NDSyn algorithm

We use the NDSyn algorithm to learn disjunctionsof rules. We apply NDSyn in multiple passes toallow the model to learn multi-pass rules.

At each pass, the algorithm learns rules to per-form token-level transformations that are appliedto each element of the input sequence. The token-level examples are passed to NDSyn, which learnsthe if-then-else statements that constitute a set ofrules. This is done by first generating a set of can-didate rules by randomly sampling a token-levelexample and synthesizing a set of rules that satisfythe example. Then, rules are selected to cover thetoken-level examples.

Rules that satisfy a randomly sampled exampleare learnt using the FlashMeta program synthesisalgorithm (Polozov and Gulwani, 2015). The syn-thesis task is given by the DSL operator P and thespecification of constraints X that the synthesizedprogram must satisfy. In our application, this speci-fication is in the form of token-level examples, andthe DSL operators are the predicates and transfor-mations defined in the paper. The algorithm recur-sively decomposes the synthesis problem (P,X )into smaller tasks (Pi,Xi) for each argument Pito the operator. Xi is inferred using the inversesemantics of the operator Pi, which is encoded asa witness function. The inverse semantics providesthe possible values for the arguments of an opera-tor, given the output of the operator. We refer thereader to the paper by Polozov and Gulwani (2015)for a full description of the synthesis algorithm.

After the candidates are generated, they areranked according to a ranking score of each pro-gram. The ranking score for an operator in a pro-gram is computed as a function of the scores of

its arguments. The arguments may be other op-erators, offsets, or other constants (like tokens orfeatures). The score for an operator in the argu-ment is computed recursively. The score for anoffset favours smaller numbers and local rules bydecreasing the score for larger offsets. The scorefor other constants is chosen to be a small negativeconstant. The scores for the arguments are addedup, along with a small negative penalty to favourshorter programs, to obtain the final score for theoperator.

This ranking score selects for programs that areshorter, and favours either choosing more gen-eral by giving the Is predicate a higher score(FEATURE) or more specific rules by giving theIsToken predicate a higher score (TOKEN). Thetop k programs according to the ranking functionare chosen as candidates for the next step.

To choose the final set of rules from the candi-dates generated using the FlashMeta algorithm,we use a set covering algorithm that chooses therules that correctly answer the most number of ex-amples while also incorrectly answering the least.These rules are applied to each example, and theoutput tokens are tagged with the transformationthat is applied. These outputs are then the input tothe next pass of the algorithm.

B Dataset

We select problems from various LinguisticsOlympiads to create our dataset. We include pub-licly available problems that have appeared inOlympiads before. We choose problems that onlyinvolve rules based on the symbols in the data, andnot based on knowledge of notions such as gender,tense, case, or semantic role. These problems arebased on the phonology of a particular language,and include aspects of morphology and orthogra-phy, and maybe also the phonology of a differentlanguage. In some cases where a single Olympiadproblem involves multiple components that can besolved independent of each other, we include themas separate problems in our dataset.

We put the data and assignments in a matrix, asdescribed in Section 3.1 . We separate tokens in aword by a space while transcribing the problemsfrom their source PDFs. We do not separate diacrit-ics as different tokens, and include them as part ofthe same token. For each token in the Roman script,we add the boolean features vowel and consonant,and manually tag the tokens according to whether

70

they are a vowel or consonant.We store the problems in JSON files with details

about the languages, the families to which the lan-guages belong, the data matrix, the notes used tocreate the features, and the feature sets for eachtoken.

C Baselines

C.1 NeuralFollowing Sahin et al. (2020), we use small neuralmodels for sequence-to-sequence tasks. We train asingle neural model for each task, and provide thecolumn numbers as tags in addition to the sourcesequence. We find that the single model approachworks better than training a model for each pair ofcolumns.LSTM: We use LSTM models with soft attention(Luong et al., 2015), with embeddings of size 64,hidden layers of size 128, a 2-layer encoder and asingle layer decoder. We apply a dropout of 0.3 forall layers. We train the model for 100 epochs usingthe Adam optimizer with a learning rate of 10−3,learning rate reduction on plateau, and a batch sizeof 2. We clip the gradient norm to 5.Transformer: We use Transformer models(Vaswani et al., 2017) with embeddings of size128, hidden layers of size 256, a 2-layer encoderand a 2-layer decoder. We apply a dropout of 0.3for all layers. We train the model for 2000 stepsusing the Adam optimizer with a learning rate of10−3, warmup of 400 steps, learning rate reductionon plateau, and a batch size of 2. We use a labelsmoothing value of 0.1, and clip the gradient normto 1.

We use the implementations provided at https://github.com/shijie-wu/neural-transducer/ for allneural models.

C.2 WFSTWe use the implementation the WFST models avail-able at https://github.com/sigmorphon/2020/tree/

master/task1/baselines/fst for the WFST models.We train a model for each pair of columns. Wereport the results for models of order 5, which werefound to perform the best on the test data (highestEXACT score) among models of order 3 to 9.

71

Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research

in Phonetics, Phonology, and Morphology,

August 5, 2021. ©2021 Association for Computational Linguistics

pages 72–81

Findings of the SIGMORPHON 2021 Shared Task onUnsupervised Morphological Paradigm Clustering

Adam Wiemerslage† Arya McCarthy‡ Alexander Erdmann∇

Garrett Nicolaiψ Manex Agirrezabal φ Miikka SilfverbergψMans Hulden† Katharina Kann†

†University of Colorado Boulder ‡Johns Hopkins University ∇Ohio State UniversityφUniversity of Copenhagen ψUniversity of British Columbiaadam.wiemerslage,[email protected]

Abstract

We describe the second SIGMORPHONshared task on unsupervised morphology: thegoal of the SIGMORPHON 2021 Shared Taskon Unsupervised Morphological ParadigmClustering is to cluster word types from a rawtext corpus into paradigms. To this end, we re-lease corpora for 5 development and 9 test lan-guages, as well as gold partial paradigms forevaluation. We receive 14 submissions from 4teams that follow different strategies, and thebest performing system is based on adaptorgrammars. Results vary significantly acrosslanguages. However, all systems are outper-formed by a supervised lemmatizer, implyingthat there is still room for improvement.

1 Introduction

In recent years, most research in the area of compu-tational morphology has focused on the applicationof supervised machine learning methods to wordinflection: generating the inflected forms of a word,often a lemma, in order to express certain grammat-ical properties. For example, a supervised inflec-tion system for Spanish might be provided with alemma disfrutar (English: to enjoy) and morpho-logical features such as indicative, present tense &1st person singular, and generate the correspondinginflected form disfruto as output.

However, a supervised machine learning setupis quite different from a human first language (L1)acquisition setting. Young children must learn tosegment a continuous speech signal into discretewords and perform unsupervised classification, de-coding, and eventually, inference with incompletefeedback on this noisy input. The task of unsu-pervised paradigm clustering aims to replicate oneof the steps in this process—namely, the groupingof word forms belonging to the same lexeme intoinflectional paradigms. In this unsupervised task, asystem does not know about lemmas. Furthermore,

Disfrutar

V;IND;PRS;1;SG

V;IND;PRS;2;SG

V;IND;PRS;3;SG

Disfrutar

disfrutáis

disfrutamos

disfrutan

disfruto

disfrutas

disfruta

V;IND;PRS;1;PL

V;IND;PRS;2;PL

V;IND;PRS;3;PL

...disfruta de la vida lo más que puedas! Si disfrutamos de la vida, ...

disfrutamos

disfruta

Figure 1: Unsupervised morphological paradigm clus-tering consists of clustering word forms from raw textinto paradigms.

neither does it know (a) the features for which alemma typically inflects, nor (b) the number of dis-tinct inflected forms which constitute the paradigm.

A successful unsupervised paradigm cluster-ing system leverages common patterns in the lan-guage’s inflectional morphology while simultane-ously ignoring regular circumstantial similaritiesalong with derivational patterns. For example, anaccurate unsupervised system must recognize thatdisfrutamos (English: we enjoy) and disfruta (En-glish: he/she/it enjoys) are inflected variants of thesame paradigm, but that the orthographically sim-ilar disparamos (English: we shoot), belongs to aseparate paradigm. Likewise, a successful systemfor English will recognize that walk and walkedbelong to the same verbal paradigm but walkeris a derived form belonging to a distinct nominalparadigm. Such fine-grained distinctions are diffi-cult to learn in an unsupervised manner.

This paper describes the SIGMORPHON 2021Shared Task on Unsupervised MorphologicalParadigm Clustering. Participants are asked to sub-mit systems which cluster words from the Bibleinto inflectional paradigms.1 Participants are notallowed to use any external resources. Four teamssubmit at least one system for the shared task and

1Bible translations for five development and nine test lan-guages were obtained from the Johns Hopkins UniversityBible Corpus introduced by McCarthy et al. (2020b).

72

all teams also submit a system description paper.The shared task systems can be grouped into

two broad categories: similarity-based systemsexperiment with different combinations of ortho-graphic and embedding-based similarity metrics forword forms combined with clustering methods likek-means or agglomerative clustering. Grammar-based methods instead learn grammars or rulesfrom the data and either apply these to clusteringdirectly, or first segment words into stems and af-fixes and then cluster forms which share a stem intoparadigms. Our official baseline, described in Sec-tion 2.3, is based on grouping together word formssharing a common substring of length ≥ k, wherek is a hyperparameter. Grammar-based systems ob-tain higher average F1 scores (see Section 2.2 fordetails on evaluation) across the nine test languagesthan the baseline. The Edinburgh system has thebest overall performance: it outperforms the base-line by 34.61% F1 and the second best system by1.84% F1.

The rest of the paper is organized as follows:Section 2 describes the task of unsupervised mor-phological paradigm clustering in detail, includingthe official baseline and all provided datasets. Sec-tion 3 gives an overview of the participating sys-tems. Section 4 describes the official results, and5 presents an analysis. Finally, Section 6 containsa discussion of where the task can move in futureiterations and concludes the paper.

2 Task Description

Unsupervised morphological paradigm clusteringconsists of, given a raw text corpus, grouping wordsfrom that corpus into their paradigms without anyadditional information. Recent work in unsuper-vised morphology has attempted to induce fullparadigms from corpora with only a subset of alltypes. Kann et al. (2020) and Erdmann et al. (2020)explore initial approaches to this task, which iscalled unsupervised morphological paradigm com-pletion, but find it to be challenging. Buildingupon the SIGMORPHON 2020 Shared Task on Un-supervised Morphological Paradigm Completion(Kann et al., 2020), our shared task is focused on asubset of the overall problem: sorting words intoparadigms. This can be seen as an initial step toparadigm completion, as unobserved types do notneed to be induced, and the inflectional categoriesof paradigm slots do not need to be considered.

2.1 DataLanguages The SIGMORPHON 2021 SharedTask on Unsupervised Morphological ParadigmClustering features 5 development languages: Mal-tese, Persian, Portuguese, Russian, and Swedish.The final evaluation is done on 9 test languages:Basque, Bulgarian, English, Finnish, German, Kan-nada, Navajo, Spanish, and Turkish.

Our languages span 4 writing systems, and repre-sent fusional, agglutinative, templatic, and polysyn-thetic morphologies. The languages in the develop-ment set are mostly suffixing, except for Maltese,which is a templatic language. And while most ofthe test languages are also predominantly suffix-ing, Navajo employs prefixes and Basque uses bothprefixes and suffixes.

Text Corpora We provide corpora from theJohns Hopkins University Bible Corpus (JHUBC)(McCarthy et al., 2020b) for all development andtest languages. This is the only resource that sys-tems are allowed to use.

Gold Partial Paradigms Along with the Bibles,we also release a set of gold partial paradigms forthe development languages to be used for systemdevelopment. Gold data sets are also compiled forthe test languages, but these test sets are withhelduntil the completion of the shared task.

In order to produce gold partial paradigms, wefirst take the set of all paradigms Π for each lan-guage from UniMorph (McCarthy et al., 2020a).We then obtain gold partial paradigms ΠG =Π

⋂Σ, where Σ is the set of types attested in the

Bible corpus. Finally, we sample up to 1000 of theresulting gold partial paradigms for each language,resulting in the set ΠG according to the followingsteps:

1. Group gold paradigms in ΠG by size, result-ing in the set G, where gk ∈ G is the group ofparadigms with k forms in it.

2. Continually loop over all gk ∈ G and ran-domly sample one paradigm from gk until wehave 1000 paradigms.

Because not every token in the Bible corpora is inUniMorph, we can only evaluate on the subset ofparadigms that exist in the UniMorph database. Inpractice, this means that for several languages, weare not able to sample 1000 paradigms, cf. Tables1 and 2. Notably, for Basque, we can only provide12 paradigms.

73

Maltese Persian Portuguese Russian Swedish# Lines 7761 7931 31167 31102 31168

# Tokens 193257 227584 828861 727630 871707# Types 16017 11877 31446 46202 25913

TTR .083 .052 .038 .063 .03# Paradigms 76 64 1000 1000 1000

# Forms in paradigms 572 446 11430 6216 3596Largest paradigm size 14 20 47 17 9

Table 1: Statistics for the development Bible corpora and the dev gold partial paradigms. TTR is the type-tokenratio in the corpus. The statistics for the paradigms reflect only those words in our partial paradigms, not the fullparadigms from Unimorph.

English Navajo Spanish Finnish Bulgarian Basque Kannada German Turkish# Lines 7728 5058 7337 31087 31101 7958 7863 31102 30182

# Tokens 236465 104631 251581 685699 801657 195459 193213 826119 616418# Types 7144 18799 9755 54635 37048 18376 28561 22584 59458

TTR .03 .18 .039 .08 .046 .094 .148 .027 .096# Paradigms 1000 88 990 1000 1000 12 92 1000 1000

# Forms in paradigms 2475 214 5154 8509 5086 63 933 3628 9204Largest paradigm size 7 13 34 31 27 25 44 15 49

Table 2: Statistics for the test Bible corpora and the test gold partial paradigms.

dependemosdependemdependerádependessedepende

dependiam

dependemosdependemdependerádependessedepende

dependiamdesfrutarem

dependessedepende

desonrares

1.0

0.5

P1

P2

G1

Figure 2: An example matching of predicted paradigmsin blue, and a gold paradigm in green. Words in red donot exist in the gold set, and thus cannot be evaluated.

2.2 Evaluation

As our task is entirely unsupervised, evaluationis not straightforward: as in Kann et al. (2020),our evaluation requires a mapping from predictedparadigms to gold paradigms. Because our set ofgold partial paradigms does not cover all words inthe corpus, in practice we only evaluate against asubset of the clusters predicted by systems.

For these reasons, we want an evaluation thatassesses the best matching paradigms, ignoring pre-dicted forms that do not occur in the gold set, butstill punishing for spurious predictions that are inthe gold set. For example, Figure 2 shows two can-didate matches for a gold partial paradigm. Each

one contains a word that does not exist in the set ofgold paradigms, and thus cannot be judged – thesewords are ignored and do not affect evaluation. Inthis example, the predicted P1 is a better match,resulting in a perfect F1 score. However, our eval-uation punishes systems for predicting a secondparadigm, P2, with words from G1, reducing theoverall precision score of this submission.

Building upon BMAcc (Jin et al., 2020), weuse best-match F1 score for evaluation. We definea paradigm as a set of word forms f ∈ π. Du-plicate forms within π (syncretism) are discarded.Given a set of gold partial paradigms πg ∈ ΠG, aset of predicted paradigms πp ∈ ΠP , a gold vo-cabulary Σg =

⋃πg, and a predicted vocabulary

Σp =⋃πp, it works according to the following

steps:

1. Redefine each predicted paradigm, remov-ing the words that we cannot evaluate πp

′=

πp⋂

Σg, to form a set of pruned paradigmsΠ′P .

2. Build a complete Bipartite graph over Π′P and

ΠG, where the edge weight between πgi andπp

′j is the number of true positives |πgi

⋂πp

′j |.

3. Compute the maximum-weight full matchingusing Karp (1980), in order to find the optimalalignment between Π′

P and ΠG

4. Assign all predicted words Σp′ and all goldwords Σg a label corresponding to the goldparadigm, according to the matching found in

74

3. Any unmatched wp′i ∈ Σp′ is assigned a

label corresponding to a spurious paradigm.

5. Compute the F1 score between the sets oflabeled words in Σp′ and Σg

2.3 Baseline System

We provide a straightforward baseline that con-structs paradigms based on substring overlap be-tween words. We construct paradigms out of wordsthat share a substring of length ≥ k. Since wordscan share multiple substrings, it is possible thatmultiple identical, redundant paradigms are cre-ated. We reduce these to a single paradigm. Wordsthat do not belong to a cluster are assigned a sin-gleton paradigm, that is, a paradigm that consistsof only that word.

We tune k on the development sets and find thatk = 5 works best on average. This means that aword of less than 5 characters can only ever be inone, singleton, paradigm.

3 Submitted Systems

The Boulder-Perkoff-Daniels-Palmer team(Boulder-PDP; Perkoff et al., 2021) participateswith four submissions, resulting from experimentswith two different systems. Both systems applyk-means clustering on vector representations ofinput words. They differ in the type of vectorrepresentations used: either orthographic orsemantic representations. Semantic skip-gramrepresentations are generated using word2vec(Mikolov et al., 2013). For the orthographicrepresentations, each word is encoded into a vectorof fixed dimensionality equaling the word length|wmax| for the longest word wmax in the inputcorpus. They associate each character c ∈ Σ in thealphabet of the input corpus with a real numberr ∈ [0, 1] and assign vi := r if the ith characterof the input word w is c. If |w| < |wmax|, theremaining entries are assigned to 0.

The number of clusters is a hyperparameter ofthe k-means clustering algorithm. In order to setthis hyperparameter, Perkoff et al. (2021) experi-ment with a graph-based method. The word typesin the corpus form the nodes of a graph, where theneighborhood of a word w consists of all wordssharing a maximal substring with w. The graph issplit into highly connected subgraphs (HCS) con-taining n nodes, where the number of edges thatneed to be cut in order to split the graph into two

disconnected components is > n/2 (Hartuv andShamir, 2000). The number of HCSs is then takento be the cluster number. In practice, however,the graph-clustering step proves to be prohibitivelyslow and results for test languages are submittedusing fixed numbers of clusters of size 500, 1000,1500 and 1900. In experiments on the dev lan-guages, they find that the orthographic representa-tions outperform the semantic representations forall languages, and thus submit four systems utiliz-ing orthographic representations.

The Boulder-Gerlach-Wiemerslage-Kann team(Boulder-GWK; Gerlach et al., 2021) submitstwo systems based on an unsupervised lemmati-zation system originally proposed by Rosa andZabokrtsky (2019). Their approach is based on ag-glomerative hierarchical clustering of word types,where the distance between word types is computedas a combination of a string distance metric andthe cosine distance of fastText embeddings (Bo-janowski et al., 2017). Their choice of fastTextembeddings is due to the limited size of the sharedtask datasets. Two variants of edit distance are com-pared to quantify string distance: (1) Jaro-Winkleredit distance (Winkler, 1990) resembles regularedit distance of strings but emphasizes similarityat the start of strings which is likely to bias thesystem toward languages expressing inflection viasuffixation. (2) A weighted variant of edit distance,where costs for insertions, deletions and substitu-tions are derived from a character-based languagemodel trained on the shared task data.

The CU–UBC (Yang et al., 2021) team providessystems that built upon the official shared task base-line – given the pseudo-paradigms found by thebaseline, they extract inflection rules of multipletypes. Comparing pairs of words in each paradigm,they learn both continuous and discontinuous char-acter sequences that transform the first word intothe second, following work on supervised inflec-tional morphology, such as Durrett and DeNero(2013); Hulden et al. (2014). Rules are sorted byfrequency to separate genuine inflectional patternsfrom noise. Starting from a random seed word,paradigms are constructed by iteratively applyingthe most frequent rules. Generated paradigms arefurther tested for paradigm coherence using met-rics such as graph degree calculation and fastTextembedding similarity.

The Edinburgh team (McCurdy et al., 2021)submits a system based on adaptor grammars (John-

75

English Navajo Spanish Finnish Bulgarian Basque Kannada German Turkish Average

Boulder-PDP-1Rec 28.93 32.71 23.90 18.43 20.55 28.57 25.19 25.50 15.70 24.39Prec 29.27 34.15 24.68 18.81 20.75 29.51 35.18 25.64 15.90 25.99F1 29.10 33.41 24.29 18.62 20.65 29.03 29.36 25.57 15.80 25.09

Boulder-PDP-2Rec 36.57 36.92 28.52 23.38 26.37 30.16 25.83 33.21 19.53 28.94Prec 37.00 38.54 29.45 23.86 26.63 31.15 36.08 33.40 19.79 30.65F1 36.78 37.71 28.98 23.62 26.50 30.65 30.11 33.31 19.66 29.70

Boulder-PDP-3Rec 42.79 37.85 29.41 26.01 28.73 26.98 25.94 38.18 21.38 30.81Prec 43.30 39.51 30.37 26.55 29.01 27.87 36.23 38.39 21.66 32.54F1 43.04 38.66 29.88 26.27 28.87 27.42 30.23 38.28 21.52 31.58

Boulder-PDP-4Rec 45.45 40.19 30.64 26.60 29.79 28.57 24.54 39.86 21.65 31.92Prec 45.99 41.95 31.63 27.15 30.08 29.51 34.28 40.08 21.93 33.62F1 45.72 41.05 31.13 26.87 29.93 29.03 28.61 39.97 21.79 32.68

Boulder-GWK-2Rec 28.81 10.75 19.27 22.02 30.02 19.05 18.54 31.92 20.63 22.33Prec 66.33 65.71 69.93 67.36 71.69 35.29 62.45 78.56 64.09 64.60F1 40.17 18.47 30.21 33.19 42.32 24.74 28.60 45.39 31.22 32.70

Boulder-GWK-1Rec 24.53 11.21 18.30 22.69 31.18 25.40 16.93 30.98 21.16 22.49Prec 56.47 68.57 66.41 69.41 74.46 47.06 57.04 76.26 65.74 64.60F1 34.20 19.28 28.69 34.20 43.96 32.99 26.12 44.06 32.02 32.83

BaselineRec 76.69 59.81 72.18 76.73 73.02 25.40 38.48 77.62 65.82 62.86Prec 38.76 23.02 26.56 17.86 26.50 18.60 17.22 25.35 15.60 23.28F1 51.49 33.25 38.83 28.97 38.89 21.48 23.79 38.22 25.23 33.35

CU–UBC-5Rec 66.95 50.93 60.52 45.96 65.08 17.46 30.33 66.57 43.25 49.67Prec 90.40 68.55 72.70 56.47 76.85 52.38 61.26 74.40 54.05 67.45F1 76.93 58.45 66.05 50.68 70.48 26.19 40.57 70.26 48.05 56.41

CU–UBC-6Rec 63.76 51.867 63.62 48.75 63.84 17.46 33.12 65.05 45.81 50.36Prec 85.99 69.375 76.49 59.67 75.99 52.38 64.24 72.39 57.52 68.23F1 73.23 59.36 69.46 53.66 69.39 26.19 43.71 68.52 51.00 57.17

CU–UBC-7Rec 60.36 53.74 64.05 51.51 58.18 22.22 35.37 59.32 47.74 50.28Prec 81.42 72.33 76.98 62.58 69.23 66.67 69.77 66.13 60.17 69.47F1 69.33 61.66 69.92 56.51 63.23 33.33 46.94 62.54 53.24 57.41

CU–UBC-3Rec 83.39 47.66 76.48 52.06 73.14 25.40 36.33 74.28 46.50 57.25Prec 84.38 49.76 78.97 53.14 73.87 26.23 50.75 74.70 47.10 59.88F1 83.89 48.69 77.71 52.60 73.50 25.81 42.35 74.49 46.80 58.42

CU–UBC-4Rec 80.69 47.66 78.35 57.29 73.77 28.57 40.73 74.06 50.93 59.12Rec 81.64 49.76 80.89 58.48 74.50 29.51 56.89 74.47 51.59 61.97F1 81.16 48.69 79.60 57.88 74.14 29.03 47.47 74.27 51.26 60.39

CU–UBC-1Rec 75.96 47.66 75.73 65.35 69.07 28.57 49.52 65.08 60.58 59.73Prec 76.86 49.76 78.19 66.71 69.92 29.51 69.16 65.44 61.36 62.99F1 76.41 48.69 76.94 66.03 69.50 29.03 57.71 65.26 60.97 61.17

CU–UBC-2Rec 88.16 41.59 81.90 72.68 76.58 28.57 50.91 73.98 67.37 64.64Prec 89.21 43.41 84.56 74.18 77.34 29.51 71.11 74.39 68.24 67.99F1 88.68 42.48 83.21 73.42 76.96 29.03 59.34 74.18 67.80 66.12

EdinburghRec 89.54 41.59 82.38 59.58 80.22 31.75 58.95 78.97 72.82 66.20Prec 90.75 43.41 85.06 60.84 83.30 32.79 82.34 79.41 73.75 70.18F1 90.14 42.48 83.70 60.20 81.73 32.26 68.71 79.19 73.28 67.96

stanzaRec 95.31 - 85.49 86.21 84.74 65.08 - 79.19 86.80 83.26Prec 93.87 - 85.84 85.91 82.79 50.62 - 71.57 86.87 79.64F1 94.59 - 85.66 86.06 83.75 56.94 - 75.19 86.84 81.29

Table 3: Results on all test languages for all systems in %; the official shared task metric is best-match F1. Toprovide a more complete picture, we also show precision and recall. stanza is a supervised system.

son et al., 2007) modeling word structure. Theirwork draws on parallels between the unsupervisedparadigm clustering task and unsupervised mor-phological segmentation. Their grammars segmentword forms in the shared task corpora into a se-quence of zero or more prefixes and a single stemfollowed by zero or more suffixes.

Based on the segmented words from the raw textdata, they then determine whether the languageuses prefixes or suffixes for inflection. The finalstem for words in a predominantly suffixing lan-

guage then consists of the prefixes and stem identi-fied by the adaptor grammar. For a predominantlyprefixing language, the final stem instead containsall suffixes of the word form. The team notes thatthis approach is unsuitable for languages whichextensively make use of both prefixes and suffixes,such as Basque.

Finally, they group all words which share thesame stem into paradigms. However, becausesampling from an adaptor grammar is a non-deterministic process – i.e., the system may return

76

multiple possible segmentations for a single wordform – they construct preliminary clusters by in-cluding all forms which might share a given stem.Then they select the cluster that maximizes a scorebased on frequency of occurrence of the inducedsegment in all segmentations.

4 Results and Discussion

The official results obtained by all submitted sys-tems on the test sets are shown in Table 3.

The Edinburgh system performs best overallwith an average best-match F1 of 67.96%. Ingeneral, grammar-based systems attain the best re-sults, with all of the CU–UBC systems and theEdinburgh system outperforming the baseline by atleast 23.06% F1. The Boulder-GWK and Boulder-PDP systems, both of which perform clusteringover word representations, approach but do not ex-ceed baseline performance. Perkoff et al. (2021)found that clustering over word2vec embeddingsperforms poorly on the development languages,and their scores on the test set reflect clusters foundwith vectors based purely on orthography. TheBoulder-GWK systems contain incomplete results,and partial evidence suggests that their cluster-ing method, which combines both fastText embed-dings trained on the provided bible corpora, andedit distance, can indeed outperform the baseline.However, it likely cannot outperform the grammar-based submissions.

For comparison, we also evaluate a supervisedlemmatizer from the Stanza toolkit (Qi et al., 2020).The Stanza lemmatizer is a neural network modeltrained on Universal Dependencies (UD) treebanks(Nivre et al., 2020), which first tags for parts ofspeech, and then uses these tags to generate lemmasfor a given word. Because there is no UD corpus inthe current version for Navajo nor Kannada, we donot have scores for those languages. Stanza’s accu-racy on our task is far lower than that reported forlemmatization on UD data. We note, however, that1) our data is from a different domain, 2) Biblicallanguage in particular can differ strongly from con-temporary text, and 3) we evaluate on only a partialset of types in the corpus, which could represent aparticularly challenging set of paradigms for somelanguages. The Stanza lemmatizer outperforms allsystems for all languages, except for German. Thisis unsurprising as it is a supervised system, thoughit is interesting that the German score falls short ofthat of the Edinburgh system.

naagha neiikai naahkainaasha nijigha nideeshaałnaaya ninadaah nanina

ninajıdaah nizhdoogaał

Table 4: A paradigm from our gold set for Navajo.

Overgeneralization/Underspecification Whenacquiring language, children often overgeneralizemorphological analogies to new, ungrammaticalforms. For example, the past tense of the Englishverb to know might be expressed as knowed, ratherthan the irregular knew. The same behavior canalso be observed in learning algorithms at somepoint during the learning process (Kirov andCotterell, 2018). This is reflected to some extent inTable 3 by trade-offs between precision and recall.A low precision, but high recall indicates that asystem is overgeneralizing: some surface formsare erroneously assigned to too many paradigms.In effect, these systems are hypothesizing thata substring is productive, and thus proposing aparadigmatic relationship between two words. Forexample, the English words approach and approveshare the stem appro- with unproductive segmentsas suffixes. The baseline tends to overgeneralizedue to its creation of large paradigms via a naivegrouping of words by shared n-grams.

On the other hand, several systems seem to un-derspecify, indicated by their low recall. A lowrecall, but high precision indicates that a systemdoes not attribute inflected forms to a paradigmthat the form does in fact belong to. This can becaused by suppletion in systems based purely onorthography, for example, generating the paradigmwith go and goes, but attributing went to a separateparadigm. Underspecification is apparent in theCU–UBC submissions that relied on discontinuousrules (CU–UBC 5, 6, and 7). This is likely becausethey filtered these systems down to far fewer rulesthan their prefix/suffix systems, in order to avoidsevere overgeneralization that can result from spuri-ous morphemes based on discontinuous substrings.Similarly, the Boulder-GWK systems both havereasonable precision, but very low recalls. Theyreport that this is due to the fact that they ignoreany words with less than a certain frequency in thecorpus due to time constraints, thus creating smallparadigms and ignoring many words completely.

Language and Typology In general, we find thatBasque and Navajo are the two most difficult testlanguages. Both languages have relatively small

77

Figure 3: Singleton paradigm counts for the best performing system on all test languages. Languages for whichwe have more than 100 paradigms on the left, and those for which we have less than 100 paradigms on the right.Predicted singleton paradigms are in red and blue, gold singleton paradigms are in grey.

Figure 4: The F1 score across paradigm sizes for the best performing system on all test languages. From left toright, the graphs represent the groups of languages in increasing order of how well systems typically performed onthem. F1 scores are interpolated for paradigm sizes that do not exist in a given language.

corpora, and are typlogically agglutinative – thatis, they express inflection via the concatenation ofpotentially many morpheme segments, which canresult in a large number of unique surface forms.Both languages thus have relatively high type-tokenratios (TTR) – especially Navajo, which has thehighest TTR, cf. Table 2. It is also important tonote that both Basque and Navajo have compara-tively small sets of paradigms against which weevaluate. This leaves the possibility that the subsetof paradigms in the gold set are particularly chal-lenging. However, the differences between systemscores indicates that these two languages do offerchallenges related to their morphology.

Navajo is a predominantly prefixing language– the only one in the development and test sets –and Basque also inflects using prefixes, though toa lesser extent. The top two performing systemsboth obtain low scores for Navajo. The CU–UBC-2system considers only suffix rules, which resultsin it being the lowest performing CU–UBC systemon Navajo. The Edinburgh submission should beable to identify prefixes and consider the suffix tobe part of the stem in Navajo. However, the largenumber of types, for a relatively small Navajo cor-

pus may cause difficulties for their algorithm thatbuilds clusters based on affix frequency. Notably,the CU-UBC-7 system, which learns discontinu-ous rules rather than rules that model strictly con-catenative morphology, performs best on Navajoby a large margin when compared to the best per-forming system, which relies on strictly concate-native grammars. It also performs best on Basque,though by a smaller margin. Another difficulty inNavajo morphology is that it exhibits verbal stemalternation for expressing mood, tense, and aspect,which creates challenges for systems that rely onrewrite rules or string similarity, based on continu-ous substrings. For instance, our evaluation algo-rithm aligns a singleton predicted paradigm to thegold paradigm in Table 4 for nearly all systems.

On Basque, most systems perform poorly. Mc-Curdy et al. (2021), the best performing systemoverall, obtains a low score for Basque, which maybe due to their system assuming that a languageinflects either via prefixation or suffixation, but notboth, as Basque does. Other systems, however,attain similarly low scores for Basque.

The next tier of difficulty seems to compriseFinnish, Kannada, and Turkish, on which most sys-

78

tems obtain low scores. All of those languagesare suffixing, but also have an agglutinative mor-phology. The largest paradigm of each of these 3languages are all in the top 4 largest paradigms inTable 2. This implies that large paradigm sizes andlarge numbers of distinct inflectional morphemes –two properties often assumed to correlate with ag-glutinative morphology –, coupled with sparse cor-pora to learn from, offer challenges for paradigmclustering. Though agglutinative morphology, hav-ing relatively unchanged morphemes across words,might be simpler for automatic segmentation sys-tems than morphology characterized as fusional,our sparse data sets are likely to complicate this.

Finally, systems obtain the best results for En-glish, followed by Spanish, and then Bulgarian.These three languages are also strongly suffixing,but typically express inflection with a single mor-pheme. German appears to be a bit of an outlier,generally exhibiting scores that lie somewhere be-tween the highest scoring languages, and the moredifficult agglutinative languages. McCurdy et al.(2021) hypothesize that this may be due to non-concatenative morphology from German verbal cir-cumfixes. This hypothesis could explain why theBoulder-GWK system performs better on Germanthan other languages: it incorporates semantic in-formation. However, the CU–UBC systems thatuse discontinuous rules (systems 5, 6, and 7), andthus should better model circumfixation, do notproduce higher German scores than the continuousrules, including the suffix-only system.

5 Analysis: Partial Paradigm Sizes

The effect of the size of the gold partial paradigmson F1 score for the best system is illustrated inFigure 4. For Basque and Navajo, the F1 scoretends to drop as paradigm size increases. We seethe same trend for Finnish, Kannada, and German,with a few exceptions, but this trend does not existfor all languages. English resembles somethinglike a bell shape, other than the low scoring outlierfor the largest paradigms of size 7. Interestingly,Spanish and Turkish attain both very high and verylow scores for larger paradigms.

An artifact of a sparse corpus is that many sin-gleton paradigms arise. For theoretically largerparadigms, only a single inflected form might oc-cur in such a small corpus. Of course, this also hap-pens naturally for certain word classes. However,nouns, verbs, and occasionally adjectives typically

form paradigms comprising several inflected forms.Figure 3 demonstrates that the best system tends toovergenerate singleton paradigms. We see this tosome extent for all agglutinative languages, whichmay be due to the high number of typically long,unique forms. This is especially true for Navajo,which has a small corpus and extremely high type–token ratio. On the other hand, for the languagesfor which the highest scores are obtained, Span-ish and English, the system does not overgeneratesingleton paradigms. Of the large number of sin-gleton paradigms predicted for both languages, thevast majority are correct. For other systems notpictured in the figure, singleton paradigms are typi-cally undergenerated for Spanish and English. Inthe case of English, this could be due to wordsthat share a derivational relationship. For example,the word accomplishment might be assigned to theparadigm for the verb accomplish, when, in fact,their relationship is not inflectional.

6 Conclusion and Future Shared Tasks

We presented the SIGMORPHON 2021 SharedTask on Unsupervised Morphological ParadigmClustering. Submissions roughly fell into two cat-egories: similarity-based methods and grammar-based methods, with the latter proving moresuccessful at the task of clustering inflectionalparadigms. The best systems significantly im-proved over the provided n-gram baseline, roughlydoubling the F1 score – mostly through much im-proved precision. A comparison against a super-vised lemmatizer demonstrated that we have not yetreached the ceiling for paradigm clustering: manywords are still either incorrectly left in singletonparadigms or incorrectly clustered with circum-stantially (and often derivationally) related words.Regardless of the ground still to be covered, thesubmitted results were a successful first step in au-tomatically inducing the morphology of a languagewithout access to expert-annotated data.

Unsupervised morphological paradigm cluster-ing is only the first step in a morphological learn-ing process that more closely models human L1acquisition. We envision future tasks expandingon this task to include other important aspects ofmorphological acquisition. Paradigm slot catego-rization is a natural next step. To correctly cate-gorize paradigm slots, cross-paradigmatic similari-ties must be considered, for example, the Germanwords liest and schreibt are both 3rd person singular

79

present indicative inflections of two different verbs.This can occasionally be identified via string simi-larity, but more often requires syntactic information.Syncretism (the collapsing of multiple paradigmslots into a single representation) further compli-cates the task. A similar subtask involves lemmaidentification, where a canonical form (Cotterellet al., 2016b) is identified within the paradigm.

Likewise, another important task involves fill-ing unrealized slots in paradigms by generating thecorrect surface form, which can be approached sim-ilarly to previous SIGMORPHON shared tasks oninflection (Cotterell et al., 2016a, 2017, 2018; Mc-Carthy et al., 2019; Vylomova et al., 2020), but willlikely be based on noisy information from the slotcategorization – all previous tasks have assumedthat the morphosyntactic information provided toan inflector is correct. Currently, investigations intothe robustness of these systems to noise are sparse.

Another direction for this task is the expansionto more under-resourced languages. The submit-ted results demonstrate that the task becomes par-ticularly difficult when the provided raw text issmall, but under-documented languages are oftenthe ones most in need of morphological corpora.The JHUBC contains Bible data for more than 1500languages, which can potentially be augmented byother raw text corpora because morphology is rel-atively stable across domains. Future tasks mayenable the construction of inflectional paradigmsin languages that require them to construct furthercomputational tools.

Acknowledgments

We would like to thank all of our shared task par-ticipants for their hard work on this difficult task!

ReferencesPiotr Bojanowski, Edouard Grave, Armand Joulin, and

Tomas Mikolov. 2017. Enriching word vectors withsubword information. Transactions of the Associa-tion for Computational Linguistics, 5:135–146.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman,Geraldine Walther, Ekaterina Vylomova, Arya DMcCarthy, Katharina Kann, Sabrina J Mielke,Garrett Nicolai, Miikka Silfverberg, et al. 2018.The conll–sigmorphon 2018 shared task: Univer-sal morphological reinflection. arXiv preprintarXiv:1810.07125.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman,Geraldine Walther, Ekaterina Vylomova, Patrick

Xia, Manaal Faruqui, Sandra Kubler, DavidYarowsky, Jason Eisner, et al. 2017. Conll-sigmorphon 2017 shared task: Universal morpholog-ical reinflection in 52 languages. In Proceedings ofthe CoNLL SIGMORPHON 2017 Shared Task: Uni-versal Morphological Reinflection, pages 1–30.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman,David Yarowsky, Jason Eisner, and MansHulden. 2016a. The sigmorphon 2016 sharedtask—morphological reinflection. In Proceedingsof the 14th SIGMORPHON Workshop on Compu-tational Research in Phonetics, Phonology, andMorphology, pages 10–22.

Ryan Cotterell, Tim Vieira, and Hinrich Schutze.2016b. A joint model of orthography and mor-phological segmentation. Association for Computa-tional Linguistics.

Greg Durrett and John DeNero. 2013. Supervisedlearning of complete morphological paradigms. InProceedings of the 2013 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 1185–1195.

Alexander Erdmann, Micha Elsner, Shijie Wu, RyanCotterell, and Nizar Habash. 2020. The paradigmdiscovery problem. In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics, pages 7778–7790, Online. Associationfor Computational Linguistics.

Andrew Gerlach, Adam Wiemerslage, and KatharinaKann. 2021. Paradigm clustering with weightededit distance. In Proceedings of the 18th Workshopon Computational Research in Phonetics, Phonol-ogy, and Morphology. Association for Computa-tional Linguistics.

Erez Hartuv and Ron Shamir. 2000. A clustering al-gorithm based on graph connectivity. Informationprocessing letters, 76(4-6):175–181.

Mans Hulden, Markus Forsberg, and Malin Ahlberg.2014. Semi-supervised learning of morphologicalparadigms and lexicons. In Proceedings of the 14thConference of the European Chapter of the Associa-tion for Computational Linguistics, pages 569–578.

Huiming Jin, Liwei Cai, Yihui Peng, Chen Xia, AryaMcCarthy, and Katharina Kann. 2020. Unsuper-vised morphological paradigm completion. In Pro-ceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 6696–6707, Online. Association for Computational Lin-guistics.

Mark Johnson, Thomas L Griffiths, Sharon Goldwa-ter, et al. 2007. Adaptor grammars: A frame-work for specifying compositional nonparametricbayesian models. Advances in neural informationprocessing systems, 19:641.

80

Katharina Kann, Arya D. McCarthy, Garrett Nico-lai, and Mans Hulden. 2020. The SIGMORPHON2020 shared task on unsupervised morphologicalparadigm completion. In Proceedings of the 17thSIGMORPHON Workshop on Computational Re-search in Phonetics, Phonology, and Morphology,pages 51–62, Online. Association for ComputationalLinguistics.

Richard M Karp. 1980. An algorithm to solve the m×n assignment problem in expected time o (mn log n).Networks, 10(2):143–152.

Christo Kirov and Ryan Cotterell. 2018. Recurrent neu-ral networks in linguistic theory: Revisiting pinkerand prince (1988) and the past tense debate. Trans-actions of the Association for Computational Lin-guistics, 6:651–665.

Arya D. McCarthy, Christo Kirov, Matteo Grella,Amrit Nidhi, Patrick Xia, Kyle Gorman, Ekate-rina Vylomova, Sabrina J. Mielke, Garrett Nico-lai, Miikka Silfverberg, Timofey Arkhangelskiy, Na-taly Krizhanovsky, Andrew Krizhanovsky, ElenaKlyachko, Alexey Sorokin, John Mansfield, ValtsErnstreits, Yuval Pinter, Cassandra L. Jacobs, RyanCotterell, Mans Hulden, and David Yarowsky.2020a. UniMorph 3.0: Universal Morphology.In Proceedings of the 12th Language Resourcesand Evaluation Conference, pages 3922–3931, Mar-seille, France. European Language Resources Asso-ciation.

Arya D. McCarthy, Ekaterina Vylomova, Shijie Wu,Chaitanya Malaviya, Lawrence Wolf-Sonkin, Gar-rett Nicolai, Christo Kirov, Miikka Silfverberg, Sab-rina J. Mielke, Jeffrey Heinz, Ryan Cotterell, andMans Hulden. 2019. The SIGMORPHON 2019shared task: Morphological analysis in context andcross-lingual transfer for inflection. In Proceedingsof the 16th Workshop on Computational Research inPhonetics, Phonology, and Morphology, pages 229–244, Florence, Italy. Association for ComputationalLinguistics.

Arya D. McCarthy, Rachel Wicks, Dylan Lewis, AaronMueller, Winston Wu, Oliver Adams, Garrett Nico-lai, Matt Post, and David Yarowsky. 2020b. TheJohns Hopkins University Bible corpus: 1600+tongues for typological exploration. In Proceed-ings of the 12th Language Resources and EvaluationConference, pages 2884–2892, Marseille, France.European Language Resources Association.

Kate McCurdy, Sharon Goldwater, and Adam Lopez.2021. Adaptor grammars for unsupervisedparadigm clustering. In Proceedings of the 18thWorkshop on Computational Research in Phonetics,Phonology, and Morphology. Association for Com-putational Linguistics.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word represen-tations in vector space.

Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-ter, Jan Hajic, Christopher D. Manning, SampoPyysalo, Sebastian Schuster, Francis Tyers, andDaniel Zeman. 2020. Universal Dependencies v2:An evergrowing multilingual treebank collection.In Proceedings of the 12th Language Resourcesand Evaluation Conference, pages 4034–4043, Mar-seille, France. European Language Resources Asso-ciation.

E. Margaret Perkoff, Josh Daniels, and Alexis Palmer.2021. Orthographic vs. semantic representationsfor unsupervised morphological paradigm cluster-ing. In Proceedings of the 18th Workshop on Compu-tational Research in Phonetics, Phonology, and Mor-phology. Association for Computational Linguistics.

Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton,and Christopher D. Manning. 2020. Stanza: Apython natural language processing toolkit for manyhuman languages. In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics: System Demonstrations, pages 101–108, Online. Association for Computational Linguis-tics.

Rudolf Rosa and Zdenek Zabokrtsky. 2019. Unsu-pervised lemmatization as embeddings-based wordclustering. CoRR, abs/1908.08528.

Ekaterina Vylomova, Jennifer White, ElizabethSalesky, Sabrina J Mielke, Shijie Wu, EdoardoPonti, Rowan Hall Maudslay, Ran Zmigrod, JosefValvoda, Svetlana Toldova, et al. 2020. Sigmorphon2020 shared task 0: Typologically diverse morpho-logical inflection. arXiv preprint arXiv:2006.11572.

William E. Winkler. 1990. String comparator met-rics and enhanced decision rules in the fellegi-suntermodel of record linkage. In Proceedings of the Sec-tion on Survey Research, pages 354–359.

Changbing Yang, Garrett Nicolai, and Miikka Silfver-berg. 2021. Unsupervised paradigm clustering us-ing transformation rules. In Proceedings of the 18thWorkshop on Computational Research in Phonetics,Phonology, and Morphology. Association for Com-putational Linguistics.

81

Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research

in Phonetics, Phonology, and Morphology,

August 5, 2021. ©2021 Association for Computational Linguistics

pages 82–89

Adaptor Grammars for Unsupervised Paradigm Clustering

Kate McCurdy Sharon Goldwater Adam LopezSchool of Informatics

University of [email protected], sgwater, [email protected]

Abstract

This work describes the Edinburgh submis-sion to the SIGMORPHON 2021 Shared Task2 on unsupervised morphological paradigmclustering. Given raw text input, the taskwas to assign each token to a cluster withother tokens from the same paradigm. Weuse Adaptor Grammar segmentations com-bined with frequency-based heuristics to pre-dict paradigm clusters. Our system achievedthe highest average F1 score across 9 test lan-guages, placing first out of 15 submissions.

1 Introduction

While the task of supervised morphological inflec-tion has seen dramatic gains in accuracy over recentyears (e.g. Cotterell et al., 2016, 2017, 2018; Vy-lomova et al., 2020), unsupervised morphologicalanalysis remains an open challenge. This is evidentin the results of the 2020 SIGMORPHON SharedTask 2 on Unsupervised Morphological ParadigmCompletion, in which no submission consistentlyoutperformed the baseline (Kann et al., 2020; Jinet al., 2020).

The 2021 Shared Task 2 (Wiemerslage et al.,2021) focuses on a subproblem from the 2020task: given raw text input, cluster tokens togetherbased on membership in the same morphologi-cal paradigm. For example, given the sentence“My dog met some other dogs”, a successful sys-tem would assign “dog” and “dogs” to the sameparadigm because they are two inflected forms ofthe same lemma “dog”, while each other wordwould occupy its own cluster. Furthermore, asuccessful system needs to cluster typologicallydiverse, morphologically rich languages such asFinnish and Navajo, with inflectional paradigmswhich are much larger than English paradigms.

2 Adaptor Grammars

Our approach is based upon Adaptor Grammars, aframework which achieves state-of-the-art resultson the related task of unsupervised morphologicalsegmentation (Eskander et al., 2020).

2.1 Model

Adaptor Grammars (AGs; Johnson et al., 2007b)are a class of nonparametric Bayesian probabilisticmodels which learn structured representations, orparses, of natural language input strings. An AGhas two components: a Probabilistic Context-FreeGrammar (PCFG) and one or more adaptors. ThePCFG is a 5-tuple (N,W,R, S,θ) which specifiesa base distribution over parse trees. Parse trees aregenerated top-down by expanding non-terminalsN (including the start symbol S ∈ N ) to non-terminals N (excluding S) and terminals W , usingthe set of allowed expansion rules R with expan-sion probability θr for each rule r ∈ R. PCFGshave very strong independence assumptions; theadaptor component relaxes these assumptions by al-lowing certain nonterminals to adapt to a particularcorpus, meaning they can cache and re-use subtreeswith probabilities conditioned on that corpus.

An AG extends a PCFG by specifying a set ofadapted nonterminals A ⊆ N and a vector of adap-tors C. For each adapted nonterminal X ∈ A, theadaptor CX stores all subtrees previously emittedwith the root node X . When a new tree rooted inX is sampled, the adaptor CX either generates anew tree from the PCFG base distribution or re-turns a previously emitted subtree from its cache.The adaptor distribution is generally based on aPitman-Yor Process (PYP; Pitman and Yor, 1997),under which the probability of Cx returning a par-ticular subtree σ is roughly proportional to thenumber of times X has previously expanded toσ. This leads to a “rich-get-richer” effect as more

82

Word→ Stem SuffixWord→ StemStem→ CharsSuffix→ CharsChars→ CharChars→ Char Chars

(a) Example grammar.Adapted nonterminals areunderlined.

Word Segmentationwalked walk-edjumping jump-ingwalking walk-ing

jump jump

(b) Toy corpus with target seg-mentations

Word

Suffix

e d

Stem

w a l k

Word

Suffix

i n g

Stem

j u m p

(c) Example target morphological analyses, showing only thetop 2 levels of structure

Figure 1: A possible morphological analysis (1c)learned by the grammar in (1a) over the corpus shownin (1b) (from Johnson et al., 2007b)

frequently sampled subtrees gain higher probabilitywithin the conditional adapted distribution. Givenan AG specification, MCMC sampling can be usedto infer values for the PCFG rule probabilities θ(Johnson et al., 2007a) and PYP hyperparameters(Johnson and Goldwater, 2009).

2.2 AGs for Morphological Analysis

The probabilistic parses generated by adaptor gram-mars can be used to segment sequences. In caseswhere the grammar specifies word structures, thesegmentations may reflect morphological analy-ses. For example, an AG trained with the simplegrammar shown in Table 1a may learn to cache“jump” and “walk” as Stem subtrees, and “ing” and“ed” as Suffix subtrees, ideally producing the tar-get segmentations shown in Figure 1c. In prac-tice, researchers have successfully applied AGs tothe task of unsupervised morphological segmenta-tion (Sirts and Goldwater, 2013; Eskander et al.,2016). Eskander et al. (2020) found that a language-independent AG framework achieved state-of-the-art results on 12 typologically distinct languages.

3 System description

3.1 Overview

The task of unsupervised paradigm clustering isclosely related to morphological segmentation, butwe are not aware of previous applications of AGsto the current task. To use AGs for paradigm clus-

tering, we need a method to group words togetherbased on their AG segmentations. The examplesegmentations shown in Figure 1 suggest a verysimple approach to paradigm clustering: assign allforms with the same stem to the same cluster. Forexample, “walked” and “walking” would correctlycluster together with the shared stem “walk”. Oursystem builds upon this intuition.

As a preliminary step, we select grammars tosample from, looking only at the development lan-guages. We build simple clusters and heuristicallyselect grammars which show relatively high per-formance, as described in Section 3.2. In this casewe select two grammars. Once the grammars havebeen selected, we discard the simple clusters infavor of a more sophisticated strategy.

We implement1 a procedure to generate clustersfor both development and test languages. First,we sample 3 separate AG parses for each corpusand each grammar, resulting in 6 segmentations foreach word. We then use frequency-based metricsover the segmentations to identify the language’sadfix direction, i.e. whether it is predominantlyprefixing or suffixing, as described in Section 3.3.Finally, we iterate over the entire vocabulary andapply frequency-based scores to generate paradigmclusters, as described in Section 3.4 .

3.2 Grammar selection

An adaptor grammar builds upon an initial PCFGspecification, and many such grammars can be ap-plied to model word structure. As a first step, weevaluate various grammar specifications on the de-velopment languages and select the grammars forour final model.

To train the adaptor grammar representations,we use MorphAGram (Eskander et al., 2020), aframework which extends the adaptor grammar im-plementation of Johnson et al. (2007b). Eskan-der et al. (2020) evaluated nine different PCFGgrammar specifications on the task of unsupervisedword segmentation. Each grammar specifies therange of possible word structures which can belearned under that model. We evaluated six of theirnine proposed grammars on the development lan-guages (Maltese, Persian, Portuguese, Russian, andSwedish). Following their procedure, we extracteda vocabulary V of word types as AG inputs.2

1https://github.com/kmccurdy/paradigm-clusters

2Although AGs can also model token frequencies (Gold-water et al., 2006), which could conceivably improve perfor-

83

To evaluate grammar performance, we followthe intuition in Section 3.1 and group by AG-segmented stems. Grouping by stem can be moredifficult for complex words. For example, an AGwith a more complex grammar might segment theplural noun “actionables” into “action-able-s”, with“action” as the stem (see also the example in Fig-ure 2a); however, the target paradigm for cluster-ing includes only “actionable” and “actionables”,not “action” and “actions”. To address this issuefor our clustering task, we make the further sim-plifying (but linguistically motivated; e.g. Stump,2005, 56) assumption that inflectional morphologyis generally realized on a word’s periphery, so asegmentation like “action-able-s” implies the stem“actionable” (in a suffixing language like English,where the prefix is included in the stem). As allof the development languages were predominantlysuffixing (with the partial exception of Maltese,which includes root-and-pattern morphology), wesimply grouped together words with the same AG-segmented Prefix + Stem.

We selected two grammars with the following de-sirable attributes: 1) they reliably showed good per-formance on the development set, relative to othergrammars; and 2) they specified very similar struc-tures, making it easier to combine their outputs inlater steps. Both grammars model words as a tripar-tite Prefix-Stem-Suffix sequence. Both grammarsalso use a SubMorph level of representation, whichhas been shown to aid word segmentation (Sirtsand Goldwater, 2013), although we only considersegments from the level directly above SubMorphsin clustering. The full grammar specifications areincluded in Appendix A.

• Simple+SM: Each word comprises one op-tional prefix, one stem, and one optional suf-fix. Each of these levels can comprise one ormore SubMorphs.

• PrStSu+SM Each word comprises zero ormore prefixes, one stem, and zero or more suf-fixes. Each of these levels can comprise oneor more SubMorphs. Eskander et al. (2020)found that this grammar showed the highestperformance in unsupervised segmentationacross the languages they evaluated.

Sampling from an adaptor grammar is a non-deterministic process, so the same set of initial

mance on this task, we did not explore this option.

Word

Suffix

SuffixMorph

SubMorph

Char

d

SubMorph

Char

e

SuffixMorph

SubMorph

Char

n

SubMorph

Char

o

Char

i

Stem

SubMorph

Char

t

Char

r

Char

o

Prefix

PrefixMorph

SubMorph

Char

p

Char

p

Char

a

(a) PrStSu+SMWord

Suffix

SubMorph

Char

d

Char

e

Char

n

SubMorph

Char

o

SubMorph

Char

i

Stem

SubMorph

Char

t

Char

r

Char

o

SubMorph

Char

p

Prefix

SubMorph

Char

p

Char

a

(b) Simple+SM

Figure 2: Two example parses of the word “appor-tioned” from our two distinct grammar specifications,learned on the English test data.

parameters applied to the same data can predict dif-ferent segmentation outputs. Given this variability,we run the AG sampler three times for each of ourtwo selected grammars, yielding 6 parses of thelexicon for each language. The number of gram-mar runs was heuristically selected and not tunedin any way, so adding more runs for each grammarmight improve performance (for example, Sirts andGoldwater, 2013, use 5 samples per grammar). Wethen combine the resulting segmentations using thefollowing procedure.

3.3 Adfix direction

The first step is to determine the adfix directionfor each language, i.e. whether the language usespredominantly prefixing or suffixing inflection. Weheuristically select the adfix direction using thefollowing automatic procedure.

First, we count the frequency of each peripheralsegment across all 6 parses of the lexicon. A pe-ripheral segment is a substring at the start or endof a word, which has been parsed as a segmentabove the SubMorph level in some AG sample. Forinstance, in the parse shown in Figure 2a, “app-”would be the initial peripheral segment, and “-ed”would be the final peripheral segment. By contrast,for the parse shown in Figure 2b,“ap-” would bethe initial peripheral segment, and “-ioned” wouldbe the final peripheral segment.

Next, we rank the segmented adfixes by theirfrequency, and select the top N for consideration,where N is some heuristically chosen quantity. Inlight of the generally Zipfian properties of linguistic

84

distributions, we chose to scale N logarithmicallywith the vocabulary size, so N = log(|V |).

Finally, we select the majority label (i.e. “prefix”or “suffix”) of the N most frequent segments as theadfix direction. This simple approach has obviouslimitations — to name just one, it neglects the re-ality of nonconcatenative morphology, such as theroot-and-pattern inflection of many Maltese verbs.Nonetheless, it appears to capture some key distinc-tions: this method correctly identified Navajo as aprefixing language, and all other development andtest languages as predominantly suffixing.

3.4 Creating paradigm clusters

Once we have inferred the adfix direction for a lan-guage, we use a greedy iterative procedure overwords to identify and score potential clusters. Ourscoring metric is frequency-based, motivated bythe observation that inflectional morphology (suchas the “-s” in “actionables”) tends to be more fre-quent across word types relative to derivationalmorphology (such as the “-able” in “actionables”).Yarowsky and Wicentowski (2000) have demon-strated the value of frequency metrics in aligninginflected forms from the same lemma.

We start with no assigned clusters and iteratethrough the vocabulary in alphabetical order.3 Foreach word w which has not yet been assigned toa cluster, we identify the most likely cluster usingthe following procedure.

Find possible stems Identify each possible stems from all of the segmentations for w, where the“stem” comprises the entire substring up to a pe-ripheral adfix. For example, based on the twoparses shown in Figure 2, “apportion” and “ap-port” would constitute possible stems for the word“apportioned”. The word w in its entirety is alsoconsidered as a possible stem.

Find possible cluster members For each stems, identify other words in the corpus which mightshare that stem, forming a potential cluster cs. Aword potentially shares a stem if it shares the samesubstring from the non-adfixing-direction — so astem is a shared prefix substring in a suffixing lan-guage like English, and vice-versa for a prefixinglanguage like Navajo. For each word wi that isidentified this way, the rest of the string outsideof the possible stem s is a possible adfix ai. In

3The method is relatively insensitive to order, except re-versed alphabetical order, which is worse for most languages.

the example from Figure 2, if “apportions” werealso in the corpus, it would be added to the clusterfor the stem “apportion”, with “-s” as the adfix ai.Similarly, it would also be considered in the clusterfor the stem “apport”, with adfix “-ions”.

Score cluster members For each word wi in cs,calculate a score xi:.

xi =√Awi log(Ai) (1)

where Ai is the normalized overall frequency ofthe ith adfix ai (suffix or prefix) per 10,000 typesin the corpus of 6 segmentations, and Awi is theproportion of segmentations of the ith word wiwhich contain the adfix ai. For example, if “ap-portioned” were in consideration for a hypotheticalcluster based on the stem “apportion”, Ai would bethe normalized corpus frequency of “-ed”, and Awiwould be .5 (assuming only the two segmentationsshown in Figure 2). For a cluster with the stem“apport”, Ai would be the normalized frequency of“-ioned”, and Awi would still be .5.

Intuitively, when evaluating a single word, Eq. 1assumes that adfixes which appear frequently in thesegmented corpus overall are more likely to be in-flectional, so words with more frequent adfixes aremore likely paradigm members (the log(Ai) term).For instance, the high frequency of the “-s” suffixin English will increase the score of any word withan “-s” suffix in its segmentation (e.g. “apportion-s”). Eq. 1 also assumes that, for all segmentationsof this particular word wi, adfixes which appearin a higher proportion of segmentations are morereliable (the

√Awi term), so the more times some

AG samples the “apportion-s” segmentation, thehigher the score for “apportions” membership inthe “apportion”-stem paradigm. The square roottransform was selected based on development setperformance, and has not been tuned extensively.

Filter and score clusters For each possible stemcluster cs, filter out words whose score xi is belowthe score threshold hyperparameter t, to create anew cluster c′s. Calculate the cluster score xs bytaking the average of xi for only those words in c′s,i.e. only words with score xi ≥ t. The value for tis selected via grid search on the development set.We found that setting t = 2 maximized F1 acrossthe development languages as a whole.

Select cluster Select the potential cluster c′s withthe highest score, and assignw to that cluster, alongwith each word wi in c′s.

85

Language Precision Recall F1

Maltese .30 .30 .30Persian .54 .52 .53Portuguese .92 .91 .91Russian .83 .82 .82Swedish .85 .81 .83

Mean .69 .67 .68

Table 1: Performance on development languages

Language Precision Recall F1

Basque .33 .32 .32Bulgarian .83 .80 .82English .91 .90 .90Finnish .61 .60 .60German .79 .79 .79Kannada .82 .59 .69Navajo .43 .42 .42Spanish .85 .82 .84Turkish .74 .73 .73

Mean .70 .66 .68

Table 2: Performance on test languages

4 Results and Discussion

Performance was evaluated using the script pro-vided by the shared task organizers. Table 1 showsthe results for the development languages, and Ta-ble 2 shows the results for the test languages. Whilethe average F1 score ends up being quite similarfor both development and test languages, it’s clearwithin both groups that there are large differencesin performance across different languages.

4.1 Error analysis and ways to improve

Noncontiguous stems The clustering method de-scribed in Section 3.4 makes an unjustifiably strongassumption that stems are contiguous substrings,which effectively eliminates its ability to representnonconcatenative morphology. This limitation con-tributes to the low score on Maltese, a Semitic lan-guage which includes root-and-pattern morphologyfor certain verbs. The model further assumes thatthe left or right edge of a word — the side oppositefrom the adfix direction — is contiguous with thestem. This leads to errors on German, as most verbshave a circumfixing past participle form “ge- + -t”or “ge- + -en”. For example, the model correctlyassigns “andern”, “anderten”, and “andert” to the

same cluster, but incorrectly assigns “geandert” toa separate cluster. We estimate that roughly 30%of the model’s incorrect German predictions stemfrom this issue. This limitation also contributed toour model’s poor performance on Basque, which,like Maltese, uses both prefixing and suffixing in-flection to express polypersonal agreement.4

One obvious way to improve this issue wouldbe to use an extension of the AG framework whichcan represent nonconcatenative morphology. Bothaand Blunsom (2013) present such an extension,replacing the PCFG with a Probabilistic SimpleRange Concatenating Grammar. They report suc-cessful results for unsupervised segmentation onHebrew and Arabic. On the other hand, it’s unclearwhether such a nonconcatenative-focused approachcould also adequately represent concatenative mor-phology. Fullwood and O’Donnell (2013) explorea similar framework, using Pitman-Yor processesto sample separate lexica of roots, templates, and“residue” segments; they find that their model workswell for Arabic, but much less well for English. Inaddition, Eskander et al. (2020) report state-of-the-art morphological segmentation for Arabic usingthe PrStSu+SM grammar which we also use here.Their findings suggest that, rather than changingthe AG framework, we might attempt a more intel-ligent clustering method based on noncontiguoussegmented subsequences rather than contiguoussubstrings.

Irregular morphology The strong assumptionof contiguous substrings as stems also hinders ac-curate clustering of irregular forms of any kind,from predictable stem alternations (such as um-laut in German and Swedish, or theme vowels inPortuguese and Spanish) to more challenging sup-pletive forms such as English “go”-“went”. Thelatter likely requires additional input from seman-tic representations, but semiregular alternations informs could also be handled in principle by a moreintelligent clustering process. On this point, wenote that some small but significant fraction of AGparses of Portuguese verbs grouped verbal themevowels and inflections together (e.g. parsing “apre-sentada” as “apresent-ada” rather than “apresenta-da”, “apresentarem” as “apresent-arem” rather than“apresenta-rem”, and so on), and these parses werecrucial to our model’s relatively high performanceon Portuguese.

4We thank an anonymous reviewer for bringing this to ourattention.

86

Derivation vs. inflection Another issue is thatthe parses sampled by AGs do not distinguish be-tween inflectional and derivational morphology.This is apparent in Figure 2, where both grammarsparse “apportioned” with “-ioned” as the suffix. Weseek to address this issue with frequency-based met-rics in our clustering method, but frequent deriva-tional adfixes often score high enough to be as-signed a wrong paradigm cluster. For example,in English our model correctly clusters “allow”,“allows”, and “allowed” together, but it also incor-rectly assigns “allowance” to the same cluster.

A straightforward way to handle this within ourexisting approach would be to allow language-specific variation of the score threshold t. As wehad no method for unsupervised estimation of tfor unfamiliar languages, we did not pursue this;however, a researcher who had minimal familiaritywith the language in question might be able to se-lect a more sensible value for t based on inspectingthe clusters. Beyond that, the distinction betweeninflectional and derivational morphology is an in-triguing and contested issue within linguistics (e.g.Stump, 2005), and the question of how to model itcomputationally requires much more attention.

4.2 Things that didn’t work

We attempted a number of unsupervised ap-proaches beyond AG segmentations, with the goalof incorporating them during the clustering pro-cess; however, we could not consistently improveperformance with any of them. It seems likely tous that these methods could still be used to improveAG-segmentation-based clusters, but we could notfind immediately obvious ways to do this.

FastText As the AG framework only modelsword structure based on form, we hoped to use thedistributional representations learned by FastText(Bojanowski et al., 2017) to incorporate semanticand syntactic information into our model’s clus-ters. We tried several different approaches withoutsuccess. 1) We trained a skipgram model with acontext window of 5 words, a setting often usedfor semantic applications, in hopes that words fromthe same paradigm might have similar semanticrepresentations. Agglomerative clustering on theserepresentations alone yielded much worse clustersthan the AG method, and we could not find a wayto combine them successfully with the AG clusters.2) Erdmann et al. (2020) trained a skipgram modelwith a context window of 1 word and a minimum

subword length of 2 characters, and used it to clus-ter words from the same cell rather than the sameparadigm (e.g. clustering together English verbsin the third person singular such as “walks” and“jumps”). We attempted to follow this procedure,but it proved too difficult, as paradigm cell informa-tion was not explicitly included in the developmentdata for this shared task. 3) We used the methoddescribed by Bojanowski et al. (2017) to identifyimportant subwords within a word, in hopes ofcombining them with AG segmentations. However,the identified subwords did not consistently alignwith stem-adfix segementations as we had hoped,and did not seem to provide any additional benefit.

Brown clustering Part of speech tags could pro-vide latent structure as a higher-order grouping forparadigm clusters — for example, verbs would beexpected to have paradigms more similar to otherverbs than to nouns. Brown clusters (Brown et al.,1992) have been used for unsupervised inductionof word classes approximating part of speech tags.We used a spectral clustering algorithm (Stratoset al., 2014) to learn Brown clusters, but they didnot reliably correspond to part of speech categorieson our development language data.

5 Conclusion

The Adaptor Grammar framework has previouslybeen applied to unsupervised morphological seg-mentation. In this paper, we demonstrate that AGsegmentations can be used for the related task ofunsupervised paradigm clustering with successfulresults, as shown by our system’s performance inthe 2021 SIGMORPHON Shared Task.

We note that there is still considerable room forimprovement in our clustering procedure. Two keydirections for future development are more sophis-ticated treatment of nonconcatenative morphology,and incorporation of additional sources of informa-tion beyond the word form alone.

Acknowledgments

This work was supported in part by the EPSRC Cen-tre for Doctoral Training in Data Science, fundedby the UK Engineering and Physical Sciences Re-search Council (grant EP/L016427/1) and the Uni-versity of Edinburgh, and a James S McDonnellFoundation Scholar Award (#220020374) to thesecond author.

87

ReferencesPiotr Bojanowski, Edouard Grave, Armand Joulin, and

Tomas Mikolov. 2017. Enriching Word Vectors withSubword Information. Transactions of the Associa-tion for Computational Linguistics, 5:135–146.

Jan A Botha and Phil Blunsom. 2013. Adaptor gram-mars for learning non-concatenative morphology. InProceedings of the 2013 conference on empiricalmethods in natural language processing, pages 345–356.

Peter F. Brown, Peter V. Desouza, Robert L. Mer-cer, Vincent J. Della Pietra, and Jenifer C. Lai.1992. Class-based n-gram models of natural lan-guage. Computational linguistics, 18(4):467–479.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman,Geraldine Walther, Ekaterina Vylomova, Arya D.McCarthy, Katharina Kann, Sabrina J. Mielke, Gar-rett Nicolai, Miikka Silfverberg, David Yarowsky,Jason Eisner, and Mans Hulden. 2018. TheCoNLL–SIGMORPHON 2018 Shared Task: Uni-versal Morphological Reinflection. In Proceedingsof the CoNLL–SIGMORPHON 2018 Shared Task:Universal Morphological Reinflection, pages 1–27,Brussels. Association for Computational Linguis-tics.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman,Geraldine Walther, Ekaterina Vylomova, PatrickXia, Manaal Faruqui, Sandra Kubler, DavidYarowsky, Jason Eisner, and Mans Hulden. 2017.CoNLL-SIGMORPHON 2017 Shared Task: Uni-versal Morphological Reinflection in 52 Languages.In Proceedings of the CoNLL SIGMORPHON 2017Shared Task: Universal Morphological Reinflection,pages 1–30, Vancouver. Association for Computa-tional Linguistics.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman,David Yarowsky, Jason Eisner, and Mans Hulden.2016. The SIGMORPHON 2016 SharedTask—Morphological Reinflection. In Pro-ceedings of the 14th SIGMORPHON Workshop onComputational Research in Phonetics, Phonology,and Morphology, pages 10–22, Berlin, Germany.Association for Computational Linguistics.

Alexander Erdmann, Micha Elsner, Shijie Wu, RyanCotterell, and Nizar Habash. 2020. The ParadigmDiscovery Problem. arXiv:2005.01630 [cs]. ArXiv:2005.01630.

Ramy Eskander, Francesca Callejas, Elizabeth Nichols,Judith Klavans, and Smaranda Muresan. 2020. Mor-phAGram, Evaluation and Framework for Unsuper-vised Morphological Segmentation. In Proceedingsof the 12th Language Resources and EvaluationConference, pages 7112–7122, Marseille, France.European Language Resources Association.

Ramy Eskander, Owen Rambow, and Tianchun Yang.2016. Extending the Use of Adaptor Grammars for

Unsupervised Morphological Segmentation of Un-seen Languages. In Proceedings of COLING 2016,the 26th International Conference on ComputationalLinguistics: Technical Papers, pages 900–910, Os-aka, Japan. The COLING 2016 Organizing Commit-tee.

Michelle Fullwood and Tim O’Donnell. 2013. Learn-ing non-concatenative morphology. In Proceedingsof the Fourth Annual Workshop on Cognitive Model-ing and Computational Linguistics (CMCL), pages21–27, Sofia, Bulgaria. Association for Computa-tional Linguistics.

Sharon Goldwater, Mark Johnson, and Thomas L Grif-fiths. 2006. Interpolating between types and tokensby estimating power-law generators. In Advances inneural information processing systems, pages 459–466.

Huiming Jin, Liwei Cai, Yihui Peng, Chen Xia,Arya D. McCarthy, and Katharina Kann. 2020.Unsupervised Morphological Paradigm Completion.arXiv:2005.00970 [cs]. ArXiv: 2005.00970.

Mark Johnson and Sharon Goldwater. 2009. Improvingnonparameteric Bayesian inference: experiments onunsupervised word segmentation with adaptor gram-mars. In Proceedings of Human Language Tech-nologies: The 2009 Annual Conference of the NorthAmerican Chapter of the Association for Compu-tational Linguistics, pages 317–325, Boulder, Col-orado. Association for Computational Linguistics.

Mark Johnson, Thomas Griffiths, and Sharon Gold-water. 2007a. Bayesian Inference for PCFGs viaMarkov Chain Monte Carlo. In Human LanguageTechnologies 2007: The Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics; Proceedings of the Main Confer-ence, pages 139–146, Rochester, New York. Associ-ation for Computational Linguistics.

Mark Johnson, Thomas L. Griffiths, and Sharon Gold-water. 2007b. Adaptor Grammars:A Framework forSpecifying Compositional Nonparametric BayesianModels. In Bernhard Scholkopf, John Platt, andThomas Hofmann, editors, Advances in Neural In-formation Processing Systems 19. The MIT Press.

Katharina Kann, Arya D. McCarthy, Garrett Nico-lai, and Mans Hulden. 2020. The SIGMORPHON2020 Shared Task on Unsupervised Morphologi-cal Paradigm Completion. In Proceedings of the17th SIGMORPHON Workshop on ComputationalResearch in Phonetics, Phonology, and Morphology,pages 51–62, Online. Association for ComputationalLinguistics.

Jim Pitman and Marc Yor. 1997. The Two-ParameterPoisson-Dirichlet Distribution Derived from a Sta-ble Subordinator. The Annals of Probability,25(2):855–900. Publisher: Institute of Mathemati-cal Statistics.

88

Kairit Sirts and Sharon Goldwater. 2013. Minimally-Supervised Morphological Segmentation usingAdaptor Grammars. Transactions of the Associationfor Computational Linguistics, 1:255–266.

Karl Stratos, Do-kyum Kim, Michael Collins, andDaniel Hsu. 2014. A spectral algorithm for learn-ing class-based n-gram models of natural language.Proceedings of the Association for Uncertainty in Ar-tificial Intelligence.

Gregory T Stump. 2005. Word-formation and inflec-tional morphology. In Handbook of word-formation,volume 64, pages 49–71. Springer, Dordrecht, TheNetherlands.

Ekaterina Vylomova, Jennifer White, Eliza-beth Salesky, Sabrina J. Mielke, Shijie Wu,Edoardo Maria Ponti, Rowan Hall Maudslay, RanZmigrod, Josef Valvoda, Svetlana Toldova, FrancisTyers, Elena Klyachko, Ilya Yegorov, NataliaKrizhanovsky, Paula Czarnowska, Irene Nikkarinen,Andrew Krizhanovsky, Tiago Pimentel, LucasTorroba Hennigen, Christo Kirov, Garrett Nicolai,Adina Williams, Antonios Anastasopoulos, HilariaCruz, Eleanor Chodroff, Ryan Cotterell, MiikkaSilfverberg, and Mans Hulden. 2020. SIGMOR-PHON 2020 Shared Task 0: Typologically DiverseMorphological Inflection. In Proceedings of the17th SIGMORPHON Workshop on ComputationalResearch in Phonetics, Phonology, and Morphology,pages 1–39, Online. Association for ComputationalLinguistics.

Adam Wiemerslage, Arya McCarthy, Alexander Erd-mann, Garrett Nicolai, Manex Agirrezabal, MiikkaSilfverberg, Mans Hulden, and Katharina Kann.2021. The SIGMORPHON 2021 Shared Taskon Unsupervised Morphological Paradigm Cluster-ing. In Proceedings of the 18th SIGMORPHONWorkshop on Computational Research in Phonetics,Phonology, and Morphology. Association for Com-putational Linguistics.

David Yarowsky and Richard Wicentowski. 2000. Min-imally Supervised Morphological Analysis by Mul-timodal Alignment. pages 207–216.

A PCFGs

Our system uses the following two grammar specifi-cations, developed by Eskander et al. (2016, 2020).Nonterminals are adapted by default. Non-adaptednonterminals are preceded by 1, indicating an ex-pansion probability of 1, i.e. the PCFG always ex-pands this rule and never caches it.

A.1 Simple+SM1 Word --> Prefix Stem Suffix

Prefix --> ˆˆˆ SubMorphsPrefix --> ˆˆˆ

Stem --> SubMorphs

Suffix --> SubMorphs $$$Suffix --> $$$

1 SubMorphs --> SubMorph SubMorphs1 SubMorphs --> SubMorphSubMorph --> Chars1 Chars --> Char1 Chars --> Char Chars

A.2 PrStSu+SM

1 Word --> Prefix Stem Suffix

Prefix --> ˆˆˆPrefix --> ˆˆˆ PrefMorphs1 PrefMorphs --> PrefMorph PrefMorphs1 PrefMorphs --> PrefMorphPrefMorph --> SubMorphs

Stem --> SubMorphs

Suffix --> $$$Suffix --> SuffMorphs $$$1 SuffMorphs --> SuffMorph SuffMorphs1 SuffMorphs --> SuffMorphSuffMorph --> SubMorphs

1 SubMorphs --> SubMorph SubMorphs1 SubMorphs --> SubMorphSubMorph --> Chars1 Chars --> Char1 Chars --> Char Chars

89

Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research

in Phonetics, Phonology, and Morphology,

August 5, 2021. ©2021 Association for Computational Linguistics

pages 90–97

Orthographic vs. Semantic Representations for UnsupervisedMorphological Paradigm Clustering

E. Margaret Perkoff? Josh Daniels† Alexis Palmer†?Dept. of Computer Science, †Dept. of Linguistics

University of Colorado Bouldermargaret.perkoff, joda4370, [email protected]

Abstract

This paper presents two different systemsfor unsupervised clustering of morphologicalparadigms, in the context of the SIGMOR-PHON 2021 Shared Task 2. The goal of thistask is to correctly cluster words in a givenlanguage by their inflectional paradigm, with-out any previous knowledge of the languageand without supervision from labeled data ofany sort. The words in a single morphologicalparadigm are different inflectional variants ofan underlying lemma, meaning that the wordsshare a common core meaning. They also -usually - show a high degree of orthographi-cal similarity. Following these intuitions, weinvestigate KMeans clustering using two dif-ferent types of word representations: one fo-cusing on orthographical similarity and theother focusing on semantic similarity. Addi-tionally, we discuss the merits of randomly ini-tialized centroids versus pre-defined centroidsfor clustering. Pre-defined centroids are iden-tified based on either a standard longest com-mon substring algorithm or a connected graphmethod built off of longest common substring.For all development languages, the character-based embeddings perform similarly to thebaseline, and the semantic embeddings per-form well below the baseline. Analysis of thesystems’ errors suggests that clustering basedon orthographic representations is suitable fora wide range of morphological mechanisms,particularly as part of a larger system.

1 Introduction

One significant barrier to progress in morpholog-ical analysis is the lack of available data for mostof the world’s languages. As a result, there is adramatic divide between high and low resourcelanguages when it comes to performance on au-tomated morphological analysis (as well as manyother language-related tasks). Even for languages

Surface Forms Morphological Featureswalk bring V; 1SG; 2SG; 3PL; 1PLwalks brings V; 3SGwalking bringing PRES; PARTwalked brought PAST

Table 1: Morphological paradigms for the Englishverbs walk and bring.

with resources suitable for computational morpho-logical analysis, there is no guarantee that the avail-able data in fact covers all important aspects of thelanguage, leading to significant error rates on un-seen data. This uncertainty regarding training datamakes unsupervised learning a natural modelingchoice for the field of computational morphology.The unsupervised setting takes away the need forlarge quantities of labeled text in order to detectlinguistic phenomena. The SIGMORPHON 2021shared task aims to leverage the unsupervised set-ting in order to identify morphological paradigms,at the same time including languages with a widerange of morphological properties.

For a given language, the morphologicalparadigms are the models that relate root forms(or lemmas) of words to their surface forms. Thetask we tackle is to cluster surface word forms intogroups that reflect the application of a morphologi-cal paradigm to a single lemma. The lemma of theparadigm is typically the dictionary citation form,and the corresponding surface forms are inflectedvariations of that lemma, conveying grammaticalproperties such as tense, gender, or plurality. Forexample, Table 1 displays partial clusters for twoEnglish verbs: walk and bring.

In developing our system, we consider two typesof information that could reasonably play a role inunsupervised paradigm induction. First, the wordsin a single paradigm cluster are different inflec-tional variants of an underlying lemma, meaning

90

that the words share a common core meaning. Theyalso - usually - show a high degree of orthograph-ical similarity. Following these intuitions, we in-vestigate KMeans clustering using two differenttypes of word representations: one focusing on or-thographical similarity and the other focusing onsemantic similarity. Intuitively, we would expectthe cluster of forms for walk to be recognizablelargely based on orthographic similarity. The par-tially irregular cluster for bring shows greater ortho-graphical variability in the past-tense form broughtand so might be expected to require informationbeyond orthographic similarity.

System Overview. The core of our approach isto cluster unlabelled surface word forms usingKMeans clustering; a complete architecture dia-gram can be seen in Figure 1. After reading theinput file for a particular language to identify thelexicon and alphabet, we transform each word intotwo different types of vector representations. Tocapture semantic information, we train Word2Vecembeddings from the input data. The orthography-based representations we learn are character embed-dings, again trained from the input data. Details forboth representations appear in section 4.1. For theexperiments in this paper, we test each type of rep-resentation separately, using randomly initializedcenters for the clustering. In later work, we plan toexplore the integration of both types of representa-tions. We would also like to explore the use of pre-defined centers for clustering. These pre-definedcenters could be provided using either a longestcommon subsequence method or a graph-based al-gorithm such as that described in section 4.3. Thefinal output of the system is a set of clusters, eachone representing a morphological paradigm.

2 Previous Work

The SIGMORPHON 2020 shared task set includedan open problem calling for unsupervised systemsto complete morphological paradigms. For the2020 task, participants were provided with theset of lemmas available for each language (Kann,2020). In contrast, the 2021 SIGMORPHON task 2outlines that submissions are unsupervised systemsthat cluster input tokens into the appropriate mor-phological paradigm (Nicolai et al., 2020). Giventhe novelty of the task, there is a lack of previouswork done to cluster morphological paradigms inan unsupervised manner. However, we have identi-fied key methods from previous work in computa-

tional morphology and unsupervised learning thatcould be combined to approach this problem.

Previous work has identified the benefit of com-bining rules based on linguistic characteristics withmachine learning techniques. Erdmann et al. (2020)established a baseline for the Paradigm DiscoveryProblem that clusters the unannotated sentencesfirst by a combination of string similarity and lex-ical semantics and then uses this clustering as in-put for a neural transducer. Erdmann and Habash(2018) investigated the benefits of different similar-ity models as they apply to Arabic dialects. Theirfindings demonstrated that Word2Vec embeddingssignificantly underperformed in comparison to theLevenshtein distance baseline. The highest per-forming representation was a combination of Fast-Text and a de-lexicalized morphological analyzer.The FastText embeddings (Bojanowski et al., 2016)have the benefit of including sub-word informationby representing words as character n-grams. Thede-lexicalized analyzer relies on linguistic expertknowledge of Arabic to identify the morphologicalcloseness of two words. In the context of the paper,it is used to prune out word relations that do not con-form to Arabic morphological rules. The approachmentioned greatly benefits from the use of a mor-phological analyzer, something that is not readilyavailable for low-resource languages. Soricut andOch (2015) focused on the use of morphologicaltransformations as the basis for word representa-tions. Their representation can be quite accuratefor affix-based morphology.

Our representations are based entirely off of un-labelled data and do not require linguistic expertsto provide morphological transformation rules forthe language. Additionally, we hoped to create asystem that would be robust for languages that in-clude non-affix based morphology. In this work wecompare Word2Vec representations to character-based representations to represent orthography. Wehave not yet evaluated additional representationsor combinations of the two.

3 Task overview

The 2021 SIGMORPHON Shared Task 2 createda call for unsupervised systems that would cre-ate morphological paradigm clusters. This wasintended to build upon the shared task from 2020that focused on morphological paradigm comple-tion. Participants were provided with tokenizedBible data from the JHU bible corpus (McCarthy

91

et al., 2020) and gold standard paradigm clustersfor five development languages: Maltese, Persian,Portuguese, Russian and Swedish. Teams coulduse this data to train their systems and evaluateagainst the gold standard files as well as a baseline.The baseline provided groups together words thatshare a substring of length n and then removes anyduplicate clusters. The resulting systems were thenused to cluster tokenized data from a set of testlanguages including: Basque, Bulgarian, English,Finnish, German, Kannada, Navajo, Spanish, andTurkish.

4 System Architecture

The overall architecture of our system includesseveral distinct pieces as demonstrated in Figure1. For a given language, we read the corpus textprovided and generate a lexicon of unique words.The lexicon is then fed to an embedding layer andan optional lemma identification layer. The em-bedding layer generates a vector representation ofeach word based on either a character level embed-ding or a Word2Vec embedding. When used, thelemma identification layer generates a set of prede-fined lemmas from the lexicon based on either thestandard longest common substring or a connectedgraph formed from the longest common substring.Result word embeddings along with the optionalset of predefined lemmas are used as input to aKMeans clustering algorithm. In the event prede-fined lemmas are not provided, the system defaultsto using a randomly initialized set of centroids. Oth-erwise, the initial centroids for the clusters are theresult of finding the appropriate word embeddingfor the lemmas identified. Once a cluster has beencreated, the output cluster predictions are formattedinto a paradigm dictionary which can be written toa file for evaluation.

4.1 Word Representations

We create two different types of word represen-tations, aiming to capture information that mayreflect the relatedness of words within a paradigm.

Character Based Embeddings. To capture or-thographic information, we generate a character-based word embedding for the language. For eachlanguage we do the following:

1. Generate a lexicon of all the words in the de-velopment corpus and an alphabet of uniquecharacters in the language.

2. Identify the maximum word length of the lex-icon.

3. Create a dictionary of the alphabet where eachcharacter corresponds to a float value between0 (non-inclusive) and 1 (inclusive).

4. For each word:

(a) Initialize an array of zeros the same sizeas the maximum length word.

(b) Map each character in the word, in order,to its respective float value based on ouralphabet dictionary. Leave the remainingvalues as zero.

This representation focuses purely on the charac-ters of the language. For the time being, it doesnot take into account the relationship between or-thographic characters in any of the languages butfuture work could attempt to create smarter numer-ical representations based on these relationships.

Word Embeddings with Word2Vec. To incor-porate semantic and syntactic information, we usethe Word2Vec embeddings. Specifically, we train aWord2Vec model for each language with the Gen-sim skip-gram representations (Rehurek and Sojka,2010).

4.2 (Optional) Lemma IdentificationLCS Graph Formation One of the challengesof using clustering-based methods on this prob-lem is determining the number of morphologicalparadigms expected to be present and then findingsuitable lemmas for each to serve as centers forclustering. One potential approach to find lemmasis to first arrange the words into a network graphbased on the longest common substring relation-ships between them. Specifically, for each attestedword W in a language’s data, the longest commonsubstring (LCS) is calculated between W and everyother attested word in the language. Graph edgesare then constructed between W and the word (orwords if there are multiple with the same lengthLCS) that have the longest LCS with W. This pro-cess is repeated for every word in the given lan-guage’s corpus. This results in a large graph thatappears to capture many of the morphological de-pendencies within the language.

Next, we split the graph into highly connectedsubgraphs (HCSs). HCS are defined as graphsin which the number of edges that would needto be cut to split the graph into two disconnected

92

Figure 1: Overall Statistical Clustering Architecture diagram. There are two possible word embedding algorithmsrepresented in the diagram (left side of split). The optional lemma identification layer also includes two possiblemethods (right side).

subgraphs is greater than one half of the num-ber of nodes. This is helpful because in the LCSgraphs generated, morphologically related formstend to be connected relatively densely to eachother and only weakly connect to forms from otherparadigms. Additionally, the use of a thresholdbased algorithm like HCS, unlike other clusteringmethods, would allow lemmas to be extracted with-out having to prespecify the expected number oflemmas beforehand. Unfortunately, during testingthe HCS graph analysis proved computationallytaxing and was unable to be completed in time forevaluation, though qualitative analysis of the gen-erated LCS graphs suggests the technique may stillbe useful with better computational power. We willexplore this method further in future work.

4.3 Clustering

The word representations described in section 4.1are used as input to a clustering algorithm. We usethe KMeans algorithm as defined by the sklearn im-plementation (Pedregosa et al., 2011). The KMeansapproach is one of the pioneering algorithms in un-supervised learning (MacQueen et al., 1967). Inputvalues are grouped by continuously shifting clus-ters and their centers while attempting to minimizethe variance of each cluster. This indicates that thecluster that a particular word is assigned to should

be as close (as defined by Euclidean distance) tothe cluster’s center, or the lemma word, as possible.

Clustering with Randomly Initiated Centers.For comparison, we evaluate the effectiveness ofusing randomly initialized centers for our clusters.In the context of this task, this means that the firstset of centers fed to the algorithm do not necessarilycorrespond to any valid word in the given language,or perhaps any language. Another obstacle for thisapproach in an unsupervised setting is defining thenumber of clusters to use. Identifying this requireshuman interference with hyper-parameters that arenot going to be cross-linguistically relevant. Thesize of the input bible corpus and the inflectionalmorphology of the language both directly impactthe number of clusters, or the number of lemmas,that are relevant. We used a range of cluster sizesfor the development languages from 100 to 6000 toevaluate which ones provided the highest accuracy.For the test languages, we chose to submit resultsfor clusters of size 500, 1000, 1500, and 1900 toassess performance variability based on number oflemmas.

Extension: Initializing with Non-Random Cen-ters. The use of non-random centers would havemultiple benefits in the context of this task. Thisapproach would incorporate linguistic information

93

Language BL KMW2V KMCEMaltese 0.29 0.19 0.25Persian 0.30 0.18 0.36Portuguese 0.34 0.06 0.24Russian 0.36 0.11 0.34Swedish 0.44 0.18 0.45

Table 2: F1 Scores for each of the model types on alldevelopment languages. The best F1 scores are in bold.BL is Baseline, KMW2V is KMeans with Word2Vecembeddings, and KMCE is KMeans with CharacterEmbeddings.

Language BL 500 1000 1500 1900Basque 0.21 0.29 0.31 0.27 0.29Bulgarian 0.39 0.21 0.27 0.29 0.30English 0.52 0.29 0.37 0.43 0.45Finnish 0.29 0.19 0.24 0.26 0.27German 0.38 0.26 0.33 0.38 0.40Kannada 0.24 0.29 0.30 0.30 0.29Navajo 0.33 0.33 0.38 0.39 0.41Spanish 0.39 0.24 0.29 0.30 0.31Turkish 0.25 0.16 0.20 0.22 0.22

Table 3: F1 Scores for the baseline (BL) and theKMCE models on the test languages. The best F1scores are in bold. Test languages were evaluated onKMCE models with clusters of size 500, 1000, 1500,and 1900.

to inform the initial set of centers. This could leadto quicker convergence of a model due to moreintelligently picked centers. It could also preventthe model from being skewed towards less thanideal center values. Additionally, with pre-definedcenters we can remove the need to arbitrarily definethe number of clusters.

In the scope of this task, we were unable to ex-periment with pre-defined center values but wehave proposed two potential methods for doingso: using longest common substrings and pickinghighly connected nodes from an LCS graph for-mation. The longest common substring approachwould mimic the lemma identification approach de-scribed above (4.2). Both of these systems are rep-resented as an optional lemma identification layeron the right hand side of Figure 1. The output ofeach one would be a set of words to use as centers.Each word would be converted to the appropriateword representation and then fed as an input to theKMeans clustering.

5 Results

Table 2 shows results to date. We compare thetwo representation methods on the developmentlanguages. The KMeans clusterings for the devel-opment languages were generated based on optimalcluster values starting with size 100 and increas-ing to a cluster size of 6000, or until the accuracyno longer improved from an increase in clustersize. For the Word2Vec embeddings we used clus-terings of size 110 for Maltese, 130 for Persian,1490 for Portuguese, 1490 for Russian, and 1490for Swedish. With the character embeddings, wehad 540 clusters for Maltese, 110 clusters for Per-sian, 2200 clusters for Portuguese, 4000 clustersfor Russian, and 5400 clusters for Swedish. The F1scores provided are based on comparing the appro-priate model’s predictions to the gold paradigms forthis task using the evaluation function defined inthe SIGMORPHON 2021 Task 2 github repository.The KMCE models clearly and consistently out-perform the KMW2V models, for all developmentlanguages.

For test languages, we run clustering only withthe better-performing character-based representa-tions. The performance on test languages wasevaluated with clusters of size 500, 1000, 1500,and 1900. These results are in Table 3. We foundthat our algorithm outperformed the baseline forBasque, German, Kannada, and Navajo. For bothBasque and Kannada, the largest clustering didnot have the highest result suggesting that the cor-pora provided for these languages contain a smallernumber of morphological paradigms. In the case ofBulgarian, English, Spanish, and Finnish, we notethat the KMCE model performance increases witheach increase in cluster size. This suggests that themodel accuracy would continue increasing if weran the model for these languages with a highernumber of clusters. Additional discussion of the er-ror analysis appears in section 6, and fo the resultsin section 7.

6 Error Analysis

We have evaluated the results from the Word2Vecrepresentations and our character-based embeddingand compared them to the gold standard paradigmsprovided by the task organizers. We have foundthat, overall, the character-based version is morerobust on regular verb forms than the Word2Vecversion, and that neither is effective on irregularforms. Additionally, we explore some of the nu-

94

anced errors with the character based embeddingsand how they could be addressed for future work.

6.1 Regular Verb FormsOur results are consistent with our initial expec-tation that an orthographic word representationwould perform better on regular verb forms than theWord2Vec representation, since it weights close-ness based on the characters of the word. The char-acter embedding correctly groups many surfaceforms together based on regular English morpho-logical paradigms, or those that follow the patternof -ed for past tense, -s for third person singularpresent, - ∅ for first, second and third person pluralpresent. However, there were sometimes wordsmissing from the paradigm. For example, the sys-tem generated the paradigm stumble, stumbles,stumbling. This should have included stumbled,but instead that is in a paradigm with thaddaeus.In contrast, the Word2Vec representation separatesall four surface forms into different morphologicalparadigms. For Spanish paradigms, we see that thecharacter embeddings perform well for matchingsome of the regular surface forms together, but can-not handle longer suffixes. For example, aprendas,aprendan and aprenden are grouped together whileleaving out longer surface forms like aprendere-mos. Similarly, hablaste, hablaras, hablaran andhablara are grouped in the same morphologicalparadigm, but hablar, hablan, hablar, hables andhablen are part of a separate grouping. We discussthe issue of errors related to word length in detailbelow.

6.2 Irregular Verb FormsBecause it focuses on semantic relatedness, we ex-pected the Word2Vec representation to be more ac-curate in grouping together irregular surface formsfrom the same paradigm. For example, we havethe paradigm for go: go, goes, going, went. Infact, Word2Vec created a morphological paradigmfor go, one for upstairs, carry, goes, favour,going, reply, robbed, dared, gaius, failed, god-less, and one for went. The orthographical rep-resentation also produced some undesired results,with a paradigm for eyes, goat, goes, gone, gong,else, eloi, noah, none, sons, long and gains,noisy, lasea, lysias, gates, lying, fatal, often, notes,loses, latin, latest, going. The other surface formsof went and go also ended up in separate morpho-logical paradigms. These results suggest that nei-ther representation is currently robust enough to

handle irregular verb forms.

6.3 Character Distance Errors

In some cases, the character representations resultin strange cluster formations due to the usage ofEuclidean distance in the sklearn KMeans library.Since each character in the language’s alphabetwas mapped arbitrarily to a numeric value, thecloseness of a pair of characters does not reflect amorphological relationship between those symbols.However, characters that are assigned to numeri-cal values that are closer to one another will beclassified as closer by the Euclidean distance algo-rithm. It would be possible to learn more about thelanguage specific character relations by traininga recurrent network with a focus on the charac-ter sequence alignments. This network could thenbe used as an encoder to generate character levelembeddings.

6.4 Non-Affix Based Morphology

For verb forms in English that do not use a regu-lar affixation paradigm, we find that some surfaceforms are paired together in the correct clusters,but those clusters often contain additional unre-lated words. Consider the following group: drank,drinks, drink, drunk, breaks, break, branch. Inthis cluster, we see that drank, drinks, drink anddrunk were all correctly identified as being related.The algorithm also matched break and breaks to-gether. This suggests that character representationshave the potential to identify morphological trans-formations that occur at different points in the word,as opposed to just prefixes or suffixes. However, theresult is a combination of what should be three dis-tinct morphological paradigms, including a uniqueparadigm for branch. In Navajo, jidooleeł anddadooleeł are correctly put in the same paradigm.However, this paradigm also includes jidooleełgoand dadooleełgo, which are not morphologicallyrelated. We also see the tendency of over-groupingin Basque, where bitzate, nitzan, and ditzan are allgrouped together along with over ten other unre-lated forms. This could potentially be addressed byincreasing the number of clusters to favor smallerclusters. Adding semantic features to the word em-beddings such as part of speech or limited contextwindows may also help filter out words that are notrelevant to a particular paradigm.

95

6.5 Word Length

Another type of cluster error has to do with wordlength. The word representation vectors were sizedbased on the largest word present in a given lan-guage’s corpus. If a word is under the maximumlength, the remaining vector gets filled in with ze-ros. This means that words that are similar inlength are more likely to be paired together fora cluster. The gold data created the morphologicalparadigm crowd, crowds, crowding, while ourscreated two separate clusterings: crowd, crowdsand brawling, proposal, crowding. This is alsopresent in the clustering of certain words in Navajo.Our algorithm grouped nizaad, and bizaad together,but some of the longer forms in this paradigm wereexcluded such as danihizaad and nihizaad. In fu-ture work, we would attempt to mitigate this byusing subword distances or cosine similarity asthe basis for distance metrics in a clustering algo-rithm. This could prevent inaccurate groupings dueto large affix lengths.

7 Discussion, Conclusions, Future Work

Overall, these results demonstrate an improvementover the baseline in several languages, namely Per-sian, Swedish, Basque, Germany, Kannada, andNavajo, when using KMeans clustering over char-acter embeddings. This suggests that embedding-based clustering systems merit further explorationas a potential approach to unsupervised problems inmorphology. The fact that the character embeddingsystem outperformed the W2V one and the fact thatperformance was strongest on words with regularinflectional paradigms suggests that this approachmight be best suited to synthetic and agglutinatinglanguages in which morphology is encoded fairlysimply within the orthography of the word. Lan-guages that rely heavily on more complex morpho-logical processes, particularly non-concatenativemorphology, would likely require an extension ofthis system that integrates more sources of non-orthographic information, or a different approachall together.

One obvious avenue for building on this researchis to find more efficient and more effective methodsfor the initial process of lemma identification. De-veloping a set of lemmas would allow a pre-definedset of centers to be fed into the clustering algo-rithm rather than using randomly defined centers,which would likely improve performance. Thiscould be done by leveraging an initial rule based

analysis or through the threshold-based graph clus-tering technique discussed above. Other potentialvariations on that approach, once the problem ofcomputational limits has been solved, include us-ing longest common sequences rather than longestcommon substrings, and weighting graph edges bythe length of the LCS between the two words. Theformer would potentially help accommodate formsof non-concatenative morphology, while the latterwould potentially include more information aboutmorphological relationships than an unweightedgraph does. Future research should also explorehow other sources of linguistic information couldbe leveraged for this task. This could includeother forms of semantic information outside of thecontext-based semantics used by W2V, as well asthings like the orthographic-phonetic correspon-dences in a given language.

Finally, we would like to explore filtering of theoutput clusters according to language-specific prop-erties in order to improve the overall results.Thiswould involve adding additional layers to our sys-tem architecture that take place after a distance-based clustering. One such layer could prune un-likely clusters based off of a morphological trans-formations, such as the method used by Soricut andOch (2015). Future unsupervised systems for clus-tering morphological paradigms should considerthe benefits of hierarchical models that leverage dif-ferent algorithm types to gain the most informationpossible.

ReferencesPiotr Bojanowski, Edouard Grave, Armand Joulin, and

Tomas Mikolov. 2016. Enriching word vectors withsubword information.

Alexander Erdmann, Micha Elsner, Shijie Wu, RyanCotterell, and Nizar Habash. 2020. The paradigmdiscovery problem. In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics, pages 7778–7790, Online. Associationfor Computational Linguistics.

Alexander Erdmann and Nizar Habash. 2018. Comple-mentary strategies for low resourced morphologicalmodeling. In Proceedings of the Fifteenth Workshopon Computational Research in Phonetics, Phonol-ogy, and Morphology, pages 54–65, Brussels, Bel-gium. Association for Computational Linguistics.

Katharina Kann. 2020. Acquisition of inflectionalmorphology in artificial neural networks with priorknowledge. In Proceedings of the Society for Com-putation in Linguistics 2020, pages 144–154, New

96

York, New York. Association for Computational Lin-guistics.

MacQueen, J, and author. 1967. Some methods forclassification and analysis of multivariate observa-tions. In Proceedings of the Fifth Berkeley Sym-posium on Mathematical Statistics and Probability,Volume 1: Statistics, pages 281–297. University ofCalifornia Press.

Arya D. McCarthy, Rachel Wicks, Dylan Lewis, AaronMueller, Winston Wu, Oliver Adams, Garrett Nico-lai, Matt Post, and David Yarowsky. 2020. TheJohns Hopkins University Bible corpus: 1600+tongues for typological exploration. In Proceed-ings of the 12th Language Resources and EvaluationConference, pages 2884–2892, Marseille, France.European Language Resources Association.

Garrett Nicolai, Kyle Gorman, and Ryan Cotterell, edi-tors. 2020. Proceedings of the 17th SIGMORPHONWorkshop on Computational Research in Phonetics,Phonology, and Morphology. Association for Com-putational Linguistics, Online.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-esnay. 2011. Scikit-learn: Machine learning inPython. Journal of Machine Learning Research,12:2825–2830.

Radu Soricut and Franz Och. 2015. Unsupervised mor-phology induction using word embeddings. In Pro-ceedings of the 2015 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages1627–1637, Denver, Colorado. Association for Com-putational Linguistics.

Radim Rehurek and Petr Sojka. 2010. Software frame-work for topic modelling with large corpora. In Pro-ceedings of LREC 2010 workshop New Challengesfor NLP Frameworks, pages 46–50, Valletta, Malta.University of Malta.

97

Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research

in Phonetics, Phonology, and Morphology,

August 5, 2021. ©2021 Association for Computational Linguistics

pages 98–106

Unsupervised Paradigm Clustering Using Transformation Rules

Changbing Yang Garrett Nicolai Miikka SilfverbergUniversity of Colorado Boulder University of British Columbia

[email protected] [email protected]

Abstract

This paper describes the submission ofthe CU-UBC team for the SIGMORPHON2021 Shared Task 2: Unsupervised morpho-logical paradigm clustering. Our systemgenerates paradigms using morphologicaltransformation rules which are discoveredfrom raw data. We experiment with twomethods for discovering rules. Our firstapproach generates prefix and suffix trans-formations between similar strings. Sec-ondly, we experiment with more generalrules which can apply transformations in-side the input strings in addition to prefixand suffix transformations. We find thatthe best overall performance is deliveredby prefix and suffix rules but more gen-eral transformation rules perform betterfor languages with templatic morphologyand very high morpheme-to-word ratios.

1 IntroductionSupervised sequence-to-sequence models forword inflection have delivered impressive re-sults in the past few years and a number ofshared tasks on supervised learning of morphol-ogy have helped to raise the state of the art ofthis task (Cotterell et al., 2016, 2017, 2018; Mc-Carthy et al., 2019; Vylomova et al., 2020). Incontrast, unsupervised approaches to morphol-ogy have received far less attention in recentyears. Nevertheless, the question of whetherthe morphological system of a language can bediscovered from raw text data alone is certainlyan interesting one.

This paper describes the submission of theCU-UBC team for the SIGMORPHON 2021Shared Task 2: Unsupervised morphologi-cal paradigm clustering (Wiemerslage et al.,2021).1 The objective of this task is to group

1github.com/changbingY/Sigmorph-2021-task2

the distinct inflected forms of lexemes occur-ring in a corpus into morphological paradigms.Figure 1 illustrates the task.

Our system generates paradigms using mor-phological transformation rules which are dis-covered from raw data. As an example, con-sider the rule ed → ing, which maps an En-glish past tense verb form like walked into thepresent participle walking. In this paper, weuse regular expressions of symbol-pairs (thatis, regular relations) in the well-known Xeroxformalism (Beesley and Karttunen, 2003) todenote rules: for example, ?+ e:i d:n 0:g.These rule can be applied using compositionof regular relations:

[w a l k e d] .o. [?+ e:i d:n 0:g]

will result in an output form w a l k i n g.We cluster forms into the same paradigm ifwe can find morphological transformation ruleswhich map one of the forms into the other. Ourapproach is illustrated in Figure 2.

We experiment with two methods for discov-ering rules, described in Section 3.3. Our firstapproach is inspired by work on morphologydiscovery by Soricut and Och (2015), who gen-erate prefix and suffix transformations betweensimilar strings. This idea closely parallels ourapproach for extracting rules. Unlike Soricutand Och (2015), however, we do not utilizeword embeddings when extracting rules due tothe very small size of the shared task datasets.In addition to prefix and suffix rules, we alsoexperiment with more general discontinuoustransformation rules which can apply trans-formations to infixes as well as prefixes andsuffixes. For example, the rule

?+ i:0 ?+ e:i ?+ 0:t

would transform the input form gidem (‘tobite’ in Maltese) to gdimt. Our results

98

the cat meowedthe cats are meowing

Data:

Paradigms:catcats

meowedmeowing

the are

Figure 1: The unsupervised paradigm clusteringtask.

demonstrate that prefix and suffix rules deliverstronger performance for most languages in theshared task dataset but our more general trans-formations rules are beneficial for templaticlanguages like Maltese and languages with ahigh morpheme-to-word ratio like Basque.

2 Related Work

The unsupervised paradigm clustering task isclosely related to the 2020 SIGMORPHONshared task on unsupervised morphologicalparadigm completion (Kann et al., 2020). How-ever, paradigm clustering systems do not infermissing forms in paradigms. Our system re-sembles the baseline system for the paradigmcompletion task (Jin et al., 2020) which alsoextracts transformation rules, however, in theform of edit trees (Chrupala et al., 2008).

Several approaches to unsupervised or mini-mally supervised morphology learning, whichshare characteristics with our system, havebeen proposed. Our rules are essentially iden-tical to the FST rules used by Beemer et al.(2020) for the task of supervised morpholog-ical inflection. Likewise, Durrett and DeN-ero (2013) and Ahlberg et al. (2015) both ex-tract inflectional rules after aligning forms fromknown paradigms. Yarowsky and Wicentowski(2000) also generate rules for morphologicaltransformations but their system for minimallysupervised morphological analysis requires ad-ditional information in the form of a list ofmorphemes as input.

Erdmann et al. (2020) present a task calledthe paradigm discovery problem which is quitesimilar to the unsupervised paradigm clusteringtask. In their formulation of the task, inflectedforms are clustered into paradigms and corre-

sponding forms in distinct paradigms (like allplural forms of English nouns) are clusteredinto cells. Their benchmark system is based onsplitting every form into a (potentially discon-tinuous) base and exponent, where the base isthe longest common subsequence of the formsin a paradigm and the exponent is the residualof the form. They then maximize the base ineach paradigm while minimizing the exponentsof individual forms.

3 Methods

This section describes how we extract rulesfrom the dataset and apply them to paradigmclustering. We also describe methods for fil-tering out extraneous forms from generatedparadigms.

3.1 BaselineAs a baseline, we use the character n-gramclustering method provided by the shared taskorganizers (Wiemerslage et al., 2021). Hereall forms sharing a given substring of lengthn are clustered into a paradigm. Duplicateparadigms are removed. The hyperparametern can be tuned on validation data if such data isavailable (we use n = 5 in all our experiments).

3.2 Transformation rulesOur approach builds on the baseline paradigmsdiscovered in the previous step. We start by ex-tracting transformation rules between all wordforms in a single baseline paradigm. For eachpair of strings like dog and dogs belonging to aparadigm, we generate a rule like ?+ 0:s whichtranslates the first form into the second one.From a paradigm of size n, we can therefore ex-tract n2−n rules—one for each ordered pair ofdistinct word forms. Preliminary experimentsshowed that large baseline paradigms tendedto generate many incorrect rules which did notrepresent genuine morphological transforma-tions. We, therefore, limited rule-discovery toparadigms spanning maximally 20 forms.

After generating transformation rules, wecompute rule-frequency over all baselineparadigms and discard rare rules which areunlikely to represent genuine morphologicaltransformations (the minimum threshold forrule frequency is a hyperparameter). The re-maining rules are then applied iteratively to

99

dogdogshotdog

catcats

Extract rules

Discard rare rules

Rebuild paradigms

2: ?+ 0:s2: ?+ s:01: h:0 o:0 t:0 ?+1: 0:h 0:o 0:t ?+1: h:0 o:0 t:0 ?+ 0:s1: 0:h 0:o 0:t ?+ s:0

dogdogs

hotdog

catcats

2: ?+ 0:s2: ?+ s:01: h:0 o:0 t:0 ?+1: 0:h 0:o 0:t ?+1: h:0 o:0 t:0 ?+ 0:s1: 0:h 0:o 0:t ?+ s:0

Figure 2: A schematic representation of our approach. We start by generating preliminary paradigmsusing the baseline method. We then extract transformation rules for each word pair in our paradigmsnoting how many times each unique rule occurred. For example, here both (dog, dogs) and (cat, cats)result in a rule ?* 0:s which therefore has count 2. Subsequently, we discard rare rules like h:0 o:0 t:0?* which are unlikely to represent genuine morphological transformations. We then use the remainingrules to reconstruct our morphological paradigms as explained in Section 3.3.

our datasets to construct paradigms. We exper-iment with two rule types which are describedbelow.

3.2.1 Prefix and Suffix RulesOur first approach to rule-discovery is based onidentifying a contiguous word stem shared byboth forms. The stem is defined as the longestcommon substring of the forms. We split bothforms into a prefix, stem and suffix. The mor-phological transformation is then defined as ajoint substitution of a prefix and suffix. For ex-ample, given the German forms acker+n andge+acker+t (German ‘to plow’), we wouldgenerate a rule:

0:g 0:e ?+ n:t

As mentioned above, these rules are extractedfrom paradigms generated by the baseline sys-tem.

We also experiment with a more restrictedform of these rules in which only suffix trans-formations are allowed. While this limits thepossible transformations, it will also result infewer incorrect rules and may, therefore, de-liver better performance for languages whichare predominantly suffixing.

3.2.2 Discontinuous rulesEven though prefix and suffix transformationsare adequate for representing morphologicaltransformations in many languages, they fail toderive the appropriate generalizations for lan-guages with templatic morphology like Maltese(which was included among the developmentlanguages). For example, it is impossible toidentify a contiguous stem-like unit spanningmore than a single character for the Malteseforms gidem ‘to bite’ and gdimt. We need

a rule which can apply transformations insidethe input string:

?+ i:0 ?+ e:i ?+ 0:t

Like prefix and suffix rules, discontinuousrules are generated from baseline paradigms.Unlike prefix and suffix rules, however, discon-tinuous rules require a character-level align-ment between the input and output string.To this end, we start by generating a datasetconsisting of all string pairs like (dog, dogs)and (hotdog, dog), where both strings belongto the the same paradigm. We then applya character-level aligner based on the itera-tive Markov chain Monte Carlo method to thisdataset.2 Using this method, we can jointlyalign all string pairs in the baseline paradigms.This is beneficial because the MCMC alignerwill prefer common substitutions, deletionsand insertions over rare ones.3 which enforcesconsistency of the alignment over the entiredataset. This in turn can help us find linguis-tically motivated transformation rules.

Character-level alignment results in pairs:INPUT: d o g 0OUTPUT: d o g sINPUT: h o t d o gOUTPUT: 0 0 0 d o g

Each symbol pair in the alignment representsone of the following types: (1) an identity pairx:x, (2) an insertion 0:x, (3) a deletion x:0,or (4) a substitution x:y. In order to converta pair of aligned strings into a transformation

2This aligner was initially used for the baseline sys-tem in the 2016 iteration of the SIGMORPHON sharedtask (Cotterell et al., 2016).

3This is a consequence of the fact that the algorithmiteratively maximizes the likelihood of the alignment foreach example given all other examples in the dataset.

100

rule, we simply replace all contiguous sequencesof identity pairs with ?+. For the alignmentsabove, we get the rules: ?+ 0:s and h:0 o:0t:0 ?+.

3.3 Iterative Application of RulesAfter extracting a set of rules from baselineparadigms, we discard the baseline paradigms.We then construct new paradigms using ourrules. We start by picking a random wordform w from the dataset. We then form theparadigm P for w as the set of all forms inour dataset which can be derived from w byapplying our rules iteratively. For example,given the form eats and the rules:

?+ s:0 and ?+ 0:i 0:n 0:g

the paradigm of eats would contain both eat(generated by the first rule) and eating (gen-erated by the second rule from eats) providedthat both of these forms are present in our orig-inal dataset. All forms in P are removed fromthe dataset and we then repeat the process foranother randomly sampled form in the remain-ing dataset. This continues until the dataset isexhausted. The procedure is sensitive to the or-der in which we sample forms from the datasetbut exploring the optimal way to sample formsfalls beyond the scope of the present work.

For prefix and suffix rules, we limit rule ap-plication to a single iteration because this de-livered better results in practice. Applyingrules iteratively tended to result in very largeparadigms. For discontinuous rules, we do ap-ply rules iteratively.

3.4 Filtering ParadigmsAccording to our preliminary experiments,many large paradigms generated by transfor-mation rules contained word forms which weremorphologically unrelated to the other formsin the paradigm. To counteract this, we ex-perimented with three strategies for filteringout individual extraneous forms from generatedparadigms: the degree test, the rule-frequencytest and the embedding-similarity test. Formswhich fail all of our three tests are removedfrom the paradigm.4

4These filtering strategies are applied to paradigmscontaining > 20 forms. This threshold was determinedbased on examining the output clusters for the devel-opment languages.

walk

walking

walkswalked

wallX

Figure 3: Given the candidate paradigm walk,wall, walking, walked, walks, we can form agraph where two word forms are connected if arule like ?+ 0:e 0:d derives one of the forms likewalked from the other one walk. We experimentwith filtering out forms which have low degree inthis graph since those are more likely to be spuriousadditions resulting from rules like ?+ l:k in the ex-ample, which do not capture genuine morphologi-cal regularities. In this example, wall might be fil-tered out because it has low degree one comparedto all other forms which have degree three.

If we first generate all paradigms and then fil-ter out extraneous forms, we will be left with anumber of forms which have not been assignedto a paradigm. In order to circumvent thisproblem, we apply filtering immediately aftergenerating each individual paradigm. Formswhich are filtered out from the paradigm areplaced back into the original dataset. Theycan then be included in paradigms which aregenerated later in the process.

Degree test Our morphological transforma-tion rules induce dependencies and thereforea graph structure between the forms in aparadigm as demonstrated in Figure 3. Withineach paradigm, we calculate the degree of aword in the following way: For each attestedword w in the generated paradigm, its degreeis the number of forms w′ in the paradigm forwhich we can find a transformation rule map-ping w → w′. We increment the degree if thereis at least one edge between words w and w′

in the paradigm (the number of distinct rulesmapping form w to w′ is irrelevant here as longas there is at least one). If the degree of a wordis less than a third of the paradigm size, theword fails the degree test.

Rule-Frequency test Some rules like ?+e:i d:n 0:g for English represent genuine in-flectional transformations and will therefore

101

occur often in our datasets. Others, like therule ?* l:k in Figure 3, instead result from co-incidence, and will usually have low frequency.We can, therefore, use rule frequency as a cri-terion when identifying extraneous forms ingenerated paradigms. We examine the cumula-tive frequency of all rules applying to the formin our paradigm. If this frequency is lowerthan the median cumulative frequency in theparadigm, the form fails the rule-frequency test.

Embedding-similarity test If a word failsto pass the degree and the rule frequency tests,we will measure the semantic similarity of thegiven form with other forms in the paradigm.To this end, we trained FastText embeddings(Bojanowski et al., 2017) and calculated co-sine similarity between embedding vectors as ameasure of semantic relatedness.5 We start byselecting two reference words in the paradigmwhich have high degree (at least 50% of themaximal degree) and whose cumulative rule fre-quency is above the paradigm’s median value.We then compute their cosine similarity as areference point r. For all other words in theparadigm, we then compare their cosine simi-larity r′ to one of the reference forms. Formsfail the embedding-similarity test if r′ < 0.5and r − r′ > 0.3.

4 Experiments and Results

In this section, we describe experiments on theshared task development and test languages.

4.1 Data and Resources

The shared task uses two data resources. Cor-pus data for the four development languages(Maltese, Persian, Russian and Swedish) andnine test languages (Basque Bulgarian, English,Finnish, German, Kannada, Navajo, Spanishand Turkish) are sourced from the Johns Hop-kins Bible Corpus (McCarthy et al., 2020b).For most of the languages, complete Bibleswere provided but for some of them, we onlyhad access to a subset (see Wiemerslage et al.(2021) for details). Gold standard paradigmswere automatically generated using the Uni-morph 3.0 database (McCarthy et al., 2020a).

5We train 300-dimensional embeddings with contextwindow 3 and use character n-grams of size 3-6.

4.2 Experiments on validationlanguages

Since our transformation rules are generatedfrom paradigms discovered by the baseline sys-tem, which contain incorrect items, it is to beexpected that some incorrect rules are gener-ated. We filter out infrequent rules, as they areless likely to represent genuine morphologicaltransformations. For prefix and suffix rules(i.e., PS), we experimented with including thetop 2000 (PS-2000), 5000 (PS-5000), and allrules (PS-all), as measured by rule-frequency.Additionally, we present experiments using asystem which relies exclusively on suffix trans-formations including all of them regardless offrequency (S-all). For discontinuous rules (D),we used lower thresholds because our prelimi-nary experiments indicated that incorrect gen-eralizations were a more severe problem forthis rule type. We selected the 200 (D-200),300 (D-300), and 500 (D-500) most frequentrules, respectively. Results with regard to best-match F1 score (see Wiemerslage et al. (2021)for details) are shown in Table 1.

According to the results, all of our systemsoutperform the baseline system by at least25.53% as measured using the mean best matchF1 score. Plain suffix rules (S-all) providethe best performance with a mean F1 scoreof 65.41%, followed by other affixal systems(PS-2000, PS-5000 and PS-all). On average,discontinuous rules (D-200, D-300 and D-500)are slightly less-successful, but they deliverthe best performance for Maltese. Table 1demonstrates that simply increasing the num-ber of rules does not always contribute to bet-ter performance—the optimal threshold variesbetween languages.

As explained in Section 3.4, we aim to fil-ter out extraneous forms from overly-largeparadigms. We applied this approach to discon-tinuous rules with a 500 threshold. Results areshown in Table 2. As the table shows, a filteringstrategy can offer very limited improvements.Most of the languages do not benefit from thisapproach and even for languages which do, thegain is miniscule. Due to their very limitedeffect, we did not apply filtering strategies totest languages.

102

Maltese Persian Portuguese Russian Swedish MeanBaseline 29.07 30.04 34.15 36.30 43.62 34.64PS-2000 35.41 50.17 65.53 81.20 81.14 62.69PS-5000 36.81 50.40 71.33 81.96 79.82 64.06PS-all 40.67 53.15 76.63 75.39 72.46 63.66S-all 30.32 52.69 82.67 80.65 80.74 65.41D-200 42.99 54.65 66.86 70.38 68.76 60.73D-300 42.99 53.64 69.38 72.33 67.14 61.10D-500 45.05 51.82 66.37 75.26 62.30 60.16

Table 1: F1 Scores for each of the model types on all development languages. The best F1 scores are inbold.

Maltese Persian Portuguese Russian Swedish MeanBaseline 29.07 30.04 34.15 36.30 43.62 34.64D-500 45.05 51.82 66.37 75.26 62.30 60.16Filter 45.05 51.82 66.45 75.26 62.30 60.18

Table 2: F1 score for Discontinuous rules systems and Filtering systems across five validation languages.

4.3 Experiments on Test Languages

Results for the test languages are presented inTable 3. We find that all of our systems sur-passed the baseline results by at least 23.06% inF1 score. The prefix and suffix system using allof the suffix rules displays the best performancewith an F1 score of 66.12%. Among the discon-tinuous systems, the system with a threshold of500 has the best results. On average, the affixalsystems outperform the discontinuous ones. Inparticular, these methods perform best on lan-guages which are known to be predominantlysuffixing, such as English, Spanish, and Finnish.Contrarily, discontinuous rules deliver the bestperformance for Navajo—a strongly prefixinglanguage. Discontinuous rules also result inthe best performance for Basque, which has avery high morpheme-to-word ratio.

In order to better understand the behavior ofour systems, we analyzed the distribution of thesize of generated paradigms for prefix and suffixsystems as well as discontinuous systems. Re-sults for selected systems are shown in Figure 4.We conducted this experiment for the overallbest system (S-all), as well as the best discontin-uous system (D-500). Both systems follow thesame overall pattern: large paradigms are rarerthan smaller ones and the frequency drops veryrapidly with increasing paradigm size. The ma-jority of generated paradigms have sizes in therange 1-5. Although the tendency is similar for

suffix rules and discontinuous rules, discontin-uous rules tend to generate more paradigms ofsize 1. In contrast to the paradigms generatedby our systems, the frequency of gold standardparadigms drops far slower as the paradigmsgrow. For example, for Finnish and Kannada,paradigms containing 10 forms are still verycommon. The only language where the distri-bution generated by our systems very closelyparallels the gold standard is Spanish. Forall other languages, our systems very clearlyover-generate small paradigms.

5 Discussion and ConclusionsParadigm construction can suffer from twomain difficulties: overgeneralization, and un-derspecification. In the former, paradigms aretoo generous when adding new members. Con-sider, for example, a paradigm headed by “sea”.We would want to include the plural “seas”, butnot the unrelated words “seal”, “seals”, “un-dersea”, etc. Contrarily, a paradigm selectionalgorithm that is overly selective will result ina large number of small paradigms - less thanideal in a morphologically-dense language.

Considering the results described in the pre-vious section, we note that our two best mod-els skew towards conservatism - they prefersmaller paradigms. This is likely an artifact ofour development cycle - we found that the base-line preferred large paradigms, often capturingderivational features, or even circumstantial

103

English Navajo Spanish Finnish Bulgarian Basque Kannada German Turkish MeanBaseline 51.49 33.25 38.83 28.97 38.89 21.48 23.79 38.22 25.23 33.35PS-2000 83.89 48.69 77.71 52.60 73.50 25.81 42.35 74.49 46.80 58.42PS-5000 81.16 48.69 79.60 57.88 74.14 29.03 47.47 74.27 51.26 60.39PS-all 76.41 48.69 76.94 66.03 69.50 29.03 57.71 65.26 60.97 61.17S-all 88.68 42.48 83.21 73.42 76.96 29.03 59.34 74.18 67.80 66.12D-200 76.93 58.45 66.05 50.68 70.48 26.19 40.57 70.26 48.05 56.41D-300 73.23 59.36 69.46 53.66 69.39 26.19 43.71 68.52 51.00 57.17D-500 69.33 61.66 69.92 56.51 63.23 33.33 46.94 62.54 53.24 57.41

Table 3: F1 Scores for each of the model types on all test languages. The best F1 scores are in bold.

1 5 10 15 200

10

20

30

40

50

60

70

80Basque Paradigm Size Distribution

Basque-Suffix-allBasque-Discontinuous-500Basque-Gold

(a) Basque

1 5 10 15 200

10

20

30

40

50

60

70

80Bulgarian Paradigm Size Distribution

Bulgarian-Suffix-allBulgarian-Discontinuous-500Bulgarian-Gold

(b) Bulgarian

1 5 10 15 200

10

20

30

40

50

60

70

80English Paradigm Size Distribution

English-Suffix-allEnglish-Discontinuous-500English-Gold

(c) English

1 5 10 15 200

10

20

30

40

50

60

70

80Finnish Paradigm Size Distribution

Finnish-Suffix-allFinnish-Discontinuous-500Finnish-Gold

(d) Finnish

1 5 10 15 200

10

20

30

40

50

60

70

80German Paradigm Size Distribution

German-Suffix-allGerman-Discontinuous-500German-Gold

(e) German

1 5 10 15 200

10

20

30

40

50

60

70

80Kannada Paradigm Size Distribution

Kannada-Suffix-allKannada-Discontinuous-500Kannada-Gold

(f) Kannada

1 5 10 15 200

10

20

30

40

50

60

70

80Navajo Paradigm Size Distribution

Navajo-Suffix-allNavajo-Discontinuous-500Navajo-Gold

(g) Navajo

1 5 10 15 200

10

20

30

40

50

60

70

80Spanish Paradigm Size Distribution

Spanish-Suffix-allSpanish-Discontinuous-500Spanish-Gold

(h) Spanish

1 5 10 15 200

10

20

30

40

50

60

70

80Turkish Paradigm Size Distribution

Turkish-Suffix-allTurkish-Discontinuous-500Turkish-Gold

(i) Turkish

Figure 4: Paradigm size distribution across nine test languages. The x axis stands for paradigm sizeranging from 1 to 20. The y axis shows the percentage of each paradigm size accounts for among allparadigms the system generates.

104

string similarities, when clustering paradigms.Much of our focus was thus on limiting rule ap-plication only to those rules we could be certainwere genuine. Unfortunately, this means thatmany words are excluded, residing in singletonparadigms.

Our methods were also affected by the choiceof development languages. Of these languages,only one (Persian) is agglutinating, and noneof the authors can read the script, so it had asmaller impact on the evolution of our methods.We believe that several languages —namely,Finnish, Turkish, and Basque— could havebenefited from iterative rule application; how-ever, the iterative process was not selected afterseeing a degradation (due to overgeneralization)on the development languages.

It is also worth discussing two outliers inour system selection. Our suffix-first modelperformed very well on all of the developmentlanguages except Maltese. This is not sur-prising, given its templatic morphology. Mal-tese inspired the creation of our discontinuousrule set, and indeed, these rules outperformedthe suffixes for Maltese. Switching to the testlanguages, we see that this model has higherperformance for Navajo and Basque –two lan-guages that are rarely described as templatic.We observe, however, that both languages makeheavy use of prefixing. Note in Table 2 that in-cluding prefixes (PS-All) significantly improvesNavajo: the only language to see such a bene-fit. Likewise, Navajo also has significant stemalternation, which may be benefiting from dis-continuous rule sets. Basque is trickier - itdoes not improve simply from including pre-fixal rules. Upon closer inspection, we observethat much Basque prefixation more closely re-sembles circumfixation: the stem has a prefixalvowel to indicate tense, which is jointly appliedwith inflectional suffixes. One round of ruleapplication - even if it includes both suffixesand prefixes, appears to be insufficient.

There is still plenty of ground to be covered,with the mean F1 score below 70%. We be-lieve that the next step lies in re-establishinga bottom-up construction for those paradigmsthat our methods currently separate into smallsub-paradigms. Our methods predict roughlytwice to 3 times as many singleton paradigmsas exist in the gold data, and there is not signifi-

cant rule support to combine them. Possible ar-eas for exploration include iterative rule extrac-tion on successively more correct paradigms, orthe incorporation of a machine learning elementthat can predict missing forms.

In this paper, we have presented a method forautomatically building inflectional paradigmsfrom raw data. Starting with an n-gram base-line, we extract intra-paradigmatic rewriterules. These rules are then re-applied to the cor-pus in a discovery process that re-establishesknown paradigms. Our methods prove verycompetitive, with our best model finishingwithin 2% of the best submitted system.

ReferencesMalin Ahlberg, Markus Forsberg, and Mans

Hulden. 2015. Paradigm classification in su-pervised learning of morphology. In Proceed-ings of the 2015 Conference of the North Amer-ican Chapter of the Association for Computa-tional Linguistics: Human Language Technolo-gies, pages 1024–1029.

Sarah Beemer, Zak Boston, April Bukoski, DanielChen, Princess Dickens, Andrew Gerlach, TorinHopkins, Parth Anand Jawale, Chris Koski,Akanksha Malhotra, Piyush Mishra, SalihaMuradoglu, Lan Sang, Tyler Short, SagarikaShreevastava, Elizabeth Spaulding, TestumichiUmada, Beilei Xiang, Changbing Yang, andMans Hulden. 2020. Linguist vs. machine:Rapid development of finite-state morphologicalgrammars. In Proceedings of the 17th SIGMOR-PHON Workshop on Computational Research inPhonetics, Phonology, and Morphology, pages162–170, Online. Association for ComputationalLinguistics.

Kenneth R Beesley and Lauri Karttunen. 2003.Finite-state morphology: Xerox tools and tech-niques. CSLI, Stanford.

Piotr Bojanowski, Edouard Grave, Armand Joulin,and Tomas Mikolov. 2017. Enriching word vec-tors with subword information. Transactions ofthe Association for Computational Linguistics,5:135–146.

Grzegorz Chrupala, Georgiana Dinu, and Josefvan Genabith. 2008. Learning morphology withMorfette. In Proceedings of the Sixth Inter-national Conference on Language Resourcesand Evaluation (LREC’08), Marrakech, Mo-rocco. European Language Resources Associa-tion (ELRA).

Ryan Cotterell, Christo Kirov, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylo-mova, Arya D McCarthy, Katharina Kann, Se-

105

bastian J Mielke, Garrett Nicolai, Miikka Silfver-berg, et al. 2018. The CoNLL–SIGMORPHON2018 shared task: Universal morphological re-inflection. In Proceedings of the CoNLL–SIGMORPHON 2018 Shared Task: UniversalMorphological Reinflection, pages 1–27.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylo-mova, Patrick Xia, Manaal Faruqui, SandraKübler, David Yarowsky, Jason Eisner, et al.2017. Conll-sigmorphon 2017 shared task: Uni-versal morphological reinflection in 52 languages.In Proceedings of the CoNLL SIGMORPHON2017 Shared Task: Universal Morphological Re-inflection, pages 1–30.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman, David Yarowsky, Jason Eisner, andMans Hulden. 2016. The SIGMORPHON 2016shared task—morphological reinflection. In Pro-ceedings of the 14th SIGMORPHON Work-shop on Computational Research in Phonetics,Phonology, and Morphology, pages 10–22.

Greg Durrett and John DeNero. 2013. Supervisedlearning of complete morphological paradigms.In Proceedings of the 2013 Conference of theNorth American Chapter of the Association forComputational Linguistics: Human LanguageTechnologies, pages 1185–1195.

Alexander Erdmann, Micha Elsner, Shijie Wu,Ryan Cotterell, and Nizar Habash. 2020. Theparadigm discovery problem. In Proceedingsof the 58th Annual Meeting of the Associationfor Computational Linguistics, pages 7778–7790,Online. Association for Computational Linguis-tics.

Huiming Jin, Liwei Cai, Yihui Peng, Chen Xia,Arya McCarthy, and Katharina Kann. 2020. Un-supervised morphological paradigm completion.In Proceedings of the 58th Annual Meeting ofthe Association for Computational Linguistics,pages 6696–6707, Online. Association for Com-putational Linguistics.

Katharina Kann, Arya D. McCarthy, Garrett Nico-lai, and Mans Hulden. 2020. The SIGMOR-PHON 2020 shared task on unsupervised mor-phological paradigm completion. In Proceed-ings of the 17th SIGMORPHON Workshop onComputational Research in Phonetics, Phonol-ogy, and Morphology, pages 51–62, Online. As-sociation for Computational Linguistics.

Arya D. McCarthy, Christo Kirov, Matteo Grella,Amrit Nidhi, Patrick Xia, Kyle Gorman, Ekate-rina Vylomova, Sabrina J. Mielke, Garrett Nico-lai, Miikka Silfverberg, Timofey Arkhangelskiy,Nataly Krizhanovsky, Andrew Krizhanovsky,Elena Klyachko, Alexey Sorokin, John Mans-field, Valts Ernštreits, Yuval Pinter, Cassan-dra L. Jacobs, Ryan Cotterell, Mans Hulden,

and David Yarowsky. 2020a. UniMorph 3.0:Universal Morphology. In Proceedings of the12th Language Resources and Evaluation Con-ference, pages 3922–3931, Marseille, France. Eu-ropean Language Resources Association.

Arya D McCarthy, Ekaterina Vylomova, Shi-jie Wu, Chaitanya Malaviya, Lawrence Wolf-Sonkin, Garrett Nicolai, Christo Kirov, MiikkaSilfverberg, Sabrina J Mielke, Jeffrey Heinz,et al. 2019. The SIGMORPHON 2019 sharedtask: Morphological analysis in context andcross-lingual transfer for inflection. In Proceed-ings of the 16th Workshop on Computational Re-search in Phonetics, Phonology, and Morphology,pages 229–244.

Arya D. McCarthy, Rachel Wicks, Dylan Lewis,Aaron Mueller, Winston Wu, Oliver Adams,Garrett Nicolai, Matt Post, and David Yarowsky.2020b. The Johns Hopkins University Biblecorpus: 1600+ tongues for typological explo-ration. In Proceedings of the 12th LanguageResources and Evaluation Conference, pages2884–2892, Marseille, France. European Lan-guage Resources Association.

Radu Soricut and Franz Och. 2015. Unsuper-vised morphology induction using word embed-dings. In Proceedings of the 2015 Conferenceof the North American Chapter of the Associ-ation for Computational Linguistics: HumanLanguage Technologies, pages 1627–1637, Den-ver, Colorado. Association for ComputationalLinguistics.

Ekaterina Vylomova, Jennifer White, ElizabethSalesky, Sabrina J Mielke, Shijie Wu, EdoardoPonti, Rowan Hall Maudslay, Ran Zmigrod,Josef Valvoda, Svetlana Toldova, et al. 2020.SIGMORPHON 2020 shared task 0: Typolog-ically diverse morphological inflection. SIG-MORPHON 2020.

Adam Wiemerslage, Arya McCarthy, AlexanderErdmann, Garrett Nicolai, Manex Agirreza-bal, Miikka Silfverberg, Mans Hulden, andKatharina Kann. 2021. The SIGMORPHON2021 shared task on unsupervised morphologi-cal paradigm clustering. In Proceedings of the18th SIGMORPHON Workshop on Computa-tional Research in Phonetics, Phonology, andMorphology. Association for Computational Lin-guistics.

David Yarowsky and Richard Wicentowski. 2000.Minimally supervised morphological analysis bymultimodal alignment. In Proceedings of the38th Annual Meeting of the Association forComputational Linguistics, pages 207–216, HongKong. Association for Computational Linguis-tics.

106

Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research

in Phonetics, Phonology, and Morphology,

August 5, 2021. ©2021 Association for Computational Linguistics

pages 107–114

Paradigm Clustering with Weighted Edit Distance

Andrew Gerlach, Adam Wiemerslage and Katharina KannUniversity of Colorado [email protected]

AbstractThis paper describes our system for theSIGMORPHON 2021 Shared Task on Un-supervised Morphological Paradigm Cluster-ing, which asks participants to group in-flected forms together according their underly-ing lemma without the aid of annotated train-ing data. We employ agglomerative cluster-ing to group word forms together using a met-ric that combines an orthographic distance anda semantic distance from word embeddings.We experiment with two variations of an editdistance-based model for quantifying ortho-graphic distance, but, due to time constraints,our systems do not outperform the baseline.However, we also show that, with more time,our results improve strongly.

1 Introduction

Most of the world’s languages express gram-matical properties, such as tense or case, viasmall changes to a word’s surface form. Thisprocess is called morphological inflection, andthe canonical form of a word is known asits lemma. A search of the WALS databaseof linguistic typology shows that 80% of thedatabase’s languages mark verb tense and 65%mark grammatical case through morphology(Dryer and Haspelmath, 2013).

The English lemma do, for instance, has aninflected form did that expresses past tense.Though English verbs inflect to express tense,there are generally only 4 to 5 surface varia-tions for a given English lemma. In contrast, aRussian verb can have up to 30 morphologicalinflections per lemma, and other languages –such as Basque – have hundreds of forms perlemma, cf. Table 1.

Inflected forms are systematically related toeach other: in English, most noun plurals are

Basque Lemma: eginbegi begiate begidatebegie begiete begigubegigute begik beginbeginate begio begiotebegit begite begitza... ... ...zenegizkigukeen zenegizkigukete zenegizkiguketenzenegizkigun zenegizkigute zenegizkigutenzenegizkio zenegizkioke zenegizkiokeenzenegizkiokete zenegizkioketen zenegizkionzenegizkiote zenegizkioten zenegizkit

Table 1: The paradigm of the Basque verb egin consistsof 674 inflected forms. In contrast, the paradigm of theEnglish verb do only consists of 5 inflected forms: do,does, doing, did, and done.

obtained from the lemma by adding -s or -es tothe end of the noun, e.g., list/lists or kiss/kisses.However, irregular plurals also exist, such asox/oxen or mouse/mice. Although irregularforms are less frequent, they cause challengesfor the automatic generation or analysis of thesurface forms of English plural nouns.

In this work, we address the SIGMOR-PHON 2021 Shared Task on UnsupervisedMorphological Paradigm Clustering (”Task 2”)(Wiemerslage et al., 2021). The goal of thisshared task is to group words encountered innaturally occurring text into morphologicalparadigms. Unsupervised paradigm cluster-ing can be helpful for state-of-the-art naturallanguage processing (NLP) systems, whichtypically require large amounts of trainingdata. The ability to group words together intoparadigms is a useful first step for training asystem to induce full paradigms from a lim-ited number of examples, a task known as (su-pervised) morphological paradigm completion.Building paradigms can help an NLP system

107

to induce representations for rare words or togenerate words that have not been observed ina given corpus. Lastly, unsupervised systemshave the advantage of not needing annotateddata, which can be costly in terms of time andmoney, or, in the case of extinct or endangeredlanguages, entirely impossible.

Since 2016, the Association for Computa-tional Linguistics’ Special Interest Group onComputational Morphology and Phonology(SIGMORPHON) has created shared tasks tohelp spur the development of state-of-the-artsystems to explicitly handle morphologicalprocesses in a language. These tasks haveinvolved morphological inflection (Cotterellet al., 2016), lemmatization (McCarthy et al.,2019), as well as other, related tasks. SIG-MORPHON has increased the level of diffi-culty of the shared tasks, largely along twodimensions. The first dimension is the amountof data available for models to learn, reflectingthe difficulties of analyzing low-resource lan-guages. The second dimension is the amountof structure provided in the input data. Initially,SIGMORPHON shared tasks provided prede-fined tables of lemmas, morphological tags,and inflected forms. For the SIGMORPHON2021 Shared Task on Unsupervised Morpho-logical Paradigm Clustering, only raw text isprovided as input.

We propose a system that combines ortho-graphic and semantic similarity measures tocluster surface forms found in raw text. Weexperiment with a character-level languagemodel for weighing substring differences be-tween words. Due to time constraints we areonly able to cluster over a subset of each lan-guages’ vocabulary. Despite of this, our sys-tem’s performance is comparable to the base-line.

2 Related Work

Unsupervised morphology has attracted a greatdeal of interest historically, including a largebody of work focused on segmentation (Xuet al., 2018; Creutz and Lagus, 2007; Poonet al., 2009; Narasimhan et al., 2015). Re-cently, the task of unsupervised morphologi-

cal paradigm completion has been proposed(Kann et al., 2020; Jin et al., 2020; Erdmannet al., 2020), wherein the goal is to induce fullparadigms from raw text corpora.

In this year’s SIGMORPHON shared task,we are asked to only address part of the unsu-pervised paradigm completion task: paradigmclustering. Intuitively, the task of segmentationis related to paradigm clustering, but the out-puts are different. Goldsmith (2001) producesmorphological signatures, which are similarto approximate paradigms, based on an algo-rithm that uses minimum description length.However, this type of algorithm relies heavilyon purely orthographic features of the vocab-ulary. Schone and Jurafsky (2001) hypothe-size that approximating semantic informationcan help differentiate between hypothesizedmorphemes, revealing those that are produc-tive. They propose an algorithm that combinesorthography, semantics, and syntactic distri-butions to induce morphological relationships.They used semantic relatedness, quantified bylatent semantic analysis, combined with thefrequencies of affixes and syntactic context(Schone and Jurafsky, 2000).

More recently, Soricut and Och (2015) haveused SkipGram word embeddings (Mikolovet al., 2013) to find meaningful morphemesbased on analogies: regularities exhibited byembedding spaces allow for inferences of cer-tain types (e.g., king is to man what queen isto woman). Hypothesizing that these regulari-ties also hold for morphological relations, theyrepresent morphemes by vector differences be-tween semantically similar forms, e.g., the vec-tor for the suffix −→s may be represented by thedifference between

−−→cats and

−→cat.

Drawing upon these intuitions, we followRosa and Zabokrtsky (2019), which combinessemantic distance using fastText embeddings(Bojanowski et al., 2017) with an orthographicdistance between word pairs. Words are thenclustered into paradigms using agglomerativeclustering.

108

3 Task Description

Given a raw text corpus, the task is tosort words into clusters that correspond toparadigms. More formally, for the vocabularyΣ of all types attested in the corpus and theset of morphological paradigms Π for whichat least one word is in Σ, the goal is to out-put clusters corresponding to πk

⋂Σ for all

πk ∈ Π.

Data As the raw text data for this task, JHUBible corpora (McCarthy et al., 2020b) are pro-vided by the organizers. This is the only datathat systems can use. The organizers furtherprovide development and test sets consisting ofgold clusters for a subset of words in the Biblecorpora. Each cluster is a list of words repre-senting πk

⋂Σ for πk ∈ Πdev or πk ∈ Πtest,

respectively, and Πdev,Πtest ( Π.The partial morphological paradigms in

Πdev and Πtest are taken from the UniMorphdatabase (McCarthy et al., 2020a). Develop-ment sets are only available for the develop-ment languages, while test sets are only pro-vided for the test languages. All test sets arehidden from the participants until the conclu-sion of the shared task.

Languages The development languages fea-tured in the shared task are Maltese, Per-sian, Portuguese, Russian, and Swedish. Thetest languages are Basque, Bulgarian, English,Finnish, German, Kannada, Navajo, Spanish,and Turkish.

4 System Descriptions

We submit two systems based on Rosa andZabokrtsky (2019). The first, referred to be-low as JW-based clustering, follows their workvery closely. The second, LM-based cluster-ing, contains the same main components, butapproximates orthographic distances with thehelp of a language model.

4.1 JW-based Clustering

We describe the system of Rosa andZabokrtsky (2019) in more detail here. Thissystem clusters over words whose distance is

computed as a combination of orthographicand semantic distances.

Orthographic Distance The orthographicdistance of two words is computed as theirJaro-Winkler (JW) edit distance (Winkler,1990). JW distance differs from the more com-mon Levenshtein distance (Levenshtein, 1966)in that JW distance gives more importance tothe beginnings of strings than to their ends,which is where characters belonging to thestem are likely to be in suffixing languages.

The JW distance is averaged with the JWdistance of a simplified variant of the string.The simplified variant is a string that has beenlower cased, transliterated to ASCII, and hadthe non-initial vowels deleted. This is done tosoften the impact of characters that are likely tocorrespond with affixes. Crucially, we believethat this biases the system towards languagesthat express inflection via suffixation.

Semantic Distance We represent words inthe corpus by fastText embeddings, similar toErdmann and Habash (2018), who cluster fast-Text embeddings for the same task in variousArabic dialects. We expect fastText embed-dings to provide better representations than,e.g., Word2Vec (Mikolov et al., 2013), due tothe limited size of the Bible corpora. Unfortu-nately, using fastText may also inadvertentlyresult in higher similarity between words be-longing to different lemmas that contain over-lapping subwords corresponding to affixes.

Overall Distance We compute a pairwisedistance matrix for all words in the corpus.The distance between two words w1 and w2 iscomputed as:

d(w1, w2) = 1− δ(w1, w2) ·cos(w1, w2) + 1

2,

(1)where w1 and w2 are the embeddings ofw1 andw2, cos is the cosine distance, and δ is the JWedit distance. The cosine distance is mappedto [0, 1] to avoid negative distances.

Finally, agglomerative clustering is per-formed by first assigning each word form to aunique cluster. At each step, the two clusters

109

with the lowest average distance are mergedtogether. The merging continues while the dis-tance between clusters stays below a threshold.We tune this hyperparameter on the develop-ment set, and our final threshold is 0.3.

4.2 LM-based ClusteringThe JW-based clustering described above re-lies on heuristics to obtain a good measure oforthographic similarity. These heuristics helpto quantify orthographic similarity betweentwo words by relying more on the shared char-acters in the stem than in the affix: The plu-ral past participles gravados and louvados inPortuguese have longer substrings in commonthan the substrings by which they differ. Thisis due to the affix -ados, which indicates thatthe two words express the same inflectional in-formation, even though their lemmas are differ-ent. Similarly, the Portuguese verbs abafa andabafavamos differ in many characters, thoughthey belong to the same paradigm, as can beobserved by the shared stem abaf.

However, not all languages express inflec-tion exclusively via suffixation, nor via con-catenation. We thus experiment with remov-ing the edit distance heuristics and, instead,utilizing probabilities from a character-levellanguage model (LM) to distinguish betweenstems and affixes. In doing so, we hope toachieve better results for templatic languages,such as Maltese. We hypothesize that the LMwill have a higher confidence for charactersthat are part of an affix than for those that arepart of the stem. We then draw upon this hy-pothesis and weigh edit operations betweentwo strings based on these confidences.

LM-weighted Edit Distance Similar tothe intuition behind Silfverberg and Hulden(2018), we train a character-level LM on theentire vocabulary for each Bible corpus. Un-like their work, we do not have inflectionaltags for each word. Despite this, we hypothe-size that the highly regular and frequent natureof inflectional affixes will lead to higher likeli-hoods for characters that occur in affixes thanfor those in stems. We train a two-layer LSTM(Hochreiter and Schmidhuber, 1997) with an

embedding size of 128 and a hidden layer sizeof 128. We train the model until the trainingloss stops decreasing, for up to 100 epochs,using Adam (Kingma and Ba, 2014) with alearning rate of 0.001 and a batch size of 16.

When calculating the edit distance betweentwo words, the insertion, deletion, or substitu-tion costs are computed as a function of theLM probabilities. We expect this to give moreweights to differences in the stem than to thosein other parts of the word. Each character isthen associated with a cost given by

cost(wi) = 1− p(wi)∑j∈|w|

p(wj), (2)

where p(wi) is the probability of the ith char-acter in word w as given by the LM. We thencompute the cost of an insertion or deletionas the cost of the character being inserted ordeleted. The cost of a substitution is the aver-age of the costs of the two involved characters.The sum over these operations is the weightededit distance between two words, ε(w1, w2).Finally, we compute pairwise distances usingEquation 1, replacing δ(w1, w2) with

ε(w1, w2)

max(|w1|, |w2|).

Forward vs. Backward LM We hypoth-esize that the direction in which the LM istrained affects the probabilities for affixes. In-tuitively, an LM is likely to assign higher confi-dence to characters at the beginning of a wordthan at the end. Thus, an LM trained on data inthe forward direction (LM-F) should be morelikely to assign higher probabilities to charac-ters at the beginning of a word, such as pre-fixes, while a model trained on reversed words(LM-B) should assign higher probabilities tosuffixes. In practice, LM-B outperforms LM-Fon all development languages, cf. Table 2. Be-cause of that, we employ LM-B to weigh editoperations for all test languages.1

1This might be caused by none of the development lan-guages being prefixing. However, in order to make a moreinformed choice, a method to automatically distinguish be-tween prefixing and suffixing languages from raw text alonewould be necessary.

110

Lang Baseline LMC-B LMC-F JWCprec. rec. F1 prec. rec. f1 prec. rec. F1 prec. rec. F1

Maltese 0.250 0.348 0.291 0.465 0.229 0.307 0.411 0.202 0.272 0.489 0.241 0.323Persian 0.265 0.348 0.300 0.321 0.307 0.314 0.494 0.197 0.282 0.579 0.231 0.330Portuguese 0.218 0.794 0.341 0.771 0.248 0.376 0.494 0.159 0.241 0.742 0.239 0.362Russian 0.234 0.807 0.363 0.802 0.282 0.417 0.726 0.255 0.378 0.792 0.278 0.412Swedish 0.303 0.776 0.436 0.818 0.378 0.517 0.695 0.321 0.439 0.838 0.388 0.530Average 0.254 0.615 0.346 0.635 0.289 0.386 0.482 0.186 0.268 0.688 0.275 0.391

Table 2: Precision, recall, and F1 for all development languages. LMC-R is the LM-clustering system for languagemodels trained from left-to-right (reverse). LMC-F are trained from left-to-right, and JWC is the JW-clusteringsystem. The highest F1 for each language is in bold.

Lang Baseline LMC JWCprec. rec. F1 prec. rec. F1 prec. rec. F1

English 0.388 0.767 0.515 0.565 0.245 0.3420 0.663 0.288 0.402Navajo 0.230 0.598 0.333 0.686 0.112 0.1928 0.657 0.108 0.185Spanish 0.266 0.722 0.388 0.664 0.183 0.2869 0.699 0.193 0.302Finnish 0.179 0.767 0.290 0.694 0.227 0.342 0.674 0.220 0.332Bulgarian 0.265 0.730 0.390 0.745 0.312 0.440 0.717 0.300 0.423Basque 0.186 0.254 0.215 0.471 0.254 0.330 0.353 0.191 0.247Kannada 0.172 0.385 0.238 0.570 0.169 0.261 0.625 0.185 0.286German 0.254 0.776 0.382 0.7626 0.310 0.441 0.787 0.319 0.454Turkish 0.156 0.658 0.252 0.6574 0.212 0.320 0.641 0.206 0.312Average 0.233 0.629 0.334 0.646 0.225 0.328 0.646 0.223 0.327

Table 3: Precision, recall, and F1 for all test languages. LMC is the LM-clustering system, JWC is the JW-clustering system. The highest F1 for each language is in bold.

5 Results and Discussion

The official scores obtained by our systems aswell as the baseline are shown in Table 3.

Both of our systems perform minimallyworse than the baseline if we consider F1 av-eraged over languages (0.334 vs. 0.328 and0.327). However, we believe this to be largelydue to our submissions only generating clus-ters for a subset of the full vocabularies: due totime constraints, we only consider words thatappear at least 5 times in the corpus. No otherwords are included in the predicted clusters.The large gap between precision and recall re-flects this constraint: our submissions havea high average precision (0.646 for both sys-tems), indicating that the limited set of wordswe consider are being clustered more accu-rately than the F1 scores would suggest. Thelow recall scores (0.225 and 0.223) are likelyat least partially caused by the missing wordsin our predictions.2

Conversely, the baseline system has a highrecall (0.629) and a low precision (0.233). This

2We confirm this hypothesis with additional experimentsafter the shared task’s completion. Those results can be foundin the appendix.

is likely due to it simply clustering words withshared substrings, such that a given word islikely to appear in many predicted clusters.

Interestingly, both of our submissions havethe same average precision on the test set, de-spite varying across languages. Notably, theLM-based clustering system strongly outper-forms the JW-based system on Basque withrespect to precision. However, the JW-basedsystem outperforms the LM-based one by alarge margin on English. One hypothesis forthe difference in results is that agglutinating in-flection in Basque causes very long affixes,which our LM-based system should down-weigh in its measurement of orthographic simi-larity. Basque is also not a strictly suffixing lan-guage, which we expect the JW-based model tobe biased towards. On the other hand, Englishhas relatively little inflectional morphology,and is strictly suffixing (in terms of inflection).The assumptions behind the JW-based systemare more ideal for a language like English. TheJW system performs best on Maltese, whichsuggests that the heuristics of that system aresufficient for a templatic language, comparedto the LM-based system.

111

6 Conclusion

We present two systems for the SIGMOR-PHON 2021 Shared Task on UnsupervisedMorphological Paradigm Clustering. Both ofour systems perform slighly worse than the of-ficial baseline. However, we also show that thisis due to our official submissions only makingpredictions for a subset of the corpus’ vocab-ulary, due to time constraints and that at leastone of our systems improves strongly if thetime constraints are removed.

References

Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching word vectors withsubword information. Transactions of the Associa-tion for Computational Linguistics, 5:135–146.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman,David Yarowsky, Jason Eisner, and Mans Hulden.2016. The SIGMORPHON 2016 shared task—morphological reinflection. In Proceedings of the2016 Meeting of SIGMORPHON, Berlin, Germany.Association for Computational Linguistics.

Mathias Creutz and Krista Lagus. 2007. Unsupervisedmodels for morpheme segmentation and morphol-ogy learning. 4(1).

Matthew S. Dryer and Martin Haspelmath, editors.2013. WALS Online. Max Planck Institute for Evo-lutionary Anthropology, Leipzig.

Alexander Erdmann, Micha Elsner, Shijie Wu, RyanCotterell, and Nizar Habash. 2020. The paradigmdiscovery problem. In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics, pages 7778–7790, Online. Associationfor Computational Linguistics.

Alexander Erdmann and Nizar Habash. 2018. Comple-mentary strategies for low resourced morphologicalmodeling. In Proceedings of the Fifteenth Workshopon Computational Research in Phonetics, Phonol-ogy, and Morphology, pages 54–65, Brussels, Bel-gium. Association for Computational Linguistics.

John Goldsmith. 2001. Unsupervised learning of themorphology of a natural language. Computationallinguistics, 27(2):153–198.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Huiming Jin, Liwei Cai, Yihui Peng, Chen Xia, AryaMcCarthy, and Katharina Kann. 2020. Unsuper-vised morphological paradigm completion. In Pro-ceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 6696–6707, Online. Association for Computational Lin-guistics.

Katharina Kann, Arya D. McCarthy, Garrett Nicolai,and Mans Hulden. 2020. The SIGMORPHON2020 shared task on unsupervised morphologicalparadigm completion. In Proceedings of the 17thSIGMORPHON Workshop on Computational Re-search in Phonetics, Phonology, and Morphology,pages 51–62, Online. Association for Computa-tional Linguistics.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Vladimir Iosifovich Levenshtein. 1966. Binary codescapable of correcting deletions, insertions and re-versals. Soviet Physics Doklady, 10(8):707–710.Doklady Akademii Nauk SSSR, V163 No4 845-8481965.

Arya D. McCarthy, Christo Kirov, Matteo Grella,Amrit Nidhi, Patrick Xia, Kyle Gorman, Ekate-rina Vylomova, Sabrina J. Mielke, Garrett Nico-lai, Miikka Silfverberg, Timofey Arkhangelskiy, Na-taly Krizhanovsky, Andrew Krizhanovsky, ElenaKlyachko, Alexey Sorokin, John Mansfield, ValtsErnstreits, Yuval Pinter, Cassandra L. Jacobs, RyanCotterell, Mans Hulden, and David Yarowsky.2020a. UniMorph 3.0: Universal Morphology.In Proceedings of the 12th Language Resourcesand Evaluation Conference, pages 3922–3931, Mar-seille, France. European Language Resources Asso-ciation.

Arya D. McCarthy, Ekaterina Vylomova, Shijie Wu,Chaitanya Malaviya, Lawrence Wolf-Sonkin, Gar-rett Nicolai, Christo Kirov, Miikka Silfverberg, Sab-rina J. Mielke, Jeffrey Heinz, Ryan Cotterell, andMans Hulden. 2019. The SIGMORPHON 2019shared task: Morphological analysis in context andcross-lingual transfer for inflection. In Proceedingsof the 16th Workshop on Computational Research inPhonetics, Phonology, and Morphology, pages 229–244, Florence, Italy. Association for ComputationalLinguistics.

Arya D. McCarthy, Rachel Wicks, Dylan Lewis, AaronMueller, Winston Wu, Oliver Adams, Garrett Nico-lai, Matt Post, and David Yarowsky. 2020b. TheJohns Hopkins University Bible corpus: 1600+tongues for typological exploration. In Proceed-ings of the 12th Language Resources and EvaluationConference, pages 2884–2892, Marseille, France.European Language Resources Association.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word represen-tations in vector space.

112

Karthik Narasimhan, Regina Barzilay, and TommiJaakkola. 2015. An unsupervised method for un-covering morphological chains. Transactions of theAssociation for Computational Linguistics, 3:157–167.

Hoifung Poon, Colin Cherry, and Kristina Toutanova.2009. Unsupervised morphological segmentationwith log-linear models. In Proceedings of HumanLanguage Technologies: The 2009 Annual Confer-ence of the North American Chapter of the Associa-tion for Computational Linguistics, pages 209–217,Boulder, Colorado. Association for ComputationalLinguistics.

Rudolf Rosa and Zdenek Zabokrtsky. 2019. Unsu-pervised lemmatization as embeddings-based wordclustering. CoRR, abs/1908.08528.

Patrick Schone and Daniel Jurafsky. 2000. Knowledge-free induction of morphology using latent semanticanalysis. In Fourth Conference on ComputationalNatural Language Learning and the Second Learn-ing Language in Logic Workshop.

Patrick Schone and Daniel Jurafsky. 2001. Knowledge-free induction of inflectional morphologies. In Sec-ond Meeting of the North American Chapter of theAssociation for Computational Linguistics.

Miikka Silfverberg and Mans Hulden. 2018. Anencoder-decoder approach to the paradigm cell fill-ing problem. In Proceedings of the 2018 Conferenceon Empirical Methods in Natural Language Process-ing, pages 2883–2889, Brussels, Belgium. Associa-tion for Computational Linguistics.

Radu Soricut and Franz Och. 2015. Unsupervised mor-phology induction using word embeddings. In Pro-ceedings of the 2015 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages1627–1637, Denver, Colorado. Association for Com-putational Linguistics.

Adam Wiemerslage, Arya McCarthy, Alexander Erd-mann, Garrett Nicolai, Manex Agirrezabal, MiikkaSilfverberg, Mans Hulden, and Katharina Kann.2021. The SIGMORPHON 2021 shared task on un-supervised morphological paradigm clustering. InProceedings of the 18th SIGMORPHON Workshopon Computational Research in Phonetics, Phonol-ogy, and Morphology. Association for Computa-tional Linguistics.

William E. Winkler. 1990. String comparator met-rics and enhanced decision rules in the fellegi-suntermodel of record linkage. In Proceedings of the Sec-tion on Survey Research, pages 354–359.

Ruochen Xu, Yiming Yang, Naoki Otani, and YuexinWu. 2018. Unsupervised cross-lingual transfer ofword embedding spaces. In Proceedings of the 2018

Conference on Empirical Methods in Natural Lan-guage Processing, pages 2465–2474, Brussels, Bel-gium. Association for Computational Linguistics.

113

7 Appendix

Here we present new results which includethe entire data set for selected languages. Wesee an improvement in F1 for each language.This due to the increased recall scores fromthe paradigms being more complete. Precisionscores decrease across the board. This maybe due to the languages being sensitive to thethreshold value.

Lang Subset Fullprec. rec. F1 prec. rec. F1

Basque 0.471 0.254 0.330 0.443 0.429 0.435Bulgarian 0.745 0.312 0.440 0.638 0.631 0.634English 0.565 0.245 0.342 0.430 0.425 0.428German 0.763 0.310 0.441 0.703 0.699 0.701Maltese 0.465 0.229 0.307 0.402 0.400 0.401Navajo 0.686 0.112 0.193 0.449 0.430 0.435Spanish 0.664 0.183 0.287 0.579 0.560 0.569Swedish 0.818 0.378 0.517 0.783 0.737 0.759Average 0.659 0.252 0.357 0.553 0.539 0.545

Table 4: Post-shared task results using the full data setfor selected languages. These results use LM-B with athreshold value of 0.3.

114

Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research

in Phonetics, Phonology, and Morphology,

August 5, 2021. ©2021 Association for Computational Linguistics

pages 115–125

Results of the Second SIGMORPHON Shared Taskon Multilingual Grapheme-to-Phoneme Conversion

Lucas F.E. Ashby∗, Travis M. Bartley∗, Simon Clematide†, Luca Del Signore∗,Cameron Gibson∗, Kyle Gorman∗, Yeonju Lee-Sikka∗, Peter Makarov†,

Aidan Malanoski∗, Sean Miller∗, Omar Ortiz∗, Reuben Raff∗,Arundhati Sengupta∗, Bora Seo∗, Yulia Spektor∗, Winnie Yan∗

∗Graduate Program in Linguistics, Graduate Center, City University of New York†Department of Computational Linguistics, University of Zurich

Abstract

Grapheme-to-phoneme conversion is an im-portant component in many speech technolo-gies, but until recently there were no multi-lingual benchmarks for this task. The seconditeration of the SIGMORPHON shared taskon multilingual grapheme-to-phoneme conver-sion features many improvements from theprevious year’s task (Gorman et al. 2020), in-cluding additional languages, a stronger base-line, three subtasks varying the amount ofavailable resources, extensive quality assur-ance procedures, and automated error analy-ses. Four teams submitted a total of thirteensystems, at best achieving relative reductionsof word error rate of 11% in the high-resourcesubtask and 4% in the low-resource subtask.

1 IntroductionMany speech technologies demand mappings be-tween written words and their pronunciations.In open-vocabulary systems—as well as certainresource-constrained embedded systems—it is in-sufficient to simply list all possible pronunciations;these mappings must generalize to rare or unseenwords as well. Therefore, the mapping must beexpressed as a mapping from a sequence of ortho-graphic characters—graphemes— to a sequenceof sounds—phones or phonemes.1The earliest work on grapheme-to-phoneme

conversion (G2P), as this task is known, used or-dered rewrite rules. However, such systems areoften brittle and the linguistic expertise neededto build, test, and maintain rule-based systemsis often in short supply. Furthermore, rule-based systems are outperformed by modern neu-

1We note that referring to elements of transcriptions asphonemes implies an ontological commitment which may ormay not be justified; see Lee et al. 2020 (fn. 4) for discussion.Therefore, we use the term phone to refer to symbols used totranscribe pronunciations.

ral sequence-to-sequence models (e.g., Rao et al.2015, Yao and Zweig 2015, van Esch et al. 2016).With the possible exception of van Esch

et al. (2016), who evaluate against a proprietarydatabase of 20 languages and dialects, virtuallyall of the prior published research on grapheme-to-phoneme conversion evaluates only on English,for which several free and low-cost pronunciationdictionaries are available. The 2020 SIGMOR-PHON Shared Task on Multilingual Grapheme-to-Phoneme Conversion (Gorman et al. 2020) repre-sented a first attempt to construct a multilingualbenchmark for grapheme-to-phoneme conversion.The 2020 shared task targeted fifteen languagesand received 23 submissions from nine teams. Thesecond iteration of this shared task attempts tofurther refine this benchmark by introducing addi-tional languages, a much stronger baseline model,new quality assurance procedures for the data, andautomated error analysis techniques. Furthermore,in response to suggestions from participants in the2020 shared task, the task has been divided intohigh-, medium-, and low-resource subtasks.

2 Data

As in the previous year’s shared task, all datawas drawn from WikiPron (Lee et al. 2020), amassively multilingual pronunciation database ex-tracted from the online dictionary Wiktionary. De-pending on the language and script, Wiktionarypronunciations are either manually entered by hu-man volunteers working from language-specificpronunciation guidelines and/or generated fromthe graphemic form via language-specific server-side scripting. WikiPron scrapes these pro-nunciatons from Wiktionary, optionally applyingcase-folding to the graphemic form, removingany stress and syllable boundaries, and segment-ing the pronunciation—encoded in the Interna-

115

tional Phonetic Alphabet—using the Python li-brary segments (Moran and Cysouw 2018). Inall, 21 WikiPron languages were selected for thethree subtasks, including seven new languages andfourteen of the fifteen languages used in the 2020shared task.2In several cases, multiple scripts or dialects

are available for a given language. For instance,WikiPron has both Latin and Cyrillic entries forSerbo-Croatian, and three different dialects ofVietnamese. In such case, the largest data set of theavailable scripts and/or dialects is chosen. Further-more, WikiPron distinguishes between “broad”transcriptions delimited by forward slash (/) and“narrow” transcriptions delimited by square brack-ets ([ and ]).3 Once again, the larger of the twodata sets is the one used for this task.

3 Quality assurance

During the previous year’s shared task we be-came aware of several consistency issues with theshared task data. This lead us to develop qualityassurance procedures for WikiPron and the “up-stream” Wiktionary data. For a few languages,we worked with Wiktionary editors who automat-ically enforced upstream consistency via “bots”,i.e., scripts which automatically edit Wiktionaryentries. We also improvedWikiPron’s routines forextracting pronunciation data fromWiktionary. Insome cases (e.g., Vietnamese), this required thecreation of language-specific extraction routines.In early versions of WikiPron, users had limited

means to separate out entries for languages writtenin multiple scripts. We therefore added an auto-mated script detection system which ensures thatentries for the many languages written with multi-ple scripts—including shared task languages Mal-tese, Japanese, and Serbo-Croatian—are sorted ac-cording to script.We noticed that the WikiPron data includes

many hyper-foreign pronunciations with non-native phones. For example, the English data in-cludes a broad pronunciation of Bach (the sur-name of a family of composers) as /bɑːx/ witha velar fricative /x/, a segment which is com-mon in German but absent in modern English.Furthermore, unexpected phones may represent

2The fifteenth language, Lithuanian, was omitted due tounresolved quality assurance issues.

3Sorting by script, dialect, and broad vs. narrow transcrip-tion is performed automatically during data ingestion.

simple human error. Therefore, we wished toexclude pronunciations which include any non-native segments. This was accomplished by creat-ing phonelists which enumerate native phones fora given language. Separate phonelists may be pro-vided for broad and narrow transcriptions of thesame language. During data ingestion, if a pro-nunciation contains any segment not present on thephonelist, the entry was discarded. Phonelist filtra-tion was used for all languages in the medium- andlow-resource subtasks, described below.

4 Task definitionIn this task, participants were provided with a col-lection of words and their pronunciations, and thenscored on their ability to predict the pronunciationof a set of unseen words.

4.1 SubtasksIn the previous year’s shared task, each language’sdata consisted of 4,500 examples, sampled fromWikiPron, split randomly into 80% training exam-ples, 10% development examples, and 10% testexamples. As part of their system development,two teams in the 2020 shared task (Hauer et al.2020, Yu et al. 2020) down-sampled these data tosimulate a lower-resource setting, and one partici-pant expressed concern whether the methods usedin the shared task would generalize effectivelyto high-resource scenarios like the large Englishdata sets traditionally used to evaluate grapheme-to-phoneme systems. This motivated a division ofthe data into three subtasks, varying the amount ofdata provided, as described below.4

High-resource subtask The first subtask con-sists of a roughly 41,000-word sample of Main-stream American English (eng_us). Participatingteams were permitted to use any and all externalresources to develop their systems except for Wik-tionary or WikiPron. It was anticipated partici-pants would exploit other freely available Amer-ican English pronunciation dictionaries.

Medium-resource subtask The second subtaskrepresents a medium-resource task. For each ofthe ten target languages, a sample of 10,000 wordswas used. Teams participating in this subtask were

4Languages were sorted into medium- vs. low-resourcesubtasks according to data availability. For example, Ice-landic was placed in the low-resource shared task simply be-cause it has less than 10,000 pronunciations available.

116

permitted to use UniMorph paradigms (Kirov et al.2018) to lemmatize or to look up morphologicalfeatures, but were not permitted to use any otherexternal resources. The languages for this subtaskare listed and exemplified in Table 1.

Low-resource subtask The third subtask is de-signed to simulate a low-resource setting and con-sists of 1,000 words from ten languages. Teamswere were not permitted to use any external re-sources for this subtask. The languages for thissubtask are shown in Table 2.

4.2 Data preparationThe procedures for sampling and splitting the dataare similar to those used in the previous year’sshared task; see Gorman et al. 2020, §3. Foreach of the three subtasks, the data for each lan-guage are first randomly downsampled accordingto their frequencies in the Wortschatz (Goldhahnet al. 2012) norms. Words containing less thantwoUnicode characters or less than two phone seg-ments are excluded, as are words with multiplepronunciations. The resulting data are randomlysplit into 80% training data, 10% developmentdata, and 10% test data. As in the previous year’sshared task, these splits are constrained so that in-flectional variants of any given lemma—accordingto the UniMorph (Kirov et al. 2018) paradigms—can occur in at most one of the three shards. Train-ing and development data was made available atthe start of the task. The test words were alsomade available at the start of the task; test pro-nunciations were withheld until the end of the task.Some additional processing is required for certainlanguages, as described below.

English The Wiktionary American English pro-nunciations exhibit a large number of inconsisten-cies. These pronunciations were validated by au-tomatically comparing them with entries in theCALLHOME American English Lexicon (Kings-bury et al. 1997), which provides broad ARPAbettranscriptions of Mainstream American English.Furthermore, a script was used to standardize useof vowel length and enforce consistent use of tiebars with affricates (e.g., /tʃ/→ /t ʃ/). However, wenote that Gautam et al. (2021:§2.1) report severalresidual quality issues with this data.

Bulgarian Bulgarian Wiktionary transcriptionsmake inconsistent use of tie bars on affricates; for

example, ц is transcribed as both /ts, t s/. Further-more, the broad transcriptions sometimes containallophones of the consonants /t, d, l/ (Ternes andVladimirova-Buhtz 1990); e.g., л is transcribed asboth /l, ɫ/. A script was used to enforce a consistentbroad transcription.

Maltese In the Latin-script Maltese data, Wik-tionary has multiple transcriptions of digraphgħ, which in the contemporary language indi-cates lengthening of an adjacent vowel, exceptword-finally where it is read as [ħ] (Hoberman2007:278f.). Rather than excluding multiple pro-nunciations, a script was used to eliminate pronun-ciations which contain archaic readings of this di-graph, e.g., as pharyngealization or as [ɣ].

Welsh WikiPron’s transcriptions of the South-ern dialect of Welsh include the effects of vari-able processes of monophthongization and dele-tion (Hannahs 2013:18–25). Once again, ratherthan excluding multiple pronunciations, a scriptwas used to select the “longer” pronunciation—naturally, the pronunciation without variablemonophthongization or deletion—ofWelsh wordswith multiple pronunciations.

5 EvaluationThe primary metric for this task was word errorrate (WER), the percentage of words for which thehypothesized transcription sequence is not iden-tical to the gold reference transcription. As themedium- and low-resource subtasks involve multi-ple languages, macro-averagedWERwas used forsystem ranking. Participants were provided withtwo evaluation scripts: one which computes WERfor a single language, and one which also com-putes macro-averaged WER across two or morelanguages. The 2020 shared task also reported an-other metric, phone error rate (PER), but this wasfound to be highly correlated with WER and there-fore has been omitted here.

6 BaselineThe 2020 shared task included three baselines: aWFST-based pair n-gram model, a bidirectionalLSTM encoder-decoder network, and a trans-former. All models were tuned to minimize per-language development-set WER using a limited-budget grid search. Best results overall were ob-tained by the bidirectional LSTM. Despite theextensive GPU resources required to execute a

117

Armenian (Eastern) arm_e համադրություն h ɑ m ɑ d ә ɾ u tʰ j u nBulgarian bul обоснованият o b o s n o v a n i j ә tDutch dut konijn k oː n ɛ i nFrench fre joindre ʒ w ɛ d ʁGeorgian geo მოუქნელად m ɔ u kʰ n ɛ l ɑ dSerbo-Croatian (Latin) hbs_latn opadati o p ǎː d a t iHungarian hun lobog l o b o ɡJapanese (Hiragana) jpn_hira ぜんたいしゅぎ dz ẽ n t a i ɕ ɨᵝ ɡʲ iKorean kor 쇠가마우지 sʰ w e ɡ a m a u dʑ iVietnamese (Hanoi) vie_hanoi ngừng bắn ŋ ɨ ŋ ˨˩ ʔ ɓ a n ˧˦

Table 1: The ten languages in the medium-resource subtask with language codes and example training data pairs.

Adyghe ady кӏэшӏыхьан t ʃʼ a ʃʼ ә ħ aː nGreek gre λέγεται l e ʝ e t eIcelandic ice maður m aː ð ʏ rItalian ita marito m a r i t oKhmer khm របហារ p r ɑ h aːLatvian lav mīksts m îː k s t sMaltese (Latin) mlt_latn minna m ɪ n n aRomanian rum ierburi j e r b u rʲSlovenian slv oprostite ɔ p r ɔ s t íː t ɛWelsh (Southwest) wel_sw gorff ɡ ɔ r f

Table 2: The ten languages in the low-resource subtask with language codes and example training data pairs.

per-language grid search, the best baseline washandily outperformed by nearly all submissions.This led us to seek a simpler, stronger, andless computationally-demanding baseline for thisyear’s shared task.The baseline for the 2021 shared task is a neu-

ral transducer system using an imitation learn-ing paradigm (Makarov and Clematide 2018). Avariant of this system (Makarov and Clematide2020) was the second-best system in the 2020shared task.5 Alignments are computed usingten iterations of expectation maximization, andthe imitation learning policy is trained for up tosixty epochs (with a patience of twelve) using theAdadelta optimizer. A beam of size of four isused for prediction. Final predictions are producedby a majority-vote ten-component ensemble. In-ternal processing is performed using the decom-posedUnicode normalization form (NFD), but pre-

5The baseline was implemented using the DyNet neuralnetwork toolkit (Neubig et al. 2017). In contrast to the previ-ous year’s baseline, the imitation learning system does not re-quire a GPU for efficient training; it runs effectively on CPUand can exploit multiple CPU cores if present. Training, en-sembling, and evaluation for all three subtasks took roughly72 hours of wall-clock time on a commodity desktop PC.

dictions are converted back to the composed form(NFC). An implementation of the baselinewas pro-vided during the task and participating teams wereencouraged to adapt it for their submissions.

7 Submissions

Below we provide brief descriptions of sub-missions to the shared task; more detaileddescriptions—as well as various exploratory anal-yses and post-submission experiments—can befound in the system papers later in this volume.

AZ Hammond (2021) produced a single submis-sion to the low-resource subtask. The model is in-spired by the previous year’s bidirectional LSTMbaseline but also employs several data augmenta-tion strategies. First, much of the developmentdata is used for training rather than for validation.Secondly, new training examples are generated us-ing substrings of other training examples. Finally,the AZ model is trained simultaneously on all lan-guages, a method used in some of the previousyear’s shared task submissions (e.g., Peters andMartins 2020, Vesik et al. 2020).

118

CLUZH Clematide and Makarov (2021) pro-duced four submissions to the medium-resourcesubtask and three to the low-resource subtask. Allseven submissions are variations on the imitationlearning baseline model (section 6). They ex-periment with processing individual IPA Unicodecharacters instead of entire IPA “segments” (e.g.,CLUZH-1, CLUZH-5, and CLUZH-6), and largerensembles (e.g., CLUZH-3). They also experi-ment with input dropout, mogrifier LSTMs, andadaptive batch sizes, among other features.

Dialpad Gautam et al. (2021) produced threesystems to the high-resource subtask. TheDialpad-1 submission is a large ensemble of sevendifferent sequence models. Dialpad-2 is a smallerensemble of three models. Dialpad-3 is a singletransformer model implemented as part of CMUSphinx. Gautam et al. also experiment with sub-word modeling techniques.

UBC Lo and Nicolai (2021) submitted two sys-tems for the low-resource subtask, both variationson the baseline model. The UBC-1 submission hy-pothesizes that, as previously reported by van Eschet al. (2016), inserting explicit syllable boundariesinto the phone sequences enhances grapheme-to-phoneme performance. They generate syllableboundaries using an automated onset maximiza-tion heuristic. The UBC-2 submission takes a dif-ferent approach: it assigns additional language-specific penalties for mis-predicted vowels and di-acritic characters such as the length mark /ː/.

8 ResultsMultiple submissions to the high- and low-resource subtasks outperformed the baseline; how-ever, no submission to the medium-resource sub-task exceeded the baseline. The best results foreach language are shown in Table 3.

8.1 SubtasksHigh-resource subtask The Dialpad team sub-mitted three systems for the high-resource subtask,all of which outperformed the baseline. Results forthis subtask are shown in Table 4. The best sub-mission overall, Dialpad-1, a seven-componentensemble, achieved an impressive 4.5% absolute(11% relative) reduction inWER over the baseline.

Medium-resource subtask The CLUZH teamsubmitted four systems for the medium-resourcesubtask. All of of these systems are variants of the

baseline model. The results are shown in Table 5;note that the individual language results are ex-pressed as three-digit percentages since there are1,000 test examples each. While several of theCLUZH systems outperform the baseline on in-dividual languages, including Armenian, French,Hungarian, Japanese, Korean, and Vietnamese,the baseline achieves the best macro-accuracy.

Low-resource subtask Three teams—AZ,CLUZH, and UBC—submitted a total of sixsystems to the low-resource subtask. Results forthis subtask are shown in Table 6; note that the re-sults are expressed as two-digit percentages sincethere are 100 test examples for each language.Three submissions outperformed the baseline.The best-performing submission was UBC-2, anadaptation of the baseline which assigns higherpenalties for mis-predicted vowels and diacriticcharacters. It achieved a 1.0% absolute (4%relative) reduction in WER over the baseline.

8.2 Error analysisError analysis can help identify strengths andweaknesses of existing models, suggesting futureimprovements and guiding the construction ofensemble models. Prior experience using goldcrowd-sourced data extracted from Wiktionarysuggests that a non-trivial portion of errors madeby top systems are due to errors in the gold dataitself. For example, Gorman et al. (2019) reportthat a substantial portion of the prediction errorsmade by the top two systems in the 2017 CoNLL–SIGMORPHON Shared Task on MorphologicalReinflection (Cotterell et al. 2017) are due to tar-get errors, i.e., errors in the gold data. Thereforewe conducted an automatic error analysis for fourtarget languages. It was hoped that this analysiswould also help identify (and quantify) target er-rors in the test data.Two forms of error analysis were employed

here. First, after Makarov and Clematide (2020),the most frequent error types in each language areshown in Table 7. From this table it is clear thatmany errors can be attributed either to the ambigu-ity of a language’s writing system. For example, inboth Serbo-Croatian and Slovenian the most com-mon errors involve the confusion or omission ofsuprasegmental information such as pitch accentand vowel length, neither of which are representedin the orthography. Likewise, in French and Ital-ian the most frequent errors confuse vowel sounds

119

Baseline WER Best submission(s) WEReng_us 41.91 Dialpad-1 37.43arm_e 7.0 CLUZH-7 6.4bul 18.3 CLUZH-6 18.8dut 14.7 CLUZH-7 14.7fre 8.5 CLUZH-4, CLUZH-5, CLUHZ-6 7.5geo 0.0 CLUZH-4, CLUHZ-5, CLUZH-6, CLUZH-7 0.0hbs_latn 32.1 CLUZH-7 35.3hun 1.8 CLUZH-6, CLUZH-7 1.0jpn_hira 5.2 CLUZH-7 5.0kor 16.3 CLUZH-4 16.2vie_hanoi 2.5 CLUZH-5, CLUZH-7 2.0ady 22 CLUZH-2, CLUZH-3, UBC-2 22gre 21 CLUZH-1, CLUZH-3 20ice 12 CLUZH-1, CLUZH-3 10ita 19 UBC-1 20khm 34 UBC-2 28lav 55 CLUZH-2, CLUZH-3, UBC-2 49mlt_latn 19 CLUZH-1 12rum 10 UBC-2 10slv 49 UBC-2 47wel_sw 10 CLUZH-1 10

Table 3: Baseline WER, and the best submission(s) and their WER, for each language.

Baseline Dialpad-1 Dialpad-2 Dialpad-3eng_us 41.94 37.43 41.72 41.58

Table 4: Results for the high-resource (US English) subtask.

represented by the same graphemes.Many errors may also be attributable to prob-

lems with the target data. For example, the twomost frequent errors for English are predicting [ɪ]instead of [ә], and predicting [ɑ] instead of [ɔ].Impressionistically, the former is due in part toinconsistent transcription of the -ed and -es suf-fixes, whereas the latter may reflect inconsistenttranscription of the low back merger.The second error analysis technique used here

is an adaptation of a quality assurance techniqueproposed by Jansche (2014). For each languagetargeted by the error analysis, a finite-state cov-ering grammar is constructed by manually listingall pairs of permissible grapheme-phonemappingsfor that language. Let C be the set of all such g, ppairs. Then, the covering grammar γ is the ra-tional relation given by the closure over C, thusγ = C∗. Covering grammars were constructed for

three medium-resource languages and four of thelow-resource languages. A fragment of the Bul-garian covering grammar, showing readings of thecharacters б, ф, and ю, is presented in Table 8.6Let G be the graphemic form of a word and let

P and P be the corresponding gold and hypothe-sis pronunciations for that word. For error analysiswe are naturally interested in cases where P = P ,i.e., those cases where the gold and hypothesispronunciations do not match, since these are ex-actly the cases which contribute to word error rate.Then, P = πo (G γ) is a finite-state lattice repre-senting the set of all “possible” pronunciations ofG admitted by the covering grammar.When P = P but P ∈ P—that is, when

6Error analysis software was implemented using thePynini finite-state toolkit (Gorman 2016). See Gorman andSproat 2021, ch. 3, for definitions of the various finite-stateoperations used here.

120

Baseline CLUZH-4 CLUZH-5 CLUZH-6 CLUZH-7arm_e 7.0 7.1 6.6 6.6 6.4bul 18.3 20.1 19.2 18.8 19.7dut 14.7 15.0 14.9 15.6 14.7fre 8.5 7.5 7.5 7.5 7.6geo 0.0 0.0 0.0 0.0 0.0hbs_latn 32.1 38.4 35.6 37.0 35.3hun 1.8 1.5 1.2 1.0 1.0jpn_hira 5.2 5.9 5.3 5.5 5.0kor 16.3 16.2 16.9 17.2 16.3vie_hanoi 2.5 2.3 2.0 2.1 2.0Macro-average 10.6 11.4 10.9 11.1 10.8

Table 5: Results for the medium-resource subtask.

Baseline AZ CLUZH-1 CLUZH-2 CLUZH-3 UBC-1 UBC-2ady 22 30 24 22 22 25 22gre 21 23 20 22 20 22 22ice 12 22 10 12 10 13 11ita 19 25 23 24 21 20 22khm 34 42 32 33 32 31 28lav 55 53 53 49 49 58 49mlt_latn 19 19 12 16 14 19 18rum 10 13 13 13 12 14 10slv 49 90 50 59 55 56 47wel_sw 10 40 10 13 12 13 12Macro-average 25.1 35.7 24.7 26.3 24.7 27.1 24.1

Table 6: Results for the low-resource subtask.

the gold pronunciation is one of the possiblepronunciations—we refer to such errors as modeldeficiencies, since this condition suggests that thesystem in question has failed to guess one of sev-eral possible pronunciations of the current word.In many cases this reflects genuine ambiguities inthe orthography itself. For example, in Italian, eis used to write both the phonemes /e, ɛ/ and o issimilarly read as /o, ɔ/ (Rogers and d’Arcangeli2004). There are few if any orthographic cluesto which mid-vowel phoneme is intended, andall submissions incorrectly predicted that the o innome ‘name’ is read as [ɔ] rather than [o]. Simi-lar issues arise in Icelandic and French. The pre-ceding examples both represent global ambigui-ties, but model deficiencies may also occur whenthe system has failed to disambiguate a local am-biguity. One example of this can be found inFrench: the verbal third-person plural suffix -ent

is silent whereas the non-suffixal word-final entis normally read as [ɑ]. Morphological informa-tion was not provided to the covering grammar,but it could easily be exploited by grapheme-to-phoneme models.Another condition of interest is when P = P

but P /∈ P. We refer to such errors as coverage de-ficiencies, since they arise when the gold pronun-ciation is not one permitted by the covering gram-mar. While coverage deficiencies may result fromactual deficiencies in the covering grammar itself,they more often arise when a word does not fol-low the normal orthographic principles of its lan-guage. For instance, Italian has borrowed the En-glish loanword weekend [wikɛnd] ‘id.’ but has notyet adapted it to Italian orthographic principles. Fi-nally, coverage deficiencies may indicate target er-rors, inconsistencies in the gold data itself. For ex-ample, in the Italian data, the tie bars used to indi-

121

eng_us ɪ ә 113 ɑ ɔ 112 _ ʊ• 96 _ ɪ• 85 ɪ i 76arm_e _ ә• 16 ә• _ 10 tʰ d 6 d tʰ 6 j• _ 3bul ɛ• d 32 a ә 31 ә ɤ 30 _ 27 ә a 25dut ә eː 10 _ ː 10 ә ɛ 9 eː ә 8 z s 8fre a ɑ 6 _ •s 5 ɔ o 5 e ɛ•ʁ 3 _ •t 3geohbs_latn _ ː 85 ː _ 76 _ 55 53 _ 52hun _ ː 6 h ɦ 3 ʃ s 2 ː _ 2jpn_hira _ 20 _ 11 _ d 4 ː •ɯᵝ 3 h ɰᵝ 3kor _ ː 73 ː _ 28 ʌ ɘː 23 ʰ 9 ɘː ʌ 6vie_hanoi _ w• 3 _ ˧ 3 _ w•ŋm• 2 ɕ •ɹ 2 _ ʔ• 2ady ʼ _ 3 ː _ 3 ʃ ʂ 3 ə• _ 2 a ә 2gre ɾ r 8 r ɾ 3 i ʝ 3 m• _ 2 ɣ ɡ 2ice ː _ 2 _ 2 _ ː 2ita o ɔ 6 e ɛ 5 j i 3 • 2 ɔ o 2khm aː i•ә 3 _ ʰ 3 _ •ɑː 2 ĕ ɔ 2 ɑ a 2lav 11 _ 10 _ 9 _ 7 _ 4mlt_latn _ ː 5 _ ɪ• 2 ɐ a 2 b p 2 a ɐ 2rum • 2slv 7 ː _ 6 ː _ 6 _ ː 5 ɛ éː 4wel_sw ɪ iː 3 ɪ i 2 _ ɛ• 2

Table 7: The five most frequent error types, represented by the hypothesis string, gold string, and count, for eachlanguage; • indicates whitespace and _ the empty string.

…б bб bjб p…ф fф fj…ю juю u…

Table 8: Fragment of a covering grammar for Bul-garian; the left column contains graphemes and corre-sponding phones are given in the right column.

cate affricates are not always present, andmany ap-parent errors are the result of gold pronunciationswhich omit a tie bar.WER and model deficiency rate (MDR) is

shown for select systems and three languagesfrom the medium-resource subtask in Table 9, andTable 10 shows similar statistics for four low-resource languages. Note that by construction, one

can obtain the coverage deficiency rate simply bysubtractingMDR fromWER. By comparingWERand MDR one can see the overwhelming majorityof errors in these seven languages are model defi-ciencies, most naturally arising from genuine am-biguities in orthography rather than target errors(i.e., data inconsistencies).To facilitate ensemble construction and further

error analysis, we release all submissions’ test setpredictions to the research community.7

9 Discussion

We once again see an enormous difference in lan-guage difficulty. One of the languages with thehighest amount of data, English, also has one ofthe highest WERs. In contrast, the baseline and allfour submissions to the medium-resource subtaskachieve perfect performance on Georgian. Thisis a substantial change from the previous year’sshared task: with a sample roughly half the size ofthis year’s task, the best system (Yu et al. 2020) ob-tained aWER of 24.89 on Georgian (Gorman et al.

7https://drive.google.com/drive/folders/1Fer7UfHBnt5k-WFHsVXQO8ac3BvREAyC

122

Baseline CLUZH-5WER MDR WER MDR

bul 18.3 17.6 19.2 19.0fre 8.5 7.5 7.5 6.8jpn_hira 5.2 4.4 5.3 4.5

Table 9: WER and model deficiency rate (MDR) for three languages from the medium-resource subtask.

Baseline AZ CLUZH-1 UBC-2WER MDR WER MDR WER MDR WER MDR

ady 22 22 30 23 24 21 22 22gre 21 18 23 19 20 17 22 21ice 12 9 22 17 10 7 11 5ita 19 15 25 19 23 16 22 19

Table 10: WER and model deficiency rate (MDR) for four languages from the low-resource subtask.

2020:47). This enormous improvement likely re-flects quality assurance work on this language,8but we did not anticipate reaching ceiling perfor-mance. Insofar as the above quality assurance anderror analysis techniques prove effective and gen-eralizable, we may soon be able to ask what makesa language hard to pronounce (cf. Gorman et al.2020:45f.).As mentioned above, the data here are a mixture

of broad and narrow transcriptions. At first glance,this might explain some of the variation in lan-guage difficulty; for example, it is easy to imaginethat the additional details in narrow transcriptionsmake them more difficult to predict. However, formany languages, only one of the two levels of tran-scription is available at scale, and other languages,divergence between broad and narrow transcrip-tions is impressionistically quite minor. However,this impression ought to be quantified.While we responded to community demand for

lower- and higher-resource subtasks, only oneteam submitted to the high- and medium-resourcesubtasks, respectively. It was surprising that noneof the medium-resource submissions were able toconsistently outperform the baseline model acrossthe ten target languages. Clearly, this year’s base-line is much stronger than the previous year’s.Participants in the high- and medium-resource

subtasks were permitted to make use of lemmasand morphological tags from UniMorph as addi-tional features. However, no team made use of

8https://github.com/CUNY-CL/wikipron/issues/138

resources. Some prior work (e.g., Demberg et al.2007) has found morphological tags highly useful,and error analysis (§8.2) suggests this informationwould make an impact in French.There is a large performance gap between the

medium-resource and low-resource subtasks. Forinstance, the baseline achieves a WER of 10.6 inthe medium-resource scenario and a WER of 25.1in the low-resource scenario. It seems that cur-rent models are unable to reach peak performancewith the 800 training examples provided in the low-resource subtask. Further work is needed to de-velop more efficient models and data augmenta-tion strategies for low-resource scenarios. In ouropinion, this scenario is the most important onefor speech technology, since speech resources—including pronunciation data—are scarce for thevast majority of the world’s written languages.

10 Conclusions

The second iteration of the shared task on multi-lingual grapheme-to-phoneme conversion featuresmany improvements on the previous year’s task,most of all data quality. Four teams submittedthirteen systems, achieving substantial reductionsin both absolute and relative error over the base-line in two of three subtasks. We hope the codeand data, released under permissive licenses,9 willbe used to benchmark grapheme-to-phoneme con-version and sequence-to-sequence modeling tech-niques more generally.

9https://github.com/sigmorphon/2021-task1/

123

AcknowledgementsWe thank theWiktionary contributors, particularlyAryaman Arora, without whom this shared taskwould be impossible. We also thank contribu-tors to WikiPron, especially Sasha Gutkin, Jack-son Lee, and the participants of Hacktoberfest2020. Finally, thank you to Andrew Kirby for last-minute copy editing assistance.

ReferencesSimon Clematide and Peter Makarov. 2021. CLUZH

at SIGMORPHON 2021 Shared Task on Multilin-gual Grapheme-to-Phoneme Conversion: variationson a baseline. In Proceedings of the 18th SIG-MORPHON Workshop on Computational Researchin Phonetics, Phonology, and Morphology.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman,Géraldine Walther, Ekaterina Vylomova, PatrickXia, Manaal Faruqui, Sandra Kübler, DavidYarowsky, Jason Eisner, and Mans Hulden. 2017.CoNLL–SIGMORPHON 2017 shared task: univer-sal morphological reinflection in 52 languages. InProceedings of the CoNLL SIGMORPHON 2017Shared Task: Universal Morphological Reinflection,pages 1–30.

Vera Demberg, Helmut Schmid, and Gregor Möhler.2007. Phonological constraints and morphologi-cal preprocessing for grapheme-to-phoneme conver-sion. In Proceedings of the 45th Annual Meeting ofthe Association of Computational Linguistics, pages96–103.

Daan van Esch, Mason Chua, and Kanishka Rao. 2016.Predicting pronunciations with syllabification andstress with recurrent neural networks. In INTER-SPEECH 2016: 17th Annual Conference of theInternational Speech Communication Association,pages 2841–2845.

Vasundhara Gautam, Wang Yau Li, ZafarullahMahmood, Frederic Mailhot, Shreekantha Nadig,Riqiang Wang, and Nathan Zhang. 2021. Avengers,ensemble! Benefits of ensembling in grapheme-to-phoneme prediction. In Proceedings of the 18thSIGMORPHON Workshop on Computational Re-search in Phonetics, Phonology, and Morphology.

Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff.2012. Building large monolingual dictionaries at theLeipzig Corpora Collection: from 100 to 200 lan-guages. In Proceedings of the Eighth InternationalConference on Language Resources and Evaluation,pages 759–765.

Kyle Gorman. 2016. Pynini: a Python library forweighted finite-state grammar compilation. In Pro-ceedings of the SIGFSM Workshop on StatisticalNLP and Weighted Automata, pages 75–80.

Kyle Gorman, Lucas F.E. Ashby, Aaron Goyzueta,Shijie Wu, and Daniel You. 2020. The SIGMOR-PHON 2020 shared task on multilingual grapheme-to-phoneme conversion. In Proceedings of the 17thSIGMORPHON Workshop on Computational Re-search in Phonetics, Phonology, and Morphology,pages 40–50.

Kyle Gorman, Arya D. McCarthy, Ryan Cotterell,Ekaterina Vylomova, Miikka Silfverberg, and Mag-dalena Markowska. 2019. Weird inflects but OK:making sense of morphological generation errors. InProceedings of the 23rd Conference on Computa-tional Natural Language Learning, pages 140–151.

Kyle Gorman and Richard Sproat. 2021. Finite-StateText Processing. Morgan & Claypool.

Michael Hammond. 2021. Data augmentation for low-resource grapheme-to-phoneme mapping. In Pro-ceedings of the 18th SIGMORPHON Workshop onComputational Research in Phonetics, Phonology,and Morphology.

S. J. Hannahs. 2013. The Phonology of Welsh. OxfordUniversity Press.

Bradley Hauer, Amir Ahmad Habibi, Yixing Luan,Arnob Mallik, and Grzegorz Kondrak. 2020. Low-resource G2P and P2G conversion with synthetictraining data. In Proceedings of the 17th SIGMOR-PHON Workshop on Computational Research inPhonetics, Phonology, and Morphology, pages 117–122.

Robert Hoberman. 2007. Maltese morphology. InAlan S. Kaye, editor, Morphologies of Asia andAfrica, volume 1, pages 257–282. Eisenbrauns.

Martin Jansche. 2014. Computer-aided quality as-surance of an Icelandic pronunciation dictionary.In Proceedings of the Ninth International Confer-ence on Language Resources and Evaluation, pages2111–2114.

Paul Kingsbury, Stephanie Strassel, CynthiaMcLemore, and Robert MacIntyre. 1997. CALL-HOME American English Lexicon (PRONLEX).LDC97L20.

Christo Kirov, Ryan Cotterell, John Sylak-Glassman,Géraldine Walther, Ekaterina Vylomova, PatrickXia, Manaal Faruqui, Arya D. McCarthy, SandraKübler, David Yarowsky, Jason Eisner, and MansHulden. 2018. UniMorph 2.0: universal morphol-ogy. InProceedings of the 11th Language Resourcesand Evaluation Conference, pages 1868–1873.

Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza,Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D.McCarthy, and Kyle Gorman. 2020. Massively mul-tilingual pronunciation mining with WikiPron. InProceedings of the 12th Language Resources andEvaluation Conference, pages 4216–4221.

124

Roger Yu-Hsiang Lo and Garrett Nicolai. 2021. Lin-guistic knowledge in multilingual grapheme-to-phoneme conversion. In Proceedings of the 18thSIGMORPHON Workshop on Computational Re-search in Phonetics, Phonology, and Morphology.

Peter Makarov and Simon Clematide. 2018. Imita-tion learning for neural morphological string trans-duction. In Proceedings of the 2018 Conference onEmpiricalMethods in Natural Language Processing,pages 2877–2882.

Peter Makarov and Simon Clematide. 2020. CLUZHat SIGMORPHON 2020 shared task on multilingualgrapheme-to-phoneme conversion. In Proceedingsof the 17th SIGMORPHON Workshop on Computa-tional Research in Phonetics, Phonology, and Mor-phology, pages 171–176.

Steven Moran and Michael Cysouw. 2018. The Uni-code Cookbook for Linguists: Managing WritingSystems using Orthography Profiles. Language Sci-ence Press.

Graham Neubig, Chris Dyer, Yoav Goldberg, AustinMatthews, Waleed Ammar, Antonios Anastasopou-los, and Pengcheng Yin. 2017. DyNet: the dynamicneural network toolkit. arXiv:1701.03980.

Ben Peters and André F.T. Martins. 2020. One-size-fits-all multilingual models. In Proceedings of the17th SIGMORPHON Workshop on ComputationalResearch in Phonetics, Phonology, andMorphology,pages 63–69.

Kanishka Rao, Fuchun Peng, Haşim Sak, andFrançoise Beaufays. 2015. Grapheme-to-phonemeconversion using long short-term memory recurrentneural networks. In 2015 IEEE International Con-ference on Acoustics, Speech and Signal Processing,pages 4225–4229.

Derek Rogers and Luciana d’Arcangeli. 2004. Italian.Journal of the International Phonetic Association,34(1):117–121.

Elmar Ternes and Tatjana Vladimirova-Buhtz. 1990.Bulgarian. Journal of the International Phonetic As-sociation, 20(1):45–47.

Kaili Vesik, Muhammad Abdul-Mageed, and MiikkaSilfverberg. 2020. One model to pronounce themall: multilingual grapheme-to-phoneme conversionwith a transformer ensemble. In Proceedings of the17th SIGMORPHON Workshop on ComputationalResearch in Phonetics, Phonology, andMorphology,pages 146–152.

Kaisheng Yao and Geoffrey Zweig. 2015. Sequence-to-sequence neural net models for grapheme-to-phoneme conversion. In INTERSPEECH 2015:16th Annual Conference of the International SpeechCommunication Association, pages 3330–3334.

Xiang Yu, Ngoc Thang Vu, and Jonas Kuhn. 2020.Ensemble self-training for low-resource languages:grapheme-to-phoneme conversion and morpholog-ical inflection. In Proceedings of the 17th SIG-MORPHON Workshop on Computational Researchin Phonetics, Phonology, and Morphology, pages70–78.

125

Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research

in Phonetics, Phonology, and Morphology,

August 5, 2021. ©2021 Association for Computational Linguistics

pages 126–130

Data augmentation for low-resource grapheme-to-phoneme mapping

Michael HammondDept. of Linguistics

U. of ArizonaTucson, AZ, USA

[email protected]

AbstractIn this paper we explore a very simple neuralapproach to mapping orthography to phonetictranscription in a low-resource context. Thebasic idea is to start from a baseline systemand focus all efforts on data augmentation. Wewill see that some techniques work, but othersdo not.

1 Introduction

This paper describes a submission by our team tothe 2021 edition of the SIGMORPHON Grapheme-to-Phoneme conversion challenge. Here we demon-strate our efforts to improve grapheme-to-phonememapping for low-resource languages in a neuralcontext using only data augmentation techniques.

The basic problem in the low-resource conditionwas to build a system that maps from graphemesto phonemes with very limited data. Specifically,there were 10 languages with 800 training pairsand 100 development pairs. Each pair was a wordin its orthographic representation and a phonetictranscription of that word (though some multi-wordsequences were also included). Systems were thentested on 100 additional pairs for each language.The 10 languages are given in Table 1.

To focus our efforts, we kept to a single system,intentionally similar to one of the simple baselinesystems from the previous year’s challenge.

We undertook and tested three data augmenta-tion techniques.

1. move as much development data to trainingdata as possible

2. extract substring pairs from the training datato use as additional training data

3. train all the languages together

In the following, we first provide additional de-tails on our base system and then outline and test

Code Languageady Adyghegre Modern Greekice Icelandicita Italiankhm Khmerlav Latvianmlt( latn) Maltese (Latin script)rum Romanianslv Slovenewel( sw) Welsh (South Wales dialect)

Table 1: Languages and codes

each of the moves above separately. We will seethat some work and some do not.

We acknowledge at the outset that we do not ex-pect a system of this sort to “win”. Rather, we wereinterested in seeing how successful a minimalistapproach might be, one that did not require majorchanges in system architecture or training. Thisminimalist approach entailed that the system notrequire a lot of detailed manipulation and so westarted with a “canned” system. This approach alsoentailed that training be something that could beaccomplished with modest resources and time. Allconfigurations below were run on a Lambda LabsTensorbook with a single GPU.1 No training runtook more than 10 minutes.

2 General architecture

The general architecture of the model is inspired byone of the 2020 baseline systems (Gorman et al.,2020): a sequence-to-sequence neural net with atwo-level LSTM encoder and a two-level LSTMdecoder. The system we used is adapted from theOpenNMT base (Klein et al., 2017).

There is a 200-element embedding layer in both1RTX 3080 Max-Q.

126

encoder and decoder. Each LSTM layer has 300nodes. The systems are connected by a 5-headattention mechanism (Luong et al., 2015). Trainingproceeds in 24,000 steps and the learning rate startsat 1.0 and decays at a rate of 0.8 every 1,000 stepsstarting at step 10,000. Optimization is stochasticgradient descent, the batch size is 64, the dropoutrate is 0.5.

We spent a fair amount of time tuning the systemto these settings for optimal performance with thisgeneral architecture on these data. We do not detailthese efforts as this is just a normal part of workingwith neural nets and not our focus here.

Precise instructions for building the docker im-age, full configuration files, and auxiliary code filesare available at https://github.com/hammondm/g2p2021.

3 General results

In this section, we give the general results of the fullsystem with all strategies in place and then in thenext sections we strip away each of our augmenta-tion techniques to see what kind of effect each has.In building our system, we did not have access tothe correct transcriptions for the test data provided,so we report performance on the development datahere.

The system was subject to certain amount ofrandomness because of randomization of trainingdata and random initial weights in the network. Wetherefore report mean final accuracy scores overmultiple runs.

Our system provides accuracy scores for devel-opment data in terms of character-level accuracy.The general task was scored in terms of word-levelerror rate, but we keep this measure for severalreasons. First, it was simply easier as this is whatthe system provided as a default. Second, this isa more granular measure that enabled us to adjustthe system more carefully. Finally, we were ableto simulate word-level accuracy in addition as de-scribed below.

We use a Monte Carlo simulation to calculateexpected word-level accuracy based on character-level accuracy and average transcription length forthe training data for the different languages. Thesimulation works by generating 100, 000 wordswith a random distribution of a specific character-level accuracy rate and then calculating word-levelaccuracy from that. Running the full system tentimes, we get the results in Table 2. Keep in mind

Character Word94.84 75.694.78 75.394.46 74.094.84 75.594.71 75.094.59 74.594.90 75.894.53 74.294.53 74.294.71 75.0

Mean 94.69 74.91

Table 2: Development accuracy for 10 runs of the fullsystem with all languages grouped together with esti-mated word-level accuracy

Character Word94.60 74.694.71 75.094.35 73.894.48 74.094.48 74.094.50 74.294.59 74.594.71 75.094.80 75.494.59 74.5

Mean 94.58 74.5

Table 3: Development accuracy for 10 runs of the re-duced system with all languages grouped together with100 development pairs with estimated word-level accu-racy

that we are reporting accuracy rather than error rate,so the goal is to maximize these values.

4 Using development data

The default partition for each language is 800 pairsfor training and 100 pairs for development. Weshifted this to 880 pairs for training and 20 pairsfor development. The logic of this choice wasto retain what seemed like the minimum numberof development items. Running the system tentimes without this repartitioning gives the resultsin Table 3.

There is a small difference in the right direction,but it is not significant for characters (t = −1.65,p = 0.11, unpaired) or words (t = −1.56, p =0.13, unpaired). It may be that with a larger sampleof runs, the difference becomes more stable.

127

Code Items addedady 4gre 223ice 58ita 194khm 39lav 100mlt latn 62rum 119slv 127wel sw 7

Table 4: Number of substrings added for each language

5 Using substrings

This method involves finding peripheral letters thatmap unambiguously to some symbol and then find-ing plausible splitting points within words to createpartial words that can be added to the training data.

Let’s exemplify this with Welsh. First, we iden-tify all word-final letters that always correspond tothe same symbol in the transcription. For exam-ple, the letter c always corresponds to a word-final[k]. Similarly, we identify word-initial characterswith the same property. For example, in these data,the word-initial letter t always corresponds to [t].2

We then search for any word in training that hasthe medial sequence ct where the transcription has[kt]. We take each half of the relevant item and addthem to the training data if that pair is not alreadythere. For example, the word actor [aktOr] fits thepattern, so we can add the pairs ac-ak and tor-tOr.to the training data. Table 4 gives the number ofitems added for each language. This strategy isa more limited version of the “slice-and-shuffle”approach used by Ryan and Hulden (2020) in lastyear’s challenge.

Note that this procedure can make errors. If thereare generalizations about the pronunciation of let-ters that are not local, that involve elements at adistance, this procedure can obscure those. Anotherexample from Welsh makes the point. There areexceptions, but the letter y in Welsh is pronouncedtwo ways. In a word-final syllable, it is pronounced[1], e.g. gwyn [gw1n] ‘white’. In a non-final sylla-ble, it is pronounced [@], e.g. gwynion [gw@njOn]‘white ones’. Though it doesn’t happen in the train-ing data here, the procedure above could easily

2This is actually incorrect for the language as a whole.Word-initial t in the digraph th corresponds to a differentsound [T].

Character Word95.15 76.994.40 73.795.15 76.894.59 74.594.65 74.895.27 77.494.53 74.294.78 75.295.09 76.694.59 74.5

Mean 94.82 75.46

Table 5: 10 runs with all languages grouped togetherwithout substrings added for each language

result in a y in a non-final syllable ending up in afinal syllable in a substring generated as above.

Table 5 shows the results of 10 runs withoutthese additions and simulated word error rates foreach run.

Strikingly, adding the substrings lowered per-formance, but the difference with the full modelis not significant for either characters (t = 1.18,p = 0.25, unpaired) or for words (t = 1.17,p = 0.25, unpaired). This model without sub-strings is the best-performing of all the models wetried, so this is what was submitted.

6 Training together

The basic idea here was to leverage the entire setof languages. Thus all languages were trained to-gether. To distinguish them, each pair was prefixedby its language code. Thus if we had orthogra-phy O = 〈o1, o2, . . . , on〉 and transcription T =〈t1, t2, . . . , tm〉 from language x, the net would betrained on the pair O′ = 〈x, o1, o2, . . . , on〉 andT ′ = 〈x, t1, t2, . . . , tm〉. The idea is that, whilethe mappings and orthographies are distinct, thereare similarities in what letters encode what soundsand in the possible sequences of sounds that canoccur in the transcriptions. This approach is verysimilar to that of Peters et al. (2017), except thatwe tag the orthography and the transcription withthe same langauge tag. Peters and Martins (2020)took a similar approach in last years challenge, butuse embeddings prefixed at each time step.

In Table 6 we give the results for running eachlanguage separately 5 times. Since there was a lotless training data for each run, these models settledfaster, but we ran them the same number of steps

128

as the full models for comparison purposes.There’s a lot of variation across runs and the

means for each language are quite different, pre-sumably based on different levels of orthographictransparency. The general pattern is clear that, over-all, training together does better than training sep-arately. Comparing run means with our baselinesystem is significant (t = −6.06, p < .001, un-paired).

This is not true in all cases however. For somelanguages, individual training seems to be betterthan training together. Our hypothesis is that lan-guages that share similar orthographic systems didbetter with joint training and that languages withdiverging systems suffered.

The final results show that our best system (nosubstrings included, all languages together, movingdevelopment data to training) performed reason-ably for some languages, but did quite poorly forothers. This suggests a hybrid strategy that wouldhave been more successful. In addition to trainingthe full system here, train individual systems foreach language. For test, compare final develop-ment results for individual languages for the jointlytrained system and the individually trained systemsand use whichever does better for each language intesting.

7 Conclusion

To conclude, we have augmented a basic sequence-to-sequence LSTM model with several data aug-mentation moves. Some of these were successful:redistributing data from development to trainingand training all the languages together. Some tech-niques were not successful though: the substringstrategy resulted in diminished performance.

Acknowledgments

Thanks to Diane Ohala for useful discussion.Thanks to several anonymous reviewers for veryhelpful feedback. All errors are my own.

ReferencesKyle Gorman, Lucas F.E. Ashby, Aaron Goyzueta,

Arya McCarthy, Shijie Wu, and Daniel You. 2020.The SIGMORPHON 2020 shared task on multilin-gual grapheme-to-phoneme conversion. In Proceed-ings of the 17th SIGMORPHON Workshop on Com-putational Research in Phonetics, Phonology, andMorphology, pages 40–50. Association for Compu-tational Linguistics.

G. Klein, Y. Kim, Y. Y. Deng, J. Senellart, andA.M. Rush. 2017. OpenNMT: Open-source toolkitfor neural machine translation. ArXiv e-prints.1701.02810.

Minh-Thang Luong, Hieu Pham, and Christopher D.Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings ofthe 2015 Conference on Empirical Methods in Natu-ral Language Processing, pages 1412–1421.

Ben Peters, Jon Dehdari, and Josef van Genabith.2017. Massively multilingual neural grapheme-to-phoneme conversion. In Proceedings of the FirstWorkshop on Building Linguistically GeneralizableNLP Systems, pages 19–26, Copenhagen. Associa-tion for Computational Linguistics.

Ben Peters and Andre F. T. Martins. 2020. DeepSPINat SIGMORPHON 2020: One-size-fits-all multilin-gual models. In Proceedings of the 17th SIGMOR-PHON Workshop on Computational Research inPhonetics, Phonology, and Morphology, pages 63–69. Association for Computational Linguistics.

Zach Ryan and Mans Hulden. 2020. Data augmen-tation for transformer-based G2P. In Proceedingsof the 17th SIGMORPHON Workshop on Computa-tional Research in Phonetics, Phonology, and Mor-phology, pages 184–188. Association for Computa-tional Linguistics.

129

language 5 separate runs Meanady 95.27 91.12 93.49 94.67 93.49 93.61gre 97.25 98.35 98.35 98.90 98.90 98.35ice 91.16 94.56 93.88 90.48 94.56 92.93ita 93.51 94.59 94.59 94.59 94.59 94.38khm 94.19 90.32 90.97 90.97 90.97 91.48lav 94.00 90.67 89.33 92.00 90.67 91.33mlt latn 91.89 94.59 91.89 92.57 93.24 92.84rum 95.29 96.47 94.71 95.88 95.29 95.51slv 94.01 94.61 04.61 94.61 94.01 94.37wel sw 96.30 97.04 96.30 97.04 96.30 96.59Mean 94.29 94.23 93.81 94.17 94.2 94.14

Table 6: 5 separate runs for each language

130

Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research

in Phonetics, Phonology, and Morphology,

August 5, 2021. ©2021 Association for Computational Linguistics

pages 131–140

Linguistic Knowledge in Multilingual Grapheme-to-Phoneme Conversion

Roger Yu-Hsiang LoDepartment of Linguistics

The University of British [email protected]

Garrett NicolaiDepartment of Linguistics

The University of British [email protected]

Abstract

This paper documents the UBC Linguisticsteam’s approach to the SIGMORPHON 2021Grapheme-to-Phoneme Shared Task, concen-trating on the low-resource setting. Our sys-tems expand the baseline model with simplemodifications informed by syllable structureand error analysis. In-depth investigation oftest-set predictions shows that our best modelrectifies a significant number of mistakes com-pared to the baseline prediction, besting allother submissions. Our results validate theview that careful error analysis in conjunctionwith linguistic knowledge can lead to more ef-fective computational modeling.

1 Introduction

With speech technologies becoming ever moreprevalent, grapheme-to-phoneme (G2P) conversionis an important part of the pipeline. G2P conver-sion refers to mapping a sequence of orthographicrepresentations in some language to a sequence ofphonetic symbols, often transcribed in the Interna-tional Phonetic Alphabet (IPA). This is often anearly step in tasks such as text-to-speech, wherethe pronunciation must be determined before anyspeech is produced. An example of such a G2Pconversion, in Amharic, is illustrated below:

€≈r 7→ [amar1ï:a] ‘Amharic’

For the second year, one of SIGMORPHONshared tasks concentrates on G2P. This year, thetask is further broken into three subtasks of varyingdata levels: high-resource ( 33K training instances),medium-resource (8K training instances), and low-resource (800 training instances). Our focus is onthe low-resource subtask. The language data andassociated constraints in the low-resource settingwill be summarized in Section 3.1; the reader inter-ested in the other two subtasks is referred to Ashbyet al. (this volume) for an overview.

In this paper, we describe our methodology andapproaches to the low-resource setting, includinginsights that informed our methods. We concludewith an extensive error analysis of the effectivenessof our approach.

This paper is structured as follows: Section 2overviews previous work on G2P conversion. Sec-tion 3 gives a description of the data in the low-resource subtask, evaluation metric, and baselineresults, along with the baseline model architecture.Section 4 introduces our approaches as well as themotivation behind them. We present our results inSection 5 and associated error analyses in Section 6.Finally, Section 7 concludes our paper.

2 Previous Work on G2P conversion

The techniques for performing G2P conversionhave long been coupled with contemporary ma-chine learning advances. Early paradigms utilizejoint sequence models that rely on the alignmentbetween grapheme and phoneme, usually withvariants of the Expectation-Maximization (EM)algorithm (Dempster et al., 1977). The result-ing sequences of graphones (i.e., joint grapheme-phoneme tokens) are then modeled with n-grammodels or Hidden Markov Models (e.g., Jiampoja-marn et al., 2007; Bisani and Ney, 2008; Jiampo-jamarn and Kondrak, 2010). A variant of thisparadigm includes weighted finite-state transducerstrained on such graphone sequences (Novak et al.,2012, 2015).

With the rise of various neural network tech-niques, neural-based methods have dominated thescene ever since. For example, bidirectional longshort-term memory-based (LSTM) networks usinga connectionist temporal classification layer pro-duce comparable results to earlier n-gram models(Rao et al., 2015). By incorporating alignment in-formation into the model, the ceiling set by n-gram

131

models has since been broken (Yao and Zweig,2015). Attention further improved the performance,as attentional encoder-decoders (Toshniwal andLivescu, 2016) learned to focus on specific input se-quences. As attention became “all that was needed”(Vaswani et al., 2017), transformer-based architec-tures have begun looming large (e.g., Yolchuyevaet al., 2019).

Recent years have also seen works that capital-ize on multilingual data to train a single modelwith grapheme-phoneme pairs from multiple lan-guages. For example, various systems from lastyear’s shared task submissions learned from a mul-tilingual signal (e.g., ElSaadany and Suter, 2020;Peters and Martins, 2020; Vesik et al., 2020).

3 The Low-resource Subtask

This section provides relevant information concern-ing the low-resource subtask.

3.1 Task Data

The provided data in the low-resource subtaskcome from ten languages1: Adyghe (ady; in theCyrillic script), Modern Greek (gre; in the Greekalphabet), Icelandic (ice), Italian (ita), Khmer(khm; in the Khmer script, which is an alphasyl-labary system), Latvian (lat), Maltese transliter-ated into the Latin script (mlt_latn), Romanian(rum), Slovene (slv), and the South Wales dialectof Welsh (wel_sw). The data are extracted fromWikitionary2 using WikiPron (Lee et al., 2020), andfiltered and downsampled with proprietary tech-niques, resulting in each language having 1,000labeled grapheme-phoneme pairs, split into a train-ing set of 800 pairs, a development set of 100 pairs,and a blind test set of 100 pairs.

3.2 The Evaluation Metric

This year, the evaluation metric is the word er-ror rate (WER), which is simply the percentageof words for which the predicted transcription se-quence differs from the ground-truth transcription.Different systems are ranked based on the macro-average over all languages, with lower scores indi-cating better systems. We also adopted this metricwhen evaluating our models on the developmentsets.

1All output is represented in IPA; unless specified other-wise, the input is written in the Latin alphabet.

2https://www.wiktionary.org/

3.3 BaselinesThe official baselines for individual languages arebased on an ensembled neural transducer trainedwith the imitation learning (IL) paradigm (Makarovand Clematide, 2018a). The baseline WERs are tab-ulated in Table 3. In what follows, we overview thisbaseline neural-transducer system, as our modelsare built on top of this system. The detailed formaldescription of the baseline system can be found inMakarov and Clematide (2018a,b,c, 2020).

The neural transducer in question defines a con-ditional distribution over edit actions, such as copy,deletion, insertion, and substitution:

pθ(y,a|x) =|a|∏

j=1

pθ(aj |a<j ,x),

where x denotes an input sequence of graphemes,and a = a1 . . . a|a| stands for a sequence of editactions. Note that the ouput sequence y is missingfrom the conditional probability on the right-handside as it can be deterministically computed from xand a. The model is implemented with an LSTMdecoder, coupled with a bidirectional LSTM en-coder.

The model is trained with IL and therefore de-mands an expert policy, which contains demon-strations of how the task can be optimally solvedgiven any configuration. Cast as IL, the mappingfrom graphemes to phonemes can be understoodas following an optimal path dictated by the expertpolicy that gradually turns input orthographic sym-bols to output IPA characters. To acquire the expertpolicy, a Stochastic Edit Distance (Ristad and Yian-ilos, 1998) model trained with the EM algorithmis employed to find an edit sequence consisting offour types of edits: copy, deletion, insertion, andsubstitution. During training time, the expert policyis queried to identify the next optimal edit that min-imizes the following objective expressed in termsof Levenshtein distance and edit sequence cost:

βED(y,y) + ED(x, y), β ≥ 1,

where the first term is the Levenshtein distancebetween the target sequence y and the predictedsequence y, and the second term measures the costof editing x to y.

The baseline is run with default hyperparametervalues, which include ten different initial seeds anda beam of size 4 during inference. The predictionsof these individual models are ensembled using a

132

voting majority. Early efforts to modify the ensem-ble to incorporate system confidence showed that amajority ensemble was sufficient.

This model has proved to be competitive, judg-ing from its performance on the previous year’sG2P shared task. We therefore decided to use it asthe foundation to construct our systems.

4 Our Approaches

This section lays out our attempted approaches.We investigate two alternatives, both linguistic innature. The first is inspired by a universal linguisticstructure—the syllable—and the other by the errorpatterns discerned from the baseline predictions onthe development data.

4.1 System 1: Augmenting Data withUnsupervised Syllable Boundaries

Our first approach originates from the observationthat, in natural languages, a sequence of soundsdoes not just assume a flat structure. Neighboringsounds group to form units, such as the onset, nu-cleus, and coda. In turn, these units can furtherproject to a syllable (see Figure 1 for an exampleof such projection). Syllables are useful structuralunits in describing various linguistic phenomenaand indeed in predicting the pronunciation of aword in some languages (e.g., Treiman, 1994). Forinstance, in Dutch, the vowel quality of the nu-cleus can be reliably inferred from the spellingafter proper syllabification: .dag. [dAx] ‘day’ but.da.gen. [da:G@n] ‘days’, where . marks syllableboundaries. Note that a in a closed syllable is pro-nounced as the short vowel [A], but as the longvowel [a:] in an open syllable. In applying syllabi-fication to G2P conversion, van Esch et al. (2016)find that training RNNs to jointly predict phonemesequences, syllabification, and stress leads to fur-ther performance gains in some languages, com-pared to models trained without syllabification andstress information.

Syllable

Rhyme

Coda

Tfł

Nucleus

E

Onset

wt

Figure 1: The syllable structure of twelfth [twEłfT]

To identify syllable boundaries in the input se-quence, we adopted a simple heuristic, the specificsteps of which are listed below:3

1. Find vowels in the output: We first identifythe vowels in the phoneme sequence by com-paring each segment with the vowel symbolsfrom the IPA chart. For instance, the symbols[ø] and [y] in [thrøyst] for Icelandic traust arevowels because they match the vowel symbols[ø] and [y] on the IPA chart.

2. Find vowels in the input: Next we alignthe grapheme sequence with the phoneme se-quence using an unsupervised many-to-manyaligner (Jiampojamarn et al., 2007; Jiampo-jamarn and Kondrak, 2010). By identifyinggraphemes that are aligned to phonemic vow-els, we can identify vowels in the input. Usingthe Icelandic example again, the aligner pro-duces a one-to-one mapping: t 7→ th, r 7→ r, a7→ ø, u 7→ y, s 7→ s, and t 7→ t. We thereforeassume that the input characters a and u rep-resent two vowels. Note that this step is oftenredundant for input sequences based on theLatin script but is useful in identifying vowelsymbols in other scripts.

3. Find valid onsets and codas: A key step insyllabification is to identify which sequencesof consonants can form an onset or a coda.Without resorting to linguistic knowledge, oneway to identify valid onsets and codas is tolook at the two ends of a word—consonantsequences appearing word-initially before thefirst vowel are valid onsets, and consonantsequences after the final vowel are valid codas.Looping through each input sequence in thetraining data gives us a list of valid onsets andcodas. In the Icelandic example traust, theinitial tr sequence must be a valid onset, andthe final st sequence a valid coda.

4. Break word-medial consonant sequencesinto an onset and a coda: Unfortunatelyidentifying onsets and codas among word-medial consonant sequences is not as straight-forward. For example, how do we know the

3We are aware that different languages permit distinctsyllable constituents (e.g., some languages allow syllabic con-sonants while others do not), but given the restriction that weare not allowed to use external resources in the low-resourcesubtask, we simply assume that all syllables must contain avowel.

133

sequence in the input VngstrV (V for a vowelcharacter) should be parsed as Vng.strV, asVn.gstrV, or even as V.ngstrV? To tackle thisproblem, we use the valid onset and coda listsgathered from the previous step: we split theconsonant sequence into two parts, and wechoose the split where the first part is a validcoda and the second part a valid onset. Forinstance, suppose we have an onset list str,tr and a coda list ng, st. This implies thatwe only have a single valid split—Vng.strV—so ng is treated as the coda for the previoussyllable and str as the onset for the follow-ing syllable. In the case where more than onesplit is acceptable, we favor the split that pro-duces a more complex onset, based on thelinguistic heuristic that natural languages tendto tolerate more complex onsets than codas.For example, Vng.strV > Vngs.trV. In thesituation where none of the splits produces aconcatenation of a valid coda and onset, weadopt the following heuristic:

• If there is only one medial consonant(such as in the case where the consonantcan only occur word-internally but notin the onset or coda position), this con-sonant is classified as the onset for thefollowing syllable.

• If there is more than one consonant, thefirst consonant is classified as the codaand attached to the previous syllablewhile the rest as the onset of the follow-ing syllable.

Of course, this procedure is not free of errors(e.g., some languages have onsets that are onlyallowed word-medially, so word-initial onsetswill naturally not include them), but overall itgives reasonable results.

5. Form syllables: The last step is to put to-gether consonant and vowel characters to formsyllables. The simplest approach is to alloweach vowel character to be projected as a nu-cleus and distribute onsets and codas aroundthese nuclei to build syllables. If there arefour vowels in the input, there are likewisefour syllables. There is one important caveat,however. When there are two or more consec-utive vowel characters, some languages preferto merge them into a single vowel/nucleus intheir pronunciation (e.g., Greek και 7→ [ce])

while other languages simply default to vowelhiatuses/two side-by-side nuclei (e.g., Italianbadia 7→ [badia])—indeed, both are commoncross-linguistically. We again rely on thealignment results in the second step to selectthe vowel segmentation strategy for individuallanguages.

After we have identified the syllables that com-pose each word, we augmented the input se-quences with syllable boundaries. We identifyfour labels to distinguish different types of sylla-ble boundaries: <cc>, <cv>. <vc>, and <vv>,depending on the classes of sound the segmentsstraddling the syllable boundary belong to. Forinstance, the input sequence b í l a v e rk s t æ ð i in Icelandic will be augmentedto be b í <vc> l a <vc> v e r k <cc>s t æ <vc> ð i. We applied the same syl-labification algorithm to all languages to generatenew input sequences, with the exception of Khmer,as the Khmer script does not permit a straightfor-ward linear mapping between input and output se-quences, which is crucial for the vowel identifi-cation step. We then used these syllabified inputsequences, along with their target transcriptions, asthe training data for the baseline model.4

4.2 System 2: Penalizing Vowel and DiacriticErrors

Our second approach focuses on the training ob-jective of the baseline model, and is driven bythe errors we observed in the baseline predictions.Specifically, we noticed that the majority of er-rors for the languages with a high WER—Khmer,Latvian, and Slovene—concerned vowels, someexamples of which are given in Table 1. Note thenature of these mistakes: the mismatch can be inthe vowel quality (e.g., [O] for [o]), in the vowellength (e.g., [a:] for [a]), in the pitch accent (e.g.,[ı:] for [ı:]), or a combination thereof.

Based on the above observation, we modifiedthe baseline model to explicitly address this vowel-mismatching issue. We modified the objective suchthat erroneous vowel or diacritic (e.g., the length-ening marker [:]) predictions during training incur

4The hyperparameters used are the default values providedin the baseline model code: character and action embedding =100, encoder LSTM state dimension = decoder LSTM statedimension = 200, encoder layer = decoder layer = 1, beamwidth = 4, roll-in hyperparameter = 1, epochs = 60, patience= 12, batch size = 5, EM iterations = 10, ensemble size =10.

134

Language Target Baseline prediction

khm n u h n u @ hr O: j r e @ js p o @ n s p a n

lat t s e: l s t s E: l sj u o k s j u o k sv æ l s v æ: l s

slv j o: g u r t j O g u: r tk r ı: S k r ı: Sz d a j z d a: j

Table 1: Typical errors in the development set that in-volve vowels from Khmer (khm), Latvian (lat), andSlovene (slv)

additional penalties. Each incorrectly-predictedvowel incurs this penalty. The penalty acts as aregularizer that forces the model to expend moreeffort on learning vowels. This modification is inthe same spirit as the softmax-margin objective ofGimpel and Smith (2010), which penalizes high-cost outputs more heavily, but our approach is evensimpler—we merely supplement the loss with ad-ditional penalties for vowels and diacritics. Wefine-tuned the vowel and diacritic penalties using agrid search on the development data, incrementingeach by 0.1, from 0 to 0.5. In the cases of ties, weskewed higher as the penalties generally workedbetter at higher values. The final values used togenerate predictions for the test data are listed inTable 2. We also note that the vowel penalty hadsignificantly more impact than the diacritic penalty.

Penalty

Language Vowel Diacritic

ady 0.5 0.3gre 0.3 0.2ice 0.3 0.3ita 0.5 0.5khm 0.2 0.4lav 0.5 0.5mlt_latn 0.2 0.2rum 0.5 0.2slv 0.4 0.4wel_sw 0.4 0.5

Table 2: Vowel penalty and diacritic penalty values inthe final models

5 Results

The performances of our systems, measured inWER, are juxtaposed with the official baseline re-sults in Table 3. We first note that the baseline wasparticularly strong—gains were difficult to achievefor most languages. Our first system (Syl), which isbased on syllabic information, unfortunately doesnot outperform the baseline. Our second system(VP), which includes additional penalties for vow-els and diacritics, however, does outperform thebaselines in several languages. Furthermore, themacro WER average not only outperforms the base-line, but all other submitted systems.

WER

Language Baseline Syl VP

ady 22 25 22gre 21 22 22ice 12 13 11ita 19 20 22khm 34 31 28lav 55 58 49mlt_latn 19 19 18rum 10 14 10slv 49 56 47wel_sw 10 13 12

Average 25.1 27.1 24.1

Table 3: Comparison of test-set results based on theword error rates (WERs)

It seems that extra syllable information does nothelp with predictions in this particular setting. Itmight be the case that additional syllable bound-aries increase input variability without providingmuch useful information with the current neural-transducer architecture. Alternatively, informationabout syllable boundary locations might be redun-dant for this set of languages. Finally, it is possiblethat the unsupervised nature of our syllable anno-tation was too noisy to aid the model. We leavethese speculations as research questions for futureendeavors and restrict the subsequent error analy-ses and discussion to the results from our vowel-penalty system.5

5One reviewer raised a question of why only syllableboundaries, as opposed to smaller constituents, such as onsetsor codas, are marked. Our hunch is that many phonological al-ternations happen at syllable boundaries, and that vowel lengthin some languages depends on whether the nucleus vowel isin a closed or open syllable. Also, given that adding syllable

135

ady gre ice ita khm lav mlt_latn rum slv wel_sw

baseSyl VP baseSyl VP baseSyl VP baseSyl VP baseSyl VP baseSyl VP baseSyl VP baseSyl VP baseSyl VP baseSyl VP

0

20

40

60

80

Systems

Co

un

t

Error types C-V, V-C C-C, C-ϵ, ϵ-C V-V, V-ϵ, ϵ-V

Figure 2: Distributions of error types in test-set predictions across languages. Error types are distinguished basedon whether an error involves only consonants, only vowels, or both. For example, C-V means that the error iscaused by a ground-truth consonant being replaced by a vowel in the prediction. C-ǫ means that it is a deletionerror where the ground-truth consonant is missing in the prediction while ǫ-C represents an insertion error where aconsonant is wrongly added.

6 Error Analyses

In this section, we provide detailed error analyseson the test-set predictions from our best system.The goals of these analyses are twofold: (i) to ex-amine the aspects in which this model outperformsthe baseline and to what extent, and (ii) to get abetter understanding of the nature of errors madeby the system—we believe that insights and im-provements can be derived from a good grasp oferror patterns.

We analyzed the mismatches between predictedsequences and ground-truth sequences at the seg-mental level. For this purpose, we again utilizedmany-to-many alignment (Jiampojamarn et al.,2007; Jiampojamarn and Kondrak, 2010), but thistime between a predicted sequence and the corre-sponding ground-truth sequence.6 For each erroralong the aligned sequence, we classified it intoone of the three kinds:

• Those involving erroneous vowel insertions(e.g., ǫ → [@]), deletions (e.g., [@] → ǫ), orsubstitutions (e.g., [@] → [a]).

• In the same vein, those involving erroneousconsonant insertions (e.g., ǫ → [P]), deletions

boundaries does not improve the results, it is unlikely thatmarking constituent boundaries, which adds more variabilityto the input, will result in better performance, though we didnot test this hypothesis.

6The parameters used are: allowing deletion of inputgrapheme strings, maximum length of aligned grapheme andphoneme substring being one, and a training threshold of0.001.

(e.g., [P] → ǫ), and substitutions (e.g., [d] →[t]).

• Those involving exchanges of a vowel and aconsonant (e.g., [w] → [u]) or vice versa.

The frequency of each error type made by thebaseline model and our systems for each individ-ual language is plotted in Figure 2. Some patternsare immediately clear. First, both systems have asimilar pattern in terms of the distribution of errortypes across language, albeit that ours makes fewererrors on average. Second, both systems err ondifferent elements, depending on the language. Forinstance, while Adyghe (ady) and Khmer (khm)have a more balanced distribution between conso-nant and vowel errors, Slovene (slv) and Welsh(wel_sw) are dominated by vowel errors. Third,the improvements gained in our system seem tocome mostly from reduction in vowel errors, as isevident in the case of Khmer, Latvian (lav), and,to a lesser extent, Slovene.

The final observation is backed up if we zoomin on the errors in these three languages, whichwe visualize in Figure 3. Many incorrect vowelsgenerated by the baseline model are now correctlypredicted. We note that there are also cases, thoughless common, where the baseline model gives theright prediction, but ours does not. It should bepointed out that, although our system shows im-provement over the baseline, there is still plentyof room for improvement in many languages, andour system still produces incorrect vowels in many

136

ɑ

ɑː

e

ə

ɛː

i

ϵ

baseline ground-truth ours (VP)

Systems

Vow

els

Khmer vowels

aàâā

âːāːæɛɛɛːɛːiîī

îːīːj

oôuùūϵ

baseline ground-truth ours (VP)

SystemsV

ow

els

Latvian vowels

áː

àː

éː

èː

ə

ɛː

o

óː

ɔ

ɔː

baseline ground-truth ours (VP)

Systems

Vow

els

Slovene vowels

Error types

base wrong

ours wrong

Figure 3: Comparison of vowels predicted by the baseline model and our best system (VP) with the ground-truthvowels. Here we only visualize the cases where either the baseline model gives the right vowel but our system doesnot, or vice versa. We do not include cases where both the baseline model and our system predict the correct vowel,or both predict an incorrect vowel, to avoid cluttering the view. Each baseline—ground-truth—ours line representsa set of aligned vowels in the same word; the horizontal line segment between a system and the ground-truth meansthat the prediction from the system agrees with the ground-truth. Color hues are used to distinguish cases wherethe prediction from the baseline is correct versus those where the prediction from our second system is correct.Shaded areas on the plots enclose vowels of similar vowel quality.

instances.Finally, we look at several languages which

still resulted in high WER on the test set—ady,gre, ita, khm, lav, and slv. We analyzethe confusion matrix analysis to identify clustersof commonly-confused phonemes. This analysisagain relies on the alignment between the ground-truth sequence and the corresponding predictedsequence to characterize error distributions. Theresults from this analysis are shown in Figure 4,and some interesting patterns are discussed below.Figure 2 suggests that Khmer has an equal share ofconsonant and vowel errors, and the heat maps inFigure 4 reveal that these errors do not seem to fol-low a certain pattern. However, a different pictureemerges with Latvian and Slovene. For both lan-guages, Figure 2 indicates the dominance of errorstied to vowels; consonant errors account for a rela-tively small proportion of errors. This observationis borne out in Figure 4, with the consonant heatmaps for the two languages displaying a clear diag-onal stripe, and the vowel heat maps showing muchmore off-diagonal signals. What is more interest-ing is that the vowel errors in fact form clusters,as highlighted by white squares on the heat maps.The general pattern is that confusion only ariseswithin a cluster where vowels are of similar qualitybut differ in terms of length or pitch accent. Forexample, while [i:] might be incorrectly-predictedas [i], our model does not confuse it with, say, [u].The challenges these languages present to the mod-

els are therefore largely suprasegmental—vowellength and pitch accent, both of which are lexical-ized and not explicitly marked in the orthography.For the other three languages, their errors also showdistinct patterns: for Adyghe, consonants differingonly in secondary features can get confused; inGreek, many errors can be attributed to the mixingof [r] and [R]; in Italian, front and back mid vowelscan trick our model.

We hope that our detailed error analyses shownot only that these errors “make linguistic sense”—and therefore attest to the power of the model—but also point out a pathway along which futuremodeling can be improved.

7 Conclusion

This paper presented the approaches adopted bythe UBC Linguistics team to tackle the SIGMOR-PHON 2021 Grapheme-to-Phoneme Conversionchallenge in the low-resource setting. Our submis-sions build upon the baseline model with modifi-cations inspired by syllable structure and vowelerror patterns. While the first modification doesnot result in more accurate predictions, the secondmodification does lead to sizable improvementsover the baseline results. Subsequent error analy-ses reveal that the modified model indeed reduceserroneous vowel predictions for languages whoseerrors are dominated by vowel mismatches. Ourapproaches also demonstrate that patterns uncov-

137

0.00 0.25 0.50 0.75 1.00Proportion

ʔ

z

w

ʋ

t

s

r

p

ŋ

ɲ

n

m

l

k

j

h

f

ɗ

c

ɓ

ɓ c cʰ ɗ f h j k kʰ l m n ɲ ŋ p pʰ r s t tʰ ʋ w z ʔGround-truth consonants

Pre

dic

ted

co

nso

na

nts

Khmer consonants

ʒ

z

w

v

t

ʃ

s

ɾ

r

p

ŋ

ɲ

n

m

ʎ

l

k

ɟ

j

ɡ

f

d

c

b

b c d f ɡ j ɟ k l ʎ m n ɲ ŋ p r ɾ s ʃ t v w z ʒGround-truth consonants

Pre

dic

ted

co

nso

na

nts

Latvian consonants

ʒ

z

x

ʋ

t ʃ

t s

t

ʃ

s

r

p

n

m

l

k

j

ɡ

f

d

b

b d f ɡ j k l m n p r s ʃ t t s t ʃ ʋ x z ʒGround-truth consonants

Pre

dic

ted

co

nso

na

nts

Slovene consonants

ŭ

u

ɔː

ɔ

ŏ

o

ɨ

i

ɛː

əː

ə

ĕ

e

ɑː

ɑ

a

a aː ɑ ɑː e ĕ eː ə əː ɛː i iː ɨ o ŏ oː ɔ ɔː u ŭ uːGround-truth vowels

Pre

dic

ted

vo

we

ls

Khmer vowels

a

æ, e, ɛ

i

o

uūːûːùːūû

ù

u

oːōô

o

īːîːīì

i

ɛːɛːɛːɛɛɛɛēe

ǣːæ ːǣæ

āːâːàːaːāâ

à

a

a à â ā aː àː âː āː æ ǣæ ːǣː e ē ɛ ɛ ɛ ɛ ɛː ɛː ɛː i ì ī îː īː o ô ō oː u ù û ū ùː ûː ūːGround-truth vowels

Pre

dic

ted

vo

we

ls

Latvian vowels

a

e, ə, ɛ

i

o, ɔ

uùː

úː

u

ɔː

ɔː

ɔ

ɔ

òː

óː

ìː

íː

i

ɛː

ɛː

ɛ

ɛ

ə

ə

èː

éː

àː

áː

á

a

a á áː àː éː èː ə ə ɛ ɛ ɛː ɛː i íː ìː óː òː ɔ ɔ ɔː ɔː u úː ùːGround-truth vowels

Pre

dic

ted

vo

we

ls

Slovene vowels

χʷχʔʷʔʒʷʒʑʐʷʐzxwtʼt ʃʼt ʃ

t ʂt sʼt stʃʼʃʷʼʃʷʃ

ʂʷʂsʁʷʁr

qʷqpʼ

pʷʼpn

mɮɬʼɬl

kʼkʷkʲʼkʲjħɣɡʷɡʲf

d ʒd zdɕb

b ɕ dd zd ʒ f ɡɡʷɣ ħ j kʲkʲk kʼ l ɬ ɬʼ ɮmn ppʷpʼqqʷr ʁʁʷs ʂʂʷ ʃ ʃ ʃʷʼ ʃʼ t t st s t ʂ t ʃ t ʃʼ tʼw x z ʐʐʷʑ ʒʒʷʔʔʷχχʷGround-truth consonants

Pre

dic

ted

co

nso

na

nts

Adyghe consonants

r, ɾ

θ

z

x

v

t

s

ɾ

r

p

ŋ

ɲ

n

m

ʎ

l

k

ʝ

ɣ

ɡ

f

ð

d

ç

c

b

b c ç d ð f ɡ ɣ ʝ k l ʎ m n ɲ ŋ p r ɾ s t v x z θGround-truth consonants

Pre

dic

ted

co

nso

na

nts

Greek consonants

e, ɛ

o, ɔ

u

ɔ

o

i

ɛ

e

a

a e ɛ i o ɔ u

Ground-truth vowels

Pre

dic

ted

vo

we

ls

Italian vowels

Figure 4: Confusion matrices of vowel and consonant predictions by our second system (VP) for languages with thetest WER > 20%. Each row represents a predicted segment, with colors across columns indicating the proportionof times the predicted segment matches individual ground-truth segments. A gray row means the segment inquestion is absent in any predicted phoneme sequences but is present in at least one ground-truth sequence. Thediagonal elements represent the number of times for which the predicted segment matches the target segment,while off-diagonal elements are those that are mis-predicted by the system. White squares are added to highlightsegment groups where mismatches are common.

138

ered from careful error analyses can inform thedirections for potential improvements.

ReferencesMaximilian Bisani and Hermann Ney. 2008. Joint-

sequence models for grapheme-to-phoneme conver-sion. Speech Communication, 50:434–451.

A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977.Maximum likelihood from incomplete data via theEM algorithm. Journal of the Royal Statistical Soci-ety. Series B (Methodological), 39(1):1–38.

Omnia ElSaadany and Benjamin Suter. 2020.Grapheme-to-phoneme conversion with a mul-tilingual transformer model. In Proceedings ofthe Seventeenth SIGMORPHON Workshop onComputational Research in Phonetics, Phonology,and Morphology, pages 85–89.

Daan van Esch, Mason Chua, and Kanishka Rao. 2016.Predicting pronunciations with syllabification andstress with recurrent neural networks. In Proceed-ings of Interspeech 2016, pages 2841–2845.

Kevin Gimpel and Noah A. Smith. 2010. Softmax-margin CRFs: Training log-linear models with costfunctions. In Human Language Technologies: The2010 Annual Conference of the North AmericanChapter of the ACL, pages 733–736.

Sittichai Jiampojamarn and Grzegorz Kondrak. 2010.Letter-phoneme alignment: An exploration. In Pro-ceedings of the 48th Annual Meeting of the Associa-tion for Computational Linguistics, pages 780–788.

Sittichai Jiampojamarn, Grzegorz Kondrak, and TarekSherif. 2007. Applying many-to-many alignmentsand Hidden Markov Models to letter-to-phonemeconversion. In Proceedings of NAACL HLT 2007,pages 372–379.

Jackson L. Lee, Lucas F. E. Ashby, M. Elizabeth Garza,Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D.McCarthy, and Kyle Gorman. 2020. Massively mul-tilingual pronunciation mining with WikiPron. InProceedings of the 12th Conference on Language Re-sources and Evaluation (LREC 2020), pages 4223–4228.

Peter Makarov and Simon Clematide. 2018a. Imita-tion learning for neural morphological string trans-duction. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing,pages 2877–2882.

Peter Makarov and Simon Clematide. 2018b. Neu-ral transition-based string transduction for limited-resource setting in morphology. In Proceedings ofthe 27th International Conference on ComputationalLinguistics, pages 83–93.

Peter Makarov and Simon Clematide. 2018c. UZH atCoNLL-SIGMORPHON 2018 shared task on uni-versal morphological reinflection. In Proceedings ofthe CoNLL-SIGMORPHON 2018 Shared Task: Uni-versal Morphological Reinflection, pages 69–75.

Peter Makarov and Simon Clematide. 2020. CLUZHat SIGMORPHON 2020 shared task on multilin-gual grapheme-to-phoneme conversion. In Proceed-ings of the Seventeenth SIGMORPHON Workshopon Computational Research in Phonetics, Phonol-ogy, and Morphology, pages 171–176.

Josef R. Novak, Nobuaki Minematsu, and KeikichiHirose. 2012. WFST-based grapheme-to-phonemeconversion: Open source tools for alignment, model-building and decoding. In Proceedings of the 10thInternational Workshop on Finite State Methods andNatural Language Processing, pages 45–49.

Josef Robert Novak, Nobuaki Minematsu, and KeikichiHirose. 2015. Phonetisaurus: Exploring garpheme-to-phoneme conversion with joint n-gram models inthe WFST framework. Natural Language Engineer-ing, 22(6):907–938.

Ben Peters and André F. T. Martins. 2020. DeepSPINat SIGMORPHON 2020: One-size-fits-all multilin-gual models. In Proceedings of the Seventeenth SIG-MORPHON Workshop on Computational Researchin Phonetics, Phonology, and Morphology, pages63–69.

Kanishka Rao, Fuchun Peng, Hasim Sak, andFrançoise Beaufays. 2015. Grapheme-to-phonemeconversion using long short-term memory recurrentneural networks. In IEEE International Confer-ence on Acoustics, Speech and Signal Processing(ICASSP), pages 4225–4229.

Eric Sven Ristad and Peter N. Yianilos. 1998. Learningstring-edit distance. IEEE Transactions on PatternAnalysis and Machine Intelligence, 20(5):522–532.

Shubham Toshniwal and Karen Livescu. 2016. Jointlylearning to align and convert graphemes tophonemes with neural attention models. In2016 IEEE Spoken Language Technology Workshop(SLT), pages 76–82.

Rebecca Treiman. 1994. To what extent do ortho-graphic units in print mirror phonological units inspeech? Journal of Psycholinguistic Research,23(1):91–110.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, ŁukaazKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Proceedings of the 31st Conferenceon Neural Information Processing Systems (NIPS2017), pages 1–11.

Kaili Vesik, Muhammad Abdul-Mageed, and MiikkaSilfverberg. 2020. One model to pronounce themall: Multilingual grapheme-to-phoneme conversionwith a Transformer ensemble. In Proceedings of the

139

Seventeenth SIGMORPHON Workshop on Computa-tional Research in Phonetics, Phonology, and Mor-phology, pages 146–152.

Kaisheng Yao and Geoffrey Zweig. 2015. Sequence-to-sequence neural net models for grapheme-to-phoneme conversion. In Proceedings of Interspeech2015, pages 3330–3334.

Sevinj Yolchuyeva, Géza Németh, and Bálint Gyires-Tóth. 2019. Transformer based grapheme-to-phoneme conversion. In Proceedings of Interspeech2019, pages 2095–2099.

140

Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research

in Phonetics, Phonology, and Morphology,

August 5, 2021. ©2021 Association for Computational Linguistics

pages 141–147

Avengers, Ensemble! Benefits of ensembling ingrapheme-to-phoneme prediction

Vasundhara Gautam, Wang Yau Li, Zafarullah Mahmood, Fred Mailhot∗,Shreekantha Nadig†, Riqiang Wang, Nathan Zhang

Dialpad Canada †Dialpad Indiavasundhara,wangyau.li,zafar,fred.mailhot

shree,riqiang.wang,[email protected]

AbstractWe describe three baseline beating sys-tems for the high-resource English-onlysub-task of the SIGMORPHON 2021Shared Task 1: a small ensemble thatDialpad’s1 speech recognition team usesinternally, a well-known off-the-shelfmodel, and a larger ensemble modelcomprising these and others. We addition-ally discuss the challenges related to theprovided data, along with the processingsteps we took.

1 IntroductionThe transduction of sequences of graphemesto phones or phonemes,2 that is from charac-ters used in orthographic representations tocharacters used to represent minimal units ofspeech, is a core component of many tasks inspeech science & technology. This grapheme-to-phoneme conversion (or g2p) may be used,e.g., to automate or scale the creation ofdigital lexicons or pronunciation dictionaries,which are crucial to FST-based approaches toautomatic speech recognition (ASR) and syn-thesis (Mohri et al., 2002).The SIGMORPHON 2021 Workshop in-

cluded a Shared Task on g2p conversion, com-prising 3 sub-tasks.3 The low- and medium-resource tasks were multilingual, while thehigh-resource task was English-only. Thispaper provides an overview of the threebaseline-beating systems submitted by the Di-alpad team for the high-resource sub-task,

∗Corresponding author. Contributing authors arelisted alphabetically.

1https://www.dialpad.com/2We use these terms interchangeably here to refer

to graphical representations of minimal speech sounds,remaining agnostic as to their theoretical or ontologicalstatus.

3https://github.com/sigmorphon/2021-task1

along with discussion of the challenges posedby the data that was provided.

2 Sub-task 1: high-resource,English-only

The organizers provided 41,680 lines of datain total; 33,344 for training, and 4,168 eachfor development and test. The data consistsof word/pronunciation pairs (word-pron pairs,henceforth), where words are sequences ofgraphemes and pronunciations are sequencesof characters from the International PhoneticAlphabet (International Phonetic Association,1999). The data was derived from the Englishportion of the WikiPron database (Lee et al.,2020), a massively multilingual resource ofword-pron pairs extracted from Wiktionary4and subject to some manual QA and post-processing.5The baseline model provided was the 2nd

place finisher from the 2020 g2p shared task(Gorman et al., 2020). It is an ensembled neu-ral transition model that operates over editactions and is trained via imitation learning(Makarov and Clematide, 2020).

Evaluation scripts were provided to com-pute word error rate (WER), the percentage ofwords for which the output sequence does notmatch the gold label.Notwithstanding the baseline’s strong prior

performance and the amount of data avail-able, the task proved to be challenging; thebaseline system achieved development andtest set WERs of 45.13 and 41.94, respec-tively. We discuss possible reasons for thisbelow.

4https://en.wiktionary.org/5See https://github.com/sigmorphon/2021-task1

for fuller details on data formatting and processing.

141

2.1 Data-related challengesWiktionary is an open, collaborative, publiceffort to create a free dictionary in multiplelanguages. Anyone can create an account andadd or amend words, pronunciations, etymo-logical information, etc. As with most user-generated content, this is a noisy method ofdata creation and annotation.Even setting aside the theory-laden ques-

tion of when or whether a given word shouldbe counted as English,6 the open nature ofWiktionary means that speakers of differentvariants or dialects of English may submitvarying or conflicting pronunciations for setsof words. For example, some transcriptionsindicate that the users who input them hadthe cot/caught merger while others do not; inthe training data “cot” is transcribed /k ɑ t/and “caught” is transcribed /k ɔ t/, indicat-ing a split, but “aughts” is transcribed as /ɑ ts/, indicating merger. There is also variationin the narrowness of transcription. For exam-ple, some transcriptions include aspiration onstressed-syllable-initial stops while others donot c.f. “kill” /kʰ ɪ l/ and “killer” /k ɪ l ɚ/.Typically the set of English phonemes is

taken to be somewhere between 38-45 de-pending on variant/dialect (McMahon, 2002).In exploring the training data, we found a to-tal of 124 symbols in the training set transcrip-tions, many of which only appeared in a smallset (1–5) of transcriptions. To reduce the ef-fect of this long tail of infrequent symbols, wenormalized the training set.The main source of symbols in the long

tail was the variation in the broadness oftranscription—vowels were sometimes butnot always transcribed with nasalization be-fore a nasal consonant, aspiration on word-initial voiceless stops was inconsistently indi-cated, phonetic length was occasionally indi-cated, etc. There were also some cases of er-roneous transcription that we uncovered bylooking at the lowest frequency phones andthe word-pronunciation pairs where they ap-peared. For instance, the IPA /j/ was tran-scribed as /y/ twice, the voiced alveolar ap-proximant /ɹ/ was mistranscribed as the trill/r/ over 200 times, and we found a handful

6E.g., the training data included the arguably Frenchword-pronunciation pair: embonpoint /ɑ b ɔ p w ɛ/

of issues where a phone was transcribed witha Unicode symbol not used in the IPA at all.Most of these were cases where the rare

variant was at least two orders of magnitudeless frequent than the common variant of thesymbol. There was, however, one class ofsounds where the variation was less dramat-ically skewed; the consonants /m/, /n/, and/l/ appeared in unstressed syllables follow-ing schwa (/əm/, /ən/, /əl/) roughly one or-der of magnitude more frequently than theirsyllabic counterparts (/m/, /n/, /l/), and weopted not to normalize these. If we had nor-malized the syllabic variants, it would haveresulted in more consistent g2p output but itwould likely also have penalized our perfor-mance on the uncleaned test set.7 In the end,our training data contained 47 phones (plusend-of-sequence and UNK symbols for somemodels).

3 ModelsWe trained and evaluated several models forthis task, both publicly available, in-house,and custom developed, along with various en-sembling permutations. In the end, we sub-mitted three sets of baseline beating results.The organizers assigned sequential identifiersto multiple submissions (e.g. Dialpad-N); weinclude these in the discussion of our entriesbelow, for ease of subsequent reference.

3.1 The Dialpad model (Dialpad-2)Dialpad uses a g2p system internally for scal-able generation of novel lexicon additions.We were motivated to enter this shared taskas a means of assessing potential areas of im-provement for our system; in order to do sowe needed to assess our own performance asa baseline.This model is a simple majority-vote ensem-

ble of 3 existing publicly available g2p sys-tems: Phonetisaurus (Novak et al., 2012), aWFST-based model, Sequitur (Bisani and Ney,2008), a joint-sequence model trained via EM,and a neural sequence-to-sequence model de-veloped at CMU as part of the CMUSphinx8

7Although the possibility also exists that one or moreof our models would have found and exploited contex-tual cues that weren’t obvious to us by inspection.

8https://cmusphinx.github.io

142

toolkit (see subsection 3.2). As Dialpad usesa proprietary lexicon and phoneset internally,we retrained all three models on the cleanedversion of the shared task training data, re-taining default hyperparameters and architec-tures.In the end, this ensemble achieved a test set

WER of 41.72, narrowly beating the baseline(results are discussed in more depth in Section4).3.2 A strong standalone model:

CMUSphinx g2p-seq2seq (Dialpad-3)CMUSphinx is a set of open systems andtools for speech science developed at CarnegieMellon University, including a g2p system.9It is a neural sequence-to-sequence model(Sutskever et al., 2014) that is Transformer-based (Vaswani et al., 2017), written in Ten-sorflow (Abadi et al., 2015). A pre-trained 3-layer model is available for download, but it istrained on a dictionary that uses ARPABET, asubstantially different phoneset from the IPAused in this challenge. For this reason we re-trained a model from scratch on the cleanedversion of the training data.This model achieved a test set WER of

41.58, again narrowly beating the baseline.Interestingly, this outperformed the Dialpadmodel which incorporates it, suggesting thatPhonetisaurus and Sequitur add more noisethan signal to predicted outputs, to say noth-ing of increased computational resources andtraining time. More generally, this points tothe CMUSphinx seq2seq model as a simpleand strong baseline against which future g2presearch should be assessed.3.3 A large ensemble (Dialpad-1)In the interest of seeing what results could beachieved via further naive ensembling, our fi-nal submission was a large ensemble, compris-ing two variations on the baseline model, theDialpad-2 ensemble discussed above, and twoadditional seq2seq models, one using LSTMsand the other Transformer-based. The latteradditionally incorporated a sub-word extrac-tion method designed to bias a model’s input-output mapping toward “good” grapheme-phoneme correspondences.

9https://github.com/cmusphinx/g2p-seq2seq

The method of ensembling for this model isword level majority-vote ensembling. We se-lect the most common prediction when thereis a majority prediction (i.e. one predictionhas more votes than all of the others). If thereis a tie, we pick the prediction that was gen-erated by the best standalone model with re-spect to each model’s performance on the de-velopment set.This collection of models achieved a test set

WER of 37.43, a 10.75% relative reduction inWER over the baseline model. As shown inTable 1, although a majority of the compo-nent models did not outperform the baseline,there was sufficient agreement across differ-ent examples that a simple majority votingscheme was able to leverage the models’ vary-ing strengths effectively. We discuss the com-ponents and their individual performance be-low and in Section 4.

3.3.1 Baseline variationsThe “foundation” of our ensemble was the de-fault baseline model (Makarov and Clematide,2018), which we trained using the raw dataand default settings in order to reflect thebaseline performance published by the orga-nization. We included this in order to individ-ually assess the effect of additional models onoverall performance.In addition to this default base, we added

a larger version of the same model, for whichwe increased the number of encoder and de-coder layers from 1 to 3, and increased thehidden dimensions 200 to 400.

3.3.2 biLSTM+attention seq2seqWe conducted experiments with a RNNseq2seq model, comprising a biLSTM encoder,LSTM decoder, and dot-product attention.10We conducted several rounds of hyperparam-eter optimization over layer sizes, optimizer,and learning rate. Although none of thesemodels outperformed the baseline, a smallnetwork (16-d embeddings, 128-d LSTM lay-ers) proved to be efficiently trainable (2 CPU-hours) and improved the ensemble results, soit was included.

10We used the DyNet toolkit (Neubig et al., 2017) forthese experiments.

143

3.3.3 PAS2P: Pronunciation-assistedsub-words to phonemes

Sub-word segmentation is widely used in ASRand neural machine translation tasks, as itreduces the cardinality of the search spaceover word-based models, and mitigates the is-sue of OOVs. Use of sub-words for g2p taskshas been explore, e.g. Reddy and Goldsmith(2010) develop an MDL-based approach to ex-tracting sub-word units for the task of g2p.Recently, a pronunciation-assisted sub-wordmodel (PASM) (Xu et al., 2019) was shownto improve the performance of ASR models.We experimented with pronunciation-assistedsub-words to phonemes (PAS2P), leveragingthe training data and a reparameterization ofthe IBM Model 2 aligner (Brown et al., 1993)dubbed fast_align (Dyer et al., 2013).11The alignment model is used to find an

alignment of sequences of graphemeres totheir corresponding phonemes. We follow asimilar process as Xu et al. (2019) to findthe consistent grapheme-phoneme pairs andrefinement of the pairs for the PASM model.We also collect grapheme sequence statisticsand marginalize it by summing up the countsof each type of grapheme sequence over allpossible types of phoneme sequences. Thesecounts are the weights of each sub-word se-quence.Given a word and the weights for each

sub-word, the segmentation process is asearch problem over all possible sub-wordsegmentation of that word. We solve thissearch problem by building weighted FSTs12of a given word and the sub-word vocabu-lary, and finding the best path through thislattice. For example, the word “thought-fulness” would be segmented by PASM as“th_ough_t_f_u_l_n_e_ss”, and this would beused as the input in the PAS2P modelrather than the full sequence of individualgraphemes.Finally, the PAS2P transducer is a

Transformer-based sequence-to-sequencemodel trained using the ESPnet end-to-endspeech processing toolkit (Watanabe et al.,2018), with pronunciation-assisted sub-words as inputs and phones as outputs. The

11https://github.com/clab/fast_align12We use Pynini (Gorman, 2016) for this.

model has 6 layers of encoder and decoderwith 2048 units, and 4 attention heads with256 units. We use dropout with a probabilityof 0.1 and label smoothing with a weightof 0.1 to regularize the model. This modelachieved WERs of 44.84 and 43.40 on thedevelopment and test sets, respectively.4 ResultsOur main results are shown in Table 1, wherewe show both dev and test set WER for eachindividual model in addition to the submit-ted ensembles. In particular, we can see thatmany of the ensemble components do not beatthe baselineWER, but nonetheless serve to im-prove the ensembled models.

Model dev testDialpad-3 43.30 41.58PAS2P 44.84 43.40Baseline (large) 44.99 41.65Baseline (organizer) 45.13 41.94Phonetisaurus 45.44 43.88Baseline (raw data) 45.92 41.70Sequitur 46.69 43.86biLSTM seq2seq 47.89 44.05Dialpad-2 43.83 41.72Dialpad-1 40.12 37.43

Table 1: Results for components of ensembles,and submitted models/ensembles (bolded).

5 Additional experimentsWe experimented with different ensemblesand found that incorporating models with dif-ferent architectures generally improves over-all performance. In the standalone results,only the top three models beat the base-line WER, but adding additional models withhigher WER than the baseline continues to re-duce overall WER. Table 2 shows the effectof this progressive ensembling, from our top-3 models to our top-7 (i.e. the ensemble forthe Dialpad-1 model).5.1 Edit distance-based votingIn addition to varying our ensemble sizes andcomponents, we investigated a different en-semble voting scheme, in which ties are bro-ken using edit distance when there is no 1-best majority option. That is, in the event of

144

Model dev testEnsemble-top3 41.10 39.71Ensemble-top4 40.74 38.89Ensemble-top5 40.50 38.12Ensemble-top6 40.31 37.69Ensemble-top7 (Dialpad-1) 40.12 37.43

Table 2: Progressive ensembling results, with top-performing components

a tie, instead of selecting the prediction madeby the best standalone model (our usual tie-breaking method), we select the predictionthat minimizes edit distance to all other pre-dictions that have the same number of votes.The idea of this method is to maximize sub-word level agreement. Although this methoddid not show clear improvements on the de-velopment set, we found after submission thatit narrowly but consistently outperformed thetop-N ensembles on the test set (see Table 3).

Model dev testED-Dialpad-3 43.76 41.70ED-top3 41.24 39.40ED-top4 40.62 38.48ED-top5 40.50 37.69ED-top6 40.28 37.50ED-top7 40.21 37.31

Table 3: Results for ensembling with edit-distancetie-breaking

6 Error analysisWe conducted some basic analyses of theDialpad-1 submission’s patterns of errors, tobetter understand its performance and iden-tify potential areas of improvement.13

6.1 Oracle WERWe began by calculating the oracle WER, i.e.the theoretical best WER that the ensemblecould have achieved if it had selected the cor-rect/gold prediction every time it was presentin the pool of component model predictionsfor a given input. The Dialpad-1 system’s ora-cle WERs on the dev and test sets were 25.12and 23.27, respectively (c.f. 40.12 and 37.43actual).

13We are grateful to an anonymous reviewer for sug-gesting that this would strengthen the paper.

These represent massive performance im-provements (approx. 15% absolute, or 37%relative, WER reduction), and suggest refine-ment of our output selection/voting method(perhaps via some kind of confidence weight-ing) could lead to much-improved results.6.2 Data-related errorsWe also investigated outputs for which noneof our component models predicted the cor-rect pronunciation, in hopes of finding somepatterns of interest.Many of the training data-related issues

raised in section 2.1 appeared in the dev andtest labels as well. In some cases this led tohigh cross-component agreement, even on in-correct predictions. Our hope that subtle con-textual cues might reveal patterns in the distri-bution of syllabic versus schwa-following liq-uids and nasals was not borne out, e.g. our en-semble was led astray on words like “warble”,which had a labelled pronunciation of /w ɔ ɹb l/, while all 7 of our models predicted /w ɔɹ b ə l/, a functionally non-distinct pronuncia-tion. In addition, the previously mentioned is-sue of /ɹ/ being mistranscribed as /r/ affectedour performance, e.g. with the word “unilat-eral”, whose labelled pronunciation was /j un ɪ l æ t ə r ə l/, instead of /j u n ɪ l æ t ə ɹ ə l/,which was again the pronunciation predictedby all 7 models. Finally, narrowness of tran-scription was also an issue that affected ourperformance on the dev and test sets, e.g., forwords like “cloudy” /k ɫ a ʊ d i/ and “cry” /kɹ a ɪ/, for which we predicted /k l a ʊ d i/ and/k ɹ a ɪ/, respectively. In the end, it seemsthat noisiness in the data was a major sourceof errors for our submissions.14Aside from issues arising due label noise,

our systems also made some genuine errorsthat are typical of g2p models, mostly relatedto data distribution or sparsity. For example,our component models overwhelmingly pre-dicted that “irreparate” (/ɪ ɹ ɛ p ə ɹ ə t/) shouldrhyme instead with “rate” (this “-ate-” /e ɪ t/correspondence was overwhelmingly presentin the training data), that “backache” (/b æk e ɪ k/) must contain the affricate /tʃ/, that

14We nonetheless acknowledge the magnitude andchallenge of the task of cleaning/normalizing a largequantity of user-generated data, and thank the organiz-ers for the work that they did in this area.

145

“acres” (e ɪ k ɚ z/) rhymes with “degrees”, andthat “beret” has a /t/ sound in it. In each ofthese cases, there was either not enough sam-ples in the training set to reliably learn therelevant grapheme-phoneme correspondence,or else a conflicting (but correct) correspon-dence was over-represented in the trainingdata.7 ConclusionWe presented and discussed three g2p sys-tems submitted for the SIGMORPHON2021English-only shared sub-task. In additionto finding a strong off-the-shelf contender,we show that naive ensembling remains astrong strategy in supervised learning, ofwhich g2p is a sub-domain, and that sim-ple majority-voting schemes in classificationcan often leverage the respective strengthsof sub-optimal component models, especiallywhen diverse architectures are combined. Ad-ditionally, we provided more evidence forthe usefulness of linguistically-informed sub-word modeling as an input transformation onspeech-related tasks.We also discussed additional experiments

whose results were not submitted, indicatingthe benefit of exploring top-N model vs en-semble trade-offs, and demonstrating the po-tential benefit of an edit-distance based tie-breaking method for ensemble voting.Future work includes further search for

the optimal trade-off between ensemble sizeand performance, as well as additional explo-ration of the edit-distance voting scheme, andmore sophisticated ensembling/voting meth-ods, e.g. majority voting at the phone levelon aligned outputs.AcknowledgmentsWe are grateful to Dialpad Inc. for provid-ing the resources, both temporal and compu-tational, to work on this project.

ReferencesMartín Abadi, Ashish Agarwal, Paul Barham, Eu-

gene Brevdo, Zhifeng Chen, Craig Citro, Greg S.Corrado, Andy Davis, Jeffrey Dean, MatthieuDevin, Sanjay Ghemawat, Ian Goodfellow, An-drew Harp, Geoffrey Irving, Michael Isard,Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser,

Manjunath Kudlur, Josh Levenberg, Dan Mané,Rajat Monga, Sherry Moore, Derek Murray,Chris Olah, Mike Schuster, Jonathon Shlens,Benoit Steiner, Ilya Sutskever, Kunal Talwar,Paul Tucker, Vincent Vanhoucke, Vijay Vasude-van, Fernanda Viégas, Oriol Vinyals, Pete War-den, Martin Wattenberg, Martin Wicke, YuanYu, and Xiaoqiang Zheng. 2015. Tensor-Flow: Large-scale machine learning on hetero-geneous systems. Software available from ten-sorflow.org.

Maximilian Bisani and Hermann Ney. 2008. Joint-sequence models for grapheme-to-phonemeconversion. Speech Communication, 50(5):434–451.

Peter F. Brown, Stephen A. Della Pietra, Vincent J.Della Pietra, and Robert L. Mercer. 1993. Themathematics of statistical machine translation:Parameter estimation. Computational Linguistics,19(2):263–311.

Chris Dyer, Victor Chahuneau, and Noah A. Smith.2013. A simple, fast, and effective reparam-eterization of IBM model 2. In Proceedings ofthe 2013 Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies, pages 644–648, Atlanta, Georgia. Association for Compu-tational Linguistics.

Kyle Gorman. 2016. Pynini: A python libraryfor weighted finite-state grammar compilation.In Proceedings of the SIGFSM Workshop on Sta-tistical NLP and Weighted Automata, pages 75–80, Berlin, Germany. Association for Computa-tional Linguistics.

Kyle Gorman, Lucas F.E. Ashby, Aaron Goyzueta,Arya McCarthy, Shijie Wu, and Daniel You.2020. The SIGMORPHON 2020 shared taskon multilingual grapheme-to-phoneme conver-sion. In Proceedings of the 17th SIGMORPHONWorkshop on Computational Research in Phonet-ics, Phonology, and Morphology, pages 40–50,Online. Association for Computational Linguis-tics.

International Phonetic Association. 1999. Hand-book of the International Phonetic Association:A guide to the use of the International PhoneticAlphabet. Cambridge University Press, Cam-bridge, U.K.

Jackson L. Lee, Lucas F.E. Ashby, M. ElizabethGarza, Yeonju Lee-Sikka, Sean Miller, AryaD. McCarthy Alan Wong, and Kyle Gorman.2020. Massively multilingual pronunciationmining with wikipron. In Proceedings of the12th Language Resources and Evaluation Confer-ence, pages 4216––4221, Marseille.

Peter Makarov and Simon Clematide. 2018. Imi-tation learning for neural morphological string

146

transduction. In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing, pages 2877–2882, Brussels, Belgium.Association for Computational Linguistics.

Peter Makarov and Simon Clematide. 2020.CLUZH at SIGMORPHON 2020 shared task onmultilingual grapheme-to-phoneme conversion.In Proceedings of the 17th SIGMORPHON Work-shop on Computational Research in Phonetics,Phonology, and Morphology, pages 171–176, On-line. Association for Computational Linguistics.

April McMahon. 2002. An Introduction to EnglishPhonology. Edinburgh University Press, Edin-burgh, U.K.

Mehryar Mohri, Fernando Pereira, and MichaelRiley. 2002. Weighted finite-state transducersin speech recognition. Computer Speech & Lan-guage, 16(1):69–88.

Graham Neubig, Chris Dyer, Yoav Goldberg,Austin Matthews, Waleed Ammar, AntoniosAnastasopoulos, Miguel Ballesteros, David Chi-ang, Daniel Clothiaux, Trevor Cohn, KevinDuh, Manaal Faruqui, Cynthia Gan, Dan Gar-rette, Yangfeng Ji, Lingpeng Kong, AdhigunaKuncoro, Gaurav Kumar, Chaitanya Malaviya,Paul Michel, Yusuke Oda, Matthew Richard-son, Naomi Saphra, Swabha Swayamdipta, andPengcheng Yin. 2017. Dynet: The dynamic neu-ral network toolkit.

Josef R. Novak, Nobuaki Minematsu, and Kei-kichi Hirose. 2012. WFST-based grapheme-to-phoneme conversion: Open source tools foralignment, model-building and decoding. InProceedings of the 10th International Workshop onFinite State Methods and Natural Language Pro-cessing, pages 45–49, Donostia–San Sebastián.Association for Computational Linguistics.

Sravana Reddy and John Goldsmith. 2010. AnMDL-based approach to extracting subwordunits for grapheme-to-phoneme conversion. InHuman Language Technologies: The 2010 AnnualConference of the North American Chapter of theAssociation for Computational Linguistics, pages713–716, Los Angeles, California. Associationfor Computational Linguistics.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le.2014. Sequence to sequence learning with neu-ral networks. In Advances in Neural Informa-tion Processing Systems, volume 27. Curran As-sociates, Inc.

Ashish Vaswani, Noam Shazeer, Niki Parmar,Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Ł ukasz Kaiser, and Illia Polosukhin. 2017. At-tention is all you need. In Advances in NeuralInformation Processing Systems, volume 30. Cur-ran Associates, Inc.

Shinji Watanabe, Takaaki Hori, Shigeki Karita,Tomoki Hayashi, Jiro Nishitoba, Yuya Unno,Nelson Enrique Yalta Soplin, Jahn Heymann,Matthew Wiesner, Nanxin Chen, Adithya Ren-duchintala, and Tsubasa Ochiai. 2018. Espnet:End-to-end speech processing toolkit. In Proc.Interspeech 2018, pages 2207–2211.

Hainan Xu, Shuoyang Ding, and Shinji Watanabe.2019. Improving end-to-end speech recogni-tion with pronunciation-assisted sub-word mod-eling. ICASSP 2019 - 2019 IEEE InternationalConference on Acoustics, Speech and Signal Pro-cessing (ICASSP), pages 7110–7114.

147

Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research

in Phonetics, Phonology, and Morphology,

August 5, 2021. ©2021 Association for Computational Linguistics

pages 148–153

CLUZH at SIGMORPHON 2021 Shared Task on MultilingualGrapheme-to-Phoneme Conversion: Variations on a Baseline

Simon Clematide and Peter MakarovDepartment of Computational Linguistics

University of Zurich, [email protected] [email protected]

Abstract

This paper describes the submission by theteam from the Department of ComputationalLinguistics, Zurich University, to the Mul-tilingual Grapheme-to-Phoneme Conversion(G2P) Task 1 of the SIGMORPHON 2021challenge in the low and medium settings. Thesubmission is a variation of our 2020 G2Psystem, which serves as the baseline for thisyear’s challenge. The system is a neural trans-ducer that operates over explicit edit actionsand is trained with imitation learning. For thischallenge, we experimented with the follow-ing changes: a) emitting phoneme segments in-stead of single character phonemes, b) inputcharacter dropout, c) a mogrifier LSTM de-coder (Melis et al., 2019), d) enriching the de-coder input with the currently attended inputcharacter, e) parallel BiLSTM encoders, andf) an adaptive batch size scheduler. In the lowsetting, our best ensemble improved over thebaseline, however, in the medium setting, thebaseline was stronger on average, although forcertain languages improvements could be ob-served.

1 Introduction

The SIGMORPHON Grapheme-to-Phoneme Con-version task consists of mapping a sequence ofcharacters in some language into a sequence ofwhitespace delimited International Phonetic Alpha-bet (IPA) symbols, which represent the pronunci-ation of this input character sequence (not neces-sarily a phonemic transcription, despite the nameof the task) according to the language-specific con-ventions used in the English Wiktionary.1 The datawas collected and post-processed by the WikiPronproject (Lee et al., 2020). Post-processing removesstress and syllable markers and applies IPA seg-mentation for combining and modifier diacritics as

1https://en.wiktionary.org/

Lang. Grapheme Phoneme Wiktionaryice persóna pʰ ɛ r s o uː n a /ˈpʰɛr.souːna/fra williams w i l j a m z /wi.ljamz/bul засичайки z ɐ s tʃ ʃ ə j k i /zɐˈsitʃəjk i/kor 검출 k ɘː m tɕʰ u ɭ [ˈkʌ(ː)mtɕʰuɭ]

Figure 1: Examples of the original G2P shared task datafrom four different languages and their pronunciationentries in Wiktionary.

well as contour information. See Figure 1 for thepost-processed shared task entries and the originalentries from the Wiktionary pronunciation section.For more information, we refer the reader to theshared task overview paper (Ashby et al., 2021).

In the low and medium data setting, the 2021SIGMORPHON multilingual G2P challenge fea-tures ten different languages from various phyloge-netic families and written in different scripts. Thelow setting comes with 800 training, 100 develop-ment and 100 test examples. In the medium setting,the data splits are 10 times larger. Although it ispermitted to use external resources for the mediumsetting, all our models used exclusively the officialtraining material.

Our system is a neural transducer with pointernetwork-like monotonic hard attention (Aharoniand Goldberg, 2017) that operates over explicitcharacter edit actions and is trained with imita-tion learning (Daume III et al., 2009; Ross et al.,2011; Chang et al., 2015). It is an adaptation ofour type-level morphological inflection generationsystem that proved its data efficiency and perfor-mance in the SIGMORPHON 2018 shared task(Makarov and Clematide, 2018). G2P shares manysimilarities with traditional morphological stringtransduction: The changes are mostly local and of-ten simple depending on how close the spellingof a language reflects pronunciation. For most lan-guages, a substantial part of the work is actually

148

p(#)

Σ : ε / p(DEL(Σ)) ε : Ω / p(INS(Ω))

Σ : Ω / p(SUB(Σ, Ω))

Figure 2: Stochastic edit distance (Ristad and Yianilos,1998): A memoryless probabilistic FST. Σ and Ω standfor any input and output symbol, respectively. Transi-tion weights are to the right of the slash and p(#) isthe final weight.

applying character-by-character substitutions. Anextreme case is Georgian, which features an al-most deterministic one-to-one mapping betweengraphemes and IPA segments that can be learnedalmost perfectly from little training data.2

The main goal of our submission was to testwhether our last year’s system, which is the base-line for this year’s G2P challenge, already exhauststhe potential of its architecture, or whether changesto the output representation (IPA segments vs. IPAUnicode codepoints; input character dropout), tothe LSTM decoder (the mogrifier steps and theadditional input of the attended character), to theBiLSTM encoder (parallel encoders), or to otherhyper-parameter settings (adaptive batch size) canimprove the results without replacing the LSTM-based encoder/decoder setup by a Transformer-based architecture (see e.g. Wu et al. (2021) forTransformer-based state-of-the-art results).

2 Model description

The model defines a conditional distributionover substitution, insertion and deletion editspθ(a | x) =

∏|a|j=1 pθ(aj | a<j ,x), where x =

x1 . . . x|x| is an input sequence of graphemes anda = a1 . . . a|a| is an edit action sequence. Theoutput sequence of IPA symbols y is determin-istically computed from x and a. The model isequipped with an LSTM decoder and a bidirec-tional LSTM encoder (Graves and Schmidhuber,2005). At each decoding step j, the model attendsto a single grapheme xi. The attention steps mono-tonically through the input sequence, steered by theedits that consume input (e.g. a deletion shifts theattention to the next grapheme xi+1).

2Even a reduced training set of only 100 items allows asingle model to achieve over 90% accuracy on the Georgiantest set.

The imitation learning algorithm relies on anexpert policy for suggesting intuitive and appro-priate character substitution, insertion and deletionactions. For instance, for the data sample кит 7→/kjit/ (Russian: “whale”), we would like the fol-lowing most natural edit sequence to attain the low-est cost: SUBS[k], INS[j], SUBS[i], SUBS[t]. The costfunction for these actions is estimated by fittinga Stochastic Edit Distance (SED) model (Ristadand Yianilos, 1998) on the training data, whichis a memoryless weighted finite-state transducershown in Figure 2. The resulting SED model isintegrated into the expert policy, the SED policy,that uses Viterbi decoding to compute optimal editaction sequences for any point in the action searchspace: Given a transducer configuration of partiallyprocessed input, find the best edit actions to gen-erate the remaining target sequence suffix. Dur-ing training, an aggressive exploration schedulepsampling(i) = 1

1+exp(i) where i is the trainingepoch number, exposes the model to configurationssampled by executing edit actions from the model.For an extended description of the SED policy andIL training, we refer the reader to the last year’ssystem description paper (Makarov and Clematide,2020).

2.1 Changes to the baseline model

This section describes the changes that we imple-mented in our submissions.

IPA segments vs. IPA Unicode characters:Emitting IPA segments in one action (includ-ing its whitespace delimiter), e.g., for the Rus-sian example from above SUBS[kj•],3 insteadof producing the same output by three actionsSUBS[k], INS[j], INS[•] reduces the number of ac-tion predictions (and potential errors) considerably,which is beneficial. On the other hand, this mightlead to larger action vocabularies and sparse train-ing distributions. Therefore, we experimented withcharacter (CHAR) and IPA segment (SEG) edit ac-tions in our submission. Table 1 shows statisticson the resulting vocabulary sizes if CHAR or SEG

actions are used. Some caution is needed thoughbecause some segments might only appear once inthe training data, e.g. English has an IPA segments:: that only appears in the word “psst”.

Input character dropout: To prevent the modelfrom memorizing the training set and to force it tolearn about syllable contexts, we randomly replace

3• denotes whitespace symbol.

149

S Language NFD< SEG CNFC CNFD

L ady 0.5% 67 37 37L gre 4.3% 33 33 33L ice 30.3% 60 36 36L ita 0.8% 32 29 29L khm 0.5% 47 36 34L lav 12.4% 73 51 36L mlt latn 9.0% 41 29 29L rum 0.3% 45 31 31L slv 4.3% 48 38 30L wel sw 2.4% 43 37 37M arm e 0.0% 54 31 31M bul 3.5% 46 34 34M dut 0.8% 49 39 39M fre 0.1% 39 36 36M geo 0.0% 33 27 27M hbs latn 3.7% 63 43 33M hun 42.5% 66 37 37M jpn hira 36.1% 64 42 39M kor 99.8% 60 46 46M vie hanoi 88.2% 49 44 44H eng us 0.0% 124 83 80Average 16.2% 54.1 39.0 37.0

Table 1: Statistics on Unicode normalization for low(L), medium (M), and high (H) settings (column S).Column NFD< specifies the percentage of trainingitems where NFD normalized graphemes had smallerlength difference to phonemes than in NFC normal-ization. Column SEG gives the vocabulary size of IPAsegments (the counts are the same for NFC and NFD).Column CNFC reports the phoneme vocabulary size inNFC Unicode characters (CHAR) and CNFD in NFD.

an input character with the UNK symbol accordingto a linearly decaying schedule.4

Mogrifier LSTM decoder: Mogrifier LSTMs(Melis et al., 2019) iteratively and mutually up-date the hidden state of a previous time step withthe current input before feeding the modified hid-den state and input into a standard LSTM cell. Onlanguage modeling tasks with smaller corpora, thistechnique closed the gap between LSTM and Trans-former models. We apply a standard mogrifier with5 rounds of updates in our experiments. We expectthe mogrifier decoder to profit from IPA segmen-tation because in this setup the decoder mogrifiesneighboring IPA phoneme segments and not space

4For all experiments, we start with a probability of 50%for UNKing a character in a word and reduce this rate over 10epochs to a minimal probability of 1%. Light experimentationon a few languages led to this cautious setting, which mightleave room for further improvement.

and IPA characters.Enriching the decoder input with the cur-

rently attended input character: The auto-regressive decoder of the baseline system uses theLSTM decoder output of the previous time stepand the BiLSTM encoded representation of thecurrently attended input character as input. Intu-itively, by feeding the input character embeddingdirectly into the decoder (as a kind of skip con-nection), we want to liberate the BiLSTM encoderfrom transporting the hard attention informationto the decoder, thereby motivating the sequenceencoder to focus more on the contextualization ofthe input character.

Multiple parallel BiLSTM encoders: Convo-lutional encoders typically use many convolutionalfilters for representation learning and Transformerencoders similarly feature multi-head attention. Us-ing several LSTM encoders in parallel has beenproposed by Zhu et al. (2017) for language model-ing and translation and was e.g. also successfullyused for named entity recognition (Zukov-Gregoricet al., 2018). Technically, the same input is fedthough several smaller LSTMs, each with its ownparameter set, and then their output is concatenatedfor each time step. The idea behind parallel LSTMencoders is to provide a more robust ensemble-styleencoding with lower variance between models. Forour submission, there was not enough time to sys-tematically tune the input and hidden state sizes aswell as the number of parallel LSTMs.

Adaptive batch size scheduler: We combinethe ideas of “Don’t Decay the Learning Rate, In-crease the Batch Size” (Smith et al., 2017) andcyclical learning schedules by dynamically enlarg-ing or reducing the batch size according to develop-ment set accuracy: Starting with a defined minimalbatch size m threshold, the batch size for the nextepoch is set to bm − 0.5c if the development setperformance improved, or bm+ 0.5c otherwise.5

If a predefined maximum batch size is reached,the batch size is reset in one step to the minimumthreshold. The motivation for the reset comes fromempirical observations that going back to a smallbatch size can help overcome local optima. Withlarger training sets, we subsample the training setsper epoch randomly in order to have a more dy-namic behavior.6

5See also the recent discussion on learning rates and batchsizes by Wu et al. (2021).

6The subsample size is set to 3,000 items per epoch in allour experiments.

150

2.2 Unicode normalizationFor some writing systems, e.g. for Korean or Viet-namese, applying Unicode NFD normalization tothe input has a great impact on the input sequencelength and consequently on the G2P character cor-respondences. The decomposition of diacritics andother composing characters for all languages, asperformed in the baseline, has the disadvantage oflonger input sequences. We apply a simple heuris-tic to decide on NFD normalization based on acriterion for the minimum length distance betweengraphemes and phonemes: If more than 50% of thetraining grapheme sequences in NFD normalizationhave a smaller length difference compared to thephoneme sequence than their corresponding NFCvariants, then NFD normalization is applied. SeeTable 1 for statistics, which indicate a preferencefor NFD for only 2 languages.

3 Submission details

Modifications such as mogrifier LSTMs, additionalinput character skip connections, or parallel en-coders increase the number of model parametersand make it difficult to compare the baseline systemdirectly with its variants. Additionally, we did nothave enough time before the submission to system-atically explore and fine-tune for the best combina-tion of model modifications and hyper-parameters.In the end, after some light experimentation we hadto stick to settings that might not be optimal.

We train separate models for each language onthe official training data and use the developmentset exclusively for model selection. As beam de-coding for mogrifier models sometimes sufferedcompared to greedy decoding, we built all ensem-bles from greedy model prediction. Like the base-line system (B), we train the SED model for 10epochs, use one-layer LSTMs, hidden state dimen-sion 200 for the decoder LSTMs and action embed-ding dimension 100. For the low (L) and medium(M) setting, we have the following specific hyper-parameters:

• patience: 12 (B), 24 (L), 18 (M)• maximal epochs: 60 (B), 80 (L/M)• minimal batch size:7 3 (L), 5 (M)• maximal batch size: 10 (L/M)• character embedding dimension:8 100 (B),7The baseline system’s batch size is 5.8The motivation for lowering the character embedding

size comes from adding the input character to the mogrifierdecoder LSTM, which increases the parameter size for eachof the 5 update weight matrices.

50(L/M)• LSTM encoder hidden state dimension: 200

(B), 300 (L/M) divided by 6 parallel encoders.

We submit 3 ensemble runs for the low setting:CLUZH-1: 15 models with CHAR input,CLUZH-2: 15 models with SEG input,CLUZH-3: 30 models with CHAR or SEG input.

We submit 4 ensemble runs for the medium setting:CLUZH-4: 5 models with CHAR input,CLUZH-5: 10 models with SEG input,CLUZH-6: 5 models with SEG input,CLUZH-7: 15 models with CHAR or SEG input.

Due to a configuration error, medium results wereactually computed without two add-ons: mogrifierLSTMs and the additional input character. In post-submission experiments, we computed runs thatenabled these features and report their results aswell (CLUZH-4m/5m).

4 Results and discussion

Table 2 shows a comparison of results for the lowsetting. We report the development and test setaverage word error rate (WER) performance toillustrate the sometimes dramatic differences be-tween these sets (e.g. Greek). Both runs containingCHAR action emitting models (CLUZH-1, CLUZH-3) have second best results (the best system reaches24.1). The SEG models with IPA segmentation ac-tions excel on some languages (Adyghe, Latvian),but fail badly on Slovene and Maltese. Only forRomanian and Italian, we see an improvement forthe 30-strong mixed ensemble. The expectationthat the size difference between the SEG and CHAR

vocabulary correlates with language-specific per-formance differences cannot be confirmed giventhe numbers in Table 1. E.g. Latvian features 73different IPA segments but only 51 IPA characters,still, the SEG variant shows only 49% WER.

Table 3 shows a comparison of results for themedium setting. We report selected developmentand test set average performance to illustrate thatalso in this larger setting, the expectation of aslightly higher development set performance doesnot always hold (e.g. Korean or Japanese). On theother hand, Bulgarian and Dutch have a sharp in-crease in errors on the test set compared to the de-velopment set. The comparison between runs withthe mogrifier LSTM decoder and the attended char-acter input (CLUZH-Nm) or without (C-N) suggestthat these changes are not beneficial. In the mediumsetting, C-4 (CHAR) and C-6 (SEG) can be directly

151

CLUZH-1 (CHAR) CLUZH-2 (SEG) C-3 OUR BASELINE BSL OtherAVERAGE E AVERAGE E E AVERAGE E E

LNG dev test sd test dev test sd test test dev test sd test test testady 25.0 27.8 3.3 24 25.6 26.2 1.8 22 22 26 25.2 2.8 21 22 22gre 6.5 22.2 2.3 20 5.1 22.8 2.8 22 20 5 26.0 3.3 25 21 21ice 14.8 12.4 2.4 10 16.1 14.5 2.2 12 10 21 15.8 2.1 12 12 11ita 24.5 27.0 2.2 23 24.4 26.3 3.2 24 21 25 22.7 3.5 19 19 20khm 39.8 38.2 3.4 32 40.3 36.9 2.2 33 32 39 40.4 2.5 34 34 28lav 47.2 53.7 2.8 53 46.9 55.3 3.7 49 49 44 56.5 2.2 54 55 49mlt 17.0 18.0 2.4 12 19.7 21.2 2.9 16 14 23 21.8 5.1 17 19 18rum 11.1 13.7 1.8 13 10.3 14.1 1.0 13 12 11 12.5 2.1 10 10 10slv 46.4 56.4 2.7 50 48 60.2 3.4 59 55 44 54.2 2.1 51 49 47wel 18.0 14.9 3.5 10 15.6 15.7 1.8 13 12 19 14.8 2.0 12 10 12AVG 25.0 28.4 2.7 24.7 25.2 29.3 2.5 26.3 24.7 25.7 29.0 2.8 25.5 25.1 23.8

Table 2: Overview of the dev and test results in the low setting. C-3 is CLUZH-3 ensemble. OUR BASELINEshows the results for our own run of the baseline configuration. They are different from the official baseline results(BSL) due to different random seeds. Column sd always reports the test set standard deviation. E means ensembleresults.

C-4 CLUZH-4m (CHAR) C-5 CLUZH-5m (SEG) C-5l C-6 C-7 OUR BASELINE BSLE5 AVERAGE E5 E10 AVERAGE E10 E10 E5 E15 AVERAGE E10 E10

LNG test dev test sd test test dev test sd test test test test dev test sd test testarm 7.1 5.4 7.9 0.7 6.4 6.6 5.1 7.2 0.5 6.2 7.1 6.6 6.4 5.8 7.8 0.7 6.5 7.0bul 20.1 12.2 20.4 2.0 19.9 19.2 11.9 23.3 2.1 22.4 16.2 18.8 19.7 12.5 19.7 1.7 19.3 18.3dut 15.0 13.1 18.3 1.2 14.8 14.9 12.4 16.8 0.6 14.6 14.5 15.6 14.7 13.1 17.7 1.3 14.3 14.7fre 7.5 8.4 9.7 0.6 8.2 7.5 8.5 9.5 0.7 8.1 8.1 7.5 7.6 8.9 9.1 0.5 7.8 8.5geo .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0hbs 38.4 43.2 44.5 1.1 39.1 35.6 42.4 44.3 1.5 36.8 35.7 37.0 35.3 39.1 38.9 1.2 33.6 32.1hun 1.5 1.8 1.8 0.1 1.6 1.2 1.7 1.5 0.3 1.0 0.9 1.0 1.0 1.7 2.0 0.3 1.8 1.8jpn 5.9 6.9 6.8 0.2 5.5 5.3 6.8 6.5 0.3 5.4 5.2 5.5 5.0 6.8 6.4 0.5 5.5 5.2kor 16.2 21.3 18.6 0.7 17.4 16.9 19.6 18.3 0.8 16.2 16.1 17.2 16.3 20.4 18.9 0.8 16.5 16.3vie 2.3 1.2 2.4 0.1 2.3 2.0 1.2 2.1 0.1 2.1 2.2 2.1 2.0 1.4 2.5 0.2 2.4 2.5AVG 11.4 11.4 13.0 0.7 11.5 10.9 11.0 12.9 0.7 11.3 10.6 11.1 10.8 11.0 12.3 0.7 10.8 10.6

Table 3: Overview of the development and test results in the medium setting. C-N is CLUZH-N ensemble. CLUZH-Nm runs use the mogrifier decoder and additional input character in decoder (these are post-submisson runs). C-5luses larger parameterization and reaches WER 10.60 (BSL: 10.64). OUR BASELINE shows the results for ourown run of the baseline configuration. Boldface indicates best performance in official shared task runs; underlinemarks the best performance in post-submission configurations. Column sd always reports the test set standarddeviation. En means n-strong ensemble results.

compared because they feature the same ensem-ble size: The results suggest that IPA segmentation(SEG) for higher resource settings (and the specificmedium languages) seems to be slightly better thanCHAR. C-5l is a post-submission run with a largerparametrization.9 This post-submission ensembleoutperforms the baseline system by a small mar-gin, but still struggles with Serbo-Croatian (hbs)compared to the official baseline results.

In a post-submission experiment on the high set-ting, we built a large10 5-strong SEG-based ensem-

9Three parallel encoders with 200 hidden units each; char-acter embedding dimension of 200; no mogrifier; no inputcharacter added to the decoder.

10Character embedding dimension: 200; action embeddingdimension: 100; 10 parallel encoders with hidden state dimen-

ble. It achieves an impressive low word error rateof 38.7 compared to the official baseline (41.94)and the best other submission (37.43).

Future work: Performance variance betweendifferent runs of our LSTM-based architecturemakes it difficult to reliably assess the actual useful-ness of the small architectural changes; extensiveexperimentation, e.g. in the spirit of Reimers andGurevych (2017), is needed for that. One shouldalso investigate the impact of the official data setsplits: The observed differences between the de-velopment set and test set performance in the low

sion 100; decoder hidden state dimension: 500; minimal batchsize: 5; maximal batch size: 20; epochs: 200 (subsampled to3,000 items); patience: 24; no mogrifier; no input characteradded to the decoder.

152

setting for Slovene or Greek are extreme. Cross-validation experiments might help assess the truedifficulty of the WikiPron datasets.

5 Conclusion

This paper presents the approach taken by theCLUZH team to solving the SIGMORPHON 2021Multilingual Grapheme-to-Phoneme Conversionchallenge. Our submission for the low and mediumsettings is based on our successful SIGMORPHON2020 system, which is a majority-vote ensembleof neural transducers trained with imitation learn-ing. We add several modifications to the existingLSTM architecture and experiment with IPA seg-ment vs. IPA character action predictions. For thelow setting languages, our IPA character-based runoutperforms the baseline and ranks second overall.The average performance of segment-based actionedits suffers from performance outliers for certainlanguages. For the medium setting languages, wenote small improvements on some languages, butthe overall performance is lower than the baseline.Using a mogrifier LSTM decoder and enrichingthe encoder input with the currently attended in-put character did not improve performance in themedium setting. Post-submission experiments sug-gest that network capacity for the submitted sys-tems was too small. A post-submission run for thehigh-setting shows considerable improvement overthe baseline.

ReferencesRoee Aharoni and Yoav Goldberg. 2017. Morphologi-

cal inflection generation with hard monotonic atten-tion. In ACL.

Lucas F.E Ashby, Travis M. Bartley, Simon Clematide,Luca Del Signore, Cameron Gibson, Kyle Gorman,Yeonju Lee-Sikka, Peter Makarov, Aidan Malanoski,Sean Miller, Omar Ortiz, Reuben Raff, ArundhatiSengupta, Bora Seo, Yulia Spektor, and Winnie Yan.2021. Results of the Second SIGMORPHON 2021Shared Task on Multilingual Grapheme-to-PhonemeConversion. In Proceedings of 18th SIGMORPHONWorkshop on Computational Research in Phonetics,Phonology, and Morphology.

Kai-Wei Chang, Akshay Krishnamurthy, Hal Daume,III, and John Langford. 2015. Learning to searchbetter than your teacher. In PMLR, volume 37 ofProceedings of Machine Learning Research.

Hal Daume III, John Langford, and Daniel Marcu.2009. Search-based structured prediction. Machinelearning, 75(3).

Alex Graves and Jurgen Schmidhuber. 2005. Frame-wise phoneme classification with bidirectionalLSTM and other neural network architectures. Neu-ral Networks, 18(5).

Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza,Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D.McCarthy, and Kyle Gorman. 2020. Massively mul-tilingual pronunciation modeling with WikiPron. InLREC.

Peter Makarov and Simon Clematide. 2018. Imitationlearning for neural morphological string transduc-tion. In EMNLP.

Peter Makarov and Simon Clematide. 2020. CLUZHat SIGMORPHON 2020 shared task on multilingualgrapheme-to-phoneme conversion. In Proceedingsof the 17th SIGMORPHON Workshop on Computa-tional Research in Phonetics, Phonology, and Mor-phology.

Gabor Melis, Tomas Kocisky, and Phil Blunsom. 2019.Mogrifier LSTM. CoRR, abs/1909.01792.

Nils Reimers and Iryna Gurevych. 2017. Reportingscore distributions makes a difference: Performancestudy of LSTM-networks for sequence tagging. InEMNLP.

Eric Sven Ristad and Peter N Yianilos. 1998. Learningstring-edit distance. IEEE Transactions on PatternAnalysis and Machine Intelligence, 20(5).

Stephane Ross, Geoffrey Gordon, and Drew Bagnell.2011. A reduction of imitation learning and struc-tured prediction to no-regret online learning. InPMLR, volume 15 of Proceedings of Machine Learn-ing Research.

Samuel L. Smith, Pieter-Jan Kindermans, and Quoc V.Le. 2017. Don’t decay the learning rate, increase thebatch size. CoRR, abs/1711.00489.

Shijie Wu, Ryan Cotterell, and Mans Hulden. 2021.Applying the transformer to character-level transduc-tion. In EACL.

Danhao Zhu, Si Shen, Xin-Yu Dai, and Jiajun Chen.2017. Going wider: Recurrent neural network withparallel cells. CoRR, abs/1705.01346.

Andrej Zukov-Gregoric, Yoram Bachrach, and SamCoope. 2018. Named entity recognition with par-allel recurrent neural networks. In ACL.

153

Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research

in Phonetics, Phonology, and Morphology,

August 5, 2021. ©2021 Association for Computational Linguistics

pages 154–166

What transfers in morphological inflection? Experiments with analogicalmodels

Micha ElsnerDepartment of LinguisticsThe Ohio State [email protected]

Abstract

This paper investigates how abstract processeslike suffixation can be learned from morpho-logical inflection task data using an analogi-cal memory-based framework. In this frame-work, the inflection target form is specifiedby providing an example inflection of anotherword in the language. This model is capa-ble of near-baseline performance on the Sig-Morphon 2020 inflection challenge. Such amodel can make predictions for unseen lan-guages, allowing one-shot inflection for natu-ral languages and the investigation of morpho-logical transfer with synthetic probes. Accu-racy for one-shot transfer can be unexpectedlyhigh for some target languages (88% in Shona)and language families (53% across Romance).Probe experiments show that the model learnspartially generalizable representations of pre-fixation, suffixation and reduplication, aidingits ability to transfer. The paper argues thatthe degree of generality of these process repre-sentations also helps to explain transfer resultsfrom previous research.

1 Introduction

Morphological transfer learning has proven to bea powerful and effective technique for improvingthe performance of inflection models on under-resourced languages. The beneficial effects oftransfer between source and target languages areknown to be higher when the two are closely re-lated (Anastasopoulos and Neubig, 2019) or typo-logically similar (Lin et al., 2019), mediated bythe effect of script (Murikinati et al., 2020). Butthese effects are not always consistent; a varietyof researchers report failure of transfer betweenclosely related languages, or surprising successeswith rather dissimilar ones (Sec 2). Pushing for-ward our understanding of these cases requires amore nuanced understanding of what is transferredby morphological transfer learning— that is, what

abstract representational concepts do inflection net-works acquire and how are these shared across lan-guages?

This is a difficult question to address in the stan-dard framework for inflection (Kann and Schutze,2016), in which morphosyntactic properties areclosely tied to their specific exponents in a par-ticular language as well as to the more abstractprocesses by which these exponents are applied.In such a network, it is difficult to test whethera generic suffixation operation has been learned,without reference to a particular form/feature map-ping, for instance between the Maori passive fea-ture PASS and the spelling of a particular passivesuffix -tia. Suffixing as a generic operation is muchmore likely to be useful in another language thanthe individual suffix. This work decouples theserepresentational pieces by performing inflection inan analogical, memory-based framework.1 In thisframework, inflection instances do not have tags;rather, they include an instance of the desired map-ping with respect to a different lemma (Figure 1).For example, to produce a passive Maori verb, thesystem takes an example verb with its passive andcompletes the four-part analogy: lemma : target ::exemplar lemma : exemplar target. The advantageof this redefinition of the task is that, in principle,the system does not need to learn anything aboutthe individual affixes of a particular language, sincethese can be copied from the exemplar. Thus, it ispossible to investigate how well such a system haslearned a particular morphological process such assuffixation, which is expected to be present in avariety of languages.2

1“Memory-based” has been used in the literature to referto models with dynamic read-write memory (Graves et al.,2016), as well as KNN-like exemplar models which store alarge number of examples in a static memory (van den Boschand Daelemans, 1999). The current work is of the latter type.

2Code available at: https://github.com/melsner/transformerbyexample.

154

Section 5 shows that this analogical frameworkfor inflection can predict inflections across a va-riety of languages, demonstrating reasonable per-formance on the Sigmorphon 2020 multilingualbenchmark (Vylomova et al., 2020). Section 6 de-scribes one-shot learning experiments, performinglanguage transfer without fine-tuning, and showsthat for languages with concatenative affixes, one-shot transfer can be more effective than previouslythought. Section 7 studies the system’s ability to ap-ply different types of morphological processes us-ing constructed stimuli, showing that some config-urations are capable of learning generic and trans-ferable representations of processes including pre-fixing, suffixing and reduplication.

2 Related work

The overall positive effect of transfer learning iswell established (McCarthy et al., 2019). Previ-ous research has also evaluated how the choiceof source language affects the performance in thetarget. While there is a robust trend for relatedlanguages to perform better, there are also many re-ports of exceptions. Kann (2020) finds that Hungar-ian is a better source for English than German anda better source for Spanish than Italian. She con-cludes that matching the target language’s defaultaffix placement (prefixing/suffixing) is important,and that agglutinative languages might be benefi-cial to transfer learning in general, but that geneticrelatedness is not always a necessary or sufficientfor effective transfer. Lin et al. (2019) also find thatHungarian and Turkish are good source languagesfor a surprising variety of unrelated targets. Ratherthan attribute this to agglutination, they proposethat these languages lead to good transfer becauseof their large datasets and difficulty as tasks. Fur-ther puzzling results come from Anastasopoulosand Neubig (2019), who find that Italian data doesnot improve performance in closely related Ladinor Neapolitan3 once monolingual hallucinated datais available, and that Latvian is as good a sourcefor Scots Gaelic as its relative Irish.

Previous analyses of transfer learning have at-tempted to differentiate the contributions of variousparts of the model through factored vocabularies orciphering (Kann et al., 2017b; Jin and Kann, 2017).These methods give disjoint representations to char-acters and tags in the source and target languages,

3Regional Romance languages spoken in Northern andSouthern Italy respectively.

or disrupt the mapping between them. Low-levelcorrespondence between character sets is the mostimportant factor for successful transfer in very low-resource settings, but models with disjoint charac-ter representations still succeed at transfer once atleast 200 target examples are available, indicatingthat higher-level information is also transferred andcontributes to performance.

Kann et al. (2017b) also represents a prior one-shot morphological learning experiment. Their set-ting is not quite the same as the one here; theyassume access to a single inflected form in half theparadigm cells in their target language (Spanish)which are used to fine-tune a pretrained system.Because their system uses the conventional tag-based framework, they are capable of filling cellsfor which no example is available (zero-shot learn-ing), while the memory-based system presentedhere is not. On the other hand, the current workdoes not use fine-tuning or require target-languagedata at training time. They evaluate inflection onboth seen and unseen cells as a function of fivesource languages, four of which are in the Romancefamily. The best one-shot transfer within Romancescores 44% exact match, the worst 13%. Transferfrom unrelated Arabic scores 0%. One-shot learn-ing experiments in this work use a much larger setof languages, and although performance in the typ-ical case is similar, the best results are substantiallybetter.

The memory-based design of the current work isrooted in cognitive theories of morphological pro-cessing. The widely accepted dual route model ofmorphological processing postulates that the mindretrieves familiar inflected forms from memory aswell as synthesizing forms from scratch (Milinet al., 2017; Alegre and Gordon, 1998; Butterworth,1983). It has often been claimed that memorizedforms of specific words are central to the structureof inflection classes (Bybee and Moder, 1983; By-bee, 2006; Jackendoff and Audring, 2020). In sucha theory, production of a form of a rare lemma isguided by the memory of the appropriate forms ofcommon ones. Additional evidence for this viewcomes from historical changes in which one word’sparadigm is analogically remodeled on another’s(Krott et al., 2001; Hock and Joseph, 1996, ch.5).Liu and Hulden (2020) evaluate a model very simi-lar to this one (a transformer in which target formsof other words, which they term “cross-table” ex-amples, are provided as part of the input). They

155

Lemma Target specification → TargetStandard inflection generation waiata V;PASS waiatatia

Memory-basedwaiata karanga : karangatia waiatatiawaiata kaukau : kaukauria waiatatia

Figure 1: Differing inputs for inflection models, eliciting the passive of the Maori verb waiata “sing”. The memory-based system relies on an exemplar verb as the target specifier; shown here are karanga “call”, which takes amatching suffix, and kaukau “swim”, which mismatches.

find that such examples are complementary to datahallucination and yield improved results in data-sparse settings. Some earlier non-neural modelsalso rely on stored word forms (Skousen, 1989;Albright and Hayes, 2002).

3 Exemplar selection

The system uses instances generated as described inFigure 1, separating the lemma, exemplar lemmaand exemplar form with punctuation characters.Each instance also contains two features indicatingthe language and language family of the example(e.g. LANG MAO, FAM AUSTRONESIAN).

The selection of the exemplar is critical to themodel’s performance. Ideally, the lemma and theexemplar inflect in the same way, reducing the in-flection task to copying. But this is not always thecase. For example, Maori verbs fall into inflectionclasses, as shown in Figure 1; when the exemplarcomes from a different class than the lemma, copy-ing will yield an invalid output, so the model hasto guess which class the input belongs to.4

This paper presents experiments using two set-tings: In random selection, the exemplar lemmais chosen arbitrarily from the set of traininglemma/form pairs for the appropriate language andcell. This makes the task difficult, but allows themodel to learn to cope with the distribution of in-puts it will face at test time. In similarity-basedselection, each source lemma is paired with anexemplar for which the transductions are highlysimilar. This makes the task easy, but since it relieson access to the true target form, it can be used onlyfor model training, not for testing.5 All models are

4In cases of class-dependent syncretism, the model mustalso guess which cell is being specified. For instance, Germanfeminine nouns do not inflect for case, but some masculinenouns do, so the combination of a masculine lemma and afeminine exemplar can yield an unsolvable prediction prob-lem.

5Within the training set, the same lemma/inflected formpair can appear as both an exemplar and a target instance; a re-viewer speculates that this might allow the model to memorizelexically-specific outputs within the training set even when

evaluated using instances generated using randomselection.

To perform similarity-based selection, eachlemma is aligned with its target form in the trainingdata in order to extract an edit rule (Durrett andDeNero, 2013; Nicolai et al., 2016). (For the firstmemory-based example in Figure 1, both wordshave the same edit rule -+tia.) The selected exem-plar/form pair uses the same edit rule, if possible.During training, a lemma is allowed to act as itsown exemplar, so that there is always at least onecandidate. However, words in the test set must begiven exemplars from the training set. If a cell inthe test set does not appear in the training set, noprediction can be made; in this case, the systemoutputs the lemma. Extending the model to coverthis case is discussed below as future work.6

4 Model design

The system uses the character-based transformer(Wu et al., 2020) as its learning model; this is asequence-to-sequence transformer (Vaswani et al.,2017) tuned for morphological tasks, and serves asa strong official baseline for the Sigmorphon 2020task. Moreover, transformers are known to performwell in the few-shot setting (Brown et al., 2020).All default hyperparameters7 match those of Wuet al. (2020).

As discussed in prior work (Anastasopoulosand Neubig, 2019; Kann and Schutze, 2017), itis important to pretrain the model to predisposeit to copy strings. To ensure this, the system istrained on a synthetic dataset. Each synthetic in-stance is generated within a random character set.The instance consists of a random pseudo-lemmaand pseudo-exemplar created by sampling word

using random selection. To avoid this issue, no training scoresare reported in this paper.

6In the SigMorphon 2020 datasets, this rarely occurs inpractice. ≥ 99% of target cells are covered in all languages ex-cept Ingrian (98%), Evenki (96%), and notably Ludic (61%).

7Including 4 layers, batches of 64, and the learning rateschedule.

156

lengths from the training word length distributionand then filling each one with random characters.With probability 50% the example is given a pre-fix; independently with probability 50% a suffix;independently with probability 10% an infix ata random character position. Prefixes and suf-fixes are random strings between 2-5 characterslong and infixes are 1-2 characters long. (Thismeans that, in some cases, no affix is added andthe transformation is the identity, as occurs in casesof morphological syncretism.) An example suchinstance is mpienjmel:rbeaikkea::zlurbeaikkeauewith output zlumpienjmelue. The language tagsfor these examples indicate the kinds of affixa-tion operations which were performed, for exam-ple LANG PREFIX SUFFIX; the family tag identifiesthem as SYNTHETIC. While this synthetic dataset isinspired by hallucination techniques (Anastasopou-los and Neubig, 2019; Silfverberg et al., 2017), notethat these synthetic instances are not presented tothe model as part of any natural language.

The Sigmorphon 2020 data is divided into “de-velopment languages” (45 languages in 5 fami-lies: Austronesian, Germanic, Niger-Congo, Oto-Manguean and Uralic) and “surprise languages”(45 more languages, including some members ofdevelopment families as well as unseen families).Data from all the “development languages”, plusthe synthetic examples from the previous stage, isused to train a multilingual model, which is fine-tuned family. Finally the family models are fine-tuned by language. During multilingual trainingand per-family tuning, the dataset is balanced tocontain 20,000 instances per language; languageswith more training instances than this are subsam-pled, while languages with fewer are upsampled bysampling multiple exemplars (with replacement)for each lemma/target pair. For the final language-specific fine-tuning stage, all instances from thespecific language are used.

5 Fine-tuned results

This section shows the test results for fully fine-tuned models on the development languages. Table1 shows the average exact match and standard de-viation by language family. Full results are givenin Appendix A. Tables also show the results of theofficial competition baseline which is closest to thecurrent work, the character transformer (Wu et al.,2020) fine-tuned by language, TRM-SINGLE.

Because the results of exemplar-based models

Family Random Similarity BaseAustronesian (4) 83 (13) 67 (21) 81 (18)Germanic (10) 87 (10) 51 (16) 90 (9)Niger-Congo (9) 98 (4) 94 (9) 97 (3)Oto-Manguean (10) 82 (16) 39 (23) 86 (12)3Uralic (11) 92 (6) 46 (14) 93 (0.05)Overall 89 (12) 57 (26) 90 (11)

Table 1: Fine-tuned accuracy scores for models trainedwith random and similarity-based selection, comparedto the baseline. Num languages in family and scorestandard deviation across languages in parentheses.

can vary based on the choice of exemplar, the sys-tem applies a simple post-process to compensatefor unlucky choices: it runs each lemma with fiverandomly-selected exemplars and chooses the ma-jority output.

Neither model achieves the same performance asthe baseline (90%), although the random-exemplarmodel (89%) comes quite close. The similar-exemplar model (57%) is clearly inferior due toits severe mismatch between training and test set-tings. Performance varies across language families.All models perform well in Niger-Congo, althoughthe conference organizers state that data from theselanguages may have been biased toward regularforms in an unrepresentative way.8 The random-exemplar model is at or near baseline performancein Austronesian and Uralic, but falls further belowbaseline in Germanic and Oto-Manguean. Bothof these families are characterized by complex in-flection class structure in which randomly chosenexemplars are less likely to resemble the target fora given word.

The similar-exemplar model also performspoorly in Uralic. While some Uralic languageshave inflection classes (Baerman, 2014), many(like Finnish) do not, but have complex systemsof phonological alternations (Koskenniemi andChurch, 1988). While the random-exemplar modelcan learn to compensate for these, the similar-exemplar model does not.

6 One-shot results

This section shows the results of one-shot learning.These experiments apply the multilingual and fam-ily models from the development languages to thesurprise languages, without fine-tuning. For lan-guages within development families, they use theappropriate family model; otherwise they use the

8A Swahili speaker confirms that some forms in the dataappear artificially over-regularized (Martha Johnson p.c.).

157

multilingual model. Thus, the model’s only accessto information about the target language is via theprovided exemplar.

Each experiment evaluates the results across fiverandom exemplars per test instance (with replace-ment), but averages the results rather than applyingmajority selection. This computes the expectedperformance in the one-shot setting where only asingle exemplar is available.

Results are shown in Table 2. One-shot learningis not competitive with the baseline fine-tuned sys-tem in any language family, but has some capacityto predict inflections in all families. Performanceis generally better in families for which relatedlanguages were present in development.

The system trained with random exemplarsachieves its best results on Tajik (Iranian: tgk, score89%), Shona (Niger-Congo: sna, score 75%)9, andNorwegian Nynorsk (Germanic: nno, score 42%).The system trained with similar exemplars achievesits best results on Shona (88%), Zarma (Songhay:dje, score 82%) and Tajik (79%). Notably, someof these high scores are achieved on languages thatwere difficult for the baseline systems; the score forTajik beats the transformer baseline (56%), perhapsdue to data sparsity, since baselines regularized us-ing data hallucination perform better (93%).

Training with similar exemplars leads to clearlybetter results than random exemplars, a reversal ofthe trend observed with fine-tuning. This differenceis particularly marked in Romance (53% averagevs 5%). While the random-exemplar system isbetter at guessing what to do when the exemplarand target forms are divergent, this causes errorswith unfamiliar languages. The system attemptsto guess the correct inflection, rather than simplycopying.

As an example, Table 3 shows an analysis ofperformance in Catalan (cat), selected because itsresults are fairly typical of the Romance family;the similar-exemplar system scores 53% while therandom-exemplar system scores 12%. The tableshows selected instances with different levels ofexemplar match and mismatch. The first two, ar-rissar “curl” and disputar “discuss”, match theirexemplars well and are good cases for copying.The random-exemplar model gets these both wrong,segmenting incorrectly in the first and adding a spu-rious character in the second. The next two, repetir

9As stated above, the Niger-Congo datasets are artificial-ized and probably does not represent the real difficulty of theinflection task.

Family Random Similarity BaseGermanic (3) 29 (13) 38 (22) 80 (13)Niger-Congo (1) 75 (0) 88 (0) 100 (0)Uralic (5) 21 (9) 28 (12) 76 (26)Afro-Asiatic (3) 7 (3) 26 (18) 96 (3)Algic (1) 2 (0) 14 (0) 68 (0)Dravidian (2) 7 (7) 13 (3) 85 (9)Indic (4) 4 (5) 4 (2) 98 (3)Iranian (3) 35 (39) 34 (32) 82 (19)Romance (8) 6 (4) 53 (19) 99 (1)Sino-Tibetan (1) 21 (0) 9 (0) 84 (0)Siouan (1) 13 (0) 13 (0) 96 (0)Songhay (1) 21 (0) 82 (0) 88 (0)Southern Daly 4 (0) 6 (0) 90 (0)Tungusic (1) 28 (0) 27 (0) 57 (0)Turkic (9) 7 (8) 19 (11) 96 (7)Uto-Aztecan (1) 33 (0) 30 (0) 81 (0)Overall 14 (18) 30 (25) 90 (15)

Table 2: One-shot accuracy scores for models trainedwith random and similarity-based selection, comparedto the baseline. Num. languages in family andscore standard deviation across languages in parenthe-ses. Families represented in development above theline, surprise families below.

“repeat” and engolir “ingest”, are mismatched withexemplars from a different inflection class; bothsystems make incorrect predictions, but the similar-exemplar system preserves the suffixes while therandom-exemplar system does not. Finally, inthe last example llevar-se “get up”, the similar-exemplar model misinterprets the reflexive suffix-se as part of the verb stem, while the random-exemplar model fails to make any edit.

A more systematic analysis computes analignment-based edit rule for each system predic-tion (King et al., 2020) and counts the unique rulesused to form one-shot predictions in the Catalan de-velopment set. Over 37105 instances, the random-exemplar model applies 626 unique edit rules, 20of which appear in correct predictions. The similar-exemplar model applies 3137 unique rules, 154 ofthem correctly. The greater variety of both correctand incorrect outputs from the similar-exemplarmodel demonstrates its preference for faithfulnessto the exemplar rather than remodeling the outputto fit language-specific constraints.

7 Synthetic transfer experiments

When transfer learning fails, it can be difficult totell whether the system has failed to represent ageneral morphological process, or whether it mis-applies what it has learned due to mismatched lexi-cal/phonological triggers. Experiments on artificialdata can probe what abstract processes the model

158

Lemma Exemplar Rand. Sel. Sim. Sel Targetarrissar posar : posarien arrissaren arrissarien arrissariendisputar descriure : descriuria disputarta disputaria disputariarepetir cremar : cremo repetirer repetio repeteixoengolir forjar : forjava engolire engoliva engoliallevar-se terminar : termino llevar-se llevor-se llevo

Table 3: Development data from Catalan (Romance: cat) showing the outputs of two one-shot systems.

has learned to apply, the links between these pro-cesses and language families, and the environmentsin which they can operate.

A probing dataset is synthesized to model severalmorphological operations (Figure 2), including pre-fix/suffix affixation, reduplication and gemination.Affixation is typologically widespread (Bickel andNichols, 2013) and appears in every developmentlanguage on which the model was trained. Suf-fixation is more common in Germanic and Uralic;Oto-Manguean tonal morphology is also often rep-resented via word-final diacritics.10 Prefixing ismore common in the Niger-Congo family.

Reduplication appears in three of the four Aus-tronesian development languages, Tagalog, Hili-gaynon and Cebuano (WAL, 2013), but not in theMaori dataset provided. The probe language haspartial reduplication of the first syllable, as found inTagalog and Hiligaynon. Previous work with artifi-cial data demonstrates that sequence-to-sequencelearners can learn fully abstract representations ofreduplication (Prickett et al., 2018; Nelson et al.,2020; Haley and Wilson, 2021), but it has not beenpreviously shown that networks trained on realdata do this in a transferable way. In one-shotlanguage transfer, reduplication instances are actu-ally ambiguous. Given an instance modi : :: gobu: gogobu, there are two plausible interpretations,reduplicative momodi and affixal gomodi. Thus,analysis of reduplicative instances can be infor-mative about the model’s learned linkage betweenlanguage family and typology.

Gemination is a inflectional process whereby asegment is lengthened to mark some morpholog-ical feature (Samek-Lodovici, 1992). The probelanguage geminates the last non-final consonant.None of the development languages have morpho-logical gemination.

The probe languages use two alphabets: the firstis a common subset of characters which appear in

10No Unicode normalization was performed; Oto-Manguean tone diacritics are treated as characters (as are partsof the complex characters of the Indic scripts). The placementof these diacritics within the word varies from language tolanguage.

at least half the languages of every developmentfamily.11 The second is a subset of Cyrillic char-acters intended to test transfer to a less-familiar or-thography; a few Uralic development languages arewritten in Cyrillic. Each language has 90 randomlemmas, sampled with the frames CVCV, CVCVC,CVCVCVC; affixal languages have 30 affixes oftypes VCV, CV, CVCV, plus 7 single-letter affixes.No probe lemma coincides with any real lemma,and no probe affix has frequency > 5% as a stringprefix or suffix in any real language. Affixal lan-guages contain an instance for every lemma/affixpair. Reduplication and gemination languages haveone instance per lemma.

The model is prompted to inflect the probes asif they are members of each language family, andas members of a comparatively well-resourced lan-guage selected from those families, specificallyTagalog (tgl), German (deu), Mezquital Otomi(ote), Swahili (swa) and Finnish (fin), as well asthe synthetic suffixing language used in pretraining(suff). In addition to checking whether the outputmatches, the table shows whether reduplicated in-stances have been correctly reduplicated (using aregular expression).

Table 4 shows the results. A comparison be-tween the random-exemplar and similar-exemplarmodels confirms the hypothesis from above thatrandom-exemplar models have less generalizablerepresentations of morphological processes, es-pecially prefixation and suffixation. While bothmodels are capable of attaching affixes in the syn-thetic language, the random-exemplar model learnsvery language- and suffix-specific rules for apply-ing these operations, leading to very low accuracyfor copying generic affixes. Both models showless language-specific remodeling of affixes in thefamily-only setting than when the probes are la-beled as part of a particular language; this effect isagain more pronounced for the random-exemplarmodel.

Both models learn to reduplicate arbitrary CVsyllables, but this process is mostly restricted to

11Consonants mpbntdrlskgh, vowels aeiou.

159

Lemma semetProbe type Exemplar TargetPrefixing kigu : igokigu igosemetSuffixing kigu : kiguigo semetigoReduplication modi : momodi sesemetGemination bogu : boggu semmet

Figure 2: Probe tasks illustrated for a single lemma.

Tagalog,12, with some generalization to Austrone-sian. Most other languages interpret reduplicationinstances as affixes.

Only the similar-exemplar model gets any gem-ination instances correct, and these primarily inUralic.13 This is unsurprising, since the model wasnever trained with morphological gemination. Itdemonstrates that the model’s representations ofmorphological processes represent the input typol-ogy and are not simply artifacts of the transformerarchitecture. While Uralic does not have gemi-nation as an independent morphological process,alternations involving geminates do occur in someparadigms; the NOM.PL of tikka “dart” is tikat.14

The model seems to have learned a little about gem-ination from this morphophonological process, butnot a fully generalized representation.

Affixation remains relatively successful when us-ing Cyrillic characters (suffixes more than prefixes),but for the most part, less so than with Latin char-acters, although in the random-exemplar model,Cyrillic suffixes are somewhat more accurate, prob-ably due to less interference from language-specificknowledge. This substantiates the general find-ing (Murikinati et al., 2020) that transfer acrossscripts is more difficult than within-script. Cyrillicreduplication sees a much larger drop in accuracy.The difference is probably that simple affixation isphonologically uncomplicated, while reduplicationrequires phonological information about vowelsand consonants.

8 Discussion

These experiments with real and synthetic trans-fer suggest some useful insights into the problem-atic findings of earlier transfer experiments. Why

12The random-exemplar model has low accuracy for redu-plication in Tagalog because it appends spurious Tagalog pre-fixes to the output, another example of a language-specificrule. However, the regular expression check confirms thatreduplication is performed correctly.

13Because of this poor performance, Cyrillic geminationwas not tested.

14See Silfverberg et al. (2021) for a fuller investigation ofgeneralizable representations of gradation processes in Finnishnoun paradigms.

is Hungarian so successful as a source languagefor unrelated targets? Kann (2020) suggests thatit is its agglutinative nature. The results shownhere offer some speculative support for this view—perhaps the relative segmentability of prototypi-cally agglutinative languages (Plank, 1999) actslike the similar-exemplar setting in the memory-based model, giving the source model a generalbias for concatenative affixation, unpolluted by toomany lexical and phonological alternations. As re-ported here, such a model is a promising startingpoint for inflection in many non-agglutinative sys-tems, such as Romance verbs, which neverthelessare strongly concatenative.

Where transfer between related languages fails,it is conjecturally possible that the source modelrepresentations of edit operations are too closelylinked to particular phonological and lexical prop-erties of the source. This is clearly shown in thesynthetic transfer experiments, where generic suf-fixation fails in Germanic and Uralic despite thesefamilies being strongly suffixing, because the sys-tem has learned to remodel its outputs to conformtoo closely to source-language templates.

More broadly, the synthetic experiments showa link between language typology and learningof morphological processes, suggesting that lan-guage structure, not only language relatedness, iskey to successful transfer— transfer of structuralprinciples can lead to improvements even withoutcognate words or affixes. For instance, success-ful reduplication appears only in Austronesian andsuccessful gemination only in Uralic. A promisingdirection for future work would be to replace thelanguage family feature with a set of typologicalfeature indicators such as WALs properties (WAL,2013), which might help the model to learn fasterin low-resource target languages.

Two other extensions might bring the memory-based model closer to the state of the art in super-vised inflection prediction. First, although the Sig-Morphon 2020 datasets are balanced by paradigmcell, real datasets are Zipfian, with sparse cover-age of cells (Blevins et al., 2017; Lignos and Yang,2018). For languages with large paradigms, themodel thus requires the capacity to fill cells forwhich no exemplar can be retrieved, perhaps usinga variant of adaptive source selection (Erdmannet al., 2020; Kann et al., 2017a). Second, thesimilar-exemplar model performs better in one-shottransfer experiments, but is hampered in the su-

160

Model Fam/Lg. Pref Pref (Cyrl) Suff Suff (Cyrl) Redup. Redup. (Cyrl) Gem.

Rand.

austro 62 36 26 38 0 (10) 0 0austro/tgl 0 1 0 0 28 (90) 3 (7) 0ger 1 0 25 36 0 (3) 0 0ger/deu 0 0 8 10 0 (3) 0 0n-congo 92 55 40 41 0 (3) 0 0n-congo/swa 100 76 36 25 0 (3) 0 0oto 20 15 21 33 0 (3) 0 0oto/ote 35 30 1 9 0 (3) 0 0uralic 3 0 23 34 0 (3) 0 0uralic/fin 0 0 7 22 0 (3) 0 0synth 84 62 97 91 0 (3) 0 0synth/suff 28 1 100 97 0 (3) 0 0

Sim.

austro 86 75 94 85 30 (30) 0 0austro/tgl 30 35 75 63 88 (88) 8 (8) 0ger 85 55 99 96 3 (3) 0 8ger/deu 86 55 99 98 0 0 5n-congo 99 96 98 93 0 (3) 0 3n-congo/swa 99 98 88 57 0 0 0oto 88 76 95 87 18 (18) 0 0oto/ote 96 84 59 17 5 (5) 0 0uralic 59 10 97 95 0 0 17uralic/fin 52 4 98 98 0 0 12synth 94 84 99 95 8 (10) 0 2synth/suff 86 42 100 99 0 0 2

Table 4: Accuracy of synthetic probe tasks presented as different language and language family. (Cyrl) indicatesCyrillic alphabet. Parentheses in reduplication columns show frequency of correct CV reduplication.

pervised setting by train-test mismatch. Selectingtraining exemplars using a classifier which couldalso be used at inference time would reduce thismismatch. These experiments are left for futurework.

Finally, since the memory-based architecture iscognitively inspired, it might be adapted as a cog-nitive model of language learning in contact sit-uations. Work on this learning process suggeststhat speakers find it much easier to learn new ex-ponents than to learn new morphological processes(Dorian, 1978; Mithun, 2020). Thus, the impactof source-language transfer may indeed be mostsignificant in cases where the L1 and L2 (sourceand target) languages differ in the abstract mecha-nisms of inflection rather than the specifics. Histor-ical contact-induced change provides evidence forthis viewpoint in the form of systems which havechanged to employ the same processes as a contactlanguage. For example, Cappadocian Greek hasbecome agglutinative through its extensive contactwith Turkish (Janse, 2004). For other examples,see Green (1995); Thomason (2001).

9 Conclusion

The results of this paper demonstrate that the pro-posed cognitive mechanism of memory-based anal-ogy provides a relatively strong basis for inflectionprediction. Performance in a supervised setting is

strongest in languages without large numbers ofinflection classes, and requires training exemplarsto be selected in the same way as test exemplars.Memory-based analogy also provides a foundationfor one-shot transfer; in this case, training exem-plars should closely match the elicited inflections,so that the model learns to copy rather than recon-struct the output form. One-shot transfer using thismechanism can achieve higher accuracy than pre-viously thought, even when no genetically relatedlanguages are available in training. Scores varywidely, but can be over 80% for some languages.

Finally, this paper provides new evidence aboutwhat kinds of abstract information (beyond char-acter correspondences) is transferred between lan-guages when learning to inflect. The model learnsgeneral processes for prefixation and suffixationwhich apply (to some extent) across character sets,but its application of these can be disrupted bylanguage-specific morpho-phonological rules. Italso learns to reduplicate arbitrary CV sequences,but applies this process only when targeting a lan-guage with reduplication. Learning of morphologi-cal processes in general appears to be driven by theinput typology. The discussion argues that the use-fulness of general representations for prefixationand suffixation accounts for the puzzling effective-ness of agglutinative languages as transfer sourcesreported in previous research.

161

Acknowledgments

This research is deeply indebted to ideas con-tributed by Andrea Sims. I am also grateful tomembers of LING 5802 in autumn 2020 at OhioState, and to the three anonymous reviewers fortheir comments and suggestions. Parts of this workwere run on the Ohio Supercomputer (OSC, 1987).

References2013. World atlas of language structures online. Avail-

able online at https://wals.info/, accessed 3June 2020.

Adam Albright and Bruce Hayes. 2002. ModelingEnglish past tense intuitions with minimal gener-alization. In Proceedings of the Sixth Meeting ofthe Association for Computational Linguistics Spe-cial Interest Group in Computational Phonology inPhiladelphia, July 2002, pages 58–69.

Maria Alegre and Peter Gordon. 1998. Frequency ef-fects and the representational status of regular inflec-tions. Journal of Memory and Language, 40:41–61.

Antonios Anastasopoulos and Graham Neubig. 2019.Pushing the limits of low-resource morphological in-flection. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), pages984–996, Hong Kong, China. Association for Com-putational Linguistics.

Matthew Baerman. 2014. Covert systematicity in a dis-tributionally complex system. Journal of Linguis-tics, pages 1–47.

Balthasar Bickel and Johanna Nichols. 2013. Fusionof selected inflectional formatives. In Matthew S.Dryer and Martin Haspelmath, editors, The WorldAtlas of Language Structures Online. Max Planck In-stitute for Evolutionary Anthropology, Leipzig.

James P. Blevins, Petar Milin, and Michael Ramscar.2017. The Zipfian paradigm cell filling problem.In Ferenc Kiefer, James P. Blevins, and Huba Bar-tos, editors, Perspectives on morphological organi-zation: Data and analyses, pages 141–158. Brill.

Antal van den Bosch and Walter Daelemans. 1999.Memory-based morphological analysis. In Proceed-ings of the 37th Annual Meeting of the Associationfor Computational Linguistics, pages 285–292, Col-lege Park, Maryland, USA. Association for Compu-tational Linguistics.

Tom B Brown, Benjamin Mann, Nick Ryder, MelanieSubbiah, Jared Kaplan, Prafulla Dhariwal, ArvindNeelakantan, Pranav Shyam, Girish Sastry, AmandaAskell, et al. 2020. Language models are few-shotlearners. arXiv preprint arXiv:2005.14165.

Brian Butterworth. 1983. Lexical representation. InBrian Butterworth, editor, Language production, vol.2: Development, writing and other language pro-cesses, pages 257–294. Academic Press.

Joan Bybee. 2006. From usage to grammar: Themind’s response to repetition. Language, 82(4):711–733.

Joan Bybee and Carol Moder. 1983. Morphologicalclasses as natural categories. Language, 59(2):251–270.

Nancy C. Dorian. 1978. The fate of morphologicalcomplexity in language death: Evidence from EastSutherland Gaelic. Language, 54(3):590–609.

Greg Durrett and John DeNero. 2013. Supervisedlearning of complete morphological paradigms. InProceedings of the 2013 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 1185–1195, Atlanta, Georgia. Association forComputational Linguistics.

Alexander Erdmann, Tom Kenter, Markus Becker, andChristian Schallhart. 2020. Frugal paradigm com-pletion. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics,pages 8248–8273, Online. Association for Computa-tional Linguistics.

Alex Graves, Greg Wayne, Malcolm Reynolds,Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwinska, Sergio Gomez Colmenarejo, EdwardGrefenstette, Tiago Ramalho, John Agapiou, et al.2016. Hybrid computing using a neural net-work with dynamic external memory. Nature,538(7626):471–476.

Ian Green. 1995. The death of ‘prefixing’: contact in-duced typological change in northern australia. InAnnual Meeting of the Berkeley Linguistics Society,volume 21, pages 414–425.

Coleman Haley and Colin Wilson. 2021. Deep neuralnetworks easily learn unnatural infixation and redu-plication patterns. Proceedings of the Society forComputation in Linguistics, 4(1):427–433.

Hans Henrich Hock and Brian D. Joseph. 1996. Lan-guage history, language change and language rela-tionship: An introduction to historical and compar-ative linguistics. Mouton de Gruyter.

Ray Jackendoff and Jenny Audring. 2020. The textureof the lexicon: Relational Morphology and the Par-allel Architecture. Oxford University Press.

Mark Janse. 2004. Animacy, definiteness, and case inCappadocian and other Asia Minor Greek dialects.Journal of Greek linguistics, 5(1):3–26.

Huiming Jin and Katharina Kann. 2017. Exploringcross-lingual transfer of morphological knowledgein sequence-to-sequence models. In Proceedings of

162

the First Workshop on Subword and Character LevelModels in NLP, pages 70–75, Copenhagen, Den-mark. Association for Computational Linguistics.

Katharina Kann. 2020. Acquisition of inflectionalmorphology in artificial neural networks with priorknowledge. In Proceedings of the Society for Com-putation in Linguistics 2020, pages 144–154, NewYork, New York. Association for Computational Lin-guistics.

Katharina Kann, Ryan Cotterell, and Hinrich Schutze.2017a. Neural multi-source morphological reinflec-tion. In Proceedings of the 15th Conference of theEuropean Chapter of the Association for Computa-tional Linguistics: Volume 1, Long Papers, pages514–524, Valencia, Spain. Association for Compu-tational Linguistics.

Katharina Kann, Ryan Cotterell, and Hinrich Schutze.2017b. One-shot neural cross-lingual transfer forparadigm completion. In Proceedings of the 55thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers), pages1993–2003, Vancouver, Canada. Association forComputational Linguistics.

Katharina Kann and Hinrich Schutze. 2016. MED: TheLMU system for the SIGMORPHON 2016 sharedtask on morphological reinflection. In Proceedingsof the 14th SIGMORPHON Workshop on Computa-tional Research in Phonetics, Phonology, and Mor-phology, pages 62–70, Berlin, Germany. Associa-tion for Computational Linguistics.

Katharina Kann and Hinrich Schutze. 2017. Unlabeleddata for morphological generation with character-based sequence-to-sequence models. In Proceed-ings of the First Workshop on Subword and Charac-ter Level Models in NLP, pages 76–81, Copenhagen,Denmark. Association for Computational Linguis-tics.

David King, Andrea Sims, and Micha Elsner. 2020. In-terpreting sequence-to-sequence models for Russianinflectional morphology. In Proceedings of the Soci-ety for Computation in Linguistics 2020, pages 481–490, New York, New York. Association for Compu-tational Linguistics.

Kimmo Koskenniemi and Kenneth Ward Church. 1988.Complexity, two-level morphology and Finnish. InColing Budapest 1988 Volume 1: International Con-ference on Computational Linguistics.

Andrea Krott, R Harald Baayen, and Robert Schreuder.2001. Analogy in morphology: modeling the choiceof linking morphemes in dutch.

Constantine Lignos and Charles Yang. 2018. Morphol-ogy and language acquisition. In Andrew Hippis-ley and Gregory T. Stump, editors, Cambridge hand-book of morphology, pages 765–791. CambridgeUniversity Press.

Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li,Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani,Junxian He, Zhisong Zhang, Xuezhe Ma, AntoniosAnastasopoulos, Patrick Littell, and Graham Neubig.2019. Choosing transfer languages for cross-linguallearning. In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics, pages 3125–3135, Florence, Italy. Associationfor Computational Linguistics.

Ling Liu and Mans Hulden. 2020. Analogy models forneural word inflection. In Proceedings of the 28thInternational Conference on Computational Linguis-tics, pages 2861–2878, Barcelona, Spain (Online).International Committee on Computational Linguis-tics.

Arya D. McCarthy, Ekaterina Vylomova, Shijie Wu,Chaitanya Malaviya, Lawrence Wolf-Sonkin, Gar-rett Nicolai, Christo Kirov, Miikka Silfverberg, Sab-rina J. Mielke, Jeffrey Heinz, Ryan Cotterell, andMans Hulden. 2019. The SIGMORPHON 2019shared task: Morphological analysis in context andcross-lingual transfer for inflection. In Proceedingsof the 16th Workshop on Computational Research inPhonetics, Phonology, and Morphology, pages 229–244, Florence, Italy. Association for ComputationalLinguistics.

Petar Milin, Laurie Beth Feldman, Michael Ramscar,Roberta A. Hendrick, and R. Harald Baayen. 2017.Discrimination in lexical decision. PLoS ONE,12(2):e0171935.

Marianne Mithun. 2020. Where is morphological com-plexity? In Peter Arkadiev and Francesco Gardani,editors, The complexities of morphology, pages 306–327. Oxford University Press.

Nikitha Murikinati, Antonios Anastasopoulos, and Gra-ham Neubig. 2020. Transliteration for cross-lingualmorphological inflection. In Proceedings of the17th SIGMORPHON Workshop on ComputationalResearch in Phonetics, Phonology, and Morphology,pages 189–197, Online. Association for Computa-tional Linguistics.

Max Nelson, Hossep Dolatian, Jonathan Rawski, andBrandon Prickett. 2020. Probing RNN encoder-decoder generalization of subregular functions us-ing reduplication. In Proceedings of the Societyfor Computation in Linguistics 2020, pages 167–178, New York, New York. Association for Compu-tational Linguistics.

Garrett Nicolai, Bradley Hauer, Adam St Arnaud, andGrzegorz Kondrak. 2016. Morphological reinflec-tion via discriminative string transduction. In Pro-ceedings of the 14th SIGMORPHON Workshop onComputational Research in Phonetics, Phonology,and Morphology, pages 31–35, Berlin, Germany. As-sociation for Computational Linguistics.

OSC. 1987. Ohio supercomputer center.

163

Frans Plank. 1999. Split morphology: How agglutina-tion and flexion mix. Linguistic Typology, 3:279–340.

Brandon Prickett, Aaron Traylor, and Joe Pater. 2018.Seq2Seq models with dropout can learn generaliz-able reduplication. In Proceedings of the FifteenthWorkshop on Computational Research in Phonetics,Phonology, and Morphology, pages 93–100, Brus-sels, Belgium. Association for Computational Lin-guistics.

Vieri Samek-Lodovici. 1992. A unified analysis ofcrosslinguistic morphological gemination. In Pro-ceedings of CONSOLE, volume 1, pages 265–283.Citeseer.

Miikka Silfverberg, Francis Tyers, Garrett Nicolai, andMans Hulden. 2021. Do RNN states encode abstractphonological alternations? In Proceedings of the2021 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, pages 5501–5513. As-sociation for Computational Linguistics.

Miikka Silfverberg, Adam Wiemerslage, Ling Liu, andLingshuang Jack Mao. 2017. Data augmentation formorphological reinflection. In Proceedings of theCoNLL SIGMORPHON 2017 Shared Task: Univer-sal Morphological Reinflection, pages 90–99, Van-couver. Association for Computational Linguistics.

Royal Skousen. 1989. Analogical modeling of lan-guage. Springer Science & Business Media.

Sarah Grey Thomason. 2001. Contact-induced typo-logical change. In Language typology and languageuniversals: An international handbook, volume 2,pages 1640–1648.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In NIPS.

Ekaterina Vylomova, Jennifer White, Eliza-beth Salesky, Sabrina J. Mielke, Shijie Wu,Edoardo Maria Ponti, Rowan Hall Maudslay, RanZmigrod, Josef Valvoda, Svetlana Toldova, FrancisTyers, Elena Klyachko, Ilya Yegorov, NataliaKrizhanovsky, Paula Czarnowska, Irene Nikkarinen,Andrew Krizhanovsky, Tiago Pimentel, LucasTorroba Hennigen, Christo Kirov, Garrett Nicolai,Adina Williams, Antonios Anastasopoulos, HilariaCruz, Eleanor Chodroff, Ryan Cotterell, MiikkaSilfverberg, and Mans Hulden. 2020. SIGMOR-PHON 2020 shared task 0: Typologically diversemorphological inflection. In Proceedings of the17th SIGMORPHON Workshop on ComputationalResearch in Phonetics, Phonology, and Morphology,pages 1–39, Online. Association for ComputationalLinguistics.

Shijie Wu, Ryan Cotterell, and Mans Hulden. 2020.Applying the transformer to character-level transduc-tion. arXiv preprint arXiv:2005.10213.

164

A Full results

For replicability, this appendix providesfull results for all languages, as 0-1 accu-racy on the official test datasets. The re-ported baseline is TRM-SINGLE, copied fromhttps://docs.google.com/spreadsheets/d/

1ODFRnHuwN-mvGtzXA1sNdCi-jNqZjiE-i9jRxZCK0kg.Scores for supervised systems on the developmentlanguages are shown in Table 5 and scores forone-shot systems on surprise languages are shownin Table 6. See Vylomova et al. (2020) forlanguage abbreviation definitions.

Lang Fam Rand Sim Baseang Indo-Eur: Germanic 72 19 78azg Oto-Manguean 94 22 95ceb Austronesian 79 69 84cly Oto-Manguean 82 19 91cpa Oto-Manguean 74 33 91ctp Oto-Manguean 43 15 60czn Oto-Manguean 83 32 80dan Indo-Eur: Germanic 75 42 75deu Indo-Eur: Germanic 93 62 98eng Indo-Eur: Germanic 97 67 97est Uralic 94 47 95fin Uralic 100 39 100frr Indo-Eur: Germanic 81 39 87gaa Niger-Congo 100 100 98gmh Indo-Eur: Germanic 94 75 91hil Austronesian 97 74 98isl Indo-Eur: Germanic 88 37 97izh Uralic 85 33 87kon Niger-Congo 99 99 98krl Uralic 99 36 99lin Niger-Congo 100 100 100liv Uralic 93 54 96lug Niger-Congo 90 74 91mao Austronesian 71 57 52mdf Uralic 92 67 94mhr Uralic 91 67 93mlg Austronesian 100 100 100myv Uralic 93 61 94nld Indo-Eur: Germanic 99 61 99nob Indo-Eur: Germanic 75 47 76nya Niger-Congo 100 100 100ote Oto-Manguean 99 80 99otm Oto-Manguean 98 46 98pei Oto-Manguean 65 17 72sme Uralic 99 31 100sot Niger-Congo 100 100 98swa Niger-Congo 100 100 100swe Indo-Eur: Germanic 97 59 99tgl Austronesian 69 35 72vep Uralic 83 28 84vot Uralic 81 41 86xty Oto-Manguean 90 79 91zpv Oto-Manguean 87 46 85zul Niger-Congo 92 83 92Overall 89 57 90Stdev 12 26 11

Table 5: Zero-one test-set accuracy scores by languagefor SigMorphon 2020 development languages (super-vised).

165

Lang Fam Rand Sim Baseast Indo-Eur: Romance 2 64 100aze Turkic 9 17 81bak Turkic 15 14 100ben Indo-Aryan 1 4 99bod Sino-Tibetan 21 9 84cat Indo-Eur: Romance 12 53 100cre Algic 2 14 68crh Turkic 24 45 99dak Siouan 13 13 96dje Nilo-Saharan 21 82 88evn Tungusic 28 27 57fas Indo-Eur: Iranian 2 13 100frm Indo-Eur: Romance 7 73 100fur Indo-Eur: Romance 11 19 100glg Indo-Eur: Romance 9 59 100gml Indo-Eur: Germanic 11 11 62gsw Indo-Eur: Germanic 33 64 93hin Indo-Aryan 0 1 100kan Dravidian 13 16 76kaz Turkic 0 7 98kir Turkic 2 6 98kjh Turkic 11 11 100kpv Uralic 17 47 97lld Indo-Eur: Romance 3 68 99lud Uralic 22 14 32mlt Afro-Asiatic 10 13 97mwf Australian 4 6 90nno Indo-Eur: Germanic 42 40 86olo Uralic 37 33 94ood Uto-Aztecan 33 30 81orm Afro-Asiatic 2 52 99pus Indo-Eur: Iranian 13 9 90san Indo-Aryan 13 5 93sna Niger-Congo 75 88 100syc Afro-Asiatic 8 13 91tel Dravidian 0 10 95tgk Indo-Eur: Iranian 89 79 56tuk Turkic 0 21 86udm Uralic 11 30 98uig Turkic 0 26 99urd Indo-Aryan 2 7 99uzb Turkic 0 21 100vec Indo-Eur: Romance 2 62 100vro Uralic 17 17 61xno Indo-Eur: Romance 2 22 96Overall 14 30 90Stdev 18 25 15

Table 6: Zero-one test-set accuracy scores by languagefor SigMorphon 2020 surprise languages (one-shot).

166

Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research

in Phonetics, Phonology, and Morphology,

August 5, 2021. ©2021 Association for Computational Linguistics

pages 167–176

Simple induction of (deterministic) probabilistic finite-state automata forphonotactics by stochastic gradient descent

Huteng DaiRutgers University

[email protected]

Richard FutrellUniversity of California, Irvine

[email protected]

Abstract

We introduce a simple and highly generalphonotactic learner which induces a proba-bilistic finite-state automaton from word-formdata. We describe the learner and show how toparameterize it to induce unrestricted regularlanguages, as well as how to restrict it to cer-tain subregular classes such as Strictly k-Localand Strictly k-Piecewise languages. We evalu-ate the learner on its ability to learn phonotac-tic constraints in toy examples and in datasetsof Quechua and Navajo. We find that an un-restricted learner is the most accurate over-all when modeling attested forms not seen intraining; however, only the learner restrictedto the Strictly Piecewise language class suc-cessfully captures certain nonlocal phonotacticconstraints. Our learner serves as a baselinefor more sophisticated methods.

1 Introduction

Natural language phonotactics is argued to fall inthe class of regular languages, or even in a smallerclass of subregular languages (Rogers et al., 2013).This observation has motivated the study of proba-bilistic finite-state automata (PFAs) that generatethese languages as models of phonotactics. Herewe implement a simple method for the induction ofPFAs for phonotactics from data, which can inducegeneral regular languages in addition to languagesin certain more restricted subclasses, for example,Strictly k-Local and Strictly k-Piecewise languages(Heinz, 2018; Heinz and Rogers, 2010). We evalu-ate our learner on corpus data from Quechua andNavajo, with a particular emphasis on the ability tolearn nonlocal constraints.

We make both theoretical and empirical con-tributions. Theoretically, we present the differen-tiable linear-algebraic formulation of PFAs whichenables learning of the structure of the automa-ton by gradient descent. In our framework, it is

possible to induce an unrestricted automaton witha given number of states, or an automaton withhard-coded constraints representing various subreg-ular languages. This work fills a gap in the formallinguistics literature, where learners have been de-veloped within certain subregular classes (Shibataand Heinz, 2019; Heinz, 2010; Heinz and Rogers,2010; Futrell et al., 2017), whereas our learnercan in principle induce any (sub)regular language.In addition, we demonstrate how Strictly Localand Strictly Piecewise constraints can be encodedwithin our framework, and show how information-theoretic regularization can be applied to producedeterministic automata.

Empirically, our main result is to show thatour approach gives reasonable and linguisticallyaccurate results. We find that inducing an unre-stricted PFA produces the best fit to held-out at-tested forms, while inducing an automaton for aStrictly 2-Piecewise language yields a model thatsuccessfully captures nonlocal constraints. We alsoanalyze the nondeterminism of induced automata,and the extent to which induced automata overfitto their training data.

2 Model specification

2.1 Probabilistic Finite-state Automata

A probabilistic finite-state automaton (PFA) forgenerating sequences consists of a finite set ofstates Q, an inventory of symbols Σ, an emissiondistribution with probability mass function p(x|q)which gives the probability of generating a symbolx ∈ Σ given state q ∈ Q, and a transition dis-tribution with probability mass function p(q′|q, x)which gives the probability of transitioning intonew state q′ from state q after emission of symbolx.

We parameterize a PFA using a family of right-stochastic matrices. The emission matrix E, of

167

shape |Q| × |Σ|, gives the probability of emittinga symbol x given a state. Each row in the matrixrepresents a state, and each column represents anoutput symbol. Given a distribution on states rep-resented as a stochastic vector q, the probabilitymass function over symbols is:

p(·|q) = q>E. (1)

Each symbol x ∈ Σ is associated with a right-stochastic transition matrix Tx of shape |Q|×|Q|,so that the probability distribution on followingstates given that the symbol x was emitted from thedistribution on states q is

p(·|q, x) = q>Tx. (2)

Generation of a particular sequence x ∈ Σ∗

works by starting in a distinguished initial stateq0, generating a symbol x, transitioning into thenext state q′, and so on recursively until reaching adistinguished final state qf . Given a PFA parame-terized by matrices E and T, the probability of asequence xNt=1 marginalizing over all trajectoriesthrough states can be calculated according to theForward algorithm (Baum et al., 1970; Vidal et al.,2005a, §3) as follows:

p(xNt=1|E,T) = f(xNt=1|δq0),

where δq is a one-hot coordinate vector on state qand

f(∅|q) = δ>qfq

f(xnt=1|q) = p(x1|q) · f(xnt=2|q>Tx1).

The important aspect of this formulation is thatthe probability of a sequence is a differentiablefunction of the matrices E and T that define thePFA. Because the probability function is differen-tiable, we can induce a PFA from a set of trainingsequences by using gradient descent to search formatrices that maximize the probability of the train-ing sequences.

2.2 Learning by gradient descentWe describe a simple and highly general methodfor inducing a PFA from data by stochastic gradi-ent descent. Although more specialized learningalgorithms and heuristics exist for special cases(see for example Vidal et al., 2005b, §3), ours hasthe advantage of generality. Our goal is to see howeffective this simple approach can be in practice.

Given a data distribution X with support overΣ∗, we wish to learn a PFA by finding parametermatrices E and T to minimize an objective func-tion of the form

J(E,T) = 〈− log p(x|E,T)〉x∼X + C(E,T),(3)

where 〈·〉x∼X indicates an average over val-ues x drawn from the data distribution X , and− log p(x|E,T) is the negative log likelihood(NLL) of a sample x under the model; the averagenegative log likelihood is equivalent to the cross en-tropy of the data distribution X and the model. Byminimizing cross-entropy, we maximize likelihoodand thus fit to the data. The term C(E,T) repre-sents additional complexity constraints on the Eand T matrices, discussed in Section 2.4. When Cis interpreted as a log prior probability on automata,then minimizing Eq. 3 is equivalent to fitting themodel by maximum a posteriori.

Given the formulation in Eq. 3, because the ob-jective function is differentiable, we can searchfor the optimal matrices E and T by performing(stochastic) descent on the gradients of the objec-tive. That is, for a parameter matrix X, we cansearch for a minimum by performing updates ofthe form

X′ = X− η∇J(X), (4)

where the scalar η is the learning rate. In stochas-tic gradient descent, each update is performed usinga random finite sample from the data distribution,called a minibatch, to approximate the averageover the data distribution in Eq. 3.

However, we cannot apply these updates directlyto the matrices E and T because they must beright-stochastic, meaning that the entries in eachrow must be positive and sum to 1. There is noguarantee that the output of Eq. 4 would satisfythese constraints. This issue was addressed by Dai(2021) by clipping the values of the matrix E tobe between 0 and 1. A more general solution isthat, instead of doing optimization on the E and Tmatrices directly, we instead do optimization overunderlying real-valued matrices E and T such that

Eij =exp Eij∑k exp Eik

, Tij =exp Tij∑k exp Tik

,

in other words we derive the matrices E and Tby applying the softmax function to underlyingmatrices E and T, whose entries are called logits.Gradient descent is then done on the objective as

168

a function of the logit matrices E and T. This ap-proach to parameterizing probability distributionsis standard in machine learning. Applied to inducea PFA with states Q and symbol inventory Σ, ourformulation yields a total of |Q| × (|Q| × |Σ| − 1)meaningful trainable parameters.

We note that this procedure is not guaranteed tofind an automaton that globally minimizes the ob-jective when optimizing T (see Vidal et al., 2005b,§3). But in practice, stochastic gradient descent inhigh-dimensional spaces can avoid local minima,functioning as a kind of annealing (Bottou, 1991,§4); using these simple optimization techniques onnon-convex objectives is now standard practice inmachine learning.

2.3 Sequence representation and wordboundaries

In order to model phonotactics, a PFA must be sen-sitive to the boundaries of words, because there areoften constraints that apply only at word beginningsor endings (Hayes and Wilson, 2008; Chomsky andHalle, 1968). In order to account for this, we in-clude in the symbol inventory Σ a special wordboundary delimiter #, which occurs as the finalsymbol of each word, and which only occurs inthat position. Furthermore, we constrain all ma-trices T to transition deterministically back intothe initial state following the symbol #, effectivelyreusing the initial state q0 as the final state qf .

By constructing the automata in this way, weensure that their long-run behavior is well-behaved.If an automaton of this form is allowed to keep gen-erating past the symbol #, it will generate succes-sive concatenated independent and identically dis-tributed samples from its distribution over words,with boundary symbols # delineating them. Thisconstruction makes it possible to calculate station-ary distributions over states and complexity mea-sures related to them.

2.4 RegularizationThe objective in Eq. 3 includes a regularizationterm C representing complexity constraints. Anydifferentiable complexity measure could be usedhere. This regularization term can be viewed froma Bayesian perspective as defining a prior over au-tomata, and providing an inductive bias. We pro-pose to use this term to constrain the PFA inductionprocess to produce deterministic automata.

Most formal work on probabilistic finite-stateautomata for phonology has focused on determin-

istic PFAs because of their nice theoretical proper-ties (Heinz, 2010). A deterministic PFA is dis-tinguished by having fully deterministic transi-tion matrices T. This condition can be expressedinformation-theoretically. Assuming 0 log 0 = 0,letting the entropy of a stochastic vector p be:

H[p] = −∑

i

pi log pi,

a PFA is deterministic when it satisfies the con-dition H[q>Tx] = 0 for all symbols x and statedistributions q.

We can use this expression to monitor the degreeof nondeterminism of a PFA during optimization,or to add a determinism constraint to the objectivein Section 2.2. The average nondeterminism Nof a PFA is given by

N(E,T) =∑

ij

qiEijH[δ>qiTj ],

where q is the stationary distribution over states,representing the long-run average occupancy dis-tribution over states. The stationary distribution qis calculated by finding the left eigenvector of thematrix S satisfying

q>S = q,

where S is a right stochastic matrix giving the prob-ability that a PFA transitions from state i to state jmarginalizing over symbols emitted:

Sij =∑

x∈Σ

p(x|qi)p(qj |qi, x).

For the Strictly Local and Strictly Piecewise au-tomata, N = 0 by construction. For an automatonparameterized by T = softmax(T), it is not pos-sible to attain N = 0, but nonetheless N can bemade arbitrarily small. There are alternative pa-rameterizations where N = 0 is achievable, forexample using the sparsemax function instead ofsoftmax (Martins and Astudillo, 2016; Peters et al.,2019).

In order to constrain automata to be determinis-tic, we set the regularization term in Eq. 3 to be

C = αN,

where α is a non-negative scalar determining thestrength of the trade-off of cross entropy and nonde-terminism in the optimization. With α = 0 there isno constraint on the nondeterminism of the automa-ton, and minimizing the objective in Eq. 3 reducesto maximum likelihood estimation.

169

2.5 Implementing restricted automataWe define Strictly Local and Strictly Piecewise au-tomata as automata that generate the respective lan-guages. We implement Strictly Local and StrictlyPiecewise automata by hard-coding the transitionmatrices T. For these automata, we only do opti-mization over the emission matrices E.

Strictly Local In a Strictly k-Local (k-SL) lan-guage, each symbol is conditioned only on imme-diately preceding k − 1 symbol(s) (Heinz, 2018;Rogers and Pullum, 2011). We implement a 2-SLautomaton by associating each state q ∈ Q with aunique element x in the symbol inventory Σ. Uponemitting symbol x, the automaton deterministicallytransitions into the corresponding state, denoted qx.Thus the transition matrices have the form

Tx =

...q 6=x... qx ...q 6=x...

......

.... . . 0 . . . 1 . . . 0 . . .

......

...

.

This construction can be straightforwardly ex-tended to k-SL, yielding |Σ|k−1 × (|Σ| − 1) train-able parameters for a k-SL automaton.

Strictly Piecewise A Strictly k-Piecewise k-SP)language, each symbol depends on the presence ofany preceding k − 1 symbols at arbitrary distance(Heinz, 2007, 2018; Shibata and Heinz, 2019). Forexample, in a 2-SP language, in a string abc, cwould be conditional on the presence of a and thepresence of b, without regard to distance nor therelative order of a and b.

The implementation of an SP automaton isslightly more complex than the SL automaton, asthe number of states required in a naïve imple-mentation is exponential in the symbol inventorysize, resulting in intractably large matrices. We cir-cumvent this complexity by parameterizing a 2-SPautomaton as a product of simpler automata. Weassociate each symbol x ∈ Σ with a sub-automatonAx which has two states qx0 and qx1 , with state qx0indicating that the symbol x has not been seen,and qx1 indicating that it has been seen. Each sub-automaton Ax has an emission matrix E(x) of size2× |Σ| corresponding to the two states qx0 and qx1 ;the emission matrix for all states qx0 is constrainedto be the uniform distribution over symbols. Thetransition matrices T(x) are

T(x)x =

[0 10 1

],T

(x)y 6=x =

[1 00 1

].

Then the probability of the t’th symbol in a se-quence xt given a context of previous symbolsxt−1i=1 is the geometric mixture of the probability

of xt under each sub-automaton, also called theco-emission probability

p(xt|xt−1i=1) ∝

|Σ|∏

y=1

pAy(xt|xt−1i=1).

Because each sub-automaton Ay is deterministic,its state after seeing the context xt−1

i=1 is known,and the conditional probability pAy(xt|xt−1

i=1) canbe computed using Eq. 1. For calculating the prob-ability of a sequence, we assume an initial state ofhaving seen the boundary symbol #; that is, thesub-automaton A# starts in state q#

1 .Using this parameterization, we can do opti-

mization over the collection of emission matri-ces E(x)x∈Σ. This construction yields |Σ| ×(|Σ| − 1) trainable parameters for the 2-SP automa-ton, the same number of parameters as the 2-SLautomaton.

SP + SL It is also possible to create and trainan automaton with the ability to condition on both2-SL and 2-SP factors by taking the product of 2-SL and 2-SP automata, as proposed by Heinz andRogers (2013). We refer to the language gener-ated by such an automaton as 2-SL + 2-SP. Weexperiment with such product machines below.

2.6 Related workPFA induction from data is a well-studied taskwhich has been the subject of multiple competi-tions over the years (see Verwer et al., 2012, for areview). The most common approaches are vari-ants of Baum-Welch and heuristic state-mergingalgorithms (see for example de la Higuera, 2010).Gibbs samplers and spectral methods have alsobeen proposed (Gao and Johnson, 2008; Bailly,2011; Shibata and Yoshinaka, 2012). Induction ofrestricted PDFAs, especially for SL and SP lan-guages, is explored in Heinz and Rogers (2013,2010)

Our work differs from previous approaches in itssimplicity. Inspired by Shibata and Heinz (2019),we optimize the training objective directly via gra-dient descent, without approximations or heuristicsother than the use of minibatches. The same algo-rithm is applied to learn both transition and emis-sion structure, for learning of both general PFAsand restricted PDFAs. One of our contributions

170

is to show that this very simple approach givesreasonable results for learning phonotactics.

3 Inducing toy languages

First, we test the ability of the model to recoverautomata for simple examples of subregular lan-guages. We do so for the two subregular classes2-SL and 2-SP described in Section 2.5. For eachof these language classes, we implement a ref-erence PFA which generates strings from a sim-ple example language in that class, then generate10, 000 sample sequences from the reference PFA.We then use these samples as training data, andstudy whether our learners can recover the relevantconstraints from the data.

3.1 Evaluation

We evaluate the ability to induce appropriate au-tomata in two ways. First, since we are studyingvery simple languages and automata, it is possibleto directly inspect the E and T matrices and checkthat they implement the correct automaton by ob-serving the transition and emission probabilities.

Second, we study the probabilities assignedto carefully selected strings which exemplify theconstraints that define the languages. For eachlanguage, we define an illegal test string whichviolates the constraints of the language, and aminimally-different legal test string. Given anautomaton, we can measure the legal–illegal dif-ference: the log probability of the legal test stringminus the log probability of the illegal test string.A larger legal–illegal difference indicates that themodel is assigning a higher probability to the legalform compared to the illegal one and therefore issuccessfully learning the constraints represented bythe testing data.

3.2 Languages

All languages are defined over the symbol inven-tory a, b, c plus the boundary symbol #.

As an exemplar of 2-SL languages, we use thelanguage characterized by the forbidden factor *ab.A deterministic PFA for the language is given inFigure 1 (top). The language contains all stringsthat do not have an a followed immediately by a b.Our legal test string for this language is bacccb#and the illegal test string is babccc#.

As an exemplar of 2-SP languages, we usethe language characterized by a forbidden factor*a. . . b. This language contains all strings that do

not have an a followed by a b at any distance. Thereference automaton is given in Figure 1 (bottom).The legal test string is baccca# and the illegal teststring is bacccb#.

3.3 Training parameters

The logit matrices E and T are initialized withrandom draws from a standard Normal distribution(Derrida, 1981). We perform stochastic gradient de-scent using the Adam algorithm, which adaptivelysets the learning rate (Kingma and Ba, 2015). Weperform 10, 000 update steps with starting learningrate η = 0.001 and minibatch size 5.

3.4 Results

Unrestricted PFA induction succeeds in recover-ing the reference automata for both toy languages.Learners restricted to the appropriate classes, aswell as the automaton combining SL and SP factors,also succeed in inducing the appropriate automata,while learners restricted to the ‘wrong’ class fail.

Figure 1 shows the legal–illegal differences fortest strings over the course of training. We cansee that, when the learner is unrestricted or whenthe learner is in the appropriate class, it eventu-ally picks up on the relevant constraint, with thelegal–illegal difference increasing apparently with-out bound over training. Unrestricted learners takelonger to reach this point, but they reach it reliably.On the other hand, looking at the legal–illegal dif-ferences for learners in the wrong class, we seethat they asymptote to a small number and stopimproving.

These results demonstrate that our simplemethod for PFA induction does succeed in induc-ing certain simple structures relevant for modelingphonotactics in a small, controlled setting. Next,we turn to induction of phonotactics from corpusdata.

4 Corpus experiments

We evaluate our learner by training it on dictionaryforms from Quechua and Navajo and then studyingits ability to predict attested forms that were heldout in training in addition to artificially constructednonce forms which probe the ability of the modelto represent nonlocal constraints.

4.1 Training parameters

All training parameters are as in Section 3.3, exceptthat we train for 100, 000 steps, and control the

171

Target language: *a...b (2−SP) Legal test string: baccca# Illegal test string: bacccb#

Target language: *ab (2−SL) Legal test string: bacccb# Illegal test string: babccc#

0 2500 5000 7500 10000 0 2500 5000 7500 10000

−4

0

4

8

Training step

Lega

l−ill

egal

diff

eren

ce

Learner class 2−SL 2−SP 2−SP + 2−SL Unrestricted PFA with |Q|=2

Figure 1: Difference in log probabilities for legal and illegal forms over the course of PFA induction for toylanguages. A large positive value indicates that the relevant constraint has been learned.

q0 q1

b : 1/4c : 1/4# : 1/4 a : 3/8

a : 1/4

c : 3/8# : 1/4

q0 q1

b : 1/4c : 1/4# : 1/4

a : 3/8c : 3/8

a : 1/4

# : 1/4

Figure 2: Reference automata for the 2-SL languagecharacterized by the constraint *ab (top) and the 2-SPlanguage characterized by the constraint *a. . . b (bot-tom). Arcs are annotated with symbols emitted andtheir corresponding emission probabilities.

succession of minibatches to be the same acrossmodels within the same language.

4.2 Dataset

The proposed learner is applied to the datasets ofNavajo and Quechua (Gouskova and Gallagher,2020), in which nonlocal phonotactics are attested.

In Navajo, the co-occurrence of alveolar and

palatal strident is illegal. The learning data ofNavajo includes 6, 279 Navajo phonological words;we divide this data into a training set of 5, 023forms and a held-out set of 1, 256 forms. Thenonce testing data of Navajo consists of 5, 000 gen-erated nonce words, which were labelled as illegal(N = 3, 271) and legal (N = 1, 729) based onwhether the nonlocal phonotactics are satisfied.

In Quechua, any stop cannot be followed by anejective or aspirated stop at any distance. The learn-ing data of Quechua includes 10, 804 phonolog-ical words, which we separate into 8, 643 train-ing forms and 2, 160 held-out forms. The testingdata of Quechua (Gouskova and Gallagher, 2020)consists of 24, 352 nonce forms which were man-ually classified as legal (N = 18, 502) and ille-gal (N = 5, 810, including stop-aspirate and stop-ejective pairs).

4.3 Dependent Variables

For the linguistic performance of the classifier, westudy two main dependent variables. First, theaverage held-out negative log likelihood (NLL)indicates the ability of the model to assign highprobabilities to unseen but attested forms—lowNLL indicates higher probabilities. Second, us-ing our nonce forms dataset, we measure the ex-tent to which the model can differentiate the legalforms from the illegal forms using the differencein log likelihood for the legal forms minus the il-legal forms. This is the same as the legal–illegal

172

Navajo Quechua

Heldout N

LLO

verfittingN

(alpha = 0)

N (alpha =

1)

0 25000 50000 75000 100000 0 25000 50000 75000 100000

20

25

30

35

40

45

0

2

4

2

4

6

8

2

4

6

8

Training step

bits

Number of states |Q|32

64

128

256

512

1024

Figure 3: Accuracy and complexity metrics for unrestricted PFA induction. ‘Overfitting’ is the difference betweenheld-out NLL and training set NLL. N is nondeterminism and alpha is the regularization parameter α (see Sec-tion 2.4). Runs with |Q| = 128, 256, 512 and α = 1 on Navajo data terminated early due to numerical underflowin the calculation of the stationary distribution.

Navajo Quechua

Heldout N

LLLegal−

Illegal Difference

0 25000 50000 75000 100000 0 25000 50000 75000 100000

20

30

40

50

60

0

5

10

15

20

Training step

bits

2−SL 2−SP 2−SP + 2−SL PFA |Q|=1024

Figure 4: Performance of a 2-SP automaton, a 2-SL automaton, a 2-SP + 2-SL product automaton, and an un-restricted PFA with 1, 024 states and α = 0. ‘Heldout NLL’ is the average NLL of a form in the set of attestedforms never seen during training. ‘Legal–illegal difference’ is the difference in log likelihood between ‘legal’ and‘illegal’ forms in the nonce test set.

173

difference described in Section 3.1, but now as anaverage over many legal–illegal nonce pairs insteadof a difference for one pair.

4.4 Results

Unrestricted PFA induction Figure 3 shows re-sults from induction of unrestricted PFAs with var-ious numbers of states. We find that show the av-erage NLL of forms in the heldout data, as well as‘overfitting’, defined as the average held-out NLLminus the average training set NLL. This numbershows the extent to which the model assigns higherprobabilities to forms in the training set as opposedto the held-out set, an index of overfitting. We findthat automata with more states fit the data better,but are also more prone to overfitting to the trainingset.

In Figure 3 (bottom two rows) we also show themeasured nondeterminism N of the induced au-tomata throughout training, for different values ofthe regularization parameter α (see Section 2.4).We find that, even without an explicit constraintfor determinism, the induced PFAs tend towardsdeterminism over time, with N reaching around1.5 bits by the final training step. Explicit regu-larization (with α = 1) makes this process faster,with N reaching around 0.5 bits. Regularizationfor determinism has only a minimal effect on theNLL values.

Linguistic performance and restricted modelsFigure 4 shows held-out NLL and the legal–illegaldifference for both languages, comparing the SLautomaton, the SP automaton, the product SP +SL automaton, and a PFA with 1, 024 states andα = 0.

In terms of the ability to predict attested held-out forms, the best model is consistently the unre-stricted PFA, with the SP automaton performingthe worst. However, in terms of predicting the ill-formedness of artificial forms violating nonlocalphonotactic constraints, the best model is eitherthe SP automaton or the SP + SL product automa-ton. Both of these automata successfully inducethe nonlocal constraint.

On the other hand, the unrestricted PFA learnershows no evidence at all of having learned the dif-ference between legal and illegal forms in the arti-ficial data, despite having the capacity to do so intheory, and despite succeeding in inducing a 2-SPlanguage in Section 3.

4.5 Discussion

We find that an unrestricted PFA learner performsmost accurately when predicting real held-outforms, while an SP learner is most effective in learn-ing certain nonlocal constraints. In fact, in termsof its ability to model the nonlocal constraints, thePFA learner ends up comparable to an SL learner,which cannot learn the constraints at all. Mean-while, the SP learner, which is unable to modellocal constraints, fares much worse than even theSL learner on predicting held-out forms. The prod-uct SP + SL learner combines the strengths of bothrestricted learners, but still does not assign as highprobability to the real held-out forms as the unre-stricted PFA learner.

This pattern of performance suggests that thePFA learner is using most of its states to modellocal constraints beyond those captured in a 2-SLlanguage. These constraints are important for pre-dicting real held-out forms. The SP automatonis unable to achieve strong performance on held-out forms without the ability to model these localconstraints. On the other hand, the unrestrictedPFA tends to overfit to its training data, perhapsexplaining its failure to detect nonlocal constraintswhich are picked up by the appropriate restrictedautomata.

5 Conclusion

We introduced a framework for phonotactic learn-ing based on simple induction of probabilistic finite-state automata by stochastic gradient descent. Weshowed how this framework can be used to learnunrestricted PFAs, in addition to PFAs restrictedto certain formal language classes such as StrictlyLocal and Strictly Piecewise, via constraints onthe transition matrices that define the automata.Furthermore, we showed that the framework is suc-cessful in learning some phonotactic phenomena,with unrestricted automata performing best in awide-coverage evaluation on attested but held-outforms, and Strictly Piecewise automata perform-ing best in a targeted evaluation using nonce formsfocusing on nonlocal constraints.

Our results leave open the question of whetherthe unrestricted learner or one of the restrictedlearners is ‘best’ for learning phonotactics, sincethey perform differently on different metrics. A keyquestion for future work is whether there might besome model that could do well in inducing bothlocal and nonlocal constraints simultaneously, and

174

performing well on both the held-out evaluationand the nonce form evaluation. Such a model couldcome in the form of another restricted languageclass such as Tier-Based Strictly Local languages(Heinz et al., 2011; Jardine and Heinz, 2016; Mc-Mullin, 2016; Jardine and McMullin, 2017), orperhaps in the form of a regularization term in thetraining objective which enforces an inductive biasthat favors certain nonlocal interactions.

The code for this project is availableat http://github.com/hutengdai/PFA-learner.

Acknowledgments

This work was supported by a GPU Grant from theNVIDIA corporation. We thank the three anony-mous reviewers and Adam Jardine, Jeff Heinz, andDakotah Lambert for their comments.

ReferencesRaphael Bailly. 2011. Quadratic weighted automata:

Spectral algorithm and likelihood maximization. InAsian Conference on Machine Learning, pages 147–163. PMLR.

Leonard E. Baum, Ted Petrie, George Soules, and Nor-mal Weiss. 1970. A maximization technique occur-ring in the statistical analysis of probabilistic func-tions of Markov chains. Annals of MathematicalStatistics, 41:164–171.

Léon Bottou. 1991. Stochastic gradient learning inneural networks. Proceedings of Neuro-Nımes,91(8):12.

Noam Chomsky and Morris Halle. 1968. The SoundPattern of English. Harper & Row.

Huteng Dai. 2021. Learning nonlocal phonotactics inStrictly Piecewise phonotactic model. In Proceed-ings of the 2020 Annual Meeting on Phonology.

Colin de la Higuera. 2010. Grammatical Inference:Learning Automata and Grammars. Cambridge Uni-versity Press.

Bernard Derrida. 1981. Random-energy model: An ex-actly solvable model of disordered systems. Physi-cal Review B, 24(5):2613.

Richard Futrell, Adam Albright, Peter Graff, and Tim-othy J. O’Donnell. 2017. A generative model ofphonotactics. Transactions of the Association forComputational Linguistics, 5:73–86.

Jianfeng Gao and Mark Johnson. 2008. A compar-ison of Bayesian estimators for unsupervised Hid-den Markov Model POS taggers. In Proceedings of

the 2008 Conference on Empirical Methods in Natu-ral Language Processing, pages 344–352, Honolulu,Hawaii. Association for Computational Linguistics.

Maria Gouskova and Gillian Gallagher. 2020. Induc-ing nonlocal constraints from baseline phonotactics.Natural Language & Linguistic Theory, 38(1):77–116.

Bruce Hayes and Colin Wilson. 2008. A maximum en-tropy model of phonotactics and phonotactic learn-ing. Linguistic Inquiry, 39(3):379–440.

Jeffrey Heinz. 2007. The inductive learning of phono-tactic patterns. Ph.D. thesis, PhD dissertation, Uni-versity of California, Los Angeles.

Jeffrey Heinz. 2010. Learning long-distance phonotac-tics. Linguistic Inquiry, 41(4):623–661.

Jeffrey Heinz. 2018. The computational nature ofphonological generalizations. Phonological Typol-ogy, Phonetics and Phonology, pages 126–195.

Jeffrey Heinz, Chetan Rawal, and Herbert G. Tan-ner. 2011. Tier-based Strictly Local constraints forphonology. In Proceedings of the 49th Annual Meet-ing of the Association for Computational Linguistics:Human Language Technologies, pages 58–64, Port-land, Oregon, USA. Association for ComputationalLinguistics.

Jeffrey Heinz and James Rogers. 2010. EstimatingStrictly Piecewise distributions. In Proceedingsof the 48th Annual Meeting of the Association forComputational Linguistics, pages 886–896, Upp-sala, Sweden. Association for Computational Lin-guistics.

Jeffrey Heinz and James Rogers. 2013. Learning sub-regular classes of languages with factored determin-istic automata. In Proceedings of the 13th Meetingon the Mathematics of Language (MoL 13), pages64–71, Sofia, Bulgaria. Association for Computa-tional Linguistics.

Adam Jardine and Jeffrey Heinz. 2016. Learning Tier-based Strictly 2-Local languages. Transactions ofthe Association for Computational Linguistics, 4:87–98.

Adam Jardine and Kevin McMullin. 2017. Effi-cient learning of Tier-based Strictly k-Local lan-guages. In International Conference on Languageand Automata Theory and Applications, pages 64–76. Springer.

Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In 3rd Inter-national Conference on Learning Representations,ICLR 2015, Conference Track Proceedings, SanDiego, CA.

Andre Martins and Ramon Astudillo. 2016. Fromsoftmax to sparsemax: A sparse model of atten-tion and multi-label classification. In International

175

Conference on Machine Learning, pages 1614–1623.PMLR.

Kevin James McMullin. 2016. Tier-based locality inlong-distance phonotactics: learnability and typol-ogy. Ph.D. thesis, University of British Columbia.

Ben Peters, Vlad Niculae, and André F. T. Martins.2019. Sparse sequence-to-sequence models. In Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 1504–1519, Florence, Italy. Association for ComputationalLinguistics.

James Rogers, Jeffrey Heinz, Margaret Fero, JeremyHurst, Dakotah Lambert, and Sean Wibel. 2013.Cognitive and sub-regular complexity. In FormalGrammar, pages 90–108. Springer.

James Rogers and Geoffrey K. Pullum. 2011. Auralpattern recognition experiments and the subregularhierarchy. Journal of Logic, Language and Informa-tion, 20(3):329–342.

Chihiro Shibata and Jeffrey Heinz. 2019. Maximumlikelihood estimation of factored regular determinis-tic stochastic languages. In Proceedings of the 16thMeeting on the Mathematics of Language, pages102–113, Toronto, Canada. Association for Compu-tational Linguistics.

Chihiro Shibata and Ryo Yoshinaka. 2012. Marginaliz-ing out transition probabilities for several subclassesof PFAs. Journal of Machine Learning Research- Workshops and Conference Proceedings, 21:259–263.

Sicco Verwer, Rémi Eyraud, and Colin de la Higuera.2012. PAutomaC: A PFA/HMM learning competi-tion. Journal of Machine Learning Research - Work-shops and Conference Proceedings, 21.

Enrique Vidal, Franck Thollard, Colin de la Higuera,Francisco Casacuberta, and Rafael C. Carrasco.2005a. Probabilistic finite-state machines – Part I.IEEE Transactions on Pattern Analysis and MachineIntelligence, 27(7):1013–1025.

Enrique Vidal, Franck Thollard, Colin de la Higuera,Francisco Casacuberta, and Rafael C. Carrasco.2005b. Probabilistic finite-state machines – Part II.IEEE Transactions on Pattern Analysis and MachineIntelligence, 27(7):1026–1039.

176

Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research

in Phonetics, Phonology, and Morphology,

August 5, 2021. ©2021 Association for Computational Linguistics

pages 177–187

Recognizing Reduplicated Forms: Finite-State Buffered Machines

Yang WangDepartment of Linguistics

University of California, Los AngelesLos Angeles, CA, USA

[email protected]

Abstract

Total reduplication is common in natural lan-guage phonology and morphology. However,formally as copying on reduplicants of un-bounded size, unrestricted total reduplicationrequires computational power beyond context-free, while other phonological and morpholog-ical patterns are regular, or even sub-regular.Thus, existing language classes characterizingreduplicated strings inevitably include typo-logically unattested context-free patterns, suchas reversals. This paper extends regular lan-guages to incorporate reduplication by intro-ducing a new computational device: finitestate buffered machine (FSBMs). We giveits mathematical definitions and discuss someclosure properties of the corresponding set oflanguages. As a result, the class of regularlanguages and languages derived from themthrough a copying mechanism is characterized.Suggested by previous literature (Gazdar andPullum, 1985), this class of languages shouldapproach the characterization of natural lan-guage word sets.

1 The Puzzle of (Total) Reduplication

Formal language theory (FLT) provides computa-tional mechanisms characterizing different classesof abstract languages based on their inherent struc-tures. Following FLT in the study of human lan-guages, in principle, researchers would expect ahierarchy of grammar formalisms that matches em-pirical findings: more complex languages in such ahierarchy are supposed to be 1) less common in nat-ural language typology; and 2) harder for learnersto learn.

The classical Chomsky Hierarchy (CH) puts for-mal languages into four levels with increasing com-plexity: regular, context-free, context-sensitive, re-cursively enumerable (Chomsky, 1956; Jager andRogers, 2012). Does the CH notion of formalcomplexity have the desired empirical correlates?

Several findings suggest that those four levels donot align with natural languages precisely, someleading to major refinements on the CH. First,the unbounded crossing dependencies in Swiss-German case marking (Shieber, 1985) facilitatedattempts to characterize mildly context-sensitivelanguages (MCS), which extend context-free lan-guages (CFLs) but still preserve some useful prop-erties of CFLs (e.g., Joshi, 1985; Seki et al., 1991;Stabler, 1997). Secondly, it is generally acceptedthat phonology is regular (e.g. Johnson, 1972; Ka-plan and Kay, 1994). However, being regular isargued to be an unrestrictive property for phonolog-ical well-formed strings: for example, a languagewhose words are sensitive to an even or odd num-ber of certain sounds is unattested (Heinz, 2018).With strong typological evidence, the sub-regularhierarchy was further developed, which continuesto be an active area of research (e.g., McNaughtonand Papert, 1971; Simon, 1975; Heinz, 2007; Heinzet al., 2011; Chandlee, 2014; Graf, 2017).

In this paper, we analyze another mismatch be-tween existing well-known language classes andempirical findings: reduplication, which involvescopying operations on certain base forms (Inkelasand Zoll, 2005). The reduplicated phonologicalstrings are either of total identity (total reduplica-tion) or of partial identity (partial reduplication) tothe base forms. Table 1 provides examples showingthe difference between total reduplication and par-tial reduplication: in Dyirbal, the pluralization ofnominals is realized by fully copying the singularstems, while in Agta examples, plural forms onlycopy the first CVC sequence of the correspondingsingular forms (Healey, 1960; Marantz, 1982).

Reduplication is common cross-linguistically.According to Rubino (2013) and Dolatian andHeinz (2020), 313 out of 368 natural languagesexhibit productive reduplication, in which 35 lan-guages only have total reduplication, but not partial

177

Total reduplication: Dyirbal plurals (Dixon, 1972, 242)Singular Gloss Plural Glossmidi ‘little, small’ midi-midi ‘lots of little ones’gulgiói ‘prettily painted men’ gulgiói-gulgiói ‘lots of prettily painted men’

Partial reduplication: Agta plurals (Healey, 1960,7)Singular Gloss Plural Glosslabang ‘patch’ lab-labang ‘patches’takki ‘leg’ tak-takki ‘legs’

Table 1: Total reduplication: Dyirbal plurals (top); partial reduplication: Agta plurals (bottom).

m i d i m i d i

m i d i i d i m

Figure 1: Crossing dependencies in Dyirbal total redu-plication ‘midi-midi’ (top) versus nesting dependenciesin unattested string reversal ‘midi-idim’ (bottom)

reduplication. As a comparison, it is widely rec-ognized that context-free string reversals are rarein phonology and morphology (Marantz, 1982)and appear to be confined to language games(Bagemihl, 1989).

Unrestricted total reduplication, or unboundedcopying, can be abstracted as Lww = ww |w ∈Σ∗, a well-known non-context free language(Culy, 1985; Hopcroft and Ullman, 1979).1 Its non-context-freeness comes from the incurred crossingdependencies among symbols, similar to Swiss-German case marking constructions. However,the typologically-rare string reversals wwR demon-strate nesting dependencies, which are context-free(see Fig. 1 as an illustration).

Given most phonological and morphological pat-terns are regular, how can one fit in reduplicatedstrings without including reversals? Gazdar andPullum (1985, 278) made the remark that

1Total reduplication does not immediately guarantee un-boundedness. When the set of bases is finite, i.e, ww |w ∈L when L is finite, total reduplication can be squeezed inlanguages described by 1 way finite state machines (Chandlee,2017), though doing so eventually leads to state explosion(Roark and Sproat, 2007; Dolatian and Heinz, 2020). Com-putationally, only total reduplication with infinite number ofpotential reduplicants is true unbounded copying. With care-ful treatment, unbounded copying, externalizing a primitivecopying operation, can be justified as a model of reduplicationin natural languages. More in-depth discussion of 1): boundedversus unbounded and 2): copying as a primitive operationcan be found in Clark and Yoshinaka (2014); Chandlee (2017);Dolatian and Heinz (2020).

regularaibj

context sensitive

mildly context sensitivecontext-free

wwR

a ib ja ib j

ww

aibjcidj

Figure 2: The class of regular with copying languagesin CH

We do not know whether there exists anindependent characterization of the classof languages that includes the regularsets and languages derivable from themthrough reduplication, or what the timecomplexity of that class might be, butit currently looks as if this class mightbe relevant to the characterization of NLword-sets.

Motivated by Gazdar and Pullum (1985), thisarticle aims to give a formal characterization ofregular with copying languages. Specifically, it ex-amines what minimal changes can be brought toregular languages to include stringsets with two ad-jacent copies, while excluding some typologicallyunattested context-free patterns, such as reversals,shown in Fig. 2. One possible way to probe sucha language class is by adding copying to the set ofoperations whose closure defines regular languages.Instead, the approach we take in this paper is toadd reduplication to finite state automata (FSAs),which compute regular languages.

178

Various attempts followed this vein:2 one ex-ample is finite state registered machine in Cohen-Sygal and Wintner (2006) (FSRAs) with finitelymany registers as its memory, limited in the waythat it only models bounded copying. The state-of-art finite state machinery that computes unboundedcopying elegantly and adequately is 2-way finitestate transducers (2-way FSTs), capturing redupli-cation as a string-to-string mapping (w → ww)(Dolatian and Heinz, 2018a,b, 2019, 2020). Toavoid the mirror image function (w → wwR),Dolatian and Heinz (2020) further developed sub-classes of 2-way FSTs which cannot output any-thing during right-to-left passes over the input (cf.rotating transducers: Baschenis et al., 2017).

It should be noted that the issue addressed by 2-way FSTs is a different one: reduplication is mod-eled as a function (w → ww), while this paper fo-cuses on a set of languages containing identical sub-strings (ww). The stringset question is non-trivialand well-motivated for reasons of both formal as-pects and its theoretical relevance. Firstly, sincethe studied 2-way FSTs are not readily invertible,how to get the inverse relation ww → w remainsan open question, as acknowledged in Dolatian andHeinz (2020). Although this paper does not directlyaddress this morphological analysis problem, rec-ognizing which strings are reduplicated and belongto Lww or any other copying languages may be animportant first step.3

As for the theoretical aspects, there are someattested forms of meaning-free reduplication innatural languages.Zuraw (2002) proposes aggres-sive reduplication in phonology: speakers aresensitive to phonological similarity between sub-strings within words and reduplication-like struc-tures are attributed to those words. It is still ar-guable whether those meaning-free reduplicativepatterns of unbounded strings are generated via amorphological function or not. Overall, it is de-sirable to have models that help to detect the sub-string identity within surface strings when thosesub-strings are in the regular set.

This paper introduces a new computational de-vice: finite state buffered machine (FSBMs). They

2Some other examples, pursuing more linguistically soundand computationally efficient finite state techniques, areWalther (2000), Beesley and Karttunen (2000) and Hulden(2009). However, they fail to model unbounded copying.Roark and Sproat (2007), Cohen-Sygal and Wintner (2006)and Dolatian and Heinz (2020) provide more comprehensivereviews.

3Thanks to the reviewer for bringing this point up.

are two-taped finite state automata, sensitive tocopying activities within strings, hence able to de-tect identity between sub-strings. This paper isorganized as follows: Section 2 provides a defi-nition of FSBMs with examples. Then, to betterunderstand the copying mechanism, complete-pathFSBMs, which recognize exactly the same set oflanguages as general FSBMs, are highlighted. Sec-tion 3 examines the computational and mathemat-ical properties of the set of languages recognizedcomplete-path FSBMs. Section 4 concludes withdiscussion and directions for future research.

2 Finite State Buffered Machine

2.1 Definitions

FSBMs are two-taped automata with finite-statecore control. One tape stores the input, as in normalFSAs; the other serves as an unbounded memorybuffer, storing reduplicants temporarily for futureidentity checking. Intuitively, FSBMs is an ex-tension to FSRAs but equipped with unboundedmemory. In theory, FSBMs with a bounded bufferwould be as expressive as an FSRA and thereforecan be converted to an FSA.

The buffer interacts with the input in restrictedways: 1) the buffer is queue-like; 2) the bufferneeds to work on the same alphabet as the input,unlike the stack in a pushdown automata (PDA),for example; 3) once one symbol is removed fromthe buffer, everything else must also be wiped offbefore the buffer is available for other symbol ad-dition. These restrictions together ensure the ma-chine does not generate string reversals or othernon-reduplicative non-regular patterns.

There are three possible modes for an FSBM Mwhen processing an input: 1) in normal (N) mode,M reads symbols and transits between states, func-tioning as a normal FSA; 2) in buffering (B) mode,besides consuming symbols from the input and tak-ing transitions among states, it adds a copy of just-read symbols to the queue-like buffer, until it exitsbuffering (B) mode; 3) after exiting buffering (B)mode, M enters emptying (E) mode, in which Mmatches the stored symbols in the buffer against in-put symbols. When all buffered symbols have beenmatched, M switches back to normal (N) mode foranother round of computation. Under the currentaugmentation, FSBMs can only capture local redu-plication with two adjacent, completely identicalcopies. It cannot handle non-local reduplication,nor multiple reduplication.

179

Definition 1. A Finite-State Buffered Machine(FSBM) is a 7-tuple 〈Σ, Q, I, F,G,H, δ〉 where

• Q: a finite set of states

• I ⊆ Q: initial states

• F ⊆ Q: final states

• G ⊆ Q: states where the machine must enterbuffering (B) mode

• H ⊆ Q: states visited while the machine isemptying the buffer

• G ∩H = ∅

• δ: Q × (Σ ∪ ε) × Q: the state transitionsaccording to a specific symbol

Specifying G and H states allows an FSBM tocontrol what portions of a string are copied. Toavoid complications, G and H are defined to bedisjoint. In addition, states in H identify certainspecial transitions. Transitions between two Hstates check input-memory identity and consumesymbols in both the input and the buffer. By con-trast, transitions with at least one state not inH canbe viewed as normal FSA transitions. In all, thereare effectively two types of transitions in δ.

Definition 2. A configuration of an FSBM D =(u, q, v, t) ∈ Σ∗ ×Q×Σ∗ × N, B, E, where u isthe input string; v is the string in the buffer; q is thecurrent state and t is the current mode the machineis in.

Definition 3. Given an FSBM M and x ∈ (Σ ∪ε), u,w, v ∈ Σ∗, we define that a configurationD1 yields a configuration D2 in M (D1 `M D2)as the smallest relation such that: 4

• For every transition (q1, x, q2) with at leastone state of q1, q2 /∈ H(xu, q1, ε, N) `M (u, q2, ε, N) with q1 /∈ G(xu, q1, v, B) `M (u, q2, vx, B) with q2 /∈ G

• For every transition (q1, x, q2) and q1, q2 ∈ H(xu, q1, xv, E) `M (u, q2, v, E)

• For every q ∈ G(u, q, ε, N) `M (u, q, ε, B)

• For every q ∈ H(u, q, v, B) `M (u, q, v, E)(u, q, ε, E) `M (u, q, ε, N)

4Note that a machine cannot do both symbol consumptionand mode changing at the same time.

q1Start q2 q4 Acceptε

a

b

ε

a

b

Figure 3: An FSBM M1 with G = q1 (diamond) andH = q3 (square); dashed arcs are used only for theemptying process. L(M1) = ww|w ∈ a, b∗

q1Start q2 q3 q4 Accepta

a

b

b

ε

a, b

Figure 4: An FSBM M2 with G = q1 and H = q4.L(M2) = aibjaibj|i, j ≥ 1

Definition 4. A run of FSBM M on w is a se-quence of configurations D0, D1, D2 . . . Dm suchthat 1): ∃ q0 ∈ I , D0 = (w, q0, ε, N); 2): ∃ qf ∈ F ,Dm = (ε, qf , ε, N); 3): ∀ 0 ≤ i < m, Di `M Di+1.The language recognized by an FSBM M is de-noted by L(M). w ∈ L(M) iff there’s a run of Mon w.

2.2 Examples

In all illustrations, G states are drawn with dia-monds and H states are drawn with squares. Thespecial transitions between H states are dashed.

Example 1. Total reduplication Figure 3 offersan FSBM M1 for Lww, with any arbitrary stringsmade out of an alphabet Σ = a, b as candidatesof bases.Lww is the simplest representation of unbounded

copying, but this language is somewhat structurallydull. For the rest of the illustration, we focus onthe FSBM M2 in Figure 4. M2 recognizes the non-context free aibjaibj|i, j ≥ 1. This languagecan be viewed as total reduplication added to theregular language aibj|i, j ≥ 1 (recognized bythe FSA M0 in Figure 5).

State q1 is an initial state and more importantly aG state, forcing M2 to enter B mode before it takesany arcs and transits to other states. Then, M2 inB mode always keeps a copy of consumed input

q1Start q2 q3 Accepta

a

b

b

Figure 5: An FSA M0 with L(M0)= aibj|i, j ≥ 1

180

symbols until it proceeds to q4, an H state. Stateq4 requires M2 to stop buffering and switch to E

mode in order to check for string identity. Using thespecial transitions between H states (in this case,a and b loops on State q4), M2 checks whether thestored symbols in the buffer matches the remaininginput. If so, after emitting out all symbols in thebuffer, M2 with a blank buffer can switch to N

mode. It eventually ends at State q4, a legal finalstate. Figure 6 gives a complete run of M2 on thestring “abbabb”. Figure 7 shows M2 rejects thenon-total reduplicated string “ababb” since a finalconfiguration cannot be reached.

Example 3. Partial reduplication Assume Σ =b, t, k, ng, l, i, a, the FSBM M3 in Figure 8serves as a model of two Agta CVC reduplicatedplurals in Table 1.

Given the initial state q1 is in G, M3 has to enterB mode before it takes any transitions. In B mode,M3 transits to a plain state q2, consuming an inputconsonant and keeping it in the buffer. Similarly,M3 transits to a plain state q3 and then to q4. WhenM3 first reaches q4, the buffer would contain aCVC sequence. q4, an H state, urges M3 to stopbuffering and enter E mode. Using the specialtransitions between H states (in this case, loopson q4), M3 matches the CVC in the buffer with theremaining input. Then, M3 with a blank buffer canswitch to N mode at q4. M3 in N mode loses theaccess to loops on q4, as they are available onlyin E mode. It transits to q5 to process the rest ofthe input by the normal transitions between q5. Asuccessful run should end at q5, the only final state.Figure 9 gives a complete run of M3 on the string“taktakki”.

2.3 Complete-path FSBMs

As shown in the definitions and the examples above,an FSBM is supposed to end in N mode to processan input. There are two possible scenarios for a runto meet this requirement: either never entering B

mode or undergoing full cycles of N, B, E, N modechanges. The corresponding languages reflect ei-ther no copying (functioning as plain FSAs) or fullcopying, respectively.

In any specific run, it is the states that inform anFSBM M of its modality. The first time M reachesaG state, it has to enter B mode and keeps bufferingwhen it transits between plain states. The first timewhen it reaches an H state, M is supposed to enterE mode and transit only between H states in E

mode. Hence, to go through full cycles of modechanges, once M reaches a G state and switchesto B mode, it has to encounter some H states laterto be put in E mode. To allow us to only reasonabout only the “useful” arrangements of G and Hstates, we impose an ordering requirement on Gand H states along a path in a machine and definea complete path.

Definition 5. A path s from an initial state to afinal state in a machine is said to be complete if

1. for one H state in s, there is always a preced-ing G state;

2. once one G state is in s, s must contain mustcontain at least one H following that G state

3. in between G and the first H are only plainstates.

Schematically, with P representing those non-G,non-H plain states and I, F representing initial,final states respectively, the regular expression de-noting the state information in a path s should beof the form: I(P ∗GP ∗HH∗P ∗ |P ∗)∗F .

Definition 6. A complete-path finite statebuffered machine is an FSBM in which all possiblepaths are complete.

Example FSBMs we provide so far (Figure 3,Figure 4 and in Figure 8) are complete-path FSBMs.For the rest of this section, we describe severalcases of an incomplete path in a machine M .

No H states When a G state does not have anyreachableH state following it, there is no completerun, since M always stays in B mode.

NoH states in between twoG states When aGstate q0 has to transit to another G state q′0 beforeany H states, M cannot go to q′0, for M wouldenter B mode at q0 while transiting to another Gstate in B mode is ill-defined.

H states first When M has to follow a path con-taining two consecutiveH states before anyG state,it would clash in the end, because the transitionsamong two H states can only be used in E mode.However, it is impossible to enter E mode withoutentering B mode enforced by some G states.

It should be emphasized that M in N mode canpass through one (and only one) H state to anotherplain state. For instance, the language of the FSBM

181

Used Arc State Info Configuration

1. N/A q1 ∈ I (abbabb, q1, ε, N)2. N/A q1 ∈ G (abbabb, q1, ε, B) Buffering triggered by q1 and empty buffer3. (q1, a, q2) q2 /∈ G (bbabb, q2, a, B)4. (q2, b, q3) (babb, q3, ab, B)5. (q3, b, q3) (abb, q3, abb, B)6. (q3, ε, q4) (abb, q4, abb, B) Emptying triggered by q4

7. N/A (abb, q4, abb, E)8. (q4, a, q4) (bb, q4, bb, E)9. (q4, b, q4) (b, q4, b, E)10. (q4, b, q4) q4 ∈ H (ε, q4, ε, E) Normal triggered by q4 and empty buffer11. N/A q4 ∈ F (ε, q4, ε, N)

Figure 6: M2 in Figure 4 accepts abbabb

Used Arc State Info Configuration

1. N/A q1 ∈ I (ababb, q1, ε, N)2. N/A q1 ∈ G (ababb, q1, ε, B) Buffering triggered by q1 and empty buffer3. (q1, a, q2) q2 /∈ G (babb, q2, a, B)4. (q2, b, q3) q3 ∈ H (abb, q3, ab, B)6. (q3, ε, q4) (abb, q4, ab, B) Emptying triggered by q4

5. N/A (abb, q4, ab, E)6. (q4, a, q4) (bb, q4, b, E)7. (q4, b, q4) q4 ∈ H (b, q4, ε, E) Normal triggered by q4 and empty buffer8. N/A (b, q4, ε, N)

Clash

Figure 7: M2 in Figure 4 rejects ababb

q1Start q2 q3 q4 q5 Acceptb, t, k, ng, l i, a b, t, k, ng, l ε

b, t, k, ng, l, i, a

b, t, k, ng, l

i, a

Figure 8: An FSBM M3 for Agta CVC-reduplicated plurals: G = q1 and H = q4

Used Arc State Info Configuration

1. N/A q1 ∈ G (taktakki, q1, ε, N) Buffering triggered by q1 and empty buffer2. N/A (taktakki, q1, ε, B)3. (q1, t, q2) q2 /∈ G (aktakki, q2, t, B)4. (q2, a, q3) (ktakki, q3, ta, B)5. (q3, k, q4) q4 ∈ H (takki, q4, tak, B) Emptying triggered by q4

6. N/A (takki, q4, tak, E)7. (q4, t, q4) (akki, q4, ak, E)8. (q4, a, q4) (kki, q4, k, E)9. (q4, k, q4) q4 ∈ H (ki, q4, ε, E) Normal triggered by q4 and empty buffer10. N/A (ki, q4, ε, N)11. (q4, ε, q5) (ki, q5, ε, N)12. (q5, k, q5) (i, q5, ε, N)13. (q5, i, q5) q5 ∈ F (ε, q5, ε, N)

Figure 9: M3 in Figure 8 accepts taktakki

182

q1Start q2 q3 q4 q5 Accepta b b a

a b

Figure 10: An incomplete FSBM M4 with G = ∅ andH = q2, q4; L(M4) = abba

q1Start q2 q3 q4 q5 Accepta b b a

Figure 11: An FSA (or an FSBM with G = ∅ and H =∅) whose language is equivalent as M4 in Figure 10

M4 in Figure 10 is equivalent to the language rec-ognized by the FSA in Figure 11. M4 remains tobe an incomplete FSBM because it doesn’t haveany G state preceding the H states q2 and q4.

The languages recognized by complete-path FS-BMs are precisely the languages recognized bygeneral FSBMs. One key observation is the lan-guage recognized by the new machine is the unionof the languages along all possible paths. Then, thevalidity of such a statement builds on different in-complete cases ofG andH states along a path: theyeither recognize the empty-set language or showequivalence to finite state machines. Therefore, thelanguage along an incomplete path of the machineis still in the regular set. Only a complete pathcontaining at least one well-arranged G . . .HH∗

sequence uses the copying power and extends theregular languages. Therefore, in the next section,we focus on complete-path FSBMs.

3 Some closure properties of FSBMs

In this section, we show some closure properties ofcomplete-path FSBM-recognizable languages andtheir linguistic relevance. Section 3.1 discusses itsclosure under intersection with regular languages;Section 3.2 shows it is closed under homomor-phism; Section 3.3 briefly mentions union, con-catenation, Kleene star. These operations are ofspecial interests because they are regular opera-tions defining regular expressions (Sipser, 2013,64). That complete-path FSBMs are closed underregular operations leads to a conjecture that theset of languages recognized by the new automatais equivalent to the set of languages denoted by aversion of regular expression with copying added.

Noticeably, given FSBMs are FSAs with a copy-ing mechanism, the proof ideas in this sectionare similar to the corresponding proofs for FSAs,which can be found in Hopcroft and Ullman (1979)and Sipser (2013).

3.1 Intersection with FSAs

Theorem 1. If L1 is a complete-path FSBM-recognizable language and L2 is a regularlanguage, then L1 ∩ L2 is a complete-pathFSBM-recognizable language.

In other words, if L1 is a language rec-ognized by a complete-path FSBM M1 =〈Q1,Σ, I1, F1, G1, H1, δ1〉, and L2 is a languagerecognized by an FSA M2 = 〈Q2,Σ, I2, F2, δ2〉,then L1 ∩ L2 is a language recognizable by an-other complete-path FSBM. It is easy to con-struct an intersection machine M where M =〈Q,Σ, I, F,G,H, δ〉 with 1) Q = Q1 × Q2; 2)I = I1 × I2; 3) F = F1 × F2; 4) G = G1 × Q2;5) H = H1 ×Q2; 6) ((q1, q

′1), x, (q2, q

′2)) ∈ δ iff

(q1, x, q2) ∈ δ1 and (q′1, x, q′2) ∈ δ2. Paths in M

would inherit the completeness from M1 given thecurrent construction. Then, L(M) = L1 ∩ L2, asM simulates L1 ∩ L2 by running M1 and M2 si-multaneously. M accepts w if and only if both M1

and M2 accept w.

In nature, FSAs can be viewed as FSBMs with-out copying: they can be converted to an FSBMwith an empty G set, an empty H set and triviallyno special transitions between H states.

That FSBM-recognizable languages are closedunder intersection with regular languages is of greatrelevance to phonological theory: assume a natu-ral language X imposes backness vowel harmony,which can be modeled by an FSA MV H . In ad-dition, this language also requires phonologicalstrings of certain forms to be reduplicated, whichcan be modeled by an FSBM MRED. One herebycan construct another FSBM MRED+V H to en-force both backness vowel harmony and the totalidentity of sub-strings in those forms. Not lim-ited to harmony systems, phonotactics other thanidentity of sub-strings are regular (Heinz, 2018),indicating almost all phonological markedness con-straints can be modeled by FSAs. When FSBMs in-tersect with FSAs computing those phonotactic re-strictions, the resulting formalism is still an FSBMbut not other grammar with higher computationalpower. Thus, FSBMs can model natural languagephonotactics once including recognizing surfacesub-string identity.

183

q1Start q2 q3 q4 q5 AcceptC V C ε

C, VV

C

Figure 12: An FSBM M5 on the alphabet C, V suchthat L(M5) = h(L(M3)) with M3 in Figure 8

3.2 Homomorphism and inverse alphabetichomomorphism

Definition 7. A (string) homomorphism is a func-tion mapping one alphabet to strings of anotheralphabet, written h : Σ → ∆∗. We can extend hto operate on strings over Σ∗ such that 1) h(εΣ)= ε∆; 2) ∀a ∈ Σ, h(a) ∈ ∆∗; 3) for w =a1a2 . . . an ∈ Σ∗, h(w) = h(a1)h(a2) . . . h(an)where each ai ∈ Σ. An alphabetic homomorphismh0 is a special homomorphism with h0: Σ→ ∆.

Definition 8. Given a homomorphism h: Σ →∆∗ and L1 ⊆ Σ∗, L2 ⊆ ∆∗, define h(L1)= h(w) |w ∈ L1 ⊆ ∆∗ and h−1(L2) =w |h(w) = v ∈ L2 ⊆ Σ∗.

Theorem 2. The set of complete-path FSBM-recognizable languages is closed under homomor-phisms.

Theorem 2. can be proved by constructing anew machine Mh based on M . The informal in-tuition goes as follows: relabel the odd arcs tomapped strings and add states to split the arcs sothat there is only one symbol or ε on each arc inMh.When there are multiple symbols on normal arcs,the newly added states can only be plain non-G,non-H states. For multiple symbols on the specialarcs between two H states, the newly added statesmust be H states. Again, under this construction,complete paths in M lead to newly constructedcomplete paths in Mh.

The fact that complete-path FSBMs guaranteethe closure under homomoprhism allows theoriststo perform analyses at certain levels of abstractionof certain symbol representations. Consider two al-phabets Σ = b, t, k, ng, l, i, a and ∆ = C, V with a homomorphism h mapping every consonant(b, t, k, ng, l) to C and mapping every vowel (i, a)to V . As illustrated by M3 on alphabet Σ (Fig-ure 8) and M5 on alphabet ∆ (Figure 12), FSBM-definable patterns on Σ would be another FSBM-definable patterns on ∆.

We conjecture that the set of languages

recognized by complete-path FSBMs is notclosed under inverse alphabetic homomorphismsand thus inverse homomorphism. Consider acomplete-path FSBM-recognizable language L =aibjaibj | i, j ≥ 1 (cf. Figure 4). Consider analphabetic homomorphism h : 0, 1, 2 → a, b∗such that h(0) = a, h(1) = a and h(2) = b. Then,h−1(L) = (0|1)i2j(0|1)i2j | i, j ≥ 1 seems tobe challenging for FSBMs. Finite state machinescannot handle the incurred crossing dependencieswhile the augmented copying mechanism only con-tributes to recognizing identical copies, but notgeneral cases of symbol correspondence. 5

3.3 Other closure propertiesUnion Assume there are complete-path FSBMsM1 and M2 such that L(M1) = L1 and L(M2) =L2, then L1 ∪ L2 is a complete-path FSBM-recognizable language. One can construct a newmachine M that accepts an input w if either M1

or M2 accepts w. The construction of M keepsM1 and M2 unchanged, but adds a new plain stateq0. Now, q0 becomes the only initial state, branch-ing into those previous initial states in M1 and M2

with ε-arcs. In this way, the new machine wouldguess on either M1 or M2 accepts the input. If oneaccepts w, M will accept w, too.

Concatenation Assume there are complete-pathFSBMs M1 and M2 such that L(M1) = L1 andL(M2) = L2, then there is a complete-path FSBMM that can recognize L1 L2 by normal concate-nation of two automata. The new machine addsa new plain state q0 and makes q0 the only initialstate, branching into those previous initial statesin M1 with ε-arcs. All final states in M2 are theonly final states in M . Besides, the new machineadds ε-arcs from any old final states in M1 to anypossible initial states in M2. A path in the resultingmachine is guaranteed to be complete because itis essentially the concatenation of two completepaths.

Kleene Star Assume there is a complete-pathFSBM M1 such that L(M1) = L1, L∗1 is acomplete-path FSBM-recognizable language. Anew automaton M is similar to M1 with a new ini-tial state q0. q0 is also a final state, branching into

5The statement on the inverse homomorphism closure isleft as a conjecture. We admit that a more formal and rigor-ous mathematical proof proving h−1(L) is not complete-pathFSBM-recognizable should be conducted. To achieve thisgoal, a more formal tool, such as a developed pumping lemmafor the corresponding set of languages, is important.

184

old initial states in M1. In this way, M accepts theempty string ε. q0 is never a G state nor an H state.Moreover, to make sure M can jump back to aninitial state after it hits a final state, ε-arcs from anyfinal state to any old initial states are added.

4 Discussion and conclusion

In summary, this paper provides a new computa-tional device to compute unrestricted total redu-plication on any regular languages, including thesimplest copying language Lww where w can beany arbitrary string of an alphabet. As a result, itintroduces a new class of languages incomparableto CFLs. This class of languages allows unboundedcopying without generating non-reduplicative non-regular patterns: we hypothesize context-free stringreversals are excluded since the buffer is queue-like.Meanwhile, the MCS Swiss-German cross-serialdependencies, abstracted as aibjcidj|i, j ≥ 1,is also excluded, since the buffer works on the samealphabet as the input tape and only matches identi-cal sub-strings.

Following the sub-classes of 2-way FSTs inDolatian and Heinz (2018a,b, 2019, 2020), whichsuccessfully capture unbounded copying as func-tions while exclude the mirror image mapping,complete-path FSBMs successfully capture thetotal-reduplicated stringsets while exclude stringreversals. Comparison between the characterizedlanguages in this paper and the image of functionsin Dolatian and Heinz (2020) should be further car-ried out to build the connection. Moreover, onenatural next step is to extend FSBMs as acceptorsto finite state buffered transducers (FSBT). Ourintuition is FSBTs would be helpful in handlingthe morphological analysis question (ww → w),a not-yet solved problem in the 2-way FSTs thatDolatian and Heinz (2020) study. After reading thefirst w in input and buffering this chunk of stringin the memory, the transducer can output ε for eachmatched symbol when transiting among H states.

Another potential area of research is applyingthis new machinery to Primitive Optimality Theory(Eisner, 1997; Albro, 1998). Albro (2000, 2005)used weighted finite state machine to model con-straints while represented the set of candidates byMultiple Context Free Grammars to enforce base-reduplicant correspondence (McCarthy and Prince,1995). Parallel to Albro’s way, given complete-path FSBMs are intersectable with FSAs, it is pos-sible to computationally implement the reduplica-

tive identity requirement by complete-path FSBMswithout using the full power of mildly context sen-sitive formalisms. To achieve this goal, future workshould consider developing an efficient algorithmthat intersects complete-path FSBMs with weightedFSAs.

The present paper is the first step to recognizereduplicated forms in adequate yet more restric-tive models and techniques compared to MCSformalisms. There are some limitations of thecurrent approach on the whole typology of redu-plication. Complete-path FSBMs can only cap-ture local reduplication with two adjacent identicalcopies. As for non-local reduplication, the modi-fication should be straightforward: the machinesneed to allow the filled buffer in N mode (or inanother newly-defined memory holding mode) andmatch strings only when needed. As for multi-ple reduplication, complete-path FSBMs can eas-ily be modified to include multiple copies of thesame base form (wn |w ∈ Σ∗, n ∈ N) butcannot be easily modified to recognize the non-semilinear language containing copies of the copy(w2n |w ∈ Σ∗, n ∈ N). It remains to be an openquestion on the computational nature of multiplereduplication. Last but not the least, as a reviewerpoints out, recognizing non-identical copies canbe achieved by either storing or emptying not ex-actly the same input symbols, but mapped sym-bols according to some function f . Under thismodification, the new automata would recognizeanbn |n ∈ N with f(a) = b but still excludestring reversals. In all, detailed investigations onhow to modify complete-path FSBMs should bethe next step to complete the typology.

Acknowledgments

The author would like to thank Tim Hunter, BruceHayes, Dylan Bumford, Kie Zuraw, and the mem-bers of the UCLA Phonology Seminar for theirfeedback and suggestions. Special thanks to theanonymous reviewers for their constructive com-ments and discussions. All errors remain my own.

ReferencesDaniel M Albro. 1998. Evaluation, implementation,

and extension of primitive optimality theory. Mas-ter’s thesis, UCLA.

Daniel M. Albro. 2000. Taking primitive OptimalityTheory beyond the finite state. In Proceedings of theFifth Workshop of the ACL Special Interest Group

185

in Computational Phonology, pages 57–67, CentreUniversitaire, Luxembourg. International Commit-tee on Computational Linguistics.

Daniel M Albro. 2005. Studies in computational op-timality theory, with special reference to the phono-logical system of Malagasy. Ph.D. thesis, Universityof California, Los Angeles, Los Angeles.

Bruce Bagemihl. 1989. The crossing constraint and‘backwards languages’. Natural language & linguis-tic Theory, 7(4):481–549.

Felix Baschenis, Olivier Gauwin, Anca Muscholl, andGabriele Puppis. 2017. Untwisting two-way trans-ducers in elementary time. In 2017 32nd AnnualACM/IEEE Symposium on Logic in Computer Sci-ence (LICS), pages 1–12.

Kenneth R. Beesley and Lauri Karttunen. 2000. Finite-state non-concatenative morphotactics. In Proceed-ings of the 38th Annual Meeting of the Associa-tion for Computational Linguistics, pages 191–198,Hong Kong. Association for Computational Linguis-tics.

Jane Chandlee. 2014. Strictly local phonological pro-cesses. Ph.D. thesis, University of Delaware.

Jane Chandlee. 2017. Computational locality in mor-phological maps. Morphology, 27:599–641.

Noam Chomsky. 1956. Three models for the descrip-tion of language. IRE Trans. Inf. Theory, 2:113–124.

Alexander Clark and Ryo Yoshinaka. 2014. Distri-butional learning of parallel multiple context-freegrammars. Mach. Learn., 96(1–2):5–31.

Yael Cohen-Sygal and Shuly Wintner. 2006. Finite-state registered automata for non-concatenative mor-phology. Computational Linguistics, 32(1):49–82.

Christopher Culy. 1985. The complexity of the vo-cabulary of bambara. Linguistics and philosophy,8(3):345–351.

Robert M. W. Dixon. 1972. The Dyirbal Language ofNorth Queensland, volume 9 of Cambridge Studiesin Linguistics. Cambridge University Press, Cam-bridge.

Hossep Dolatian and Jeffrey Heinz. 2018a. Learningreduplication with 2-way finite-state transducers. InProceedings of the 14th International Conferenceon Grammatical Inference, volume 93 of Proceed-ings of Machine Learning Research, pages 67–80.PMLR.

Hossep Dolatian and Jeffrey Heinz. 2018b. Modelingreduplication with 2-way finite-state transducers. InProceedings of the Fifteenth Workshop on Computa-tional Research in Phonetics, Phonology, and Mor-phology, pages 66–77, Brussels, Belgium. Associa-tion for Computational Linguistics.

Hossep Dolatian and Jeffrey Heinz. 2019. RedTyp: Adatabase of reduplication with computational mod-els. In Proceedings of the Society for Computationin Linguistics (SCiL) 2019, pages 8–18.

Hossep Dolatian and Jeffrey Heinz. 2020. Comput-ing and classifying reduplication with 2-way finite-state transducers. Journal of Language Modelling,8(1):179–250.

Jason Eisner. 1997. Efficient generation in primitiveOptimality Theory. In 35th Annual Meeting of theAssociation for Computational Linguistics and 8thConference of the European Chapter of the Associa-tion for Computational Linguistics, pages 313–320,Madrid, Spain. Association for Computational Lin-guistics.

Gerald Gazdar and Geoffrey K Pullum. 1985. Com-putationally relevant properties of natural languagesand their grammars. New generation computing,3(3):273–306.

Thomas Graf. 2017. The power of locality domains inphonology. Phonology, 34(2):385–405.

Phyllis M. Healey. 1960. An Agta Grammar. Bureauof Printing, Manila.

Jeffrey Heinz. 2007. The Inductive Learning of Phono-tactic Patterns. Ph.D. thesis, University of Califor-nia, Los Angeles.

Jeffrey Heinz. 2018. The computational nature ofphonological generalizations. In Larry Hyman andFrans Plank, editors, Phonological Typology, Pho-netics and Phonology, chapter 5, pages 126–195. DeGruyter Mouton.

Jeffrey Heinz, Chetan Rawal, and Herbert G Tan-ner. 2011. Tier-based strictly local constraints forphonology. In Proceedings of the 49th Annual Meet-ing of the Association for Computational Linguistics:Human language technologies, pages 58–64.

John E Hopcroft and Jeffrey D Ullman. 1979. Introduc-tion to automata theory, languages, and computation.Addison-Welsey, NY.

Mans Hulden. 2009. Finite-state Machine Construc-tion Methods and Algorithms for Phonology andMorphology. Ph.D. thesis, University of Arizona,Tucson, USA.

Sharon Inkelas and Cheryl Zoll. 2005. Reduplication:Doubling in morphology, volume 106. CambridgeUniversity Press.

Gerhard Jager and James Rogers. 2012. Formallanguage theory: refining the chomsky hierarchy.Philosophical Transactions of the Royal Society B:Biological Sciences, 367(1598):1956–1970.

C. Douglas Johnson. 1972. Formal Aspects of Phono-logical Description. Monographs on linguistic anal-ysis. Mouton, The Hague.

186

Aravind K. Joshi. 1985. Tree adjoining grammars:How much context-sensitivity is required to providereasonable structural descriptions?, Studies in Natu-ral Language Processing, page 206–250. CambridgeUniversity Press.

Ronald M. Kaplan and Martin Kay. 1994. Regularmodels of phonological rule systems. Comput. Lin-guist., 20(3):331–378.

Alec Marantz. 1982. Re reduplication. Linguistic in-quiry, 13(3):435–482.

John J. McCarthy and Alan S. Prince. 1995. Faithful-ness and reduplicative identity. In Jill N. Beckman,Laura Walsh Dickey, and Suzanne Urbanczyk, edi-tors, Papers in Optimality Theory. GLSA (GraduateLinguistic Student Association), Dept. of Linguis-tics, University of Massachusetts, Amherst, MA.

Robert McNaughton and Seymour A Papert. 1971.Counter-Free Automata (MIT research monographno. 65). The MIT Press.

Brian Roark and Richard Sproat. 2007. Computationalapproaches to morphology and syntax, volume 4.Oxford University Press.

Carl Rubino. 2013. Reduplication. In Matthew S.Dryer and Martin Haspelmath, editors, The WorldAtlas of Language Structures Online. Max Planck In-stitute for Evolutionary Anthropology, Leipzig.

Hiroyuki Seki, Takashi Matsumura, Mamoru Fujii,and Tadao Kasami. 1991. On multiple context-free grammars. Theoretical Computer Science,88(2):191–229.

Stuart M Shieber. 1985. Evidence against the context-freeness of natural language. In Philosophy, Lan-guage, and Artificial Intelligence, pages 79–89.Springer.

Imre Simon. 1975. Piecewise testable events. In Au-tomata Theory and Formal Languages, pages 214–222, Berlin, Heidelberg. Springer Berlin Heidelberg.

Michael Sipser. 2013. Introduction to the Theory ofComputation, third edition. Course Technology,Boston, MA.

Edward Stabler. 1997. Derivational minimalism. InLogical Aspects of Computational Linguistics, pages68–95, Berlin, Heidelberg. Springer Berlin Heidel-berg.

Markus Walther. 2000. Finite-state reduplication inone-level prosodic morphology. In 1st Meeting ofthe North American Chapter of the Association forComputational Linguistics.

Kie Zuraw. 2002. Aggressive reduplication. Phonol-ogy, 19(3):395–439.

187

Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research

in Phonetics, Phonology, and Morphology,

August 5, 2021. ©2021 Association for Computational Linguistics

pages 188–197

An FST morphological analyzer for the Gitksan language

Clarissa Forbesα Garrett Nicolaiβ Miikka SilfverbergβαUniversity of Arizona βUniversity of British Columbia

[email protected] [email protected]

Abstract

This paper presents a finite-state morpholog-ical analyzer for the Gitksan language. Theanalyzer draws from a 1250-token Eastern di-alect wordlist. It is based on finite-state tech-nology and additionally includes two exten-sions which can provide analyses for out-of-vocabulary words: rules for generating pre-dictable dialect variants, and a neural guessercomponent. The pre-neural analyzer, testedagainst interlinear-annotated texts from multi-ple dialects, achieves coverage of (75-81%),and maintains high precision (95-100%). Theneural extension improves coverage at the costof lowered precision.

1 Introduction

Endangered languages of the Americas are typi-cally underdocumented and underresourced. Com-putational tools like morphological analyzerspresent the opportunity to speed up ongoing docu-mentation efforts by enabling automatic and semi-automatic data analysis. This paper describes thedevelopment of a morphological analyzer for Gitk-san, an endangered Indigenous language of West-ern Canada. The analyzer is capable of providingthe base form and morphosyntactic description ofinflected word forms: a word gupdiit ‘they ate’ isannotated gup-TR-3PL.

Our Gitksan analyzer is based on two coredocumentary resources: a wordlist spanning ap-proximately 1250 tokens, and an 18,000 tokeninterlinear-annotated text collection. Due to thescarcity of available lexical and corpus resources,we take a rule-based approach to modeling of mor-phology which is less dependent on large datasetsthan machine learning methods. Our analyzer isbased on finite-state technology (Beesley and Kart-tunen, 2003) using the foma finite-state toolkit(Hulden, 2009b).

Our work has three central goals: (1) We wantto build a flexible morphological analyzer to sup-plement lexical and textual resources in support oflanguage learning. Such an analyzer can supportlearners in identifying the base-form of inflectedwords where the morpheme-to-word ratio might beparticularly high, in a way not addressed by a tradi-tional dictionary. It may also productively generateinflected forms of words. (2) We want to facilitateongoing efforts to expand the aforementioned 1250token wordlist into a broad-coverage dictionary ofthe Gitksan language. Running our analyzer onGitksan texts, we can rapidly identify word formswhose base-form has not yet been documented. Ananalyzer can also help automate the process of iden-tifying sample sentences for dictionary words, theaddition of which substantially increases the valueof the dictionary. (3) We want to use the model tofurther our understanding of Gitksan morphology.Unanalyzeable and erroneously analyzed forms canhelp us identify shortcomings in our description ofthe morphological system and can thus feed backinto the documentation effort of the language.

The Gitksan-speaking community recognizestwo dialects: Eastern (Upriver) and Western(Downriver). Our analyzer is based on resourceswhich mainly represent the Eastern dialect. Con-sequently, our base analyzer achieves higher cov-erage of 71% for the Eastern dialect as measuredon a manually annotated test set. For the West-ern dialect, coverage is lower at 53%. In orderto improve coverage on the Western variety, weexplore two extensions to our analyzer. First, weimplement a number of dialectal relaxation ruleswhich model the orthographic variation betweenEastern and Western dialects. This leads to siz-able improvements in coverage for the Westerndialect (around 9%-points on types and 6%-pointson tokens). Moreover, the precision of our ana-lyzer remains high both for the Eastern and West-

188

ern dialects even after applying dialect rules. Sec-ondly, we extend our FST morphological analyzerby adding a data-driven neural guesser which fur-ther improves coverage both for the Eastern andWestern varieties.

2 The Gitksan Language

The Gitxsan are one of the indigenous peoples ofBritish Columbia, Canada. Their traditional territo-ries consist of upwards of 50,000 square kilometersof land along the Skeena River in the BC northerninterior. The Gitksan language is the easternmostmember of the Tsimshianic family, which spans theentirety of the Skeena and Nass River watershedsto the Pacific Coast. Today, Gitksan is the mostvital Tsimshianic language, but is still criticallyendangered with an estimated 300-850 speakers(Dunlop et al., 2018).

The Tsimshianic family can be broadly under-stood as a dialect continuum, with each villagealong these rivers speaking somewhat differentlyfrom its neighbors up- or downstream, and the twoendpoints being mutually unintelligible. The sixGitxsan villages are commonly divided into twodialects: East/Upriver and West/Downriver. Thedialects have some lexical and phonological dif-ferences, with the most prominent being a vowelshift. Consider the name of the Skeena River: Xsan,Ksan (Eastern) vs Ksen (Western).

2.1 Morphological description

The Gitksan language has strict VSO word orderand multifunctional, fusional morphology (Rigsby,1986). It utilizes prefixation, suffixation, and bothpro- and en-cliticization. Category derivation andnumber marking are prefixal, while markers of ar-gument structure, transitivity, and person/numberagreement are suffixal.

The Tsimshianic languages have been describedas having word-complexity similar to German (Tar-pent, 1987). The general structure of a noun orverb stem is presented in the template in Figure1. A stem consists of minimally a root (typicallyCVC); an example is monomorphemic gup ‘eat’.Stems may also include derivational prefixes ortransitivity-related suffixes; compare gupxw ‘beeaten; be edible’.

In sentential context, stems are inflected for fea-tures like transitivity and person/number. Our an-alyzer is concerned primarily with stem-externalinflection and cliticization. The structure of stem-

external morphology for the most complex wordtype, a transitive verb, is schematized in the tem-plate in Figure 2; an example word with all theseslots filled would be ’naagask’otsdiitgathl ‘appar-ently they cut.PL open (common noun)’

On the left edge of the stem can appear any num-ber of modifying ‘proclitics’. These contributelocative, adjectival, and manner-related informa-tion to a noun or verb, often producing semi- or non-compositional idioms in a similar fashion to Ger-manic particle verbs.1 It is often unclear whetherthese proclitics constitute part of the root or stem,or if they are distinct words entirely. The ortho-graphic boundaries on this edge are consequentlysometimes fuzzy. Sometimes clear contrasts arepresented, as with the sequence lax-yip ‘on-earth’:we see compositional lax yip ‘on the ground’ ver-sus lexicalized laxyip ‘land, territory’. However,the boundary between compositional and idiomaticis not always so obvious, as in examples like (1).

(1) a. saa-’witxw (away-come, ‘come from’)b. k’ali-aks (upstream-water, ‘upriver’)c. xsi-ga’a (out-see, ‘choose’)d. luu-no’o (in-hole, ‘annihilate’)

Inflectional morphology largely appears on theright edge of the stem. The main complexity ofGitksan inflection involves homophony and opac-ity: a similar or identical wordform often has mul-tiple possible analyses. For example, a word likegubin transparently involves a stem gup ‘eat’ anda 2SG suffix -n, but the intervening vowel i mightbe analyzed as epenthetic, as transitive inflection(TR), or as a specially-induced transitivizing suffix(T), resulting in three possible analyses in (2). Sim-ilarly, a word gupdiit involves the same stem gup‘eat’ and a 3PL suffix -diit, but this suffix is ableto delete preceding transitive suffixes, resulting infour possible analyses as in (3).

(2) gubina. gup-2SGb. gup-TR-2SGc. gup-T-2SG

(3) gupdiita. gup-3PLb. gup-TR-3PLc. gup-T-3PLd. gup-T-TR-3PL

1E.g. nachslagen ’look up’ in German.

189

Derivation– Proclitics– Plural– Root –Argument Structure

Figure 1: Morphological template of a complex nominal or verbal stem

Proclitics– Stem –Transitive –Person/Number =Epistemic =Next Noun Class

Figure 2: Morphological template of modification, inflection, and cliticization for a transitive verbal predicate

Running speech in Gitksan is additionally rifewith clitics, which pose a more complex problemfor morphological modeling. First, there are a setof ergative ‘flexiclitics’, which are able to eitherprocliticize or encliticize onto a subordinator orauxiliary, or stand independently. The same combi-nation of host and clitic might result in sequenceslike n=ii (1SG=and), ii=n (and=1SG), or ii na(and 1SG) (Stebbins, 2003; Forbes, 2018).

Second, all nouns are introduced with a noun-class clitic that attaches to the preceding word, asillustrated by the VSO sentence in (4). Here, theproper noun clitic =s attaches to the verb but is syn-tactically associated with Mary, and the commonnoun clitic =hl attaches to Mary but is associatedwith gayt ‘hat’.

(4) Giigwisgiikw-i-tbuy-TR-3.II

=s=PN

MaryhlMaryMary

=hl=CN

gayt.gaythat

‘Mary bought a hat.’

Any word able to precede a noun phrase is a possi-ble host for one of these clitics (hence their appear-ance on transitive verbs in Figure 2).

Finally, there are several sentence-final andsecond-position clitics. whose distribution is basedon prosodic rather than strictly categorial proper-ties; these attach on the right edge of subordina-tors/auxiliaries, predicates, and argument phrases,depending on the structure of the sentence.

A large part of Gitksan’s unique morphologicalcomplexity therefore arises not in nominal or verbalinflection, but in the flexibility of multiple types ofclitics used in connected speech, and the logic ofwhich possible sequences can appear with whichwordforms.

2.2 Resources

The Gitksan community orthography was designedand documented in the Hindle and Rigsby (1973)wordlist (H&R). Though it originally reflectedonly the single dialect of one of the authors (Git-an’maaxs, Eastern), this orthography is in broaduse today across the Gitxsan community for all di-

alects, as well as neighboring Nisga’a, with somevariations. Given the relatively short period thatthis orthography has been in use, orthographic con-ventions can vary widely across dialects and writ-ers. In producing this initial analyzer, we attempt tomitigate the issue by working with a small numberof more-standardized sources: the original H&Rand an annotated, multidialectal text collection.

We worked with a digitized version of the H&Rwordlist (Mother Tongues Dictionaries, 2020). Theoriginal wordlist documents only the Git-an’maaxsEastern dialect; our version adds a small numberof additional dialect variants, and fifteen commonverbs and subordinators. In total, the list containsapproximately 1250 lexemes and phrases, plusnoted variants and plural forms.

The analyzer was informed by descriptive workon both Gitksan and its mutually intelligible neigh-bor Nisga’a. This work details many aspects ofGitksan inflection, including morphological opac-ity and the complex interactions of certain suffixesand clitics (Rigsby, 1986; Tarpent, 1987; Hunt,1993; Davis, 2018; Brown et al., 2020).

A text collection of approximately 18,000 wordswas also used in the development and evaluationof the analyzer. This collection consists of oralnarratives given by three speakers from differentvillages: Ansbayaxw (Eastern), Gijigyukwhla’a(Western), and Git-anyaaw (Western) (cf. Forbeset al., 2017). It includes multiple genres: personalanecdotes, traditional tales (ant’imahlasxw), histo-ries of ownership (adaawk), recipes, and explana-tions of cultural practice. The collection is fullyannotated in the ‘interlinear gloss’ format with freetranslation, exemplified in (5).

(5) IiiiCCNJ

al’algaltgathlCVC-algal-t=gat=hlPL-watch-3.II=REPORT=CN

get,getpeople

‘And they stood by and watched,’

The analyzed corpus provides insight into the use ofclitics in running speech, and is the dataset againstwhich we test the results of the analyzer.

190

3 Related Work

While considering different approaches to compu-tational modeling of Gitksan morphology, finite-state morphology arose as a natural choice. At thepresent time, finite-state methods are quite widelyapplied for Indigenous languages of the Americas.Chen and Schwartz (2018) present a morpholog-ical analyzer for St. Lawrence Island / CentralSiberian Yupik for aid in language preservation andrevitalization work. Strunk (2020) present anotheranalyzer for Central Alaskan Yupik. Snoek et al.(2014) present a morphological analyzer for PlainsCree nouns and Harrigan et al. (2017) present onefor Plains Cree verbs. Littell (2018) build a finite-state analyzer for Kwak’wala. All of the aboveare languages which present similar challenges tothe ones encountered in the case of Gitksan: wordforms consisting of a large number of morphemes,both prefixing and suffixing morphology and mor-phophonological alternations. Finite-state morphol-ogy is well-suited for dealing with these challenges.It is noteworthy that similarly to Gitksan, a numberof the aforementioned languages are also undergo-ing active documentation efforts.

While we present the first morphological ana-lyzer for Gitksan which is capable of productiveinflection, this is not the first electronic lexical re-source for the Gitksan language. Littell et al. (2017)present an electronic dictionary interface Waldayufor endangered languages and apply it to Gitksan.The model is capable of performing fuzzy dictio-nary search which is an important extension in thepresence of orthographic variation which widelyoccurs in Gitksan. While this represents an impor-tant development for computational lexicographyfor Gitksan, the method cannot model productiveinflection which is important particularly for lan-guage learners who might not be able to easilydeduce the base-form of an inflected word (Huntet al., 2019). As mentioned earlier, our model cananalyze inflected forms of lexemes.

We extend the coverage of our finite-state an-alyzers by incorporating a neural morphologicalguesser which can be used to analyze word formswhich are rejected by the finite-state analyzer. Simi-lar mechanisms have been explored for other Amer-ican Indigenous languages. Micher (2017) usesegmental recurrent neural networks (Kong et al.,2015) to augment a finite-state morphological an-alyzer for Inuktitut.2 These jointly segment the

2The Uquailaut morphological analyzer:

input word into morphemes and label each mor-pheme with one or more grammatical tags. Verysilmilarly to the approach that we adopt, Schwartzet al. (2019) and Moeller et al. (2018) use atten-tional LSTM encoder-decoder models to augmentmorphological analyzers for extending morpholog-ical analyzers for St. Lawrence Island / CentralSiberian Yupik and Arapaho, respectively.

4 The Model

Our morphological analyzer was designed with sev-eral considerations in mind. First, given the smallamount of data at our disposal, we chose to con-struct a rule-based finite state transducer, built froma predefined lexicon and morphological description.The dependence of this type of analyzer on a lexi-con supports one of the major goals of this project:lexical discovery from texts. Words which cannotbe analyzed will likely be novel lemmas that haveyet to be documented. Furthermore, the processof constructing a morphological description allowsfor the refinement of our understanding of Gitksanmorphology and orthographic standards. For exam-ple, there is a common post-stem rounding effectthat generates variants such as jogat, jogot ‘thosewho live’; the project helps us identify where thiseffect occurs. Our analyzer can also later serve as atool to explore of the behavior of less-documentedconstructions (e.g. distributive, partitive), as gram-matical and pedagogical resources continue to bedeveloped.

Our general philosophy was to take a maximal-segmentation approach to inflection and cliticiza-tion: morphemes were added individually, and in-teractions between morphemes (e.g. deletion) werederived through transformational rules based onmorphological and phonological context. Mostinteractions of this kind are strictly local; thereare few long-distance dependencies between mor-phemes. The only exception to the minimal chunk-ing rule is a specific interaction between noun-classclitics and verbal agreement: when these cliticsappend to verbal agreement suffixes, they eitheragglutinate with (6-a) or delete them (6-b) depend-ing on whether the agreement and noun-class mor-pheme are associated with the same noun (Tarpent,1987; Davis, 2018). That is, the conditioning factorfor this alternation is syntactic, not morphophono-logical.

http://www.inuktitutcomputing.ca/Uqailaut

191

(6) Realizations of gup-i-t=hl (eat-TR-3=CN)a. gubithl ‘he/she ate (common noun)’b. gubihl ‘(common noun) ate’

The available set of resources further constrainedour options for the analyzer’s design and our meansof evaluating it. The H&R wordlist is quite small,and of only a single dialect, while the corpus fortesting was multidialectal. We therefore aimedto produce a flexible analyzer able to recognizeorthographic variation, to maximize the value of itssmall lexicon.

4.1 FST implementationOur finite-state analyzer was written in lexc

and xfst format and compiled using foma (Hulden,2009b). Finite-state analyzers like this one areconstructed from a dictionary of stems, with af-fixes added left-to-right, and morpho-phonologicalrewrite rules applied to produce allomorphs andcontextual variation. The necessary components ofthe analyzer are therefore a lexicon, a morphotac-tic description, and a set of morphophonologicaltransformations, as illustrated in Figure 3.

Our analyzer’s lexicon is drawn from the H&Rwordlist. As a first step, each stem from that listwas assigned a lexical category to determine itsinflectional possibilities. The resulting 1506 word+ category pairs were imported to category-specificgroups in the morphotactic description.

Any of the major stem categories could be usedto start a word; modifiers, preverbs, and prenounscould also be used as verb/noun prefixes. Eachcategorized group flowed to a series of category-specific sections which appended the appropriatepart of speech, and then listed various derivationalor inflectional affixes that could be appended. Amorphological group would terminate either with ahard stop (#) or by flowing to a final group ‘Word’,where clitics were appended.

Finally, forms were subject to a sequenceof orthographic transformations reflecting mor-phophonological rules. Some examples includedthe deletion of adjacent morphemes which couldnot co-occur, processes of vowel epenthesis or dele-tion, vowel coloring by rounded and back conso-nants, and prevocalic stop voicing.

A sample form produced by the FST for theword saabisbisdiithl ‘they tore off (pl. commonnoun)’ is in example (7). This form involves apreverb saa being affixed directly to a transitiveverb bisbis, a reduplicated plural form of the verb

which was listed directly in the H&R wordlist (thesymbol ˆ marks morpheme boundaries).3 Afterthe verb, we find two inflectional suffixes and oneclitic. Ultimately, rewrite rules are used to deletethe transitive suffix and segmentation boundaries(8).

(7) saaˆbisbisˆiˆdiitˆhlsaa+PVB-bisbis+VT-TR-3PL=CN

(8) saabisbisdiithl

4.2 Analyzer iterations

We built and evaluated four iterations of the Gitk-san morphological analyzer based upon the foun-dation presented in Section 4.1: the v1. LexicalFST, v2. Complete FST, v3. Dialectal FST andv4. FST+Neural. Each iteration cumulatively ex-pands the previous one by incorporating additionalvocabulary items, rules or modeling components.

The first analyzer (v1: Lexical FST) includedonly the open-class categories of verbs, nouns,modifiers, and adverbs which made up the bulkof the H&R wordlist. The main focus of themorphotactic description was transitive inflection,person/number-agreement, and cliticization forthese categories. Some semi-productive argumentstructural morphemes (e.g. the passive -xw or an-tipassive -asxw) were also included.

The second analyzer (v2: Complete FST) in-corporated functional and closed-class morphemessuch as subordinators, pronouns, prepositions, quo-tatives, demonstratives, and aspectual particles, in-cluding additional types of clitics.

The third analyzer (v3: Dialectal FST) furtherincorporated predictable stem-internal variation,such as the vowel shift and dorsal stop lenition/-fortition seen across dialects. In order to apply thevowel shift in a targeted way, all items in the lex-icon were marked for stress using the notation $.Parses prior to rule application now appear as in(9) (compare to (7)).

(9) s$aaˆbisb$isˆiˆdiitˆhl

Finally, we seek to expand the coverage of theanalyzer through machine learning, namely neu-ral architectures (v4: FST+Neural). Our FST ar-chitecture allows for the automatic extraction ofsurface-analysis pairs; this enables us to create

3The FST has no component to productively handle redu-plication but this would be possible to implement given aclosed lexicon Hulden (2009a, Ch. 4).

192

LEXICON RootNmaa’y N ;smax N ;LEXICON RootVIyee VI ;t’aa VI ;LEXICON RootPrenounlax_ Prenoun ;

(a) Lexicon

LEXICON N+N: NInfl ;LEXICON NInfl-ATTR:^m # ;-SX:^it Word ;

Agr_II ;Word ;

LEXICON Prenoun+PNN: # ;+PNN: RootN ;

(b) Morphotactic description

Deletion before -3PL:ˆi→ 0 / _ ˆdiit

Vowel insertion:0→ i / C ˆ _ Sonorant #

Prevocalic voicing:p,t,ts,k,k→ b,d,j,g,g / _ V

(c) Rewrite rules

Figure 3: Three main components of the FST (simplified)

a training set for the neural models. We experi-ment with two alternative neural architectures - theHard-Attentional model over edit actions (HA) de-scribed by Makarov and Clematide (2018), and thetransformer model (Vaswani et al., 2017), as imple-mented in Fairseq (Fairseq) (Ott et al., 2019). Un-like the FST, the neural models can extend morpho-logical patterns beyond a defined series of stems,analyzing forms that the FST cannot recognize.

For both models, we extract 10,000 random anal-ysis pairs, with replacement; early stopping forboth models uses a 10% validation set extractedfrom the training, with no overlap between train-ing and validation sets (although stem overlap isallowed). The best checkpoint is chosen based onvalidation accuracy. The HA model uses a Chi-nese Restaurant Process alignment model, and istrained for 60 epochs, with 10 epochs patience; theencoder and decoder both have hidden dimension200, and are trained with 50% dropout on recurrentconnections. The Transformer model is a 3-layer,4-head transformer trained for 50 epochs. The en-coders and decoders each have an embedding sizeof 512, and feed-forward size of 1024, with 50%dropout and 30% attentional dropout. We optimizeusing Adam (0.9, 0.98), and cross-entropy with20% label-smoothing as our objective.

Any wordform which received no analysis fromthe FST was provided a set of five possible analyseseach from the HA and Fairseq models.

5 Evaluation

5.1 FST Coverage

The analyzers were run on two 2000-token datasetsdrawn from the multidialectal corpus: an EasternGitksan dataset (1 speaker), and a Western Gitksan

dataset (2 speakers and dialects). Token and typecoverage for the three FSTs is provided in Table1, representing the percentage of wordforms forwhich each analyzer was able to provide one ormore possible parses.

Types TokensEast Lexical 63.12% 54.17%

Complete 71.10% 81.48%Dialectal 71.10% 81.48%

West Lexical 45.49% 38.09%Complete 53.20% 70.12%Dialectal 62.35% 75.98%

Table 1: Analyzer coverage on 2000-token datasets

The effect of adding function-word coverage tothe second ‘Complete’ analyzer was broadly sim-ilar across dialects, increasing type coverage byabout 8% and token coverage by 27-32%, demon-strating the relative importance of function wordsto lexical coverage.

The first two analyzers performed substantiallybetter on the Eastern dataset which more closelymatched the dialect of the wordlist/lexicon. Thethird ‘Dialectal’ analyzer incorporated four typesof predictable stem-internal allomorphy to generateWestern-style variants. These transformations hadno effect on coverage for the Eastern dataset, butincreased type and token coverage for the Westerndataset by 9% and 6% respectively.

5.2 FST precisionWhile our analyzer manipulates vocabulary itemsat the level of the stem seen in the lexicon, the cor-pus used for evaluation is annotated to the level ofthe root and was not always comparable (e.g. ih-lee’etxw ‘red’ vs ihlee’e-xw ‘blood-VAL’). Accu-

193

racy evaluation therefore had to be done manuallyby comparing the annotated analysis in the corpusto the parse options produced by the FST (10).

(10) japhla. make[-3.II]=CN (Corpus)b. j$ap+N=CN

j$ap+N-3=CNj$ap+VT-3=CN (FSTv3)

We evaluated the accuracy of the Dialectal FST ontwo smaller datasets: 150 tokens Eastern, and 250tokens Western. These datasets included 85 and180 unique wordform/annotation pairs respectively.The same wordform might have multiple attestedanalyses, depending on its usage. The performanceof the Dialectal analyzer on each dataset is sum-marized in Table 2. Precision is calculated as thepercentage of word/annotation pairs for which theanalyzer produced a parse matching the context-sensitive annotation in the corpus.4 Other analysesproduced by the FST were ignored. For example in(10), the token would be evaluated as correct giventhe final parse, which uses the appropriate stem(jap ‘make’) and matching morphology; the otherparses using a different stem (jap ‘box trap’) and/ordifferent morphology could not qualify the tokenas correctly parsed. Only parsable wordforms wereconsidered (i.e. English words and names are ex-cluded).

East WestCoverage 71.76% 68.89%

(61/85) (124/180)Correct parse 71.76% (61) 64.44% (116)Incorrect parse 0.00% (0) 2.78% (5)Name, English 2.5% (2) 3.33% (6)No parse 27.5% (22) 29.44% (53)Precision 100.00% 95.87%

(61/61) (116/121)

Table 2: Accuracy evaluation for dialectal analyzer (v3)on small datasets

The Western dataset was larger, and consisted oftwo distinct dialects, in contrast to the smaller andmore homogeneous Eastern dataset. Regardless,analyzer coverage between the two datasets wascomparable (68-72%) and precision was very high(95-100%). When this analyzer was able to providea possible parse, one was almost always correct.

4Note that precision is computed only on word formswhich received at least one analysis from the FST.

To further understand the analyzer’s limitations,we categorized the reasons for erroneous and miss-ing analyses, listed in Table 3. In addition to thesmall datasets, for which all words were checked,we also evaluated the 100 most-frequent word/anal-ysis pairs in the larger datasets.

The majority of erroneous and absent analyseswere due to the use of new lemmas not in the lexi-con, or novel variants not captured by productivestem-alternation rules. Novel lemmas made upabout 18% each of the small datasets, and 4-8%of the top-100 most frequent types. Some func-tional items had specific dialectal realizations; forexample, all three speakers used a different locativepreposition (goo-, go’o-, ga’a-), only one of whichwas recognized.

There were also a few errors attributable tothe morphotactic rules encoded in the parser.For example, there were several instances in thedataset of supposed ‘preverb’ modifiers combin-ing with nouns (e.g. t’ip-no’o=si, sharply.down-hole=PROX, ‘this steep-sided hole’), which theparser could not recognize. This category combi-nation flags the need for further documentation ofcertain ‘preverbs’. As a second example, numbersattested without agreement were not recognizedbecause the analyzer expected that they would al-ways agree. This could be fixed by updating themorphotactic description for numbers (e.g. to moreclosely match intransitive verbs).

5.3 FST + Neural performance

The addition of the neural component signifi-cantly increased the analyzer’s coverage (mean HA:+21%, Fairseq: +17%), but at the expense of pre-cision (mean -15% for both). The results of themanual accuracy evaluation are presented in Fig-ure 4. There remained several forms for which theneural analyzers produced no analyses.

Both analyzers performed better on the 100-most-frequent types datasets, where they tendedto accurately identify dialectal variants of com-mon words (e.g. t’ihlxw from tk’ihlxw ‘child’, diyefrom diya ‘3=QUOT (third person quotative)’). Inthe small datasets of running text, these modelswere occasionally able to correctly identify un-known noun and verb stems that had minimal in-flection. However, they struggled with identify-ing categories, and often failed to identify correctinflection. These difficulties stem from category-flexibility and homophony in Gitksan. Nouns and

194

East West150 tokens (22) Top-100 (17) 250 tokens (58) Top-100 (23)

New lemma 15 2 30 2New function word 1 2 4 6Lexical variant 3 8 6 5Functional variant 2 3 9 9Morphotactic error 1 2 9 1

Table 3: Categorization of erroneous and absent analyses for dialectal analyzer (FSTv3)

East 150 tok

0

0.25

0.5

0.75

1

FST FST+HA FST+FairSeq

East top-100 tok

0

0.25

0.5

0.75

1

FST FST+HA FST+FairSeq

West 250 tokens

0

0.25

0.5

0.75

1

FST FST+HA FST+FairSeq

West top-100 tok

0

0.25

0.5

0.75

1

FST FST+HA FST+FairSeq

Figure 4: Proportion of forms which receive the correct analysis from each of our models (indicated in blue) andthe number of forms which receive only incorrect analyses from our models (indicated in red). The remainingforms received no analyses.

verbs use the exact same inflection and clitics, mak-ing the category itself difficult to infer. Short in-flectional sequences have a large number of ho-mophonous parses, and even more differ only by acharacter or two.

Qualitatively, the HA model tended to producemore plausible predictions, often producing the cor-rect stem or else a mostly-plausible analysis thatcould map to the same surface form, but with incor-rect categories or inflection. In contrast, the Fairseqmodel often introduced stem changes or inflec-tional sequences which could not ultimately mapto the surface form. Example (11) provides a sam-ple set of incorrect predictions (surface-plausibleanalyses are starred).

(11) ksimaasdiit ksi+PVB-m$aas+VT-TR-3PLa. HA model

xsim$aas+N-3PL (*)xsim$aas+N-T-3PL (*)xsim$aas+NUM-3PL (*?)xsim$aast+N-T-3PL

b. Fairseq modelxsim$aast+N-3PLxsim$aast+N=RESEMxsim$aast+N-SX=PN

xsim$as+N-3PL

Further work can be done to improve the per-formance of the neural addition, such as trainingthe model on attested tokens instead of, or in addi-tion to, tokens randomly generated from the FSTanalyzer.

6 Discussion and Conclusions

The grammatically-informed FST is able to han-dle many of Gitksan’s morphological complexi-ties with a high degree of precision, including ho-mophony, contextual deletion, and position-flexibleclitics. The FST analyzer’s patchy coverage canbe attributed to its small lexicon. Unknown lexi-cal items and variants comprised roughly 18% ofeach small dataset. Notably, errors and unidenti-fied forms in the FST analyzer signal the currentlimits of morphotactic descriptions and lexical doc-umentation. The analyzer can therefore serve as auseful part of a documentary linguistic workflowto quickly and systematically identify novel lexicalitems and grammatical rules from texts, facilitatingthe expansion of lexical resources. It can also beused as a pedagogical tool to identify word stemsin running text, or to generate morphological exer-

195

cises for language learners.The neural system, with its expanded coverage,

can serve as part of a feedback system with a hu-man in the loop, informing future iterations of theannotation process. While its precision is lowerthan the FST, it can still inform annotators on wordsthat the FST does not analyze. Newly-annotateddata can then be used to enlarge the FST coverage.

Acknowledgments

’Wii t’isim ha’miyaa nuu’m aloohl the fluent speak-ers who continue to share their knowledge withme (Barbara Sennott, Vincent Gogag, Hector Hill,Jeanne Harris), as well as the UBC Gitksan Re-search Lab. This research was supported by fund-ing from the National Endowment for the Humani-ties (Documenting Endangered Languages Fellow-ship) and the Social Sciences and Humanities Re-search Council of Canada (Grant 430-2020-00793).Any views/findings/conclusions expressed in thispublication do not necessarily reflect those of theNEH, NSF or SSHRC.

ReferencesKenneth R Beesley and Lauri Karttunen. 2003. Finite-

state morphology: Xerox tools and techniques.CSLI, Stanford.

Colin Brown, Clarissa Forbes, and Michael DavidSchwan. 2020. Clause-type, transitivity, and thetransitive vowel in Tsimshianic. In Papers of the In-ternational Conference on Salish and NeighbouringLanguages 55. UBCWPL.

Emily Chen and Lane Schwartz. 2018. A morpholog-ical analyzer for st. lawrence island/central siberianyupik. In Proceedings of the Eleventh InternationalConference on Language Resources and Evaluation(LREC 2018).

Henry Davis. 2018. Only connect! a unified analysisof the Tsimshianic connective system. InternationalJournal of American Linguistics, 84(4):471–511.

Britt Dunlop, Suzanne Gessner, Tracey Herbert, andAliana Parker. 2018. Report on the status of BC FirstNations languages. Report of the First People’s Cul-tural Council. Retrieved March 24, 2019.

Clarissa Forbes. 2018. Persistent ergativity: Agree-ment and splits in Tsimshianic. Ph.D. thesis, Uni-versity of Toronto.

Clarissa Forbes, Henry Davis, Michael Schwan, andthe UBC Gitksan Research Laboratory. 2017. ThreeGitksan texts. In Papers for the 52nd InternationalConference on Salish and Neighbouring Languages,pages 47–89. UBC Working Papers in Linguistics.

Atticus G Harrigan, Katherine Schmirler, Antti Arppe,Lene Antonsen, Trond Trosterud, and Arok Wolven-grey. 2017. Learning from the computational mod-elling of plains cree verbs. Morphology, 27(4):565–598.

Lonnie Hindle and Bruce Rigsby. 1973. A short prac-tical dictionary of the Gitksan language. NorthwestAnthropological Research Notes, 7(1).

Mans Hulden. 2009a. Finite-state machine construc-tion methods and algorithms for phonology and mor-phology. Ph.D. thesis, The University of Arizona.

Mans Hulden. 2009b. Foma: a finite-state compilerand library. In Proceedings of the 12th Confer-ence of the European Chapter of the Associationfor Computational Linguistics: Demonstrations Ses-sion, pages 29–32. Association for ComputationalLinguistics.

Benjamin Hunt, Emily Chen, Sylvia L.R. Schreiner,and Lane Schwartz. 2019. Community lexical ac-cess for an endangered polysynthetic language: Anelectronic dictionary for St. Lawrence Island Yupik.In Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics (Demonstrations), pages 122–126,Minneapolis, Minnesota. Association for Computa-tional Linguistics.

Katharine Hunt. 1993. Clause Structure, Agreementand Case in Gitksan. Ph.D. thesis, University ofBritish Columbia.

Lingpeng Kong, Chris Dyer, and Noah A Smith.2015. Segmental recurrent neural networks. arXivpreprint arXiv:1511.06018.

Patrick Littell. 2018. Finite-state morphology forkwak’wala: A phonological approach. In Proceed-ings of the Workshop on Computational Modelingof Polysynthetic Languages, pages 21–30, Santa Fe,New Mexico, USA. Association for ComputationalLinguistics.

Patrick Littell, Aidan Pine, and Henry Davis. 2017.Waldayu and waldayu mobile: Modern digital dic-tionary interfaces for endangered languages. In Pro-ceedings of the 2nd Workshop on the Use of Com-putational Methods in the Study of Endangered Lan-guages, pages 141–150.

Peter Makarov and Simon Clematide. 2018. Neu-ral transition-based string transduction for limited-resource setting in morphology. In Proceedings ofthe 27th International Conference on ComputationalLinguistics, pages 83–93, Santa Fe, New Mexico,USA. Association for Computational Linguistics.

Jeffrey Micher. 2017. Improving coverage of an Inuk-titut morphological analyzer using a segmental re-current neural network. In Proceedings of the 2ndWorkshop on the Use of Computational Methods inthe Study of Endangered Languages, pages 101–106,Honolulu. Association for Computational Linguis-tics.

196

Sarah Moeller, Ghazaleh Kazeminejad, Andrew Cow-ell, and Mans Hulden. 2018. A neural morphologi-cal analyzer for arapaho verbs learned from a finitestate transducer. In Proceedings of the Workshopon Computational Modeling of Polysynthetic Lan-guages, pages 12–20.

Mother Tongues Dictionaries. 2020. Gitk-san. Edited by the UBC Gitksan ResearchLab. Accessed June 4, 2020. (https://mothertongues.org/gitksan).

Myle Ott, Sergey Edunov, Alexei Baevski, AngelaFan, Sam Gross, Nathan Ng, David Grangier, andMichael Auli. 2019. fairseq: A fast, extensibletoolkit for sequence modeling. In Proceedings ofthe 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics(Demonstrations), pages 48–53, Minneapolis, Min-nesota. Association for Computational Linguistics.

Bruce Rigsby. 1986. Gitxsan grammar. Master’s the-sis, University of Queensland, Australia.

Lane Schwartz, Emily Chen, Benjamin Hunt, andSylvia LR Schreiner. 2019. Bootstrapping a neu-ral morphological analyzer for st. lawrence islandyupik from a finite-state transducer. In Proceedingsof the Workshop on Computational Methods for En-dangered Languages, volume 1.

Conor Snoek, Dorothy Thunder, Kaidi Loo, AnttiArppe, Jordan Lachler, Sjur Moshagen, and TrondTrosterud. 2014. Modeling the noun morphology ofplains cree. In Proceedings of the 2014 Workshopon the Use of Computational Methods in the Studyof Endangered Languages, pages 34–42.

Tonya Stebbins. 2003. On the status of interme-diate form-classes: Words, clitics, and affixes inCoast Tsimshian (Sm’algyax). Linguistic Typology,7(3):383–415.

Lonny Strunk. 2020. A Finite-State Morphological An-alyzer for Central Alaskan Yup’Ik. Ph.D. thesis, Uni-versity of Washington.

Marie-Lucie Tarpent. 1987. A Grammar of the NisghaLanguage. Ph.D. thesis, University of Victoria.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. arXiv preprint arXiv:1706.03762.

197

Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research

in Phonetics, Phonology, and Morphology,

August 5, 2021. ©2021 Association for Computational Linguistics

pages 198–211

Comparative Error Analysis in Neural and Finite-state Models forUnsupervised Character-level Transduction

Maria Ryskina1 Eduard Hovy1 Taylor Berg-Kirkpatrick2 Matthew R. Gormley3

1Language Technologies Institute, Carnegie Mellon University2Computer Science and Engineering, University of California, San Diego

3Machine Learning Department, Carnegie Mellon [email protected] [email protected]

[email protected] [email protected]

Abstract

Traditionally, character-level transductionproblems have been solved with finite-statemodels designed to encode structural andlinguistic knowledge of the underlying pro-cess, whereas recent approaches rely on thepower and flexibility of sequence-to-sequencemodels with attention. Focusing on the lessexplored unsupervised learning scenario, wecompare the two model classes side by sideand find that they tend to make different typesof errors even when achieving comparableperformance. We analyze the distributions ofdifferent error classes using two unsupervisedtasks as testbeds: converting informallyromanized text into the native script of its lan-guage (for Russian, Arabic, and Kannada) andtranslating between a pair of closely relatedlanguages (Serbian and Bosnian). Finally, weinvestigate how combining finite-state andsequence-to-sequence models at decodingtime affects the output quantitatively andqualitatively.1

1 Introduction and prior work

Many natural language sequence transduction tasks,such as transliteration or grapheme-to-phonemeconversion, call for a character-level parameteriza-tion that reflects the linguistic knowledge of the un-derlying generative process. Character-level trans-duction approaches have even been shown to per-form well for tasks that are not entirely character-level in nature, such as translating between relatedlanguages (Pourdamghani and Knight, 2017).

Weighted finite-state transducers (WFSTs) havetraditionally been used for such character-leveltasks (Knight and Graehl, 1998; Knight et al.,2006). Their structured formalization makes it eas-ier to encode additional constraints, imposed either

1Code will be published at https://github.com/ryskina/error-analysis-sigmorphon2021

это точно ಮನ #$ಳ&$ತು

3to to4no mana belagitu

техничка и стручна настава

tehničko i stručno obrazovanje

Figure 1: Parallel examples from our test setsfor two character-level transduction tasks: con-verting informally romanized text to its originalscript (top; examples in Russian and Kannada)and translating between closely related languages(bottom; Bosnian–Serbian). Informal romaniza-tion is idiosyncratic and relies on both visual (q→ 4) and phonetic (t → t) character similarity,while translation is more standardized but not fullycharacter-level due to grammatical and lexical dif-ferences (‘nastava’ → ‘obrazovanje’) betweenthe languages. The lines show character alignmentbetween the source and target side where possible.

by the underlying linguistic process (e.g. mono-tonic character alignment) or by the probabilis-tic generative model (Markov assumption; Eisner,2002). Their interpretability also facilitates the in-troduction of useful inductive bias, which is crucialfor unsupervised training (Ravi and Knight, 2009;Ryskina et al., 2020).

Unsupervised neural sequence-to-sequence(seq2seq) architectures have also shown impressiveperformance on tasks like machine transla-tion (Lample et al., 2018) and style transfer (Yanget al., 2018; He et al., 2020). These models aresubstantially more powerful than WFSTs, and theysuccessfully learn the underlying patterns from

198

monolingual data without any explicit informationabout the underlying generative process.

As the strengths of the two model classes dif-fer, so do their weaknesses: the WFSTs and theseq2seq models are prone to different kinds oferrors. On a higher level, it is explained by thestructure–power trade-off: while the seq2seq mod-els are better at recovering long-range dependen-cies and their outputs look less noisy, they alsotend to insert and delete words arbitrarily becausetheir alignments are unconstrained. We attributethe errors to the following aspects of the trade-off:

Language modeling capacity: the statisticalcharacter-level n-gram language models (LMs) uti-lized by finite-state approaches are much weakerthan the RNN language models with unlimited leftcontext. While a word-level LM can improve theperformance of a WFST, it would also restrict themodel’s ability to handle out-of-vocabulary words.

Controllability of learning: more structured mod-els allow us to ensure that the model does not at-tempt to learn patterns orthogonal to the underlyingprocess. For example, domain imbalance betweenthe monolingual corpora can cause the seq2seqmodels to exhibit unwanted style transfer effectslike inserting frequent target side words arbitrarily.

Search procedure: WFSTs make it easy to per-form exact maximum likelihood decoding viashortest-distance algorithm (Mohri, 2009). For theneural models trained using conventional methods,decoding strategies that optimize for the outputlikelihood (e.g. beam search with a large beamsize) have been shown to be susceptible to favoringempty outputs (Stahlberg and Byrne, 2019) andgenerating repetitions (Holtzman et al., 2020).

Prior work on leveraging the strength of the twoapproaches proposes complex joint parameteriza-tions, such as neural weighting of WFST arcs orpaths (Rastogi et al., 2016; Lin et al., 2019) orencoding alignment constraints into the attentionlayer of seq2seq models (Aharoni and Goldberg,2017; Wu et al., 2018; Wu and Cotterell, 2019;Makarov et al., 2017). We study whether perfor-mance can be improved with simpler decoding-time model combinations, reranking and productof experts, which have been used effectively forother model classes (Charniak and Johnson, 2005;Hieber and Riezler, 2015), evaluating on two un-supervised tasks: decipherment of informal roman-

ization (Ryskina et al., 2020) and related languagetranslation (Pourdamghani and Knight, 2017).

While there has been much error analysis forthe WFST and seq2seq approaches separately, itlargely focuses on the more common supervisedcase. We perform detailed side-by-side error analy-sis to draw high-level comparisons between finite-state and seq2seq models and investigate if theintuitions from prior work would transfer to theunsupervised transduction scenario.

2 Tasks

We compare the errors made by the finite-state andthe seq2seq approaches by analyzing their perfor-mance on two unsupervised character-level trans-duction tasks: translating between closely relatedlanguages written in different alphabets and con-verting informally romanized text into its nativescript. Both tasks are illustrated in Figure 1.

2.1 Informal romanization

Informal romanization is an idiosyncratic transfor-mation that renders a non-Latin-script language inLatin alphabet, extensively used online by speak-ers of Arabic (Darwish, 2014), Russian (Paulsen,2014), and many Indic languages (Sowmya et al.,2010). Figure 1 shows examples of romanized Rus-sian (top left) and Kannada (top right) sentencesalong with their “canonicalized” representations inCyrillic and Kannada scripts respectively. Unlikeofficial romanization systems such as pinyin, thistype of transliteration is not standardized: charac-ter substitution choices vary between users and arebased on the specific user’s perception of how sim-ilar characters in different scripts are. Althoughthe substitutions are primarily phonetic (e.g. Rus-sian n /n/→ n), i.e. based on the pronunciationof a specific character in or out of context, usersmight also rely on visual similarity between glyphs(e.g. Russian q /

>tSj/ → 4), especially when the

associated phoneme cannot be easily mapped toa Latin-script grapheme (e.g. Arabic ¨ /Q/→ 3).To capture this variation, we view the task of de-coding informal romanization as a many-to-manycharacter-level decipherment problem.

The difficulty of deciphering romanization alsodepends on the type of the writing system thelanguage traditionally uses. In alphabetic scripts,where grapheme-to-phoneme correspondence ismostly one-to-one, there tends to be a one-to-onemonotonic alignment between characters in the ro-

199

manized and native script sequences (Figure 1, topleft). Abjads and abugidas, where graphemes corre-spond to consonants or consonant-vowel syllables,increasingly use many-to-one alignment in theirromanization (Figure 1, top right), which makeslearning the latent alignments, and therefore decod-ing, more challenging. In this work, we experimentwith three languages spanning over three majortypes of writing systems—Russian (alphabetic),Arabic (abjad), and Kannada (abugida)—and com-pare how well-suited character-level models are forlearning these varying alignment patterns.

2.2 Related language translation

As shown by Pourdamghani and Knight (2017)and Hauer et al. (2014), character-level models canbe used effectively to translate between languagesthat are closely enough related to have only smalllexical and grammatical differences, such as Ser-bian and Bosnian (Ljubešic and Klubicka, 2014).We focus on this specific language pair and tie thelanguages to specific orthographies (Cyrillic forSerbian and Latin for Bosnian), approaching thetask as an unsupervised orthography conversionproblem. However, the transliteration framing ofthe translation problem is inherently limited sincethe task is not truly character-level in nature, asshown by the alignment lines in Figure 1 (bottom).Even the most accurate transliteration model willnot be able to capture non-cognate word transla-tions (Serbian ‘nastava’ [nastava, ‘education, teach-

ing’]→ Bosnian ‘obrazovanje’ [‘education’]) and theresulting discrepancies in morphological inflection(Serbian -a endings in adjectives agreeing withfeminine ‘nastava’ map to Bosnian -o represent-ing agreement with neuter ‘obrazovanje’).

One major difference with the informal roman-ization task is the lack of the idiosyncratic orthogra-phy: the word spellings are now consistent acrossthe data. However, since the character-level ap-proach does not fully reflect the nature of the trans-formation, the model will still have to learn a many-to-many cipher with highly context-dependent char-acter substitutions.

3 Data

Table 1 details the statistics of the splits used forall languages and tasks. Below we describe eachdataset in detail, explaining the differences in datasplit sizes between languages. Additional prepro-cessing steps applied to all datasets are described

in §3.4.2

3.1 Informal romanization

Source: de el menu:)Filtered: de el menu<...>

Target: <...> é JÖÏ @ ø X

Gloss: ‘This is the menu’

Figure 2: A parallel example from the LDC BOLTArabizi dataset, written in Latin script (source) andconverted to Arabic (target) semi-manually. Somesource-side segments (in red) are removed by an-notators; we use the version without such segments(filtered) for our task. The annotators also stan-dardize spacing on the target side, which results indifference with the source (in blue).

Arabic We use the LDC BOLT Phase 2 cor-pus (Bies et al., 2014; Song et al., 2014) for trainingand testing the Arabic transliteration models (Fig-ure 2). The corpus consists of short SMS and chatin Egyptian Arabic represented using Latin script(Arabizi). The corpus is fully parallel: each mes-sage is automatically converted into the standard-ized dialectal Arabic orthography (CODA; Habashet al., 2012) and then manually corrected by humanannotators. We split and preprocess the data accord-ing to Ryskina et al. (2020), discarding the target(native script) and source (romanized) parallel sen-tences to create the source and target monolingualtraining splits respectively.

Russian We use the romanized Russian datasetcollected by Ryskina et al. (2020), augmented withthe monolingual Cyrillic data from the Taiga cor-pus of Shavrina and Shapovalova (2017) (Figure 3).The romanized data is split into training, validation,and test portions, and all validation and test sen-tences are converted to Cyrillic by native speakerannotators. Both the romanized and the native-script sequences are collected from public posts andcomments on a Russian social network vk.com,and they are on average 3 times longer than themessages in the Arabic dataset (Table 1). However,although both sides were scraped from the sameonline platform, the relevant Taiga data is collectedprimarily from political discussion groups, so thereis still a substantial domain mismatch between thesource and target sides of the data.

2Links to download the corpora and other data sourcesdiscussed in this section can be found in Appendix A.

200

Train (source) Train (target) Validation TestSent. Char. Sent. Char. Sent. Char. Sent. Char.

Romanized Arabic 5K 104K 49K 935K 301 8K 1K 20KRomanized Russian 5K 319K 307K 111M 227 15K 1K 72KRomanized Kannada 10K 1M 679K 64M 100 11K 100 10KSerbian→Bosnian 160K 9M 136K 9M 16K 923K 100 9KBosnian→Serbian 136K 9M 160K 9M 16K 908K 100 10K

Table 1: Dataset splits for each task and language. The source and target train data are monolingual, andthe validation and test sentences are parallel. For the informal romanization task, the source and targetsides correspond to the Latin and the original script respectively. For the translation task, the source andtarget sides correspond to source and target languages. The validation and test character statistics arereported for the source side.

AnnotatedSource: proishodit s prirodoy 4to to very very badFiltered: proishodit s prirodoy 4to to <...>Target: proishodit s prirodoi qto-to <...>Gloss: ‘Something very very bad is happening to

the environment’

MonolingualSource: —Target: to videoroliki so sezda par-

tii“Edina Rossi”Gloss: ‘These are the videos from the “United Rus-

sia” party congress’

Figure 3: Top: A parallel example from the roman-ized Russian dataset. We use the filtered versionof the romanized (source) sequences, removing thesegments the annotators were unable to convert toCyrillic, e.g. code-switched phrases (in red). Theannotators also standardize minor spelling varia-tion such as hyphenation (in blue). Bottom: amonolingual Cyrillic example from the vk.comportion of the Taiga corpus, which mostly consistsof comments in political discussion groups.

Kannada Our Kannada data (Figure 4) is takenfrom the Dakshina dataset (Roark et al., 2020),a large collection of native-script text fromWikipedia for 12 South Asian languages. Unlikethe Russian and Arabic data, the romanized portionof Dakshina is not scraped directly from the users’online communication, but instead elicited fromnative speakers given the native-script sequences.Because of this, all romanized sentences in thedata are parallel: we allocate most of them to thesource side training data, discarding their originalscript counterparts, and split the remaining anno-tated ones between validation and test.

Target:

‘Not at all. Just like Žirinovskij, [they] often make sensible suggestions.’

‘[One] could kinda risk it [and bet] on Property’

Как бы можно рискнуть на <UNK>Отнюдь. Так же, как Жириновский, часто предлагает

здравые вещи.

Kagbi mozno riskytj na Property

Kagbi mozno riskytj na <UNK>

ಮೂಲ +,-$.ನ/$0 DDR3ಯನು2 ಬಳಸಲುSource: moola saaketnalli ddr3yannu balasaluGloss: ‘to use DDR3 in the source circuit’

Figure 4: A parallel example from the Kannada por-tion of the Dakshina dataset. The Kannada scriptdata (target) is scraped from Wikipedia and man-ually converted to Latin (source) by human anno-tators. Foreign target-side characters (in red) getpreserved in the annotation but our preprocessingreplaces them with UNK on the target side.

Serbian: svako ima pravo na ivot, slobodui bezbednost liqnosti.

Bosnian: svako ima pravo na život, slobodu i osobnusigurnost.

Gloss: ‘Everyone has the right to life, liberty andsecurity of person.’

Figure 5: A parallel example from the Serbian–Cyrillic and Bosnian–Latin UDHR. The sequencesare not entirely parallel on character level due toparaphrases and non-cognate translations (in blue).

3.2 Related language translation

Following prior work (Pourdamghani and Knight,2017; Yang et al., 2018; He et al., 2020), we trainour unsupervised models on the monolingual datafrom the Leipzig corpora (Goldhahn et al., 2012).We reuse the non-parallel training and synthetic par-allel validation splits of Yang et al. (2018), who gen-erated their parallel data using the Google Trans-lation API. Rather than using their synthetic testset, we opt to test on natural parallel data from theUniversal Declaration of Human Rights (UDHR),following Pourdamghani and Knight (2017).

We manually sentence-align the Serbian–

201

Cyrillic and Bosnian–Latin declaration texts andfollow the preprocessing guidelines of Pour-damghani and Knight (2017). Although we striveto approximate the training and evaluation setupof their work for fair comparison, there are somediscrepancies: for example, our manual alignmentof UDHR yields 100 sentence pairs compared to104 of Pourdamghani and Knight (2017). We usethe data to train the translation models in both di-rections, simply switching the source and targetsides from Serbian to Bosnian and vice versa.

3.3 Inductive bias

As discussed in §1, the WFST models are less pow-erful than the seq2seq models; however, they arealso more structured, which we can use to introduceinductive bias to aid unsupervised training. Follow-ing Ryskina et al. (2020), we introduce informativepriors on character substitution operations (for a de-scription of the WFST parameterization, see §4.1).The priors reflect the visual and phonetic similar-ity between characters in different alphabets andare sourced from human-curated resources builtwith the same concepts of similarity in mind. Forall tasks and languages, we collect phoneticallysimilar character pairs from the phonetic keyboardlayouts (or, in case of the translation task, from thedefault Serbian keyboard layout, which is phoneticin nature due to the dual orthography standard ofthe language). We also add some visually similarcharacter pairs by automatically pairing all sym-bols that occur in both source and target alphabets(same Unicode codepoints). For Russian, whichexhibits a greater degree of visual similarity thanArabic or Kannada, we also make use of the Uni-code confusables list (different Unicode codepointsbut same or similar glyphs).3

It should be noted that these automatically gen-erated informative priors also contain noise: key-board layouts have spurious mappings becauseeach symbol must be assigned to exactly one key inthe QWERTY layout, and Unicode-constrained vi-sual mappings might prevent the model from learn-ing correspondences between punctuation symbols(e.g. Arabic question mark ? → ?).

3.4 Preprocessing

We lowercase and segment all sequences into char-acters as defined by Unicode codepoints, so dia-

3Links to the keyboard layouts and the confusables list canbe found in Appendix A.

critics and non-printing characters like ZWJ arealso treated as separate vocabulary items. To filterout foreign or archaic characters and rare diacritics,we restrict the alphabets to characters that cover99% of the monolingual training data. After that,we add any standard alphabetical characters andnumerals that have been filtered out back into thesource and target alphabets. All remaining filteredcharacters are replaced with a special UNK symbolin all splits except for the target-side test.

4 Methods

We perform our analysis using the finite-state andseq2seq models from prior work and experimentwith two joint decoding strategies, reranking andproduct of experts. Implementation details andhyperparameters are described in Appendix B.

4.1 Base models

Our finite-state model is the WFST cascade in-troduced by Ryskina et al. (2020). The model iscomposed of a character-level n-gram languagemodel and a script conversion transducer (emis-sion model), which supports one-to-one charactersubstitutions, insertions, and deletions. Charac-ter operation weights in the emission model areparameterized with multinomial distributions, andsimilar character mappings (§3.3) are used to cre-ate Dirichlet priors on the emission parameters.To avoid marginalizing over sequences of infinitelength, a fixed limit is set on the delay of any path(the difference between the cumulative number ofinsertions and deletions at any timestep). Ryskinaet al. (2020) train the WFST using stochastic step-wise EM (Liang and Klein, 2009), marginalizingover all possible target sequences and their align-ments with the given source sequence. To speedup training, we modify their training proceduretowards ‘hard EM’: given a source sequence, wepredict the most probable target sequence underthe model, marginalize over alignments and thenupdate the parameters. Although the unsupervisedWFST training is still slow, the stepwise trainingprocedure is designed to converge using fewer datapoints, so we choose to train the WFST modelonly on the 1,000 shortest source-side training se-quences (500 for Kannada).

Our default seq2seq model is the unsupervisedneural machine translation (UNMT) model of Lam-ple et al. (2018, 2019) in the parameterizationof He et al. (2020). The model consists of an

202

Arabic Russian KannadaCER WER BLEU CER WER BLEU CER WER BLEU

WFST .405 .86 2.3 .202 .58 14.8 .359 .71 12.5Seq2Seq .571 .85 4.0 .229 .38 48.3 .559 .79 11.3

Reranked WFST .398 .85 2.8 .195 .57 16.1 .358 .71 12.5Reranked Seq2Seq .538 .82 4.6 .216 .39 45.6 .545 .78 12.6Product of experts .470 .88 2.5 .178 .50 22.9 .543 .93 7.0

Table 2: Character and word error rates (lower is better) and BLEU scores (higher is better) for theromanization decipherment task. Bold indicates best per column. Model combinations mostly interpolatebetween the base models’ scores, although reranking yields minor improvements in character-level andword-level metrics for the WFST and seq2seq respectively. Note: base model results are not intended as adirect comparison between the WFST and seq2seq, since they are trained on different amounts of data.

srp→bos bos→srpCER WER BLEU CER WER BLEU

WFST .314 .50 25.3 .319 .52 25.5Seq2Seq .375 .49 34.5 .395 .49 36.3

Reranked WFST .314 .49 26.3 .317 .50 28.1Reranked Seq2Seq .376 .48 35.1 .401 .47 37.0Product of experts .329 .54 24.4 .352 .66 20.6

(Pourdamghani and Knight, 2017) — — 42.3 — — 39.2(He et al., 2020) .657 .81 5.6 .693 .83 4.7

Table 3: Character and word error rates (lower is better) and BLEU scores (higher is better) for the relatedlanguage translation task. Bold indicates best per column. The WFST and the seq2seq have comparableCER and WER despite the WFST being trained on up to 160x less source-side data (§4.1). While noneof our models achieve the scores reported by Pourdamghani and Knight (2017), they all substantiallyoutperform the subword-level model of He et al. (2020). Note: base model results are not intended as adirect comparison between the WFST and seq2seq, since they are trained on different amounts of data.

LSTM (Hochreiter and Schmidhuber, 1997) en-coder and decoder with attention, trained to mapsentences from each domain into a shared latentspace. Using a combined objective, the UNMTmodel is trained to denoise, translate in both direc-tions, and discriminate between the latent represen-tation of sequences from different domains. Sincethe sufficient amount of balanced data is crucialfor the UNMT performance, we train the seq2seqmodel on all available data on both source and tar-get sides. Additionally, the seq2seq model decideson early stopping by evaluating on a small parallelvalidation set, which our WFST model does nothave access to.

The WFST model treats the target and sourcetraining data differently, using the former to trainthe language model and the latter for learning theemission parameters, while the UNMT model is

trained to translate in both directions simultane-ously. Therefore, we reuse the same seq2seq modelfor both directions of the translation task, but traina separate finite-state model for each direction.

4.2 Model combinations

The simplest way to combine two independentlytrained models is reranking: using one model toproduce a list of candidates and rescoring them ac-cording to another model. To generate candidateswith a WFST, we apply the n–shortest paths algo-rithm (Mohri and Riley, 2002). It should be notedthat the n–best list might contain duplicates sinceeach path represents a specific source–target char-acter alignment. The length constraints encoded inthe WFST also restrict its capacity as a reranker:beam search in the UNMT model may producehypotheses too short or long to have a non-zero

203

Input svako ima pravo da slobodno uqestvuje u kulturnom ivotu zajednice, da uivau umetnosti i da uqestvuje u nauqnom napretku i u dobrobiti koja otudaproistiqe.

Ground truth svako ima pravo da slobodno sudjeluje u kulturnom životu zajednice, da uživa u umjetnosti i daucestvuje u znanstvenom napretku i u njegovim koristima.

WFST svako ima pravo da slobodno ucestvuje u kulturnom životu s jednice , da uživa u m etnosti i daucestvuje u naucnom napretku i u dobrobiti koja otuda pr istice .

Reranked WFST svako ima pravo da slobodno ucestvuje u kulturnom životu s jednice , da uživa u m etnosti i daucestvuje u naucnom napretku i u dobrobiti koja otuda pr istice .

Seq2Seq svako ima pravo da slobodno ucestvuje u kulturnom životu zajednice , daucestvuje u naucnom napretku i u dobrobiti koja otuda proistice .

Reranked Seq2Seq svako ima pravo da slobodno ucestvuje u kulturnom životu zajednice , da uživa u umjetnosti i daucestvuje u naucnom napretku i u dobrobiti koja otuda proistice

Product of experts svako ima pravo da slobodno ucestvuje u kulturnom za u sajednice , da živa u umjetnosti i daucestvuje u naucnom napretku i u dobro j i koja otuda proisti

Subword Seq2Seq sami ima pravo da slobodno utice na srpskom nivou vlasti da razgovaraju u bosne i da djeluje umedunarodnom turizmu i na buducnosti koja muža decisno .

Table 4: Different model outputs for a srp→bos translation example. Prediction errors are highlightedin red. Correctly transliterated segments that do not match the ground truth (e.g. due to paraphrasing)are shown in yellow. Here the WFST errors are substitutions or deletions of individual characters, whilethe seq2seq drops entire words from the input (§5 #4). The latter problem is solved by reranking with aWFST for this example. The seq2seq model with subword tokenization (He et al., 2020) produces mostlyhallucinated output (§5 #2). Example outputs for all other datasets can be found in the Appendix.

probability under the WFST.Our second approach is a product-of-experts-

style joint decoding strategy (Hinton, 2002):we perform beam search on the WFST lattice,reweighting the arcs with the output distribution ofthe seq2seq decoder at the corresponding timestep.For each partial hypothesis, we keep track of theWFST state s and the partial input and output se-quences x1:k and y1:t.4 When traversing an arcwith input label i ∈ xk+1, ε and output label o,we multiply the arc weight by the probability ofthe neural model outputting o as the next character:pseq2seq(yt+1 = o|x, y1:t). Transitions with o = ε(i.e. deletions) are not rescored by the seq2seq. Wegroup hypotheses by their consumed input lengthk and select n best extensions at each timestep.

4.3 Additional baselines

For the translation task, we also compare to priorunsupervised approaches of different granularity:the deep generative style transfer model of He et al.(2020) and the character- and word-level WFSTdecipherment model of Pourdamghani and Knight(2017). The former is trained on the same trainingset tokenized into subword units (Sennrich et al.,2016), and we evaluate it on our UDHR test setfor fair comparison. While the train and test data

4Due to insertions and deletions in the emission model, kand t might differ; epsilon symbols are not counted.

of Pourdamghani and Knight (2017) also use thesame respective sources, we cannot account fortokenization differences that could affect the scoresreported by the authors.

5 Results and analysis

Tables 2 and 3 present our evaluation of the twobase models and three decoding-time model com-binations on the romanization decipherment andrelated language translation tasks respectively. Foreach experiment, we report character error rate,word error rate, and BLEU (see Appendix C). Theresults for the base models support what we showlater in this section: the seq2seq model is morelikely to recover words correctly (higher BLEU,lower WER), while the WFST is more faithful oncharacter level and avoids word-level substitutionerrors (lower CER). Example predictions can befound in Table 4 and in the Appendix.

Our further qualitative and quantitative findingsare summarized in the following high-level take-aways:

#1: Model combinations still suffer from searchissues. We would expect the combined decod-ing to discourage all errors common under onemodel but not the other, improving the performanceby leveraging the strengths of both model classes.However, as Tables 2 and 3 show, they instead

204

WFST Seq2Seq

Figure 6: Highest-density sub-matrices of the two base mod-els’ character confusion matrices,computed in the Russian roman-ization task. White cells repre-sent zero elements. The WFSTconfusion matrix (left) is notice-ably sparser than the seq2seq one(right), indicating more repetitiveerrors. # symbol stands for UNK.

mostly interpolate between the scores of the twobase models. In the reranking experiments, we findthat this is often due to the same base model er-ror (e.g. the seq2seq model hallucinating a wordmid-sentence) repeating across all the hypothesesin the final beam. This suggests that successfulreranking would require a much larger beam sizeor a diversity-promoting search mechanism.

Interestingly, we observe that although addinga reranker on top of a decoder does improve per-formance slightly, the gain is only in terms of themetrics that the base decoder is already strong at—character-level for reranked WFST and word-levelfor reranked seq2seq—at the expense of the otherscores. Overall, none of our decoding strategiesachieves best results across the board, and no modelcombination substantially outperforms both basemodels in any metric.

#2: Character tokenization boosts performanceof the neural model. In the past, UNMT-stylemodels have been applied to various unsupervisedsequence transduction problems. However, sincethese models were designed to operate on word orsubword level, prior work assumes the same tok-enization is necessary. We show that for the tasksallowing character-level framing, such models infact respond extremely well to character input.

Table 3 compares the UNMT model trained oncharacters with the seq2seq style transfer modelof He et al. (2020) trained on subword units. Theoriginal paper shows improvement over the UNMTbaseline in the same setting, but simply switchingto character-level tokenization without any otherchanges results in a 30 BLEU points gain for ei-ther direction. This suggests that the tokenizationchoice could act as an inductive bias for seq2seqmodels, and character-level framing could be use-ful even for tasks that are not truly character-level.

This observation also aligns with the findings ofthe recent work on language modeling complex-ity (Park et al., 2021; Mielke et al., 2019). Formany languages, including several Slavic ones re-lated to the Serbian–Bosnian pair, a character-levellanguage model yields lower surprisal than the onetrained on BPE units, suggesting that the effectmight also be explained by the character tokeniza-tion making the language easier to language-model.

#3: WFST model makes more repetitive errors.Although two of our evaluation metrics, CER andWER, are based on edit distance, they do not dis-tinguish between the different types of edits (sub-stitutions, insertions and deletions). Breaking themdown by the edit operation, we find that while bothmodels favor substitutions on both word and char-acter levels, insertions and deletions are more fre-quent under the neural model (43% vs. 30% of alledits on the Russian romanization task). We alsofind that the character substitution choices of theneural model are more context-dependent: whilethe total counts of substitution errors for the twomodels are comparable, the WFST is more likelyto repeat the same few substitutions per charactertype. This is illustrated by Figure 6, which visual-izes the most populated submatrices of the confu-sion matrices for the same task as heatmaps. TheWFST confusion matrix is noticeably more sparse,with the same few substitutions occurring muchmore frequently than others: for example, WFSToften mistakes for a and rarely for other char-acters, while the neural model’s substitutions of are distributed closer to uniform. This suggeststhat the WFST errors might be easier to correctwith rule-based postprocessing. Interestingly, wedid not observe the same effect for the translationtask, likely due to a more constrained nature of theorthography conversion.

205

0 0.6 1.2 1.8 2.40

200

400

600

800

CER per word

Num

bero

fwor

dsWFST

0 0.6 1.2 1.8 2.40

200

400

600

800

CER per word

Seq2SeqFigure 7: Character error rate perword for the WFST (left) and seq2seq(right) bos→srp translation outputs.The predictions are segmented us-ing Moses tokenizer (Koehn et al.,2007) and aligned to ground truthwith word-level edit distance. The in-creased frequency of CER=1 for theseq2seq model as compared to theWFST indicates that it replaces entirewords more often.

#4: Neural model is more sensitive to data dis-tribution shifts. The language model aiming toreplicate its training data distribution could causethe output to deviate from the input significantly.This could be an artifact of a domain shift, such asin Russian, where the LM training data came froma political discussion forum: the seq2seq model fre-quently predicts unrelated domain-specific propernames in place of very common Russian words, e.g.izn~ [žizn, ‘life’]→ Zganov [Zjuganov, ‘Zyuganov

(politician’s last name)’] or to [èto, ‘this’]→ EdinaRossi [Edinaja Rossija, ‘United Russia (political party)’],presumably distracted by the shared first characterin the romanized version. To quantify the effect ofa mismatch between the train and test data distri-butions in this case, we inspect the most commonword-level substitutions under each decoding strat-egy, looking at all substitution errors covered by the1,000 most frequent substitution ‘types’ (groundtruth–prediction word pairs) under the respectivedecoder. We find that 25% of the seq2seq substitu-tion errors fall into this category, as compared tomerely 3% for the WFST—notable given the rela-tive proportion of in-vocabulary words in the mod-els’ outputs (89% for UNMT vs. 65% for WFST).

Comparing the error rate distribution across out-put words for the translation task also supports thisobservation. As can be seen from Figure 7, theseq2seq model is likely to either predict the wordcorrectly (CER of 0) or entirely wrong (CER of1), while the the WFST more often predicts theword partially correctly—examples in Table 4 illus-trate this as well. We also see this in the Kannadaoutputs: WFST typically gets all the consonantsright but makes mistakes in the vowels, while theseq2seq tends to replace the entire word.

6 Conclusion

We perform comparative error analysis in finite-state and seq2seq models and their combinationsfor two unsupervised character-level tasks, infor-mal romanization decipherment and related lan-guage translation. We find that the two model typestend towards different errors: seq2seq models aremore prone to word-level errors caused by distribu-tional shifts while WFSTs produce more character-level noise despite the hard alignment constraints.

Despite none of our simple decoding-time com-binations substantially outperforming the base mod-els, we believe that combining neural and finite-state models to harness their complementary ad-vantages is a promising research direction. Suchcombinations might involve biasing seq2seq mod-els towards WFST-like behavior via pretraining ordirectly encoding constraints such as hard align-ment or monotonicity into their parameterization(Wu et al., 2018; Wu and Cotterell, 2019). Al-though recent work has shown that the Transformercan learn to perform character-level transductionwithout such biases in a supervised setting (Wuet al., 2021), exploiting the structured nature of thetask could be crucial for making up for the lack oflarge parallel corpora in low-data and/or unsuper-vised scenarios. We hope that our analysis providesinsight into leveraging the strengths of the two ap-proaches for modeling character-level phenomenain the absence of parallel data.

Acknowledgments

The authors thank Badr Abdullah, DeepakGopinath, Junxian He, Shruti Rijhwani, and StasKashepava for helpful discussion, and the anony-mous reviewers for their valuable feedback.

206

ReferencesRoee Aharoni and Yoav Goldberg. 2017. Morphologi-

cal inflection generation with hard monotonic atten-tion. In Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 2004–2015, Vancouver,Canada. Association for Computational Linguistics.

Cyril Allauzen, Michael Riley, Johan Schalkwyk, Woj-ciech Skut, and Mehryar Mohri. 2007. OpenFst: Ageneral and efficient weighted finite-state transducerlibrary. In Proceedings of the Ninth InternationalConference on Implementation and Application ofAutomata, (CIAA 2007), volume 4783 of LectureNotes in Computer Science, pages 11–23. Springer.http://www.openfst.org.

Sowmya V. B., Monojit Choudhury, Kalika Bali,Tirthankar Dasgupta, and Anupam Basu. 2010. Re-source creation for training and testing of translit-eration systems for Indian languages. In Proceed-ings of the Seventh International Conference on Lan-guage Resources and Evaluation (LREC’10), Val-letta, Malta. European Language Resources Associ-ation (ELRA).

Ann Bies, Zhiyi Song, Mohamed Maamouri, StephenGrimes, Haejoong Lee, Jonathan Wright, StephanieStrassel, Nizar Habash, Ramy Eskander, and OwenRambow. 2014. Transliteration of Arabizi into Ara-bic orthography: Developing a parallel annotatedArabizi-Arabic script SMS/chat corpus. In Proceed-ings of the EMNLP 2014 Workshop on Arabic Nat-ural Language Processing (ANLP), pages 93–103,Doha, Qatar. Association for Computational Lin-guistics.

Eugene Charniak and Mark Johnson. 2005. Coarse-to-fine n-best parsing and MaxEnt discriminativereranking. In Proceedings of the 43rd Annual Meet-ing of the Association for Computational Linguis-tics (ACL’05), pages 173–180, Ann Arbor, Michi-gan. Association for Computational Linguistics.

Kareem Darwish. 2014. Arabizi detection and conver-sion to Arabic. In Proceedings of the EMNLP 2014Workshop on Arabic Natural Language Processing(ANLP), pages 217–224, Doha, Qatar. Associationfor Computational Linguistics.

Jason Eisner. 2002. Parameter estimation for prob-abilistic finite-state transducers. In Proceedingsof the 40th Annual Meeting of the Association forComputational Linguistics, pages 1–8, Philadelphia,Pennsylvania, USA. Association for ComputationalLinguistics.

Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff.2012. Building large monolingual dictionaries atthe Leipzig corpora collection: From 100 to 200 lan-guages. In Proceedings of the Eighth InternationalConference on Language Resources and Evaluation(LREC’12), pages 759–765, Istanbul, Turkey. Euro-pean Language Resources Association (ELRA).

Kyle Gorman. 2016. Pynini: A Python library forweighted finite-state grammar compilation. In Pro-ceedings of the SIGFSM Workshop on StatisticalNLP and Weighted Automata, pages 75–80, Berlin,Germany. Association for Computational Linguis-tics.

Nizar Habash, Mona Diab, and Owen Rambow. 2012.Conventional orthography for dialectal Arabic. InProceedings of the Eighth International Conferenceon Language Resources and Evaluation (LREC’12),pages 711–718, Istanbul, Turkey. European Lan-guage Resources Association (ELRA).

Bradley Hauer, Ryan Hayward, and Grzegorz Kon-drak. 2014. Solving substitution ciphers with com-bined language models. In Proceedings of COLING2014, the 25th International Conference on Compu-tational Linguistics: Technical Papers, pages 2314–2325, Dublin, Ireland. Dublin City University andAssociation for Computational Linguistics.

Junxian He, Xinyi Wang, Graham Neubig, and TaylorBerg-Kirkpatrick. 2020. A probabilistic formulationof unsupervised text style transfer. In InternationalConference on Learning Representations.

Felix Hieber and Stefan Riezler. 2015. Bag-of-wordsforced decoding for cross-lingual information re-trieval. In Proceedings of the 2015 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, pages 1172–1182, Denver, Colorado. As-sociation for Computational Linguistics.

G. E. Hinton. 2002. Training products of experts byminimizing contrastive divergence. Neural Compu-tation, 14(8):1771–1800.

Sepp Hochreiter and Jürgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, andYejin Choi. 2020. The curious case of neural text de-generation. In International Conference on Learn-ing Representations.

Cibu Johny, Lawrence Wolf-Sonkin, Alexander Gutkin,and Brian Roark. 2021. Finite-state script normal-ization and processing utilities: The Nisaba Brahmiclibrary. In Proceedings of the 16th Conference ofthe European Chapter of the Association for Compu-tational Linguistics: System Demonstrations, pages14–23, Online. Association for Computational Lin-guistics.

Kevin Knight and Jonathan Graehl. 1998. Ma-chine transliteration. Computational Linguistics,24(4):599–612.

Kevin Knight, Anish Nair, Nishit Rathod, and KenjiYamada. 2006. Unsupervised analysis for deci-pherment problems. In Proceedings of the COL-ING/ACL 2006 Main Conference Poster Sessions,pages 499–506, Sydney, Australia. Association forComputational Linguistics.

207

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar, AlexandraConstantin, and Evan Herbst. 2007. Moses: Opensource toolkit for statistical machine translation. InProceedings of the 45th Annual Meeting of the As-sociation for Computational Linguistics CompanionVolume Proceedings of the Demo and Poster Ses-sions, pages 177–180, Prague, Czech Republic. As-sociation for Computational Linguistics.

Guillaume Lample, Alexis Conneau, Ludovic Denoyer,and Marc’Aurelio Ranzato. 2018. Unsupervised ma-chine translation using monolingual corpora only.In International Conference on Learning Represen-tations.

Guillaume Lample, Sandeep Subramanian, Eric Smith,Ludovic Denoyer, Marc’Aurelio Ranzato, and Y-Lan Boureau. 2019. Multiple-attribute text rewrit-ing. In International Conference on Learning Rep-resentations.

Percy Liang and Dan Klein. 2009. Online EM forunsupervised models. In Proceedings of HumanLanguage Technologies: The 2009 Annual Confer-ence of the North American Chapter of the Associa-tion for Computational Linguistics, pages 611–619,Boulder, Colorado. Association for ComputationalLinguistics.

Chu-Cheng Lin, Hao Zhu, Matthew R. Gormley, andJason Eisner. 2019. Neural finite-state transducers:Beyond rational relations. In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long andShort Papers), pages 272–283, Minneapolis, Min-nesota. Association for Computational Linguistics.

Nikola Ljubešic and Filip Klubicka. 2014.bs,hr,srWaC - web corpora of Bosnian, Croa-tian and Serbian. In Proceedings of the 9th Web asCorpus Workshop (WaC-9), pages 29–35, Gothen-burg, Sweden. Association for ComputationalLinguistics.

Peter Makarov, Tatiana Ruzsics, and Simon Clematide.2017. Align and copy: UZH at SIGMORPHON2017 shared task for morphological reinflection. InProceedings of the CoNLL SIGMORPHON 2017Shared Task: Universal Morphological Reinflection,pages 49–57, Vancouver. Association for Computa-tional Linguistics.

Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman, BrianRoark, and Jason Eisner. 2019. What kind of lan-guage is hard to language-model? In Proceedings ofthe 57th Annual Meeting of the Association for Com-putational Linguistics, pages 4975–4989, Florence,Italy. Association for Computational Linguistics.

Mehryar Mohri. 2009. Weighted automata algorithms.In Handbook of weighted automata, pages 213–254.Springer.

Mehryar Mohri and Michael Riley. 2002. An efficientalgorithm for the n-best-strings problem. In SeventhInternational Conference on Spoken Language Pro-cessing.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automaticevaluation of machine translation. In Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics, pages 311–318, Philadelphia,Pennsylvania, USA. Association for ComputationalLinguistics.

Hyunji Hayley Park, Katherine J. Zhang, Coleman Ha-ley, Kenneth Steimel, Han Liu, and Lane Schwartz.2021. Morphology matters: A multilingual lan-guage modeling analysis. Transactions of the Asso-ciation for Computational Linguistics, 9:261–276.

Martin Paulsen. 2014. Translit: Computer-mediateddigraphia on the Runet. Digital Russia: The Lan-guage, Culture and Politics of New Media Commu-nication.

Nima Pourdamghani and Kevin Knight. 2017. Deci-phering related languages. In Proceedings of the2017 Conference on Empirical Methods in Natu-ral Language Processing, pages 2513–2518, Copen-hagen, Denmark. Association for ComputationalLinguistics.

Pushpendre Rastogi, Ryan Cotterell, and Jason Eisner.2016. Weighting finite-state transductions with neu-ral context. In Proceedings of the 2016 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, pages 623–633, San Diego, California.Association for Computational Linguistics.

Sujith Ravi and Kevin Knight. 2009. Learningphoneme mappings for transliteration without paral-lel data. In Proceedings of Human Language Tech-nologies: The 2009 Annual Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics, pages 37–45, Boulder, Colorado.Association for Computational Linguistics.

Brian Roark, Richard Sproat, Cyril Allauzen, MichaelRiley, Jeffrey Sorensen, and Terry Tai. 2012. TheOpenGrm open-source finite-state grammar soft-ware libraries. In Proceedings of the ACL 2012 Sys-tem Demonstrations, pages 61–66, Jeju Island, Ko-rea. Association for Computational Linguistics.

Brian Roark, Lawrence Wolf-Sonkin, Christo Kirov,Sabrina J. Mielke, Cibu Johny, Isin Demirsahin, andKeith Hall. 2020. Processing South Asian languageswritten in the Latin script: The Dakshina dataset.In Proceedings of the 12th Language Resourcesand Evaluation Conference, pages 2413–2423, Mar-seille, France. European Language Resources Asso-ciation.

Maria Ryskina, Matthew R. Gormley, and Taylor Berg-Kirkpatrick. 2020. Phonetic and visual priors for

208

decipherment of informal Romanization. In Pro-ceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 8308–8319, Online. Association for Computational Lin-guistics.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare wordswith subword units. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.

Tatiana Shavrina and Olga Shapovalova. 2017. Tothe methodology of corpus construction for machinelearning: Taiga syntax tree corpus and parser. InProc. CORPORA 2017 International Conference,pages 78–84, St. Petersburg.

Zhiyi Song, Stephanie Strassel, Haejoong Lee, KevinWalker, Jonathan Wright, Jennifer Garland, DanaFore, Brian Gainor, Preston Cabe, Thomas Thomas,Brendan Callahan, and Ann Sawyer. 2014. Collect-ing natural SMS and chat conversations in multiplelanguages: The BOLT phase 2 corpus. In Proceed-ings of the Ninth International Conference on Lan-guage Resources and Evaluation (LREC’14), pages1699–1704, Reykjavik, Iceland. European LanguageResources Association (ELRA).

Felix Stahlberg and Bill Byrne. 2019. On NMT searcherrors and model errors: Cat got your tongue? InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP), pages 3356–3362, Hong Kong, China. Association for Computa-tional Linguistics.

Shijie Wu and Ryan Cotterell. 2019. Exact hard mono-tonic attention for character-level transduction. InProceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 1530–1537, Florence, Italy. Association for ComputationalLinguistics.

Shijie Wu, Ryan Cotterell, and Mans Hulden. 2021.Applying the transformer to character-level trans-duction. In Proceedings of the 16th Conference ofthe European Chapter of the Association for Com-putational Linguistics: Main Volume, pages 1901–1907, Online. Association for Computational Lin-guistics.

Shijie Wu, Pamela Shapiro, and Ryan Cotterell. 2018.Hard non-monotonic attention for character-leveltransduction. In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing, pages 4425–4438, Brussels, Belgium.Association for Computational Linguistics.

Zichao Yang, Zhiting Hu, Chris Dyer, Eric P. Xing, andTaylor Berg-Kirkpatrick. 2018. Unsupervised textstyle transfer using language models as discrimina-tors. In NeurIPS, pages 7298–7309.

209

A Data download links

The romanized Russian and Arabic data and pre-processing scripts can be downloaded here. Thisrepository also contains the relevant portion of theTaiga dataset, which can be downloaded in full atthis link. The romanized Kannada data was down-loaded from the Dakshina dataset.

The scripts to download the Serbian and BosnianLeipzig corpora data can be found here. TheUDHR texts were collected from the correspondingpages: Serbian, Bosnian.

The keyboard layouts used to construct thephonetic priors are collected from the followingsources: Arabic 1, Arabic 2, Russian, Kannada,Serbian. The Unicode confusables list used for theRussian visual prior can be found here.

B Implementation

WFST We reuse the unsupervised WFST imple-mentation of Ryskina et al. (2020),5 which utilizesthe OpenFst (Allauzen et al., 2007) and Open-Grm (Roark et al., 2012) libraries. We use thedefault hyperparameter settings described by theauthors (see Appendix B in the original paper). Wekeep the hyperparameters unchanged for the trans-lation experiment and set the maximum delay valueto 2 for both translation directions.

UNMT We use the PyTorch UNMT implementa-tion of He et al. (2020)6 which incorporates im-provements introduced by Lample et al. (2019)such as the addition of a max-pooling layer. Weuse a single-layer LSTM (Hochreiter and Schmid-huber, 1997) with hidden state size 512 for boththe encoder and the decoder and embedding dimen-sion 128. For the denoising autoencoding loss, weadopt the default noise model and hyperparametersas described by Lample et al. (2018). The autoen-coding loss is annealed over the first 3 epochs. Wepredict the output using greedy decoding and setthe maximum output length equal to the length ofthe input sequence. Patience for early stopping isset to 10.

Model combinations Our joint decoding imple-mentations rely on PyTorch and the Pynini finite-state library (Gorman, 2016). In reranking, werescore n = 5 best hypotheses produced using

5https://github.com/ryskina/romanization-decipherment

6https://github.com/cindyxinyiwang/deep-latent-sequence-model

beam search and n–shortest path algorithm for theUNMT and WFST respectively. Product of expertsdecoding is also performed with beam size 5.

C Metrics

The character error rate (CER) and word error rate(WER) as measured as the Levenshtein distancebetween the hypothesis and reference divided byreference length:

ER(h, r) =dist(h, r)

len(r)

with both the numerator and the denominator mea-sured in characters and words respectively.

We report BLEU-4 score (Papineni et al., 2002),measured using the Moses toolkit script.7 For bothBLEU and WER, we split sentences into wordsusing the Moses tokenizer (Koehn et al., 2007).

7https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl

210

Input kongress ne odobril biudjet dlya osuchestvleniye"bor’bi s kommunizmom" v yuzhniy amerike.

Ground truth kongress ne odobril bdet dlosuwestvleni "bor~by s kommunizmom"v noi amerike.

kongress ne odobril bjudžet dlja osušcestvlenija"bor’by s kommunizmom" v južnoj amerike.

WFST kongress ne odobril viud et dlaosusqestvleniy e "bor#bi skommunizmom" v uuznani amerike.

kongress ne odobril viud et dla osuscestvleniy e"bor#bi s kommunizmom" v uuznani amerike.

Reranked WFST kongress ne odobril vid et delaosusqestvleniy e "bor#bi skommunizmom" v uuznani amerike.

kongress ne odobril vid et de la osuscestvleniy e"bor#bi s kommunizmom" v uuznani amerike.

Seq2Seq kongress ne odobrilby udivitel~no skommunizmom" v nyi amerike.

kongress ne odobril by udivitel’nos kommunizmom" v južnyj amerike.

Reranked Seq2Seq kongress ne odobril bdet dlosuwestvlenie "bor~by s kommunizmom"v nyi amerike.

kongress ne odobril bjudžet dlja osušcestvlenie"bor’by s kommunizmom" v južnyj amerike.

Product of experts kongress ne odobril bid et dl aosuwestvleniy e "bor~by skommunizmom" v uuznnik ameri

kongress ne odobril b id et dlja a osušcestvleniy e"bor’by s kommunizmom" v uuznnik ameri

Table 5: Different model outputs for a Russian transliteration example (left column—Cyrillic, right—scientific transliteration). Prediction errors are shown in red. Correctly transliterated segments that do notmatch the ground truth because of spelling standardization in annotation are in yellow. # stands for UNK.

Input ana h3dyy 3lek bokra 3la 8 kda

Ground truth èY» 8 úΫ èQºK. ½JÊ« ø Y« Ag A K @ AnA H>Edy Elyk bkrp ElY 8 kdh

WFST èY» 8 B QºK. ½Ë ù K Yg A K @ AnA H d yy l k bkr l> 8 kdh

Reranked WFST èY» 8 B QºK. ½Ë ù K Yg A K @ AnA H d yy l k bkr l> 8 kdh

Seq2Seq èY» 1 Èð @ Qk ½Ê g

@ ø X

AK. A K @ AnA b>dy >xl k Hr >wl 1 kdh

Reranked Seq2Seq èY» 1 Èð @ Qk ½Ê g

@ ø X

AK. A K @ AnA b>dy >xl k Hr >wl 1 kdh

Product of experts èY» 8 B @ @Q» H. ½Ë ø X A K @ AnA dy l k b krA > lA 8 kdh

Table 6: Different model outputs for an Arabizi transliteration example (left column—Arabic, right—Buckwalter transliteration). Prediction errors are highlighted in red in the romanized versions. Correctlytransliterated segments that do not match the ground truth because of spelling standardization duringannotation are highlighted in yellow.

Input kshullaka baalina avala horaatavannu adu vivarisuttade.Ground truth

ಮೂಲ +,-$.ನ/$0 DDR3ಯನು2 ಬಳಸಲು

5ುಲ0ಕ 7,8$ನ ಅವಳ ;$ೂೕ=,ಟವನು2 ಅದು @$ವA$ಸುತBC$.

ಕುಹೂE,0F$ 7,/$ನು G,ಳ ;$ೂರI,ವನು2 ಆದು @$ವA$ಸುತBC$.

ಕುಹೂE,0F$ 7,/$ನ G,ಳu ;$ೂರI,ವನು2 ಆದು @$ವA$ಸುತBC$.

ಕಳuಹುಳL 7,@$ಂN ಇಲ0P$ೕ ;$ೂೕ=,ಟವನು2 ಇದು @$ವA$ಸುತBC$.

ಕಳL 7,ಕ/$ನ2 G,E, ;$ೂೕ=,ಟI,Qನು2 ದು @$G,A$ಸುತBದ

ಕಳuಹುಳL 7,@$ಂತ ಇಲ0P$ೕ ;$ೂೕ=,ಟವನು2 ಇದು @$ವA$ಸುತBC$.

ks.ullaka bal.ina aval.a horat.avannu aduvivarisuttade.

WFST

ಮೂಲ +,-$.ನ/$0 DDR3ಯನು2 ಬಳಸಲು

5ುಲ0ಕ 7,8$ನ ಅವಳ ;$ೂೕ=,ಟವನು2 ಅದು @$ವA$ಸುತBC$.

ಕುಹೂE,0F$ 7,/$ನು G,ಳ ;$ೂರI,ವನು2 ಆದು @$ವA$ಸುತBC$.

ಕುಹೂE,0F$ 7,/$ನ G,ಳu ;$ೂರI,ವನು2 ಆದು @$ವA$ಸುತBC$.

ಕಳuಹುಳL 7,@$ಂN ಇಲ0P$ೕ ;$ೂೕ=,ಟವನು2 ಇದು @$ವA$ಸುತBC$.

ಕಳL 7,ಕ/$ನ2 G,E, ;$ೂೕ=,ಟI,Qನು2 ದು @$G,A$ಸುತBದ

ಕಳuಹುಳL 7,@$ಂತ ಇಲ0P$ೕ ;$ೂೕ=,ಟವನು2 ಇದು @$ವA$ಸುತBC$.

k uhu ll akhe ba l inu v a l.ahoratavannu adu vivarisuttade.

Reranked WFST

ಮೂಲ +,-$.ನ/$0 DDR3ಯನು2 ಬಳಸಲು

5ುಲ0ಕ 7,8$ನ ಅವಳ ;$ೂೕ=,ಟವನು2 ಅದು @$ವA$ಸುತBC$.

ಕುಹೂE,0F$ 7,/$ನು G,ಳ ;$ೂರI,ವನು2 ಆದು @$ವA$ಸುತBC$.

ಕುಹೂE,0F$ 7,/$ನ G,ಳu ;$ೂರI,ವನು2 ಆದು @$ವA$ಸುತBC$.

ಕಳuಹುಳL 7,@$ಂN ಇಲ0P$ೕ ;$ೂೕ=,ಟವನು2 ಇದು @$ವA$ಸುತBC$.

ಕಳL 7,ಕ/$ನ2 G,E, ;$ೂೕ=,ಟI,Qನು2 ದು @$G,A$ಸುತBದ

ಕಳuಹುಳL 7,@$ಂತ ಇಲ0P$ೕ ;$ೂೕ=,ಟವನು2 ಇದು @$ವA$ಸುತBC$.

k uhu ll akhe ba l ina v a l. uhoratavannu adu vivarisuttade.

Seq2Seq

ಮೂಲ +,-$.ನ/$0 DDR3ಯನು2 ಬಳಸಲು

5ುಲ0ಕ 7,8$ನ ಅವಳ ;$ೂೕ=,ಟವನು2 ಅದು @$ವA$ಸುತBC$.

ಕುಹೂE,0F$ 7,/$ನು G,ಳ ;$ೂರI,ವನು2 ಆದು @$ವA$ಸುತBC$.

ಕುಹೂE,0F$ 7,/$ನ G,ಳu ;$ೂರI,ವನು2 ಆದು @$ವA$ಸುತBC$.

ಕಳuಹುಳL 7,@$ಂN ಇಲ0P$ೕ ;$ೂೕ=,ಟವನು2 ಇದು @$ವA$ಸುತBC$.

ಕಳL 7,ಕ/$ನ2 G,E, ;$ೂೕ=,ಟI,Qನು2 ದು @$G,A$ಸುತBದ

ಕಳuಹುಳL 7,@$ಂತ ಇಲ0P$ೕ ;$ೂೕ=,ಟವನು2 ಇದು @$ವA$ಸುತBC$.

kal. uhul.l. a bav i mg illav e horat.avannuidu vivarisuttade.

Reranked Seq2Seq

ಮೂಲ +,-$.ನ/$0 DDR3ಯನು2 ಬಳಸಲು

5ುಲ0ಕ 7,8$ನ ಅವಳ ;$ೂೕ=,ಟವನು2 ಅದು @$ವA$ಸುತBC$.

ಕುಹೂE,0F$ 7,/$ನು G,ಳ ;$ೂರI,ವನು2 ಆದು @$ವA$ಸುತBC$.

ಕುಹೂE,0F$ 7,/$ನ G,ಳu ;$ೂರI,ವನು2 ಆದು @$ವA$ಸುತBC$.

ಕಳuಹುಳL 7,@$ಂN ಇಲ0P$ೕ ;$ೂೕ=,ಟವನು2 ಇದು @$ವA$ಸುತBC$.

ಕಳL 7,ಕ/$ನ2 G,E, ;$ೂೕ=,ಟI,Qನು2 ದು @$G,A$ಸುತBದ

ಕಳuಹುಳL 7,@$ಂತ ಇಲ0P$ೕ ;$ೂೕ=,ಟವನು2 ಇದು @$ವA$ಸುತBC$. kal. uhul.l. a bav i mta ill av e horat.avannuidu vivarisuttade.

Product of experts

ಮೂಲ +,-$.ನ/$0 DDR3ಯನು2 ಬಳಸಲು

5ುಲ0ಕ 7,8$ನ ಅವಳ ;$ೂೕ=,ಟವನು2 ಅದು @$ವA$ಸುತBC$.

ಕುಹೂE,0F$ 7,/$ನು G,ಳ ;$ೂರI,ವನು2 ಆದು @$ವA$ಸುತBC$.

ಕುಹೂE,0F$ 7,/$ನ G,ಳu ;$ೂರI,ವನು2 ಆದು @$ವA$ಸುತBC$.

ಕಳuಹುಳL 7,@$ಂN ಇಲ0P$ೕ ;$ೂೕ=,ಟವನು2 ಇದು @$ವA$ಸುತBC$.

ಕಳL 7,ಕ/$ನ2 G,E, ;$ೂೕ=,ಟI,Qನು2 ದು @$G,A$ಸುತBದ

ಕಳuಹುಳL 7,@$ಂತ ಇಲ0P$ೕ ;$ೂೕ=,ಟವನು2 ಇದು @$ವA$ಸುತBC$.

k al.l. a bakal inna v ala horat.a tv annudu viv a risuttada

Table 7: Different model outputs for a Kannada transliteration example (left column—Kannada, right—ISO 15919 transliterations). The ISO romanization is generated using the Nisaba library (Johny et al.,2021). Prediction errors are highlighted in red in the romanized versions.

211

Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research

in Phonetics, Phonology, and Morphology,

August 5, 2021. ©2021 Association for Computational Linguistics

pages 212–221

Finite-state Model of Shupamem Reduplication

Magdalena Markowska, Jeffrey Heinz, and Owen RambowStony Brook University

Department of Linguistics &Institute for Advanced Computational Science

magdalena.markowska,jeffrey.heinz,[email protected]

Abstract

Shupamem, a language of Western Cameroon,is a tonal language which also exhibits themorpho-phonological process of full redupli-cation. This creates two challenges for a finite-state model of its morpho-syntax and morpho-phonology: how to manage the full reduplica-tion, as well as the autosegmental nature oflexical tone. Dolatian and Heinz (2020) ex-plain how 2-way finite-state transducers canmodel full reduplication without an exponen-tial increase in states, and finite-state trans-ducers with multiple tapes have been usedto model autosegmental tiers, including tone(Wiebe, 1992; Dolatian and Rawski, 2020a;Rawski and Dolatian, 2020). Here we synthe-size 2-way finite-state transducers and multi-tape transducers, resulting in a finite-state for-malism that subsumes both, to account forthe full reduplicative processes in Shupamemwhich also affect tone.

1 Introduction

Reduplication is a very common morphologicalprocess cross-linguistically. Approximately 75%of world languages exhibit partial or total redupli-cation (Rubino, 2013). This morphological pro-cess is particularly interesting from the compu-tational point of view because it introduces chal-lenges for 1-way finite-state transducers (FSTs).Even though partial reduplication can be modelledwith 1-way FSTs (Roark and Sproat, 2007; Chan-dlee and Heinz, 2012), there is typically an ex-plosion in the number of states. Total reduplica-tion, on the other hand, is the only known morpho-phonological process that cannot be modelled with1-way FSTs because the number of copied ele-ments, in principle, has no upper bound. Dolatianand Heinz (2020) address this challenge with 2-way FSTs, which can move back and forth on theinput tape, producing a faithful copy of a string.

Deterministic 2-way FSTs can model both partialand full segmental reduplication in a compact way.

However, many languages that exhibit reduplica-tive processes also are tonal, which often meansthat tones and segments act independently fromone another in their morpho-phonology. For in-stance, in Shupamem, a tonal language of WesternCameroon, ndap ‘house’ → ndap ndap ‘houses’(Markowska, 2020).

tones H HL Lsegments ndap ndap ndap

Pioneering work in autosegmental phonology(Leben, 1973; Williams, 1976; Goldsmith, 1976)shows tones may act independently from their tone-bearing units (TBUs). Moreover, tones may ex-hibit behavior that is not typical for segments (Hy-man, 2014; Jardine, 2016), which brings yet an-other strong argument for separating them fromsegments in their linguistic representations. Suchautosegmental representations can be mimickedusing finite-state machines, in particular, Multi-Tape Finite-State Transducers (MT FSTs) (Wiebe,1992; Dolatian and Rawski, 2020a; Rawski andDolatian, 2020). We note that McCarthy (1981)uses the same autosegmental representations inthe linguistic representation to model templaticmorphology, and this approach has been modeledfor Semitic morphological processing using multi-tape automata (Kiraz, 2000; Habash and Rambow,2006).

This paper investigates what finite-state machin-ery is needed for languages which have both redu-plication and tones. We first argue that we needa synthesis of the aforementioned transducers, i.e.1-way, 2-way and MT FSTs, to model morphol-ogy in the general case. The necessity for sucha formal device will be supported by the morpho-phonological processes present in Shupamem nom-inal and verbal reduplication. We then discuss an

212

alternative, in which we use the MT FST to han-dle both reduplication and tones. It is importantto emphasize that all of the machines we discussare deterministic, which serves as another piece ofevidence that even such complex processes like fullreduplication can be modelled with deterministicfinite-state technology (Chandlee and Heinz, 2012;Heinz, 2018).

This paper is structured as follows. First, wewill briefly summarize the linguistic phenomenaobserved in Shupamem reduplication (Section 2).We then provide a formal description of the 2-way(Section 3) and MT FSTs (Section 4). We proposea synthesis of the 1-way, 2-way and MT FSTs inSection 5 and further illustrate them using rele-vant examples from Shupamem in Section 6. InSection 7 we discuss a possible alternative to themodel which uses only MT FSTs. Finally, in Sec-tion 8 we show that the proposed model works forother tonal languages as well, and we conclude ourcontributions.

2 Shupamem nominal and verbalreduplication

Shupamem is an understudied Grassfields Bantulanguage of Cameroon spoken by approximately420,000 speakers (Eberhard et al., 2021). It ex-hibits four contrastive surface tones (Nchare, 2012):high (H; we use diacritic V on a vowel as an or-thographic representation), low (L; diacritic V),rising (LH; diacritic V), and falling (HL; diacriticV). Nouns and verbs in the language reduplicateto create plurals and introduce semantic contrast,respectively. Out of 13 nouns classes, only oneexhibits reduplication. Nouns that belong to thatclass are monosyllabic and carry either H or L lexi-cal tones. Shupamem verbs are underlyingly H orrising (LH). Table 1 summarizes the data adaptedfrom Markowska (2020).

In both nouns and verbs, the first item of the redu-plicated phrase is the base, while the reduplicant isthe suffix. We follow the analysis in Markowska(2020) and summarize it here. The nominal redu-plicant is toneless underlyingly, while the verbalreduplicant has an H tone. Furthermore, the rule ofOpposite Tone Insertion explains the tonal alterna-tion in the base of reduplicated nouns, and DefaultL-Insertion accounts for the L tone on the suffix.Interestingly, more tonal alternations are observedwhen the tones present in the reduplicated phraseinteract with other phrasal/grammatical tones. For

Transl. Lemma Red form

Nouns

H HL L‘crab’ kam kam kam

L LH L‘game’ kam kam kam

Verbs

H HŤH‘fry’ ka ka kŤa

LH LHŤH‘peel’ ka ka kŤa

Table 1: Nominal and verbal reduplication in Shu-pamem

the purpose of this paper, we provide only a sum-mary of those tonal alternations in Table 2.

Red. tones OutputNouns HL L HL H

LH L LH HVerbs HŤH HŤH

LHŤH HL LH

Table 2: Tonal alternations: interaction of tones forreduplicated forms (“Red. tones”) with grammatical Htones

The underlined tones indicate changes triggeredby the H grammatical/phrasal tone. In the obser-vance of H tone associated with the right edge ofthe subject position in Shupamem, the L tone that ispresent on the surface in the suffix of reduplicatednouns (recall Table 1), now is represented with anH tone. Now it should be clear that the noun redu-plicant should, in fact, be toneless in the underlyingrepresentation (UR). While the presence of H tonedirectly preceding the reduplicated verb does notaffect H-tone verbs, such as ka ‘fry’, it causes ma-jor tonal alternations in rising reduplicated verbs.Let us look at a particular example representing thefinal row in Table 2:

p@ ‘PASTIII’ + ka ka ‘peel.CONTR’→ p@ ka ka

The H tone associated with the tense markerintroduces two tonal changes to the reduplicatedverb: it causes tonal reversal on the morphologicalbase, and it triggers L-tone insertion to the left edgeof the reduplicant.

The data in both Table 1 and 2 show that 1)verbs and nouns in Shupamem reduplicate fullyat the segmental level, and 2) tones are affectedby phonological rules that function solely at thesuprasegmental level. Consequently, a finite-statemodel of the language must be able to account for

213

q0start q1 q2 q3 q4(o,λ,+1)

(k,k,+1)

(a,a,+1)

(n,λ,-1)

(k,λ,−1)

(a,λ,−1)

(o,∼,+1)

(k,k,+1)

(a,a,+1)

(n,λ,+1)

Figure 1: 2-way FST for total reduplication of ka ‘fry.IMP→ ka ka ‘fry.IMP as opposed to boiling’

those factors. In the next two sections, we will pro-vide a brief formal introduction to 2-way FSTs andMT-FSTs, and explain how they correctly modelfull reduplication and autosegmental representa-tion, respectively.

In this paper, we use an orthographic represen-tation for Shupamem which uses diacritics to in-dicate tone. Shupamem does not actually use thisorthography; however, we are interested in model-ing the entire morpho-phonology of the language,independently of choices made for the orthography.Furthermore, many languages do use diacritics toindicate tone, including the Volta-Niger languagesYoruba and Igbo, as well as Dschang, a Grassfieldslanguage closely related to Shupamem. (For a dis-cussion of the orthography of Cameroonian lan-guages, with a special consideration of tone, see(Bird, 2001).) Diacritics are also used to writetone in non-African languages, such as Vietnamese.Therefore, this paper is also relevant to NLP ap-plications for morphological analysis and genera-tion in languages whose orthography marks toneswith diacritics: the automata we propose could beused to directly model morpho-phonological com-putational problems for the orthography of suchlanguages.

3 2-way FSTs

As Roark and Sproat (2007) point out, almost allmorpho-pholnological processes can be modelledwith 1-way FSTs with the exception of full redu-plication, whose output is not a regular language.One way to increase the expressivity of 1-way FSTis to allow the read head of the machine to moveback and forth on the input tape. This is exactlywhat 2-way FST does (Rabin and Scott, 1959), andDolatian and Heinz (2020) explain how these trans-ducers model full reduplication not only effectively,but more faithfully to linguistic generalizations.

Similarly to a 1-way FST, when a 2-way FSTreads an input, it writes something on the outputtape. If the desired output is a fully reduplicated

string, then the FST faithfully ‘copies’ the inputstring while scanning it from left to right. In con-trast, while scanning the string from right to left,it outputs nothing (λ), and it then copies the stringagain from left to right.

Figure 1 illustrates a deterministic 2-way FSTthat reduplicates ka ‘fry.IMP’; readers are referredto Dolatian and Heinz (2020) for formal defini-tions. The key difference between deterministic1-way FSTs and deterministic 2-way FSTs are theaddition of the ‘direction’ parameters +1, 0,−1on the transitions which tell the FST to advance tothe next symbol on the input tape (+1), stay on thesame symbol (0), or return to the previous symbol(-1). Deterministic 1-way FSTs can be thought ofas deterministic 2-way FSTs where transitions areall (+1).

The input to this machine is okan. The o andn symbols mark beginning and end of a string, and∼ indicates the boundary between the first and thesecond copy. None of those symbols are essen-tial for the model, nevertheless they facilitate thetransitions. For example, when the machine readsn, it transitions from state q1 to q2 and reversesthe direction of the read head. After outputtingthe first copy (state q1) and rewinding (state q2),the machine changes to state q3 when it scans theleft boundary symbol o and outputs ∼ to indicatethat another copy will be created. In this partic-ular example, not marking morpheme boundarywould not affect the outcome. However, in Section5, where we propose the fully-fledged model, itwill be crucial to somehow separate the first fromsecond copy.

4 Multitape FSTs

Multiple-tape FSTs are machines which operate inthe exact same was as 1-way FST, with one keydifference: they can read the input and/or write theoutput on multiple tapes. Such a transducer canoperate in an either synchronous or asynchronousmanner, such that the input will be read on all tapes

214

and an output produced simultaneously, or the ma-chine will operate on the input tapes one by one.MT-FST can take a single (‘linear’) string as aninput and output multiple strings on multiple tapesor it can do the reverse (Rabin and Scott, 1959;Fischer, 1965).

To illustrate this idea, let us look at Shupamemnoun mapam ‘coat’. It has been argued that Shu-pamem nouns with only L surface tones will havethe L tone present in the UR (Markowska, 2019,2020). Moreover, in order to avoid violating theObligatory Contour Principle (OCP) (Leben, 1973),which prohibits two identical consecutive elements(tones) in the UR of a morpheme, we will assumethat only one L tone is present in the input. Con-sequently, the derivation will look as shown in Ta-ble 3.

Input: T-tape LInput: S-tape mapam

Output: Single tape mapam

Table 3: Representation of MT-FST for mapam ‘coat’

Separating tones from segments in this manner,i.e. by representing tones on the T(one)-tape andsegments on the S(segmental)-tape, faithfully re-sembles linguistic understanding of the UR of aword. The surface form mapam has only one Ltone present in the UR, which then spreads to allTBUs, which happen to be vowels in Shupamem,if no other tone is present.

An example of a multi-tape machine is pre-sented in Figure 2. For better readability, weintroduce a generalized symbols for vowels (V)and consonants (C), so that the input alphabet isΣo = (C, V ), (L,H) ∪ o,n, and the outputalphabet is Γ = C, V , V . The machine operateson 2 input tapes and writes the output on a singletape. Therefore, we could think of such machineas a linearizer. The two input tapes represent theTonal and Segmental tiers and so we label them Tand S, respectively. We illustrate the functioningof the machine using the example (mapam, L)→mapam ‘coat’. While transitioning from state q0 toq1, the output is an empty string since the left edgemarker is being read on both tapes simultaneously.In state q1, when a consonant is being read, themachine outputs the exact same consonant on theoutput tape. However, when the machine reaches avowel, it outputs a vowel with a tone that is beingread at the same time on the T-tape (in our exam-

ple, (L, V)→ V) and transitions to state q2 (if thesymbol on the T-tape is H) or q3 (for L, as in our ex-ample). In states q2 and q3, consonants are simplyoutput as in state q1, but for vowels, one of threeconditions may occur: the read head on the Tonaltape may be H or L, in which case the automatontransitions (if not already there) to q2 (for H) or q3(for L), and outputs the appropriate orthographicsymbol. But if on the Tonal tape the read head ison the right boundary marker n, we are in a casewhere there are more vowels in the Segmental tapethan tones in the Tonal tape. This is when the OCPdetermines the interpretation: all vowels get thetone of the last symbol on the Tone tier (which weremember as states q2 and q3). In our example,this is an L. Finally, when the Segmental tape alsoreaches the right boundary marker n, the machinetransitions to the final state q4. This (‘linearizing’)MT-FST consists of 4 states and shows how OCPeffects can be handled with asynchronous multi-tape FSTs. Note that when there are more toneson the Tonal tier than vowels on the Segmental tier,they are simply ignored. We refer readers to Dola-tian and Rawski (2020b) for formal definitions ofthese MT transducers.

We are also interested in the inverse process –that is, a finite-state machine that in the exampleabove would take a single input string [mapam]and produce two output strings [L] and [mapam].While multitape FSTs are generally conceived asrelations over n-ary relations over strings, Dolatianand Rawski (2020b) define their machines deter-ministically with n input tapes and a single outputtape. We generalize their definition below.

Similarly to spreading processes describedabove, separating tones from segments give us a lotof benefits while accounting for tonal alternationstaking place in nominal and verbal reduplication inShupamem. First of all, functions such as OppositeTone Insertion (OTI) will apply solely at the tonallevel, while segments can be undergoing other op-erations at the same time (recall that MT-FSTs canoperate on some or all tapes simultaneously). Sec-ondly, representing tones separately from segmentsmake tonal processes local, and therefore all thealternations can be expresses with less powerfulfunctions (Chandlee, 2017).

Now that we presented the advantages of MT-FSTs, and the need for utilizing 2-way FSTs tomodel full reduplication, we combine those ma-chines to account for all morphophonological pro-

215

q0start q1

q2

q3

q4

T:(o,+1)

S:(o,+1)

O:λ

T:(H,+1) T:(H,0)

S:(V,+1) S:(C,+1)

O:V O:C

T:(H,+1) T:(H,0)

S:(V,+1) S:(C,+1)

O:V O:C

T:(n,0) T:(n,0)

S:(V,+1) S:(C,+1)

O:V O:C

T:(L,+1) T:(L,0)

S:(V,+1) S:(C,+1)

O:V O:C

T:(o,+1)

S:(o,+1)

O:λ

T:(L,+1)

S:(V,+1)

O:V

T:(H,+1)

S:(V,+1)

O:V

T:(L,+1) T:(L,0)

S:(V,+1) S:(C,+1)

O:V O:C

T:(n,0) T:(n,0)

S:(V,+1) S:(C,+1)

O:V O:C

T:(o,+1)

S:(o,+1)

O:λ

Figure 2: MT-FST: linearizeC and V are notational meta-symbols for consonants and vowels, resp.; T indicates the tone tape, S, thesegmental tape, and O the output tape.

cesses described in Section 2.

5 Deterministic 2-Way Multi-tape FST

Before we define Deterministic 2-Way Multi-tapeFST (or 2-way MT FST for short) we introducesome notation. An alphabet Σ is a finite set ofsymbols and Σ∗ denotes the set of all strings offinite length whose elements belong to Σ. We useλ to denote the empty string. For each n ∈ N, ann-string is a tuple 〈w1, . . . wn〉 where each wi isa string belonging to Σ∗i (1 ≤ i ≤ n). These nalphabets may contain distinct symbols or not. Wewrite #»w to indicate a n-string and

λ to indicate then-string where each wi = λ. We also write

Σ todenote a tuple of n alphabets:

Σ = Σ1 × · · ·Σn.Elements of

Σ are denoted #»σ .If #»w and #»v belong to

Σ∗ then the pointwise con-catenation of #»w and #»v is denoted #»w #»v and equals〈w1, . . . wn〉〈v1, . . . vn〉 = 〈w1v1, . . . wnvn〉. Weare interested in functions that map n-strings tom-strings with n,m ∈ N. In what follows we gen-

erally use the index i to range from 1 to n and theindex j to range from 1 to m.

We define Deterministic 2-Way n,m MultitapeFST (2-way (n,m) MT FST for short) for n,m ∈N by synthesizing the definitions of Dolatian andHeinz (2020) and Dolatian and Rawski (2020b);n,m refer to the number of input tapes and outputtapes, respectively. A Deterministic 2-Way n,mMultitape FST is a six-tuple (Q,

Σ,#»

Γ , q0, F, δ),where

• #»

Σ = 〈Σ1 . . .Σn〉 is a tuple of n input alpha-bets that include the boundary symbols, i.e.,o,n ⊂ Σi, 1 ≤ i ≤ n,

• #»

Γ is a tuple of m output alphabets Γj (1 ≤j ≤ m),

• δ : Q × #»

Σ → Q × #»

Γ∗ × #»

D is the transitionfunction. D is an alphabet of directions equalto −1, 0,+1 and

D is an n-tuple.#»

Γ∗ is am-tuple of strings written to each output tape.

216

Figure 3: Synthesis of 2-way FST and MT-FST

We understand δ(q, #»σ ) = (r, #»v ,#»

d ) as follows. Itmeans that if the transducer is in state q and then read heads on the input tapes are on symbols〈σ1, . . . σn〉 = #»σ , then several actions ensue. Thetransducer changes to state r and pointwise concate-nates #»v to the m output tapes. The n read headsthen move according to the instructions

d ∈ #»

D.For each read head on input tape i, it moves backone symbol iff di = −1, stays where it is iff di = 0,and advances one symbol iff di = +1. (If the readhead on an input tape “falls off” the beginning orend of the string, the computation halts.)

The function recognized by a 2-way (n,m) MTFST is defined as follows. A configuration of an,m-MT-FST T is a 4-tuple 〈 #»

Σ∗, q,#»

Σ∗,#»

Γ∗〉. Themeaning of the configuration ( #»w, q, #»x , u) is thatthe input to T is #»w #»x and the machine is currentlyin state q with the n read heads on the first symbolof each xi (or has fallen off the right edge of thei-th input tape if xi = λ) and that #»u is currentlywritten on the m output tapes.

If the current configuration is ( #»w, q, #»x , #»u ) andδ(q, #»σ ) = (r, #»v ,

d ) then the next configuration is( #»w ′, r, #»x ′, #»u #»v ), where for each i, 1 ≤ i ≤ n:

• #»w ′ = 〈w′1 . . . w′n〉 and #»x ′ = 〈x′1 . . . x′n〉 (1 ≤i ≤ n);

• w′i = wi and x′i = xi iff di = 0;

• w′i = wiσ and x′i = x′′i iff di = +1 and thereexists σ ∈ Σi, x

′′i ∈ Σ∗i such that xi = σx′′i ;

• w′i = w′′i and x′i = σxi iff di = −1 and thereexists σ ∈ Σi, w

′′i ∈ Σ∗i such that wi = σw′′i .

We write ( #»w, q, #»x , #»u ) → ( #»w ′, r, #»x ′, #»u #»v ). Ob-serve that since δ is a function, there is at most onenext configuration (i.e., the system is deterministic).Note there are some circumstances where there isno next configuration. For instance if di = +1 andxi = λ then there is no place for the read head toadvance. In such cases, the computation halts.

The transitive closure of→ is denoted with→+.Thus, if c→+ c′ then there exists a finite sequenceof configurations c1, c2 . . . cn with n > 1 such thatc = c1 → c2 → . . .→ cn = c′.

At last we define the function that a 2-way (n,m)MT FST T computes. The input strings are aug-mented with word boundaries on each tape. Let# »own = 〈ow1n, . . .o wnn〉. For each n-string#»w ∈ #»

Σ∗, fT ( #»w) = #»u ∈ #»

Γ∗ provided thereexists qf ∈ F such that (

λ, q0,# »own, #»

λ ) →+

(# »own, qf ,

λ, #»u ).If fT ( #»w) = #»u then #»u is unique because the

sequence of configurations is determined determin-istically. If the computation of a 2-way MT-FST Thalts on some input #»w (perhaps because a subse-quent configuration does not exist), then we say Tis undefined on #»w.

The 2-way FSTs studied by Dolatian and Heinz(2020) are 2-way 1,1 MT FST. The n-MT-FSTsstudied by Dolatian and Rawski (2020b) are 2-wayn,1 MT FST where none of the transitions con-tain the −1 direction. In this way, the definitionpresented here properly subsumes both.

Figure 4 shows an example of a 1,2 MT FST that“splits” a phonetic (or orthographic) transcription ofa Shupamem word into a linguistic representation

217

with a tonal and segmental tier by outputting twooutput strings, one for each tier.

q0start q1 q2I:(o,+1)

T:λ

S:λ

I:(C,+1) I:(a,+1) I:(a,+1)

T:λ T:H T:L

S:C S:a S:a

I:(a,+1)

T:HL

S:a

I:(a,+1)

T:LH

oS:a

I:(n,+1)

T:λ

S:λ

Figure 4: MT-FST: splitndap→ (ndap, H) ‘house’, C and V are notationalmeta-symbols for consonants and vowels, resp.; Tindicates the output tone tape, S – the segmental

output tape, and I – the input.

6 Proposed model

As presented in Figure 3, our proposed model, i.e.2-way 2,2 MT FST, consists of 1-way and 2-waydeterministic transducers, which together operateon two tapes. Both input and output are representedon two separate tapes: Tonal and Segmental Tape.Such representation is desired as it correctly mim-ics linguistic representations of tonal languages,where segments and tones act independently fromeach other. On the T-tape, a 1-way FST takes theH tone associated with the lexical representation ofndap ‘house’ and outputs HL∼ by implementingthe Opposite Tone Insertion function. On the S-tape, a 2-way FST takes ndap as an input, and out-puts a faithful copy of that string (ndap ndap). The∼ symbol significantly indicates the morphemeboundary and facilitates further output lineariza-tion. A detailed derivation of ndap 7→ ndap ndapis shown in Table 4.

Figure 3 also represents two additional ‘trans-formations’: splitting and linearizing. First, thephonetic transcription of a string (ndap) is splitinto tones and segments with a 1,2 MT FST. Theoutput (H, ndap) serves as an input to the 2-way 2,2MT FST. After the two processes discussed aboveapply, together acting on both tapes, the output isthen linearized with an 2,1 MT FST. The composi-tion of those three machines, i.e. 1,2 MT, 2-way 2,2MT FST, and 2,1 MT FSTs is particularly usefulin applications where a phonetic or orthographicrepresentations needs to be processed.

As was discussed in Section 2, the tone on thesecond copy is dependent on whether there was anH tone preceding the reduplicated phrase. If therewas one, the tone on the reduplicant will be H.Otherwise, the L-Default Insertion rule will insertL tone onto the toneless TBU of the second copy.Because those tonal changes are not part of thereduplicative process per se, we do not representthem either in our model in Figure 3, or in thederivation in Table 4. Those alternations could beaccounted for with 1-way FST by simply insertingH or L tone to the output of the composed machinerepresented in Figure 3.

Modelling verbal reduplication and the tonal pro-cesses revolving around it (see Table 2) works inthe exact same way as described above for nominalreduplication. The only difference are the functionsapplied to the T-tape.

7 An Alternative to 2-Way Automata

2-way n,m MT FST generalize regular functions(Filiot and Reynier, 2016) to functions from n-strings to m-strings. It is worth asking however,what each of these mechanisms brings, especiallyin light of fundamental operations such as func-tional composition.

Figure 5: An alternative model for Shupamem redupli-cation

For instance, it is clear that 2-way 1, 1 MT FSTscan handle full reduplication in contrast to 1-way1, 1 MT FSTs which cannot. However, full redupli-cation can also be obtained via the composition ofa 1-way 1,2 MT FST with a 1-way 2, 1 MT FST.To illustrate, the former takes a single string as aninput, e.g. ndap, and ‘splits’ it into two identicalcopies represented on two separate output tapes.Then the 2-string output by this machines becomesthe input to the next 1-way 2,1 MT FST. Sincethis machine is asynchronous, it can linearize the

218

State Segment-tape Tone-tape S-output T-outputq0 ondapn oHn λ λ

q1 ondapn S:o:+1 oHn T:o:+1 n HLq1 ondapn S:n:+1 oHn T:H:+1 nd HL∼q1 ondapn S:d:+1 oHn T:n:0 nda HL∼q1 ondapn S:a:+1 oHn T:n:0 ndap HL∼q1 ondapn S:p:+1 oHn T:n:0 ndap∼ HL∼q2 ondapn S:n:-1 oHn T:n:0 ndap∼ HL∼q2 ondapn S:p:-1 oHn T:n:0 ndap∼ HL∼q2 ondapn S:a:-1 oHn T:n:0 ndap∼ HL∼q2 ondapn S:d:-1 oHn T:n:0 ndap∼ HL∼q2 ondapn S:n:-1 oHn T:n:0 ndap∼ HL∼q3 ondapn S:o:+1 oHn T:n:0 ndap∼n HL∼q3 ondapn S:n:+1 oHn T:n:0 ndap∼nd HL∼q3 ondapn S:d:+1 oHn T:n:0 ndap∼nda HL∼q3 ondapn S:a:+1 oHn T:n:0 ndap∼ndap HL∼q3 ondapn S:p:+1 oHn T:n:0 ndap∼ndap HL∼

Table 4: Derivation of ndap 7→ ndap ndap ‘houses’This derivation happens in the two automata in the center of Figure 3. The FST for the Segmental tier is

the one shown in Figure 1, and the states in the table above refer to this FST.

2-string (e.g. (ndap, ndap)) to a 1-string (ndapndap) by reading along one of these input tapes(and writing it) and reading the other one (and writ-ing it) only when the read head on the first inputtape reaches the end. Consequently, an alternativeway to model full reduplication is to write the ex-act same output on two separate tapes, and thenlinearize it. Therefore, instead of implementingShupamem tonal reduplication with 2-way 2,2 MTFST, we could use the composition of two 1-wayMT-FST: 1,3 MT-FST and 3,1 MT-FST as shownin Figure 5. (We need three tapes, two Segmentaltapes to allow reduplication as just explained, andone Tonal tape as discussed before.)

This example shows that additional tapes canbe understood as serving a similar role to regis-ters in register automata (Alur and Cerny, 2011;Alur et al., 2014). Alur and his colleagues haveshown that deterministic 1-way transducers withregisters are equivalent in expressivity to 2-waydeterministic transducers (without registers).

8 Beyond Shupamem

The proposed model is not limited to modelingfull reduplication in Shupamem. It can be used forother tonal languages exhibiting this morphologicalprocess. We provide examples of the applicabilityof this model for the three following languages:Adhola, Kikerewe, and Shona. And we predict that

other languages could also be accounted for.

All three languages undergo full reduplicationat the segmental level. What differs is the tonalpattern governing this process. In Adhola (Ka-plan, 2006), the second copy is always representedwith a fixed tonal pattern H.HL, where ‘.’ indi-cates syllable boundary, irregardless of the lexi-cal tone on the non-reduplicated form. In the fol-lowing examples, unaccented vowels indicate lowtone. For instance, tiju ‘work’ 7→ tija tija ‘worktoo much’, tSemo ‘eat’ 7→ tSema tSŤema ‘eat toomuch’. In Kikerewe (Odden, 1996), if the first(two) syllable(s) of the verb are marked with anH tone, the H tone would also be present in thefirst two syllables of the reduplicated phrase. Onthe other hand, if the last two syllables of the non-reduplicated verb are marked with an H tone, theH tone will be present on the last two syllables ofthe reduplicated phrase. For instance, biba ‘plant’7→ biba biba ‘plant carelessly, here and there’, bib-ile ‘planted (yesterday)’ 7→ bibile bibile ‘planted(yesterday) carelessly, here and there’. Finally, inKiHehe (Odden and Odden, 1985), if an H tone ap-pears in the first syllable of the verb, the H tone willalso be present in the first syllable of the secondcopy, for example doongoleesa ‘roll’ 7→ dongolesadoongoleesa ‘roll a bit’.

The above discussed examples can be modelledin a similar to Shupamem way, such that, first,

219

the input will be output on two tapes: Tonal andSegmental, then some (morpho-)phonological pro-cesses will apply on both level. The final step isthe ‘linearization’, which will be independent ofthe case. For example, in Kikerewe, if the first tonethat is read on the Tonal tape is H, and a vowelis read on the Segmental tape, the output will bea vowel with an acute accent. If the second toneis L, as in biba, this L tone will be ‘attached’ toevery remaining vowel in the reduplicated phrase.While Kikerewe provides an example where thereare more TBUs than tones, Adhola presents thereverse situation, where there are more tones thanTBU (contour tones). Consequently, it is crucial tomark syllable boundaries, such that only when ‘.’or the right edge marker (o) is read, the FST willoutput the ‘linearized’ element.

9 Conclusion

In this paper we proposed a deterministic finite-state model of total reduplication in Shupamem.As it is typical for Bantu languages, Shupamem isa tonal language in which phonological processesoperating on a segmental level differ from those onsuprasegmental (tonal) level. Consequently, Shu-pamem introduces two challenges for 1-way FSTs:language copying and autosegmental representa-tion. We addressed those challenges by proposinga synthesis of a deterministic 2-way FST, whichcorrectly models total reduplication, and a MT FST,which enables autosegmental representation. Sucha machine operates on two tapes (Tonal and Seg-mental), which faithfully replicate the linguisticanalysis of Shupamem reduplication discussed inMarkowska (2020). Finally, the outputs of the 2-way 2,2 MT FST is linearized with a separate 2,1MT FST outputting the desired surface representa-tion of a reduplicated word. The proposed modelis based on previously studied finite-state modelsfor reduplication (Dolatian and Heinz, 2020) andtonal processes (Dolatian and Rawski, 2020b,a).

There are some areas of future research that weplan to pursue. First, we have suggested that wecan handle reduplication using the compositionof 1-way deterministic MT FSTs, dispensing withthe need for 2-way automata altogether. Furtherformal comparison of these two approaches is war-ranted. More generally, we plan to investigate theclosure properties of classes of 2-way MT FSTs. Athird line of research is to collect more examplesof full reduplication in tonal languages and to in-

clude them in the RedTyp database (Dolatian andHeinz, 2019) so a broader empirical typology canbe studied with respect to the formal properties ofthese machines.

ReferencesRajeev Alur, Adam Freilich, and Mukund

Raghothaman. 2014. Regular combinators forstring transformations. In Proceedings of theJoint Meeting of the Twenty-Third EACSL AnnualConference on Computer Science Logic (CSL) andthe Twenty-Ninth Annual ACM/IEEE Symposium onLogic in Computer Science (LICS), CSL-LICS ’14,pages 9:1–9:10, New York, NY, USA. ACM.

Rajeev Alur and Pavol Cerny. 2011. Streaming trans-ducers for algorithmic verification of single-pass list-processing programs. In Proceedings of the 38th An-nual ACM SIGPLAN-SIGACT Symposium on Princi-ples of Programming Languages, POPL ’11, page599–610, New York, NY, USA. Association forComputing Machinery.

Steven Bird. 2001. Orthography and identity inCameroon. Written Language & Literacy, 4(2):131–162.

Jane Chandlee. 2017. Computational locality in mor-phological maps. Morphology, pages 1–43.

Jane Chandlee and Jeffrey Heinz. 2012. Bounded copy-ing is subsequential: Implications for metathesis andreduplication. In Proceedings of the Twelfth Meet-ing of the Special Interest Group on ComputationalMorphology and Phonology, pages 42–51, Montreal,Canada. Association for Computational Linguistics.

Hossep Dolatian and Jeffrey Heinz. 2019. Redtyp: Adatabase of reduplication with computational mod-els. In Proceedings of the Society for Computationin Linguistics, volume 2. Article 3.

Hossep Dolatian and Jeffrey Heinz. 2020. Comput-ing and classifying reduplication with 2-way finite-state transducers. Journal of Language Modelling,8(1):179–250.

Hossep Dolatian and Jonathan Rawski. 2020a. Com-putational locality in nonlinear morphophonology.Ms., Stony Brook University.

Hossep Dolatian and Jonathan Rawski. 2020b. Multiinput strictly local functions for templatic morphol-ogy. In In Proceedings of the Society for Computa-tion in Linguistics, volume 3.

David M. Eberhard, Gary F. Simmons, and Charles D.Fenning. 2021. Enthologue: Languages of theWorld. 24th edition. Dallas, Texas: SIL Interna-tional.

220

Emmanuel Filiot and Pierre-Alain Reynier. 2016.Transducers, logic and algebra for functions of finitewords. ACM SIGLOG News, 3(3):4–19.

Patric C. Fischer. 1965. Multi-tape and infinite-stateautomata-a survey. Communications of the ACM,pages 799–805.

John Goldsmith. 1976. Autosegmental Phonology.Ph.D. thesis, Massachusetts Institute of Technology.

Nizar Habash and Owen Rambow. 2006. Magead: Amorphological analyzer for Arabic and its dialects.In Proceedings of the 21st International Confer-ence on Computational Linguistics and 44th AnnualMeeting of the Association for Computational Lin-guistics (Coling-ACL’06), Sydney, Australia.

Jeffrey Heinz. 2018. The computational nature ofphonological generalizations. In Larry Hyman andFrans Plank, editors, Phonological Typology, Pho-netics and Phonology, chapter 5, pages 126–195. DeGruyter Mouton.

Larry Hyman. 2014. How autosegmental is phonol-ogy? The Linguistic Review, 31(2):363–400.

Adam Jardine. 2016. Computationally, tone is differ-ent. Phonology, 33:247–283.

Aaron F. Kaplan. 2006. Tonal and morphological iden-tity in reduplication. In Proceedings of the An-nual Meeting of the Berkeley Linguistics Society, vol-ume 31.

George Anton Kiraz. 2000. Multi-tiered nonlinear mor-phology using multi-tape finite automata: A casestudy on Syriac and Arabic. Computational Linguis-tics, 26(1):77–105.

William R. Leben. 1973. Suprasegmental Phonology.Ph.D. thesis, Massachusetts Institute of Technology.

Magdalena Markowska. 2019. Tones in Shuapmempossessives. Ms., Graduate Center, City Universityof New York.

Magdalena Markowska. 2020. Tones in Shupamemreduplication. CUNY Academic Works.

John McCarthy. 1981. A prosodic theory of noncon-catenative morphology. Linguistic Inquiry, 12:373–418.

Abdoulaye L. Nchare. 2012. The Grammar of Shu-pamem. Ph.D. thesis, New York University.

David Odden. 1996. Patterns of reduplication in kik-erewe. OSU WPL, 48:111–148.

David Odden and Mary Odden. 1985. Ordered redupli-cation in KiHehe. Linguistic Inquiry, 16:497–503.

Michael O Rabin and Dana Scott. 1959. Finite au-tomata and their decision problems. IBM journalof research and development, 3:114–125.

Jonathan Rawski and Hossep Dolatian. 2020. Multi-input strict local functions for tonal phonology. Pro-ceedings of the Society for Computation in Linguis-tics, 3(1):245–260.

Brian Roark and Richard Sproat. 2007. ComputationalApproaches to Morphology and Syntax. Oxford Uni-versity Press, Oxford.

Carl Rubino. 2013. Reduplication. Max Planck Insti-tute for Evolutionary Anthropology, Leipzig.

Bruce Wiebe. 1992. Modelling autosegmental phonol-ogy with multi-tape finite state transducers.

Edwin S. Williams. 1976. Underlying tone in Margiand Igbo. Linguistic Inquiry, 7:463–484.

221

Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research

in Phonetics, Phonology, and Morphology,

August 5, 2021. ©2021 Association for Computational Linguistics

pages 222–228

Improved pronunciation prediction accuracy using morphology

Dravyansh Sharma, Saumya Yashmohini Sahai, Neha Chaudhari†, Antoine Bruguier†Google LLC?

[email protected], [email protected],[email protected], [email protected]

Abstract

Pronunciation lexicons and prediction modelsare a key component in several speech synthe-sis and recognition systems. We know thatmorphologically related words typically fol-low a fixed pattern of pronunciation which canbe described by language-specific paradigms.In this work we explore how deep recurrentneural networks can be used to automaticallylearn and exploit this pattern to improve thepronunciation prediction quality of words re-lated by morphological inflection. We proposetwo novel approaches for supplying morpho-logical information, using the word’s morpho-logical class and its lemma, which are typi-cally annotated in standard lexicons. We re-port improvements across a number of Euro-pean languages with varying degrees of phono-logical and morphological complexity, andtwo language families, with greater improve-ments for languages where the pronunciationprediction task is inherently more challenging.We also observe that combining bidirectionalLSTM networks with attention mechanisms isan effective neural approach for the computa-tional problem considered, across languages.Our approach seems particularly beneficial inthe low resource setting, both by itself and inconjunction with transfer learning.

1 Introduction

Morphophonology is the study of interaction be-tween morphological and phonological processesand mostly involves description of sound changesthat take place in morphemes (minimal meaningfulunits) when they combine to form words. For ex-ample, the plural morpheme in English appears as‘-s’ or ‘-es’ in orthography and as [s], [z], and [Iz]

?Part of the work was done when D.S., N.C. and A.B.were at Google.

in phonology, e.g. in cops, cogs and courses. Thedifferent forms can be thought to be derived from acommon plural morphophoneme which undergoescontext dependent transformations to produce thecorrect phones.

A pronunciation model, also known as agrapheme to phoneme (G2P) converter, is a sys-tem that produces a phonemic representation of aword from its written form. The word is convertedfrom the sequence of letters in the orthographicscript to a sequence of phonemes (sound symbols)in a pre-determined transcription, such as IPA orX-SAMPA. It is expensive and possibly, say inmorphologically rich languages with productivecompounding, infeasible to list the pronunciationsfor all the words. So one uses rules or learned mod-els for this task. Pronunciation models are impor-tant components of both speech recognition (ASR)and synthesis (text-to-speech, TTS) systems. Eventhough end-to-end models have been gathering re-cent attention (Graves and Jaitly, 2014; Sotelo et al.,2017), often state-of-the-art models in industrialproduction systems involve conversion to and froman intermediate phoneme layer.

A single system of morphophonological ruleswhich connects morphology with phonology iswell-known (Chomsky and Halle, 1968). In factcomputational models for morphology such as thetwo-level morphology of Koskenniemi (1983); Ka-plan and Kay (1994) have the bulk of the machinerydesigned to handle phonological rules. However,the approach involves encoding language-specificrules as a finite-state transducer, a tedious and ex-pensive process requiring linguistic expertise. Lin-guistic rules are augmented computationally forsmall corpora in Ermolaeva (2018), although scala-bility and applicability of the approach across lan-guages is not tested.

We focus on using deep neural models to im-prove the quality of pronunciation prediction using

222

morphology. G2P fits nicely in the well-studied se-quence to sequence learning paradigms (Sutskeveret al., 2014), here we use extensions that can handlesupplementary inputs in order to inject the morpho-logical information. Our techniques are similar toSharma et al. (2019), although the goal there is tolemmatize or inflect more accurately using pronun-ciations. Taylor and Richmond (2020) considerimproving neural G2P quality using morphology,our work differs in two respects. First, we usemorphology class and lemma entries instead ofmorpheme boundaries for which annotations maynot be as readily available. Secondly, they con-sider BiLSTMs and Transformer models, but weadditionally consider architectures which combineBiLSTMs with attention and outperform both. Wealso show significant gains by morphology injec-tion in the context of transfer learning for low re-source languages where sufficient annotations areunavailable.

2 Background and related work

Pronunciation prediction is often studied in settingsof speech recognition and synthesis. Some recentwork explores new representations (Livescu et al.,2016; Sofroniev and Coltekin, 2018; Jacobs andMailhot, 2019), but in this work, a pronunciationis a sequence of phonemes, syllable boundariesand stress symbols (van Esch et al., 2016). A lot ofwork has been devoted to the G2P problem (e.g. seeNicolai et al. (2020)), ranging from those focusedon accuracy and model size to those discussing ap-proaches for data-efficient scaling to low resourcelanguages or multilingual modeling (Rao et al.,2015; Sharma, 2018; Gorman et al., 2020).

Morphology prediction is of independent interestand has applications in natural language generationas well as understanding. The problems of lemma-tization and morphological inflection have beenstudied in both contextual (in a sentence, whichinvolves morphosyntactics) and isolated settings(Cohen and Smith, 2007; Faruqui et al., 2015; Cot-terell et al., 2016; Sharma et al., 2019).

Morphophonological prediction, by which wemean viewing morphology and pronunciation pre-diction as a single task with several related inputsand outputs, has received relatively less attention asa language-independent computational task, eventhough the significance for G2P has been argued(Coker et al., 1991). Sharma et al. (2019) showimproved morphology prediction using phonology,

and Taylor and Richmond (2020) show the reverse.The present work aligns with the latter, but insteadof requiring full morphological segmentation ofwords we work with weaker and more easily anno-tated morphological information like word lemmasand morphological categories.

3 Improved pronunciation prediction

We consider the G2P problem, i.e. prediction ofthe sequence of phonemes (pronunciation) fromthe sequence of graphemes in a single word. TheG2P problem forms a clean, simple application ofseq2seq learning, which can also be used to cre-ate models that achieve state-of-the-art accuraciesin pronunciation prediction. Morphology can aidthis prediction in several ways. One, we coulduse morphological category as a non-sequentialside input. Two, we could use the knowledge ofthe morphemes of the words and their pronuncia-tions which may be possible with lower amountsof annotation. For example, the lemma (and itspronunciation) may already be annotated for anout-of-vocabulary word. Often standard lexiconslist the lemmata of derived/inflected words, lemma-tizer models can be used as a fallback. Learningfrom the exact morphological segmentation (Tay-lor and Richmond, 2020) would need more precisemodels and annotation (Demberg et al., 2007).

Given the spelling, language specific modelscan predict the pronunciation by using knowledgeof typical grapheme to phoneme mappings in thelanguage. Some errors of these models may befixed with help from morphological information asargued above. For instance, homograph pronun-ciations can be predicted using morphology butit is impossible to deduce correctly using just or-thography.1 The pronunciation of ‘read’ (/ôi:d/ forpresent tense and noun, /ôEd/ for past and partici-ple) can be determined by the part of speech andtense; the stress shifts from first to second syllablebetween ‘project’ noun and verb.

3.1 Dataset

We train and evaluate our models for five lan-guages to cover some morphophonological diver-sity: (American) English, French, Russian, Span-ish and Hungarian. For training our models, weuse pronunciation lexicons (word-pronunciationpairs) and morphological lexicons (containing lex-

1Homographs are words which are spelt identically buthave different meanings and pronunciations.

223

ical form, i.e. lemma and morphology class) ofonly inflected words of size of the order of 104

for each language (see Table 5 in Appendix A).For the languages discussed, these lexicons are ob-tained by scraping2 Wiktionary data and filteringfor words that have annotations (including pronun-ciations available in the IPA format) for both thesurface form and the lexical form. While this or-der of data is often available for high-resource lan-guages, in Section 3.3 we discuss extension of ourwork to low-resource settings using Finnish andPortuguese for illustration where the Wiktionarydata is about an order of magnitude smaller.

Word (language) Morph. Class Pron. LS LPmasseuses (fr) n-f-pl /ma.søz/ masseur /ma.sœK/

fagylaltozom (hu) v-fp-s-in-pr-id /"f6Íl6ltozom/ fagylaltozik /"f6Íl6ltozik/

Table 1: Example annotated entries. (v-fp-s-in-pr-id:Verb, first-person singular indicative present indefinite)

We keep 20% of the pronunciation lexiconsaside for evaluation using word error rate (WER)metric. WER measures an output as correct if theentire output pronunciation sequence matches theground truth annotation for the test example.

3.1.1 Morphological category

The morphological category of the word is ap-pended as an ordinal encoding to the spelling, sepa-rated by a special character. That is, the categoriesof a given language are appended as unique inte-gers, as opposed to one-hot vectors which may betoo large in morphologically rich languages.

3.1.2 Lemma spelling and pronounciation

Information about the lemma is given to the mod-els by appending both, the lemma pronouncia-tion 〈LP〉 and lemma spelling 〈LS〉 to the wordspelling 〈WS〉, all separated by special characters,like, 〈WS〉§〈LP〉¶〈LS〉. Lemma spelling can po-tentially help in irregular cases, for example ‘be’has past forms ‘gone’ and ‘were’, so the modelcan reject the lemma pronunciation in this case bynoting that the lemma spellings are different (butpotentially still use it for ‘been’).

3.2 Model details

The models described below are implemented inOpenNMT (Klein et al., 2017).

2kaikki.org/dictionary/

3.2.1 Bidirectional LSTM networksLSTM (Hochreiter and Schmidhuber, 1997) allowslearning of fixed length sequences, which is not amajor problem for pronunciation prediction sincegrapheme and phoneme sequences (represented asone-hot vectors) are often of comparable length,and in fact state-of-the-art accuracies can be ob-tained using bidirectional LSTM (Rao et al., 2015).We use single layer BiLSTM encoder - decoderwith 256 units and 0.2 dropout to build a charac-ter level RNN. Each character is represented by atrainable embedding of dimension 30.

3.2.2 LSTM based encoder-decoder networkswith attention (BiLSTM+Attn)

Attention-based models (Vaswani et al., 2017;Chan et al., 2016; Luong et al., 2015; Xu et al.,2015) are capable of taking a weighted sample ofinput, allowing the network to focus on differentpossibly distant relevant segments of the input ef-fectively to predict the output. We use the modeldefined in Section 3.2.1 with Luong attention (Lu-ong et al., 2015).

3.2.3 Transformer networksTransformer (Vaswani et al., 2017) uses self-attention in both encoder and decoder to learnrich text representaions. We use a similar architec-ture but with fewer parameters, by using 3 layers,256 hidden units, 4 attention heads and 1024 di-mensional feed forward layers with relu activation.Both the attention and feedforward dropout is 0.1.The input character embedding dimension is 30.

3.3 Transfer learning for low resource G2P

Both non-neural and neural approaches have beenstudied for transfer learning (Weiss et al., 2016)from a high-resource language for low resourcelanguage G2P setting using a variety of strategiesincluding semi-automated bootstrapping, usingacoustic data, designing representations suitablefor neural learning, active learning, data augmen-tation and multilingual modeling (Maskey et al.,2004; Davel and Martirosian, 2009; Jyothi andHasegawa-Johnson, 2017; Sharma, 2018; Ryan andHulden, 2020; Peters et al., 2017; Gorman et al.,2020). Recently, transformer-based architectureshave also been used for this task (Engelhart et al.,2021). Here we apply a similar approach of us-ing representations learned from the high-resourcelanguages as an additional input for low-resourcemodels but for our BiLSTM+Attn architecture. We

224

Model Inputs en fr ru es huBiLSTM (b/+c/+l) (39.7/39.4/37.1) (8.69/8.94/7.94) (5.26/4.87/5.60) (1.13/1.44/1.30) (6.96/5.85/7.21)BiLSTM+Attn (b/+c/+l) (36.9/36.1/31.0) (4.45/4.20/4.12) (5.06/3.80/4.04) (0.32/0.32/0.29) (1.78/1.31/1.12)Transformer (b/+c/+l) (40.2/39.3/37.7) (8.19/7.11/10.6) (6.57/6.38/5.36) (2.29/1.62/2.20) (8.20/4.93/8.11)

Table 2: Models and their Word Error Rates (WERs). ‘b’ corresponds to baseline (vanilla G2P), ‘+c’ refers tomorphology class injection (Sec. 3.1.1) and ‘+l’ to addition of lemma spelling and pronunciation (Sec. 3.1.2).

evaluate our model for two language pairs — hu(high) - fi (low) and es (high) and pt (low) (resultsin Table 3). We perform morphology injection us-ing lemma spelling and pronunciation (Sec. 3.1.2)since it can be easier to annotate and potentiallymore effective (per Table 2). fi and pt are not reallylow-resource, but have relatively fewer Wiktionaryannotations for the lexical forms (Table 5).

Model fi fi+hu pt pt+esBiLSTM+Attn (base) 18.53 9.81 62.65 58.87BiLSTM+Attn (+lem) 9.27 8.45 59.63 55.48

Table 3: Transfer learning for vanilla G2P (base) andmorphology augmented G2P (+lem, Sec. 3.1.2).

4 Discussion

We discuss our results under two themes — theefficacy of the different neural models we haveimplemented, and the effect of the different waysof injecting morphology that were considered.

We consider three neural models as describedabove. To compare the neural models, we firstnote the approximate number of parameters of eachmodel that we trained:

• BiLSTM: ∼1.7M parameters,• BiLSTM+Attn: ∼3.5M parameters,• Transformer: ∼5.2M parameters.

For BiLSTM and BiLSTM+Attn, the parametersize is based on neural architecture search i.e. weestimated sizes at which accuracies (nearly) peaked.For transformer, we believe even larger models canbe more effective and the current size was chosendue to computational restrictions and for “fairer”comparison of model effectiveness. Under this set-ting, BiLSTM+Attn models seem to clearly outper-form both the other models, even without morphol-ogy injection (cf. Gorman et al. (2020), albeit it isin the multilingual modeling context). Transformercan beat BiLSTM in some cases even with the sub-optimal model size restriction, but is consistentlyworse when the sequence lengths are larger whichis the case when we inject lemma spellings andpronunciations.

We also look at how adding lexical form infor-mation, i.e. morphological class and lemma, helpswith pronunciation prediction. We notice that theimprovements are particularly prominent when theG2P task itself is more complex, for example inEnglish. In particular, ambiguous or exceptionalgrapheme subsequence (e.g. ough in English)to phoneme subsequence mappings, may be re-solved with help from lemma pronunciations. Alsomorphological category seems to help for examplein Russian where it can contain a lot of informa-tion due to the inherent morphological complexity(about 25% relative error reduction). See AppendixB for more detailed comparison and error analysisfor the models.

Our transfer learning experiments indicate thatmorphology injection gives even more gains in lowresource setting. In fact for both the languagesconsidered, adding morphology gives almost asmuch gain as adding a high resource language tothe BiLSTM+Attn model. This could be useful forlow resource languages like Georgian where a highresource language from the same language familyis unavailable. Even with the high resource aug-mentation, using morphology can give a significantfurther boost to the prediction accuracy.

5 Conclusion

We note that combining BiLSTM with attentionseems to be the most attractive alternative in get-ting improvements in pronunciation prediction byleveraging morphology, and hence correspond tothe most appropriate ‘model bias’ for the prob-lem from among the alternatives considered. Wealso note that all the neural network paradigmsdiscussed are capable of improving the G2P predic-tion quality when augmented with morphologicalinformation. Since our approach can potentiallysupport partial/incomplete data (using appropriate〈MISSING〉 or 〈N/A〉 tokens), one can use a sin-gle model which injects morphology class and/orlemma pronunciation as available. For languageswhere neither is available, our results suggest build-ing word-lemma lists or utilizing effective lemma-

225

tizers (Faruqui et al., 2015; Cotterell et al., 2016).

6 Future work

Our work only leverages the inflectional morphol-ogy paradigms for better pronunciation prediction.However in addition to inflection, morphology alsoresults in word formation via derivation and com-pounding. Unlike inflection, derivation and com-pounding could involve multiple root words, soan extension would need a generalization of theabove approach along with appropriate data. Analternative would be to learn these in an unsuper-vised way using a dictionary augmented neural net-work which can efficiently refer to pronunciationsin a dictionary and use them to predict pronunci-ations of polymorphemic words using pronuncia-tions of the base words (Bruguier et al., 2018). Itwould be interesting to see if using a combinationof morphological side information and dictionary-augmentation results in a further accuracy boost.Developing non-neural approaches for the mor-phology injection could be interesting, althoughas noted before, the neural approaches are the state-of-the-art (Rao et al., 2015; Gorman et al., 2020).

One interesting application of the present workwould be to use the more accurate pronunciationprediction for morphologically related forms for ef-ficient pronunciation lexicon development (usefulfor low resource languages where high-coveragelexicons currently don’t exist), for example anno-tating the lemma pronunciation should be enoughand the pronunciation of all the related forms canbe predicted with high accuracy. This is hugelybeneficial for languages where there are hundredsor even thousands of surface forms associated withthe same lemma. Another concern for reliably us-ing the neural approaches is explainability (Molnar,2019). Some recent research looks at explainingneural models with orthographic and phonologicalfeatures (Sahai and Sharma, 2021), an extensionfor morphological features should be useful.

ReferencesAntoine Bruguier, Anton Bakhtin, and Dravyansh

Sharma. 2018. Dictionary Augmented Sequence-to-Sequence Neural Network for Grapheme toPhoneme prediction. Proc. Interspeech 2018, pages3733–3737.

William Chan, Navdeep Jaitly, Quoc Le, and OriolVinyals. 2016. Listen, attend and spell: A neuralnetwork for large vocabulary conversational speech

recognition. In Acoustics, Speech and Signal Pro-cessing (ICASSP), 2016 IEEE International Confer-ence on, pages 4960–4964. IEEE.

Noam Chomsky and Morris Halle. 1968. The soundpattern of English.

Shay B Cohen and Noah A Smith. 2007. Joint mor-phological and syntactic disambiguation. In Pro-ceedings of the 2007 Joint Conference on EmpiricalMethods in Natural Language Processing and Com-putational Natural Language Learning (EMNLP-CoNLL).

Cecil H Coker, Kenneth W Church, and Maik Y Liber-man. 1991. Morphology and rhyming: Two pow-erful alternatives to letter-to-sound rules for speechsynthesis. In The ESCA Workshop on Speech Syn-thesis.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman,David Yarowsky, Jason Eisner, and Mans Hulden.2016. The SIGMORPHON 2016 sharedtask—morphological reinflection. In Proceed-ings of the 14th SIGMORPHON Workshop onComputational Research in Phonetics, Phonology,and Morphology, pages 10–22.

Marelie Davel and Olga Martirosian. 2009. Pronuncia-tion dictionary development in resource-scarce envi-ronments.

Vera Demberg, Helmut Schmid, and Gregor Mohler.2007. Phonological constraints and morphologicalpreprocessing for grapheme-to-phoneme conversion.In Proceedings of the 45th Annual Meeting of theAssociation of Computational Linguistics, pages 96–103.

Eric Engelhart, Mahsa Elyasi, and Gaurav Bharaj.2021. Grapheme-to-Phoneme Transformer Modelfor Transfer Learning Dialects. arXiv preprintarXiv:2104.04091.

Marina Ermolaeva. 2018. Extracting morphophonol-ogy from small corpora. In Proceedings of the Fif-teenth Workshop on Computational Research in Pho-netics, Phonology, and Morphology, pages 167–175.

Daan van Esch, Mason Chua, and Kanishka Rao. 2016.Predicting Pronunciations with Syllabification andStress with Recurrent Neural Networks. In INTER-SPEECH, pages 2841–2845.

Manaal Faruqui, Yulia Tsvetkov, Graham Neubig, andChris Dyer. 2015. Morphological inflection genera-tion using character sequence to sequence learning.arXiv preprint arXiv:1512.06110.

Kyle Gorman, Lucas FE Ashby, Aaron Goyzueta,Arya D McCarthy, Shijie Wu, and Daniel You. 2020.The SIGMORPHON 2020 shared task on multilin-gual grapheme-to-phoneme conversion. In Proceed-ings of the 17th SIGMORPHON Workshop on Com-putational Research in Phonetics, Phonology, andMorphology, pages 40–50.

226

Alex Graves and Navdeep Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural net-works. In International Conference on MachineLearning, pages 1764–1772.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Cassandra L Jacobs and Fred Mailhot. 2019. Encoder-decoder models for latent phonological representa-tions of words. In Proceedings of the 16th Workshopon Computational Research in Phonetics, Phonol-ogy, and Morphology, pages 206–217.

Preethi Jyothi and Mark Hasegawa-Johnson. 2017.Low-resource grapheme-to-phoneme conversion us-ing recurrent neural networks. In 2017 IEEE Inter-national Conference on Acoustics, Speech and Sig-nal Processing (ICASSP), pages 5030–5034. IEEE.

Ronald M Kaplan and Martin Kay. 1994. Regular mod-els of phonological rule systems. Computational lin-guistics, 20(3):331–378.

Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senel-lart, and Alexander Rush. 2017. OpenNMT: Open-source toolkit for neural machine translation. InProceedings of ACL 2017, System Demonstrations,pages 67–72, Vancouver, Canada. Association forComputational Linguistics.

Kimmo Koskenniemi. 1983. Two-Level Model forMorphological Analysis. In IJCAI, volume 83,pages 683–685.

Karen Livescu, Preethi Jyothi, and Eric Fosler-Lussier.2016. Articulatory feature-based pronunciationmodeling. Computer Speech & Language, 36:212–232.

Minh-Thang Luong, Hieu Pham, and Christopher DManning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In Proceedingsof the 2015 Conference on Empirical Methods inNatural Language Processing, pages 1412–1421.

Sameer Maskey, Alan Black, and Laura Tomokiya.2004. Boostrapping phonetic lexicons for new lan-guages. In Eighth International Conference on Spo-ken Language Processing.

Christoph Molnar. 2019. Interpretable MachineLearning. https://christophm.github.io/interpretable-ml-book/.

Garrett Nicolai, Kyle Gorman, and Ryan Cotterell.2020. Proceedings of the 17th SIGMORPHONWorkshop on Computational Research in Phonetics,Phonology, and Morphology. In Proceedings of the17th SIGMORPHON Workshop on ComputationalResearch in Phonetics, Phonology, and Morphology.

Ben Peters, Jon Dehdari, and Josef van Genabith.2017. Massively Multilingual Neural Grapheme-to-Phoneme Conversion. EMNLP 2017, page 19.

Kanishka Rao, Fuchun Peng, Hasim Sak, andFrancoise Beaufays. 2015. Grapheme-to-phonemeconversion using long short-term memory recurrentneural networks. In Acoustics, Speech and SignalProcessing (ICASSP), 2015 IEEE International Con-ference on, pages 4225–4229. IEEE.

Zach Ryan and Mans Hulden. 2020. Data augmen-tation for transformer-based G2P. In Proceedingsof the 17th SIGMORPHON Workshop on Computa-tional Research in Phonetics, Phonology, and Mor-phology, pages 184–188.

Saumya Sahai and Dravyansh Sharma. 2021. Predict-ing and explaining french grammatical gender. InProceedings of the Third Workshop on Computa-tional Typology and Multilingual NLP, pages 90–96.

Dravyansh Sharma. 2018. On Training and Evaluationof Grapheme-to-Phoneme Mappings with LimitedData. Proc. Interspeech 2018, pages 2858–2862.

Dravyansh Sharma, Melissa Wilson, and AntoineBruguier. 2019. Better Morphology Prediction forBetter Speech Systems. In INTERSPEECH, pages3535–3539.

Pavel Sofroniev and Cagrı Coltekin. 2018. Phoneticvector representations for sound sequence alignment.In Proceedings of the Fifteenth Workshop on Com-putational Research in Phonetics, Phonology, andMorphology, pages 111–116.

Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Fe-lipe Santos, Kyle Kastner, Aaron Courville, andYoshua Bengio. 2017. Char2wav: End-to-endspeech synthesis.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural networks.arXiv preprint arXiv:1409.3215.

Jason Taylor and Korin Richmond. 2020. EnhancingSequence-to-Sequence Text-to-Speech with Mor-phology. Submitted to IEEE ICASSP.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Proceedings of the 31st InternationalConference on Neural Information Processing Sys-tems, pages 6000–6010.

Karl Weiss, Taghi M Khoshgoftaar, and DingDingWang. 2016. A survey of transfer learning. Journalof Big data, 3(1):1–40.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhudinov, Rich Zemel,and Yoshua Bengio. 2015. Show, attend and tell:Neural image caption generation with visual atten-tion. In International conference on machine learn-ing, pages 2048–2057.

227

Model Inputs en de es ru avg. rel. gainBiLSTM (b/+c/+l) (31.0/30.5/25.2) (17.7/15.5/12.3) (8.1/7.9/6.7) (18.4/15.6/15.9) (-/+7.9%/+20.0%)BiLSTM+Attn (b/+c/+l) (29.0/27.1/21.3) (12.0/11.6/11.6) (4.9/2.6/2.4) (14.1/13.6/13.1) (-/+15.1%/+22.0%)

Table 4: Number of total Wiktionary entries, and inflected entries with pronunciation and morphology annotations,for the languages considered.

Appendix

A On size of data

We record the size of data scraped from Wiktionaryin Table 5. There is marked inconsistency in thenumber of annotated inflected words where the pro-nunciation transcription is available, as a fraction ofthe total vocabulary, for the languages considered.

In the main paper, we have discussed resultson the publicly available Wiktionary dataset. Weperform more experiments on a larger dataset (105-106 examples of annotated inflections per language)using the same data format and methodology for(American) English, German, Spanish and Russian(Table 4). We get very similar observations in thisregime as well in terms of relative gains in modelperformances using our techniques, but these re-sults are likely more representative of word errorrates for the whole languages.

Language Total senses Annotated inflectionsen 1.25M 7543es 0.93M 28495fi 0.24M 3663fr 0.46M 24062hu 77.7K 31486pt 0.39M 2647ru 0.47M 20558

Table 5: Number of total Wiktionary entries, and in-flected entries with pronunciation and morphology an-notations, for the languages considered.

B Error analysis

Neural sequence to sequence models, while highlyaccurate on average, make “silly” mistakes likeomitting or inserting a phoneme which are hardto explain. With that caveat in place, there arestill reasonable patterns to be gleaned when com-paring the outputs of the various neural modelsdiscussed here. BiLSTM+Attn model seems to notonly be making fewer of these “silly” mistakes,but also appears to be better at learning the gen-uinely more challenging predictions. For exam-ple, the French word pedagogiques (‘pedagogical’,

plural) /pe.da.gO.Zik/ is pronounced correctly byBiLSTM+Attn, but as /pe.da.ZO.Zik/ by BiLSTM.Similarly BiLSTM+Attn predicts /"dZæmIN/, whileTransformer network says /"dZamIN/ for jamming(en). We note that errors for Spanish often involveincorrect stress assignment since the grapheme-to-phoneme mapping is highly consistent.

Adding morphological class information seemsto reduce the error in endings for morphologicallyrich languages, which can be an important sourceof error if there is relative scarcity of transcrip-tions available for the inflected words. For exam-ple, for our BiLSTM+Attn model, the pronunci-ation for фуррем (ru, ‘furry’ instrumental singu-lar noun) is fixed from /"furj:em/ to /"furj:Im/, andkoronavırusrol (hu, ‘coronavirus’ delative singu-lar) gets corrected from /"koron6vi:ruSo:l/ to /"ko-ron6vi:ruSro:l/. On the other hand, adding lemmapronunciation usually helps with pronouncing theroot morpheme correctly. Without the lemma in-jection, our BiLSTM+Attn model mispronouncesdebriefing (en) as /dI"bôi:fIN/ and sentences (en)as /sEn"tEnsIz/. Based on these observations, itsounds interesting to try to inject both categoricaland lemma information simultaneously.

228

Author Index

Agirrezabal, Manex, 72Ashby, Lucas F.E., 115

Bartley, Travis M., 115Batsuren, Khuyagbaatar, 39Bella, Gábor, 39Berg-Kirkpatrick, Taylor, 198Bruguier, Antoine, 222

Chaudhari, Neha, 222Choudhury, Monojit, 60Clematide, Simon, 115, 148

Dai, Huteng, 167Daniels, Josh, 90De Santo, Aniello, 11Del Signore, Luca, 115Dolatian, Hossep, 11

Elsner, Micha, 154Erdmann, Alexander, 72

Forbes, Clarissa, 188Futrell, Richard, 167

Gautam, Vasundhara, 141Gerlach, Andrew, 107Gibson, Cameron, 115giunchiglia, fausto, 39Goldwater, Sharon, 82Gorman, Kyle, 115Gormley, Matthew R., 198Graf, Thomas, 11

Hammond, Michael, 126Heinz, Jeffrey, 212Hovy, Eduard, 198Hulden, Mans, 72

Jayanthi, Sai Muralidhar, 49

Kann, Katharina, 72, 107Kirby, James, 32Kogan, Ilan, 1

Lee-Sikka, Yeonju, 115

Li, Wang Yau, 141Lo, Roger Yu-Hsiang, 131Lopez, Adam, 82

Mahmood, Zafarullah, 141Mailhot, Frederic, 141Makarov, Peter, 115, 148Malanoski, Aidan, 115Markowska, Magdalena, 212McCarthy, Arya D., 72McCurdy, Kate, 82Miller, Sean, 115

Nadig, Shreekantha, 141Nicolai, Garrett, 72, 98, 131, 188

Ortiz, Omar, 115

Palmer, Alexis, 90Papillon, Maxime, 23Perkoff, E. Margaret, 90Pratapa, Adithya, 49

Raff, Reuben, 115Rambow, Owen, 212Roewer-Despres, Francois, 1Ryskina, Maria, 198

Sahai, Saumya, 222Sathe, Aalok, 60Sengupta, Arundhati, 115Seo, Bora, 115Sharma, Dipti, 60Sharma, Dravyansh, 222Silfverberg, Miikka, 72, 98, 188Spektor, Yulia, 115

Vaduguru, Saujas, 60

WANG, Riqiang, 141Wang, Yang, 177Wiemerslage, Adam, 72, 107

Yan, Winnie, 115Yang, Changbing, 98Yeung, Arnold, 1

Zhang, Nathan, 141

229