a resource-light approach to morpho-syntactic tagging

A resource-light approach to morpho-syntactic tagging

LANGUAGE AND COMPUTERS:STUDIES IN PRACTICAL LINGUISTICS

No 70

edited by Christian Mair

Charles F. Meyer Nelleke Oostdijk

A resource-light approach to morpho-syntactic tagging

Anna Feldman and Jirka Hana

Amsterdam - New York, NY 2010

The authors’ research on resource-light morphology is currently supported by the U.S. National Science Foundation (Grant # 0916280)

Cover painting: Yakov Feldman, “Dialogue 30”http://www.feldman-art.com

Cover design: Pier Post

The paper on which this book is printed meets the requirements of"ISO 9706:1994, Information and documentation - Paper for documents - Requirements for permanence".

ISBN: 978-90-420-2768-8E-Book ISBN: 978-90-420-2769-5©Editions Rodopi B.V., Amsterdam - New York, NY 2010Printed in The Netherlands

Contents

List of tables vii

List of figures x

Preface xi

1 Introduction 11.1 Organization of the book . . . . . . . . . . . . . . . . . . . . . . 4

2 Common tagging techniques 52.1 Supervised methods . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Unsupervised methods . . . . . . . . . . . . . . . . . . . . . . . 172.3 Comparison of the tagging approaches . . . . . . . . . . . . . . . 192.4 Classifier combination . . . . . . . . . . . . . . . . . . . . . . . 202.5 A special approach to tagging highly inflected languages . . . . . 252.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 Previous resource-light approaches to NLP 313.1 Unsupervised or minimally supervised approaches . . . . . . . . 323.2 Cross-language knowledge induction . . . . . . . . . . . . . . . . 363.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 Languages, corpora and tagsets 494.1 Language properties . . . . . . . . . . . . . . . . . . . . . . . . . 494.2 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.3 Tagset design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.4 Tagsets in our experiments . . . . . . . . . . . . . . . . . . . . . 64

5 Quantifying language properties 715.1 Tagset size, tagset coverage . . . . . . . . . . . . . . . . . . . . . 715.2 How much training data is necessary? . . . . . . . . . . . . . . . 755.3 Data sparsity, context, and tagset size . . . . . . . . . . . . . . . . 785.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816 Resource-light morphological analysis 81

vi Contents

6.2 Motivation – Lexical statistics of Czech . . . . . . . . . . . . . . 826.3 A Morphological Analyzer of Czech . . . . . . . . . . . . . . . . 836.4 Application to other languages . . . . . . . . . . . . . . . . . . . 986.5 Possible enhancements . . . . . . . . . . . . . . . . . . . . . . . 101

7 Cross-language morphological tagging 1037.1 Why a Markov model . . . . . . . . . . . . . . . . . . . . . . . . 1037.2 Tagging Russian using Czech . . . . . . . . . . . . . . . . . . . . 1047.3 Using source language directly . . . . . . . . . . . . . . . . . . . 1057.4 Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1077.5 Using MA to approximate emissions . . . . . . . . . . . . . . . . 1087.6 Improving emissions – cognates . . . . . . . . . . . . . . . . . . 1097.7 Improving transitions – “Russifications” . . . . . . . . . . . . . . 1137.8 Dealing with data sparsity – tag decomposition . . . . . . . . . . 1157.9 Results on test corpus . . . . . . . . . . . . . . . . . . . . . . . . 1187.10 Catalan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1217.11 Portuguese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1237.12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

8 Summary and further work 1258.1 Summary of the book . . . . . . . . . . . . . . . . . . . . . . . . 1258.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

Bibliography 133

Appendices 148

A Tagsets we use 149A.1 Czech tagset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149A.2 Russian tagset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154A.3 Romance tagsets . . . . . . . . . . . . . . . . . . . . . . . . . . 161

B Corpora 165B.1 Slavic corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165B.2 Romance corpora . . . . . . . . . . . . . . . . . . . . . . . . . . 166

C Language properties 167C.1 Slavic Languages . . . . . . . . . . . . . . . . . . . . . . . . . . 167C.2 Czech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167C.3 Russian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169C.4 Romance languages . . . . . . . . . . . . . . . . . . . . . . . . . 172C.5 Catalan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175C.6 Portuguese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178C.7 Spanish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

Citation Index 183

List of tables

4.1 Abbreviations of morphological categories . . . . . . . . . . . . . . . 50

4.2 Slavic: Shallow contrastive analysis . . . . . . . . . . . . . . . . . . 50

4.3 Example comparison of Czech and Russian noun declension . . . . . 51

4.4 Homonymy of the a ending in Czech . . . . . . . . . . . . . . . . . . 53

4.5 Ending -e and noun cases in Czech . . . . . . . . . . . . . . . . . . . 53

4.6 Basic words: Comparison of Czech and Russian . . . . . . . . . . . . 54

4.7 Romance: Shallow contrastive analysis . . . . . . . . . . . . . . . . . 58

4.8 Overview of the corpora . . . . . . . . . . . . . . . . . . . . . . . . 61

4.9 Positional Tag System for Czech . . . . . . . . . . . . . . . . . . . . 65

4.10 Overview and comparison of the Czech and Russian tagsets . . . . . . 67

4.11 Overview and comparison of the Romance tagsets . . . . . . . . . . . 69

4.12 Overview of the tagsets we use . . . . . . . . . . . . . . . . . . . . . 69

5.1 Basic characteristics of Slavic, Romance and English based on the

Stat corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.1 Corpus coverage by lemma frequency . . . . . . . . . . . . . . . . . 84

6.2 Noun lemma distribution by the number of forms in the corpus . . . . 86

6.3 Forms of atom ‘atom’ and the hard masculine inanimate paradigms . . 87

6.4 Examples of the žena paradigm nouns . . . . . . . . . . . . . . . . . 88

List of tables ix

6.5 Examples of lexical entries for some nouns of the žena paradigm . . . 91

6.6 Forms of the lemma podpora in the Raw corpus . . . . . . . . . . . . 93

6.7 Candidate entries for podpora forms . . . . . . . . . . . . . . . . . . 93

6.8 Forms of the lemma atom in the Raw corpus . . . . . . . . . . . . . . 95

6.9 Fit of the forms of atom to the hrad and pán paradigms . . . . . . . . 95

6.10 Evaluation of the Czech morphological analyzer (on nouns) . . . . . . 98

6.11 Evaluation of the Russian morphological analyzer . . . . . . . . . . . 100

6.12 Evaluation of the Catalan morphological analyzer . . . . . . . . . . . 100

6.13 Evaluation of the Portuguese morphological analyzer . . . . . . . . . 100

7.1 Direct Tagger: Czech tagger applied to Russian . . . . . . . . . . . . 106

7.2 Tagging Russian with various combination of Czech and Russian emis-

sions and transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.3 Tagging with evenly distributed output of Russian MA . . . . . . . . 109

7.4 Tagging Russian using cognates . . . . . . . . . . . . . . . . . . . . 112

7.5 Tagging Russian using Russified Czech transitions . . . . . . . . . . 114

7.6 Russian tagger performance trained on individual slots vs. tagger per-

formance trained on the full tag . . . . . . . . . . . . . . . . . . . . . 116

7.7 Russian tagger performance trained on the combination of two features

vs. tagger performance trained on the full tag . . . . . . . . . . . . . 116

7.8 Russian tagger performance trained on the combination of three or four

features vs. tagger performance trained on the full tag . . . . . . . . . 116

7.9 Voted classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.10 Complementarity rate of subtaggers . . . . . . . . . . . . . . . . . . 119

7.11 Overview of results on the test corpus . . . . . . . . . . . . . . . . . 119

7.12 Detailed results obtained with the Russified tagger . . . . . . . . . . . 120

x List of tables

7.13 Comparison with the traditional approach and combination with the

traditional approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

7.14 Catalan: Overview of results on the test corpus . . . . . . . . . . . . 121

7.15 Catalan: Comparison with the traditional approach and combination

with the traditional approach . . . . . . . . . . . . . . . . . . . . . . 122

7.16 Portuguese: Overview of results on the test corpus . . . . . . . . . . . 123

A.1 Positions of the Czech and Russian tagsets . . . . . . . . . . . . . . . 150

A.2 Values of individual positions of the Czech tagset . . . . . . . . . . . 150

A.3 Values of individual positions of the Russian tagset . . . . . . . . . . 154

A.4 Overview of the Russian tagset . . . . . . . . . . . . . . . . . . . . . 158

A.5 Positions of the Romance tagsets . . . . . . . . . . . . . . . . . . . . 161

A.6 Values of individual positions of Romance tagsets . . . . . . . . . . . 161

C.1 Declension Ia – an example . . . . . . . . . . . . . . . . . . . . . . . 170

C.2 I-conjugation – grabit’ ‘rob’ . . . . . . . . . . . . . . . . . . . . . . 171

C.3 Germanic influence on Spanish, Portuguese, and Catalan . . . . . . . 181

C.4 Arabic influence on Spanish, Portuguese, and Catalan . . . . . . . . . 182

C.5 Basic words: Comparison of Spanish, Portuguese, and Catalan . . . . 182

List of figures

4.1 Atomic and wildcard gender values . . . . . . . . . . . . . . . . . . . 66

5.1 The number of distinct tags plotted against the number of tokens . . . 73

5.2 The percentage of the tagset covered by a number of tokens . . . . . . 74

5.3 The percentage of the corpus covered by the five most frequent tags . 76

5.4 Accession rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.1 Lemma characteristics by frequency . . . . . . . . . . . . . . . . . . 84

7.1 Complementarity rate analysis (Brill and Wu 1998) . . . . . . . . . . 118

C.1 Slavic languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

C.2 Romance languages . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

Preface

Some five years ago, we wanted to use a Russian morphological tagger to extractverb frames from a Russian corpus. To our surprise, we could not find a large an-notated corpus of Russian or an off-the-shelf Russian tagger. Developing such re-sources would take many years and cost a lot of money. At the same time, resourcesand tools for Czech, a related language, were already available in abundance. Weused a Czech tagger directly on Russian (after translating the script; Czech usesthe Latin alphabet, Russian uses the Cyrillic alphabet). The results were far fromperfect, but good enough to be useful. Since then, we have explored various tag-ging algorithms and experimented with different language pairs. This book is thesummary of our efforts. It addresses the problem of rapid development of morpho-syntactic taggers for resource-poor languages.

This work is truly a joint effort in all ways. Even though our names are theonly authors on this work, many have contributed to its development — those whoprovided insights, comments, and suggestions, and those who provided friendship,love, and support. First, we want to thank Chris Brew. This work started as a jointproject and many ideas developed in this book were inspired by discussions withhim. He was also Anna’s thesis advisor. A portion of the work included in this bookis based on her Ph.D. dissertation. We also want to thank Jan Hajic, Erhard Hin-richs, Brian Joseph, and Detmar Meurers for always extremely insightful commentsand feedback, and to Luiz Amaral for helping us with the Romance languages. Weare indebted to people who helped us with corpora used in the experiments: SandraMaria Aluísio, Gemma Boleda, Toni Badia, Lukasz Debowski, Maria das GraçasVolpe Nunes, Ricardo Hasegawa, Vicente López, Lluís Padró, Carlos RodríguezPenagos, Adam Przepiórkowski, and Martí Quixal.

It would be difficult indeed for us to thank everyone who had an influenceon the ideas presented here. In the last five years, we have learned a great deal, morethan we can acknowledge here, from various colleagues. To name just a few: StaceyBailey, Mary Beckman, Angelo Costanzo, Peter Culicover, Mike Daniels, EileenFitzpatrick, Eric Fosler-Lussier, Kordula De Kuthy, Markus Dickinson, Martin Jan-sche, Greg Kondrak, Soyoung Kang, Xiaofei Lu, Arantxa Martin-Lozano, VanessaMetcalf, Andrea Sims, Shari Speer, Richard Sproat, Shravan Vasishth, Mike White,and many, many other people. Thank you all!

xiv Preface

Last but not least, we express enormous gratitude to our families for theirlove, patience and support throughout the long and difficult process of completingthis book. It is they who reminded us of the fact that there is more to life thanscience.

June, 2009

Anna Feldman and Jirka Hana

Chapter 1

Introduction

The year is 1944, and World War II is near its end. A simple strokeof fate brings together three people — a Finnish soldier who is beingpunished for displaying reluctance in battle, a disgraced Soviet captaininjured in a bomb attack en route to trial, and a Lapp widow working areindeer farm. The three discover that they have no language in com-mon, and they struggle to understand each other while hostilities arerunning high. This is the story depicted in a Russian film, The Cuckoo(Kukushka, 2002). At the end of the movie, as in any well-intentioned,man-made story, life wins and the barriers fall, giving mankind a senseof hope and reconciliation.

As shown in the movie, language barriers contribute a great deal to misunderstand-ing and miscommunication. Today’s technology is doing a tremendous job of over-coming language barriers. For instance, by using some online machine translationsystems, Internet users can gain access to information from the original source lan-guage, and therefore, ideally, form unbiased opinions. The process of learning for-eign languages is also facilitated by technology. It is no longer a luxury to have anintelligent computer language tutor that will detect and correct our spelling, gram-mar, and stylistic errors. These are just a few examples of what language technologyis capable of doing. It is unfortunate, however, that not all languages receive equalattention. Many languages lack even the most rudimentary technological resources.

Success in natural language processing (NLP) depends crucially on goodresources. Standard tagging techniques are accurate, but they rely heavily on high-quality annotated training data. The training data also has to be statistically repre-sentative of the data on which the system will be tested. In order to adapt a taggerto new kinds of data, it has to be trained on new data that is similar in style andgenre. However, the creation of such data is time-consuming and labor-intensive. Ittook six years to create the Brown corpus (Kucera and Francis 1967), a one-milliontoken corpus of American English annotated with 87 part of speech tags, for in-stance. If state-of-the-art performance requires this level of annotation effort and

2 Chapter 1. Introduction

time spent for English, what of languages that typically receive less effort and at-tention, but suddenly become important? How can we ever hope to build annotatedresources for more than a handful of the world’s languages?

Resnik (2004) compares high quality translation with detailed linguistic an-notation and puts them on the same order of magnitude of difficulty: turnaroundtimes for professional translation services, based on an informal survey of severalWeb sites suggest a productivity estimate of around 200–300 words per hour forexperienced translators. If this is the rate of progress for this task, the prospect formanual annotation of linguistic representations across hundreds of languages seemsbleak indeed. Even though it might seem like the annotation and translation tasksrequire different levels of language knowledge, a mere knowledge of the grammaris insufficient for doing manual morphological annotation. Languages with richmorphology are highly ambiguous: the same morphological form can correspondto multiple analyses — understanding the context and the meaning of words iscrucial for disambiguation.

The focus of this book is on the portability of technology to new languagesand on rapid language technology development. We address the development oftaggers for resource-poor languages. “Morphological tagging” is the process ofassigning part of speech (POS), case, number, gender, and other morphological in-formation to each word in a corpus. Resource-poor languages in this context arelanguages without available annotated corpora. There are various reasons for suchlack of resources: financial, political, legal etc. We describe a rapid, low-cost ap-proach to the development of taggers by exploring the possibility of approximatingresources of one language by resources of a related language.

Languages that are either related by common heritage (e.g. Czech and Rus-sian) or by borrowing (or “contact”, e.g. Bulgarian and Greek) often share a numberof properties: morphological systems, word order, and vocabulary. Our method usesthese language relationships for porting information from one language to another.The method avoids the use of labor-intensive resources; instead, it relies on thefollowing:

1. an unannotated corpus of the target language

2. an annotated corpus of a related source language

3. a description of the target language morphology (either taken from a basicgrammar book or elicited from a native speaker)

4. optionally, morphological information about the most frequent words (so-licited from a native speaker or a dictionary).

Our approach takes the middle road between knowledge-free approaches and thosethat require extensive manually created resources. For the majority of languagesand applications, neither of these extreme approaches is warranted. The knowledge-free approach lacks precision and the knowledge-intensive approach is usually toocostly.

3

This book mainly deals with inflectional languages. Inflectional informationis crucial for various tagging applications. Inflections are not just another quirk ofcertain languages. Inflectional languages usually have free word order. In order todecide what syntactic relationships hold between the elements of a sentence, andwhat constituents agree with what constituents, detailed morphological informa-tion is essential. Morphological tags carry important information which is essentialfor parsing or text-to-speech applications, for instance. We want not only to tellapart verbs from nouns, but also singular from plural, nominative from genitive— all of which are ambiguous one way or the other. For example, in order to de-termine which syllable of a given instance of the Russian word snega should bestressed, one must know the morphological properties of that instance — the geni-tive singular form of the word is stressed on the first syllable, while the nominativeplural form is stressed on the second: snèga.Noun.Gen.Masc.Singular ‘snow’ vs.snegà.Noun.Nom-Acc.Plural ‘snows’.

The experiments discussed include both Slavic and Romance languages.As far as we know, this is the first systematic study to investigate the possibilityof adapting the knowledge and resources of one morphologically rich languageto process another related inflectional language without the use of parallel cor-pora or bilingual lexicons. The main scientific contribution is to provide a betterunderstanding of the generality or language-specificity of cross-language annota-tion methods. The practical contribution consists of developing and implementinga portable system for tagging resource-poor languages. Finding effective ways toadapt a tagger which was trained on another language with similar linguistic prop-erties has potential to become the standard way of tagging languages for whichlarge, labeled corpora are not available.

Part-of-speech (POS) tagging is important for a variety of reasons, includ-ing:

1. Corpora that have been POS-tagged are very useful in linguistic research forfinding instances or frequencies of particular constructions in large corpora(e.g. Meurers 2005).

2. POS information can also provide a useful basis for syntactic parsing. Know-ing the part of speech information about each word in an input sentence helpsdetermine a correct syntactic structure in a given formalism.

3. Knowing which POS occurs next to which can be useful in a language modelfor speech recognition (i.e. for deciphering spoken words and phrases). Inaddition, a word’s POS can tell us something about how the word is pro-nounced. Thus, for example, in English the verb object [@b"dZEkt] is pro-nounced differently from the noun object ["abdZEkt].

4. Knowing a word’s POS is useful in morphological generation (i.e. mapping alinguistic stem to all matching words), since knowing a word’s POS gives usinformation about which morphological affixes it can take. This knowledge is

4 Chapter 1. Introduction

crucial for extracting verbs or other important words from documents, whichlater can be used for text summarization, for example.

5. Automatic POS taggers can help in building automatic word-sense disam-biguation algorithms, since the meaning of individual words is related totheir POS and the POS of adjacent words. For example, downprep (as in lookdown), downad j (as in down payment), downverb (as in They down wild boars)do not have the same meaning.

1.1 Organization of the book

The rest of the book is organized as follows. Chapters 2–5 lay out the linguisticand computational foundations of our work. Chapter 2 provides a survey of tag-ging techniques as well as classifier combination methods. A number of supervisedand unsupervised methods are described and compared, and the final sections ofthe chapter are devoted to the question of the appropriateness of these methods forinflected languages in general, and for Romance and Slavic languages in particu-lar. Chapter 3 summarizes previous resource-light approaches to Natural LanguageProcessing (NLP) tasks. We discuss two approaches to this problem: unsupervisedor minimally supervised learning of linguistic generalizations from corpora; andcross-language knowledge induction. Chapter 4 provides an overview of the lan-guages, the corpora, and tagsets used in our experiments. The discussion centersaround the adequacy of the tagset for describing the properties of these languagesand the computational suitability of various tagsets to the task of tagging. Anotherquestion touched on in that chapter is the standardization of a tagset.

Chapters 5-7 introduce our resource-light approach to morpho-syntactic tag-ging of inflected languages. Chapter 5 examines a number of properties of Slavicand Romance languages quantitatively, focusing on tagsets, their size, coverage bycorpora and information they provide. Chapter 6 introduces the portable resource-light approach to morphological analysis used in this book. Chapter 7 discusses arange of experiments in cross-language morphological annotation transfer. It ex-plores the possibility of tagging Slavic and Romance languages without relying onany labor and knowledge intensive resources for those languages. It shows variousways to tag a language combining information from morphological analysis andannotated corpora of a related language. We first describe in detail how to tag Rus-sian using Czech resources and then we show how the same methods can be usedto tag Catalan and Portuguese using Spanish resources.

Finally, Chapter 8 summarizes the work and describes the future directionof research arising from this book.

Chapter 2

Common tagging techniques

Part-of-speech (POS) tagging is the task of labeling each word in a sentence with itsappropriate POS information. Morphological tagging is very similar. It is a processof labeling words in a text with their appropriate detailed morphological informa-tion. The importance of the part of speech for language processing is that it givesa significant amount of information about a word and its neighbors. For example,corpora that have been POS-tagged are very useful in linguistic research for findinginstances or frequencies of particular constructions in large corpora (e.g. Meurers2005).

Formally, the tagging procedure f selects a sequence of tags T for the inputtext W :

(2.1) f : W → T,f(wi) = ti,ti ∈ TAGSwi ,∀i : 1 ≤ i ≤ |W |,where TAGSwi is the set of meaningful tags for a word token wi (in thiswork, it is determined by morphological analysis)

In this section, different tagging techniques and their suitability for the task of in-flected languages are discussed. In addition, we provide a discussion of classifiercombination, since one of our methods (section 7.8) relies on this technique as well.

There are many approaches to automated POS tagging. One of the first dis-tinctions which can be made among POS taggers is in terms of the degree of au-tomation of the training and tagging process. The terms commonly applied to thisdistinction are supervised vs. unsupervised. Supervised taggers typically rely onpretagged corpora to serve as the basis for creating any tools to be used through-out the tagging process, such as the tagger dictionary, the word/tag frequencies, thetag sequence probabilities and/or the rule set. Unsupervised models, on the otherhand, are those which do not require a pretagged corpus but instead use sophisti-cated computational methods to automatically induce word groupings (i.e. tagsets)and, based on those automatic groupings, to either calculate the probabilistic in-formation needed by stochastic taggers or to induce the context rules needed byrule-based systems. Each of these approaches has its pros and cons.

6 Chapter 2. Common tagging techniques

It is known that supervised POS taggers tend to perform best when bothtrained and used on the same genre of text. The unfortunate reality is that pretaggedcorpora are not readily available for the many language and genres which one mightwish to tag. Unsupervised tagging addresses the need to tag previously untaggedgenres and languages in light of the fact that hand tagging of training data is a costlyand time-consuming process. There are, however, drawbacks to fully unsupervisedPOS tagging. The word clusterings (i.e. automatically derived tagsets) which tendto result from these methods are very coarse, i.e. one loses the fine distinctionsfound in the carefully designed tag sets used in the supervised methods.

The following measures are typically used for evaluating the performanceof a tagger:

(2.2) Precision = Correctly-Tagged-TokensTokens-generated

Recall = Correctly-Tagged-TokensTokens-in-data

F-measure = 2∗Precision∗RecallPrecision+Recall

Precision measures the percentage of system-provided tags that were correct. Re-call measures the percentage of tags actually present in the input that were correctlyidentified by the system. The F-measure (van Rijsbergen 1979) provides a way tocombine these two measures into a single metric.

2.1 Supervised methods

Supervised part-of-speech taggers rely on the presence of accurate gold-standardtags to learn statistical models of the process of part of speech tagging. In the fol-lowing sections the focus is on the most widely used techniques for supervisedtagging. All these approaches use the surrounding local context (typically, a win-dow of two or three words and/or tags) to determine the proper tag for a givencorpus position.

2.1.1 N-gram taggers/Markov models

N-gram taggers (Church 1988; DeRose 1988; Weischedel et al. 1993; Brants 2000)limit the class of models considered to n− 1th order Markov models. Recall thata Markov model (MM) is a doubly stochastic process defined over a set of hid-den states {si ∈ S} and a set of output symbols {w j ∈ W}. There are two sets ofprobabilities involved.

• Transition probabilities control the movement from state to state. They havethe form P(sk|sk−1 . . .sk−n+1), which encodes the assumption that only theprevious n states are relevant to the current prediction.

• Emission probabilities control the emission of output symbols from the hid-den states. They have the form P(wk|sk), encoding the fact that only the iden-tity of the current state feeds into the decision about what to emit.

2.1. Supervised methods 7

In an HMM-based part-of-speech tagger, the hidden states are identified with part-of-speech labels, while the output symbols are identified either with individualwords or with equivalence classes over these words (the latter option is taken by,for example Cutting et al. (1992), because of the desire to reduce the data sparsityproblem).

Taken together with a distribution over the initial state s0 , the emission andtransition probabilities provide a kth order Markov model of the tagging process.

P(s0 . . . sk,w0 . . .wk) = P(s0)k

Õi=0

P(wi|si)P(si+1|si . . . si−k+1)

This defines the joint probability of a tag sequence s0 . . .sk and a word sequencew0 . . .wk.

As in speech recognition, the forward-backward algorithm (an instance ofthe Expectation Maximization (EM) algorithm ) provides the designers of part-of-speech taggers with the option of adapting a Markov model to a pre-existing unla-beled corpus, but common practice is to eschew this possibility, preferring rather tolearn transition and emission probabilities by direct counting of labels and wordsoccurring in a gold-standard corpus of correctly tagged data.

For actual tagging, one must find the best possible path through the Markovmodel of states and transitions, based on the transition and emission probabilities.However, in practice, this is extremely costly, as multiple ambiguous words meanthat there will be a rapid growth in the number of transitions between states. Toovercome this, the Viterbi algorithm (Viterbi 1967) is commonly used. The mainobservation made by the Viterbi algorithm is that for any state, there is only onemost likely path to that state. Therefore, if several paths converge at a particularstate, instead of recalculating them all when calculating the transitions from thisstate to the next, less likely paths can be discarded, and only the most likely onesare used for calculations. So, instead of calculating the costs for all paths, at eachstate only the k-best paths are kept.

The terms Visible Markov model (VMM) and Hidden Markov models(HMM) are sometimes confused. In the case of the supervised training the for-malism is really a mixed formalism. In training a VMM is constructed, but then itis treated as an HMM when it is put to use for tagging new corpora.

One major problem with standard n-gram models is that they must be trainedfrom some corpus, and because any particular training corpus is finite, some per-fectly acceptable n-grams are bound to be missing from it. That means that then-gram matrix is sparse; it is bound to have a very large number of cases of pu-tative zero-probability n-grams that should really have some non-zero probability.In addition, this maximum-likelihood estimation method produces poor estimateswhen the counts are non-zero but still small. The n-grams cannot use long-distancecontext. Thus, they always tend to underestimate the probability of strings that hap-pen not to have occurred nearby in the training corpus. There are some techniques


that can be used to assign a non-zero probability to unseen possibilities. Such pro-cedures are called “smoothing” (e.g. Chen and Goodman 1996).

TnT (Brants 2000)

Trigrams’n’Tags (TnT) is a statistical Markov model tagging approach, developedby Brants (2000). Contrary to the claims found in the literature about Markov modelPOS tagging, TnT performs as well as other current approaches, such as MaximumEntropy (see section 2.1.3). A recent comparison has even shown that TnT performssignificantly better than the Maximum Entropy model for the tested corpora (seeBrants 2000 and section 2.1.1). This section describes this tagger in more detail,since the experiments that are discussed in the subsequent chapter use this particularclassifier.

The tagger is based on a trigram Markov model. The states of the modelrepresent tags, outputs represent the words. Transition probabilities depend on thestates, and thus on pairs of tags. Output (emission) probabilities only depend on themost recent category.

So, explicitly, for a given sequence of words w1, ...wT of length T , the fol-lowing is calculated:

(2.3) argmaxt1,...tT [TÕi=1

P(ti|ti−1,ti−2)P(wi|ti)]P(tT+1|tT )

t1, ...tT are elements of the tagset, the additional tags t−1, t0, and tT+1 are beginningof sequence and end of sequence markers. As Brants mentions, using these addi-tional tags, even if they stem from rudimentary processing of punctuation marks,slightly improves tagging results. This is different from formulas presented in otherpublications, which just stop with a “loose end” at the last word. If sentence bound-aries are not marked in the input, TnT adds these tags if it encounters one of [.!?;]as a token.

Transitions and output probabilities are estimated from a tagged corpus, us-ing maximum likelihood probabilities, derived from the relative frequencies.

As has been described above, trigram probabilities generated from a corpususually cannot directly be used because of the sparsity problem. This means thatthere are not enough instances for each trigram to reliably estimate the probability.Setting a probability to zero because the corresponding trigram never occurred inthe corpus is undesired, since it causes the probability of a complete sequence tobe set to zero, making it impossible to rank different sequences containing a zeroprobability. The smoothing paradigm that brings the best results in TnT is linearinterpolation of unigrams, bigrams, and trigrams. A trigram probability is estimatedthis way:

(2.4) P(t3|t1,t2) = l1P(t3)+ l2P(t3|t2)+ l3P(t3|t1,t2)


P are maximum likelihood estimates of the probabilities, and l1 + l2 + l3 = 1, soP again represents probability distributions.

Brants (2000) uses the context-independent variant of linear interpolation,where the values of the ls do not depend on the particular trigram; that yieldsbetter results than the context-dependent variant. The values of the ls are estimatedby deleted interpolation. This technique successfully removes each trigram fromthe training corpus and estimates best values for the ls from all other n-grams inthe corpus. Given the frequency counts for unigrams, bigrams, and trigrams, theweights can be very efficiently determined with a processing time that is linear inthe number of different trigrams.

To handle unknown words Brants (2000) uses Samuelsson’s (1993) suffixanalysis, which seems to work best for inflected languages. Tag probabilities areset according to the word’s ending. Suffixes are strong predictors for word classes(e.g. 98% of the words in the Penn Treebank corpus ending in -able are adjectivesand the rest are nouns).

The probability distribution for a particular suffix is generated from allwords in the training set that share the same suffix of some predefined maximumlength. The term suffix, as used in TnT (as well as in the work described in thisbook), means ‘final sequence of characters of a word’ which is not necessarily alinguistically meaningful suffix.

Additional information which is used in TnT is capitalization. Tags areusually not informative about capitalization, but probability distributions of tagsaround capitalized words are different from those not capitalized. The effect is largefor languages such as English or Russian, and smaller for German, which capital-izes all nouns. Brants (2000) uses flag ci that is true if wi is a capitalized word andfalse otherwise. These flags are added to the contextual probability distributions.Instead of P(t3|t1,t2), Brants (2000) uses P(t3,c3|t1,c1,t2,c2). This is equivalent todoubling the size of the tagset and using different tags depending on the capitaliza-tion.

The processing time of the Viterbi algorithm is reduced by introducing abeam search. Each state that receives a d value smaller than the largest d divided bysome threshold value q is excluded from further processing. While the Viterbi algo-rithm is guaranteed to find the sequence of states with the highest probability, this isno longer true when beam search is added. However, as Brants (2000) reports, forpractical purposes and the right choice of q, there is virtually no difference betweenthe algorithm with and without a beam.

Tagging inflected languages with MMs

The Markov model tagging has been applied to a number of inflected languages.Hajic and Hladká (1998a) and Hladká (2000) perform a set of experiments usingthe Markov model. These experiments are divided into two types: 1) those thatexclude morphological preprocessing and 2) those that include morphological pre-processing. The difference in experiments is based on how the set of all meaningful


tags for a given word is obtained. For the experiments without morphological pre-processing, the set of meaningful tags for each word is obtained from the trainingcorpus. For the experiments with morphological preprocessing, the set is obtainedthrough morphological analysis.

The Hajic and Hladká (1998a) and Hladká (2000) experiments without mor-phological preprocessing vary 1) the order of MM (first- or second-order), 2) thetraining data size, and 3) tagset size. The results of the experiments investigatingthe order of MM are inconclusive with respect to whether including the tags of twoprevious word tokens gives better results than including the tag of just the precedingword token. Regarding the training data size, their conclusion is that the more train-ing data there is, the better the success rate will be. And finally, a reduced tagsetbrings better absolute success values (from 81.3% accuracy with the detailed tagsetto 90% with the reduced one). On the other hand, it is unfortunate to disregard suchimportant morpho-syntactic descriptions for Czech as case and gender, which iseliminated in the reduced tagset. In other words, the relatively high performance isachieved at the cost of omitted morphological information that may be essential forvarious post-tagging applications (see chapter 1).

The Hajic and Hladká (1998a) and Hladká (2000) experiments with mor-phological preprocessing show that the trigram models give the best performanceeven on a large tagset. Adding the morphological preprocessing leads to a 14%improvement in performance.

In other work on tagging inflected languages with MMs, Debowski (2004)implements a trigram POS tagger for Polish whose performance is 90.6% usinga detailed tagset with more than 200 tags. The result of this experiment is impor-tant because a trigram MM applied to another inflected language, Polish, performsequally well as it does on Czech.

A final tagging study relevant to mention here is Carrasco and Gelbukh(2003), which evaluates the performance of TnT on Spanish. TnT shows an overalltagging accuracy between 92.95% and 95.84% on test data, specifically, between95.47% and 98.56% on known words and between 75.57% and 83.49% on un-known words. Unfortunately, the details about the tagsets are not provided in thestudy.

2.1.2 Transformation-based error-driven learning (TBL)

Transformation-based error-driven learning (TBL) (Brill 1995) is a techniquewhich attempts to automatically derive rules for classification from the trainingcorpus. The advantage over statistically-based tagging is that the rules are more lin-guistic and, thus, more easily interpretable. The supervised TBL employs not onlya small, annotated corpus but also a large unannotated corpus. A set of allowablelexical and contextual transformations is predetermined by templates operating onword forms and word tokens, respectively. A general lexical/contextual templatehas the form: “for a given word, change tag A to tag B if precondition C is true”.An example of a specific rule from instantiated template, cited in Brill (1995), is


“change the tagging of a word from noun to verb if the previous word is tagged asa modal”. The set of allowable transformations used in Brill (1995) permits tags tobe changed depending on the previous (following) three tags and on the previous(following) two word forms, but other conditions, including wider contexts, couldequally well be specified.

There are three main steps in the TBL training process:

1. From the annotated corpus, a lexicon is built specifying the most likely tagfor a given word. Unknown words are tagged with the most frequently oc-curring tag in the annotated corpus.

2. Lexical transformations are learned to guess the most likely tag for the un-known words (i.e. words not covered by the lexicon).

3. Contextual transformations are learned to improve tagging accuracy.

The learning procedure is carried out over several iterations. During each iteration,the result of each transformation (i.e. an instantiation of a template) is compared tothe truth and the transformation that causes the greatest error reduction is chosen. Ifthere is no such transformation or if the error reduction is smaller than a specifiedthreshold, the learning process is halted. The complexity of learning the cues isO(L * Ntrain * R), where L is the number of prespecified templates, Ntrain is thesize in words of training data and R is the number of possible template instances.The complexity of the tagging of test data is O(T * Ntest ), where T is the numberof transformations and Ntest is the test data size. This rule-based tagger trained on600K of English text has a tagging accuracy of 96.9%.

Megyesi (1999) demonstrates how Brill’s rule-based tagger can be appliedto a highly agglutinative language — Hungarian. When she applies the originalrule-based tagger designed for English, the tagging accuracy for Hungarian is85.9%, lower than the 96.9% for English. To get higher accuracy, the author mod-ifies lexical and contextual templates with regard to the character of Hungarian.For example, the maximum window length is changed from four to six. The mod-ifications increase the tagging accuracy for Hungarian to 91.9%. The size of theHungarian training corpus is 99,860 word tokens, and the tagset size is 452.

Current approaches to TBL rely crucially on preselecting all and only therelevant templates for transformations. Failure to satisfy this condition will resultin overtraining or under-performance. For the task of tagging under-resourced lan-guages using minimal knowledge, it is very likely that it will be difficult to obtainpre-theoretical intuitions for specifying the relevant templates for each language.

Tagging inflected languages with TBL

Hajic and Hladká (1998a) and Hladká (2000) experiment with the TBL tagger “asis” (i.e. designed for English), with the prespecified lexical/contextual templates,as described above. As for the MM-model, the relevant parameters for evaluating


the results are the tagset size and the data set size. The experiments show that themore radical the reduction of Czech tags is, the higher the accuracy of the results.However, comparing the results of a TBL approach with the MM-model, it seemsthat the training data size does not need to be as large in the former as in the latter.Moreover, Džeroski et al. (1999) report 86% accuracy for the performance of theTBL tagger on Slovene with a tagset of 1,000 tags using a 109,640-token trainingset).

Like all supervised methods, the supervised TBL approach relies on at leasta small annotated corpus for training, and thus, not directly applicable to resource-poor languages.

2.1.3 Maximum Entropy

A third supervised learning approach is the Maximum Entropy (MaxEnt) tagger(Ratnaparkhi 1996), which uses a probabilistic model basically defined as

(2.5) p(h,t) = pµkÕj=1

af j(h,t)j ,

where h is a context from the set of possible words and tag contexts (i.e., so-called“histories”), t is a tag from the set of possible tags, p is a normalization constant,{µ,a1,a2, ...,ak} are the positive model parameters and { f1, f2, ..., fk} is a set ofyes/no features (i.e. fi(h,t) ∈ {0,1}).

Each parameter ai (the so-called feature-weight) corresponds to the exactlyone feature fi, and features operate over the events (context, tag). For a currentword, the set of specific contexts is limited to the current word, the preceding twowords together with their tags, and the following two words. The positive modelparameters are chosen to maximize the likelihood of the training data. An fi is true(or equals 1) if a particular linguistic condition is met.

Features which are determined to be important to the task are constrained tohave the same expected value in the model as in the training data. That is, consis-tency with the training data is maintained by asserting this equality holds, as shownin (2.6), where E f j is the expected value of f in the model and E f j is the empiricalexpected value of f in the training sample.

(2.6) E f j = E f j

The features used in Ratnaparkhi (1996) are derived from templates, similar tothose in Brill (1995). For example, three templates are shown in (2.7), where wi isthe i’th word, ti is the i’th tag, and X and T refer to values to be filled in.

(2.7) 1. X is a suffix of wi, |X | ≤ 4 & ti = T2. ti−1 = X & ti = T3. wi+1 = X & ti = T


A feature f will be equal to one when the condition is met and zero otherwise. Afeature has access to any word or tag in the history (h) of a given tag, as shown in(2.8).

(2.8) hi = {wi,wi+1,wi+2,wi−1,wi−2,ti−1,ti−2}

So, for example, a feature might be as in (2.9).

(2.9) f j(hi,ti) =

{

1 if suffix(wi) = “ing” & ti = VBG0 otherwise

}

To set the features, the model will go through the training corpus asking yes/noquestions about each item in h for a given tag t. From this, a tag obtains a givenprobability of being correct, based on its history.

When tagging a text, the joint probability of a tag t and its history h, i.e.p(h,t), should be found. The joint probability is partly determined by the so-calledactive features, those features which have a value of one. The way the featuresdetermine the joint probability is by the constraint mentioned earlier, where theexpected value for a feature f in the model must be equal to the empirical expectedvalue for the feature. And the expected values are sums over the joint probabilities,as shown in (2.10), where H is the set of possible histories (word and tag contexts)and T is the set of possible tags. Thus, because p(h,t) and f j(h,t) are involved incalculating E f , the value of p(h,t) is constrained by the value of f j(h,t).

(2.10) 1. E f j = åh∈H,t∈T

p(h,t) f j(h,t)

2. E f j =nå

i=1p(hi,ti) f j(hi,ti)

This model can also be interpreted under the Maximum Entropy formalism, inwhich the goal is to maximize the entropy of a distribution subject to certain con-straints. Here, the entropy of the distribution p is defined as follows:

(2.11) H(p) = − åh∈H,t∈T

p(h,t) log p(h,t)

During the test step, the tagging procedure gives for each word a list of Y highestprobability sequences up to and including that word. The algorithm is a beam searchin that for the current word, only the Y highest probability sequences up to that pointare kept. In calculating sequence probabilities, the algorithm considers every tagfor a word, unless it has access to a tag dictionary, in which case, it only considersthe tags given for a word in the dictionary. Using this model, Ratnaparkhi (1996)obtains an accuracy of 96.43% on English test data.

The complexity of the searching procedure for MaxEnt is O(Ntest ∗T ∗F ∗Y )where Ntest is the test data size (number of words), T is the number of meaningfultags, F is the average number of features that are active for the given event (h,t)and Y is explained above. The cost of parameter estimation is O(Ntrain ∗ T ∗ F),where T , F are defined above and Ntrain is the training data size, in words.


Tagging inflected languages with the MaxEnt-tagger

Džeroski et al. (1999) train the MaxEnt tagger on the Slovene translation of 1984,comparing the tagging results with the results of the (trigram) MM, TBL, andMemory-based tagger (MBT) (see section 2.1.4) for Slovene. It turns out that theperformance of the MM tagger (83.31%) is not as good as TBL (85.95%). MBTperformed better than TBL with 86.42% accuracy, and MaxEnt with 86.36% accu-racy. The tagset for Džeroski et al.’s (1999) experiments contains more than 1,000tags; and the training corpus used is relatively small – 109,640 tokens.

2.1.4 Memory-based tagging (MBT)

In the memory-based approach to POS tagging (Daelemans et al. 1996, 1999) a setof example cases is kept in memory. Each example case consists of a word with pre-ceding and following context, as well as the corresponding category for that wordin that context. Thus, training is simply a matter of selecting the size of the contextand storing these cases. A new sentence is tagged by selecting for each word in thatsentence the most similar case(s) in memory, and extrapolating the categories of thewords from these “nearest neighbors”. During testing, the distance between eachtest pattern (i.e. word plus context information) and all training patterns present inmemory is computed. A tag from the “closest” training pattern is assigned to thegiven word in the test data. When a word is not found in the lexicon, its lexical rep-resentation is computed on the basis of its form, its context is determined, and theresulting pattern is disambiguated using extrapolation from the most similar casesin an unknown words case base. In each case, the output is a “best guess” of thecategory for the word in its current context.

Memory-based tagging requires a large training corpus in order to extract alexicon. For each word the number of times it occurs with each category is recorded.For the task of tagging English, Daelemans et al. (1996) generate the lexicon basedon a 2-million-word training set and test the tagger on 200K test words, getting anscore of 96.4%. For tagging Dutch, they use a training corpus of nearly 600K wordsand test on 100K words from another corpus, obtaining an accuracy of 95.7%. TheEnglish tagset used in the experiments contains about 40 tags, whereas the Dutchtagset has 13 tags.

Tagging inflected languages with MBT

As mentioned in section 2.1.3, Džeroski et al. (1999) report on the results of tag-ging Slovene using several tagging models, including MBT. Recall that the MBTtagger’s accuracy is 86.42% on all tokens. After comparing the MB tagger with theMM, TBL, and MaxEnt taggers, the authors conclude that for Slovene, given thetagset and training data sizes, MBT is the most efficient and most accurate classifier.

With respect to the current task, the MBT performance at first seems promis-ing. Indeed, the MBT tagger worked the best for Slovene, whose linguistic charac-teristics are similar to Russian. Note, however, that the performances of the MBT


and MaxEnt are comparable (86.42% vs. 86.36%, respectively). Both taggers weretrained on a relatively small amount of data, so the conclusion that the MBT taggerworks better for languages such as Slovene is not necessarily definitive.

In addition, this method, though very efficient, is not directly applicable tothe task of tagging under-resourced inflectional languages. The major problem isthat it requires an annotated corpus to extract entries for the lexicon.

2.1.5 Decision trees

Schmid (1994b) develops another technique – a probabilistic decision tree taggerknown as TreeTagger. TreeTagger is a Markov model tagger which makes use of adecision tree to get more reliable estimates for contextual parameters. So, the de-termining context for deciding on a tag is the space of the previous n tags (n=2, inthe case of a second order Markov model). The methods differ, however, in the waythe transition probability p(tn|tn−2tn−1) is estimated. N-gram taggers often estimatethe probability using the maximum likelihood principle, as mentioned above. Un-like those approaches, TreeTagger constructs a binary-branching decision tree. Thebinary tree is built recursively from a training set of trigrams. The nodes of the treecorrespond to questions (or tests) about the previous one or two tags. The branchescorrespond to either a yes or no answer. For instance, a node might be tag−2=DET?which asks whether the tag two previous positions away is a determiner. By follow-ing the path down to the terminal elements of the tree, one can determine what themost likely tag is. That is, the terminal elements are sets of (tag, probability) pairs.

To construct the tree, all possible tests are compared to determine whichtests should be assigned to which nodes. The criterion used to compare the testsis the amount of information gained about the third tag by performing each test.Each node should divide the data maximally into two subsets (i.e. should ask thequestion which provides the most information about a tagging decision). To do this,a metric of information gain is used. The information gain is maximized, which, inturn, minimizes the average amount of information still needed after the decisionis made.

Once a decision tree is constructed, it can be used to derive transition prob-abilities for a given state in a Markov model. As with other probabilistic classifiersutilizing a Markov model, the Viterbi algorithm is used to find the best sequenceof tags. With this, and with training the model on 2M words and testing it on 1Kwords, Schmid (1994b) obtains 96.36% accuracy using the Penn Treebank tagset.

Tagging inflected languages with decision trees

There were several applications of this learning method to tagging languages whichhave a richer morphology than that of English. The models trained on French,German, and Italian are provided at Schmid’s web page (http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html). The French model was trained on 43,834 tokens with 59 tags. The resulting


test set accuracy for this model is 95.55%. The Italian model was trained on 40,847tokens, but the precison has not been evaluated yet. The highest accuracy achievedfor any of the models was 97.53% for the German model trained on 20,000 tokenswith 54 tags.

Orphanos and Christodoulakis (1999) and Orphanos et al. (1999) train adecision tree classifier on a Greek corpus of 137,765 tokens. However, their an-notation does not include morphological information, just POS information. Theirtagging result is 93% accuracy.

2.1.6 Neural networks

Artificial neural networks consist of a large number of simple processing units.These units are highly interconnected by directed weighted links. Associated witheach unit is an activation value. Through the connections, this activation is propa-gated to other units.

In multilayer perceptron networks (MLP-networks), the most popular net-work type, the processing units are arranged vertically in several layers. Connec-tions exist only between units in adjacent layers. The bottom layer is called inputlayer because the activations of the units in the layer represent the input to the net-work. Correspondingly, the top layer is called the output layer. Any layers betweeninput and output layers are called hidden layers because the activations are not vis-ible externally. The goal is to find the best network to predict, based on the inputnodes, the correct output nodes.

In the case of tagging (Schmid 1994a), each unit in the output layer of theMLP network corresponds to one of the tags in the tagset. The network learnsduring training to activate the output unit that represents the correct tag and todeactivate all other output units. Hence, in the trained network, the output unit withthe highest activation indicates which tag should be attached to the word that iscurrently being processed.

The input of the network comprises all the information that the system hasabout the POS’s of the current word, the p preceding words and the f followingwords. More specifically, for each POS tag pos j and each of the p + 1 + f wordsin the context, there is an input unit whose activation ini j represents the probabilitythat wordi has part of speech posi. So, if there are n possible tags, there are n∗ (p+1 + f ) input nodes.

For the input word being tagged and its following words, only the lexicalPOS probability p(posi|wordi) is known. This probability does not take into ac-count any contextual information. For the preceding words, there is more informa-tion available, because they have already been tagged. Copying output activationsof the network into the input introduces recurrence into the network.

The network is trained on an annotated corpus using so-called backpropa-gation, which feeds the information from the corpus about the correct tag back tothe input layer.

2.2. Unsupervised methods 17

The lexicon has three parts — a full-form lexicon, a suffix lexicon, and adefault entry; each of the three parts covers a priori tag probabilities for each lexicalentry.

This technique, which has no hidden layers, results in an accuracy of 96.22%for English (trained on 2M words).

Tagging inflected languages with neural networks

Nemec (2004) uses a neural networks approach to Czech. He trains the classi-fier on 1.5M tokens, using the positional tag system developed for Czech (seesection 4.4.1). Various contexts lengths were evaluated, but the best results wereobtained using the left context of length 2 and suffix of length 4. The overall per-formance is 88.71% accuracy.

2.2 Unsupervised methods

As mentioned above, the problem with using supervised models for taggingresource-poor languages is that supervised models assume the existence of a la-beled training corpus. Unsupervised models do not make this assumption, whichmakes them more applicable to the task of morpho-syntactic tagging resource-poorlanguages.

Unsupervised models generally rely on the presence of a dictionary, or lex-icon, which contains the possible parts of speech for a given word type. This listof parts of speech may be ordered or unordered and in the former case may con-tain probabilities. For each word token in the corpus, the parts of speech in thedictionary for that word type are considered as possibilities in tagging.

2.2.1 Markov models

MM taggers work well when there is a large, tagged training set. MMs can be usedwithout a corpus to train on, too. In the unsupervised case, the MM approach (Je-linek 1985; Cutting et al. 1992; Merialdo 1994) still has three major components:1) an initial (probability) vector, 2) a transition (probability) matrix, and 3) an emis-sion (probability) matrix. Each of these components are iteratively estimated untilthe process converges. For tagging, the Viterbi algorithm is used, as described insection 2.1.1.

The difference between Visible MM (VMM) tagging (i.e. supervised) andHidden MM (HMM) tagging (i.e. unsupervised) is in how the model is trained.Since no pre-tagged corpus is available, the probabilities have to be estimated insome other way. To do this the initial parameters of the model are set based on adictionary that lists all possible tags for each word.

There are two steps in HMM training — expectation (estimation) and max-imization, which alternate during the training process, thus giving the Expectation


Maximization (EM) algorithm1. Basically, first the parameters of the model are es-timated — the initial, transition, and emission probabilities — and then the Viterbialgorithm is used to determine which estimation maximizes the probability of asequence of tags. This sequence of tags is then used to reestimate the parameters.

When the probability of traversing an arc from ti to ti+1 is estimated. Bothforward probabilities (probability of the sequence of tags leading up to ti) and back-ward probabilities (probability of the sequence of tags following ti+1) are exam-ined. During the expectation phase, a forward pass over the data is made to (re)-estimate the forward probabilities and a backward pass for backward probability(re)-estimation. This multi-directional information gives a better estimate of theprobability of traversing an arc than can be obtained using forward probabilitiesalone.

With an unsupervised HMM tagger, Cutting et al. (1992) are able to obtainaccuracies of up to 96% for English, on par with other current technologies. Thisraises the question whether such an approach could be used for other languages.

Tagging inflectional languages with HMMs

Literature on applications of the HMM algorithm to tagging inflectional languagesis not available. The Cutting et al. (1992) tagger relies heavily on a lexicon for thetarget language and a suitably large sample of ordinary text. One could try to createsuch a lexicon using a morphological analyzer and then try unsupervised learning.But given that a morphological analyzer provides many spuriously ambiguous re-sults, high tagging accuracy cannot be expected. This book does not explore thisavenue.

2.2.2 Transformation-based learning (TBL)

In the supervised transformation-based learning (TBL), a corpus is used for scoringthe outcome of applying transformations in order to find the best transformation ineach iteration of learning. In the unsupervised case, this scoring function must befound without a manually tagged corpus. To adapt to a new scoring function, Brill(1999, 1995) redefines all three components of the TBL model.

The unsupervised TBL learner begins with an unannotated text corpus, anda dictionary listing words and the allowable part of speech tags for each word. Theinitial state annotator tags each word in the corpus with a list of all allowable tags.

Since now instead of sets of tags, one tag per word is used, the transforma-tion templates must also be changed. Instead of being templates which change onetag to another, they select a tag from the set of tags. That is, they change a word’stagging from a set of tags to a single tag. A template for such transformations is asoutlined in (2.12). The context C can be defined as before, although Brill (1999)limits the context to the previous (following) word/tag.

1 The Baum-Welch or Forward-Backward algorithm, which is used for HMMtraining, is a special case of general EM.

2.3. Comparison of the tagging approaches 19

(2.12) Change the tag of a word from c to Y in context C.where c is a set of two or more tags and Y is a single tag, such that Y ∈ c.

When using supervised TBL to train a POS tagger, the scoring function is just thetagging accuracy that results from applying a transformation. With unsupervisedlearning, the learner does not have a gold standard training corpus with which ac-curacy can be measured. Instead, the information from the distribution of unam-biguous words is used to find reliable disambiguating contexts.

In each learning iteration, the score of a transformation is computed basedon the current tagging of the training set. As stated above, each word in the train-ing set is initially tagged with all tags allowed for that word, as indicated in thedictionary. In later learning iterations, the training set is transformed as a result ofapplying previously learned transformations.

To calculate the score for a transformation rule, as described in (2.12), Brillcomputes (2.13) for each tag Z ∈ c,Z , Y .

(2.13) f req(Y )/ f req(Z)∗ incontext(Z,C),

where f req(Y ) is the number of occurrences of words unambiguously tagged withtag Y in the corpus, f req(Z) is the number of occurrences of words unambiguouslytagged with tag Z in the corpus, and the incontext(Z,C) is the number of timesa word unambiguously tagged with tag Z occurs in context C in the training cor-pus. To produce a score, first let R be defined as in (2.14). Then the score for thetransformation in (2.12) is as in (2.15).

(2.14) R = argmaxZ f req(Y )/ f req(Z)∗ incontext(Z,C)

(2.15) incontext(Y,C)− f req(Y)/ f req(R)∗ incontext(R,C)

To further explain what the scoring function in (2.15) does, first consider that agood transformation for removing the tag ambiguity of a word is one for whichone of the possible tags appears much more frequently. This is measured here byunambiguously tagged words in the context, after adjusting for the differences inrelative frequency between different tags (i.e. f req(Y )/ f req(R)). So, the compari-son is made between how often Y unambiguously appears in a given context C andthe number of unambiguous instances of the most likely tag R in the same con-text, where R ∈ c,R , Y . The tag is changed from c to Y , if Y is the best choice.That is, the learner will accept a transformation for a given learning iteration if thetransformation maximizes the function in (2.15).

2.3 Comparison of the tagging approaches

Through these different approaches, two common points have emerged. First, forany given word, only a few tags are possible, a list of which can be found either ina dictionary or through a morphological analysis of the word.


Second, when a word has several possible tags, the correct tag can gener-ally be chosen from the local context, using contextual rules that define the validsequences of tags. These rules may be given different priorities so that a selectioncan be made even when several rules apply.

2.4 Classifier combination

Since the current work uses the idea of tagger combination (see section 7.8), an in-troduction to this technique is necessary. Dietterich (1997) summarizes four direc-tions that can lead to improvements in supervised learning. One of them is learningensembles of classifiers. An ensemble of classifiers is a set of classifiers whose in-dividual decisions are combined in some way (typically by weighted or unweightedvoting) to classify new examples.

One of the most active areas of research in supervised learning has beento study methods for constructing good ensembles of classifiers. The main discov-ery is that ensembles are often much more accurate than the individual classifiersthe ensembles are composed of. However, it is worth noting that an ensemble canbe more accurate than its component classifiers only if the individual classifiersdisagree with one another (Hansen and Salamon 1990). Many methods for con-structing ensembles have been developed. Some methods are general, and theycan be applied to any learning algorithm. Other methods are specific to particu-lar algorithms. What follows is an overview of various approaches to constructingensembles.

2.4.1 Subsampling of training examples

One of the general techniques is subsampling the training examples. This methodmanipulates the training examples to generate multiple hypotheses. The learningalgorithm is run several times, each time with a different subset of training exam-ples. This technique works especially well for ‘unstable’ learning algorithms —algorithms whose output classifier undergoes major changes in response to smallchanges in the training data. Decision-tree, neural network, and rule-learning al-gorithms are all unstable. Linear regression, nearest neighbor, and linear thresholdalgorithms are generally stable.

Three particular methods of sampling training data include bagging, cross-validated committees, and AdaBoost.

Bagging. The most straightforward way of manipulating the training set is called‘bagging’. On each run, the learning algorithm is presented with a training set thatconsists of a sample of m training examples drawn randomly with replacement fromthe original training set of m items. Such a training set is called a bootstrap replicateof the original training set, and the technique is called bootstrap aggregation, fromwhich the term bagging is derived (Breiman 1996).

2.4. Classifier combination 21

Cross-validated committees. Another training set sampling method (Parmantoet al. 1996) is to construct the training sets by leaving out disjoint subsets of thetraining data. For example, the training set can be randomly divided into 10 disjointsubsets. Then, 10 overlapping training sets can be constructed by leaving out adifferent one of these 10 subsets.

AdaBoost. The third method for manipulating the training set is the AdaBoostalgorithm, developed by Freund and Shapire (1996). Like bagging, AdaBoost ma-nipulates the training examples to generate multiple hypotheses. The main idea ofAdaBoost is to assign each example of the given training set a weight. At the begin-ning, all weights are equal. But in every round the weak learner returns a hypothe-sis, and the weights of all examples misclassified by that hypothesis are increased.In this way, the weak learner is forced to focus on the difficult examples of thetraining set. The final hypothesis is a combination of the hypotheses of all rounds,namely a weighted majority vote, where hypotheses with lower classification errorhave higher weight.

In addition to subsampling, there are other techniques for generating mul-tiple classifiers. These include input feature manipulation, output target manipula-tion, and injecting randomness. The details of these techniques are outside of thescope of this discussion. However, within scope are methods for combining indi-vidual classifiers. These include simple voting, pairwise voting, stacked classifiers.These methods are discussed in sections 2.4.2 through 2.4.3 respectively.

2.4.2 Simple voting

The simplest approach to combine classifiers is to take an (un)weighted vote (seesection 7.8). Many weighted voting methods have been developed for ensembles.For classification problems, weights are usually obtained by measuring the accu-racy of each individual classifier on the training data and constructing weights thatare proportional to those accuracies. Another way is to use not only precision in-formation, but also recall information.

Pairwise voting

It is possible to investigate all situations where one tagger suggests tag T1 and theother T2 and estimate the probability that in this situation the tag should actuallybe Tx. For example, if X suggests DT and Y suggests CS (which can happen ifthe token is “that”), the probabilities for the appropriate tag are: CS (subordinateconjunction) 0.3276; DT (determiner) 0.6207; QL (quantifier) 0.0172; WPR (wh-pronoun) 0.0345. When combining the taggers, every tagger pair is taken in turnand allowed to vote (with the probability described above) for each possible tag(i.e. not just the ones suggested by the component taggers). With this method (as


well as with the stacked classifiers, discussed below), a tag suggested by a minority(or even none) of the taggers still has a (slight) chance to win.

2.4.3 Stacked classifiers

The practice of feeding the outputs of a number of classifiers to the next learneras features for a next learner is usually called stacking. The algorithm works asfollows. Suppose there are L different learning algorithms A1, ...,AL and a set S ofm training examples (x1,y1), ...,(xm,ym). Each of these algorithms is applied to thetraining data to produce hypotheses h1, ...,hL. The goal of stacking is to learn agood combining classifier h∗ such that the final classification will be computed byh ∗ ((h1(x), ...,hL(x)). Wolpert (1992) proposed a scheme for learning h∗ using aform of leave-one-out cross-validation. Wolpert (1992) defines h(−i)

l to be a classi-fier constructed by algorithm AL applied to all of the training examples in S exceptexample i. In other words, each algorithm is applied to the training data m times,leaving out one training example each time. Each classifier h(−i)

l can be applied toexample xi to obtain the predicted class yl

i . This provides a new data set contain-ing ‘level 2’ examples whose features are the classes predicted by each of the Lclassifiers. Now some other algorithm can be applied to this level 2 data to learn h∗.

While ensembles provide very accurate classifiers, there are problems thatmay limit their practical application. One problem is that ensembles can requirelarge amounts of memory to store and large amounts of computation to apply (Di-etterich 1997). A second difficulty with ensemble classifiers is that an ensembleprovides little insight into how it makes its decisions. A single decision tree canoften be interpreted by human users, but a decision tree for an ensemble of 200votes is much more difficult to understand. The question is whether methods forobtaining explanations from ensembles can be found.

2.4.4 Combining POS taggers

The combination of ensembles of classifiers, although well-established in the ma-chine learning literature, has only recently been applied as a method for increasingaccuracy in natural language processing tasks. There has, of course, been a lot ofresearch on the combination of different methods (e.g., knowledge-based and sta-tistical) in hybrid systems, or on the combination of different information sources.Recently, several papers on combining POS-taggers have emerged.

For POS tagging, a significant increase in accuracy through combining theoutput of different taggers was first demonstrated in Brill and Wu (1998) and vanHalteren et al. (1998).

Brill and Wu (1998) show that the errors made from three different stateof the art POS taggers, a standard trigram tagger, the transformation-based tagger(Brill 1995) and the maximum entropy tagger (Ratnaparkhi 1996) — are strongly

2.4. Classifier combination 23

complementary. The authors show how this complementary behavior can be usedto improve tagging accuracy. Specifically, they show that by using contextual cuesto guide tagger combination, it is possible to derive a new tagger that achievesperformance significantly greater than any of the individual taggers.

van Halteren et al. (2001) examine how differences in language models,learned by different data-driven systems performing the same NLP task, can be ex-ploited to yield higher accuracy than the best individual system. They experimentwith morpho-syntactic word class tagging using three different tagged corpora: theLancaster-Oslo/Bergen corpus (LOB) (1M words, 170 different tags, Johansson1986); the Wall Street Journal Corpus (1M words, 48 tags, Paul and Baker 1992);and the Eindhoven corpus (750K words, 341 tags, den Boogaart 1975)). They trainfour taggers – HMM, memory based, transformation rules, and maximum entropy– on the same corpus data. After comparison, their outputs are combined usingseveral voting strategies and second-stage classifiers. All combinations outperformtheir best component. The amount of improvement varies from 11.3% error reduc-tion for WSJ to 24.3% error reduction for LOB. The data set that is used appearsto be the primary factor in the variation in improvement. The data set’s consistencydirectly affects performance. The authors notice that their stacked systems outper-form a simple voting system.

Borin (1999, 2000) investigates how off-the-shelf POS taggers can be com-bined to better cope with text material that differs from the type of text the taggerswere originally trained, and for which there are no readily available training cor-pora. The author uses three taggers for German — TreeTagger (Schmid 1994b),Morphy (Lezius et al. 1998) and QTAG (Mason 1997). He evaluates the taggersand creates a list of differences between taggers and a hypothesis about which pa-rameters are likely to influence tagger performance. Using this information, the au-thor formulates symbolic rules to choose the output of the inferior tagger (Morphy)over that of the better tagger (TreeTagger) under certain systematically recurringconditions.2 The evaluation of taggers is done on a very small corpus — only 10sentences. The author calculates the expected improvements from using the rules(1.7%), but the actual evaluation of the implementation of the method is not pro-vided.

Sjöbergh (2003a) trains and evaluates seven taggers on a Swedish corpusand then combines the taggers in different ways to maximize the accuracy. He

2 The author mentions that the overall lower-performing tagger is sometimesright and the better taggers are wrong. Morphy, for instance, deals betterwith abbreviations. So, Borin (1999, 2000) formulates the conditions in sucha way that the decisions of the inferior tagger are sometimes taken into ac-count.


uses fnTBL (Ngai and Florian 2001), a transformation-based tagger; Granska (Carl-berger and Kann 1999); TnT, a trigram MM-tagger (Brants 2000); Mxpost (Ratna-parkhi 1996), a maximum entropy tagger; Timbl (Daelemans et al. 2001), a memorybased tagger; Stomp (Sjöbergh 2003b) and TreeTagger (Schmid 1994b), a decisiontree tagger. The author summarizes several experiments. He found that simple vot-ing does not work because errors made by the taggers are not independent. Manu-ally assigning the taggers different voting weights by giving them weights propor-tional to their stand-alone accuracy (determined by data separate from the test data)does not improve on simple voting either. Interestingly, Sjöbergh (2003a) mentionsthat adding a rather bad tagger increases the performance of an ensemble if the tag-ger is different enough from the taggers already in the ensemble. This observationis similar to Borin’s 1999; 2000 outlined above.

In another experiment, Sjöbergh (2003a) tries giving confident taggers moreweight. One use for the confidence measurements is to let the tagger change itsvoting contribution according to its confidence (i.e. give the tagger more weightfor words where it is confident). The author tried three variants of this idea. First,a tagger is allowed to overrule the voting when its confidence is above a chosenthreshold, otherwise voting proceeds as normal. Second, the vote from a tagger isignored when the confidence is below a chosen threshold. Finally, each tagger’svote is proportional to the confidence. As Sjöbergh (2003a) reports, none of thesevariants improves on simple voting.

Yet another way explored by Sjöbergh (2003a) to combine the taggers inan ensemble is to train a new classifier on the tags selected by the taggers. Thishas the potential to correct even those cases where none of the taggers chooses thecorrect tag (which no voting scheme can do). This is, in fact, the stacked classifierapproach. This approach is also advantageous because with stacked classifiers, it iseasy to combine taggers that use different tagsets. With voting, it is more difficult tohandle combinations involving different tagsets. Both stacked classifiers and votingschemes behave similarly in that they mainly correct uncommon error types. 15%to 18% error reduction was achieved in the experiments with stacked classifiers.Sjöbergh (2003a) concludes that combining taggers by voting or training a newstacked classifier increases the number of errors of some of the common errorstypes, but removes many more errors of uncommon types. This leads to fewer totalerrors and a concentration of errors to fewer error types. This property is useful. Itis, for instance, less work to manually create correction rules for a few classes oferrors than for many.

Nakagawa et al. (2002) present a revision learning (RL) method which com-bines a model with high generalization capacity (e.g., an HMM) and a model withsmall computational cost (i.e. Support Vector Machines (SVMs), Vapnik (1998)).RL uses a binary classifier with higher capacity to revise the errors made by thestochastic model with lower capacity. During the training phase, a ranking is as-signed to each class by the stochastic model for a training example. That is, thecandidate classes are sorted in descending order of their conditional probabilities

2.5. A special approach to tagging highly infl. languages 25

given the example. Then the classes are checked in their ranked order. If the classis incorrect, the example is added to the training data for that class as a negativeexample, and the next ranked class is checked. If the class is correct, the exampleis added to the training data for that class as a positive example, and the remainingranked classes are not taken into consideration. Using these training data, binaryclassifiers are created. The binary classifier is trained to answer whether the outputfrom the stochastic model is correct or not. During the test phase, the ranking of thecandidate classes for a given example is assigned by the stochastic model as in thetraining. Then, the binary classifier classifies the example according to the ranking.If the classifier determines that the example is incorrect, the next highest rankedclass becomes the next candidate to be checked. But if the example is classified ascorrect, the class of the classifier is returned as the answer for the example. Nak-agawa et al. (2002) apply revision learning to morphological analysis of Japanese.The combined classifier outperforms the best tagger by 2.52%.

Clark et al. (2003) investigate bootstrapping part-of-speech taggers usingco-training, in which two taggers, TnT (Brants 2000) and the maximum entropyC&C tagger (Curran and Clark 2003), are iteratively re-trained on each other’s out-put. Since the output of both taggers is noisy, the challenge is to decide whichnewly labelled examples should be added to the training set. They investigate se-lecting examples by directly maximizing tagger agreement on unlabeled data. Theresults show that simply re-training on all of the newly labelled data is surprisinglyeffective, with performance depending on the amount of newly labelled data addedat each iteration. The authors also show that co-training can still benefit both tag-gers when the performance of one tagger is initially much better than the other.They also show that naive co-training, which does not explicitly maximize agree-ment, is unable to improve the performance of the taggers when they have alreadybeen trained on large amounts of manually annotated data.

2.5 A special approach to tagging highly inflected languages

Most words in English are unambiguous; they have only a single POS tag. Butmany of the most common words are ambiguous (e.g., ‘can’ can be an auxiliary, anoun, and a verb). Still, many of these ambiguous tokens are easy to disambiguate,since the various tags associated with a word are not equally likely.

In contrast, languages with rich morphologies are more challenging. MostRussian nouns, for instance, have singular and plural forms in all six cases (nom-inative, accusative, genitive, dative, locative, and instrumental). Most adjectives(at least potentially) form all three genders (masculine, feminine and neuter), bothnumbers (singular and plural), all six cases, all three degrees of comparison, andcan be either of positive or negative polarity. That yields 216 possible forms foradjectives, many of which are homonymous on the surface. Therefore, the cardi-nality of the tagsets used for languages such as Russian is usually much larger than


that for English. An additional complication is raised by the fact that inflectionallanguages typically have relatively free word order.

To sum up, the combination of a high degree of morphological ambiguity,a large tagset, and free word order, together with the lack of available resources,makes morphological tagging of highly inflectional languages a challenging prob-lem.

The chapter so far has summarized a number of experiments that used dif-ferent tagging techniques on Slavic languages. For instance, the Markov modelhas been applied to Czech and Polish with quite satisfactory results, using a largetraining corpus (see section 2.1.1). The supervised TBL has been tried on Czechand Slovene, and though the results are quite promising (∼86% accuracy), they arenot as good as with the n-gram model. One of the reasons for this is that the defaulttemplates prespecified by the algorithm are not necessarily universal and one wouldneed to explore different templates for languages with rich inflectional morpholo-gies, different (from English) agreement patterns, and free word order. The MaxEnttagger has been tried on Slovene as well. The performance was around 86%, as withthe supervised TBL model. The MB tagger performs similarly to the MaxEnt forSlovene. The neural networks approach was also applied to tagging Czech, and theaccuracy was 88.71%. But Džeroski et al. (2000) reports that training times for theMaxEnt and the RB tagger are unacceptably long (over a day for training), whilethe MB taggers and the TnT tagger are much more efficient.

The next section describes a tagger which was designed for morphologi-cally rich languages in general, and for Czech, in particular. A special property ofthis tagger is that it operates on subpositions of a tag (i.e. it assumes a positionaltag system). This tagger deserves a special mention because it is a tagger whoseperformance thus far is the best for Czech.

2.5.1 Exponential tagger

The Exponential tagger (EXP) was first introduced in Hajic and Hladká (1998b).This approach is primarily designed for tagging Czech. It predicts proper tags fromthe list of meaningful tags given by a morphological analyzer, which works witha positional tag system (see section 4.3). The Maximum entropy tagger describedin section 2.1.3 operates on the tag level, whereas the exponential tagger oper-ates on the subtag level (i.e. on the level of individual morphological categories).The ambiguity on the subtag level is mapped onto so-called ambiguity classes(ACs). For instance, for the word se the morphology generates two possible tags,RV------------- (preposition ‘with’) and P7-X4---------- (reflexive particle).The ambiguity on the subtag level is represented by four ACs: [R,P] (1st subtag),[V,7] (2nd subtag), [-,X] (4th subtag), and [7,4] (5th subtag). The number of ACsmatches the number of morphological categories (MCs), the value of which is notunique across the list of tags for a given word.

2.5. A special approach to tagging highly infl. languages 27

With regard to the ACs, EXP generates a separate model PAC(y|x), where xis a context, y is the predicted subtag value ∈ Y . This model has the general formdetermined by the equation in (2.16) for each AC.

(2.16) pAC,e(y|x) =exp( å n

i=1 li fi(y,x))Z(x) ,

where Z(x) is the normalization factor given by (2.17).

(2.17) Z(x) = å y∈Y exp(å ni=1 li fi(y,x))

To avoid the “null” probabilities caused by an unseen context in the training dataor by an unseen AC in the training data (i.e. there is no model for the AC), the finalpAC(y|x) distribution can be formulated as in (2.18).

(2.18) pAC(y|x) = spAC,e(y|x)+ (1−s)p(y),

where p(y) is the unigram distribution per MC.In (2.16), { f1, f2, ..., fn} is a set of yes/no features, i.e. fi(y,x) ∈ {0,1}.

Each parameter li (so called ‘feature weight’) corresponds to exactly one feature fiand the features operate over the events (subtag value, context). Hajic and Hladká(1998b) view the context as a set of attribute-value pairs with a discrete range ofvalues. Every feature can thus be represented by a set of contexts, in which it ispositive.

Let CatAC be the ambiguity class AC of a morphological category Cat (forinstance, Cat = gender and CatAC = f eminine,neuter), y be an attribute for thesubtag value being predicted, x be an attribute for the context value and y−, x−be values of y, x attributes. Then, the feature function fCatAC,y− ,x−

(y,x) → {0,1} iswell-defined iff y− ∈ CatAC. The value of a well-defined function fCatAC,y− ,x−

(y,x)is determined by the formula in (2.19).

(2.19) fCatAC,y− ,x−(y,x) = 1 ⇔ y = y−∧ x− ⊆ x

The weight estimation is built on the ratio of the conditional probability of y inthe context defined by the feature fAC,y−,x− and the uniform distribution for theambiguity class, as in (2.20).

(2.20) AV,l fAC,y−,x−=

PAC(y|x−)1/|AC|

The EXP tagger puts stress on the model’s feature selection (during the trainingstep) from the error rate point of view (similar to the TB learning). In other words,from the pool of features available for selection, it chooses the features which leadto the maximal improvement in the error rate with respect to the setting of thethreshold. The threshold is set to half the number of data items which contain theambiguity class AC at the beginning of the feature selection loop, and then it is cutin half again at every iteration.


This algorithm predicts all morphological categories independently andeven more, the prediction is based on the ACs rather than on the previously pre-dicted values. Thus, the tag which is suggested by the EXP tagger does not haveto be an element of the list of tags returned by the morphological analyzer for thegiven word. That is why, the purely subtag independent strategy is modified by theso-called Valid Tag Combination (VTC) strategy. The dependence assumption isexpressed in the following in (2.21).

(2.21) p(t|x) = ÕCatAC,Cat∈Categories

pAC(yAC|x)

where t is a complete tag, x is a context, yAC ∈ CatAC and pAC is determined by(2.18).

The Penn Treebank dataset has been used for the EXP tagging of English.The Penn tagset was converted to the positional tag system. A Penn Treebank posi-tional tag is defined as a concatenation of four categories: POS, SubPOS, number,and gender. For instance, for the word under, there are three possible Penn Treebanktags: IN (preposition), JJ (adjective), and RB (adverb) which translate into RR– ,AA-1, and DO-1, respectively in the positional system. The EXP tagger trained onWSJ (1.2M words) gives 96.8% (Hladká 2000).

Hajic and Hladká (1998b) use the EXP tagger on Czech. They train theclassifier on 130K words and test on 1K words. There are 378 different ambiguityclasses (of subtags) across all categories. In these experiments, they use a positionaltag system (see section 4.3). First, they run a morphological analyzer which coversabout 98% of running, unrestricted text (newspaper, magazines, novels, etc.). Theanalyzer is based on a lexicon containing about 228K lemmata and it can analyzeabout 20M word forms. The tagger achieves an accuracy of 91.25% on the full tag.

2.5.2 Other experiments

Finally, some experiments combine the exponential model described above withvarious other learning algorithms to improve tagging results.

Hajic et al. (2001) describe a hybrid system (applied to Czech) which com-bines the strength of manual rule-writing and statistical learning, obtaining resultssuperior to both methods if applied separately. Their combination of a rule-basedsystem and a statistical one is serial. The rule-based system performing partial dis-ambiguation with recall close to 100% is applied first, and a trigram MM taggerruns on the results. The main contribution of the architecture is that the combina-tion of the systems does not commit linguistically trivial errors which occur fromtime to time in the results of purely statistically tagging. The improvement ob-tained (4.58% relative error reduction) beats the pure statistical classifier combina-tion (Hladká 2000).

Hladká (2000) conducts several corpus-based tagging experiments. She per-forms an error analysis which suggests that the Markov model and Exponentialtagger are partially complementary classifiers. Using plurality voting to combine a

2.6. Summary 29

rule-based and Markov model tagger trained on the CTC (Czech Tagged Corpus,600K tokens), the individual tagger performance improves by more than 5%. Whendoubling the training corpus, the accuracy improvement becomes more significant.However, the combination of the Exponential and Markov model taggers trainedon the PDT (the Prague Dependency Treebank) by means of the plurality votingstrategy does not bring any gain over the baseline Exponential tagger. This illus-trates the situation that a relatively high complementary error rate between taggersdoes not necessarily imply that there is anything to be gained by tagger pluralityvoting. But to take advantage of the high complementary rates, Hladká (2000) em-ploys a context-based combination. In other words, she locates the contexts more“suitable” for the Markov model taggers. Given the partial success of the pluralityvoting procedure, the author applies it (and its variants) to combine Markov modeltaggers trained on partially different data produced by the bagging procedure. Butwith this approach, Hladká (2000) reports no improvement of the tagging accuracy.

The best tagger currently available for Czech is the one developed by Spous-tová et al. (2007). This is a hybrid system that uses three different statistical meth-ods (HMM, Maximum Entropy and neural networks) and reaches 95.68% accuracyon the full positional tag.

2.6 Summary

This chapter has discussed a variety of tagging techniques and how they have beenapplied to the task of tagging inflected languages. What is interesting is that Markovmodels perform surprisingly well on such languages, which allow free word order.Markov models record the information about the word order in the transition proba-bilities. What the performance of Markov models suggests is that even if a languagehas the potential for free word order, it may be that there are recurring patterns inthe progressions of parts-of-speech attested in the training corpus (e.g. constituentswhose average length is three words), otherwise, the information about the transi-tion probabilities would not be helpful.

In addition, most literature on tagging inflected languages suggests that forlanguages with high morphological ambiguity, doing morphological analysis be-fore tagging is a useful step to improve the efficiency and effectiveness of a tagger.A trigram model is reported to be the best when morphological preprocessing isemployed.

The Czech exponential tagger operates on subtags rather than full tags. Thisis facilitated by the design of the Czech tagset, where tags can be easily decomposedinto smaller units (see chapter 4). This provides an additional motivation for usinga structured tag system for languages with rich inflection, such as Czech, Russian,or Spanish.

The experiments described in chapter 7 explore another avenue than taggersdescribed in this chapter. Namely, the question of whether the transition informa-tion obtained for Czech (Spanish) is useful for Russian (Portuguese/Catalan). Inaddition, the experiments measure the degree to which the emission information


acquired from one language is useful for tagging another and whether the lexicalsimilarities between related languages can be used effectively for creating target-language models trained on a source-language corpus. This is a cross-lingual ap-proach to tagging.

Chapter 3

Previous resource-light approachesto NLP

Supervised corpus-based methods, including those described in the previous chap-ter, are highly accurate for different NLP tasks, including POS tagging. However,they are difficult to port to other languages because they require resources that areexpensive to create.

Previous research in resource-light language learning has defined resource-light in different ways. Some have assumed only partially tagged training corpora(Merialdo 1994); some start with small tagged seed wordlists (in Cucerzan andYarowsky (1999) for named-entity tagging). Others have exploited the automatictransfer of an already-existing annotated resource on a different genre or a differentlanguage (e.g. cross-language projection of POS tags, syntactic bracketing and in-flectional morphology (Yarowsky et al. 2001; Yarowsky and Ngai 2001), requiringno direct supervision in the target language).

Ngai and Yarowsky (2000) observe that the most practical measure of thedegree of supervision is the sum of weighted human and resource costs of differentmodes of supervision, which allow manual rule writing to be compared directlywith active learning on a common cost-performance learning curve. Cucerzan andYarowsky (2002), in turn, point out that another useful measure of minimal super-vision is the additional cost of obtaining a desired functionality from existing com-monly available knowledge sources. They note that for a remarkably wide range oflanguages, there exist plenty of reference grammar books and dictionaries whichare invaluable linguistic resources.

This chapter takes a closer look at two bootstrapping solutions, both becausethey are fairly well-researched and because they seem promising for the problemof creating language technology for resource-poor languages. At the same time,there are some theoretically interesting questions as to their general applicability,which we address here as well. One of the possible solutions is unsupervised or

32 Chapter 3. Previous resource-light approaches to NLP

minimally supervised learning of linguistic generalizations from corpora; the otheris cross-language knowledge induction.

3.1 Unsupervised or minimally supervised approaches

Extensive previous work exists on unsupervised or minimally supervised learningin domains, such as morphology, POS tagging, and prepositional phrase attach-ment. Only the most recent and relevant work will be discussed here, which hasinspired the ideas presented in this book.

3.1.1 Unsupervised POS tagging

Section 2.2 outlined unsupervised approaches to tagging. In a nutshell, unsuper-vised tagging approaches do not rely on the existence of a training corpus, but mostrequire a dictionary or a lexicon that lists all possible parts of speech for a givenword. This list of parts of speech may be ordered or unordered and, in the for-mer case may, contain probabilities. For each word token in the corpus, the parts ofspeech in the dictionary for that word type are considered as possibilities. There aretwo challenges with relying on dictionaries for POS information. First, obtainingsuch dictionaries is work-intensive for resource-poor languages. And second, evenwhen such dictionaries are available, it is often the case that unsupervised taggersbased on them do not achieve suitable levels of accuracy.

3.1.2 Minimally supervised morphology learning

There has been extensive previous research on unsupervised learning in the do-main of morphology. Learning inflectional morphology directly from an unanno-tated corpus is an interesting and an important problem, since many languages ofthe world have more complex morphology than English. Borin (2003), for instance,cites that out of 95 languages for which the information is available, 41 have simplemorphology, while 54 have complex morphology.

In the literature, the problem of learning morphology is sometimes seen asinvolving only the ability to relate word forms among themselves in a pairwise fash-ion, without any attempt at segmentation (e.g. Baroni et al. 2002) . In other cases,the aim is for learning quite general regularities in string transformations (Theronand Cloete 1997; Clark 2001; Neuvel and Fulop 2002). However, most research onmorphology induction proposes to factor out common substrings among the wordsin the corpus, segmenting word forms into non-overlapping pieces. This produces aconcatenative model of morphology. Thus, the words are most commonly dividedinto a stem and a suffix. There are also attempts to learn recursive structures (i.e.stem+affix structures, where stems in turn are seen as made up of stem+affix; e.g.the Linguistica morphology learning program described by Goldsmith (2001)) anditerative (i.e. morph(eme) sequences; (e.g. Creutz and Lagus 2002; Creutz 2003)),as well as prefix-suffix combinations (e.g. Schone and Jurafsky 2002).

3.1. Unsupervised or minimally supervised approaches 33

Various methods have been proposed for deciding which forms should berelated to one another and where to make the cuts in the word forms. In the mostcommonly used approach, the factorization involves some variant of an informationtheoretic or probability measure, which, in turn, is used to calculate the divisionpoints between morphs or the overall best division point between stem and suf-fix. Very common here is the use of Minimum Description Length (MDL; Zemel(1993)) as in Brent (1994, 1999), de Marcken (1995) and Goldsmith (2001). MDLis an approach for finding an optimal number of clusters. The basic idea is that themeasure of goodness captures both how well the objects fit into the clusters andhow many clusters there are. In the framework of MDL, both the clusters and theobjects are specified by code words whose length is measured in bits. The moreclusters there are, the fewer bits are necessary to encode the objects. In order toencode an object, just the difference between it and the cluster it belongs to is en-coded. More clusters mean the clusters describe objects better, and fewer bits areneeded to describe the difference between objects and clusters. However, more clus-ters obviously take more bits to encode. Since the cost function captures the lengthof the code for both data and clusters, minimizing this function (which maximizesthe goodness of clustering) will determine both the number of clusters and how toassign objects to clusters. The primary goal of using MDL is to induce lexemesfrom boundaryless speech-like streams. The MDL approach is based on the insightthat a good grammar can be used to most compactly describe the corpus. MDL re-flects both the most compact grammar and the most compact representation of thecorpus using that grammar (i.e. the grammar matches the corpus well; Hana andCulicover 2008). Goldsmith (2001) uses an MDL approach in an algorithm acquir-ing (with 86% precision) concatenative morphology in a completely unsupervisedmanner from raw text. More specifically, Goldsmith uses MDL to accept or rejectthe hypothesis proposed by a set of heuristics.

There are also approaches which do not use probability or information-theoretic measures at all, but instead seek purely discrete relatedness measures andsymbolic factorizations. Such approaches include

• engineering methods: e.g., calculating the minimum edit distance, or Lev-enshtein distance (Levenshtein 1966), between pairs of word forms (e.g.Theron and Cloete 1997; Yarowsky and Wicentowski 2000; Baroni et al.2002)

• graph theoretic analyses: e.g., a trie is built and manipulated, yielding insight,engineering advantages or both ((Schone and Jurafsky 2000), (Johnson andMartin 2003)).

These proposals boil down to using additional sources of information deemed rel-evant for the morphology learning problem. Goldsmith (2001), for instance, elim-inates singly occurring ‘stems’ and ‘affixes’ (i.e. each proposed stem and affixshould appear at least twice or it will be removed from consideration). In addition,


there are attempts to use syntax, in the form of near context, to separate homony-mous stems or affixes according to their parts of speech or functions, respectively(Yarowsky and Wicentowski 2000; Belkin and Goldsmith 2002; Schone and Ju-rafsky 2002). Still other attempts use semantics in the form of mutual information(Baroni et al. 2002) to separate homonymous stems and affixes according to theirmeanings or functions and to eliminate spurious segmentations.

The following sections describe in more detail several approaches to mor-phology and part-of-speech tagging that use minimal supervision.

Yarowsky and Wicentowski (2000)

Yarowsky and Wicentowski (2000) present an original algorithm for the nearlyunsupervised induction of inflectional morphological analysis. They treat morpho-logical analysis as an alignment task in a large corpus, combining four similar-ity measures based on expected frequency distributions, context, morphologically-weighted Levenshtein distance, and an iteratively bootstrapped model of affixationand stem-change probabilities. They divide this task into three separate steps:

1. Estimate a probabilistic alignment between inflected forms and root forms ina given language.

2. Train a supervised morphological analysis learner on a weighted subset ofthese aligned pairs.

3. Use the result in Step 2 as either a stand-alone analyzer or a probabilisticscoring component to iteratively refine the alignment in Step 1.

The morphological induction assumes the following available resources:

1. A table of the inflectional parts of speech of the given language, along witha list of the canonical suffixes for each part of speech.

2. A large unannotated text corpus.

3. A list of the candidate noun, verb, and adjective roots of the language (typi-cally obtainable from a dictionary) and any rough mechanism for identifyingthe candidate parts of speech of the remaining vocabulary, not based on mor-phological analysis.

4. A list of consonants and vowels of the given language.

5. A list of common function words of the given language.

6. Various distance/similarity tables generated by the same algorithm on previ-ously studied languages can be useful as seed information, especially if theselanguages are closely related (optional).

3.1. Unsupervised or minimally supervised approaches 35

The first similarity measure – alignment by frequency similarity – assumes twoforms belong to the same lemma, when their relative frequency fits the expecteddistribution. The distribution of irregular forms is approximated by the distributionof regular forms.

Alignment by context similarity, the second similarity measure, is based onthe idea that inflectional forms of the same lemma have similar selectional pref-erences (mostly much closer than even synonyms). For example, related verbstend to occur with similar subjects/objects. To minimize needed training resources,Yarowsky and Wicentowski (2000) identify the positions of head-noun objects andsubjects of verbs using a set of simple regular expressions. The authors notice thatsuch expressions extract significant noise and fail to match many legitimate con-texts, but because they are applied to a large monolingual corpus, the partial cover-age is tolerable.

The third alignment similarity function considers overall stem edit distanceusing a weighted Levenshtein measure (Levenshtein 1966). One important featureof this distance measure is that the edit costs for vowels and consonants are notthe same. The motivation for the difference in costs is based on the idea that inmorphological systems worldwide, vowels and vowel clusters are mutable throughmorphological processes, while consonants generally tend to have a lower prob-ability of change during inflection. Rather than treating all string edits as equal,four values are used: V for vowels, V+ for vowel clusters, C for consonants, andC+ for consonant clusters. They are initially set to relatively arbitrary assignmentsreflecting their respective tendencies towards mutability, and then are iteratively re-estimated. A table from a similar language can also be used to set the initial editcosts. Even though this approach is shown to work, there is no linguistic researchthat supports this claim.

The fourth alignment is done with morphological transformation probabili-ties. The goal is to generalize the inflection-root alignments via a generative prob-abilistic model. At each iteration of the algorithm, the probabilistic mapping func-tion is trained on the table output of the previous iteration (i.e. on the root-inflectionpairs with optional POS tags, confidence scores, and stem change+suffix analysis).Each training example is weighted with its alignment confidence, and mappingswhich have low confidence are filtered out.

Of the four measures, no single model is sufficiently effective on its own.Therefore, traditional classifier combination techniques are applied to merge scoresof the four models.

Applying the method developed by Yarowsky and Wicentowski (2000) tolanguages used in the current context raises a number of problems. First, the suffix-focused transformational model is not sufficient for languages such as Russian thatexhibit prefixal morphology.1 Second, most of the difficult substance of the lemma-

1 The morphological analyzer used in the experiments in subsequent chaptersdoes not handle prefixes either, except for the negative ne- and the superla-tive nai-.


tization problem is often captured in Yarowsky and Wicentowski’s (2000) workby a large root+POS↔inflection mapping table and a simple transducer to handleresidual forms. Unfortunately, such an approach is not directly applicable to highlyinflected languages, such as Czech or Russian, where sparse data becomes an issue.

Yarowsky and Wicentowski (2000) use the Cucerzan and Yarowsky’s (2000)bootstrapping approximation of tag probability distributions. Their algorithm startswith a small annotated corpus. For French, for example, the initial training datawas 18,000 tokens. Here, the goal is to develop a portable system which will notrely on any training corpus of the target language. Moreover, manually creatingan annotated corpus that uses such fine-grained morpho-syntactic descriptions isextremely time-consuming.

Even though the algorithm described by Yarowsky and Wicentowski (2000)cannot be used directly because of the issues outlined above, their ideas, to a largeextent, inspired the current work. The main goal here is to produce detailed morpho-logical resources for a variety of languages without relying on large quantities ofannotated training data. Similarly to Yarowsky and Wicentowski (2000), this workrelies on a subset of manually encoded knowledge, instead of applying completelyunsupervised methods.

3.2 Cross-language knowledge induction

Recent approaches to different NLP tasks exploit knowledge of words and text be-havior in one (or more) language(s) to help solve tasks in another language. Anexample of such a task is word-sense disambiguation in one language using trans-lations from a second language. Another example is verb classification by studyingproperties of verbs across several languages. A third example to be discussed inthis chapter is cross-lingual propagation of morphological analysis.

Knowledge transfer across languages can also take advantage of existing re-sources for resource-rich languages to induce knowledge in languages for whichfew linguistic resources are available. This is made possible by the wider availabil-ity of parallel corpora with better alignment methods at paragraph, sentence, andword level. Examples of knowledge induction tasks include learning morphology,part-of-speech tags, and grammatical gender, as well as the development of word-nets for many languages using, as a starting point, knowledge transfer from thePrinceton WordNet (Miller 1990).

This section summarizes some of the relevant work in cross-language appli-cations.

3.2.1 Cross-language knowledge transfer using parallel texts

It is a common situation to find a dominant language with some language technol-ogy resources and a lesser-known language lacking one or all of these resources,but a fair amount of (machine-readable) parallel texts in the two languages. Theobvious solution to the lack of resources is to try to transfer dominant language

3.2. Cross-language knowledge induction 37

annotations into the resource-poor language via an alignment of the parallel textsat some linguistic level. The performance of such systems depends on a number offactors, such as the kind of annotation targeted and the closeness of the languagesinvolved. In some cases, the annotation transfer could be used to get a first, roughannotation that could then be refined by a mix of human and automatic correctionmethods.

A special case of this methodology would be to use another language indi-rectly, as it were, using an annotation tool trained on some language X for annotat-ing a different language Y.

Bilingual lexicon acquisition

Algorithms for bilingual lexicon extraction from parallel corpora exploit a num-ber of characteristics of translated, bilingual texts (Fung 1998). Such approachesusually assume that

• words have one sense per corpus,

• words have a single translation per corpus,

• there are no missing translations in the target document,

• frequencies of bilingual word occurrences are comparable,

• positions of bilingual word occurrences are comparable.

Most translated texts are domain-specific. Thus, their content words are usuallyused in a single sense and are translated consistently into the same target words.Once the corpus is aligned sentence by sentence, it is possible to learn the mappingbetween the bilingual words in these sentences. Sometimes, lexicon extraction isjust a by-product of alignment algorithms aimed at constructing a statistical trans-lation model (Brown et al. 1990, 1993; Chen 1993; Fung and Church 1994; Kayand Röscheisen 1993; Wu and Xia 1994). For other algorithms, lexicon extractionis the main goal. One approach (Dagan et al. 1993; Dagan and Church 1994) usesan EM-based model to align words in sentence pairs in order to obtain a technicallexicon. Other algorithms use sentence-aligned parallel texts to further compile abilingual lexicon of technical words or terms using similarity measures on bilinguallexical pairs (Gale and Church 1991; Kupiec 1993; Smadja 1996). Still others focuson translating phrases or terms which consist of multiple words (Dagan and Church1994; Kupiec 1993; Smadja 1996). In addition, Melamed (2000) shows how a sta-tistical translation model can take advantage of preexisting knowledge that mightbe available about particular language pairs. Even the simplest kinds of language-specific knowledge, such as the distinction between content words and functionwords, are shown to reliably boost translation model performance on some tasks.


Cross-lingual propagation of morphological analysis and POS tagging

Similar to the approach described in this book, the underlying assumption in Snyderand Barzilay (2008a,b); Snyder et al. (2008) is that structural commonality acrossdifferent languages is a powerful source of information for morphological analysis.Their approach relies on parallel data.

Snyder and Barzilay (2008a,b) propose a model that supports fully symmet-rical knowledge transfer, utilizing any combination of supervised and unsuperviseddata across language barriers. The goal of their work is to separate a word into itsindividual morphemes. The authors present a non-parametric Bayesian model thatjointly induces morpheme segmentations of each language under consideration andat the same time identifies cross-lingual morpheme patterns. They evaluate theirapproach on a Hebrew-Arabic parallel corpus of short phrases. Their best perfor-mance on Arabic is 67.75% and on Hebrew is 64.90%. The approach is interestingand promising; however, its current performance might still be insufficient for fur-ther NLP applications.

Snyder et al. (2008) apply the same multilingual model to the task of POStagging. The main hypothesis is that the patterns of ambiguity found in each lan-guage at the part-of-speech level will differ in systematic ways. Another assump-tion is that for pairs of words that share similar semantic or syntactic functions,the associated tags will be statistically correlated, though not necessarily identical.The authors use such word pairs as the bilingual anchors of the model, allowingcrosslingual information to be shared via joint tagging decisions. This hierarchicalBayesian model selects the appropriate tag for each token occurring in a text basedon a tag dictionary (i.e., a set of possible tags for each word type). Even though theyexperiment with the parallel data provided by the Multext-East corpus, the evalu-ation is done not on the full detailed positional tag, but only on the 11 major POScategories. The performance of the taggers for English, Bulgarian, Slovene, andSerbian is in the range of 86%–95%, depending on the language combination. Un-fortunately, when the lexicon is reduced to the 100 most frequent words, the modelprovides much less accurate results: 57%–71%, depending on the language combi-nation. The important conclusion the authors draw based on these experiments isthat the results of the bilingual model are consistently and significantly better thanthe monolingual baseline for all language pairs.

Borin (2002) describes an experiment where tags are transferred from aPOS-tagged German text to a parallel Swedish text by automatic word alignment.After aligning the German and the Swedish texts, the German text is POS taggedwith Morphy (Lezius et al. 1998). For every German word-tag combination, if thereis a word alignment with a Swedish word, that word is manually assigned the SUCtag (Ejerhed and Källgren 1997) most closely corresponding to the POS tag of theGerman word. The results show that for the correct alignments, the German tag isgenerally the correct one for the Swedish correspondence (in 95% of the cases).For incorrect alignments, the proportions are reversed. This means that at least forthis language pair and this text type, POS tagging of the source language combined


with word alignment can be used to accomplish a partial POS tagging of the targetlanguage. Unfortunately, the author does not provide information about the size andgranularity of the tagset. In addition, the POS transfer is done by hand. To automatethe process, it would be necessary to formulate both the exact correspondences be-tween the German and the Swedish tags and a procedure to decide whether (i) thealignment is correct and (ii) the POS transfer should be applied.

Dien and Kiem (2003) suggest a solution to the shortage of annotated re-sources in Vietnamese by building a POS tagger for EVC, an automatically word-aligned English-Vietnamese parallel Corpus. The POS tagger makes use of theTB-learning method to project POS information from English to Vietnamese us-ing word alignments. The Penn TreeBank tagset for English (36 non-punctuationtags) and a corresponding tagset of the same size for Vietnamese are used. Due tothe typological differences between English and Vietnamese (an isolated language),direct projection of tags is not trivial. The authors use a number of heuristics to dealwith the linguistic differences. The performance of the system on 1,000 words ofthe test data is 94.6% accuracy. Given that the alignments are created automati-cally by the GIZA++ model (Och and Ney 2000) with 87% precision and given thetypological differences between the two languages, the tagging results the authorsreport are rather surprising.

Yarowsky and Ngai (2001) and Yarowsky et al. (2001) describe a systemand a set of algorithms for automatically inducing stand-alone monolingual POStaggers, base noun-phrase bracketers, named-entity taggers and morphological ana-lyzers for an arbitrary language using parallel corpora. Case studies include French,Chinese, Czech, and Spanish. The authors apply existing analysis tools for Englishto bilingual text corpora and their output is projected onto the second language viastatistically derived word alignments. This simple direct annotation projection isquite noisy, so the authors develop a training procedure which is capable of accu-rate system bootstrapping from noisy and incomplete initial projections. The per-formance of the induced POS tagger applied to French achieves 96% core POS tagaccuracy. Unfortunately, the performance of the model on the other three languagesis not reported.

Parsing

Parsing is another domain of cross-language research. Most of these approachesrely on the existence of parallel corpora for projecting syntactic trees.

Hwa et al. (2004) explore using parallel text to help solving the problemof creating syntactic annotation in new languages. The central idea is to annotatethe English side of a parallel corpus, project the analysis to the second language,and then train a stochastic analyzer on the resulting noisy annotations. An impor-tant point in Hwa et al.’s (2004) work is that a distinction should be made betweenwhat can be projected versus what can only be learned on the basis of monolingualinformation in the language to be parsed. Hwa et al. (2004) explore the possibil-ity of starting with a small, manually produced seed corpus in order to provide the


key monolingual facts, and iteratively improving that corpus using information pro-jected from English. For example, in the English-Chinese case, trees projected fromEnglish may make it possible to confidently identify many of the verb-argument re-lations, and a small number of confidently annotated Chinese trees may suffice toteach the parser how to identify attachment points for aspectual markers. Their ex-periments show that the parser performance from an automatically projected Chi-nese treebank is only a few points below what one would obtain after one or twoyears of manual treebanking, yet required less than one person-month for writingmanual correction rules to account for limitations in projecting dependencies fromEnglish.

Cavestro and Cancedda (2005) consider the problem of projecting syntac-tic trees across different sides of an English-French parallel corpus, without usingany language-dependent feature. To achieve this, they introduce a literality scoreand use it to sort the bi-sentences of the parallel corpus into different classes. Thesource side is annotated with both syntactic and dependency trees, whereas thetarget side is annotated with POS tags. The intuition behind the literality score isthat syntactic information can be projected more effectively when two parallel sen-tences are literal translations of each other. The literality score function turns thisintuition into an intuitional ranking criterion. Since no manually annotated FrenchTreebank was available at that time, the authors evaluate the performance of theirsystem by measuring the convergence rate of the parser series trained on the Frenchside relative to the rate of convergence on the English side.

In a final example of cross-lingual research into parsing, Smith and Smith(2004) describe a bilingual parser that jointly searches for the best English parse,Korean parse, and word alignment, where hidden structures constrain each another.The bilingual parser combines simple, commonly understood statistical models,such as statistical dependency parsers, probabilistic context-free grammars, andword-to-word translation models. The model used for parsing is completely fac-tored into the two parsers and the translation model, allowing separate parameterestimation. The authors evaluate their bilingual parser on the Penn Korean Tree-bank and against several baseline systems and show improvements in parsing Ko-rean with very limited data.

Semantic classes

Padó and Lapata (2005) consider the problem of unsupervised semantic lexicon ac-quisition. They introduce a fully automatic approach that exploits parallel corpora,relies on shallow text properties, and is relatively inexpensive. Given the EnglishFrameNet (Baker et al. 1998) lexicon, their method exploits word alignments togenerate frame-candidate lists for new languages, which are subsequently prunedautomatically using a small set of linguistically motivated filters. Their evaluationshows that such an approach can produce high-precision, multilingual FrameNetlexicons without recourse to bilingual dictionaries or deep syntactic analysis.


Tsang (2001) examines the use of multilingual resources in the automaticlearning of verb classification. The author shows that statistics of carefully selectedmultilingual features, collected from a bilingual English-Chinese corpus, are usefulin automatic lexical acquisition in English. In combination with English features,Chinese POS tags, passive particles, and periphrastic particles are reported as thefeatures that contribute the most significant improvements to the performance ofEnglish-only features in the acquisition task.

3.2.2 Cross-language knowledge transfer without parallel corpora

Despite a surge in research using parallel corpora for various machine translationtasks and other applications that have been described above, the amount of availablebilingual parallel corpora is still relatively small in comparison to the large amountof available monolingual text. It is unlikely that one can find parallel corpora inany given domain in electronic form. This is a particularly acute problem in “lesspopular” languages. Using non-parallel corpora for various NLP applications is adaunting task and considered much more difficult than performing the same taskswith parallel corpora.

This section describes cross-language knowledge induction in various do-mains without the use of parallel corpora.

Word sense disambiguation (WSD) and translation lexicons

Dagan (1990) was the first to use a pair of non-parallel texts for the task of lexicaldisambiguation on one of the two texts. His algorithm is based on the premise that apolysemous word in one language maps to different words corresponding to its var-ious senses in the other language. In another work for sense classification, Schuetze(1992) forms large vectors containing context words for each word he tries to clas-sify. He then uses Singular Value Decomposition (SVD) to obtain the most dis-criminative context words for further classification of new words. Large vectorscontaining context or collocational words are also used in Gale et al. (1992a,b,c)and Yarowsky (1995) to disambiguate multiple senses of a word.

The basic idea in Dagan (1990) extends to choosing a translation amongmultiple candidates (Dagan and Itai 1994) given contextual information. Given asmall segment containing a few words, they represent a feature for a word in termsof its co-occurrence with other words in that segment. A similar idea is later appliedby Rapp (1995) to show the plausibility of correlations between words in non-parallel text. His paper reports a preliminary study showing that words which co-occur in a text are likely to co-occur in another text as well. He proposes a matrixpermutation method matching co-occurrence patterns in two non-parallel text, butnotes that there are computational limitations to this method. Using the same idea,Tanaka and Iwasaki (1996) demonstrate how to eliminate candidate words in abilingual dictionary.


Fung and McKeown (1997) present an initial algorithm for translating tech-nical terms using a pair of non-parallel corpora. They present a statistical wordfeature, the Word Relation Matrix, which can be used to find translated pairs ofwords and terms from non-parallel corpora, across language groups.

Fung (1998) and Fung and Lo (1998) describe a new method which com-bines information retrieval (IR) and NLP techniques to extract new word trans-lations from automatically downloaded English-Chinese non-parallel newspapertexts. The authors present an algorithm which uses context seed word term fre-quency (TF) and inverse document frequency (IDF) measures. This was the firstalgorithm to generate a collocation bilingual lexicon from a non-parallel, compa-rable corpus. The algorithm has good precision, but the recall is low due to thedifficulty in extracting unambiguous Chinese and English words.

Named Entity (NE) recognition

There are a number of experiments that deal with applying a source language NErecognizer to a target language. Some use genetically related languages; othersdo not. For example, the experiment described in Maynard et al. (2003) apply anEnglish NE recognizer to Cebuano, an Austroneasian language of the Philippines.

According to a Linguistic Data Consortium (LDC) categorization2, Cebuanois classed as a language which is of medium difficulty to process. The main problemis that no large scale translation dictionaries, parallel corpora, or morphologicalanalyzers are available. However, the language has Latin script, is written withspaces between words, and has capitalization similar to Western languages, all ofwhich make processing a much easier task than for, say, Chinese or Arabic. Theimportant points are, therefore, that little work has been done on the language, andfew resources exist, but that the language is not intrinsically hard to process.

Maynard et al. (2003) describe an experiment to adapt a NE recognitionsystem from English to Cebuano as part of the TIDES surprise language program3.

2 The Linguistic Data Consortium (LDC) conducted a survey of the largest(by population) 300 languages (http:www.ldc.upenn.edu/Projects/TIDES/language-summary-table.html) in order to establish what re-sources were available for each language and which languages would bepotentially feasible to process. Their categorization includes factors such aswhether they could find dictionaries, newspaper texts, a copy of the Bible,etc. on the Internet, and whether the language has its words separate in writ-ing, simple punctuation, orthography, morphology, and so on.

3 The TIDES Surprise Language Exercise is a collaborative effort between anumber of sites to develop resources and tools for various language engi-neering tasks on a surprise language. Within a month of the language beingannounced, resources must be collected and tools developed for tasks suchas Information Extraction (IE), Machine Translation (MT), Summarizationand Cross-language Information Retrieval (CLIR). The aim is to establish


With four person-days of effort, with no previous knowledge of which languagewould be involved, with no knowledge of the language in question once it wasannounced, and with no training data available, Maynard et al. (2003) adapt theANNIE system4 and achieve an F-measure of 69.1% (85.1% precision and 58.2%recall). The only Cebuano-specific resources the authors use are one native speakerto manually annotate some texts with Named Entities (for testing the system), andtwo websites in Cebuano (local news from Iligan City and the surrounding area).

Carreras et al. (2003) present work on developing low-cost Named Entityrecognizers (NER) for a language with no available annotated resources, using ex-isting resources for a similar language as a starting point. They devise and evaluateseveral strategies to build a Catalan NER system using only annotated Spanish dataand unlabeled Catalan text. They compare their approach with a classical bootstrap-ping approach where a small initial corpus in the target language is hand-tagged.One strategy they experiment with is to first train models for Spanish and thentranslate them into Catalan. Another strategy is to directly train bilingual models.The resulting models are retrained on unlabeled Catalan data using bootstrappingtechniques. It turns out that the hand translation of a Spanish model is better than amodel directly learned from a small hand-annotated training corpus of Catalan. Thebest result is achieved using cross-linguistic features. Solorio and López (2005) fol-low their approach in applying the NER system for Spanish directly to Portugueseand train a classifier using the output and the real classes.

Pedersen et al. (2006) describe a method for discriminating ambiguousnames that relies upon features found in corpora of a more abundant language. Inparticular, they discriminate ambiguous names in Bulgarian, Romanian, and Span-ish corpora using information derived from much larger quantities of English data.They mix together occurrences of the ambiguous name found in English with theoccurrences of the name in the language in which they are trying to discriminate.They refer to this as a “language salad”, and find that it often results in even betterperformance than when only using English or the test language itself as the sourceof information for discrimination.

Verb classes

Tsang et al. (2002) investigate the use of multilingual data in the automatic classifi-cation of English verbs and show that there is a useful transfer of information acrosslanguages. The authors report experiments with three lexical semantic classes ofEnglish verbs. They collect statistical features over a sample of English verbs from

how quickly the NLP community can build such tools in the event of a na-tional emergency such as a terrorist attack.

4 ANNIE is an open-source, robust IE system, developed at the Universityof Sheffield that relies on finite-state algorithms. ANNIE consists of thefollowing main language processing tools: tokenizer, sentence splitter, POStagger, and named entity recognizer.


each of the classes, as well as over Chinese translations of these verbs. They use theEnglish and Chinese data, alone and in combination, as training data for a machinelearning algorithm whose output is an automatic verb classifier. They demonstratethat not only is Chinese data useful for classifying the English verbs, but also amultilingual combination of data outperforms the English data alone (82% vs. 85%accuracy). In addition, the results show that it is not necessary to use a parallel cor-pus to extract the translations in order for this technique to be successful (cf. Tsang2001).

Ruimy et al.’s (2004) approach boils down to finding cognate words in abilingual dictionary, using the information about the cognate suffixes, and assum-ing that if an Italian word has the same translation for all its senses, the Frenchequivalent will share all the senses with that word. Such an approach gives highprecision but is inadequate in cases where words have more than one translation.Ruimy et al. (2004) propose a second strategy which uses frequency, morpholog-ical, and lexical relation (e.g. hypernymy) indicators to decide on the right set ofsenses for the target word.

Mann and Yarowsky (2001) present a method for inducing translation lex-icons based on transduction models of cognate pairs via bridge languages. Bilin-gual lexicons within language families are induced using probabilistic string editdistance models. Translation lexicons for arbitrary distant language pairs are thengenerated by a combination of these intra-family translation models and one ormore cross-family online dictionaries. Up to 95% exact match accuracy is achievedon the target vocabulary (30-68% of inter-family test pairs). Mann and Yarowsky(2001) conclude that substantial portions of translation lexicons can be generatedaccurately for languages where no bilingual dictionary or parallel corpora may ex-ist.

It is important to mention the work of Mann and Yarowsky (2001) here,since it inspired several ideas described in this book. First, Mann and Yarowsky(2001) report that the Levenshtein distance is better for cognate identification thanthe HMM or the stochastic transducer. Based on the report, Levenshtein distancewas chosen for the current work (see chapter 7). Second, the authors use the ideathat languages within the same language family are often close enough to each otherand share many cognate pairs. The more languages are related, the more cognatepairs they presumably share. This idea is adopted in the present work, as will befurther described in following chapters. Third, the current work relies on a similarfundamental approach as Mann and Yarowsky (2001), namely, the use of resourcesthat are available for one language to induce resources for another, related lan-guage. In the case of Mann and Yarowsky (2001), the bridge language is the onethat has the resources (i.e. a source-bridge language bilingual dictionary) and is thelanguage that is genetically related to the target language. An interesting point thatMann and Yarowsky (2001) make is that combining several bridge languages to-gether improves coverage but does not always improve the performance over usingthe best single bridge language. This point will be revisited later in this book.


Inducing POS taggers with a bilingual lexicon

Cucerzan and Yarowsky (2002) present a method of bootstrapping a fine-grained,broad-coverage part-of-speech tagger in a new language using only one person-dayof data acquisition effort. The approach requires three resources:

1. An online or hard-copy pocket-sized bilingual dictionary.

2. A basic reference grammar.

3. Access to an existing monolingual text corpus in the language.

The steps of the algorithm are as follows:

1. Induce initial lexical POS distributions from English translations in a bilin-gual dictionary without POS tags.

2. Induce morphological analyses.

The authors notice that when the translation candidate is a single word, induc-ing a preliminary POS distribution for a foreign word via a simple translationlist is not problematic. For example, suppose the Romanian word mandat canbe translated as the English warrant, proxy and mandate. Each of these Englishwords can in turn be different parts of speech. Now suppose that P(N|warrant) =67% and P(V |warrant) = 34%; P(N|proxy) = 55% and P(A|proxy) = 45%;P(N|mandate) = 80% and P(V |mandate) = 20%. Then, P(N|mandat) = (66%+55% + 80%)/3 = 67%, which means that in the majority of cases, the Romanianword mandat is a noun.

However, if a translation candidate is phrasal, (e.g. the Romanian word man-dat is translated as money order), then modeling the more general probability of theforeign word’s POS is more challenging, since English words often have multipleparts of speech:

(1) P(Tf |we1 , ...wen) = P(Tf |Te1 , . . .Ten)∗P(Te1 , . . .Ten |we1 , . . .wen) .

The authors mention several options for estimating P(Tf |Te1 , . . .Ten). One is to as-sume that the POS usage of phrasal (English) translations is generally consistentacross dictionaries (e.g. P(N f |Ne1 ,Ne2) remains high regardless of publisher or lan-guage). Thus, any foreign-English bilingual dictionary that also includes the trueforeign-word POS could be used to train these probabilities. Another option is to doa first-pass assignment of foreign-word parts of speech based only on single-wordtranslations and use this to train P(Tf |Te1 , . . .Ten) for those foreign words that haveboth phrasal and single-word translations. Cucerzan and Yarowsky (2002) suggesta third way to obtain the probability of the foreign-word parts of speech via a thirdlanguage dictionary (e.g. Romanian via Spanish). Unfortunately, the authors arenot explicit about the method they apply for inducing these probabilities, but a ta-ble given in the article states that the English translations were untagged and thetraining dictionary (in the case of Romanian) was Spanish-English. Presumably,


the probabilities of Romanian parts of speech are derived from the following seriesof steps: Romanian word → English translations → Spanish translations with partsof speech → Spanish parts of speech to Romanian words via English translations.If this is indeed the case, then Cucerzan and Yarowsky’s (2002) idea is very similarto the one explored in subsequent chapters of this book — the idea of transferringPOS information from a related language to the target language.

The next step in Cucerzan and Yarowsky’s (2002) work is to induce partsof speech using morphological analysis. They explore the idea that for inducingmorphological analysis it is enough to begin with whatever knowledge can be effi-ciently manually entered from the grammar book in several hours. The experimentsto be described also explore this idea, specifically, using paradigm-based morphol-ogy for Russian, Portuguese, and Catalan, including only the basic paradigms froma standard grammar textbook. Cucerzan and Yarowsky create a dictionary of regu-lar inflectional affix changes and their associated POS, and on the basis of it, theygenerate hypothesized inflected forms following the regular paradigms. Clearly,these hypothesized forms are inaccurate and overgenerated. Therefore, the authorsperform a probabilistic match between all lexical tokens actually observed in amonolingual corpus and the hypothesized forms. In their next step, Cucerzan andYarowsky combine these two models, a model created on the basis of dictionaryinformation and the one produced by the morphological analysis. This approachrelies heavily on two assumptions: 1) words of the same POS tend to have similartag sequence behavior, and 2) there are sufficient instances of each POS tag labeledby either the morphology models or closed-class entries. For richly inflectional lan-guages, however, such as Russian or Czech, data sparsity is the classical problembecause of the large tagset (see the discussion in chapter 5), so there is no guaranteethat assumption (2) will always hold.

The last step in the approach to POS tagging adopted by Cucerzan andYarowsky (2002) is inducing the agreement features, specifically, the gender infor-mation. Unlike English, languages such as Romanian or Spanish have Adj-Noun,Det-Noun, and Noun-Verb agreement at the subtag-level (e.g. for person, number,case and gender). This information is missing in the induced tags, since it is pro-jected from English. The assumption that the authors make is that words exhibitinga property such as grammatical gender tend to co-occur in a relatively narrow win-dow (± 3) with other words of the same gender. Since the majority of nouns havea single grammatical gender independent of context, smoothing is performed toforce nouns (which are sufficiently frequent in the corpus) toward their single mostlikely gender. The other agreement features are induced in a similar fashion (butthe details are omitted in the article).

The accuracy of the model on the fine-grained (up to 5-features) POS spaceis 75.5%. For nouns, they distinguish number, gender, definiteness, and case; forverbs – tense, number, and person; and for adjectives – gender and number.

Again, similarly to Cucerzan and Yarowsky (2002), the present work usesa basic library reference grammar book and access to an existing monolingual textcorpus in the language. However, they also use a medium-sized bilingual dictionary.

3.3. Summary 47

In this work, a paradigm-based morphology, including only the basic paradigmsfrom a standard grammar textbook (see chapters 6 and 7) is used instead.

Parsing

Agirre et al. (2004, 2005) explore the viability of porting lexico-syntactic informa-tion from English to Basque in order to make PP attachments decisions. Basqueis a free-constituent-order language where PPs in a multiple-verb sentence can beattached to any of the verbs. Their method consists of several steps. First, the head-cases/prepositions from the test Basque data are obtained. Next, they are translatedinto English. Then, all possible English VP(head)-PP(head-case) translation com-binations are built, English combinations frequencies in the English corpus are col-lected, and each frequency is assigned a weight. Using this approach, the best pre-cision value obtained is 72%. This method, even though it does not rely on parallelcorpora, relies heavily on the availability of a translation lexicon.

3.3 Summary

This chapter has described a number of resource-light approaches to various NLPtasks. These approaches can be divided into two groups: those that use no or mini-mal training data and those that use resources in one language to project linguisticknowledge into another, resource-poor language. The latter approach can be subdi-vided into two types — techniques that use parallel data and those that do not relyon parallel texts. Many applications have been described within these approachesincluding inducing morphology, POS tagging, projecting syntactic trees, inducingtranslation lexicons, learning verb classes, etc. Some of the approaches use a hy-brid methodology (i.e. parallel data, supported by minimal knowledge encoding, orsemi-supervised techniques, starting with a small training text and then bootstrap-ping for the result of the initial step, using a mixture of parallel and comparabledata).

All these approaches rely crucially on analyses that establish relationshipsbetween individual word forms, possibly using frequencies or information theo-retic reasoning to make the connections. All can be conceptualized as providinga space of possibilities, then filtering it to remove possibilities that are undesired.The variety of different approaches to filtering suggests that no particular instanceof this general class of approaches can be relied upon to work well for an arbitrarilyselected language. Part of the difficulty, we believe, is that the techniques involvedare not easy to connect to pre-existing knowledge that has been acquired and sys-tematized by traditional linguists and language teachers. This means that when theymisfire the results are hard to understand and interpret.

We want our approach to make use of existing knowledge, and be accessi-ble to consultants whose only qualification is advanced expertise in the languagesconcerned. We are averse to approaches that produce opaque or uninterpretable


knowledge. We therefore suggest a simple paradigm-based approach that uses well-established facts about morphology and language relationships.

Our approach to morphological processing adopts many ideas from the workmentioned above. Similarly to Cucerzan and Yarowsky (2002), we assume thatreference grammar books are a great starting point for automatic morphologicalanalysis. Like many approaches described above, we do not rely on parallel corporafor projecting annotation. In addition, some corpora used in our experiments (seechapter 7) are not even comparable. For example, the Czech corpus used for trainingthe tagger is a collection of newspapers, whereas the target Russian corpus is aliterary text. Moreover, we make no assumption of the availability of pre-existingbilingual dictionaries either.

We used a modified Levenshtein distance to identify cognates. In this, wefollow Mann and Yarowsky (2001), who report that compared with the HMM or thestochastic transducer, the Levenshtein distance is better for cognate identification.The variant of the distance we use is similar to that of Yarowsky and Ngai (2001)and Yarowsky et al. (2001), although they use it for a different purpose.

The following chapters describe in more detail the methodology, the re-sources, and the evaluation results of the experiments with cross-lingual projectionof morphological information.

Chapter 4

Languages, corpora and tagsets

This chapter provides an overview of the languages (section 4.1), the corpora(section 4.2), and tagsets (sections 4.3 and 4.4) used in our experiments.

4.1 Language properties

This section briefly describes Czech, Russian, Catalan, Portuguese, and Spanish –the languages used in the experiments. The first two belong to the Slavic family; theother three belong to the Romance group of languages. A more detailed discussionof the languages can be found in Appendix C. Since the goal of the task is to projectmorpho-syntactic information from a source language to a target language, the dis-cussion concentrates mainly on characterizing the morpho-syntactic properties ofthese languages.

4.1.1 Czech and Russian

Czech and Russian are both Slavic (Slavonic) languages. Slavic languages are agroup of Indo-European languages spoken in most of Eastern Europe, much ofthe Balkans, part of Central Europe, and the Russian part of Asia. Czech belongsto the West branch of Slavic languages, whereas Russian is an East-Slavonic lan-guage. The description of these languages is based on Comrie and Corbett (2002),Shenker (1995) and on Karlík et al. (1996) for Czech, and Wade (1992) for Rus-sian. We abbreviate the nominal morphological categories as shown in Table 4.1.For example, S1 or nom.sg. stands for singular nominative.

The similarity of Czech and Russian can be illustrated by a parallel Czech-Russian example in (2). Of course, not all sentences are so similar. There are manydifferences on all language levels. A brief summary of some important linguisticproperties of Czech and Russian is provided in Table 4.2, the following text pro-vides more details.

50 Chapter 4. Languages, corpora and tagsets

Table 4.1. Abbreviations of morphological categories

S or sg. singular 1 or nom. nominativeP or pl. plural 2 or gen. genitiveM or masc.(anim.) masculine (animate) 3 or dat. dativeI or masc.inam. masculine inanimate 4 or acc. accusativeF or fem. feminine 5 or voc. vocativeN or neut. neuter 6 or loc. local

7 or inst. instrumental

Table 4.2. Slavic: Shallow contrastive analysis

Czech Russianfusional + +case 7 6 (+2)gender 3 3number 2 (+1) 2animacy only in masc in accshort adjectives + +articles - -subjunctive inflected auxiliary by particle bytense present, past, future present, past, futureword order free, old<new free, old<newcopula obligatory past/futurenegation prefix ne- particle nereflexivization clitic se suffix -sjasubject-verb agreement number number

person persongender (past) gender (past, sg)animacy (past)

adjective-noun agreement case, number, case, number,gender, animacy gender (sg)

pron/aux clitic + -neg concord + +genitive of negation - (archaic) partial

4.1. Language properties 51

(2) a. [Czech]Bylwasmasc.past

jasný,brightmasc.sg.nom

studenýcoldmasc.sg.nom

dubnovýAprilmasc.sg.nom

dendaymasc.sg.nom

aand

hodinyclocksfem.pl.nom

odbíjelystrokefem.pl.past

trináctou.thirteenthfem.sg.acc

b. [Russian]Bylwasmasc.past

jasnyj,brightmasc.sg.nom

xolodnyjcoldmasc.sg.nom

aprel’skijAprilmasc.sg.nom

den’daymasc.sg.nom

iand

casyclockspl.nom

probilistrokepl.past

trinadtsat’.thirteenacc

‘It was a bright cold day in April, and the clocks were striking thirteen.’ [fromOrwell’s ‘1984’]

Inflection

Both languages are fusional languages, i.e. several inflections are often fused intoone phonetic and orthographic form. Both are richly inflected. The morphologicalsystems of the two languages are very close. The order and function of morphemesare nearly identical. Naturally, the morphemes have different shapes (and are writ-ten in different scripts), but even from this point of view, they are also often similar.As an illustration, we show two parallel paradigms in Table 4.3.

Table 4.3. Example comparison of Czech and Russian noun declension

Czech Russian Glosssg.nom žen-a ženšcin-a ‘woman’gem žen-y ženšcin-ydat žen-e ženšcin-eacc žen-u ženšcin-uvoc žen-o –loc žen-e ženšcin-eins žen-ou ženšcin-oj/oupl.nom žen-y ženšcin-ygen žen ženšcindat žen-ám ženšcin-amacc žen-y ženšcinvoc žen-y –loc žen-ách ženšcin-axins žen-ami ženšcin-ami


Nominal categories (adjectives, nouns, pronouns) inflect for gender, num-ber, case. Both languages have 3 genders (masculine, feminine, neuter) and twonumbers (Czech has also some remnants of dual number). They share 6 cases withroughly the same meaning (nominative, genitive, dative, accusative, local, instru-mental). In addition, Czech has vocative and Russian has two secondary cases:second genitive and second locative. In both languages, nouns are grouped intodeclension classes. In addition to so-called long adjectives that distinguish gender,number and case, both languages have also so-called short adjectives whose syntac-tic distribution is restricted and which do not distinguish case. Russian adjectives,possessives, ordinal numerals, etc. do not distinguish gender in plural. Numeralsuse declensional strategies which range from near indeclinability to adjective-likedeclension. Neither language has articles; (in)definiteness is expressed using othermeans, e.g. word order (Brun 2001).

Verbs distinguish number, three persons, three tenses, mood and perfectiveand imperfective aspect; participles distinguish gender (Czech in both numbers,Russian only in singular). Aspect is expressed morphologically. Past tense, imper-fective future tense and subjunctive are expressed periphrastically. The subjunctiveis formed by the use of conjugated forms of the auxiliary by in Czech, whereas inRussian it is formed by a particle . The copula in the present tense forms in Russianis omitted in the majority of cases, but it is obligatory in Czech. In Czech, reflex-ivization is expressed by sentential clitics, whereas in Russian by verbal suffixes ofthe verb. Russian verb negation is marked by a separate particle, while Czech verbnegation is marked by a prefix.1

Morphology in both languages exhibits both (i) a high number of mor-phemic categories whose values are combined in clusters, each of which is ex-pressed by a single ending (e.g. number, gender, and case with nouns or adjec-tives, or tense, number, and person with finite verbs), and (ii) a high degree ofsynonymy and ambiguity of the endings. Their synonymous morphemic shapes aredivided into many paradigms and classes, within which individual morphs oftenexpress different values. For example, see the homonymy of the Czech ending -ain Table 4.4. The ending -e expresses six of the seven cases in the singular number(see Table 4.5). The situation in Russian is similar. In addition, there are manifoldphonemic and morphemic alternations in the stems (e.g. forms of the same wordsuch as knig-a/knižk-a ‘book’ in Russian, or vítr/vetr-u ‘wind’ and matk-a/matc-e/matek/matc-in ‘mother’ in Czech).

In Czech, there is a significant difference in morphology and the lexiconbetween the standard and colloquial levels of Czech. The automatic morphological

1 It can be argued that all four morphemes (Czech reflexive clitic, Russianreflexive suffix, Czech negative prefix and Russian negative particle) arevarious types of clitics. However, for text-based language processing, in-cluding our work, the important distinguishing feature is that Czech reflex-ives and Russian negation are orthographically separate while the other twomorphemes are not.


Table 4.4. Homonymy of the a ending in Czech

form lemma gloss categorymest-a mesto town NS2 noun neut sg gen

NP1 (5) noun neut pl nom (voc)NP4 noun neut pl acc

tém-a téma theme NS1 (5) noun neut sg nom (voc)NS4 noun neut sg acc

žen-a žena woman FS1 noun fem sg nompán-a pán man MS2 noun masc anim sg gen

MS4 noun masc anim sg accostrov-a ostrov island IS2 noun masc inanim sg genpredsed-a predseda president MS1 noun masc anim sg nomvide-l-a videt see verb past fem sg

verb past neut plvide-n-a verb passive fem sg

verb passive neut plvid-a verb transgressive masc sgdv-a dv-a two numeral masc sg nom

numeral masc sg acc

Table 4.5. Ending -e and noun cases in Czech

case form lemma gender glossnom kur-e kure neuter chickengen muž-e muž masc.anim. mandat mouš-e moucha feminine flyacc muž-e muž masc.anim. manvoc pan-e pán masc.anim. misterloc mouš-e moucha feminine flyinst – –

analysis of such a language is especially challenging since the same word can haveseveral morphological forms, depending on the language level. It also means that atagset of Czech (assuming it captures this feature) is significantly larger than tagsetof another Slavic language, with otherwise comparable morphology.

Agreement

Adjectives and possessives agree in gender, number and case with the noun theymodify. In Russian, gender agreement is only in singular. Main verbs agree in per-son and number with their subjects. Past participles agree with subject in numberand gender (again, in Russian gender agreement is only in singular).


Word-order

Both Czech and Russian are free-word-order languages. Syntactic relations withina sentence are expressed by inflection and the order of constituents is determinedmainly by pragmatic constraints. The theme (roughly old information) usually pre-cedes the rheme (roughly new information). There are, however, certain rigid wordorder combinations, such as noun modifiers, clitics (in Czech), and negation (inRussian).

Genitive of negation

Modern Czech does not have “genitive of negation”, in Russian genitive of negationis fragmentary.

Lexicon

In terms of the lexical similarities and differences between Russian and Czech,many words are cognate and have a similar etymology (not necessarily Slavonic,but Germanic or Romance). Many other words diverge in their origins, as illustratedby Table 4.6.

Table 4.6. Basic words: Comparison of Czech and Russian

Czech Russian Glosskreslo kreslo chairstul stol tableokno okno windowstrýcek djadja unclesynovci plemjannik nephew

Writing systems

The Russian alphabet is based on a modified version of the Cyrillic alphabet andcontains 33 letters (21 consonants, 10 vowels, and 2 symbols to indicate whether apreceding consonants in a word is hard or soft (i.e. palatalized or not). Czech usesa version of the Latin alphabet with diacritics. As a result of Jan Hus’ reform in theearly 1400s, it uses one grapheme for every phoneme with only few exceptions (chfor [x] and e for [jE] or [E] with the preceding consonant palatalized; y and i areboth pronounced as [I]).

4.1.2 Romance languages

The Romance languages constitute a major branch of the Indo-European lan-guage family (see Figure C.2 in Appendix C for a brief overview of their his-


tory). Romance languages share a number of linguistic features that set them apartfrom other Indo-European branches. The following three sections briefly describemorpho-syntactic properties of Catalan, Portuguese, and Spanish. A slightly moredetailed discussion can be found in C.

Catalan

As in all Romance languages, Catalan nouns distinguish feminine and masculinegender. In addition, Catalan has a special neuter agreement pronoun ho used whenno noun has been mentioned.

Most adjectives have distinct forms for masculine and feminine, at least inthe singular. There is no direct relation between word endings and gender, althoughit is true that consonant endings and -e/-o endings tend to represent the masculine,while -a tends to represent the feminine. Exceptions to this ‘rule’ are so many thatit cannot be taken for granted.

Catalan is characterized, like Spanish, Italian, and Portuguese, but unlikeFrench, by the way in which subject pronouns accompany verbs only for particularemphasis. The Catalan pronominal clitics, also generally referred to as weak objectpronouns, are unstressed elements which, while having a distinctive grammaticalfunction (expressing verbal complements), are nevertheless reduced phoneticallyto forming a single unit with the verb that governs them. These weak pronouns canstand either (proclitically) before a verb or are attached (enclitically) after a verb(e.g. M’ajudes. ‘You help me.’, Em pots ajudar. ‘You can help me.’, Adjuda’m.‘Help me’, Has d’ajudar-me. ‘You must help me.’).

On the basis of their inflectional paradigms, Catalan verbs fall into threeclasses (with some subdivisions).

In addition to agreement with masculine and feminine gender, Catalan isunusual among Romance languages in having a special neuter agreement. Thereis a small set of genderless (neuter) demonstrative pronouns: això ‘this/that’, allò‘that’ and co ‘that’. The members of this small set of neuter pronouns, togetherwith the neuter pronominal clitic ho, have two functions. They refer to objects asyet unidentified or not specific enough to be identified by a noun with a gender(e.g. D’on has tret això? ‘Where have you got that from?’). The weak pronounho represents the direct object ‘it’ when the direct object complement cannot beidentified as a specific noun (e.g. No volia di-nos el que cercava i era impossibleendevinar-ho. ‘He wouldn’t tell us what he was looking for and it was impossibleto guess.’).

Portuguese

The morphology and syntax of Portuguese is similar to the grammar of most otherRomance languages, especially Galician and the other languages of the IberianPeninsula.


Portuguese nouns distinguish gender (masculine and feminine) and number(singular and plural). Most Portuguese nouns end in -s when in the plural. But theway -s is added to the singular may vary, according to the way the singular itselfends. The most relevant features of the singular, for purposes of plural formation,are whether or not the last syllable is stressed and what particular consonant ordiphthong is found at the end of the word.

Romance languages often use articles where English would not and Por-tuguese is particularly extreme in this regard, as it will often use articles beforeperson names, especially in informal registers or if the name includes a title (e.g. AMaria saiu ‘Mariu left’). Articles also occur before certain country and organiza-tion names.

Verb inflections are usually classified into moods, tenses, and impersonalforms. This is similar to the rest of the Romance languages. Portuguese has a num-ber of grammatical features that distinguish it from most other Romance tongues,such as a synthetic pluperfect past verbal tense (e.g. ele tinha falado and ele haviafalado ‘He had spoken’ can also be expressed as ele falara), a future subjunctivetense, and the inflection of the personal infinitive. The future subjunctive is ratheruncommon among Indo-European languages. In Portuguese it is used in adverbialsubordinate clauses, e.g. Se cantarmos, seremos pagos. ‘If we sing, we will bepaid.’). In general, verbs are divided into three main conjugation classes accord-ing to the ending of their infinitive form, which may be either -ar, -er, or -ir. Thisis very similar to Spanish and Catalan. There are, of course, numerous irregularforms. Each conjugation class has its own distinctive set of 50 or so inflectionalsuffixes.

Most adjectives and demonstratives, and all articles, must be inflected ac-cording to the gender and number of the noun they reference. Verbs agree in num-ber and person with their subjects.

The word order of Portuguese is relatively flexible, compared to English.Adjectives in Portuguese generally follow the noun they modify. European Por-tuguese is a subject pro-drop language, similarly to Catalan and Spanish, whichmeans that an explicit subject is often dropped. Brazilian Portuguese is both sub-ject and object pro-drop.

Spanish

Spanish is a relatively richly inflected language, with a two-gender system andabout fifty conjugated forms per verb. However, the nominal (and pronominal) de-clension system is relatively simple.

All Spanish nouns have one of two grammatical genders: masculine or fem-inine (mostly conventional, that is, arbitrarily assigned). Most adjectives and pro-nouns, and all articles and participles, indicate the gender of the noun they refer-ence or modify. In some cases, the same word can take two genders with a differentmeaning for each (e.g. el capital ‘funds’ vs. la capital ‘capital city’). Note that thedivision between uncountable and countable nouns is not as clear-cut as in English.


Spanish verbs are one of the most complex areas of Spanish grammar. Asthe other Romance languages, Spanish is a synthetic language with a moderate-to-high degree of inflection which shows up mostly in the verb conjugation. Spanishverbs are conjugated in four categories known as moods: indicative, subjunctive,conditional and imperative. Each verb also has three non-finite forms: an infinitive,a gerund, and a past participle (more exactly, a passive perfect participle). Verbs aredivided into three classes, which differ with respect to their conjugation. The classof the verb can be identified by looking at the infinitive ending.

The Spanish word order is flexible. Usually, the factors that determine Span-ish word order depend on considerations of style, context, and emphasis, similar tothe word order rules for Portuguese and Catalan discussed above.

Contrastive study

Table 4.7 provides a brief summary of some important properties of Spanish, Por-tuguese, and Catalan. They have a complex system of word inflections to indi-cate syntactic relationships between words. Nouns distinguish gender and number.There are definite and indefinite grammatical articles, derived from Latin demon-stratives and the numeral unus (“one”).

The verb is inflected to indicate various aspects of the action, such as time,completedness or continuation. All three languages have the (-ar, -er, and ir) con-jugation systems.

There is agreement in gender and number within a noun phrase betweenthe noun and its modifiers. Verbs are also inflected according to the grammaticalperson and grammatical number of the subject, but there is subject-verb agreementin gender. Most Romance languages have polite forms of address that change theperson and/or number of 2nd person subjects, such as the tu/vous contrast in Frenchor the tu/lei contrast in Italian.

All the Romance languages are written with the Latin alphabet, subse-quently modified and augmented in various ways.

Avoiding strong claims about the three Romance languages described here,it seems that the number of differences between Portuguese and Spanish and be-tween Catalan and Spanish is comparable. Brazilian Portuguese, unlike Spanish orCatalan, optionally omits objects (in addition to subjects). In Brazilian Portuguese,negative particles can either precede or follow main verbs. Its clitics raise only upto the auxiliary (aux<clitic<Vmain). Catalan, in turn, is more like French than likeSpanish. It has a negative postverbal clitic pas and a complicated pronominal sys-tem. From the lexicon point of view, Portuguese seems to have more words sharedwith Spanish than with Catalan. The words for many everyday concepts in thethree languages do not always coincide, even etymologically. Also, Catalan, unlikeCastillian Spanish and Portuguese exhibits a unique resistance to the agglutinationof the Arabic article al- to the words, as exemplified in Table C.4.

58 Chapter 4. Languages, corpora and tagsetsTa

ble

4.7.

Rom

ance

:Sha

llow

cont

rast

ive

anal

ysis

Span

ish

Cat

alan

Portu

gues

efu

sion

al+

++

gend

erm

asc,

fem

,neu

tm

asc,

fem

,neu

tm

asc,

fem

num

ber

22

2ar

ticle

sin

def,

def

inde

f,de

fin

def,

def

pers

onal

artic

les

–+

–si

mpl

ete

nses

pres

ent,

pret

erite

,pr

esen

t,pr

eter

ite,

pres

ent,

pret

erite

,im

perf

ect,

futu

reim

perf

ect,

futu

reim

perf

ect,

futu

repl

uper

fect

synt

hetic

plup

erfe

ct–

–+

infle

cted

infin

itive

––

+fu

ture

subj

unct

ive

––

+pe

riphr

astic

past

–+

–su

bjec

t-ver

bag

reem

ent

gend

er,n

umbe

rge

nder

,num

ber

gend

er,n

umbe

rad

ject

ive-

noun

agre

emen

tge

nder

,num

ber

gend

er,n

umbe

rge

nder

,num

ber

wor

dor

der

free

free

free

nega

tion

prec

ede

Vpr

eced

eV

+pos

tver

balo

ptio

nalp

aspr

eced

e/fo

llow

Vne

gativ

eco

ncor

dne

gativ

eco

ncor

dne

gativ

eco

ncor

dpr

o-dr

opsu

bjec

tsu

bjec

tsu

bjec

t,ob

ject

(BP)

cliti

cscl

itic<

aux<

Vm

ain

cliti

c<au

x<V

mai

nau

x<cl

itic<

Vm

ain

mes

oclis

is–

–+

adve

rbia

lpro

noun

s–

+–

prep

ositi

ondr

oppi

ng–

+–

4.2. Corpora 59

4.1.3 Summary

The discussion in this section by no means is an exhaustive contrastive study ofthese languages. All the languages that have been described are fusional. As a con-sequence of having rich inflected morphology, they display relatively free wordorder. But since the Slavic morphological systems are more fine-grained and com-plex, their word order is more flexible than that of the Romance languages.

4.2 Corpora

For the research described in this book, we used a number of corpora. We wanted tostay within the resource-light paradigm, so we intentionally avoided the use of par-allel corpora or target-language annotated corpora. For Russian, Czech, and Cata-lan, we used a small annotated development corpus (Dev) of around 2K tokens totune our tools. For Portuguese, unfortunately, such a corpus was not available tous. To evaluate the performance of the system, we always tried to obtain the largesttest corpus available.

We used the following corpora for each target language (Czech, Russian,Catalan, and Portuguese):

1. Dev – annotated corpus, about 2K (intentionally small).For development testing, testing of hypotheses and for tuning the parametersof our tools; not available for Portuguese.

2. Test – annotated corpus, preferably large.For final testing.

3. Raw – raw unannotated corpus, no limit on size.Used in cognate detection (see section 7.6), to acquire lexicon (seesection 6.3.4), to get the most frequent words.

4. Train – large annotated corpus.Used to report statistics for the purpose of this book (not used during devel-opment); available only for Czech and Catalan.

5. Stat – corpus used to report statistics about the language, its tagset etc.This corpus is equal to the first 1,893 words of the Dev corpus (Test for Rus-sian and Portuguese). The Stat corpora have the same size for all languagesto make the statistics comparable, the size was determined by the size of theavailable Portuguese corpus because Portuguese has the smallest annotatedcorpus.


and the following corpora for each source language (Czech and Spanish):

1. Train – large annotated corpus.Used to train the source language tagger (i.e. emission and transition proba-bilities) and to report statistics.

2. Raw – large unannotated corpus.Used in cognate detection, to get the most frequent words.

3. Stat – same properties as Stat above. Based on the initial portion of Devfor Czech and of Train for Spanish.

Table 4.8 summarizes the properties of the corpora we used in our experiments. Foreach target and source language, we report the size of the training, development,and test corpora, the sources we used, the type of the tagset, and whether a corpuswas annotated manually or automatically. Thus, for example Russian Dev and Testand Portuguese Test were annotated manually. The term positionalized means thatwe translated a tagset into our positional system described in the previous section.More details about the corpora we used can be found in Appendix B.

The Russian and Portuguese corpora were annotated by us. To facilitate theannotation process, we first run the morphological analyzer whose output is fedinto a special annotation tool. During the manual annotation, the annotator has tochoose from several analyses offered by the morphological analyzer for each token,instead of entering all the information from scratch. The window displays both thetoken in question, the context it appears in, all possible lemma and tag analyses.The tool also provides the option to insert new lemmas and tags manually andmake corrections, if needed.

4.3 Tagset design

4.3.1 Types of tagsets

There are many ways to classify morphological tagsets. For our purposes, we dis-tinguish the following three types:

1. atomic (flat in Cloeren 1993) – tags are atomic symbols without any formalinternal structure (e.g., the Penn TreeBank tagset, Marcus et al. (1993a)).

2. structured – tags can be decomposed into subtags each tagging a particularfeature.

a) compact – e.g. Multext-East (Erjavec 2004) or Czech Compact tagsets(Hajic 2004).

b) positional – e.g. Czech Positional tagset (Hajic 2004)

4.3. Tagset design 61Ta

ble

4.8.

Ove

rvie

wof

the

corp

ora

Lang

uage

Cor

pus

Size

Sour

ceM

anua

l/Aut

omat

icta

ggin

gTa

gset

Cze

ch(s

rc/ta

rget

)Dev

2KPD

T1.

0M

anua

lC

zech

Posi

tiona

lTest

125K

PDT

1.0

Man

ual

Cze

chPo

sitio

nal

Raw

39M

dist

ribut

edw

ithPD

T1.

0N

/A–

Train

1.5M

PDT

1.0

Man

ual

Cze

chPo

sitio

nal

Stat

1,89

3PD

T1.

0M

anua

lC

zech

Posi

tiona

lR

ussi

an(ta

rget

)Dev

1,75

8O

rwel

l’s19

84M

anua

l(by

us)

Rus

sian

Posi

tiona

lTest

4,01

1O

rwel

l’s19

84M

anua

l(by

us)

Rus

sian

Posi

tiona

lRaw

1MU

psal

laN

/A–

Stat

1,89

3O

rwel

l’s19

84M

anua

l(by

us)

Rus

sian

Posi

tiona

lSp

anis

h(s

rc)

Train

106K

CLi

C-T

ALP

Aut

omat

ic,m

anua

llyva

lidat

edpo

sitio

naliz

edC

Lic-

TALP

Stat

1,89

3C

LiC

-TA

LPA

utom

atic

,man

ually

valid

ated

posi

tiona

lized

CLi

c-TA

LPC

atal

an(ta

rget

)Dev

2KC

LiC

-TA

LPA

utom

atic

,man

ually

valid

ated

posi

tiona

lized

CLi

c-TA

LPTest

20.6

KC

LiC

-TA

LPA

utom

atic

,man

ually

valid

ated

Raw

63M

ElPe

riodi

coN

/A–

Train

80.5

KC

LiC

-TA

LPA

utom

atic

,man

ually

valid

ated

posi

tiona

lized

CLi

c-TA

LPStat

1,89

3C

LiC

-TA

LPA

utom

atic

,man

ually

valid

ated

posi

tiona

lized

CLi

c-TA

LPPo

rtugu

ese

(targ

et)

Test

1,89

3N

ILC

Man

ual(

byus

)m

odifi

edC

Lic-

TALP

Raw

1.2M

NIL

CN

/AStat

1,89

3N

ILC

(=Test

)M

anua

l(by

us)

mod

ified

CLi

c-TA

LP


4.3.2 Structured tagsets – Tagsets for richly inflected languages

Romance and Slavic languages have rich inflection. Thus, any tagset capturing theirmorphological features is necessarily large. A natural way to make them manage-able is to use a structured system. In such a system, a tag is a composition of tagseach coming from a much smaller and simpler atomic tagset tagging a particularmorpho-syntactic property (e.g. gender or tense).

Examples of structured tagsets are what we call positional tagsets and com-pact tagsets. In both systems, the tags are sequences of values encoding individualmorphological features. In a positional tagset, all tags have the same length, en-coding all the features distinguished by the tagset. Features not applicable for aparticular word have a N/A value. In a compact tagset, the N/A values are leftout. Usually, part-of-speech or a similar category (e.g. the so-called SubPOS in theCzech Positional Tagset (Hajic 2004)) determines which values are applicable andwhich are not.

For example, AAFS4----2A---- in the Czech Positional Tagset and AFS42Ain the Czech Compact Tagset (now obsolete) both encode the same information:adjective (A), feminine gender (F), singular (S), accusative (4), comparative (2).

For large tagsets, a structured system has many practical benefits. For exam-ple:

1. It is easier to work with for humans. It is much easier to link traditional lin-guistic categories to the corresponding structured tag than to an unstructuredatomic tag. While it takes some time to learn the positions and the associatedvalues of the Czech Positional Tagset, for most people, it is still far easierthan learning the corresponding 4000+ tags as atomic symbols.

2. The morphological descriptions are more systematic. In each system, the at-tribute positions are determined by either POS or SubPOS. Thus, for exam-ple, knowing that a token is a common noun (NN) automatically providesinformation that the gender, number, and case positions should have values.

3. The fact that the tag can be decomposed into individual components hasbeen used in various applications. For instance, the best tagger developedfor Czech operates on the subtag level (see the discussion in section 2.5.1).

4. The evaluation of tagging results can be done in a more systematic way. Eachcategory can be evaluated separately on each morphological feature. Not onlyis it easy to see on which POS the tagger performs the best/worst, but it isalso possible to determine which individual morphological features cause themost problems.

It is also worth noting that it is trivial to view a structured tagset as an atomic tagset(e.g. by assigning a unique natural number to each tag), while the opposite is nottrue.

4.3. Tagset design 63

In our work, we use positional tagsets, mostly for practical reasons. Ourproject started with creating a Russian tagger on the basis of Czech, and the Czechcorpus was annotated with a positional tagset. Continuing to use a positional tagsetfor the other languages allowed us to use all the tools developed at that initial stage.Later, we also realized that the positional tagset allowed us to decompose the prob-lem into smaller problems, train a battery of subtaggers and combine them by vot-ing (see chapter 7).

4.3.3 Tagset size and tagging accuracy

Tagsets for highly inflected languages are typically far bigger that those for English.It might seem obvious that the size of a tagset would be negatively correlated withtagging accuracy: for a smaller tagset, there are less choices to be made, thus thereis less opportunity for an error.

However, as Elworthy (1995) shows, this is not true. The following triv-ial example illustrates this. Let’s assume a language where determiners agree withnouns in number, determiners are non-ambiguous for number while nouns some-times are. Consider two tagsets: one containing four tags, singular determiner,plural determiner, singular noun, plural nouns; and another containing three tags,determiner, singular noun, plural noun. A bigram tagger will get better accuracywhen using the larger tagset: Since determiners are non-ambiguous, a determineris tagged correctly and it in turn determines the correct tag for the noun. In thesmaller tagset, the determiner is also tagged correctly; however, the tag does notprovide any information to help in tagging the noun.

Elworthy concludes that there is no clear relationship between tagset sizeand tagging accuracy. Using the HMM terminology, we would say that a smallertagset means fewer choices and thus emission probabilities are more powerful;while a larger tagset provides more information and thus may make transitionalprobabilities more helpful (assuming there is enough data to actually learn thosetransitions).

Elworthy suggests that what is important is to choose the tagset requiredfor the application, rather than to optimize it for the tagger. The experiments withsubtaggers in chapter 7, in a sense are a follow-up to this work, and provide furtherconfirmation of the results. An additional comment that can be made here is thata large tagset could always be reduced to a smaller and less-detailed one if theapplication demands it.

4.3.4 Harmonizing tagsets across languages

Another issue concerns the question of whether tagsets should be harmonizedacross languages within the same family or languages that have similar proper-ties. Harmonized tagsets make it easier to develop multilingual applications or toevaluate language technology tools across several languages. The process of stan-dardization is interesting from a language-typological perspective as well.


Example of such tagsets are the tagsets of the Multext-East project. They arebased on a common repertoire of grammatical classes (parts of speech) and gram-matical categories (e.g. case, person, gender, etc.), and each tagset uses a subset ofthose grammatical classes or categories. For the Romance languages, the Spanishand the Catalan CLiC-TALP (Civit 2000) tagsets are also very similar; however,some differences are present, e.g., proper nouns are tagged differently. In general,standardized tagsets allow for a quick and efficient comparison of language prop-erties.

Przepiórkowski and Wolinski (2003) notice certain weaknesses with thestandardization approach. The relative uniformity of the POS classes across thenine languages of the MULTEXT-EAST project is attained at the cost of introducingthe grammatical category ‘type’ whose values reflect the considerable differencesbetween POS systems of the languages involved. In addition, it is not clear thatthe various grammatical categories and their values have the same interpretation ineach language.

This issue is relevant to our work as well. On the one hand, we benefitgreatly from harmonized tagsets, because it makes transfer of morphological in-formation across languages much easier. On the other hand, we must be aware ofthe limitations this brings. We address this issue below in sections about tagsets wedeveloped (Russian and Portuguese). In any case, we do not claim that the tagsetsused in our experiments are the most adequate linguistic descriptions of the Slavicand Romance languages.

4.4 Tagsets in our experiments

The tagsets we use in all experiments are all positional structured tagsets. As ex-plained above, by structured we mean that tags are sequences of symbols eachencoding a single category and by positional we mean that (for a given tagset) thei-th symbol always encodes the same category, regardless of the part-of-speech.

With the exception of the Czech tagset (Hajic 2004), all other tagset weredeveloped by us. The Catalan and Spanish tagsets are simple transformations ofexisting CLiC-TALP tagsets into a positional system. For Portuguese, we used theSpanish tagset with only the most needed modifications. We could not find anytagset designed specifically for Russian at that time, so the Russian tagset was cre-ated completely from scratch, using the Czech tagset as a model.

The reason for using positional tagsets instead of using the existing com-pact ones is largely a pragmatic one – when we started to experiment with theselanguages we already had a range of tools which all required positional tagsets(morphological analyzer, evaluation tools, annotation tools, etc.). Since the trans-lation between a compact and positional system is a trivial task, we opted for thatpath, instead of modifying all the tools.

All tagsets follow the basic design features of the Czech positional tagset:

1. The first position specifies POS.

4.4. Tagsets in our experiments 65

2. The second position specifies Detailed POS (SubPOS).

3. SubPOS uniquely determines POS.

4. SubPOS generally determines which positions are specified (with very fewexceptions).

5. The - value meaning N/A or not-specified is possible for all positions exceptthe first two (POS and SubPOS).

4.4.1 Slavic tagsets in our experiments

Czech

In the Czech positional tag system, every tag is represented as a string of 15 sym-bols, each corresponding to one morphological category. Table 4.9 provides thedetails. Every value in each category is represented as a single symbol, mostly anuppercase letter of the English alphabet; non-applicable values are denoted by asingle hyphen ‘-’. For example, the word videlo (‘saw.neutr.active’ in Czech) is as-signed the tag VpNS---XR-AA--- because it is a verb (V), past participle (p), neutral(N), singular (S), does not distinguish case (-), possessor’s gender (-), possessor’snumber (-), can be any person (X), is past tense (R), is not gradable (-), affirmative(A), active voice (A), and is not a stylistic variant (the last hyphen).

Table 4.9. Positional Tag System for Czech

Position Abbr Name Description Example videlo ‘saw’1 p POS part of speech V verb2 s SubPOS detailed part of speech p past participle3 g gender gender N neuter4 n number number S singular5 c case case - n/a6 f possgender possessor’s gender - n/a7 m possnumber possessor’s number - n/a8 e person person X any9 t tense tense R past tense

10 d grade degree of comparison - n/a11 a negation negation A affirmative12 v voice voice A active voice13 reserve1 unused - n/a14 reserve2 unused - n/a15 i var variant, register - basic variant


Atomic values:F feminineI masculine inanimateM masculine animateN neuter

Wildcard values:X M, I, F, N any of the basic four gendersH F, N feminine or neuterT I, F masculine inanimate or feminine (plural only)Y M, I masculine (either animate or inanimate)Z M, I, N not feminine (i.e. masculine animate/inanimate or neuter)Q feminine (with singular only) or neuter (with plural only)

Figure 4.1. Atomic and wildcard gender values

Thus, unlike what we find in the MULTEXT-EAST tagsets, the position of a partic-ular attribute is the same regardless of the POS. If it is inappropriate for a particularPOS (or more precisely SubPOS), it simply has a N/A value (-).

The Czech tagset uses a rather large number of wildcards, i.e. values thatcover more than one atomic value. For example, consider gender, as Figure 4.1shows there are four atomic values, and six wildcard values, covering not onlyvarious sets of the atomic values (e.g. Z = {M,I,N) , but in one case also their com-bination with number values (QW = {FS,NP}).

On the other hand, there are some values appropriate for a single word. Forexample, a tag with subPOS value E, whose only member is the relative pronouncož, which corresponds to the English which in subordinate clauses.

It is worth noting, that the values of detailed part of speech do not alwaysencode the same level of detail. If the values are seen as a separate tagset, it is anatomic tagset which could be naturally expressed as a structured tagset having twopositions expressing two levels of detail. For example, there is no single value en-coding personal pronouns. Instead, there are three values encoding three differenttypes of personal pronouns: P (regular personal pronoun), H (clitical personal pro-noun), and 5 (personal pronoun in prepositional form). Similarly, there are eightvalues corresponding to relative pronouns, four to generic numerals, etc.

Russian

The Russian tagset we use was developed on the basis of the Czech positionaltagset (Hajic 2004). We are aware of the issues associated with harmonized tagsets,discussed in section 4.3.4; however, by using the same tag system, the cross-lingualmorpho-syntactic transfer is made much easier.

The tagsets encode the same set of morphological categories in the sameorder and in most cases do so using the same symbols. However, there are some


differences. Many of them are a consequence of linguistic differences between thelanguages. For example, Russian has neither vocative nor dual, nor does it haveauxiliary or pronominal clitics; and the difference between colloquial and officialRussian is not as systematic and profound as in Czech. Table 4.10 compares thenumber of values for individual positions.

Table 4.10. Overview and comparison of the Czech and Russian tagsets

Pos Description Abbr. No. of valuesCzech Russian

1 POS p 12 122 SubPOS – detailed POS s 69 453 Gender g 11 54 Number n 6 45 Case c 9 86 Possessor’s Gender f 5 57 Possessor’s Number m 3 38 Person e 5 59 Tense t 5 5

10 Degree of comparison d 4 411 Negation a 3 312 Voice v 3 313 Unused 1 114 Unused 1 115 Variant, Style i 10 8

The Russian tagset also uses far less wildcards (symbols representing a set ofatomic values). Even though wildcards might lead to better tagging performance,we intentionally avoid them. The reason is that they provide less information aboutthe word, which might be needed for linguistic analysis or an NLP application. Inaddition, it is trivial to translate atomic values to wildcards if needed. The tagsetcontains only wildcards covering all atomic values (denoted by X for all applica-ble positions). There are no wildcards covering a subset of atomic values. Formsthat would be tagged with a tag containing a partial wildcard in Czech are re-garded as ambiguous. For example, the Czech tomto ‘thismasc/neut.loc’ is tagged asPDZS6---------- in v tomto dome ‘in this housemasc’ and v tomto míste ‘in thisplaceneut’. The Russian ètom ‘thismasc/neut.loc’ is tagged as PDMS6---------- inv ètom dome ‘in this housemasc’ and PDNS6---------- in v ètom meste ‘in thisplaceneut’. More details about the tagset can be found in Appendix A.2.


4.4.2 Romance tagsets in our experiments

We used positional tags for the Romance languages as well. For Spanish and Cata-lan we translated the structured tags, provided by the CLiC-TALP project (http://clic.ub.edu/en/what-is-clic) into our positional system.

Spanish

The Spanish CLiC-TALP system is a structured system, where the attribute posi-tions are determined by POS. The tagset distinguishes common and proper nouns,various types of determiners, auxiliary and main verbs, and so on. In addition, itmakes more fine-grained morphological distinctions for mood, tense, person, gen-der, number, etc., for the relevant categories. This system is similar and easily com-parable to the one developed by the CLiC-TALP project for Catalan; it is also easilytranslatable into the system we developed for Portuguese. We used a positionalizedversion of this tagset.

Catalan

A tagset for Catalan was developed within the CLiC-TALP project as well. Thistagset provides fine-grained morpho-syntactic descriptions. The system includes289 tags in a structured compact tagset. Number, gender, person, tense, and moodare distinguished as well as subcategories of POS, such as common vs. proper noun,interrogative, relative, personal pronouns, and so on. The system is parallel to thatdeveloped for Spanish. Since several experiments in this book deal with projectingmorphology from Spanish into Catalan, this system is convenient to use. Never-theless, the most important reason for choosing this system is that this formalismdescribes the morpho-syntactic properties of the languages in the most detailed,precise way. Moreover, from a technical perspective, it is a system which is easilytranslatable into other tag formats.

Portuguese

The NILC corpus (http://nilc.icmc.sc.usp.br/nilc/) is used for the experi-ments with Portuguese in this book (see chapter 7). The original annotation wascreated semi-automatically with PALAVRAS (Bick 2000) and was not directlytranslatable into our positional system — the tagset was not structured, mixed lex-ical and syntactic categories (e.g., various types of pronouns and determiners aretreated as a general category SPEC (specifiers)). Therefore, we completely ignoredthe original annotation and created our own detailed morpho-syntactic positionaltagset for Portuguese, similar to the ones used for the other languages, described inthis book.

Table 4.11 suggests that in the majority of cases, the Spanish, Portuguese,and Catalan tagsets use the same values. However, some differences are unavoid-able. For instance, the pluperfect is a compound verb tense in Spanish, but a sep-


arate word that needs a tag of its own in Portuguese. Notice there are six possiblevalues for the gender position in all the tagsets. These correspond to M (mascu-line), F (feminine), N (neuter, for certain pronouns), C (common, either M or F), 0(unspecified for this form within the category), and - (the category does not distin-guish gender).

Table 4.11. Overview and comparison of the Romance tagsets

Pos Description Abbr. No. of valuesSpanish Portuguese Catalan

1 POS p 14 14 142 SubPOS – detailed POS s 29 30 293 Gender g 6 6 64 Number n 5 5 55 Case c 6 6 66 Possessor’s Number m 4 4 47 Form o 3 3 38 Person e 5 5 59 Tense t 7 8 7

10 Mood m 7 7 711 Participle r 3 3 3

4.4.3 Summary

In this section, we have briefly discussed the tagsets used for the Slavic and Ro-mance languages in our experiments. We have described the sources, our motiva-tion for using the positional tag system while we have provided the informationabout the size and the structure of each tag set. Appendix A provides further de-tails. Table 4.12 summarizes the size and the number of slots for each source andtarget language used in our experiments. The next section describes the corporaused in our research.

Table 4.12. Overview of the tagsets we use

Language size # of tags in # of positionsStat corpus

Czech 4,251 216 13 (+2 not used)Russian 1,063 179 13 (+2 not used)Spanish 282 109 11Catalan 289 88 11Portuguese 259 73 11

Chapter 5

Quantifying language properties

This chapter examines a number of properties of Slavic and Romance languagesquantitatively. For comparison, we also provide the same measures for English. Theresults of the experiments discussed below provide a strong motivation for using themorphological analysis described in this book and for applying an n-gram approachto tagging Slavic and Romance languages.

The language properties in this chapter were obtained from the Stat corpora(see section 4.2). They are all the same size and contain 1,893 tokens. Englishnumbers are based on a fragment of the WSJ corpus (Marcus et al. 1993b). All thegraphs and tables come in pairs. The first ones are always based on the Stat corporaannotated with the full tagset, while the second ones are based on the Stat corporaannotated with a reduced tagset. The reduced tagset for the Slavic and Romancelanguages equals to the SubPos values; the English reduced tagset is the same as itsfull tagset, except for the punctuation tags. In the reduced tagset, a single tag is usedfor all punctuation marks. This makes all the reduced tagsets roughly comparablein size.

5.1 Tagset size, tagset coverage

Table 5.1 provides information about the size of the tagsets for the Slavic and theRomance languages used in our experiments, as well as for English.

The potential sparsity problem can be seen by comparing the number ofdistinct tags that appear in a Stat corpus to the number of tags in the full tagset.Figures 5.1 and 5.2 illustrate the coverage of the tagset in the corpora for each lan-guage. The graphs in Figure 5.1 depict the number of distinct tags of the detailedand reduced tagsets seen as the corpus size grows. In the case of the full tagset (es-pecially for the Slavic languages, whose tagsets are the most detailed), the numberof new tags continues to grow with the size of the corpus. Whereas in the case ofthe reduced tagset, after processing the first 1,000 word tokens, the system stopslearning new tags. Figure 5.2 supports the same observation — for English, thepercentage of the tagset covered by the corpus stops growing after the first 1,000

72 Chapter 5. Quantifying language properties

Table 5.1. Basic characteristics of Slavic, Romance and English based on the Statcorpora

Full tagsetCzech Russian Spanish Catalan Portug. English

Distinct tags in corpus 216 179 110 88 73 38Tagset size 4,251 1,027 282 289 259 45Tokens 1,893 1,893 1,893 1,893 1,893 1,893Types 1,046 993 797 657 622 836Distinct bigrams 1,028 845 645 505 476 364Distinct trigrams 1,171 983 743 584 541 402H(X) 6.163 5.685 5.169 4.924 4.689 4.305I(X ;Y ) 2.897 2.438 1.935 2.043 1.599 1.206Avg tag/token ambiguity 1.158 1.109 1.125 1.110 1.226 1.078Avg tag/token, context w = -1 1.022 1.019 1.025 1.022 1.045 1.013Avg tag/token, context w = -2 1.003 1.002 1.004 1.002 1.015 1.004Avg tag/token, context w = +1 1.009 1.010 1.028 1.020 1.079 1.011Avg tag/token, context w = +2 1.001 1.002 1.008 1.005 1.009 1.004

Reduced tagsetCzech Russian Spanish Catalan Portug. English

Distinct tags in corpus 37 33 28 26 20 31Tagset size 69 45 29 29 30 37Tokens 1,893 1,893 1,893 1,893 1,893 1,893Types 1,046 993 797 657 622 802Distinct bigrams 268 254 224 193 174 283Distinct trigrams 302 286 252 219 194 315H(X) 3.453 3.545 3.759 3.646 3.374 4.052I(X ;Y ) 0.614 0.708 0.932 1.153 1.119 1.075Avg tag/token ambiguity 1.035 1.014 1.106 1.100 1.213 1.074Avg tag/token, context w = -1 1.008 1.003 1.024 1.024 1.062 1.014Avg tag/token, context w = -2 1.002 1.001 1.007 1.008 1.032 1.004Avg tag/token, context w = +1 1.006 1.001 1.046 1.026 1.086 0.012Avg tag/token, context w = +2 1.002 1.001 1.019 1.006 1.024 1.005

word tokens are processed. More than 80% of the whole tagset is discovered at thatpoint. For the other languages, the discovery of new tags does not proceed as fast.For instance, after nearly 2,000 tokens of the text are processed, more than 90% ofthe Czech tagset is still unknown.1

1 Even the Czech Train corpus with nearly 1.5M tokens contains only 1,490distinct tags, or only 35% of the Czech tagset! To a large extent, this isbecause it is a newspaper corpus and therefore most colloquial declensionforms are not present.

5.1. Tagset size, tagset coverage 73

0

50

100

150

200

250

0 500 1000 1500

Num

bero

fTag

s

Number of Tokens

Full tagset

czerus

spacat

poreng

0

5

10

15

20

25

30

35

40

0 500 1000 1500

Num

bero

fTag

s

Number of Tokens

Reduced tagset

czerus

spacat

poreng

Figure 5.1. The number of distinct tags plotted against the number of tokens


0

20

40

60

80

100

0 500 1000 1500

Tags

etco

vera

ge(%

)

Number of Tokens

Full tagset

czerus

spacat

poreng

0

20

40

60

80

100

0 500 1000 1500

Tags

etco

vera

ge(%

)

Number of Tokens

Reduced tagset

czerus

spacat

poreng

Figure 5.2. The percentage of the tagset covered by a number of tokens

5.2. How much training data is necessary? 75

From the tables and the graphs presented thus far, it is evident that languagessuch as Czech and Russian require a larger training corpus in order to learn the in-formation about the occurrences of a significant subset of all possible tags. Spanish,Portuguese, and Catalan also show the same pattern, although to lesser extent thanthe Slavic languages. They are also more data-sparsity prone compared to English.More training data is needed for these languages as well.

5.2 How much training data is necessary?

The next questions to ask are whether it is indeed necessary to see all the possibletags and how much data can be covered just by a set of the most frequent tags. Toexplore these issues, the five most frequent tags for each language were selected andthe percentage of the corpus which would be covered by such a set was calculated.Figure 5.3 illustrates the results. The results for the detailed tagsets are comparable— a range of 30-50% coverage of the corpus seems to be constant across languagesand independent of the size of the corpus (e.g. compare the results for 500 tokens,1,000 tokens etc.). For Czech, for instance, only 30% of the corpus is covered bythe five most frequent tags. For the reduced tagset, the coverage is better, as muchas 80%, but generally, the graph shows that the increase in the text size does notaffect the text coverage of the five most frequent tags.

Entropy H(Y ) of the tagsets was also measured using the formula in (5.1),where Y denotes a random variable over Tagset and y ∈ Tagset.

(5.1) H(Y ) = åy∈Y

p(y)log1

p(y)

Intuitively, entropy is a measure of the size of the ‘search space’ consisting of thepossible tags and their associated probabilities. The higher the entropy, the largerthe ‘search space’. Table 5.1 gives the results of the entropy calculations for eachtagset and language. These entropy scores provide an additional piece of evidencethat Czech and Russian, followed by Catalan, Spanish, and Portuguese, are the mostchallenging languages for tagging (if we use detailed tagsets).

The discussion above suggests that even a large tagset creates a larger‘search space’. In addition, the figures show that even though the tagsets for mor-phologically rich languages are larger that the English tagset, the percentage of thecorpus covered by the five most frequent tags is only slightly higher for English(see Figure 5.3). To investigate this further, the accession rate for new tags, i.e.the rate at which new tags are discovered as more text is processed was examined(see e.g. Krotov et al. 1999 for further explanation of accession rates). One mightexpect that as more text is processed, the number of new tags added per text will besmaller. The accession rate is measured for both the detailed and reduced tagsets.The results, plotted in Figure 5.4, show that tag accession drops significantly afterthe first 100-200 tokens of the text are processed, but then proceeds at a relativelyconstant rate throughout processing of the remaining 2K tokens corpus. Given that


0

20

40

60

80

100

100 500 1000 1500

Cor

pusc

over

age(

%)

Number of Tokens

Full tagset

czerus

spacat

poreng

0

20

40

60

80

100

100 500 1000 1500

Cor

pusc

over

age(

%)

Number of Tokens

Reduced tagset

czerus

spacat

poreng

Figure 5.3. The percentage of the corpus covered by the five most frequent tags

5.2. How much training data is necessary? 77

0

0.05

0.1

0.15

0.2

0.25

0 500 1000 1500

p(ta

gis

new

)

Number of Tokens

Full tagset

czerus

spacat

poreng

0

0.05

0.1

0.15

0.2

0.25

0 500 1000 1500

p(ta

gis

new

)

Number of Tokens

Reduced tagset

czerus

spacat

poreng

Figure 5.4. Accession rate


the accession rate is measured on rather small corpora, strong claims cannot bemade as to whether the accession rate becomes constant after processing the first1,500 tokens of text (for the detailed tagset) or the first 800 tokens (for the reducedtagset). Clearly, it slows down significantly, which means that the discovery of newtags does not grow with the size of the corpus. This fact suggests that even thougha large tagset requires more training data, it is unclear how much data is actuallyneeded to discover the full tagset.

5.3 Data sparsity, context, and tagset size

How much context contributes to reducing the ‘search space’ and how much theuncertainty about tagy is reduced due to knowing about the preceding tag tagxwas also measured. For that, the mutual information I(X ;Y ) is calculated as in(5.2), where X denotes a random variable over the set of tags that occur in the firstposition of all tag bigrams in a corpus, and Y denotes a random variable over theset of tags that occur in the second position of all tag bigrams in the same corpus.The results of the mutual information calculations are summarized in Table 5.1.

(5.2) I(X ;Y ) = åx,y

p(x,y)logp(x,y)

p(x)p(y).

The higher the I(X ;Y ) score is, the more dependent the current tag is on the pre-vious tag. Comparing the mutual information scores for the detailed and reducedtagset for the inflected languages clearly shows that the dependence is greater inthe case of the detailed tagset. This means that by reducing the tagset for inflectedlanguages, important information is lost about agreement features (gender, number,case, etc.), which might, in turn, bring about the reduction in overall tagging accu-racy (as is indeed reported in Elworthy 1995). Among these five inflected languagesused in the experiments, Portuguese is the one for which the knowledge about thepreceding tag helps the least. This fact suggests that a tagging approach which re-lies on the preceding context will be less efficient for Portuguese than for the rest ofthe languages. For the reduced tagsets, the greatest dependency between two tags isfor English, a morphologically poor language. For the other five languages, a pre-ceding tag is not as helpful for predicting the current tag when the reduced tagset isused.

5.4 Summary

In this chapter, we investigated some properties of Slavic and Romance languagesand their tagsets. These properties were compared with those of English, a well-known and studied case. The comparison once again proves that the data sparsityproblem for the languages with a large tagset is real. This is observable in the rela-tionship between the corpus size and the number of new tags discovered. This is anexpected observation. The surprising outcome of the experiments is that for Cata-lan, Czech, Portuguese, and Russian, the knowledge about the preceding tagn−1

5.4. Summary 79

reduces the uncertainty about tagn if the detailed tagset is used. Recall that the de-tailed tagset contains the information about case, gender, number, and other impor-tant agreement features. But when the tagset is reduced to the size of the Englishtagset (eliminating the detailed information), the mutual information score dropssignificantly for the inflected languages. Compared to the English case, it seemsthat the two reduced adjacent tags for the Slavic and Romance languages are rel-atively independent of each other. This fact suggests that using a detailed tagsetfor languages such as Czech or Portuguese is beneficial and that reduction in thetagset will not necessarily lead to better tagging results. In addition, even thoughthe inflected languages are considered to be relatively word-order free, the adja-cent information seems to be helpful for reducing the tag/token ambiguity. This isanother interesting result of the investigation which supports the existence of a rela-tively fixed order of syntactic constituents in so-called “free word order" languagesand provides an additional argument in favor of using the n-gram techniques fortagging these languages (see chapter 2).

Chapter 6

Resource-light morphologicalanalysis

In this chapter, we introduce both the general framework for doing and trainingresource-light morphological analyses and its instantiation for Czech. We also dis-cuss the modification needed for analyzing other languages, namely Russian, Por-tuguese and Catalan, the languages used in the tagging experiments described inthe following chapter.

6.1 Introduction

This chapter describes a knowledge- and labor-light system for morphological anal-ysis. Our approach takes the middle road between completely unsupervised sys-tems à la Goldsmith (2001) on the one hand and systems with extensive manually-created resources à la Hajic (2004) on the other. These approaches are scientificallyinteresting and there are cases when they are also practically justifiable (e.g. theformer for analyzing understudied languages and the latter for applications requir-ing very high precision). However, we believe that for the majority of languagesand majority of purposes neither of these extreme approaches seem warranted.The knowledge-free approach still lacks precision and the knowledge-intensive ap-proach is usually too costly. We show that a system that uses a little knowledgecan be effective. We exploit the 80:20 rule: The part of the work that is easy to doand that matters most is done manually or semi-automatically and the rest is doneautomatically.

Czech this way? We use Czech to test our hypotheses. We do not suggest thatmorphological analysis of Czech should be designed exactly in the way we do.An excellent high precision system using manual resources1 already exists (Hajic

1 We use the term manual resources to refer to manually-created resources,automatic resources to automatically created resources (with possibly some

82 Chapter 6. Resource-light morphological analysis

2004). The main reason for working with Czech is that we can easily evaluate oursystem on the Prague Dependency Treebank – a large morphologically annotatedcorpus (http://ufal.mff.cuni.cz/pdt).

However, no manual resources, including those of Hajic (2004), can coverarbitrary text – there is an unbounded universe of names (people, products, com-panies, musical groups, . . . ) technical terms, neologisms, quotes from other lan-guages; typos, . . . We suggest that for languages such as Czech and Russian, mor-phological analysis should rely on extensive manual resources backed up by a sys-tem similar to ours. Less dense languages (e.g. Sorbian (Lusatian), Czech used inchat-rooms or in any other specialized settings, etc.) can use less of the expensivemanual resources and more of the automatic or semi-automatic resources.

The system. For our work, we developed an open, flexible, fast and portable sys-tem for morphological analysis. It uses a sequence of analyzing modules. Modulescan be reordered, added or removed from the system. And although we providea basic set of analyzing modules, it is possible to add other modules for specificpurposes without modifying the rest of the system. The modules we provide arere-usable for both resource-light and resource-intensive approaches, although thelatter option is not explored in detail here.

Nouns only. In the rest of this chapter we focus exclusively on Czech nouns. Wehave several reasons for this:

1. they are hard for the unsupervised systems, because their endings are highlyhomonymous (at least in Slavic languages);

2. they are the class where the manually-created resources approach fails themost – they are the most open class of all (consider proper names);

3. for practical reasons, we have to limit the scope of our work.

Data and glosses. The corpora we use and their labels are discussed discussed insection 4.2 and Appendix B. Section 4.1.1 provides an overview of the way nominalmorphological categories are abbreviated.

6.2 Motivation – Lexical statistics of Czech

To motivate our approach, we provide some statistics about Czech nouns, assumingthat nouns in other Slavic languages behave similarly. The statistics are based onthe Train1 and Train2 corpora (see section 4.2). The Train1 corpus contains

minor manual input) and semi-automatic resources to automatic resourcesmanually corrected (fully or partially).

6.3. A Morphological Analyzer of Czech 83

222,304 noun tokens (out of 619,984 all tokens) corresponding to 42,212 distinctforms (out of 87,321) and 23,643 lemmas (out of 43,056).2

Table 6.1 and Figure 6.1 break lemmas into deciles by their frequency andcompare their corpus coverage. In a similar fashion as Zipf’s law (Zipf 1935, 1949),they make two things apparent:

• It is quite easy to get a decent coverage of a text with a small number ofhigh frequency lemmas. The 2.4K lemmas in the 10th decile cover 3/4 ofnoun tokens in the corpus, 7.1K lemmas in the top three deciles cover nearly90% of all noun tokens. That means that even in labor-light systems, it is notnecessary to go the way of completely automatically acquired morphology.

• It is practically impossible to get a perfect coverage of a running text evenwith very large lexicons.

– First, the lemmas in each of the lower deciles add relatively muchsmaller coverage.

– Second, infrequent lemmas also tend to be text specific. 77% of thelemmas in the lowest decile of the Train1 corpus did not occur in theTrain2 corpus – even though the corpora are very similar (they bothconsists from texts from the same newspapers and magazines). Evenwhen we take the first half of the lemmas (decile 1-5), 70% of the lem-mas are text specific!

These facts justify our approach – to provide manually a small amount of informa-tion that makes the most difference and let the system learn the rest. This makesit possible to keep the amount of necessary labor close to that of the unsupervisedsystem with quality not much worse than that of the expensive system with manualresources.

6.3 A Morphological Analyzer of Czech

In this section, we introduce both the general framework for doing and trainingresource-light morphological analysis and its instantiation on Czech. Applicationto other languages is discussed in section 6.4.

2 The lemmas in Train1 (and in the whole PDT), distinguish not only be-tween homonyms but often also between words related by polysemy. Forexample, there are at least four different lemmas for the word strana: strana-1 ‘side (in space)’, strana-2 ‘political party’ strana-3 ‘(contracting) party, (onsomebody’s) side, ..’, strana-4 ‘page’. All four have the same morphologicalproperties – it is a feminine noun, paradigm žena. While this statistics treatsthem as four distinct entities, our Guesser and automatically acquired lexi-cons do not distinguish between them. However, the statistics are still valid,because only relatively few lemmas have such distinction.


Table 6.1. Corpus coverage by lemma frequency

Lemma Number Corpus Cumulative Lemmas notfreq decile of tokens coverage (%) coverage (%) in Train2 (%)

10 164,643 74 74 0.29 22,515 10 84 78 11,041 5 89 227 6,741 3 92 366 4,728 2 94 485 3,179 1 96 614 2,365 1 97 653 2,364 1 98 702 2,364 1 99 751 2,364 1 100 77

Notes: Each decile contains 2,364 or 2,365 noun lemmas. Cumulative coverage forthe i-th decile shows corpus coverage for all lemmas in the i-th to 10-th deciles.Percentages do not add up to 100 due to rounding.

0

20

40

60

80

100

0246810

%

Lemmma frequency percentile

Train1 corpus coverage

^

^

^ ^ ^ ^ ^ ^ ^ ^

^Not present in Train2

++

+

+

+

++

++ +

+

Figure 6.1. Lemma characteristics by frequency


We discuss the analyzer in general; the strategy of using it (section 6.3.2);how morphological paradigms are seen by a linguist and how by our system(section 6.3.3); automatic creation of large lexicons (section 6.3.4). Finally, weevaluate the whole system (section 6.3.5).

6.3.1 Morphological analyzer

Morphological analysis is a function that assigns a set of lemmas (base forms), eachwith a set of tags, to a form:

(3) MA: form →set(lemma × set(tag))ženou → { ( žena ‘woman’, {noun fem sg inst } ),

( hnát ‘hurry’, {verb pres pl 3rd } ) }

ženy → { ( žena ‘woman’, {noun fem sg gen,noun fem pl nom,noun fem pl acc,noun fem pl voc } ) }

Our goal was to design an open, fast, portable and easily configurable morpho-logical analyzer. It is a modular system that queries its analyzing modules in aparticular order. Any module can be loaded several times with different parameters(say, different lexicons). A module receives information about the word, its poten-tial prefixes and its context (currently just the preceding word with its analysis, andthe following word). The module returns zero or more analyses. An analysis mustcontain information about a lemma and a tag. Depending on the mode the morpho-logical analyzer is run in, it can also contain additional information, like a paradigmname, ending length, etc.

6.3.2 General Strategy

We focus our work and knowledge on creating a limited amount of resources thatmake the most difference and that are easy to create. The rest is done automatically.The system uses a mix of modules with various levels of precision and investedeffort. The modules are run in a cascading way. Modules that make less errors andovergenerate less are run before modules that make more errors and overgeneratemore. Modules on the subsequent level are used for analysis only if the modulesfrom the previous level did not succeed (although this is configurable).

The system contains three types of modules (in addition, there are special-ized modules for handling numbers, abbreviations, symbols, etc.):

1. Simple word lists – each word form is accompanied by information about itslemma and tags.

2. Guesser – analyzes words using only information about paradigms.


On the plus side, (1) the Guesser has a high recall and (2) is very labor-light– it is enough to specify the paradigms. However, the disadvantages are that(1) it has a low precision (overgenerates a lot) and (2) it is quite slow – thereare too many things to check and perform on too many analyses.

3. Lexicons – analyzes words using a lexicon and a list of paradigms.Lexicon-based analysis has just the opposite properties of the Guesser. Itrequires a lexicon, which is usually very costly to produce. However, (1) onlyanalyses that match the stem in the lexicon and its paradigm are considered;(2) it is very fast, because stem changes, etc. can be computed in advanceand be simply listed in the lexicon. The problem of the costly lexicon ispartly addressed in section 6.3.4.

Traditional labor-intensive systems use information about paradigms together witha large lexicon, possibly backed up by a guesser (e.g. Hajic 2004; Mikheev andLiubushkina 1995). Word lists are usually used for languages with simple inflec-tional morphology like English. It might seem obvious that for Czech, a languagewith seven cases, two numbers and four genders, form lists are out of the question.However, in practice only few lemmas occur in a larger number of forms. Table 6.2summarizes the distribution of lemma occurrences in the Train1 corpus in termsof the number of encountered forms. It can be seen that 64% of the lemmas occuronly in one form and 93% of lemmas have four or less forms.

Table 6.2. Noun lemma distribution by the number of forms in the corpus

Nr of forms LemmasCount Percentage Cumulative percentage

1 15,192 64 642 4,155 18 823 1,807 8 894 948 4 93

5-9 1,523 7 10010-17 18 0.08 100Total 23,643 100

Notes: Cumulative percentage for n forms shows what percentage of lemmas has nforms or less. Percentages do not add up to 100 due to rounding.

Entering a lexicon entry is very costly. While it is usually easy (for a native speaker)to assign a lemma to one of the major paradigm groups, it takes considerably moretime to select the exact paradigm variant differing only in one or two forms (infact, this may be even idiolect-dependent). For example, it is easy to see that atom‘atom’ does not decline according to the neuter paradigm mesto ‘town’, but it takesmore time to decide to which of the hard masculine inanimate paradigms it belongs


(see Table 6.3). On the other hand, entering possible analyses for individual wordforms is usually very straightforward.

Table 6.3. Forms of atom ‘atom’ and the hard masculine inanimate paradigms

hard masculine inanimate paradigmsatom hrad ostrov rybník zámek domecek‘atom’ ‘castle’ ‘island’ ‘pond’ ‘chateau’ ‘small house’

S1 atom-0 hrad-0S2 atom-u hrad-u ostrov-u/a rybník-u/aS3 atom-u hrad-uS4 atom-0 hrad-0S5 atom-e hrad-e zámk-u domeck-uS6 atom-u hrad-e/u rybníc-e/ík-u zámk-uS7 atom-em hrad-emP1 atom-y hrad-yP2 atom-u hrad-uP3 atom-um hrad-umP4 atom-y hrad-yP5 atom-y hrad-yP6 atom-ech hrad-ech zámc-ích domecc-ích/ck-áchP7 atom-y hrad-y

Therefore, our system uses a list of manually entered analyses for the most commonforms, an automatically acquired lexicon for less common words, and finally, theending-based guesser as a safety net covering the rest.

Note that the process of providing the list of forms is not completely man-ual – a native speaker selects the correct analyses from those suggested by theending-based guesser. Analyses of closed-class words can be entered by a non-native speaker on the basis of a basic grammar book. Finally, there is the possibilityto manually process the automatically acquired lexicon: a native speaker removesthe most obvious errors for the most frequent lexical entries. They remove errorsthat are easy to identify and that have the highest impact on the results of the sys-tem. We did not use this possibility when building the analyzer for Czech, but wedid use it when annotating development corpora for Portuguese and Russian.

6.3.3 Czech paradigms

Czech paradigms seen by a linguist

Simply put, in a fusional language like English or Czech, a paradigm is a set ofendings with their tags, e.g. 0 – noun singular, s – noun plural. The endings areadded to stems producing word forms characterized by those tags, e.g. cat – noun


singular, cats – noun plural. However, life is not easy, and the concatenation isoften accompanied by various more or less complicated phonological/graphemicprocesses affecting the stem, the ending or both, e.g. potato-es, countri-es, kniv-es,etc.

As a more complex illustration, consider several examples of Czech nounsbelonging to the žena ‘woman’ paradigm, a relatively ‘well-behaved’ paradigm offeminine nouns, in Table 6.4.

Table 6.4. Examples of the žena paradigm nouns

woman owl draft goat iceberg vapor flyS1 žen-a sov-a skic-a koz-a kr-a pár-a mouch-aS2 žen-y sov-y skic-i koz-y kr-y pár-y mouch-yS3 žen-e sov-e skic-e koz-e kr-e pár-e mouš-eS4 žen-u sov-u skic-u koz-u kr-u pár-u mouch-uS5 žen-o sov-o skic-o koz-o kr-o pár-o mouch-oS6 žen-e sov-e skic-e koz-e kr-e pár-e mouš-eS7 žen-ou sov-ou skic-ou koz-ou kr-ou pár-ou mouch-ou

P1 žen-y sov-y skic-i koz-y kr-y pár-y mouch-yP2 žen-0 sov-0 skic-0 koz-0 ker-0 par-0 much-0P3 žen-ám sov-ám skic-ám koz-ám kr-ám pár-ám mouch-ámP4 žen-y sov-y skic-i koz-y kr-y pár-y mouch-yP5 žen-y sov-y skic-i koz-y kr-y pár-y mouch-yP6 žen-ách sov-ách skic-ách koz-ách kr-ách pár-ách mouch-áchP7 žen-ami sov-ami skic-ami koz-ami kr-ami pár-ami mouch-ami

Without going too deeply into linguistics, we can see several complications:

1. Ending variation: žen-e, sov-e vs. burz-e, kr-e, pár-e; žen-y vs. skic-i.The dative and local sg. ending is -e after alveoral stops (d, t, n) and labials(b, p, m, v, f ). It is -e otherwise.Czech spelling rules require the ending -y to be spelled as -i after certainconsonants, in this case: c, c, d’, n, š. The pronunciation is the same ([I]).

2. Palatalization of the stem final consonant: kr-a – kr-e, mouch-a – mouš-e.The -e/e ending affects the preceding consonant: ch [x] → š, g/h → z, k → c,r → r.

3. Epenthesis: kr-a – ker.Sometimes, there is an epenthesis in genitive plural. This usually happenswhen the noun ends with particular consonants. There are certain tendencies,


but in the end it is just a property of the lexeme; cf. obcank-a – obcanek‘she-citizen, id-card’ vs. bank-a – bank ‘bank’ (both end with nk, but oneepenthesises and the other not). Some nouns allow both possibilities, e.g.jacht-a – jachet/jacht ‘yacht’

4. Stem internal vowel shortening: pár-a – par.Often the vowels á, í, ou shorten into a, i/e, u in genitive plural and sometimesalso in dative, locative and instrumental plural. If the vowel is followed bymultiple consonants in nominative singular, the shortening usually does nothappen. In many cases there are both short and long variants (pár-a – par– pár-ám/par-ám, pár-ách/par-ách, pár-ami/par-ami ‘vapor’), which usuallystylistically differ.

It would be possible to discuss in a similar manner all the Czech (noun) paradigms.Depending on how you count, there are roughly 13 basic paradigms – four neuter,three feminine and six masculine; plus there are nouns with adjectival declension(another two paradigms). In addition, there are many subparadigms and subsub-paradigms, all of which involve a great amount of irregularity and variation on theone hand and a great amount of homonymy on the other (see Table 4.4 and 4.5).For a more detailed discussion, see for example Karlík et al. 1996; Fronek 1999.

Czech paradigms seen by an engineer

There are two different ways to address phonological/graphemic variations andcomplex paradigm systems when designing a morphological analyzer:

• A linguistic approach. Such a system employs a phonological component ac-companying the simple concatenative process of attaching an ending. Thisimplies a smaller set of paradigms and morphemes. Two-level morphology(Koskenniemi 1983, 1984) is an example of such a system and an examplefor Czech can be found in Skoumalová (1997). The problem is that imple-menting morphology of a language in such a system requires a lot of lin-guistic work and expertise. For many languages, the linguistic knowledge isnot precise enough. Moreover, it is usually not straightforward to translateeven a precisely formulated linguistic description of a morphology into therepresentation recognized by such system.In Czech, the forms of the noun kra ‘icebergFS1’, kre ‘icebergFS36’, ker‘icebergFP2’ etc. (see Table 6.4) would be analyzed as involving the stemkr, the endings -a, -e and -0 and phonological/graphemic alternations. Formsof the noun žena ‘womanFS1’ (žene ‘FS36’, žen ‘FP2’, etc.) would belong tothe same paradigm as kra.

• An engineering approach. Such a system does not have a phonological com-ponent, or the component is very rudimentary. Phonological changes andirregularities are factored into endings and a higher number of paradigms.


This implies that the terms stem and ending have slightly different meaningsfrom the ones they traditionally have. A stem is the part of the word that doesnot change within its paradigm, and the ending is the part of the word thatfollows such a stem.Examples of such an approach are Hajic (2004) for Czech and Mikheev andLiubushkina (1995) for Russian. The previous version of our system (Hanaet al. 2004) also belongs to this category. The advantages of such a systemare its high speed, simple implementation and straightforward morphologyspecification. The problems are a very high number of paradigms (severalhundreds in the case of Czech) and impossibility to capture even the simplestand most regular phonological changes and so predict the behavior of newlexemes.For example, the English noun paradigm above (0 – s) would be captured asseveral other paradigms including, 0 – s, 0 – es, y – ies, f – ves.In Czech, the forms of the noun kra ‘icebergFS1’ would be analyzed as in-volving the stem k followed by the endings -ra, -re and -er. Forms of thenouns žena ‘womanFS1’ and kra would belong to two different paradigms.

Our system is a compromise between these two approaches. It allows some basicphonological alternations (changes of a stem-tail3 and a simple epenthesis), but inmany cases our endings and stems are still different from the linguistically mo-tivated ones. Therefore, many of the paradigms are still technical. Currently, oursystem is capable of capturing all of the processes described in above except thestem internal vowel shortening:

1. Ending variation: A paradigm can have several subparadigms. There are threeparadigms corresponding to the linguistic paradigm žena (see Table 6.4):NFzena, subparadigm NFkoza and its subparadigm NFskica.

– A subparadigm specifies only endings that are different from the mainparadigm. NFkoza is like NFzena but has -e in S3 and S6; NFskica islike NFkoza but has -i in S2, P1, P4 and P5.

– Each (sub)paradigm can restrict which words decline according to it byspecifying the possible stem-tails of these words. For example, NFzenarequires the stems to end in consonants that can be followed by e (i.e. b,d, f, m, n, p, t, v), while NFskica requires the stems to end in so-calledsoft-consonants (i.e. c, c, d’, j, n, r, š, t’, ž). Note that these restric-tions do not need to (and usually do not) uniquely assign word formsto paradigms. Moreover, in resource light settings, they are likely to befar less specific then they theoretically could be.

3 We use the term tail to refer to a final sequence of characters of a string. Wereserve the word ending to refer to those tails that are morphemes (in thetraditional linguistic sense or in our technical sense).


2. Palatalization: A paradigm can specify a simple replacement rule for chang-ing stem-tails. For example, the paradigm NFkoza says that stem-final chchanges to š in S3 and S6.

3. Epenthesis: An ending can be marked as allowing epenthesis. All the threeparadigms allow epenthesis in P2.

The current paradigm module cannot capture stem vowel changes. Therefore, theGuesser analyzes such forms incorrectly. It still provides the correct tags but notthe correct lemma. For example, par is analyzed as a form of the incorrect lemmapara instead of the correct pára; the tag NNFP2-----A---- is correct.

Our system specifies 64 noun paradigms (still not exploiting all the possibil-ities) and 14 common paradigms for adjectives and verbs. The choices on what tocover involve a balance between precision, coverage and effort. More work wouldbe somewhat beneficial but our goal is to stop improving the language specificationbefore the return on effort becomes too low.

Paradigms and Lexicons

A lexicon entry contains information about the lemma, its paradigm and stem orstems. The Lexicon-based Analyzer does not require the information about stemchanges that Guesser uses, but instead refers to the stems listed directly in the lex-icon entry. This not only speeds up the processing but also makes it possible tocapture phonological changes or irregularities that the Guesser is currently unableto handle, including the stem vowel changes mentioned above. Table 6.5 lists sev-eral lexicon entries; for most of them the full declensions can be found in Table6.4. Stem2 is used in genitive plural (P2) for all paradigms. This stem expressesepenthesis (chodb → chodeb) and stem vowel shortening (pár → par). Entries be-longing to the NFskica or NFkoza paradigms can specify a third stem used in dativeand locative singular (S3, S6). This stem expresses palatalization (mouch → mouš).

Table 6.5. Examples of lexical entries for some nouns of the žena paradigm

lemma gloss paradigm stem1 stem2 stem3žena woman NFzena žen =1 —sova owl NFzena sov =1 —chodba corridor NFzena chodb chodeb —skica draft NFskica skic =1 =1koza goat NFkoza koz =1 =1kra iceberg NFkoza kr ker krpára vapor NFkoza pár par pármoucha fly NFkoza mouch much moušváha weight NFkoza váh =1 váz


6.3.4 Lexicon acquisition

The morphological analyzer supports a module or modules employing a lexiconcontaining information about lemmas, stems and paradigms. There is always thepossibility to provide this information manually. That, however, is very costly. Inthis section we describe how to acquire a lexicon approximation from a large rawcorpus.

This approach differs from the work by Mikheev (1997) or Hlavácová(2001). Mikheev’s algorithm attempts to acquire a lexicon that would cover formsnot covered by a large manually created lexicon. Similarly, Hlavácová (2001) de-scribes a guesser that acquires rules for analyzing unknown words on the basis ofa large set of known words (it associates tails, usually endings, often preceded bya final part of a stem, with tags). In other words, in both cases it is assumed thata manually created lexicon covers most of the text and the automatically createdlexicon or rules are used only as a backup. In our case, it is the main lexicon thatis acquired automatically (note that our form lists are significantly smaller than thelexicons used in Mikheev 1997 or Hlavácová 2001).

General idea

The general idea is very simple. The ending-based Guesser module overgenerates.Part of the ambiguity is usually real but most of it is spurious. We can use a largecorpus to weed the spurious analyses out of the real ones. In such a corpus, open-class lemmas are likely to occur in more than one form. Therefore, if a lemma-stem-paradigm candidate suggested by the Guesser occurs in other forms in otherparts of the corpus, it increases the likelihood that the candidate is real and viceversa.

To make it more concrete: if we encounter the word talking in an Englishcorpus, using the information about paradigms, we assume that it is either the -ingform of the lemma talk or that it is a monomorphemic word (such as sibling).Based on this single form we cannot really say more. However, if we also encounterthe forms talk, talks and talked, the former analysis seems more probable; andtherefore, it seems reasonable to include the lemma talk as a verb into the lexicon.If we encountered also talkings, talkinged and talkinging, we would include bothlemmas talk and talking as verbs.

Examples and problems

We can use our Morphological Analyzer to analyze all the words in the corpusand then create all the possible hypothetical lexical entries consistent with theseanalyses. After that, we would like to run some filtering that would drop most ofthe bad entries and leave a small number of entries that would include the goodones. In this subsection, we discuss some of the problems associated with such afiltering.


Let’s consider for example the lemma podpora ‘support’. It is a femininenoun belonging to (a variant of) the žena paradigm. The Raw corpus contains 8,138tokens of this lemma in nine forms – see Table 6.6.4 There are 192 (!) ways toassign a lemma and a paradigm to various subsets of these forms (see Table 6.7).Most of them sound very funny to a native speaker; only a minority sounds funnyto an average learner of Czech; none sounded funny to our Guesser. In this case,we are lucky that we got nearly all the forms of the paradigm, only the vocativesingular form is missing, which is not very surprising.

Table 6.6. Forms of the lemma podpora in the Raw corpus

forms possible case occurrencespodpor-a S1 810podpor-y S2, P1, P4, P5 1,633podpor-e S3, S6 782podpor-u S4 4,128podpor-o S5 0podpor-ou S7 625podpor-0 P2 123podpor-ám P3 11podpor-ách P6 20podpor-ami P7 6podporaa typo 1

Table 6.7. Candidate entries for podpora forms

# of covered forms # of entries9 16 25 24 83 72 31 169

We could select the hypothetical entry that has the highest number of forms. Whileit would be the correct choice in this case, this strategy would not work in allcases. Consider for example the noun bezvedomí ‘unconsciousness’ and the adjec-tive bezvedomý ’unconscious’. Ignoring negation, bezvedomí has four theoreticalforms, but one of them accounts for 70% of the categories, moreover those much

4 We ignore all colloquial forms.


more frequent ones (pondelí ‘Monday’, which declines the same way. Bezvedomýhas potentially more than 20 forms. The problem is that the common form of theformer is also a form of the latter. So if we considered a simple majority of forms,the nouns similar to bezvedomí would usually lose. We could instead compare therealized percentages of the theoretical number of forms. However, this unnaturallypenalizes paradigms with the following properties:

• Paradigms with distinct rare forms. There are many rare categories that arenot realized even for a common lemma. For example, vocative is extremelyrarely found in a written text. However, for certain paradigms, the form isvery easy to find because it is simply the same as a form of a frequent cate-gory (e.g. bezvedomí ‘unconsciousnessS5/1/2/3/4/...’, vs. pane ‘MisterS5’ (onlyS5)).

• Paradigms with large number of distinct forms in general. One form isenough to see 25% of forms of a word like bezvedomí, while five forms arenecessary for the same percentage of a word like bezvedomý.

• Paradigms with alternative forms: The paradigm hrad has only one nom-inative plural, while the paradigm pán has two (e.g. páni / pánové‘gentlemenP1’). Should we count those alternative forms as one or as two?What if some (but not all) work also for a different category?

A different problem is presented by “stolen” forms. Consider the word atom ‘atom’,an inanimate noun of the hrad paradigm. The Raw corpus contains 161 tokens ofthis lemma in seven forms – see Table 6.8. Seeing those seven forms is not enoughto decide whether the words belongs to an animate or inanimate paradigm. Thereare five paradigms each covering all seven forms; see Table 6.9 listing two of them.If the Raw corpus contained only those forms, we could simply keep all five hy-potheses and still be happy to drop the other 122 hypotheses covering a smallernumber of forms. The problem is that the corpus also contains 208 tokens of theadjective atomové ‘atomicFS2/FP1/...’ that however also fit the nominative plural ofthe animate paradigm pán. Therefore the incorrect paradigm pán seems to covermore forms than the correct hrad paradigm.5

5 It does not help that the corpus also contains the name Atoma which lookslike an animate genitive or accusative singular.


Table 6.8. Forms of the lemma atom in the Raw corpus

forms possible case occurrencesatom-0 S1, S4 48 36%atom-u S2, S3, S6 28 21%atom-e S5 0 0%atom-em S7 1 0%atom-y P1, P4, P5, P7 22 17%atom-u P2 30 23%atom-um P3 1 0%atom-ech P6 1 0%Total 132 100%

Table 6.9. Fit of the forms of atom to the hrad and pán paradigms

masculine atom masculine atominanimate in Raw animate in Raw

S1 hrad-0 + pán-0 +S2 hrad-u + pán-aS3 hrad-u + pán-u/ovi +/–S4 hrad-0 + pán-aS5 hrad-e pan-eS6 hrad-e/u –/+ pán-u +S7 hrad-em + pán-em +P1 hrad-y + pán-i/ové –/(+)P2 hrad-u + pán-u +P3 hrad-um + pán-um +P4 hrad-y + pán-y +P5 hrad-y + pán-iP6 hrad-ech + pán-ech +P7 hrad-y + pán-y +Total 7 7 (8)

For a native speaker of Czech, it is hard to resist mentioning some of the othernon-existing lexical entries our algorithm found at various levels of development:

• Neuter noun bylo (paradigm mesto; forms byloS14, bylaS2/P14, bylP2, bylyP7).In fact these are past participle forms of the verb být ‘to be’: byloNS,bylaFS/NP, bylMS, bylyFP/IP. The word lists providing analyses for the mostfrequent word forms fix this particular problem.


• Neuter noun architekture ‘baby architect?’ (paradigm kure ‘chicken’; formarchitektureS14). In fact, it is a form of the feminine noun architektura ‘ar-chitecture’ (architektureS36).

• Masculine animate noun papír (paradigm pán; forms: papírS1, papírovéP15,papíruS6, . . . ). In fact, these are forms of the nonanimate noun papír ‘paper’(papírS14, papíruS236, ..), and the adjective papírový ‘made from paper’ (pa-pírovéFP145/IP14/MP4)The set of endings of animate and inanimate declensions are very similar.One of the distinctions is that only animate paradigms can contain the -ovéending (P15). However, many adjectives are derived from inanimate nounsby suffix -ov- that can be in certain forms followed by -é. The simple higher-number-of-forms wins approach would produce systematic errors.

The algorithm

The algorithm has four steps:

1. Morphological analysis of a raw corpusFor this we can use any morphological analysis that provides information notonly about lemmas and tags, but also about the paradigms used. We used ourMA system configured to provide the necessary information.

2. Creating all possible hypothetical lexical entriesEvery entry has to contain information about its lemma, paradigm and set offorms that occurred in the corpus.

3. Filtering out bad entriesThe general idea is that the entry that covers the highest number of formswins. However, taking into account the problems mentioned above, we allowseveral refinements:

• Certain forms can be excluded from the counting. Used for endings thatcause systematic errors. See the example with papírové at the end of theprevious section.

• Certain entries are not dropped even when competing entries covermore forms. Used for paradigms with a very low number of distinctforms: stavení or jarní.

• An entry covering less frequent forms (e.g. instrumental or vocative)need not be considered if it does not cover frequent forms as well (e.g.nominative).


• Size of the winning crust can be specified, in relative or absolute terms.A crust of, say, 15% means that not only the entries with the highestnumber of forms, but also entries with the number of forms 15% smallerare kept. This decreases the precision of the lexicon but increases therecall (i.e. leads to a higher ambiguity and a lower error rate).

• Minimal number of tokens and/or forms for an entry can be specified.This allows limiting the algorithm to entries with statistically reliablenumber of forms/tokens.

4. Creating a lexiconThis step is quite uninteresting – it is necessary to create appropriate lexicalentries for items that survived all the filtering. For that we need informationabout the lemma and paradigm which we have and about stem(s) which wecan easily derive.

6.3.5 Evaluation

We evaluated our Morphological Analyzer against the Test corpus manipulatingtwo parameters:

• Whether a lexicon automatically acquired from the raw corpus is used.

• Size of a word list capturing analyses of the most frequent word forms (topforms list, or TFL). The lists were created on the basis of the Raw corpus.

The results are summarized in Table 6.10. It is worth repeating that we are con-cerned only about nouns. The TFLs help without a question – they lower botherror-rate (they help with irregular words that are not covered by our paradigms)and ambiguity. The automatic lexicon lowers ambiguity (by pruning incorrect lexi-cal entries), but also increases error-rate (by pruning correct lexical entries). With-out TFL, ambiguity decreases by 40% and error rate increases by 38%. With 10K-TFL, ambiguity decreases by 32% and error rate increases by 25%. Depending onwhat the results will be used for, it may or may not make sense to use an automaticlexicon. The quality of the results is worse than the quality of Hajic (2004), a sys-tem with a large manually created lexicon: Our recall error is roughly three timesas large and precision error twice as large. As mentioned before, the Guesser is rel-atively slow, therefore using a TFL and/or lexicon increases the speed of analysis.


Table 6.10. Evaluation of the Czech morphological analyzer (on nouns)

Lexicon – – – + + + Hajic6

Top forms list 0K 5K 10K 0K 5K 10KError rate 3.6 2.9 2.7 5.8 3.9 3.6 1.3Ambiguity tag/w 19.6 13.1 11.5 11.7 8.5 7.8 3.8Speed w/s7 3000 3500 4800 4500 6500 8200

6.4 Application to other languages

To test the portability of our approach to other languages, we created a similarmorphological analyzer for Russian (Hana et al. 2004), Portuguese (Hana et al.2006) and Catalan (Feldman et al. 2006). The systems, their setup and results aredescribed below.

Setup

The setup for all three languages is similar to Czech with the following exceptions:

1. Not just nouns: Because the morphological analyzer is a component of atagger (see chapter 7), we gave equal importance to all parts-of-speech (forCzech, we focused on nouns only).

2. Top-frequency lists: For Catalan, we used a list containing 1,000 words. ForRussian and Portuguese we did not use such lists for practical reasons, al-though we believe they would help significantly. We plan to employ them inthe near future.

3. Longest ending filtering: To increase the precision of the analyzer we use aheuristics which we call longest-ending-only filtering (LEO, see Hana et al.2004). This is a simple heuristic to decrease the number of analyses producedby the Guesser module. It assumes the correct ending is usually one of thelongest candidate endings. A similar approach was used by Mikheev (1997).In English, it would mean that if a word is analyzed either as having a zeroending or an -ing ending, we would consider only the latter; obviously, inthe vast majority of cases that would be the correct analysis. In addition,we specify that few long but very rare endings should not be included inthe maximum length calculation. To stay within the labor-light paradigm, wecapture only the few most common systematic errors the LEO does.

6 300K lexicon (Hajic, p.c.)7 Running on Sun Java RE 1.5.0.01 with HotSpot, MS Windows XP on Pen-

tium Celeron 2.6 GHz, 750MB RAM. The time need to initialize the system(load and compile lexicons, paradigms etc.) is not included.

6.4. Application to other languages 99

4. No development corpus for Portuguese: We do not have a development cor-pus for Portuguese, therefore we cannot directly tune the parameters of thePortuguese analyzer. Instead, we use parameters of the analyzer for Catalan,a related language.

The resources for each language were compiled in a slightly different way:

• Catalan: Created by the first author who is not a speaker of Catalan, nor ofany other Romance language. It is based on Wheeler et al. (1999). However,the Catalan top-frequency list was not created manually. Instead we used theRaw corpus to identify the most frequent words and then extracted the tagsthey occurred with from a 80K corpus of Catalan (disjoint with the Dev andTest corpora).

• Portuguese: Created by a native speaker of Brazilian Portuguese. It is basedon Cunha and Cintra (2001). The description contains about 40 paradigmsand about 460 closed-class words and a number of the most common irregu-lar verbs. It took about 14h to compile it.

• Russian: Created by the second author who is a poor speaker of Russian anda native speaker of Czech on the basis of Wade (1992).

Evaluation

The parameters of the system (parameters of lexicon acquisition, LEO filtering,etc.) were tuned against the development corpora Dev. Here we present evaluationagainst the testing corpora Test manipulating two parameters:

• Whether the lexicon automatically acquired from the raw corpus is used.

• Whether the longest ending filtering is used.

The results are summarized in Table 6.11 (Russian), 6.12 (Catalan) and 6.13 (Por-tuguese). We can see that either of the Lexicon and LEO filtering lowers the ambi-guity but also increases the recall. In case of Catalan, similarly as for Czech, thetop-frequency-list helps significantly, lowering the ambiguity and recall error at thesame time.

Note that for Portuguese, adding the LEO filtering to the system with alexicon does not help: the ambiguity (and thus precision) says the same, whilethe recall drops. Unfortunately we do not have a Portuguese development corpus todiscover this. We tune the Portuguese analyzer on the basis of Catalan, where LEOfiltering helps. Therefore, the Portuguese tagger presented in section 7.11 does usethe results of the analyzer with LEO filtering, even though the accuracy wouldprobably be higher without it.


Table 6.11. Evaluation of the Russian morphological analyzer

Lexicon no yes no yesLEO no no yes yesAll Recall error: 2.9 4.3 12.7 6.6

ambiguity (tag/w) 9.7 4.4 3.3 2.8N Recall error: 2.6 4.9 41.6 13.7

ambiguity (tag/w) 18.6 6.8 6.5 4.3A Recall error: 6.2 7.0 8.1 7.5

ambiguity (tag/w) 21.6 10.8 3.3 5.7V Recall error: 0.8 2.0 2.3 2.3

ambiguity (tag/w) 14.7 4.8 1.5 1.5

Table 6.12. Evaluation of the Catalan morphological analyzer

top-form-list no yes no no yes yesLexicon no no yes no yes yesLEO no no no yes no yesAll Recall error: 3.8 2.5 5.5 14.7 3.4 4.2

ambiguity (tag/w) 5.4 4.1 3.3 2.6 2.9 2.6N Recall error: 2.0 1.4 7.0 7.0 4.3 6.8

ambiguity (tag/w) 10.2 6.7 5.9 5.9 4.6 4.1A Recall error: 7.0 3.2 8.8 50.1 4.4 6.0

ambiguity (tag/w) 13.2 10.8 7.6 4.8 4.6 6.1V Recall error: 12.1 5.1 14.1 14.1 5.9 6.2

ambiguity (tag/w) 8.6 6.1 3.0 3.0 2.5 2.3

Table 6.13. Evaluation of the Portuguese morphological analyzer

Lexicon no yes no yesLEO no no yes yesAll Recall error: 1.2 1.8 2.4 2.0

ambiguity (tag/w) 4.2 3.4 3.9 3.4N Recall error: 0.9 2.6 0.9 2.6

ambiguity (tag/w) 5.8 4.7 5.4 4.5A Recall error: 2.8 6.1 11.1 7.8

ambiguity (tag/w) 5.9 4.0 5.1 4.0V Recall error: 2.0 1.5 6.1 1.5

ambiguity (tag/w) 5.3 2.3 4.8 2.3

6.5. Possible enhancements 101

6.5 Possible enhancements

Currently, the main effort is focused on improving lexicon acquisition:

1. considering frequencies and contexts of word forms when eliminating incor-rect hypotheses;

2. replacing sequential application of heuristics with their weighted parallelcombination;

3. using information about common derivation patterns to extend the algorithmover several lemmas related by derivation and eliminating some of the sys-tematic errors mentioned in section 6.3.4. The preliminary results for Czechnouns are very promising. Providing very basic and noisy information on 15common derivational suffixes results in ambiguity lowered by nearly 50%and slightly increased recall.

4. We are also exploring the possibilities of combining our approach with vari-ous machine learning techniques.

Finally, we are in the process of improving our tools used by native (or informed)speakers to provide the limited amount of information needed by the analyzer in afast and effective way.

Chapter 7

Cross-language morphologicaltagging

The current chapter explores the possibility of tagging Slavic and Romance lan-guages without relying on any labor and knowledge intensive resources for thoselanguages. Instead, we employ the resource-light morphological analyzers de-scribed in the previous chapter and on available annotated corpora of a relatedlanguage. We combine this information in various ways to produce transitionsand emission probabilities usable by a second-order Markov Model tagger (seechapter 2), namely the TnT tagger (Brants 2000). We first describe in detail experi-ments with tagging Russian using Czech resources and then we show how the samemethod can be used to tag Catalan and Portuguese using Spanish resources.

7.1 Why a Markov model

Slavic languages and to a large extent also Romance languages have relatively freeword order, so it may seem an odd choice to use a Markov model (MM) tagger.Why should TnT, a second order MM, be able to capture useful facts about suchlanguages?

Firstly, as chapter 5 notes, while these languages potentially allow for allword permutations in the sentence, in reality, it still turns out that there are recurringpatterns in the progressions of tags attested in a training corpus. The order of wordswithin a constituent is relatively fixed, and the order of constituents does not varyas much as one could expect.1

Secondly, as we discuss in chapter 2, n-gram models including MM haveindeed been shown to be successful for various Slavic and Romance languages,e.g., Czech (Hajic et al. 2001) or Slovene (Džeroski et al. 2000); although not as

1 The word order of Slavic languages, for example, is often determined bypragmatic constraints such as information structure or (in)definiteness (dueto the lack of the (in)definite articles).

104 Chapter 7. Cross-language morphological tagging

much as for English. This shows that the transition information captured by thesecond-order MM from a Czech or Slovene corpus is useful for Czech or Slovene.We show that the transition information acquired from Czech is also useful forRussian, and that Spanish transitions are useful for Catalan and Portuguese. Seechapter 2 for more discussion of the TnT tagger, Markov Models, and tagging ingeneral.

7.2 Tagging Russian using Czech

The current chapter shows that the transition information acquired from a sourcelanguage (e.g. Czech or Spanish) is also useful for a related target language (e.g.Russian, Portuguese, or Catalan).

We treat Czech as an approximation of Russian. In the simplest model (seesection 7.3), we use a Czech tagger to tag Russian directly (modulo tagset and scriptmappings). In the subsequent experiments, we improve this initial model (makingsure we stay in the labor- and knowledge-light paradigm):

1. We use the morphological analyzer (MA) from chapter 6 to approximateemissions (section 7.5).

2. We use an automatically acquired list of cognates to combine Czech emis-sions and emissions based on the MA (section 7.6).

3. We apply simple syntactic transformations to the Czech corpus (“Russifica-tions”) to make it more Russian-like and thus improving the acquired transi-tions (section 7.7).

4. We train batteries of taggers on subtags to address the data sparsity problem(section 7.8).

All results are reported for the development corpus, because we do not use the testdata to tune the tools. Section 7.9 reports the results of the best method and, forthe reader’s sake, also a representative sample of the other experiments on the testdata. In addition to accuracy for a full tag, we also report accuracy on the DetailedPOS (SubPOS) slot. The reason is that SubPOS values are comparable to the PennTreebank tagset (Marcus et al. 1993b) in their number and, to a certain extent,also in their meaning. Where appropriate, we also report the accuracy for othercategories. We report these numbers not only for all tokens but also for the threemost frequent parts of speech: nouns, adjectives and verbs. We do this, because theaccuracy differs significantly for tokens of different POS.

Finally a note about the Cyrillic script and transliteration: Unfortunatelywe ran into problems when using TnT with Cyrillic characters, regardless of theencodings we tried. Therefore, we transliterate all Russian texts using ScientificTransliteration (ISO-9 1995).

7.3. Using source language directly 105

7.3 Using source language directly

In this model we assume that Czech is such a good approximation of Russian,that we can use a Czech tagger to tag Russian directly (modulo tagset and scriptmappings).

7.3.1 Tagset translation

The Czech and Russian tagsets are very similar, but they are not identical. Thereare two types of differences between the tagsets (see section 4.4.1 and Appendix Afor more details):

• Some differences are caused by different properties of the languages. For ex-ample, Russian does not have a vocative or dual; neither does it have auxiliaryor pronominal clitics.

• Some differences are due to a different tagset design. For example, forms thatwould be tagged with a tag containing a partial wildcard (e.g. Z meaning anynon-feminine gender) in Czech are regarded as ambiguous by the Russiantagset (M or N in this case) because it does not use any partial wildcards.

We translated the Czech tagset into a tagset closer to the Russian system in thefollowing way:

• Translate to the corresponding category in Russian (if obvious; again we wantto stay within the labor- and knowledge-light paradigm). For example, whereCzech uses vocative, Russian uses nominative; where Czech uses pronominalclitics, Russian uses pronouns.

• Drop distinctions Russian does not make. For example, short adjectives donot distinguish case, verbs do not distinguish negation.

• Ignore rare tags.

Some tagset differences are not easy to resolve with our translation procedure be-cause it considers the tags alone without the associated form, lemma, context etc.For example, the (official) Czech past participles use the same form for femininesingular and neuter plural. While this is usually considered ambiguity, the Czechtagset assigns them the same tag with Q value for gender and W for number; thus theQW gender-number stands for both FS and NP. Russian participles distinguish genderin singular but use the same form for plural subjects of all genders. Therefore, QWshould be translated as FS (feminine singular) or as XP (any gender plural), depend-ing on the gender and number of the subject. However, identification of a subject isfar beyond the capabilities of any simple tag-translation algorithm (it does not helpthat subject is often elided). So, in such cases we either choose the most frequenttag (FS for participles, based on the distribution of potential subjects, i.e. nouns andpronouns in nominative) or, for a less frequent discrepancy, simply ignore it.


7.3.2 Script translation

Since Russian and Czech use different scripts (the Cyrillic and Latin alphabets, re-spectively) there are virtually no common words between the Czech corpus usedfor training and the tagged Russian corpus (although some tokens, including punc-tuation and numbers are the same). This makes the Czech emission probabilitiesnearly useless in tagging Russian. To get around this issue, we transliterate Russiantexts into the Latin alphabet, using the so-called Scientific Transliteration.2 Thetransliteration of Russian produces a representation similar to the Czech spellingsystem, e.g. it produces š for [S] and c for [tS].

In addition, we modify the Czech corpus, replacing characters not presentin the transliterated Russian with their obvious (or most likely) counterparts. Forexample, long vowels are shortened (á → a), palatalization is expressed using thesoft sign (n → n’), etc.

The results are presented in Table 7.1. While they are far better than randomguessing (the tagset contains approximately 1,000 tags), an accuracy of 48% isnot even close to the mid-90s of a standard Slavic tagger. One of the reasons isthat while Czech and Russian words are similar, they are not identical. About 55%tokens of the Russian corpus did not occur in the Czech training data (59% withthe Scientific transliteration only). Moreover, many of those that did occur do nothave the same tags, e.g. ni is a personal pronoun (‘sheloc’) in Czech but a negativeparticle in Russian, nej being the pronoun corresponding to Czech ni.

Table 7.1. Direct Tagger: Czech tagger applied to Russian

tagger name directScientific transliteration Better transliteration

Unknown tokens (%) 59.0 55.3All Full tag: 44.9 48.1

SubPOS 61.0 63.8N Full tag: 32.8 37.3

SubPOS 84.0 81.1A Full tag: 20.7 31.7

SubPOS 33.8 51.7V Full tag: 36.1 39.9

SubPOS 44.6 48.1

2 It would be more natural to transliterate the Czech training corpus into Cyril-lic alphabet; unfortunately as we noted in the previous section, TnT does notwork well with any Cyrillic encodings we tried. Therefore, we tag all Rus-sian text in their transliterated form.

7.4. Expectations 107

In the following, we suggest various methods that will alleviate this prob-lem. However, first we will discuss the realistic expectations of our method. In therest of this chapter, anytime we refer to the Direct tagger, we mean the tagger de-scribed in this section.

7.4 Expectations

In the previous section, we observed that while Czech and Russian words are sim-ilar, Czech emissions are not easily usable for approximating Russian emissions.The following experiments test how useful are the Czech transitions and emissionsfor approximating Russian. We train the TnT tagger on 1,000 tokens of Russian andon 1,000 tokens of Czech. Then we combine the obtained emission and transitionprobabilities in all possible ways to tag the other part of the Russian developmentcorpus. The results are summarized in Table 7.2.

Table 7.2. Tagging Russian with various combination of Czech and Russianemissions and transitions

emissions Russian Russian Czech Czechtransitions Russian Czech Russian CzechUnknown tokens (%) 46.9 46.9 73.4 73.4All Full tag: 67.6 66.4 36.1 33.6

SubPOS 81.9 81.9 48.0 47.5N Full tag: 32.0 29.1 18.4 17.5

SubPOS 75.7 71.8 64.1 60.2A Full tag: 42.2 48.9 15.6 4.4

SubPOS 75.6 84.4 33.3 35.6V Full tag: 81.4 76.7 31.4 22.1

SubPOS 89.5 87.2 33.7 29.1

• The first tagger uses Russian transitions and Russian emissions. It is a tra-ditional tagger trained and used for the same language, Russian. The onlyunusual thing is the small size of the training data. Correspondingly, the re-sults are not stellar.

• The second tagger uses Russian emissions and Czech transitions. This tag-ger tests how useful the Czech transitions are for tagging Russian. The dropin accuracy in comparison to the first tagger reflects the mismatch betweenCzech and Russian word-order (transitions). The drop is very small: 67.6%vs. 66.4% for full tag on all tokens; the drop for individual tag slots and fortokens with a particular POS is also small. In fact, for adjectives, it producesslightly better results. Based on this, we can conclude that Czech transitionsare a good approximation for Russian transitions.


• The third tagger uses Czech emissions and Russian transitions. This taggertests how useful the Czech lexicon is for tagging Russian. The drop in accu-racy in comparison to the first tagger reflects the mismatch between Czechand Russian lexicon (emissions). The drop is substantial: 67.6% vs. 36.1%for full tags on all tokens. The drop for verbs and especially adjectives is evenworse. This shows that Czech emissions are not a very good approximationfor Russian emissions, at least not when used directly.

• The fourth tagger uses Czech transitions and Czech emissions. Except thesize of the training corpus, this tagger is equivalent to the direct tagger (seesection 7.3). The worse results reflect the smaller size of the training corpus.It is not surprising that this tagger performs the worst – it uses both ‘syntac-tic’ and ‘lexical’ information derived from Czech to tag Russian. When theresults are compared to the results of the previous tagger, they also reinforcethe conclusions above – Czech approximates the Russian word order quitewell.

To sum up, the results show that Russian and Czech words are far more similar intheir order (transitions) than they are in the shape and/or distribution (emissions).In the following two sections we focus on approximating emissions.

Note: One might argue, that the results in this section and the sections 7.5–7.8 are not comparable, because the latter are reported for the whole developmentcorpus while the former only for the first half of it. However, we ran all the follow-ing experiments on the second half of development corpus as well and the resultswere very similar to those obtained on the whole development corpus.

7.5 Using MA to approximate emissions

The Direct tagger (see section 7.3) used Czech emissions to approximate Russianemissions. The previous section suggested that this is the main culprit of its poorperformance. The Czech emissions can differ from the ideal Russian emissions inthree ways:

1. A particular emitted word does not exist.

2. The set of tags associated with an emitted word is different.

3. The distribution of tags in that set is different.

The emissions almost certainly differ in all three ways. For example, as we men-tioned in the evaluation of the Direct tagger, 55% of tokens in the Russian corpusdid not occur in the Czech training corpus.

The tagger described in this section uses our Russian Morphological Ana-lyzer (see section 6.4) to produce the emission probabilities. This greatly alleviatesthe first two problems. The analyzer does not produce a distribution over its output,

7.6. Improving emissions – cognates 109

therefore we simply assume a uniform distribution. The following section describesa more sophisticated way to distribute the tags.

The tagger uses the same transition probabilities as the Direct tagger (seesection 7.3), i.e. transitions trained on the Czech training corpus with Czech tagsettranslated into the Russian tagset. This means that the transitions are produced dur-ing the training phase and are independent of the tagged text. The emissions areproduced by the morphological analyzer on the basis of the tagged text during tag-ging.

The results in Table 7.3 show that the accuracy clearly improved for all ma-jor open classes, especially, for verbs. The much lower accuracy for nouns (54.4%)and adjectives (53.1%) than for verbs (90.1%) is expected. In the output of the mor-phological analyzer that is the basis for the emissions, verbs have the ambiguity of1.6 while the ambiguity for nouns and adjectives is 4.3 and 5.6, respectively (seethe last column in Table 6.11). Moreover, verbs have also a higher recall.

Table 7.3. Tagging with evenly distributed output of Russian MA

tagger name Direct Eventransitions Czech Czechemissions Czech uniform Russian MAAll Full tag: 48.1 77.6

SubPOS 63.8 91.2N Full tag: 37.3 54.4

SubPOS 81.1 89.6A Full tag: 31.7 53.1

SubPOS 51.7 86.9V Full tag: 39.9 90.1

SubPOS 48.1 95.7

7.6 Improving emissions – cognates

Although it is true that forms and distributions of Czech and Russian words arenot the same, they are also not completely unrelated. As any Czech speaker wouldagree, the knowledge of Czech words is useful when trying to understand a text inRussian (obviously, one has to understand the script, as most Czechs do). The rea-son is that many of the corresponding Czech and Russian words are cognates, (i.e.historically they descend from the same ancestor root or they are mere translations).

In this section, we use cognates to combine the Czech emissions (as used bythe Direct tagger, section 7.3) and the output of the Russian morphological analyzer(used by the Even tagger, section 7.5).


7.6.1 Cognate pair

We define a cognate pair as a translation pair where words from two languagesshare both meaning and a similar surface form. Depending on how closely thetwo languages are related, they may share more or fewer cognate pairs. Linguisticintuition suggests that the information about cognate words in Czech should helpin tagging Russian. Two hypotheses are tested in the experiments with respect tocognates:

1. Cognate pairs have similar morphological and distributional properties.

2. Cognate pairs are similar in form.

Obviously both of these assumptions are approximations because

1. Cognates could have departed in their meaning, and thus probably have dif-ferent distributions. For example, consider život ‘life’ in Czech vs. život‘belly’ in Russian, and krásný (adj.) ‘nice’ in Czech vs. krasnyj (adj.) ‘red’in Russian.

2. Cognates could have departed in their morphological properties. For exam-ple, tema ‘theme’, borrowed from Greek, is neuter in Czech and feminine inRussian.

3. There are false cognates — unrelated, but similar or even identical words.For example, delo ‘cannon’ in Czech vs. delo ‘matter, affair’ in Russian, jel[jEl] ‘drove’ in Czech vs. el [jEl] ‘ate’ in Russian, pozor ‘attention’ vs. pozor‘disgrace’ in Russian, ni ‘sheloc’ in Czech vs. ni negative particle in Russian(corresponding to Czech ani).3

Nevertheless, the assumption here is that these examples are true exceptions fromthe rule and that in the majority of cases, cognates will look and behave similarly.The borrowings, counter-borrowings, and parallel developments of both Slavicand Romance languages have been extensively studied (see e.g. Derksen 2008 forSlavic and Gess and Arteaga 2006 for Romance), but this book does not provide asurvey of this research.

In Feldman et al. (2005), we report the results of an experiment where 200most frequent nouns from the Russian development corpus are manually trans-lated into Czech. They constitute about 60% of all noun tokens in the development

3 It is interesting that many unrelated languages have amazing coincidences.For example, the Russian gora ‘mountail/hill’ and the Konda goro ‘moun-tain/hill’ do not seem related; or the Czech mlada ‘young’ is a false cognatewith the Arabic malad ‘youth’, but coincidentally, the words have similarmeanings. This is definitely not a very frequent language phenomenon, buteven though the words are not etymologically related, finding such pairsshould not hurt the performance of the system.

7.6. Improving emissions – cognates 111

corpus. The information about the distribution of the Czech translations is trans-ferred into the Russian model using an algorithm similar to the one outlined insection 7.6.3. The performance of the tagger that uses manual translations of thesenouns improves by 10% on nouns and by 3.5% overall. The error analysis revealsthat some Czech-Russian translations do not correspond well in their morpholog-ical properties and, therefore, create extra errors in the transfer process. However,overall the accuracy does improve.

Obviously, if we want to stay in the resource/knowledge-light paradigm,we cannot provide the list manually. The following section describes a language-independent algorithm for achieving comparable results.

7.6.2 Identifying cognates

Our approach to cognate detection does not assume access to philological erudi-tion, to accurate Czech-Russian translations, or even to a sentence-aligned corpus.None of these resources would be obtainable in a resource-poor setting. Instead wesimply look for similar words, using a modified edit distance (Levenshtein 1966)as a measure of similarity.

We use a variant of the edit distance where the cost of operations is de-pendent on the arguments. In general, we assume that characters sharing certainphonetic features are closer than characters not sharing them (we use spelling as anapproximation of pronunciation – in both Russian and Czech the relation betweenspelling and pronunciation is relatively simple). Thus for example, b is closer to pthan to, say, j. In addition, costs are refined based on some well-known and commonlanguage-specific phonetic-orthographic regularities. The non-standard distancesfor Czech and Russian include for example:

• Russian è and e have zero distance from Czech e.

• Czech h and g have zero distance from Russian g (in Czech, the originalSlavic g was replaced by h, in Russian it was not).

• The length of Czech vowels is ignored (in Russian, vowel length is notphonemic)

• y and i are closer to each other than other vowels (modern Czech does notdistinguish between them in pronunciation)

However, performing a detailed contrastive morpho-phonological analysis is unde-sirable, since portability to other languages is a crucial feature of the system. So,some facts from a simple grammar reference book should be enough. Ideally, op-timal distances should be calculated; however, currently we set them based on ourintuition.

To speed up the computation of distances we preprocess the corpora, replac-ing every character that has a unique zero-distance counterpart by that counterpart.


At the end of the cognate acquisition processes, the cognates are translated backto their original spelling. Because edit distance is affected by the number of argu-ments (characters) it needs to consider, the edit distance measure is normalized byword length. The list of cognates includes all Czech-Russian pairs of words whosedistance is a below certain threshold. We further require that the words have thesame morphological features (except for the gender of nouns and the variant as theyare lexical features).

7.6.3 Using cognates

The list of cognates obtained by the procedure described above is used to map theCzech emission probabilities to Russian emissions. To further explain this, assumewcze and wrus are cognate words. Let Tcze denote the tags that wcze occurs with in theCzech training corpus. Let pcze(t) be the emission probability of tag t (t < Tcze ⇒pcze(t) = 0). Let Trus denote tags assigned to wrus by the morphological analyzer;

1|Trus |

is the even emission probability. Then, assign the new emission probabilityp′rus(t) to every tag t ∈ Trus as is given in (4) (followed by normalization):

(4) p′rus(t) =

{

pcze(t)+ 1|Trus |

if t ∈ Trus

0 otherwise

The results are presented in Table 7.4. For comparison, we also show the resultsof the Direct (see section 7.3) and Even taggers (see section 7.5). In comparisonwith the Even tagger, the accuracy of the Cognates tagger improves in all measures(with the exception of SubPOS of nouns and adjectives, where it gives the sameaccuracy).

Table 7.4. Tagging Russian using cognates

tagger name Direct Even Cognatestransitions Czech Czech Czechemissions Czech even MA cognatesAll Full tag: 48.1 77.6 79.5

SubPOS 63.8 91.2 92.2N Full tag: 37.3 54.4 57.3

SubPOS 81.1 89.6 89.9A Full tag: 31.7 53.1 54.5

SubPOS 51.7 86.9 86.9V Full tag: 39.9 90.1 90.6

SubPOS 48.1 95.7 96.1

7.7. Improving transitions – “Russifications” 113

7.7 Improving transitions – “Russifications”

The experiments in section 7.4 showed that Czech transitions are a fairly good ap-proximation of Russian transitions. Nevertheless, there was still a drop, althoughsmall, in accuracy, especially for verbs, when compared to the native Russian tran-sitions. In this section we experiment with “Russifications”, modifications to theCzech corpus that make its structure look more like Russian, thus resulting in moreRussian-like transitions.

Negation in Czech is expressed by the prefix ne-, whereas in Russian it isvery common to see a separate particle (ne) instead:

(5) a. Nicnothing

nerekl.not-said

‘He didn’t say anything.’ [Cz]b. On

henicegonothing

nenot

skazal.said

‘He didn’t say anything.’ [Ru]

Reflexivization of verbs is expressed by a separate word in Czech,4 and by affixa-tion in Russian. Compare the following examples (se is the reflexive clitic in Czech,and sja is the reflexive suffix in Russian):

(6) a. FilipFilip

seREFL-CL

ještestill

neholí.not-shaves

‘Filip doesn’t shave yet.’ [Cz]b. Filip

Filipescestill

nenot

breet+sja.shaves+REFL.SUFFIX

‘Filip doesn’t shave yet.’ [Ru]

Even though auxiliaries and the copula are the forms of the same verb být/byt’‘to be’, both in Czech and in Russian, the use of this verb is different in the twolanguages. For example, Russian does not use an auxiliary to form past tense:

(7) a. JáI

jsemaux1sg

psal.wrote

‘I was writing/I wrote.’ [Cz]

4 While Czech reflexive pronouns are separate tokens in writing, from a lin-guistic point of view, the situation is less clear. In most cases, reflexive pro-nouns are not full-fledged words but rather clitics, i.e. elements on the bor-derline between independent words and bound morphemes (see, e.g., Hana2007, Chapter 4). They usually occur in the so-called second position, fol-lowing the first constituent in the sentence, regardless of the position theverb they belong to.


b. JaI

pisal.wrote

‘I was writing/I wrote.’ [Ru]

It also does not use the present tense copula, except for emphasis; but it uses formsof the verb byt’ in some other constructions like past passive.

Table 7.5 shows the results of the Cognates tagger with transitions trainedon the Czech Train corpus russified according to the three transformations listedabove. The transformations indeed result in a modest improvement in accuracy:from 79.5% to 80.0% on the full tag; also the results for verbs and adjectives im-prove.

Table 7.5. Tagging Russian using Russified Czech transitions

tagger name cognates russifiedsection 7.6

transitions Czech Russified Czechemissions cognates cognatesAll Full tag: 79.5 80.0

SubPOS 92.2 92.3N Full tag: 57.3 57.1

SubPOS 89.9 89.3A Full tag: 54.5 55.9

SubPOS 86.9 86.9V Full tag: 90.6 92.7

SubPOS 96.1 96.6

The results suggest that writing simple transformation rules and making the struc-ture of the Czech corpus look more like Russian can improve the performance ofthe system. There are many other possible candidates for “russifications”. How-ever the effort needed for their implementation is not warranted for two reasons.First, they are by their nature language specific and therefore they do not fit intoour goal of developing a resource- and knowledge-light framework where most ofthe work and expertise is invested into language-independent infrastructure. Sec-ond, the penalty for using Czech transitions is very small (although this might bedifferent for other language pairs).

Also note that some improvements in transitions are the results of the tagsettranslation which are part of the most basic tagger, the Direct tagger. For example,conditionals are expressed in Czech periphrastically by the auxiliary verb by conju-gated for all three numbers and both persons plus the past participle. Russian con-ditionals also consist of by plus past participle. However, by does not inflect and isusually categorized as a particle. See example (8) for comparison. The Czech tagsetannotates the conditional auxiliary with tags in the form Vc-g---e------- (g and

7.8. Dealing with data sparsity – tag decomposition 115

e standing for gender and person respectively). The tag-translation procedure of theDirect tagger replaces these tags by TT-------------, denoting particles.

(8)

Czech RussianJá bych spal. Ja by spal. ‘I would sleep.’Ty bys spal. Ty by spal. ‘You.sg would sleep.’On by spal. On by spal. ‘He would sleep.’

7.8 Dealing with data sparsity – tag decomposition

One of the problems when tagging with a large tagset is data sparsity; with 1,000tags there are 1,0003 potential trigrams. It is very unlikely that a naturally occurringcorpus will contain all the acceptable tag combinations with sufficient frequency toreliably distinguish them from the unacceptable combinations. However, not allmorphological attributes are useful for predicting the attributes of the succeedingword (e.g. tense is not really useful for case).

In this section, we describe an experiment originally presented in (Hanaet al. 2004). To overcome data sparsity issues, we trained a tagger on individualcomponents of the full tag in the hope that the reduction of the tagset of each suchsub-tagger reduces data sparsity. Unfortunately, the method did not improve theresults as we had hoped. It does increase accuracy of the less effective taggers(e.g. Even from section 7.5 or a similar tagger described in the original paper), butnot of those with higher accuracy. The results are still interesting for at least tworeasons. First, it shows that a smaller tagset does not necessarily lead to an increaseof accuracy. Second, it is possible, and even likely that it is possible to modify thebasic method in a way that would indeed lead to improved results.

7.8.1 Tag decomposition

We focus on six positions — POS (p), SubPOS (s), gender (g), number (n), case(c), and person (e). The selection of the slots is based on linguistic intuition. Forexample, because a typical Slavic NP has the structure of (Det) A* N (NPgen) PP*(very similar to English), it is reasonable to assume that the information about partof speech and agreement features (gender, number, case) of previous words shouldhelp in the prediction of the same slots of the current word. Likewise, informa-tion about part-of-speech, case and person should assist in determining person (fi-nite verbs agree with the subject, subjects are usually in nominative). On the otherhand, the combination of tense and case is prima facie unlikely to be much use forprediction. Indeed, most of the expectations are confirmed in the results.

The performance of some of the models on the Russian development corpusis summarized in Tables 7.6, 7.7, and 7.8. All models are based on the Russifiedtagger (see section 7.7), with the full-tag tagger being identical to it. The numbersmarked by an asterisk indicate instances in which the sub-tagger outperforms thefull-tag tagger. As can be seen, all the taggers trained on individual positions areworse than the full-tag tagger on those positions. This proves that a smaller tagset


Table 7.6. Russian tagger performance trained on individual slots vs. taggerperformance trained on the full tag

full tag POS SubPOS gender number case1 (POS) 92.2 92.0 – – – –2 (SubPOS) 91.3 – 90.1 – – –3 (gender) 89.9 – – 89.4 – –4 (number) 94.1 – – – 92.1 –5 (case) 87.2 – – – – 82.6

Table 7.7. Russian tagger performance trained on the combination of two featuresvs. tagger performance trained on the full tag

Feature 1 full tag POS gender gender number case caseFeature 2 case case negation case person tense1 (POS) 92.2 91.9 – – – – –2 (SubPOS) 91.3 – – – – – –3 (gender) 89.9 – 89.7 89.2 – – –4 (number) 94.1 – – – 93.2 – –5 (case) 87.2 85.6 85.6 – 84.7 82.9 83.38 (person) 99.2 – – – – 98.9 –9 (tense) 98.6 – – – – – 98.411 (negation) 96.0 – – *96.3 – – –

Table 7.8. Russian tagger performance trained on the combination of three or fourfeatures vs. tagger performance trained on the full tag

Feature 1 full tag POS POS SubPOS SubPOS SubPOSFeature 2 gender number gender number genderFeature 3 case case case case numberFeature 4 case1 (POS) 92.2 91.8 *92.3 *92.4 *92.5 *92.42 (SubPOS) 91.3 – – 90.5 90.5 90.63 (gender) 89.9 89.6 – 89.6 – *90.24 (number) 94.1 – 94.0 – 93.8 *94.35 (case) 87.2 86.3 *87.3 86.7 87.1 *87.6

does not necessarily imply that tagging is easier (see Elworthy 1995, as well as thediscussion in chapter 5). Similarly, there is no improvement from the combinationof unrelated slots — case and tense or gender and negation. However, combina-tions of (detailed) part-of-speech information with various agreement features (e.g.

7.8. Dealing with data sparsity – tag decomposition 117

SubPOS, number, and case) outperform the full-tag tagger on at least some of theslots. All of the improvements are quite modest.

7.8.2 Combining sub-taggers

The next step is to put the sub-tags back together to produce estimates of the correctfull tags, and to see how performance is affected. Simply combining the valuesoffered by the best taggers for each slot is not possible because that could yieldillegal tags (e.g. nouns in past tense). Instead, we let the taggers choose the best tagfrom the tags offered by the morphological analyzer.

There are many possible formulas that could be used. We used the formulain (9):

(9) bestTag = argmaxt∈TMA val(t)

where:1. TMA is the set of tags offered by MA2. val(t) = å 14

k=0 Nk(t)/Nk3. Nk(t) is the # of taggers voting for k-th slot of t4. Nk is the total # of taggers on slot k

This formula means that the best tag is the tag that receives the highest averagepercentage of votes for each of its slots. Weighting slots is also possible in the valfunction if certain slots are more important than others; however we did not use thisoption.

We ran a number of possible sub-tagger combinations, using 1-4 taggers foreach slot. Unfortunately, none of the resulting taggers outperformed the Russifiedtagger, the tagger they are based on, on the full tag (although some did on someof the individual slots). As an example, Table 7.9 reports the performance of asystem where the three best taggers for a particular slot vote on that slot. The betteraccuracy for a given criterion is marked by an asterisk. The tagger is clearly worsethan the original tagger on all tokens (77.2% vs. 80.0%).

Table 7.9. Voted classifier

Russified (section 7.7) sample voting taggerAll Full tag: *80.0 77.2

SubPOS 92.3 92.3N Full tag: 57.1 57.1

SubPOS 89.3 *89.9A Full tag: *55.9 53.8

SubPOS 86.9 86.9V Full tag: *92.7 82.8

SubPOS 96.6 96.6


Even though, intuitively, it seemed that the tagger decomposition approachshould improve the overall performance of the system, our experiments have shownthe opposite. One of the guesses that we can make here is that the tag decompositionwas based on our linguistic intuition and it is unclear whether such an approach isthe most optimal. We suggest to explore alternative tag decomposition techniques,such as the random decomposition used in error-correcting output coding Diet-terich and Bakiri (1991). This could shed interesting light on why the experimentsdescribed in this chapter were unsuccessful and how to further improve the taggingperformance.

7.8.3 Tagger independence

In general, as discussed in chapter 2, classifier combination is sensible only if theclassifiers are (relatively) independent. To evaluate the independence of the classi-fiers used, the complementarity rate analysis is employed (see Figure 7.1) as pro-posed by Brill and Wu (1998).

CR(A,B) = (1− Errors(A+B)Errors(A) )∗100%

If CR(A,B) = 100%, taggers A and B are independent.If CR(A,B) = 0%, taggers A and B are totally dependent.

Figure 7.1. Complementarity rate analysis (Brill and Wu 1998)

The complementarity error rate CR(A,B) between two classifiers A and B measuresthe percentage of mistakes that A makes which are not made by B. Two classifiersthat misclassify exactly the same instances have a complementarity rate of 0%. Twoclassifiers that always misclassify different instances have a complementary rate of100%. Table 7.10 summarizes the results. It shows that the subtaggers used hereare relatively independent.

7.9 Results on test corpus

In the previous sections, we tuned our method using feedback from the develop-ment corpus. Now we are ready to report the results of the best method, the Rus-sified tagger, a tagger that uses transitions trained on the “russified" Czech corpusand emissions created by merging the output of the Russian analyzer with Czechemission using cognates (see the last column of Table 7.11). For comparison, wepresent the accuracy of the other taggers as well. The results are very similar to theresults on the development corpus.

The overall accuracy of the Russified tagger is about 80%. As with the othertaggers, the accuracy for verbs is quite high (93.9%), but the accuracy for nouns

7.9. Results on test corpus 119

Table 7.10. Complementarity rate of subtaggers (see section 7.8.3)

all p s g n c pc gc ga nc ce ct

pgcpncpsgcpsnc

psgncpsgnce

psgncfme

psgncfmetdav

all 0 74 69 72 86 62 48 41 68 62 61 58 28 46 20 41 14 12 11 5p 29 0 - - - - 16 - - - - - 14 23 27 32 30 29 29 29s 33 - 0 - - - - - - - - - - - 24 27 29 27 26 33g 36 - - 0 - - - 32 6 - - - 29 - 29 - 34 34 34 35n 52 - - - 0 - - - - 37 - - - 44 - 50 51 51 50 52c 51 - - - - 0 39 37 - 36 12 15 49 47 54 50 53 53 52 50pc 30 59 - - - 36 0 44 - 47 36 37 20 23 30 25 31 31 31 30gc 30 - - 65 - 41 50 0 63 45 43 41 24 53 27 53 29 29 29 29ga 35 - - 19 - - - 38 0 - - - 38 - 39 - 43 43 42 35nc 50 - - - 75 33 47 39 - 0 37 31 52 32 55 33 48 48 48 48ce 49 - - - - 11 37 38 - 38 0 20 46 48 54 49 52 50 50 48ct 46 - - - - 16 40 38 - 35 22 0 49 48 53 48 53 53 52 46pgc 22 66 - 67 - 57 35 31 66 60 55 56 0 41 15 45 22 22 22 22pnc 32 65 - - 79 48 27 51 - 35 49 48 31 0 37 18 32 32 32 31psgc 16 72 62 68 - 62 44 36 68 64 62 61 18 48 0 39 11 11 11 15psnc 28 70 58 - 82 52 32 52 - 39 52 50 39 21 29 0 24 24 24 26psgnc 8 73 64 70 85 61 44 37 69 59 60 60 23 43 10 33 0 1 1 6psgnce 7 73 64 70 85 61 45 38 70 59 60 60 24 43 11 34 3 0 1 6psgncfme 7 73 64 70 85 61 46 38 70 59 60 60 25 44 12 35 3 1 0 5psgncfmetdav 2 74 68 71 86 60 46 39 66 60 59 56 26 44 17 38 9 8 6 0

Table 7.11. Overview of results on the test corpus

expectations direct even cog russifemissions ru ru cz cz cz MA cog cogtransitions ru cz ru cz cz cz cz czruAll Full tag: 70.5 71.0 40.2 37.9 45.6 77.6 79.3 79.7

SubPOS 82.5 83.4 55.4 55.7 62.3 90.4 91.4 91.3N Full tag: 49.9 51.5 22.8 19.8 36.7 59.6 61.2 62.1

SubPOS 77.5 80.2 64.5 68.3 81.9 89.5 89.8 89.8A Full tag: 54.4 54.7 31.0 20.8 18.9 62.5 64.7 65.8

SubPOS 74.9 77.1 46.6 36.9 36.1 86.5 86.8 86.8V Full tag: 85.4 84.0 32.8 28.9 44.1 93.0 93.2 93.9

SubPOS 89.5 87.1 40.6 40.8 54.3 95.5 95.7 95.7


and adjectives is rather low: 62.1% and 65.8%, respectively. In other words, oneout of three nominal tags is incorrect. However, in most cases the error is caused byincorrect information in just one or two tag slots, while the other slots being cor-rect. In Table 7.12, we report the performance of the tagger on individual positionsof the tag. Indeed, the table shows that excluding the gender and case features,the accuracy is above 90%, for all parts of speech and is above 80% for nounsand adjectives. In the case of verbs, all categories are predicted with an accuracyover 95%. These tagging results might be helpful for NLP applications that do notrequire the knowledge about all 13 morpho-syntactic features of the full tag.

Table 7.12. Detailed results obtained with the Russified tagger

All N A VFull Tag 79.7 62.1 65.8 93.9POS P 92.2 89.8 90.6 96.9SubPOS S 91.3 89.8 86.8 95.7Gender g 89.9 76.4 90.0 97.7Number n 94.1 89.2 96.8 96.7Case c 87.2 74.6 80.9 97.1Poss Gender f 99.8 100.0 99.7 100.0Poss Number m 99.8 100.0 100.0 100.0Person e 99.2 97.8 100.0 99.0Tense t 98.6 97.0 97.8 97.1Grade d 97.6 97.8 91.9 99.4Negation a 96.0 93.9 94.1 96.9Voice v 99.5 100.0 94.9 100.0Var i 99.6 99.1 98.7 100.0

The obvious question is: Is it worth it? Is it not possible to get the same or better re-sults by annotating a small corpus of Russian and training a tagger in the traditionalway? To answer this question, we trained a tagger on the full development corpus(about 2,000 tokens); it differs from Rus-Rus tagger in the first column only in thesize of the training data. Table 7.13 compares it with our best tagger. The tagger isworse than the Russified tagger with accuracy 73.5% vs. 79.7% on full tags for alltokens. This corresponds to about 23% relative decrease in error.

In fact, it is possible to get even better results by combining these two tag-gers. Since we did not have a tuning corpus to tune the parameters of such combi-nation, we decided to use the transitions from the Russified tagger and to combinethe lexicons of both taggers, giving them equal weight. As the last column of Table7.13 shows, this is the best tagger.

7.10. Catalan 121

Table 7.13. Comparison with the traditional approach and combination with thetraditional approach

baseline our best combination2K russif 2k+russif

transitions ru czru czruemissions ru cog(MA+cz) Rus+cog(MA+cz)All Full tag: 73.5 79.7 81.3

SubPOS 84.9 91.3 92.2N Full tag: 53.5 62.1 65.5

SubPOS 80.2 89.8 91.2A Full tag: 55.8 65.8 68.5

SubPOS 75.7 86.8 87.1V Full tag: 85.9 93.9 93.9

SubPOS 89.1 95.7 95.7

7.10 Catalan

The previous sections discussed resource-light tagging for Russian. Table 7.14 and7.15 (analogous to Table 7.11 and 7.13 for Russian) show the results that can beobtained by applying the same algorithms on Catalan.

Table 7.14. Catalan: Overview of results on the test corpus

expectations direct even cog russifemissions ca sp ca sp sp MA cog cogtransitions ca ca sp sp sp sp sp spcaAll Full tag: 79.8 78.2 47.9 47.8 56.6 81.9 83.9 86.6

SubPOS 82.4 81.0 55.1 55.0 62.9 86.4 87.7 90.6N Full tag: 78.2 73.8 33.8 30.6 57.8 76.4 79.5 81.0

SubPOS 83.1 79.8 44.5 52.1 74.9 83.1 85.0 87.1A Full tag: 33.2 32.0 18.5 18.0 35.8 46.4 55.8 55.4

SubPOS 41.2 38.7 27.3 28.3 51.2 78.1 81.2 80.8V Full tag: 70.2 65.4 38.9 38.4 43.1 87.9 88.0 88.2

SubPOS 74.4 70.5 63.2 42.2 50.1 92.7 92.7 92.9

While the tagging algorithms are the same for both languages, there are, obviously,some differences in the data used by these algorithms:

• For Catalan, the Catalan morphological analyzer described in section 6.4 isused.


Table 7.15. Catalan: Comparison with the traditional approach and combinationwith the traditional approach

baseline our best combination2K russif 2k+russif

emissions ca cog(MA+sp) ca+cog(MA+sp)transitions ca spca spcaAll Full tag: 84.0 86.6 87.1

SubPOS 86.7 90.6 91.1N Full tag: 83.3 81.0 81.5

SubPOS 88.5 87.1 87.5A Full tag: 43.2 55.4 55.7

SubPOS 52.6 80.8 80.8V Full tag: 79.9 88.2 87.9

SubPOS 83.5 92.9 92.6

• There are only minor tag translations. On the one hand, this is because tagsetsare very similar, but on the other hand, it is not obvious how the remainingdifferences should be handled.The Spanish and Catalan tagsets are far more similar than the Czech andRussian tagsets. They use not only the same slots but also the same valuesfor those slots.However, some of the rules on the co-occurrence of values in tags are differ-ent. The differences are mostly in the different usage of wildcard and atomicvalues (e.g. C common gender vs. M masculine and F feminine gender).For example, in the case of possessive determiners, in 3rd person, Span-ish distinguishes masculine, feminine and common gender (DPCS-0-3---,DPFS-0-3---, DPMS-0-3---), while Catalan has only masculine and femi-nine gender (DPFS-0-3---, DPMS-0-3---).If Spanish had only the F/M tags and Catalan the C tag, we could replaceSpanish F/M by C. This is done when any gender of Czech plural adjectivesis mapped to the any-gender wildcard X in Russian. If, on the other hand,Spanish had only the C tag and Catalan the F/M tag, we could translate allF/M tags in the output of Catalan analyzer by C and then translate it back aftertagging. Similarly, masculine and neuter gender of certain Russian pronounsis translated to the masculine-or-neuter wildcard Z used in Czech and thentranslated back after tagging. However, in this case there is a significant lossof information. We would need syntactic analysis to determine whether Span-ish DPCS-0-3--- should be translated to DPFS-0-3--- or DPMS-0-3---.

• There is only a single rule in the Russification (‘Catalanization’) transforma-tion of the Spanish corpus. Namely, all Spanish articles la or el are replaced

7.11. Portuguese 123

by l’ in front of vowels or h. Correspondingly, the tag loses gender distinc-tion.

It is worth stressing, that the only resource provided by a native speaker of Catalanis the development corpus. Apart from that, no speaker of any Romance languagewas involved.

Similarly to the experiments with Russian, cognates and ‘catalanizations’improve the tagging result. The relatively large effect of the ‘catalanizations’ israther surprising because of their limited scope. Recall also that the expectationexperiments, described in section 7.4, showed only a small drop in accuracy whenusing non-native transitions.

7.11 Portuguese

Finally, Table 7.16 shows analogous results for Portuguese. We do not have a devel-opment corpus of Portuguese; therefore, all parameters (e.g. the distance betweenwords to be considered cognates) are copied from the Catalan experiments. Thisalso means that we cannot provide the results of taggers trained on the develop-ment corpus or a portion of it (expectation taggers, 2k and combination with the 2ktagger).

Table 7.16. Portuguese: Overview of results on the test corpus

direct even cog russifemissions sp MA cog cogtransitions sp sp sp spporAll Full tag: 55.2 77.7 80.3 80.4

SubPOS 60.5 81.9 83.7 83.8N Full tag: 64.7 59.6 67.1 67.3

SubPOS 76.3 69.5 74.8 75.1A Full tag: 60.6 61.7 68.9 68.9

SubPOS 70.6 71.1 76.1 76.1V Full tag: 35.4 87.9 90.4 90.4

SubPOS 39.9 94.9 95.5 95.5

7.12 Conclusion

This chapter discussed experiments with tagging Russian via Czech, and Catalanand Portuguese via Spanish. We described several experiments, including experi-ments where a source-language model was directly applied to the target language;where the target-language transitions were approximated by the source-languagetransitions and the target-language emissions where approximated by the output


of the morphological analyzer; where target-language emissions were approxi-mated by a combination of the source-target language cognates and the target-language morphological analyzer; where the target-language transitions were ap-proximated by writing similar linguistic transformation rules to make the structureof the source-language training data to look more like the target language. We havealso discussed experiments where we decompose the problem into sub-problems,train a battery of sub-taggers and combine them by simple voting. The results ofour experiments suggest that the word order of Russian is approximated by Czechquite well. We can say the same about the Romance languages. Spanish is a goodsource-language for tagging Portuguese and Catalan. We also think the approxi-mation of the emissions by the morphological analyzer and the cognates is ratherpromising, but a more sophisticated method of cognate detection and transfer isstill needed. Simple linguistic transformations work, but we do not want to expandthe list of rules, since we want to remain resource- and labor-light. To our surprise,the voting of the subtaggers did not work as well as we expected. There are sev-eral reasons for that: the subtaggers participating in the voting are only relativelyindependent and the tag decomposition was based on our linguistic intuition. It ispossible that a more sophisticated method of subtagger selection will improve thetagging results. The next chapter lists the areas that need to be addressed to improvethe performance of the system described in this book.

Chapter 8

Summary and further work

8.1 Summary of the book

This book has explored the portability of morpho-syntactic knowledge from onelanguage to another, related language. A transfer system has been described whichrelies on only a small amount of manually created resources. The approach, testedon the Russian-Czech, Portuguese-Spanish, and Catalan-Spanish language pairs,has been shown successful.

Adaptation of the system to a new language pair can be done in a fractionof the time needed for systems with extensive, manually created resources: daysinstead of years. The following resources are required:

1. a reference grammar book (for information about paradigms and closed classwords),

2. a large amount of text (for learning a lexicon, e.g. newspapers from the In-ternet),

3. an annotated training corpus of a related source language,

4. optionally, a dictionary (or a native speaker) to provide analyses of the mostfrequent words.

The practical contribution of this book consists of developing and implementing aportable system which is both easily adaptable to new language pairs and resource-light. Finding effective ways to adapt a tagger trained on another language withsimilar linguistic properties can become the standard way of tagging languagesfor which large labeled corpora are not available. There are populations all overthe world, speakers of different languages, who need access to information notreadily available in their native language. Morpho-syntactic information is crucialfor processing such languages (e.g. for creating machine translation systems).

From the theoretical perspective, this is one of the few studies that inves-tigate the possibility of adapting the knowledge and resources of one inflectional

126 Chapter 8. Summary and further work

language to process another related inflected language without the use of parallelcorpora or bilingual lexicons. Some of the specific results presented in this bookinclude the following:

1. Related languages possess differences and similarities which can be filteredout and used for technological benefit.

2. The transfer of morpho-syntactic information is possible without parallel cor-pora, which makes it very suitable for languages where parallel corpora arenot available or easy to find.

3. Minimal encoding of morphology in paradigms is effective and beneficial formorphological analysis of inflected languages.

4. A large tagset does not necessarily result in a more demanding tagging task.In fact, the experiments with the subtaggers show that training on individualfeatures reduces performance.

5. Word order of the target language can be approximated by the word order ofa related language, which shows that inflected languages, which are claimedto be word-order free, are not as free in a real setting.

6. Lexical probabilities of the target language can be approximated by the lexi-cal probabilities of the source language by using cognates.

8.2 Future work

There are many directions in which this line of research can develop. Some of theseare sketched below.

8.2.1 Deepening scope

Morphological Analysis

The main effort here should focus on improving lexicon acquisition: (i) consider-ing frequencies and contexts of word forms when eliminating incorrect hypotheses;(ii) replacing sequential application of heuristics with their weighted parallel com-bination; (iii) using information about common derivation patterns to extend thealgorithm over several lemmas related by derivation and eliminating some of thesystematic errors mentioned earlier. We are currently also exploring the possibili-ties of combining our approach with various machine learning techniques.

The performance of the morphological analyzer can be improved by direct-ing human effort (either of a linguistic expert or a native speaker) into creating orimproving a variety of resources (more paradigms, more specific phonological re-strictions on paradigms, larger top-frequency word lists with full or partial tags,removing some incorrect candidates from an automatically acquired lexicon, . . . ).Our experience tells us that the ratio between return and cost varies significantly.

8.2. Future work 127

For example, in section 6.3.2, we concluded that for the same price, word-lists pro-vide a higher return than lexicons. We would like to perform a more formal andsystematic comparison of the cost and impact of various possibilities.

Cognate Detection and Transfer Our results suggest that the model that approx-imates emissions by cognates significantly outperforms the basic model for all thelanguages. This is already a useful result. But our experiments thus far only use avery simple algorithm for cognate identification, and the output of the algorithm isnoisy. In our ongoing work, we are developing an algorithm that detects cognatestems instead of whole forms and generate word forms using the morphologicalanalyzer. The identification of cognate stems should give more reliable cognateclasses, but the next challenge is to map the generated target forms into the sourceforms. A more sophisticated algorithm for transferring cognates from the sourcelanguage to the target is needed as well.

Alternative Methods of Tag Decomposition In the experiments with training abattery of classifiers, we were driven by our linguistic intuition when we decom-posed the positional tag. However, the question remains whether this tag decompo-sition is the optimal one. Currently, we are exploring alternative tag decompositiontechniques, such as the random decomposition used in error-correcting output cod-ing (Dietterich and Bakiri 1991). This could shed interesting light on whether thelinguistically motivated decomposition is actually the one best supported by thedata.

Moving to the Unsupervised Mode Future work should include experimentswith a number of Machine Learning approaches to make our method less dependenton labeled data and improve the performance. In particular, we think that methodsused for domain adaptation need to be explored. So far, we have used generativemodels of learning (e.g., HMMs). These learning methods usually work best whentheir training and test data are drawn from the same distribution. In our case, notonly are the training and the test data not similar, they are taken from differentlanguages.

Blitzer et al. (2006) introduce a new method, structural correspondencelearning (SCL), whose key idea is to identify correspondences among features fromdifferent domains by modeling their correlations with pivot features (i.e., featureswhich behave in the same way for (discriminative) learning in both domains). Theyapply their method to the problem of POS tagging. In the situation when they haveno labeled data of the target domain, they use as pivot features words that occurmore than 50 times in both domains. They obtain 19.1% relative reduction in errorover the supervised baseline. We think that this method is applicable to the prob-lem of cross-lingual transfer. In our case, our resource-light morphological analyzerneeds to be used in order to identify cognate lemmas, which will be used later togenerate pivot features.


Based on the detailed evaluation of the taggers that we have created, we canalready say that certain POS and morpho-syntactic features are more reliable thanothers. The system performs extremely well on verbs, certain types of nouns andpronouns. At the same time, our performance on adjectives (at least for the Slaviclanguages) still needs improvement. The same is true of various morpho-syntacticfeatures: e.g., number is often much easier than case. Other ML learning ideas needto be explored, such as 1) reranking the tagger’s output by generating n-best out-puts and reorganizing them using more detailed features (see, e.g., Collins 2000;Charniak and Johnson 2005; McClosky et al. 2006), and 2) self-training, i.e. tag-ging unlabeled data and adding it to the training source-language corpus. This isinspired by the work on parsing (e.g., McClosky et al. 2006), which shows thatadding millions of words of machine parsed and reranked data improves perfor-mance of the parser on the related test data.

8.2.2 Broadening scope and better insights

Quantifying Language Similarity and Language Pair Success

In our ongoing work, we are developing a method to quantify language similar-ity, which in turn should lead to an algorithm for ranking language pairs by theirpotential success in the morpho-syntactic transfer. Thus far, we have identified thefollowing features for the similarity measure: the number of cognates with sim-ilar morpho-syntactic properties shared by the languages, shared word order re-currences, the number of paradigms needed to encode morphological properties ofeach language. The transfer success should include additional features, such as theavailability of the labeled and unlabeled resources as well as their size; the numberand the type of 1) morpho-syntactic features available in the source language, and2) morpho-syntactic features needed to be transferred to the target.

Alternative Evaluation Metrics

In this book, the accuracy of the system has been measured on the gold standard testdata created by hand. This is important to get a realistic estimate of success. At thesame time, in reality, gold standard corpora for resource-poor languages are rarelyavailable. Therefore, alternative evaluation metrics are needed. The goal of thisresearch was to provide methods for the rapid development of annotated resources.Clearly, given the present level of precision, for many application, post-processingwill be necessary. This modification will require human intervention, but it is notimmediately obvious how costly this intervention will be.

As an ad hoc measure of the cost, Feldman (2006) has proposed a measurethat calculates the number of changes that would be required to transform the tag-ger’s output into the desired gold standard tags. We expect that this will be useful asa basis for comparison between different language pairs, because it is an estimate ofthe cost of the labor that is needed, for instance, to rapidly deploy tools to analyze


a suddenly critical language. We are also investigating the related problem of esti-mating the likely costs and benefits of applying our technology to a new languagepair. Such a method would be most useful if it can be applied before committingto a substantial annotation effort. It is certainly possible to build a simple instanceof such a measure on top of character-based language models, and to explore itseffectiveness for the new task.

Other morpho-syntactic features

The tagging experiments described in this book used fine-grained morpho-syntactictags developed for Slavic and Romance languages. One of the important morpho-logical features omitted by these tagsets is the category of aspect. However, the as-pectual information of verbs is encoded (at least to some extent) in the morphologyof Slavic languages, mostly by prefixes. The resource-light morphological analyzerused in this book does not handle aspectual information. Yet some of the sourcelanguages (e.g. the Czech MULTEXT-EAST tagset) annotate words with aspectualfeatures as well.

Opinions differ whether Slavic aspect is to be treated as an inflectional ora derivational category. Whereas most linguists more or less confidently prefer tocategorize Russian aspect as a derivational category (Karcevski 1927; Ruzýicýka1952; Dahl 1985; Bermel 1997), only very few claim aspect to be an inflectionalcategory (e.g. Isacenko 1968). As a consequence, opinions differ whether aspectshould be included in the tagset that describes such languages.

One interesting future experiment would be to see to what extent aspectualinformation can be projected from Czech into Russian (without having it encodedin the paradigms of the morphological analyzer). If this is successful, it will meanthat word order and aspect are interdependent because the only source of aspectualinformation in the target language will be the transition information learned fromthe source language. A related experiment is described in Resnik (1996), where theauthor shows that the distributionally derived selectional constraints help predictwhether verbs can participate in a class of diathesis alternations, with aspectualproperties of verbs clearly influencing the alternations of interest.

In general, work in this book projects morpho-syntactic information fromeither a language that had a more detailed tagset (e.g. Czech to Russian) or froma language with a tagset that is as detailed as the target language (from Spanish toCatalan or Portuguese). However, an obvious question to ask is how this algorithmwould perform in the case where the source language tagset is not as detailed asthat of the target language. This has not been explored in the experiments, but onepossibility along these lines is to rely on the morphological analyzer of the targetlanguage in the case the source language does not provide the relevant details.


Other annotation schemes

All experiments in this book used the same type of annotation scheme. No experi-ments have been run on other types of morphological annotation. However, it wouldbe interesting to see how much the performance of the transfer system depends onthe tag system used. In chapter 5, it was shown that a smaller tagset does not nec-essarily improve the performance of the system. However, systems that make othertypes of fine-grained morphological distinctions have not been experimented with.For example, the MULTEXT-EAST tagset does not have a separate tag for a numeral,since their morphology is either nominal or adjectival, whereas the PDT tagset doesmake this distinction. At the same time, the MULTEXT-EAST tagset marks verbs foraspect, a category which is partially encoded in the verb morphology. Such exper-iments could quantify the extent to which certain morphological distinctions aredue to lexical idiosyncrasy and the extent to which they depend on the annotationscheme used and the amount of linguistic features shared by a language pair.

Other types of knowledge induction

This work can be extended to other types of annotation. For example, the Czechcorpus is also annotated with syntactic (dependency) trees. Exploring the possibil-ity of projecting this information into Russian, without using parallel corpora, isanother avenue of research to pursue.

Other inflected languages

The experiments described in this book deal with inflected (fusional) languages.Most Indo-European languages are fusional languages. Even within this family,if this approach works, many new annotated resources could be created. Anotherfamily of inflected (fusional) languages is the Semitic group. The main propertyof these languages is a non-linear (templatic) structure of morphology. It wouldbe interesting to investigate to what extent this type of morphology plays a crucialrole in inducing morpho-syntactic features. It could be hypothesized that by omit-ting vocalizations (diacritics), Semitic languages are relatively linear in writing andtheir morphology can be encoded in paradigms. Future experiments will test thishypothesis.

Another interesting future experiment is to explore a pair of languageswhich belong to different language families but share a number of syntactic andlexical properties (e.g. Bulgarian and Greek or Bulgarian and Romanian). Eventhough the experiments described in chapter 7 dealt with pairs of languages withinthe same language family (i.e. Slavic and Romance), the method proposed in thisbook does not depend on the nature of the relationship.


Language transfer in language acquisition

Even though this book deals with transferring linguistic knowledge from one lan-guage to another for computational purposes, it raises a broader spectrum of ques-tions, such as how cross-linguistic overlap affects second language learning, whatthe general role is of first language (L1) transfer in Second Language Acquisition(SLA), and what a computational model of second language (L2) acquisition lookslike.

This book identifies a number of features which are useful for projectingmorpho-syntactic information from one inflected language into another related lan-guage. Exploring what information helps language learners guess the meaning andgrammatical function of an L2 word correctly and what information is misleadingis a possible extension of this book. This is important both for refining the methodof automatic cognate detection and for improving the performance of the cross-language annotation system. Such research can also shed light on the process of L1transfer in SLA, explore what effect cognate pairing techniques have on the L2 ac-quisition process, discover which learning strategies are important for acquiring L2,and perhaps get closer to identifying the mental processes involved in constructingand using an interlanguage.

Bibliography

Agirre, E., A. Atutxa, K. Gojenola, and K. Sarasola (2004). Exploring Portabilityof Syntactic Information from English. In Proceedings of Language Resourcesand Evaluation Conference (LREC), Lisbon, Portugal.

Agirre, E., A. Atutxa, K. Gojenola, K. Sarasola, and D. Terrón (2005). PP At-tachment for Basque Based on English Parses. In Proceedings of InternationalCross-Language Knowledge Induction Workshop held as part of the Eurolan2005 Summer School, Babes-Bolyai University, Cluj-Napoca, Romania.

Baker, C. F., C. J. Fillmore, and J. B. Lowe (1998). The Berkeley FrameNetProject. In Proceedings of 36th Annual Meeting of the Association for Computa-tional Linguistics and 17th International Conference on Computational Linguis-tics (COLING-ACL), pp. 86–90.

Baroni, M., J. Matiasek, and H. Trost (2002). Unsupervised Discovery of Mor-phologically Related Words Based on Orthographic and Semantic Similarity. InM. Maxwell (Ed.), Proceedings of the Workshop on Morphological and Phono-logical Learning of ACL/SIGPHON-2002, pp. 48–57.

Belkin, M. and J. A. Goldsmith (2002). Using Eigenvectors of the Bigram Graph toInfer Morpheme Identity. In M. Maxwell (Ed.), Proceedings of the Workshop onMorphological and Phonological Learning of ACL/SIGPHON-2002, pp. 42–47.

Bémová, A., J. Hajic, B. Hladká, and J. Panevová (1999). Morphological and Syn-tactic Tagging of the Prague Dependency Treebank. In Proceedings of Associa-tion pour le Traitement Automatique des Langues (ATALA) Workshop, pp. 21–29.Paris, France.

Bermel, N. (1997). Context and the Lexicon in the Development of Russian Aspect.Berkeley: University of California Press.

Bick, E. (2000). The Parsing System PALAVRAS: Automatic Grammatical Analysisof Portuguese in a Constraint-Grammar Framework. Ph. D. thesis, University ofAarhus, DK.

134 Bibliography

Blitzer, J., R. Mcdonald, and F. Pereira (2006). Domain Adaptation with StructuralCorrespondence Learning. In Proceedings of the 2006 Conference on EmpiricalMethods in Natural Language Processing, pp. 120–128.

Böhmová, A., J. Hajic, E. Hajicová, and B. Hladká (2001). The Prague Depen-dency Treebank: Three-Level Annotation Scenario. In A. Abeillé (Ed.), Tree-banks: Building and Using Syntactically Annotated Corpora. Kluwer AcademicPublishers.

Borin, L. (1999). Enhancing Tagging Performance by Combining KnowledgeSources. In Papers from the Association Suédoise de Linguistique Appliquée(ASLA) Symposium on Corpora in Research and Teaching, Växjö Universitet,Växjö, pp. 19–31.

Borin, L. (2000). Something Borrowed, Something Blue: Rule-based Combinationof POS Taggers. In Proceedings of the Second International Conference onLanguage Resources and Evaluation (LREC), Athens, Greece, pp. 21–26.

Borin, L. (2002). Alignment and Tagging. In L. Borin (Ed.), Parallel Corpora,Parallel Words, pp. 207–218. Amsterdam: Rodopi.

Borin, L. (2003). Language Technology Resources for Less Prevalent Languages.Nordic Language Technology, 71–82.

Brants, T. (2000). TnT - A Statistical Part-of-Speech Tagger. In Proceedings of 6thApplied Natural Language Processing Conference and North American chap-ter of the Association for Computational Linguistics annual meeting (ANLP-NAACL), pp. 224–231.

Breiman, L. (1996). Bagging Predictors. Machine Learning 24(2), 123–140.

Brent, M. R. (1994). From Grammar to Lexicon: Unsupervised Learning of LexicalSyntax. Computational Linguistics 19(2), 243–262.

Brent, M. R. (1999). An Efficient, Probabilistically Sound Algorithm for Segmen-tation and Word Discovery. Machine Learning 34(1-3), 71–105.

Brill, E. (1995). Transformation-Based Error-Driven Learning and Natural Lan-guage Processing: A Case Study in Part of Speech Tagging. ComputationalLinguistics 21(4), 543–565.

Brill, E. (1999). A Closer Look at the Automatic Induction of Linguistic Knowl-edge. In Learning Language in Logic, pp. 49–56.

Brill, E. and J. Wu (1998). Classifier Combination for Improved Lexical Disam-biguation. In Proceedings of the Thirty-Sixth Annual Meeting of the Associa-tion for Computational Linguistics and Seventeenth International Conference onComputational Linguistics, pp. 191–195.

Bibliography 135

Brown, P. F., J. Cocke, S. A. D. Pietra, V. J. D. Pietra, F. Jelinek, J. D. Lafferty, R. L.Mercer, and P. Roosin (1990). A Statistical Approach to Machine Translation.Computational Linguistics 16, 79–85.

Brown, P. F., S. A. D. Pietra, V. J. D. Pietra, and R. L. Mercer (1993). The Math-ematics of Machine Translation: Parameter Estimation. Computational Linguis-tics 19(2), 263–311.

Brun, D. (2001). Information Structure and the Status of NP in Russian. TheoreticalLinguistics 27(2/3), 109–136.

Carlberger, J. and V. Kann (1999). Implementing an Efficient Part-of-speech Tag-ger. Software – Practice and Experience 29(9), 815–832.

Carrasco, R. M. and A. Gelbukh (2003). Evaluation of TnT Tagger for Spanish.In Proceedings of the Fourth Mexican International Conference on ComputerScience, Instituto Tecnologico de Puebla, Mexico, pp. 18–25.

Carreras, X., L. Màrquez, and L. Padró (2003). Named Entity Recognition forCatalan Using Only Spanish Resources and Unlabelled Data. In Proceedings of8th Conference of the European Chapter of the Association for ComputationalLinguistics (EACL), pp. 43–50.

Cavestro, B. and N. Cancedda (2005). Literality Based Sample Sorting for SyntaxProjection. In Proceedings of International Cross-Language Knowledge Induc-tion Workshop held as part of the Eurolan 2005 Summer School, Babes-BolyaiUniversity, Cluj-Napoca, Romania.

Charniak, E. and M. Johnson (2005). Coarse-to-fine n-best parsing and MaxEntdiscriminative reranking. In Proceedings of the 2005 Meeting of the Associationfor Computational Linguistics (ACL), pp. 173–180.

Chen, S. (1993). Aligning Sentences in Bilingual Corpora Using Lexical Informa-tion. In Proceedings of the 31st Annual Meeting of the Association for Compu-tational Linguistics (ACL), Columbus, Ohio, pp. 9–16.

Chen, S. F. and J. T. Goodman (1996). An Empirical Study of Smoothing Tech-niques for Language Modeling. In Proceedings of the 34th Annual Meeting of theAssociation for Computational Linguistics (ACL), Santa Cruz, CA, pp. 310–318.

Church, K. W. (1988). A Stochastic Parts Program and Noun Phrase Parser forUnrestricted Text. In Proceedings of the 2nd Conference on Applied NaturalLanguage Processing, Austin, Texas, pp. 136–143.

Civit, M. (2000). Guía para la anotación morfológica del corpus CLiC-TALP (Ver-sión 3). Technical Report WP-00/06, X-Tract Working Paper. Centre de Llen-guatge i Computació (CLiC), Barcelona, Catalunya.

136 Bibliography

Clark, A. (2001). Learning Morphology with Pair Hidden Markov Models. In Pro-ceedings of the Student Workshop at the 39th Annual Meeting of the Associationfor Computational Linguistics (ACL), Toulouse, France, pp. 55–60.

Clark, S., J. R. Curran, and M. Osborne (2003). Bootstrapping POS Taggers UsingUnlabelled Data. In W. Daelemans and M. Osborne (Eds.), Proceedings of the7th Conference on Natural Language Learning (NLL), pp. 49–55. Edmonton,Canada.

Cloeren, J. (1993). Toward a cross-linguistic tagset. In Workshop On Very LargeCorpora: Academic And Industrial Perspectives.

Collins, M. (2000). Discriminative Reranking for Natural Language Processing.In Proceedings of the Seventeenth International Conference (ICML 2000), Stan-ford, California, pp. 175–182.

Comrie, B. and G. G. Corbett (Eds.) (2002). The Slavonic Languages. London;New York: Routlege.

Creutz, M. (2003). Unsupervised Segmentation of Words Using Prior Distributionsof Morph Length and Frequency. In Proceedings of the 41st Annual Meeting ofthe Association for Computational Linguistics (ACL), Sapporo, Japan, pp. 280–287.

Creutz, M. and K. Lagus (2002). Unsupervised Discovery of Morphemes. InProceedings of the Workshop on Morphological and Phonological Learning ofthe Annual Meeting of the Association for Computational Linguistics (ACL),Philadelphia, Pennsylvania, USA, pp. 21–30.

Cucerzan, S. and D. Yarowsky (1999). Language Independent Named EntityRecognition Combining Morphological and Contextual Evidence. In Proceed-ings of the 1999 Joint SIGDAT Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) and Very Large Corpora (VLC), University ofMaryland, USA, pp. 90–99.

Cucerzan, S. and D. Yarowsky (2000). Language Independent Minimally Super-vised Induction of Lexical Probabilities. In Proceedings of the 38th Meeting ofthe Association for Computational Linguistics (ACL), Hong Kong, pp. 270–277.

Cucerzan, S. and D. Yarowsky (2002). Bootstrapping a Multilingual Part-of-speechTagger in One Person-day. In Proceedings of the 6th Conference on NaturalLanguage Learning (CoNLL), pp. 132–138. Taipei, Taiwan.

Cunha, C. and L. F. L. Cintra (2001). Nova Gramática do Português Contemporâ-neo. Rio de Janeiro, Brazil: Nova Fronteira.

Bibliography 137

Curran, J. R. and S. Clark (2003). Investigating GIS and Smoothing for MaximumEntropy Taggers. In Proceedings of the 11th Annual Meeting of the EuropeanChapter of the Association for Computational Linguistics (EACL), Budapest,Hungary, pp. 91–98.

Cutting, D., J. Kupiec, J. Pedersen, and P. Sibun (1992). A Practical Part-of-speechTagger. In Proceedings of the Third Conference on Applied Natural LanguageProcessing (ANLP), pp. 133–140. Association for Computational Linguistics.

Daelemans, W., A. van den Bosch, and J. Zavrel (1999). Forgetting Exceptions isHarmful in Language Learning. Machine Learning 34, 11–43.

Daelemans, W., J. Zavrel, and S. Berck (1996). MBT: A Memory-based Part ofSpeech Tagger-Generator. In Proceedings of the Fourth Workshop on Very LargeCorpora (VLC), pp. 14–27.

Daelemans, W., J. Zavrel, K. van der Sloot, and A. van den Bosch (2001). TIMBL:Tilburg Memory-based Learner – Version 4.0, Reference Guide.

Dagan, I. (1990). Two Languages Are Better Than One. In Proceedings of the28th Annual Meeting of the Association for Computational Linguistics (ACL),Pittsburgh, Pennsylvania, USA, pp. 130–137.

Dagan, I. and K. W. Church (1994). Termight: Identifying and Translating Tech-nical Terminology. In Proceedings of the 4th Conference on Applied NaturalLanguage Processing (ANLP), Stuttgart, Germany, pp. 34–40.

Dagan, I., K. W. Church, and W. A. Gale (1993). Robust Bilingual Word Alignmentfor Machine Aided Translation. In Proceedings of the Workshop on Very LargeCorpora (VLC): Academic and Industrial Perspectives, Stuttgart, Germany, pp.1–8.

Dagan, I. and A. Itai (1994). Word Sense Disambiguation Using a Second Lan-guage Monolingual Corpus. Computational Linguistics 20(4), 563–596.

Dahl, O. (1985). Tense and Aspect Systems. Basil Blackwell, Oxford.

de Marcken, C. (1995). Acquiring a Lexicon from Unsegmented Speech. In 33rdAnnual Meeting of the Association for Computational Linguistics (ACL), Cam-bridge, Massachusetts, USA, pp. 311–313.

Debowski, Ł. (2004). Trigram Morphosyntactic Tagger for Polish. In Proceed-ings from the Intelligent Information Processing and Web Mining conference,Advances in Soft Computing, pp. 409–413. Springer-Verlag.

den Boogaart, P. U. (1975). Woordfrequenties in geschreven en gesproken Neder-lands. Utrecht: Oosthoek, Scheltema and Holkema.

138 Bibliography

Derksen, R. (2008). Etymological Dictionary of the Slavic Inherited Lexicon .Number 4 in Leiden Indo-European Etymological Dictionary Series. Brill Press.

DeRose, S. J. (1988). Grammatical Category Disambiguation by Statistical Opti-mization. Computational Linguistics 14(1), 31–39.

Dien, D. and H. Kiem (2003, May 31). POS-Tagger for English-Vietnamese Bilin-gual Corpus. In R. Mihalcea and T. Pedersen (Eds.), Human Language Technol-ogy and North-American Chapter of the Association for Computational Linguis-tics (HLT-NAACL). Workshop: Building and Using Parallel Texts: Data DrivenMachine Translation and Beyond, Edmonton, Alberta, Canada, pp. 88–95. As-sociation for Computational Linguistics.

Dietterich, T. (1997). Machine Learning Research: Four Current Directions. AIMagazine 18(4), 97–136.

Dietterich, T. G. and G. Bakiri (1991). Error-correcting output codes: a generalmethod for improving multiclass inductive learning programs. In T. L. Deanand K. McKeown (Eds.), Proceedings of the Ninth AAAI National Conferenceon Artificial Intelligence, Menlo Park, CA, pp. 572–577. AAAI Press.

Džeroski, S., T. Erjavec, and J. Zavrel (1999). Morphosyntactic Tagging ofSlovene: Evaluating PoS Taggers and Tagsets. Technical report, Jožef StefanInstitute Research Report IJS-DP 8018, Ljubljana, Slovenia.

Džeroski, S., T. Erjavec, and J. Zavrel (2000). Morphosyntactic Tagging ofSlovene: Evaluating Taggers and Tagsets. In Proceedings of the Second Inter-national Conference on Language Resources and Evaluation (LREC), Athens,Greece, pp. 1099–1104.

Ejerhed, E. and G. Källgren (1997). Stockholm Umeå Corpus version 1.0, SUC1.0. Technical report, Department of Linguistics. Umeå University, Stockholm,Sweden.

Elworthy, D. (1995). Tagset Design and Inflected Languages. In 7th Conference ofthe European Chapter of the Association for Computational Linguistics (EACL),From Texts to Tags: Issues in Multilingual Language Analysis SIGDAT Work-shop, Dublin, pp. 1–10.

Erjavec, T. (2004). MTEXT-EAST Version 3: Multilingual Morphosyntactic Speci-fications, Lexicons and Corpora. In Proceedings of the Fourth International Con-ference on Language Resources and Evaluation, LREC’04, ELRA, Paris, France,pp. 1535–1538.

Feldman, A. (2006). Portable Language Technology: A Resource-light Approachto Morpho-syntactic Tagging. Ph. D. thesis, The Ohio State University.

Bibliography 139

Feldman, A., J. Hana, and C. Brew (2005). Buy One, Get One Free or What toDo When Your Linguistic Resources are Limited. In Proceedings of the ThirdInternational Seminar on Computer Treatment of Slavic and East-European Lan-guages (Slovko), Bratislava, Slovakia.

Feldman, A., J. Hana, and C. Brew (2006). A Cross-language Approach to RapidCreation of New Morpho-syntactically Annotated Resources. In Proceedingsof the Fifth International Conference on Language Resources and Evaluation(LREC), Genoa, Italy.

Freund, Y. and R. Shapire (1996). Experiments with a New Boosting Algorithm.In L. Saitta (Ed.), Proceedings of the Thirteenth International Conference onMachine Learning, San Francisco, CA, pp. 148–156.

Fronek, J. (1999). English-Czech/Czech-English Dictionary. Praha: Leda. Containsan overview of Czech grammar.

Fung, P. (1998). A Statistical View on Bilingual Lexicon Extraction: from ParallelCorpora to Non-parallel Corpora. In D. Farwell, L. Gerber, and E. Hovy (Eds.),Third Conference of the Association for Machine Translation in the Americas,pp. 1–16. Springer-Verlag.

Fung, P. and K. W. Church (1994). Kvec: A New Approach for Aligning ParallelTexts. In Proceedings of the 15th International Conference on ComputationalLinguistics (COLING), Kyoto, Japan, pp. 1096–1102.

Fung, P. and Y. Y. Lo (1998). An IR Approach for Translating New Words fromNonparallel, Comparable Texts. In Proceedings of the 36th Annual Meetingof the Association for Computational Linguistics (ACL), Montreal, Canada, pp.414–420.

Fung, P. and K. McKeown (1997). A Technical Word- and Term-TranslationAid Using Noisy Parallel Corpora across Language Groups. Machine Trans-lation 12(1-2), 53–87.

Gale, W. A. and K. W. Church (1991). Identifying Word Correspondences in Par-allel Text. In Proceedings of the Fourth Darpa Workshop on Speech and NaturalLanguage, pp. 152–157.

Gale, W. A., K. W. Church, and D. Yarowsky (1992a). Estimating Upper and LowerBounds on the Performance of Word-sense Disambiguation Programs. In Pro-ceedings of the 30th Meeting of the Association for Computational Linguistics(ACL), pp. 249–256.

Gale, W. A., K. W. Church, and D. Yarowsky (1992b). Using Bilingual Materialsto Develop Word Sense Disambiguation Methods. In Proceedings of the FourthInternational Conference on Theoretical and Methodological Issues in MachineTranslation (TMI), Montreal, Canada, pp. 101–112.

140 Bibliography

Gale, W. A., K. W. Church, and D. Yarowsky (1992c). Work on Statistical Methodsfor Word Sense Disambiguation. In Proceedings of American Association forArtificial Intelligence (AAAI). Fall Symposium on Probabilistic Approaches toNatural Language, Cambridge, MA, pp. 54–60.

Gess, R. S. and D. L. Arteaga (2006). Historical Romance Linguistics: Retrospec-tive and Perspectives. J. Benjamins.

Goldsmith, J. (2001). Unsupervised Learning of the Morphology of a Natural Lan-guage. Computational Linguistics 27(2), 153–198.

Hajic, J. (2004). Disambiguation of Rich Inflection: Computational Morphology ofCzech. Prague, Czech Republic: Karolinum, Charles University Press.

Hajic, J. and B. Hladká (1998a). Czech Language Processing — POS Tagging.In Proceedings of the First Conference on Language Resources and Evaluation(LREC), Granada, Spain, pp. 931–936.

Hajic, J. and B. Hladká (1998b). Tagging Inflective Languages: Prediction of Mor-phological Categories for a Rich, Structured Tagset. In Proceedings of the 36thAnnual Meeting of the Association for Computational Linguistics and 17th In-ternational Conference on Computational Linguistics, Proceedings of the Con-ference (COLING-ACL), Montreal, Canada, pp. 483–490.

Hajic, J., P. Krbec, P. Kveton, K. Oliva, and V. Petkevic (2001). Serial Combinationof Rules and Statistics: A Case Study in Czech Tagging. In Proceedings ofAssociation for Computational Linguistics (ACL) Conference, Toulouse, France,pp. 260–267.

Hana, J. (2007). Czech Clitics in Higher Order Grammar. Ph. D. thesis, The OhioState University.

Hana, J. and P. W. Culicover (2008). Morphological complexity outside of universalgrammar. OSUWPL 58, 85–109.

Hana, J., A. Feldman, L. Amaral, and C. Brew (2006). Tagging Portuguese witha Spanish Tagger Using Cognates. In Proceedings of the Workshop on Cross-language Knowledge Induction hosted in conjunction with the 11th Confer-ence of the European Chapter of the Association for Computational Linguistics(EACL), Trento, Italy, pp. 33–40.

Hana, J., A. Feldman, and C. Brew (2004). A Resource-light Approach to RussianMorphology: Tagging Russian Using Czech Resources. In Proceedings of Em-pirical Methods for Natural Language Processing (EMNLP), Barcelona, Spain,pp. 222–229.

Hansen, L. and P. Salamon (1990). Neural Network Ensembles. In Institute ofElectrical and Electronics Engineers (IEEE) Transactions. Pattern Analysis andMachine Intelligence, pp. 993–1001. Washington, DC, USA.

Bibliography 141

Hladká, B. (2000). Czech Language Tagging. Ph. D. thesis, Institute of Formaland Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics (MFF),Charles University (UK), Prague, Czech Republic.

Hlavácová, J. (2001). Morphological Guesser or Czech Words. In V. Matoušek(Ed.), Text, Speech and Dialogue, Lecture Notes in Computer Science, pp. 70–75. Berlin: Springer-Verlag.

Hwa, R., P. Resnik, A. Weinberg, C. Cabezas, and O. Kolak (2004). Bootstrap-ping Parsers via Syntactic Projection Across Parallel Texts. Natural LanguageEngineering 1(1), 1–15.

Isacenko, A. V. (1968). Die russische Sprache der Gegenwart. Halle-Saale:Niemeyer.

ISO-9 (1995). Information and Documentation – Transliteration of Cyrillic Char-acters into Latin Characters – Slavic and non-Slavic Languages. InternationalOrganization for Standardization.

Jelinek, F. (1985). Markov Source Modeling of Text Generation. In F. K. Skwirzin-ski (Ed.), Impact of Processing Techniques on Communication.

Johansson, S. (1986). The Tagged LOB Corpus: User’s Manual. Bergen, Norway:Norwegian Computing Centre for the Humanities.

Johnson, H. and J. D. Martin (2003). Unsupervised Learning of Morphology forEnglish and Inuktitut. In Proceedings of the Human Language Technology Con-ference of the North American Chapter of the Association for ComputationalLinguistics (HLT-NAACL), Edmonton, Alberta, Canada.

Karcevski, S. (1927). Système du Verbe Russe; Essai de Linguistique Synchronique.Prague: Plamja.

Karlík, P., M. Nekula, and Z. Rusínová (1996). Prírucní mluvnice ceštiny [ConciseGrammar of Czech]. Praha: Nakladatelství Lidové Noviny.

Kay, M. and M. Röscheisen (1993). Text-translation Alignment. ComputationalLinguistics 19(1), 121–142.

Koskenniemi, K. (1983). Two-level model for morphological analysis. In IJCAI-83, Karlsruhe, Germany, pp. 683–685.

Koskenniemi, K. (1984). A general computational model for word-form recogni-tion and production. In COLING-84, Stanford University, California, USA, pp.178–181. Association for Computational Linguistics.

Krotov, A., M. Hepple, R. Gaizauskas, and Y. Wilks (1999). Evaluating Two Meth-ods for Treebank Grammar Compaction. Natural Language Engineering 5(4),377–394.

142 Bibliography

Kucera, H. and W. Francis (1967). Computational Analysis of Present-Day Ameri-can English. Providence, R.I.: Brown University Press.

Kupiec, J. (1993). An Algorithm for Finding Noun Phrase Correspondences inBilingual Corpora. In Proceedings of the 31st Annual Meeting of the Associationfor Computational Linguistics (ACL), Columbus, OH, pp. 17–22.

Levenshtein, V. (1966). Binary Codes Capable of Correcting Deletions, Insertions,and Reversals. Cybernetics and Control Theory 10(8), 707–710.

Lezius, W., R. Rapp, and M. Wettler (1998). A Freely Available MorphologicalAnalyzer, Disambiguator, and Context Sensitive Lemmatizer for German. InProceedings of the 36th Annual Meeting of the Association for ComputationalLinguistics and 17th International Conference on Computational Linguistics(COLING-ACL), Montreal, Quebec, Canada.

Mann, G. S. and D. Yarowsky (2001). Multipath Translation Lexicon Induction viaBridge Languages. In Proceedings of the Second Meeting of the North-AmericanAssociation for Computational Linguistics (NAACL), Pittsburgh, PA.

Marcus, M., B. Santorini, and M. A. Marcinkiewicz (1993a). Building a LargeAnnotated Corpus of English: The Penn Treebank. Computational Linguis-tics 19(2), 313–330.

Marcus, M., B. Santorini, and M. A. Marcinkiewicz (1993b). Building a large an-notated corpus of English: The Penn Treebank. Computational Linguistics 19(2),313–330.

Mason, O. (1997). Qtag – A Portable Probabilistic Tagger. UK: University ofBirmingham.

Maynard, D., V. Tablan, and H. Cunningham (2003). NE Recognition WithoutTraining Data on a Language You Don’t Speak. In M. B. Olsen (Ed.), Proceed-ings of the ACL Workshop on Multilingual and Mixed-language Named EntityRecognition, pp. 33–40.

McClosky, D., E. Charniak, and M. Johnson (2006). Reranking and self-trainingfor parser adaptation. In Proceedings of the 21st International Conference onComputational Linguistics and the 44th annual meeting of the Association forComputational Linguistics, Sydney, Australia, pp. 337–344.

Megyesi, B. (1999). Improving Brill’s POS Tagger for an Agglutinative Language.In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natu-ral Language Processing and Very Large Corpora (EMNLP/VLC), University ofMaryland, USA, pp. 275–284.

Melamed, D. (2000). Models of Translational Equivalence among Words. Compu-tational Linguistics 26(2), 221–49.

Bibliography 143

Merialdo, B. (1994). Tagging English Text with a Probabilistic Model. Computa-tional Linguistics 20(2), 155–171.

Meurers, W. D. (2005). On the Use of Electronic Corpora for Theoretical Lin-guistics. Case Studies from the Syntax of German. Lingua 115(11), 1619–1639.http://ling.osu.edu/~dm/papers/meurers-03.html.

Mikheev, A. (1997). Automatic rule induction for unknown word guessing. Com-putational Linguistics 23(3), 405–423.

Mikheev, A. and L. Liubushkina (1995). Russian Morphology: An EngineeringApproach. Natural Language Engineering 3(1), 235–260.

Miller, G. A. (1990). WordNet: An On-Line Lexical Database. International Jour-nal of Lexicography 3(4), 235–312.

Nakagawa, T., T. Kudo, and Y. Matsumoto (2002). Revision Learning and its Ap-plication to Part-of-Speech Tagging. In Proceedings of the 40th Annual Meetingof the Association for Computational Linguistics (ACL), pp. 497–504.

Neuvel, S. and S. A. Fulop (2002). Unsupervised Learning of Morphology withoutMorphemes. In Proceedings of the 40th Annutal Meeting of the Associationfor Computational Linguistics, Workshop on Unsupervised Learning in NaturalLanguage Processing, University of Pennsylvania. Philadelphia, PA, pp. 31–40.

Ngai, G. and R. Florian (2001). Transformation-based Learning in the Fast Lane.In Proceedings of the Second Annual Meeting of the North American Chapter ofthe Association for Computational Linguistics (NAACL), Carnegie Mellon Uni-versity, Pittsburgh, USA, pp. 1–8.

Ngai, G. and D. Yarowsky (2000). Rule Writing or Annotation: Cost-efficient Re-source Usage for Base Noun Phrase Chunking. In Proceedings of the 38th Meet-ing of the Association for Computational Linguistics (ACL), pp. 117–125.

Nemec, P. (2004). Application of Artificial Neural Networks in Morphological Tag-ging of Czech. Association for Computing Machinery (ACM) Student Competi-tion. Center for Computational Linguistics, Faculty of Mathematics and Physics(MFF), Charles University (UK). Prague, Czech Republic.

Och, F. J. and H. Ney (2000, October). Improved Statistical Alignment Models. InProceedings of the 38th Annual Meeting of the Association for ComputationalLinguistics (ACL), Hongkong, China, pp. 440–447.

Orphanos, G. S. and D. N. Christodoulakis (1999). POS Disambiguation and Un-known Word Guessing with Decision Trees. In Proceedings of the Ninth Con-ference on European Chapter of the Association for Computational Linguistics(ACL), Bergen, Norway, pp. 134–141.

144 Bibliography

Orphanos, G. S., D. Kalles, A. Papagelis, and D. Christodoulakis (1999). Deci-sion Trees and NLP: A Case Study in POS Tagging. In Proceedings of AnnualConference on Artificial Intelligence (ACAI), Greece.

Padó, S. and M. Lapata (2005). Cross-lingual Bootstrapping of Semantic Lexicons:The Case of FrameNet. In Proceedings of the Spring Symposia of the AmericanAssociation for Artificial Intelligence (AAAI), pp. 1087–1092.

Parmanto, B., P. Munro, and H. Doyle (1996). Improving Committee Diagnosiswith Resampling Techniques. In D. Touretzky, M. Mozer, and M. Hesselmo(Eds.), Advances in Neural Information Processing Systems, Volume 8, pp. 882–88. Cambridge, MA. MIT Press.

Paul, D. B. and J. M. Baker (1992). The Design of the Wall Street Journal-basedCSR corpus. In Proceedings of the ARPA Speech and Natural Language Work-shop, pp. 357–362.

Pedersen, T., A. Kulkarni, R. Angheluta, Z. Kozareva, and T. Solorio (2006). Im-proving Name Discrimination: A Language Salad Approach. In Proceedings ofthe European Association of Computational Linguistics (EACL), Workshop onCross-Language Knowledge Induction, Trento, Italy.

Perinin, M. A. (2002). Modern Portuguese: A Reference Grammar. New Haven,Connecticut: Yale University Press.

Przepiórkowski, A. and M. Wolinski (2003). A Flexemic Tagset for Polish. InProceedings of the Workshop on Morphological Processing of Slavic Languagesheld at the 11th Conference of the European Chapter of the Association for Com-putational Linguistics (EACL), pp. 33–40.

Rapp, R. (1995). Identifying Word Translations in Non-parallel Texts. In Proceed-ings of the 35th Conference of the Association for Computational Linguistics(ACL), Student Session, Boston, MA, pp. 321–322.

Ratnaparkhi, A. (1996). A Maximum Entropy Part-of-speech Tagger. In Proceed-ings of the Empirical Methods in Natual Language Processing (EMNLP) Con-ference, University of Pennsylvania, Philadelphia, USA, pp. 133–142.

Resnik, P. (1996). Selectional Constraints: An Information-Theoretic Model andits Computational Realization. Cognition (61), 127–159.

Resnik, P. (2004). Exploiting Hidden Meanings: Using Bilingual Text for Mono-lingual Annotation. In A. Gelbukh (Ed.), Lecture Notes in Computer Science2945: Computational Linguistics and Intelligent Text Processing, Seoul, Korea,pp. 283–299. Springer-Verlag.

Bibliography 145

Ruimy, N., P. Bouillon, and B. Cartoni (2004). Semi-Automatic Derivation of aFrench Lexicon from CLIPS. In Proceedings of Fourth International Conferenceon Language Resources and Evaluation (LREC), Volume IV, Lisbon, Portugal,pp. 1099–1102.

Ruzýicýka, R. (1952). Der Russische Verbalaspekt [Russian verbal aspect.]. DerRussischunterricht (5), 161–69.

Samuelsson, C. (1993). Morphological Tagging Based Entirely on Bayesian Infer-ence. In Proceedings of the 9th Nordic Conference on Computational Linguistics(NoDaLiDa), Stockholm, Sweden.

Schmid, H. (1994a). Part-of-Speech Tagging with Neural Networks. In Proceedingof the 15th International Conference on Computational Linguistics (COLING),pp. 172–176.

Schmid, H. (1994b). Probabilistic Part-of-Speech Tagging Using Decision Trees.In Proceedings of the International Conference on New Methods in LanguageProcessing, Stuttgart, Germany, pp. 44–49.

Schone, P. and D. Jurafsky (2000). Knowledge-Free Induction of Morphology Us-ing Latent Semantic Analysis. In The 4th Conference on Computational NaturalLanguage Learning and 2nd Learning Language in Logic Workshop, Lisbon,Portugal, pp. 67–72.

Schone, P. and D. Jurafsky (2002). Knowledge-Free Induction of Inflectional Mor-phologies. In Proceedings of the 2nd meeting of the North American Chapter ofthe Association for Computational Linguistics (NAACL).

Schuetze, H. (1992). Dimensions of Meaning. In Proceedings of Supercomputing,Minneapolis, MN, pp. 787–796.

Shenker, A. M. (1995). The Dawn of Slavic. Yale University Press.

Sjöbergh, J. (2003a). Combining POS-taggers for Improved Accuracy on SwedishText. In Proceedings of the 14th Nordic Conference on Computational Linguis-tics (NoDaLiDa).

Sjöbergh, J. (2003b). Stomp, a POS-tagger with a Different View. In N. Nicolov,K. Bontcheva, G. Angelova, and R. Mitkov (Eds.), Proceedings of the RecentAdvances in Natural Language Processing Conference (RANLP), Volume 260 ofCurrent Issues in Linguistic Theory (CILT), pp. 54–60. Amsterdam: John Ben-jamins.

Skoumalová, H. (1997). A Czech Morphological Lexicon . In Proceedings of theThird Meeting of the ACL Special Interest Group in Computational Phonology,Madrid, pp. 41–47. ACL.

146 Bibliography

Smadja, F. (1996). Translating Collocations for Bilingual Lexicons: A StatisticalApproach. Computational Linguistics 21(4), 1–38.

Smith, D. A. and N. A. Smith (2004). Bilingual Parsing with Factored Estimation:Using English to Parse Korean . In D. Lin and D. Wu (Eds.), Proceedings of theConference on Empirical Methods in Natural Language Processing (EMNLP),Barcelona, Spain, pp. 49–56.

Snyder, B. and R. Barzilay (2008a). Cross-lingual Propagation for MorphologicalAnalysis. In Proceedings of AAAI 2008, pp. 848–854.

Snyder, B. and R. Barzilay (2008b). Unsupervised Multilingual Learning for Mor-phological Segmentation. In Proceedings of ACL-08: HLT, Columbus, Ohio, pp.737–745. Association for Computational Linguistics.

Snyder, B., T. Nassem, J. Eisenstein, and R. Barzilay (2008). Unsupervised Mul-tilingual Learning for POS Tagging. In Proceedings of EMNLP 2008, pp. 334–343.

Solorio, T. and A. L. López (2005). Learning Named Entity Recognition in Por-tuguese from Spanish. In A. F. Gelbukh (Ed.), Proceedings of ComputationalLinguistics and Intelligent Text Processing (CICLing), Volume 2945 of LectureNotes in Computer Science, Mexico City, Mexico, pp. 762–768. Springer-Verlag.

Spoustová, D., J. Hajic, J. Votrubec, and P. Krbec (2007, June). The Best of TwoWorlds: Cooperation of Statistical and Rule-Based Taggers for Czech. In Balto-Slavonic Natural Language Processing 2007, Prague, pp. 67–74. Association forComputational Linguistics.

Tanaka, K. and H. Iwasaki (1996). Extraction of Lexical Translations from Non-aligned Corpora. In Proceedings of the 16th International Conference on Com-putational Linguistics (COLING), Copenhagen, Denmark, pp. 580–585.

Theron, P. and I. Cloete (1997). Automatic Acquisition of Two-Level Morphologi-cal Rules. In Proceedings of the Fifth Conference on Applied Natural LanguageProcessing (ANLP), Washington, DC, pp. 103–110.

Tsang, V. (2001). Second Language Information Transfer in Automatic Verb Clas-sification. Master’s thesis, Department of Computer Science, University ofToronto, Toronto, Canada.

Tsang, V., S. Stevenson, and P. Merlo (2002). Crosslinguistic Transfer in AutomaticVerb Classification. In Proceedings of the 19th International Conference onComputational Linguistics (COLING), Taipei, Taiwan, pp. 1023–1029.

van Halteren, H., Z. Zavrel, and W. Daelemans (1998). Improving Data-drivenWordclass Tagging by System Combination. In Proceedings of the 36th AnnualMeeting of the Association for Computational Linguistics and 17th InternationalConference on Computational Linguistics (ACL-COLING), pp. 491–497.

Bibliography 147

van Halteren, H., Z. Zavrel, and W. Daelemans (2001). Improving Accuracy inWord-class Tagging through Combination of Machine Learning Systems. Com-putational Linguistics 27(2), 199–230.

van Rijsbergen, C. J. (1979). Information Retrieval. Butterworths, London.

Vapnik, V. (1998). Statistical Learning Theory. New York: Wiley.

Viterbi, A. (1967). Error Bounds for Convolutional Codes and an AsymptoticallyOptimal Decoding Algorithm. In Institute of Electrical and Electronics Engi-neers (IEEE) Transactions on Information Theory, Volume 13, pp. 260–269.

Wade, T. (1992). A Comprehensive Russian Grammar. Oxford; Malden, MA:Blackwell Publishers. 582 pp.

Weischedel, R., M. Meteer, R. Schwartz, L. Ramshaw, and J. Palmucci (1993).Coping with Ambiguity and Unknown Words through Probabilistic Methods.Computational Linguistics 19(2), 361–382.

Wheeler, M. W., A. Yates, and N. Dols (1999). Catalan: A Comprehensive Gram-mar. London; New York: Routlege.

Wolpert, D. (1992). Stacked Generalization. Neural Networks, 241–260.

Wu, D. and X. Xia (1994). Learning an English-Chinese Lexicon from a Paral-lel Corpus. In Association for Machine Translation in the Americas (AMTA),Washington, DC, USA, pp. 206–213.

Yarowsky, D. (1995). Unsupervised Word Sense Disambiguation Rivaling Super-vised Methods. In Proceedings of the 33rd Conference of the Association forComputational Linguistics (ACL), pp. 189–196.

Yarowsky, D. and G. Ngai (2001). Inducing Multilingual POS Taggers and NPBracketers via Robust Projection Across Aligned Corpora. In Proceedings of theSecond Annual Meeting of the North American Chapter of the Association forComputational Linguistics (NAACL), pp. 200–207.

Yarowsky, D., G. Ngai, and R. Wicentowski (2001). Inducing Multilingual TextAnalysis via Robust Projection across Aligned Corpora. In Proceedings ofthe First International Conference on Human Language Technology Research(HLT), pp. 161–168.

Yarowsky, D. and R. Wicentowski (2000). Minimally Supervised MorphologicalAnalysis by Multimodal Alignment. In Proceedings of the 38th Meeting of theAssociation for Computational Linguistics (ACL), pp. 207–216.

Zemel, R. S. (1993). A Minimum Description Length Framework for UnsupervisedLearning. Ph. D. thesis, Department of Computer Science, University of Toronto,Toronto, Canada.

148 Bibliography

Zipf, G. K. (1935). The Psychobiology of Language. Houghton-Mifflin.

Zipf, G. K. (1949). Human Behavior and the Principle of Least-Effort. Addison-Wesley.

Appendix A

Tagsets we use

In section 4.4, we discussed the tagsets we used in our experiments. In this ap-pendix, we provide more technical details about the tagsets: possible values forindividual positions, restrictions on co-occurrence of values and some examples.

The Czech tagset was developed by Jan Hajic (2004). The other tagsets weredeveloped by us following the basic design principles of the Czech tagset. TheRussian tagset was created from scratch, the Romance tagsets are translations ofexisting compact tagsets into a positional system.

A.1 Czech tagset

The Czech tagset is described in detail in Hajic (2004). We provide only a briefoverview to allow a comparison with the other tagsets. The Czech tags have 15positions summarized in Table A.1; two of them are not used. Table A.2 summa-rizes possible values for each position. Note that not all combinations are possi-ble. The Detailed-POS determines whether a particular position is applicable (e.g.nouns distinguish number, whereas interjections do not) and what values are pos-sible (while nouns and pronouns both distinguish number, the value dual (D) ispossible only for nouns). Hajic (2004) provides detailed co-occurrence tables.

150 Appendix A. Tagsets we use

Table A.1. Positions of the Czech and Russian tagsets

Position Abbreviation Name Description1 p POS Part of Speech2 s SubPOS Detailed Part of Speech3 g Gender Gender4 n Number Number5 c Case Case6 f PossGender Possessor’s Gender7 m PossNumber Possessor’s Number8 e Person Person9 t Tense Tense

10 d Grade Degree of comparison11 a Negation Negation12 v Voice Voice13 Reserve1 Unused14 Reserve2 Unused15 i Var Variant, Style, Register, Special Usage

Table A.2. Values of individual positions of the Czech tagset

Position 1 – POS

A AdjectiveC NumeralD AdverbI InterjectionJ ConjunctionN NounP PronounV VerbR PrepositionT ParticleX Unknown, not determined, unclassifiableZ PunctuationPosition 2 – SubPOS

N N: noun2 A: Hyphenated adjective (cesko-(anglický) ‘Czech-(English)’)A A: General adjective (velký ‘big’)C A: Nominal (short) adjective (rád, schopen)G A: Adjective derived from present transgressive (chodící ‘walking’)M A: Adjective derived from past transgressive (napsavší ‘having written’)O A: Idiosyncratic pronouns svuj ‘true to type’, nesvuj ‘not-in-mood’, tentam ‘gone’U A: Possessive adjective (Martinuv ‘Martin’s)

A.1. Czech tagset 151

P P: Personal pronoun (já ‘I’)H P: Personal pronoun in clitical form (me ‘me’)5 P: 3rd person pronoun in prepositional forms (nem ‘him’)7 P: Reflexive pronouns se, si, ses, sis6 P: Reflexive pronoun se in long forms (sebou)S P: Possessive pronoun (muj ‘my’)8 P: Possessive reflexive pronoun svujD P: Demonstrative pronoun (ten, onen)J P: Relative pronoun jenž ‘who’9 P: Relative pronoun jenž in prepositional form (nehož)1 P: Relative possessive pronoun (jehož)4 P: Relative/interrogative pronoun with adjectival declension (jaký, který)E P: Relative pronoun což ‘which’K P: Relative/interrogative pronoun kdo ‘who’Q P: Relative/interrogative pronoun co ‘what’Y P: Relative/interrogative pronoun co as an enclitic to a prep. (oc, nac, zac)L P: Indefinite pronoun všechen ‘all’, sám ‘alone’Z P: Indefinite pronoun (nejaký ‘some’)W P: Negative pronoun (nic ‘nothing’, nijaký ‘no’)= C: Number written using digits (3)} C: Number written using Roman numerals (XIV)l C: Cardinal numeral <5; also sto ‘100’ and tisíc ‘1,000’n C: Cardinal numeral >=5r C: Ordinal numeral (adjective declension without degrees of comparison)? C: Numeral kolik ‘how many/much’u C: Interrogative numeral kolikrát ‘how many times’z C: Interrogative numeral kolikátý ‘what (at what position place in a sequence’a C: Indefinite numeral (mnoho ‘many’)w C: Indefinite numeral with adjectival declension (tolikátý ‘so many times repeated’)d C: Generic numeral with adjectival declension (dvojí ‘two-kinds’))h C: Generic numeral; only jedny ‘one kind of’ and nejedny ‘not only one kind of’)j C: Generic numeral 4+ used as a noun (ctvero ‘four kinds of’)k C: Generic numeral 4+ used as an adjective, short form (ctvery ‘four kinds of’)o C: Multiplicative indefinite numeral (mnohokrát ‘many times’)v C: Multiplicative definite numeral (petkrát, ‘five times’)y C: Fraction ending at -ina used as a noun* C: Word krát ‘times’f V: InfinitiveB V: Verb in present or future tensec V: Conditional (bych ‘would’)i V: Imperativep V: Active past participles V: Passive past participlee V: Present transgressive (adverbial participle)m V: Past transgressive; also archaic present transgressive of perfective verbst V: Verb in present or future tense with the enclitic -t’ (archaic)q V: Active past participle with the enclitic -t’ (archaic)


b D: Adverb without negative or comparative forms (pozadu ‘behind’)g D: Adverb with negative and comparative forms (velký ‘big’)R R: Preposition without vocalization (v, k)V R: Preposition with vocalization (ve, ku)0 R: Preposition with attach clitic -n (pron ‘for him’)F R: Preposition, part of; never appears isolated (vzhledem (k))^ J: Conjunction connecting main clauses, not subordinate, J: Conjunction subordinate (incl. aby, kdyby)* J: Word krát ‘times’I I: InterjectionsT T: ParticleX X: Word form recognized, but tag is missing in dictionary@ X: Unrecognized word form: Z: Punctuation# Z: Sentence boundary (for the virtual word ###)

Position 3 – Gender

M Masculine animateI Masculine inanimateF FeminineN NeuterX Any of the basic four gendersH Feminine or neuterT Masculine inanimate or feminine (plural only)Y Masculine (either animate or inanimate)Z Not feminine (i.e. masculine animate/inanimate or neuter)Q Feminine (with singular only) or neuter (with plural only)Position 4 – Number

S SingularP PluralD DualW Singular for feminine gender, plural with neuterX Any numberPosition 5 – Case

1 Nominative2 Genitive3 Dative4 Accusative5 Vocative6 Locative7 InstrumentalX Any casePosition 6 – Possessor’s Gender

M Masculine animate possessor

A.1. Czech tagset 153

F Feminine possessorX Possessor of any genderZ Not feminine (both masculine or neuter)Position 7 – Possessor’s Number

S Singular possessorP Plural possessorPosition 8 – Person

1 1st person2 2nd person3 3rd personX Any personPosition 9 – Tense

F FutureH Past or presentP PresentR PastX Any tensePosition 10 – Degree of comparison

1 Positive2 Comparative3 SuperlativePosition 11 – Negation

A Affirmative (not negated)N NegatedPosition 12 – Voice

A ActiveP PassivePosition 15 – Variant, Style, Register, Special Usage

- Basic variant1 Variant, second most used (less frequent), still standard2 Variant, rarely used, bookish, or archaic3 Very archaic, also archaic + colloquial4 Very archaic or bookish, but standard at the time5 Colloquial, but (almost) tolerated even in public6 Colloquial (standard in spoken Czech)7 Colloquial (standard in spoken Czech), less frequent variant8 Abbreviations9 Special uses, e.g. personal pronouns after prepositions etc.


A.2 Russian tagset

The Russian tagset was developed on the basis of the Czech positional tagset (seethe previous section). The tagsets encode the same set of morphological categoriesin the same order and in most cases do so using the same symbols. However, thereare some differences. Many of them are a consequence of linguistic differencesbetween the languages. For example, Russian has neither vocative nor dual, nordoes it have auxiliary or pronominal clitics; and the difference between colloquialand official Russian is not as systematic and profound as in Czech.

The Russian tagset also uses far fewer wildcards (symbols representing a setof atomic values). Even though wildcards might lead to better tagging performance,we intentionally avoid them. The reason is that they provide less information aboutthe word, which might be needed for linguistic analysis or an NLP application. Inaddition, it is trivial to translate atomic values to wildcards if needed.

The tagset contains only wildcards covering all atomic values (denoted by Xfor all applicable positions). There are no wildcards covering a subset of atomic val-ues. Forms that would be tagged with a tag containing a partial wildcard in Czechare regarded as ambiguous. For example, the Czech tomto ‘thismasc/neut.loc’ is taggedas PDZS6---------- in v tomto dome ‘in this housemasc’ and v tomto míste ‘in thisplaceneut’. The Russian ètom ‘thismasc/neut.loc’ is tagged as PDMS6---------- inv ètom dome ‘in this housemasc’ and PDNS6---------- in v ètom meste ‘in thisplaceneut’.

For comparison, there are 1,027 tags in the Russian tagset and 2,263 tagsin the Czech tagset. These counts contain only the basic tags, including only threevalues of the variant (final) slot of the tag: basic (-), abbreviations (8) and specialuse tags (9). Other alternative forms, colloquial and archaic language values areexcluded (1-7 in the variant slot). When counting all the variants, there are 1,063 inthe Russian tagset and 4,251 tags in the Czech tagset.

A.2.1 Positions

The positions are the same as in the Czech tagset (see Table A.1). The currentversion of the tagset captures neither animacy nor reflexivity. We plan to add thesefeatures to the Russian tagset in the future.

A.2.2 Values

Table A.3 summarizes possible values for each position. Note that, similarly toCzech, not all combinations are possible.

Table A.3. Values of individual positions of the Russian tagset

Position 1 – POS

A AdjectiveC Numeral

A.2. Russian tagset 155

D AdverbI InterjectionJ ConjunctionN NounP PronounV VerbR PrepositionT ParticleX Unknown, not determined, unclassifiableZ PunctuationPosition 2 – SubPOS

N N: NounA A: Adjective (long, non-participle) (xorošij, ploxoj)C A: Short adjective (non-participle) (surov, krasiv)G A: Participle, active or long passive (citajušcij, citavšij, procitavšij, citaemyj,

procitannyj; but not procitan)c A: Short passive participle (procitan)U A: Possessive adjective (mamin, ovec’ju)6 P: Personal reflexive pronoun (sebja)8 P: Possessive reflexive pronoun (svoj, svoju, svoim, . . . )P P: Personal pronoun (ja, my, ty, vy, on, ona, ono, oni)5 P: 3rd person pronoun in prepositional forms (nego, nej, . . . )S P: Possessive pronoun (moj, ego, ..)D P: Pronoun demonstrative (ètot, tot, sej, takoj, èkij, . . . )Q P: Relative/interrogative pronoun with nominal declension (kto, cto)q P: Relative/interrogative pronoun with adjectival declension (kakoj, kotoryj, cej, . . . )W P: Negative pronoun with nominal declension (nicto, nikto)w P: Negative pronoun with adjectival declension (nikakoj, nicej)Z P: Indefinite pronoun with nominal declension (kto-to, kto-nibud’, cto-to, . . . )z P: Indefinite pronoun with adjectival declension (samyj, ves’, . . . )= C: Number written using digits} C: Number written using Roman numerals (XIV)n C: Cardinal numeral (odin, tri, sorok)r C: Ordinal numeral (pervyj, tretij)j C: Generic/collective numeral (dvoje, cetvero)u C: Interrogative numeral (skol’ko)a C: Indefinite numeral (mnogo, neskol’ko)v C: Multiplicative numeral (dvaždy, triždy)B V: Verb in present (or rarely future) form (citaju, splju, pišu)f V: Infinitive (delat’, spat’)i V: Imperative (spi, sdelaj, procti)p V: Past form (spal, ždal)e V: Imperfective gerund (delaja)m V: Perfective gerund (pridja, otpisav)b D: Adverb without a possibility to form negation and degrees of comparison


(vverxu, vnizu, potom)g D: Adverb forming negation and degrees of comparison (vysoko, daleko)F R: Part of a preposition; never appears isolated (nesmotrja)R R: Nonvocalized preposition (ob, pered, s, v, . . . )V R: Vocalized preposition (obo, peredo, so, vo, . . . ), J: Subordinate conjunction (esli, cto, kotoryj)^ J: Non-subordinate conjunction (i, a, xotja, pricem)I I: Interjection (oj, aga, m-da)T T: Particle (by, li)# Z: Sentence boundary: Z: Punctuation0 X: Part of a multiword foreign phraseX X: Unknown, Not Determined, Unclassifiable

Position 3 – Gender Distinguished for: N, A{ACGUc}, P{P5DLwSq8}, C{nra}, Vp

F FeminineM MasculineN NeuterX Any genderPosition 4 – Number Distinguished for: N, A{ACGUc}, P{P5DLwSq8}, C{nra}, V{Bp}

P PluralS SingularX Any numberPosition 5 – Case Distinguished for: N, A{AGU}, P{P5DLWwSQq68}, C{nrjua}

1 Nominative2 Genitive3 Dative4 Accusative6 Locative7 InstrumentalX Any casePosition 6 – Possessor’s Gender Distinguished for: PS, AU

F Feminine possessorM Masculine possessorN Neuter possessorX Any genderPosition 7 – Possessor’s Number Distinguished for: PP

P Plural (possessor)S Singular (possessor)Position 8 – Person Distinguished for: P{P5S}, V{Bi}

1 1st person2 2nd person


3 3rd personX Any personPosition 9 – Tense Distinguished for: A{G}, V{Bp}

F FutureP PresentR PastX Any (Past, Present, or Future)Position 10 – Degree of comparison Distinguished for: AA, Dg

1 Positive2 Comparative3 SuperlativePosition 11 – Negation Distinguished for: N, A, Dg

A Affirmative (not negated)N NegatedPosition 12 – Voice Distinguished for: AG, Ac

A ActiveP PassivePosition 15 – Variant Distinguished for: As needed

- Basic variant1 Variant, second most used (less frequent), still standard2 Variant, rarely used, bookish, or archaic3 Very archaic5 Colloquial, but (almost) tolerated even in public6 Colloquial7 Colloquial, less frequent variant8 Abbreviations

The X wild-card values are used only in the following cases:

• Gender: agreement gender in plural (adjectives, participles, determiners,etc.), plurale-tantum nouns, non-declinable adjectives (e.g. non-Russianwords and abbreviations), personal pronouns in 3rd person plural.

• Number: non-declinable nouns, adjectives and verbs, 3rd person possessivepronouns.

• Case: non-declinable nouns and adjectives, 3rd person possessive pronouns.

• Possessor’s Gender: for the 3rd person plural possessive pronoun.

• Person: for non-declinable verbs VB-X---XP------

• Tense: for passive long participles (AG).


A.2.3 Tagset overview by POS

Table A.4 provides an overview of the Russian tagset by POS. A template in thetable below denotes a set of tags. The Roman letters refer to particular values, whilethe italics denote variables. Thus for example, to obtain the set of tags correspond-ing to the template NNgnc-----a---, one needs to instantiate all the possible com-binations of the g (gender), n (number), c (case), and a (negation) variables. In thiscase, g ∈ {F,M,N,X}, n ∈ {P,S,X}, c ∈ {1,2,3,4,6,7,X}, a ∈ {A,N}. A variablenever stands for the - (N/A) value. If a single Sub-POS allows a particular positionto have both the N/A value and other values, we list them as separate templates.

In some cases, there might be additional restrictions on possible co-occurrences of values; for example, only nouns distinguish gender in plural (i.e.only lexical gender not agreement gender is distinguished). We mention only someof these restrictions. Also, the templates are somewhat simplified by ignoring thepossibility of having different variants of the same tag (the final slot).

Table A.4. Overview of the Russian tagset

template description sample word sample tagN – Nouns

NNgnc-----a---- noun golos NNMS4-----A----

A – Adjectives (incl. Participles)

AAgnc----da---- long adjective tjaželyj AAMS4----1A----ACgn------a---- short adjective krasiv ACMS------A----AGgnc---t-av--- long participle citajušcij AGMS1---P-AA---

tv ∈ {PA,RA,XP} citavšij AGMS1---R-AA---citaemyj AGMS1---X-AP---

AUgncf----a---- possessive adjective mužnin AUMS2M----A----Acgn------aP--- pass.perf.short participle procitan AcMS------AP---

P – pronoun

PP-nc--e------- personal pronoun nam PP-P3--1-------PPgnc--3------- personal pronoun 3rd person on PPMS1--3-------P5gnc--3------- personal p. in prep. forms nego P5MS2--3-------PDgnc---------- demonstrative ètu PDFS4----------PW--c---------- negative (nominal declension) nicto PW--1----------Pwgnc---------- negative (adj declension) nikakoj PwMS1----------PSgnc-me------- possessive moja PSFS1-S1-------PSXXXfm3------- possessive ego PSXXXMS3-------PQ--c---------- relative/interrogative (nom decl) što, kto PQ--1----------Pqgnc---------- relative/interrogative (adj decl) kakoj PqMS1----------PZ--c---------- indefinite (nominal declension) kogo-to PZ--4----------Pzgnc---------- indefinite (adjectival declension) kakoj-to PzMS1----------P6--c---------- personal reflexive sebja sebja P6--4----------P8gnc---------- possessive reflexive svoj P8MS1----------


C – Numeral

C=------------- numbers (using digits) 3.14 C=-------------C}------------- roman numeral XVII C}-------------Cngnc---------- cardinal numeral 1 odnomu CnMS3----------Cng-c---------- cardinal numeral 2, poltora dvux CnM-2----------Cn--c---------- cardinal numeral 3+ pjati Cn--2----------Crgnc---------- ordinal pervyj CrMS1----------Cj--c---------- generic/collective numeral dvoim Cj--3----------Cu--c---------- interrogative skol’ko Cu--x----------Ca--c---------- indefinite numeral neskol’ko Ca--1----------Cagnc---------- indefinite num. (adj decl.) mnogomu CaMS3----------Cv------------- multiplicative triždi Cv-------------

V – verb

VB-n---et------ present (rarely fut.) finite form otryvaeš’ VB-P---2P------Ve------------- imperfective gerund grozja Ve-------------Vf------------- infinitive spat’ Vf-------------Vi-n---e------- imperative trevož’tec’ Vi-P---2-------Vm------------- perfective gerund napisav Vm-------------Vpgn----R------ past form cital VpMS----R------

D – Adverb

Db------------- adv. not forming negation/degrees tam Db-------------Dg-------da---- adv. forming negation/degrees sil’nee Dg-------2A----

R – Preposition

RR--c---------- nonvocalized prep. with c case nad RR--7----------RV--c---------- vocalized prep. with c case nado RV--7----------RF------------- part of a multiword prep. nesmotrja RF-------------

J – Conjunction

J^------------- coordinating conj. i J^-------------J,------------- subordinating conj. cto J,-------------

T – particle

TT------------- particle net TT-------------

I – Interjection

II------------- Interjection II-------------

Z – punctuation

Z#------------- Sentence boundary Z#-------------Z:------------- Punctuation ! Z:-------------

X – special

X0------------- part of a multiword foreign phrase X0-------------XX------------- unknown XX-------------


A.2.4 Notes

Negation

The values A/N in the negation slot refer to the presence (N) or absence (A) of anegative prefix (ne) for open class words. For pronouns the slot has always N/Avalue and whether they have negative meaning or not is specified by their lemma.Words that are not negated synchronically do not have N in this slot (they may stillhave negative semantics, but the initial ne is not a morphological prefix anymore),for example nenavist’ ‘hate’ is tagged as NNFS1-----A----.

All adjectives, including participles, allow negation, at least in theory:

• Ego nevolcij vzgljad menja ispugal ‘His non-wolfish look scared me’.

• Staršij syn byl bolee “nemamin” ‘The eldest son was more non-mother’s’(unusual, but at least theoretically possible)

Numerals

1. nol’/nul’ ‘zero’ and numerals above 999 (e.g. tysjaca ‘thousand’, milion) areconsidered to be regular nouns.

2. Only odinoždy ‘one time’, triždy ‘three times’, etc. are considered to be mul-tiplicative numerals; šestikratnyj ‘sixfold’ is annotated as a regular adjective.

3. Other words related to numerals are considered to be nouns or adjectives:number names (dvojka ‘number two’, devjatka ‘number nine’ – nouns); pja-tok ‘five’, desjatok ‘dozens’– nouns); composites (dvuxletnij ‘biannual’ – ad-jective, pjatiletka ‘five-year period/plan’ – noun).

Participles

1. All participles are classified as adjectives:

• citajušcij – AGMS1---P-AA--- – active (A) present (P) participle• citavšij – AGMS1---R-AA--- – active (A) past (R) participle• citaemyj – AGMS1---X-AP--- – passive (P) long (imperf/perf) partici-

ple• procitan – AcMS------AP--- – passive (P) perf. short participle

2. Similarly to Czech, all -nyj (ostavlennyj ‘deserted’, varenyj ‘cooked’, zade-lannyj ‘clogged’) participles/adjectives are considered to be general adjec-tives, because it is very hard to draw the line between their purely adjectivaland participial use.

A.3. Romance tagsets 161

A.3 Romance tagsets

The Romance tagsets that we use are translations of the CLiC-TALP (Civit 2000)tagset, a compact structured tagset, into a positional system. The translation followsthe basic design principles outlined in section 4.4. All Romance tagsets distinguish11 positions with the same meaning summarized in Table A.5.

The individual values of our tagset are always equal to the relevant valuesof the CLiC-TALP tagset, with one exception: SubPOS values are equal in mostcases, but not always. We changed a minimal number of SubPOS values to makethem unique. The possible values for each position are listed in Table A.6. Similarlyas with the Slavic tagsets, not all combinations are possible (e.g. nouns distinguishnumber but interjections do not).

Table A.5. Positions of the Romance tagsets

Position Abbreviation Name Description1 p POS Part of Speech2 s SubPOS Detailed Part of Speech3 g Gender Gender4 n Number Number5 c Case Case6 m PossNumber Possessor’s Number7 o Form Form of Preposition8 e Person Person9 t Tense Tense

10 m Mood Mood11 r Participle Is the adjective a participle?

Table A.6. Values of individual positions of Romance tagsets

Position 1 – POS

N NounA AdjectiveD DeterminerP PronounV VerbR AdverbS Preposition, AdpositionC ConjunctionI InterjectionZ Mathematical/Numeric CharactersY AbbreviationsF PunctuationX Undefined elements


Position 2 – Detailed POS

C N: Common nounE N: Proper nounO A: Predicative (ordinal) adjectiveQ A: Qualitative adjectiveA D: ArticleD D: Demonstrative determinerI D: Indefinite determinerE D: Exclamative determiner (only in Spanish)N D: Numeral determinerP D: Possessive determinerT D: Interrogative determineri P: Indefinite pronound P: Demonstrative pronoune P: Exclamative pronoun0 P: Indeterminate pronoun typen P: Numeral pronounp P: Personal pronounr P: Relative pronount P: Interrogative pronounX P: Possessive pronounM V: Main verbS V: Semiauxiliarya V: AuxiliaryG R: General adverbF R: Negative adverbK S: Preposition, adpositionc C: Coordinate conjunctions C: Subordinate conjunctionI I: InterjectionW Z: Years, Hours, etcZ Z: Mathematical/Numeric CharactersY Y: AbbreviationsF F: PunctuationX X: Undefined elementsI I: InterjectionPosition 3 – Gender

M MasculineF FeminineC CommonN Neutral0 Inapplicable for the form

A.3. Romance tagsets 163

Position 4 – Number

S SingularP PluralN Invariable0 Inapplicable for the formPosition 5 – Case

A AccusativeD DativeN NominativeO Oblique0 Inapplicable for the formPosition 6 – Possessor’s number

S SingularP Plural0 Inapplicable for the formPosition 7 – Preposition’s form

S SimpleC ContractedPosition 8 – Person

1 1st person2 2nd person3 3rd person0 Inapplicable for the formPosition 9 – Tense

P PresentI ImperfectiveF FutureC ConditionalS PastL Pluperfect (for Portuguese synthetic perfect)0 Inapplicable for the formPosition 10 – Mood

I IndicativeS SubjunctiveM ImperativeN InfinitiveG GerundP ParticiplePosition 11 – Participle (for adjectives)

P Participle0 Non-participle


A.3.1 Example tag correspondence

Below are examples of the original Romance tags and their translations into ourtagset.

1. Spanish

a) AO0FP0 → AOFP------0

b) AQ0MSP → AQMS------P

c) NCFN000→ NCFN-------

d) NP00000→ NE00-------

2. Catalan

a) AO0FP0 → AOFP------0

b) AQ0MSP → AQMS------P

c) NC00000 → NC00-------

Appendix B

Corpora

See Table 4.8, p. 61, for an overview of the basic properties of all corpora.

B.1 Slavic corpora

B.1.1 Czech corpora

All the Czech corpora are either part of the Prague Dependency Treebank 1.0 (PDT,Bémová et al. (1999); Böhmová et al. (2001); see http://ufal.mff.cuni.cz/pdt/Corpora/Raw_Texts/index.html) or are part of the PDT distribution. Let’sdiscuss them in more detail:

• Raw consists of all the texts labeled as Raw texts in the PDT distribution.The texts come from a Czech daily newspaper Lidové Noviny from the years1991-1995. It contains over 39M tokens or nearly 2.4M sentences.

• Test consists of all the annotated texts labeled as evaluation data. It approxi-mately contains 125K tokens or 8K sentences. The texts come from two dailynewspapers, a business weekly and a popular scientific magazine.

• Train consists of all the annotated texts labeled as training data. It ap-proximately contains 1.5M tokens or 95K sentences. The texts come fromthe same sources as the Test texts. To allow evaluation of how particularstatistics transfers from one corpus to another, we split the corpus into twoparts, each with about 620K tokens.1 These smaller corpora are referred toas Train1 and Train2. The results are reported in section 6.2.

1 The remaining tokens are not used in this book. PDT is organized by sourcesand date of publication. To prevent differences between the two corporacaused by such organization, we split the corpus into 40 pieces and put allthe odd pieces into Train1 and all the even pieces Train2.

166 Appendix B. Corpora

B.1.2 Russian corpora

• Dev, the development corpus: 1,758 manually annotated word tokens fromthe Russian translation of Orwell’s 1984. The annotation was done by us.

• Test, the test corpus: 4,011 manually annotated word tokens from the Rus-sian translation of Orwell’s 1984. The annotation was done by us.

• Raw: 1M unannotated tokens of the Uppsala Russian Corpus The corpus isfreely available from Uppsala University at http://www.slaviska.uu.se/ryska/corpus.html. This corpus mostly contains newspapers texts, politi-cal speeches, but also literary fragments.

B.2 Romance corpora

B.2.1 Spanish corpora

1. Train: 106,124 tokens (18,629 types) of the Spanish section of CLiC-TALP(Civit 2000), a balanced corpus, containing texts of various genres and styles.The CLiC-TALP tagset was automatically translated into the current systemfor easier detailed evaluation and comparison (see section 4.4.2).

B.2.2 Portuguese corpora

1. Test, the test corpus: manually annotated word tokens from the NILC corpus(Núcleo Interdisciplinar de Lingüística Computacional, available at http://nilc.icmc.sc.usp.br/nilc/.). The annotation was done manually byLuiz Amaral, a native speaker.

2. Raw: The NILC corpus minus the Test corpus. It contains 1.2M tokens. Theversion with POS tags assigned by PALAVRAS was used, but the POS tagswere ignored.

B.2.3 Catalan corpora

1. Dev, a development corpus: 2K tokens from the CLiC-TALP corpus. Theoriginal tagset was translated to our positional tagset.

2. Test, a test corpus: 20,645 tokens from the CLiC-TALP corpus.

3. Raw: 63M tokens of texts from the “El Periodico” newspaper available athttp://www.elperiodico.es. Note that this newspaper is published inSpanish and Catalan, and the Catalan version is obtained via a machinetranslation system plus post editing and correction. Thus, the Catalan ver-sion might appear more Spanish-like than other Catalan newspapers.

Appendix C

Language properties

Section 4.1 briefly describes the properties of the languages used in our experi-ments. Here we provide a slightly more detailed description of these languages.

C.1 Slavic Languages

Slavic languages are divided into three branches: South, West, and East Slavic.The South Slavic branch is split further into Western and Eastern subgroups. TheWestern subgroup is composed of Slovenian, Serbian, Bosnian, and Croatian. Thelanguages from the Western subgroup are spoken in Slovenia, Bosnia and Herze-govina, Croatia, Serbia and Montenegro, and the adjacent regions. The Eastern sub-group consists of Bulgarian in Bulgaria and adjacent areas, and Macedonian in theRepublic of Macedonia, Bulgaria, Greece and Albania. West Slavic includes Czechin the Czech Republic and Slovak in Slovakia, Upper and Lower Sorbian in Ger-many, and Lekhitic (Polish and its related dialects, Kashubian, Polabian, Obodrits).Russian, Ukrainian and Belarusian belong to the East Slavic branch.

C.2 Czech

General. The Czech language is one of the West Slavic languages. It is spokenby most people in the Czech Republic and by Czechs all over the world — about12 million native speakers in total (http://www.ethnologue.com).

Morphology. Czech is a richly inflected language like other Slavic languages.Czech nouns and adjectives distinguish gender, number, and case, and in somecases, animacy. There are seven cases: nominative, accusative, genitive, dative, in-strumental, locative, and vocative. About half of the singular noun paradigms havea distinctive vocative form shared by no other case; no adjectival, pronominal, nu-meral or plural noun paradigms have distinct vocative forms (i.e. vocative = nomi-native). There are three genders, the subcategory of animacy functioning within themasculine only. In the singular, animate accusative equals genitive, which itself, in

168 Appendix C. Language properties

West

– Czech– Polish– Slovak– Sorbian (Lusatian)

South

– Western

∗ Slovenian∗ Serbian∗ Bosnian∗ Croatian

– Eastern

∗ Bulgarian∗ Macedonian

East

– Belarusian– Russian– Ukrainian

Figure C.1. Slavic languages

the core (hard) masculine paradigm, differs from the inanimate genitive. Similarly,animate dative and locative usually differ from their inanimate equivalents. In theplural, the animate and inanimate differ only in nominative.

As in the other Slavic languages, the morphology of numerals is complexin Czech. For example, among the cardinal numbers, only ‘1’, ‘2’, ‘3’, and ‘4’function adjectivally and retain the morphology of case. The inflection of the othercardinal numerals is limited to the oblique-case ending -i. Ordinal (multidigit) num-bers have all digits in the ordinal form, for example dvacátý pátý, ‘25th’ and arefully declining (at least in Literary Czech). Two-digit numerals between whole tensmay have an inverted one-word form (e.g. petadvacátý, ‘25th’).

For verbs, person is expressed through inflection. Three tenses are recog-nized, a superficially simple system refined by the Slavonic aspects. Present timemeanings are expressed by the basic conjugated forms. The imperative is expressedmorphologically in second and first person plural, and analytically in others. The

C.3. Russian 169

conditional is expressed by a combination of a verb with conjugated enclitic auxil-iary by.

Five main conjugational types of verbs are recognized. They are distin-guished on the basis of the third person singular form, marked by the followingendings: (I) -e, (II) -n-e, (III) -j-e, (IV) -í, and (V) -á. Class V is a historic innova-tion, born of the contraction of once disyllabic endings and assimilation to the verbdát.

Syntax. In syntax, verbs agree with their subjects in number, person (in presentforms), gender (in past forms) and animacy (in masculine forms). Adjectives agreewith nouns they modify in case, number, gender, and animacy. As with other Slaviclanguages, Czech is a so-called free-word-order language, where the order of syn-tactic constituents is determined by pragmatic constraints. However, the positionof adjectives is relatively rigid before the nouns they qualify, as is the position ofdependent infinitives following the verbs on which they depend. Another issue inword order is the placing of clitics, elements on the boundary of syntax and mor-phology, which generally follow the first constituent of the clause. Czech cliticsinclude the past and conditional auxiliaries, the “weak” forms of the personal pro-nouns, and a small number of particles. See, for example, Hana (2007) for moredetails on properties and placement of Czech clitics. The main copular verb is býtand its frequentative bývat; it can never be omitted.

Reflexivity is expressed primarily by the free morpheme se. It is often de-scribed as a particle rather than a pronoun on the grounds of the many functionsin which it is referentially empty, and because under emphasis or where agreementmight be required, it behaves differently from other pronoun objects.

Sentence negation in Czech is formed by the prefix ne- attached to the verb.As in other Slavic languages, negative elements accumulate; any negative subject orobject pronoun or pronoun-adverb is reinforced by ne- in the verb. Unlike the otherlanguages described in this section, modern Czech is the only one that does nothave the “genitive of negation” phenomenon, i.e. the situation when an accusativeobject becomes genitive if a verb selecting it is negated. The direct object after anegative is in the accusative (except for some archaic cases).

C.3 Russian

General. Russian is an East Slavic language. Russian is primarily spoken in Rus-sia and, to a lesser extent, the other countries that were once constituent republicsof the USSR, as well as in Israel, North America and Western Europe. Accord-ing to http://www.ethnologue.com, there are 167 million first-language Russianspeakers in the world.

Morphology. Like Czech, Russian is a fusional language in which several inflec-tions are often fused into one phonetic and orthographic form. For example, in the


verb dela-et, the suffix -et indicates the person (3rd), the number (Sg), and the tense(present).

In Russian, nominal parts of speech express distinctions of case, number andgender with different degrees of consistency and not always by the same morpho-logical means. Number is expressed in all nominal parts of speech except numeralsthemselves. Russian has six primary cases (nominative, genitive, accusative, dative,instrumental, and locative) and two secondary cases (second genitive and secondlocative).

Nouns in Russian can be grouped into equivalence classes according to var-ious criteria. One such grouping is declension class (see, for example, Table C.1);another is (syntactic) gender, expressed through agreement in other parts of speech— attributive adjectives, predicative adjectives, the past tense of verbs, and pro-nouns. Declension type and gender are largely isomorphic — the members of agiven declension or subdeclension condition the same agreement, and belong tothe same gender. The exceptions mostly involve animate nouns.

Table C.1. Declension Ia – an example

Hard stem Soft stemSingular

nom cin ‘rank’ kon’ ‘horse’gen cina konjadat cinu konjuacc = nom = genloc cinax konjaxins cinom konëm

Pluralnom ciny konigen cinov konejdat cinam konjamacc = nom = genloc cinax konjaxins cinami konjami

Another equivalence class of nouns is defined by the animate accusative, or theuse of the genitive for a direct object. Nouns in Russian make use of relatively fewcase-number morphemes, and the three declension patterns into which they are or-ganized are also limited and relatively uniform, though there are some recognizablesubdeclensions.

There are two types of adjectives. Short-form adjectives, whose syntacticdistribution is restricted, preserve only the nominal endings of the nominative case.Long-form adjectives agree with nouns in number, gender, and case.

C.3. Russian 171

Numerals use declensional strategies which range from near indeclin-ability to adjective-like declension. Certain cardinal numerals expressing large,round units of counting have minimal declension, with one form for the nomi-native and accusative, and another for the remaining cases (e.g. sto.Nom/Acc vs.sta.Gen/Dat/Inst/Loc ‘hundred’).

Russian verbs distinguish three moods: the indicative, the imperative, andthe conditional. In the indicative, three tenses are distinguished. Finite forms in-flect for person and number (see Table C.2). In the past tense and the conditional,verb do not inflect for person, but inflect for gender in the singular. In addition,each verb belongs to a particular aspectual class. Change of aspect is encoded byderivational suffixes or prefixes. In some cases, verba aspects is changed through avowel alternation in the root.

Table C.2. I-conjugation – grabit’ ‘rob’

infinitive grabit’present tense

1 sg grablju2 sg grabiš3 sg grabit1 pl grabim2 pl grabite3 pl grabjat

past tensepast masc grabilpast fem grabilapast neut grabilopast pl grabili

imperative2 sg grab’2 pl grab’te

participlespresent active grabjašcijpast active grabivšijpast passive grablen

gerund (verbal adverb)present grabjapast -grabiv(ši)

Syntax. In syntax, main verbs agree in person and number with their subjects,and in gender in past, singular forms. Adjectives agree in person, gender and case


with the noun they modify. The word order of syntactic constituents in a sentenceis relatively free in Russian. The naturalness and frequency of various orders de-pends on the role of the noun phrase and the semantics of the verb, and differentorders have different stylistic consequences. The neutral order (the order that canbe used in the most situations) is Subject-Verb-Object. Pragmatic information andconsiderations of topic and focus (new information conveyed by the sentence), as inother Slavic languages, is important in determining word order. Constituents withold information precede constituents with new information or those that carry mostemphasis. The inflectional system determines the grammatical relations and roles.

Sentences stating copular relations — equations, descriptions, class mem-bership — consist of a (nominative) subject, a predicative noun or adjective and,sometimes, a copular verb. In the present tense, there is normally no overt copularverb, the conjugated forms of “be” having been eliminated in all functions.

The subjunctive mood is formed by the clitic by, which occurs in the so-called ‘second position’. ‘Second position’ is usually defined as the position afterthe first syntactic constituent or the first prosodic word.

The negative particle ne can attach to any constituent, with local scope.Negation shows an affinity with genitive case marking in place of nominative forsubjects of intransitives or accusative for objects of transitives, as illustrated in(C.1).

(C.1) a. PodlennikOriginalnom

pis’malettergen

soxranilsjapreserved

‘The original of the letter was preserved.’b. Podlennika

Originalgen

pis’malettergen

nenot

soxranilos’preserved.

‘The original of the letter was not preserved.’

Multiple negative elements can occur together with sentence negation, as shown in(C.2).

(C.2) OnHe

nikogo nenobody

videlnot saw

‘He didn’t see anybody.’

Reflexivity is expressed by the verbal suffix -sja (compare myt’ ‘wash (somebody)’vs. myt’sja ‘wash oneself’).

There are no articles, pronominal or auxiliary clitics in Russian.

C.4 Romance languages

The Romance languages (see Figure C.2), a major branch of the Indo-Europeanlanguage family, comprise all languages that descend from Latin, the language of

C.4. Romance languages 173

Romance– West-Iberian

∗ Aragonese∗ Asturo-Leonese∗ Fala∗ Galician∗ Ladino (Judaeo-Spanish)∗ Portuguese∗ Riverense Portuñol∗ Spanish (Castilian)

– Catalan∗ various dialects (Central, Northern, Aragonese, Valencian)

– Northern French languages∗ Bourguignon-Morvandiau∗ Champenois∗ Franc-Comtois∗ French∗ Gallo∗ Lorrain∗ Norman∗ Picard∗ Poitevin-Saintongeais∗ Walloon

– Franco-Provençal– Southern French languages (Occitan)– Corsican– Sardinian– Northern Italian (Gallo-Romance) languages– Rhaetian languages– Italo-Dalmatian languages

∗ Italian∗ Judeo-Italian∗ Neapolitan∗ Romanesco∗ Sicilian

– East Romance languages:∗ Aromanian∗ Romanian∗ Moldovan

Figure C.2. Romance languages


the Roman Empire. They have more than 600 million native speakers worldwide,mainly in the Americas, Europe, and Africa, as well as in many smaller regionsscattered through the world.

All Romance languages descend from Vulgar Latin, the language of sol-diers, settlers, and slaves of the Roman Empire, which was substantially differentfrom the Classical Latin of the Roman literati. Between 200 BC and 100 AD, theexpansion of the Roman Empire, coupled with administrative and educational poli-cies of Rome, made Vulgar Latin the dominant native language over a wide area,spanning from the Iberian Peninsula to the Western coast of the Black Sea. Duringthe Empire’s final years and after its collapse and fragmentation in the 5th century,Vulgar Latin began to evolve independently within each local area and eventuallydiverged into dozens of distinct languages. The overseas empires established bySpain, Portugal, and France after the 15th century then spread Romance languagesto the other continents to such an extent that about 2/3 of all Romance speakers arenow outside Europe.

In spite of multiple influences from pre-Roman languages and from later in-vasions, the phonology, morphology, lexicon, and syntax of all Romance languagesare predominantly derived from Vulgar Latin. As a result, the group shares a num-ber of linguistic features that set it apart from other Indo-European branches. Withonly one or two exceptions, Romance languages have lost the declension system ofClassical Latin.

The most spoken Romance language is Spanish, followed by Portuguese,French, Italian, Romanian and Catalan. These six languages are all main and of-ficial national languages in more than one country. A few other languages haveofficial status on a regional or otherwise limited level in these nations. For instance,Sardinian is officially recognized in Italy, Romansh in Switzerland, and Valencian,Galician, and Aranese in Spain.

The remaining Romance languages survive mostly as spoken languages forinformal contact. National governments have historically viewed linguistic diver-sity as an economic, administrative, or military liability, and a potential source ofseparatist movements. Therefore, they have generally fought to eliminate such adiversity by massively promoting the use of the official language, by restrictingthe use of the "other" languages in the media, by characterizing them as mere "di-alects", or worse.

In the last decades of the 20th century, however, increased sensibility to therights of minorities have allowed those languages to recover some of their prestigeand lost rights. However, it is not clear whether those political changes will beenough to reverse the decline of non-official languages.

C.5. Catalan 175

C.5 Catalan

General. Catalan is a member of the Romance family of languages.1 As its geo-graphical position might suggest, Catalan shares several features with its Romanceneighbors — Italian, Sardinian, Occitan, and Spanish — while being distinct inseveral respects from all of them. A conservative estimate of the number of nativespeakers of Catalan is about 6.5 million (http://www.ethnologue.com), in Cat-alonia, Valencia, and the Balearics. Though there are significant dialect differencesin Catalan, the dialects are, to a very high degree, mutually intelligible. They areconventionally divided into two groups, eastern and western, on the basis of dif-ferences in phonology as well as in some features of verb morphology. There aresome interesting lexical differences, too. The eastern dialect group covers NorthCatalan, Central Catalan, Balearic, and Alguerese. The western group consists ofNorth Western Catalan and Valencian. The descriptions and the examples used inthis chapter as well as in the experiments in later chapters are mainly based on theCentral dialect, which is the closest to the established prescriptive norms.

Morphology. As in all Romance languages, Catalan nouns distinguish feminineand masculine gender. In addition, Catalan has a special neuter agreement pronounho used when no noun has been mentioned. The feminine form is usually derivedwith the suffix -a (e.g. noi (masc.) vs. noia (fem.) ‘boy/girl’). In some cases, thefeminine form has another suffix, such as -essa. Some masculine forms are de-rived from feminine, e.g. abella (fem.) vs. abellot (masc.) ‘bee/drone’. As in anylanguage, some gender pairs are expressed with unrelated, or irregularly relatednouns, e.g. amo (masc.) vs. mestressa (fem.) ‘owner’. A frequent type of compoundis that composed of a verb and a noun. Its gender is not related to the noun it con-tains. If the compound is endocentric, its gender comes from the head noun. In thecase of exocentric compounds, which denote something different from the denota-tions of either of the individual compound elements, and in the case of noun-nouncompounds, where both elements contribute equally to the meaning, the principleseems to be that a compound that contains a feminine singular noun and no mas-culine nouns is feminine (e.g. cama-roja ‘flamingo’), otherwise the compound ismasculine (e.g. estira-i-arronsa ‘give and take’). Words borrowed from Spanish,French, Italian, and Latin typically bring with them the gender they have in thesource language; words from other languages tend to be masculine, unless they areclosely associated semantically with a feminine Catalan word.

There are five, related patterns for forming the plural of nouns and adjectivesin Catalan. These are (i) the addition of -s to the singular form (the predominantpattern), (ii) the replacement of final -a by -es, (iii) the addition of -ns, (iv) theaddition of -os (unique to masculine nouns), and (v) invariable form (the plural isidentical with the singular).

1 The description of Catalan is based on Wheeler et al.’s (1999) referencegrammar.


The use of definite and indefinite articles in Catalan is, on the whole, com-parable with the use of the corresponding forms in English. However, there are twomain differences of usage. The first is that in generic noun phrases, Catalan usesthe definite article not only with singular count nouns, but also with singular massnouns and plural nouns. The second is that, more often than in English, a singu-lar indefinite noun phrase may lack an article or determiner altogether. Catalan alsouses a ‘personal article’ before proper names of people. Articles can be orthograph-ically separate words or they can appear in the cliticized form (cf. el (sg., masc.,definite) vs. l’ (sg., masc., definite, before vowel or h-).

Differences between the dialects of Catalan are most noticeable in verb mor-phology. It is very generally the case that compound verbs (containing prefixes suchas con-, de-, en-, ex-, sub-, etc.) have the same inflections as the root verbs they arebased on.

On the basis of their inflectional paradigms, Catalan verbs fall into threeclasses (with some subdivisions). Conjugation I mostly includes regular verbs.Conjugation II contains most of the irregular verbs. In particular, Conjugation IIverbs have irregular participles; the infinitives are more varied in ending than itsother inflections. Conjugation III verbs have infinitives in -ir. This conjugation classincludes ‘inchoative’ verbs, which manifest stem alternation, and other irregularverbs. There are also verbs of mixed conjugation. The non-finite verb categoriesare infinitive, gerund, and (past) participle. Tense categories are present, past, andfuture. The expression of present and past overlaps with the expression of moodand aspect. The mood categories are indicative, imperative, and subjunctive. Theindicative combines with all tenses, but the imperative is found only allied withpresent tense, and the subjunctive only with present and past tense. The aspectcategories are perfective and imperfective. They are distinguished only in the pasttense of indicative and are called preterite, and marginally in the subjunctive mood.Imperfect is the conventional name for the past imperfective indicative.

Perfect and conditional do not fit neatly into the framework of tense, aspect,and mood. Perfect in Catalan is expressed periphrastically via the auxiliary verbhaver (sometimes ser in North Catalan and esser in Balearic) with the participle ofthe verb in question. Perfect combines with all of the categories of tense, aspect,and mood (also with conditional) except that there is no perfect imperative.

Syntax. Typical simple sentences consist of a verb and one or more arguments(see (C.3).

(C.3) NoNo

hithere

podemable

anargo

ato

Roma,Roma,

enguanythis-year

‘We won’t be able to go to Roma this year.’

Sentences may contain adverbial or prepositional phrase adjuncts, specifying place,time, manner, etc. The subject pronouns are not required to be present, since finiteverbs largely indicate what the subject is by means of inflections. The verbs haver

C.5. Catalan 177

and ser in their third-person singular forms have special usages. Such constructionshave no grammatical subject. Personal pronouns have unstressed (clitic) direct ob-ject forms distinct from subject forms. The noun phrases generally take no specialform as direct object, but the preposition a is used before direct object noun phrasesin certain limited contexts. Except in the third person, the indirect object unstressed(clitic) personal pronouns are the same as the direct object ones. The third-personindirect object clitics are li (sg.) and els/los (pl., both genders). Stressed indirectobject pronouns, and other types of indirect object noun phrases, are marked withthe preposition a.

Passive constructions in Catalan are formed similarly to English. Two verbsare used in causative constructions: fer ‘make’/‘get’ and deixar ‘allow’/‘let’. Thetwo verbs make use of the same construction that varies depending on whetherthe caused situation is expressed by an intransitive or by a transitive verb (whichappears in the infinitive form). By definition, non-finite verbs lack expression ofperson (i.e. agreement with the subject of the verb). The (past) participle takesadjective-type inflections of number and gender, agreeing with the surface subjectof a passive verb and, in certain contexts, with the object of a verb in the perfect.Finite verbs agree with the subject in person and number.

The way to negate a positive sentence in Catalan is with the particle no. Nomust precede the verb or the pronominal clitics places immediately before the verb.Some dialects use the particle pas postverbally, similar to French.

Catalan, particularly in the colloquial language, makes frequent usage ofthe adverbial pronouns en (a partitive, ‘of that’), e.g. en tinc ‘I have [of it]’, andhi (certain prepositional phrases in pronominal functions). For example, He quedatamb la Marta. that means ‘I have a date with Marta.’ can be substituted with Hi hequedat. which means ‘I have a date [with Marta].’.

In Catalan, the definite articles el or en for masculine and la or na for femi-nine are used with personal names.

The normal unmarked word order of Catalan has elements (when present)in the following order: sentential adjunct, subject, no, verb, short adverbial, directobject or predicative phrase, indirect object or other complement phrase, adver-bial adjunct. In general, however, the Catalan word order is freer than English. Animportant reason for this is that, in Catalan, it is the end position in a sentencethat carries the information focus; it is the place where the major pitch movementoccurs in speech, and it is the place where the most informative element of the sen-tence goes. Elements of the basic word-order sequence may have to be dislocated toachieve this. Adverbials (adverbs, prepositional, and adverbial phrases and clauses)are placed either immediately before or immediately after the word(s) they modify.

Additional features of the Catalan syntax include systematic dropping ofprepositions in front of complementizers, as illustrated in (C.4), and periphrasticpast, i.e. when the verb anar ‘go’ is used with an infinitive to express the past, unlikesimilar constructions in related languages, such as French, Occitan, or Spanish, toexpress the future.


(C.4) a. Te’nYou-EN

recordes deremember

mi?of me?

‘Do you remember me?’b. No

Notse’nSE-EN

recordavaremember

quethat

dimecresWednesday

havienhad

anatgone

ato

lathe

platja.beach.

‘He/She didn’t remember they had gone to the beach on Wednesday’

C.6 Portuguese

General. Portuguese2 is the language spoken in Portugal and in Brazil; it hasmany speakers in several African nations as well, including Angola, Mozambique,Guiné-Bissau, Cabo Verde, and São Tomé, and is also spoken by immigrant mi-norities in the United States, Canada, and some countries of Western Europe. Aconservative estimate puts the number of its monolingual speakers at about 200million.

In Brazil, where Portuguese is the only language used by the whole popu-lation and where it has been evolving independently for almost five centuries, thespoken language differs markedly from European Portuguese.

Morphology. A distinguishing feature of Portuguese among Romance languages,which mostly appears in archaic texts nowadays, is the occurrence of mesoclisis,the insertion of weak pronouns between the verb stem and future or conditionalverb ending (e.g. Comprá-lo-ei. = comprarei + o ‘I will buy it’).

As in all Romance languages, the grammatical gender of inanimate entitiesis quite arbitrary, and often different from that used in sister languages. Thus, forexample, Portuguese árvore ‘tree’ and flor ‘flower’ are feminine, while Spanishárbol and Italian fiore are masculine; Portuguese mar ‘sea’ and mapa ‘map’ aremasculine, while French mer and mappe are feminine; and so on.

The gender and number of many nouns can typically be deduced from itsending: the basic pattern is -o/-os for masculine singular and plural, -a/-as for fem-inine (e.g. casa ‘house’). However, the complete set of rules is much more complexand there are many irregular forms.

Portuguese has definite and indefinite articles. They are inflected by genderand number. The -o/-a/-os/-as endings are used for definites and -um/-uma/-uns/-umas are used for indefinites. Portuguese articles contract with certain prepositions,in writing and in speech, e.g. de + -o/-a/-os/-as = -do/-da/-dos/-das (‘of the’, ‘fromthe’). Demonstratives are inflected for person and number too, although the rulesare a bit different from those of articles.

2 The description of Portuguese is based on Perinin’s (2002) reference gram-mar.

C.6. Portuguese 179

Syntax. As has been mention in section 4.1, the word order of Portuguese isrelatively flexible, compared to English. European Portuguese is a subject pro-droplanguage, similarly to Catalan and Spanish, which means that an explicit subjectis often dropped. Brazilian Portuguese is both subject and object pro-drop. (C.5)illustrates the case when the object is dropped.

(C.5) A:A:

O queWhat

vocêyou

fezdid

comwith

othe

livro?book?

B:B:

EuI

deigave

parato

Maria.Mary

A: ‘What did you do with the book?’ B: ‘I gave it to Mary’.

Verbs agree in number and person with their subjects. Adjectives in Portuguesegenerally follow the noun they modify. Thus "white house" is casa branca, neverbranca casa (except in poetic speech). However, a few adjectives like bom ‘good’,belo ‘nice’, and grande ‘great, big’ are often prefixed. Some of these have differ-ent meanings depending on position: um grande homem means ‘a great man’, umhomem grande means a big man.

Like other Romance languages, Portuguese has passive voice variants ofclauses with transitive verbs and objects. The rules are basically the same as inthose languages, namely, the original object becomes the subject; the original sub-ject becomes an adverbial complement with preposition por ‘by’; and the verb isreplaced by its past participle, preceded by the verb ser ‘to be’ inflected in theoriginal mood and tense.

Two verbs are used as main copulas, as in many other Romance languages– the verbs ser and estar ‘to be’. The use of verbs encodes the distinction betweenthe concept of permanent versus temporary, rather than essence versus state, as inthe original Latin sum vs. sto. This makes Portuguese closer to Catalan than toSpanish. For example, Sou feliz. means ‘I’m happy.’ and Estou feliz. means ‘I’mhappy now.’.

In addition to the pronouns that act as subjects of a sentence and from thestressed oblique pronouns which are employed after prepositions, Portuguese hasseveral clitic object pronouns used with nonprepositional verbs, or as indirect ob-jects. These can appear before the verb as separate words, as in Ela me ama. ‘Sheloves me.’, or appended to the verb after the tense/person inflection, as in Ele amou-a. ‘He loved her.’. Portuguese spelling rules (unlike those of Italian and Spanish)require a hyphen between the verb and the clitic pronoun.

Clitic placement may require some adjustments in the verb ending and/or inthe pronoun, e.g. cantar + o = cantá-lo ‘to sing it"’. The direct and indirect objectpronouns can be contracted, as in dei + lhe + os = dei-lhos ‘I gave them to him’.

Romance languages often use articles where English would not and Por-tuguese is particularly extreme in this regard, as it will often use articles beforeperson names, especially in informal registers or if the name includes a title (e.g. AMaria saiu ‘Mariu left’). Articles also occur before certain country and organiza-tion names.


C.7 Spanish

General. Spanish is an Iberian Romance language. It is the most-widely spo-ken Romance Language, and the fourth most widely spoken language in the worldaccording to some sources. It is spoken as a first language by about 400 millionpeople, and by approximately 500 million including non-native speakers (http://www.ethnologue.com).

Morphology. As has been said in section 4.1, Spanish verbs are one of the mostcomplex areas of Spanish grammar. Verbs are divided into three classes, whichdiffer with respect to their conjugation. The class of the verb can be identified bylooking at the infinitive ending: -ar, -er, -or, and -ir, as shown in the dictionary formof the verb. The vowel in the ending (a, e, or i) is technically termed the thematicvowel.

The -ar verbs are the most numerous and the most regular; moreover, the-ar class is usually chosen for new verbs. The -er and -ir classes comprise farfewer verbs, which also tend to be more irregular. There are also subclasses ofsemi-regular verbs which show vowel alternation conditioned by stress. This isvery similar to Portuguese and Catalan, as has been described above.

All Spanish nouns have one of two grammatical genders: masculine or fem-inine (mostly conventional, that is, arbitrarily assigned). Most adjectives and pro-nouns, and all articles and participles, indicate the gender of the noun they refer-ence or modify. In some cases, the same word can take two genders with a differentmeaning for each (e.g. el capital ‘funds’ vs. la capital ‘capital city’). Note that thedivision between uncountable and countable nouns is not as clear-cut as in English.

Nouns ending in -o are masculine, with the only notable exception of theword mano ("hand"). The ending -a is typically feminine, with notable exceptions,other vowels and consonants are more often than not masculine, but many are fem-inine, particularly those referring to women (la madre) or ending in -ción, -dad, -ez(la nación, la soledad, la vejez).

A small set of words of Greek origin and ending in -ma are masculine: prob-lema ‘problem’, lema ‘lemma’, tema ‘theme’, sistema ‘system’, telegrama ‘tele-gram’, etc. Words ending in -ista, referring to a person, can generally be eithergender: el artista, la artista, ‘the artist’. The same is true of words ending in -anteor -ente, though sometimes separate female forms ending in -a are used. Wordstaken from foreign languages either take the gender they have in that language(with neuter taken to be the same as masculine), or take the gender it seems to be(e.g. la Coca-Cola is feminine because it ends in -a).

While Spanish is generally regarded to have two genders, its ancestor, Latin,had three. The transition from three genders to two is mostly complete, however,vestiges of a neuter gender can still be seen. Most notably, this is seen in pronounslike esto, eso, aquello, and ello, which are the neuter forms of este, ese, aquel, andél, respectively. These words correspond to English ‘this’, ‘that’, ‘him’ or ‘it’ in the

C.7. Spanish 181

same order. Additionally, the word lo, while usually masculine, can be consideredneuter in some circumstances.

Adjectives in Spanish can mostly be divided into two large groups: thosethat can be found in the dictionary ending in o, and all others. The former typicallyagree for number and gender; the latter typically agree just for number.

Syntax. Word order in Spanish is flexible. Many adjectives may be placed beforeor after the noun they modify (e.g. en el pasado remoto/en el remoto pasado ‘inthe remote past’.) A subject may follow or precede a verb (e.g. Juan lo sabe/losabe Juan.) A direct object noun phrase may follow or precede the verb, and as inEnglish or the other Romance language discussed here, adverbs and adverb phrasesmay occupy various positions in relation to the verb that they modify. Usually,the factors that determine Spanish word order depend on considerations of style,context, and emphasis, similar to the word order rules for Portuguese and Catalandiscussed above. (C.6) provides an example of a simple Spanish sentence.

(C.6) LosThe

estudiantesstudents

yaalready

compraronbought

losthe

livros.books

‘The students have already bought the books.’

Nearly all Spanish adjectives agree with nouns and pronouns in number, and manyalso agree in gender. Thus, they have either two forms, (e.g. natural/naturales), orfour (e.g. bueno/buena/buenos/buenas). Verbs agree with their subjects in genderand number too.

Spanish is a subject pro-drop language, which means that the omission ofpronouns in subject position is allowable, when pragmatically unnecessary. This issimilar to European Portuguese and Catalan as well.

Table C.3. Germanic influence on Spanish, Portuguese, and Catalan

Germanic Meaning Spanish Portuguese Catalan French German Englishblao blue – – blau bleu blau bluelaith ugly – – lleig laid leid –lothr free – – lloure – Leute lewdreiks rich rico rico ric riche reich rich

The largest part of the vocabulary of the three Romance languages is inheritedfrom Vulgar Latin. Catalan also remained relatively unaffected by arabization, asthe Moorish domination was rather short. If we review the Arabic vocabulary in thethree languages, we notice that Catalan, unlike Castillian Spanish and Portugueseexhibits a unique resistance to the agglutination of the Arabic article al- to thewords, as exemplified in Table C.4. Catalan, on the other hand, incorporated a goodmany Germanic loan words (see Table C.3). Thus, the words for many everyday


concepts in the three languages do not always coincide, even etymologically (seeTable C.5).

Table C.4. Arabic influence on Spanish, Portuguese, and Catalan

Arabic Spanish Portuguese Catalan Meaningal-harsufa alcachofa alcachofra carxofa artichokeal-qutun algodón algodão cotó cottonar-rabd arrabal arrabalde raval suburb

Table C.5. Basic words: Comparison of Spanish, Portuguese, and Catalan

Spanish Portuguese Catalan Occitan French Meaningsilla cadeira cadira cadièra chaire chairmesa tabela, mesa taula taula table tableventana bilheteria finestra fenèstra fenêtre windowtío tio oncle oncle oncle unclesobrino sobrinho nebot nebot neveu nephew

Citation Index

Agirre et al. (2004), 47, 133Agirre et al. (2005), 47, 133Böhmová et al. (2001), 134, 165Bémová et al. (1999), 133, 165Baker et al. (1998), 40, 133Baroni et al. (2002), 32–34, 133Belkin and Goldsmith (2002), 34,

133Bermel (1997), 129, 133Bick (2000), 68, 133Blitzer et al. (2006), 127, 133Borin (1999), 23, 24, 134Borin (2000), 23, 24, 134Borin (2002), 38, 134Borin (2003), 32, 134Brants (2000), 6, 8, 9, 24, 25, 103,

134Breiman (1996), 20, 134Brent (1994), 33, 134Brent (1999), 33, 134Brill and Wu (1998), x, 22, 118, 134Brill (1995), 10–12, 18, 22, 134Brill (1999), 18, 134Brown et al. (1990), 37, 134Brown et al. (1993), 37, 135Brun (2001), 52, 135Carlberger and Kann (1999), 24, 135Carrasco and Gelbukh (2003), 10,

135Carreras et al. (2003), 43, 135Cavestro and Cancedda (2005), 40,

135

Charniak and Johnson (2005), 128,135

Chen and Goodman (1996), 8, 135Chen (1993), 37, 135Church (1988), 6, 135Civit (2000), 64, 135, 161, 166Clark et al. (2003), 25, 136Clark (2001), 32, 135Cloeren (1993), 60, 136Collins (2000), 128, 136Comrie and Corbett (2002), 49, 136Creutz and Lagus (2002), 32, 136Creutz (2003), 32, 136Cucerzan and Yarowsky (1999), 31,

136Cucerzan and Yarowsky (2000), 36,

136Cucerzan and Yarowsky (2002), 31,

45, 46, 48, 136Cunha and Cintra (2001), 99, 136Curran and Clark (2003), 25, 136Cutting et al. (1992), 7, 17, 18, 137Džeroski et al. (1999), 12, 14, 138Džeroski et al. (2000), 26, 103, 138Daelemans et al. (1996), 14, 137Daelemans et al. (1999), 14, 137Daelemans et al. (2001), 24, 137Dagan and Church (1994), 37, 137Dagan and Itai (1994), 41, 137Dagan et al. (1993), 37, 137Dagan (1990), 41, 137Dahl (1985), 129, 137DeRose (1988), 6, 138

184 Citation Index

Derksen (2008), 110, 137Dien and Kiem (2003), 39, 138Dietterich and Bakiri (1991), 118,

127, 138Dietterich (1997), 20, 22, 138Debowski (2004), 10, 137Ejerhed and Källgren (1997), 38, 138Elworthy (1995), 63, 78, 116, 138Erjavec (2004), 60, 138Feldman et al. (2005), 110, 138Feldman et al. (2006), 98, 139Feldman (2006), 128, 138Freund and Shapire (1996), 21, 139Fronek (1999), 89, 139Fung and Church (1994), 37, 139Fung and Lo (1998), 42, 139Fung and McKeown (1997), 41, 139Fung (1998), 37, 42, 139Gale and Church (1991), 37, 139Gale et al. (1992a), 41, 139Gale et al. (1992b), 41, 139Gale et al. (1992c), 41, 139Gess and Arteaga (2006), 110, 140Goldsmith (2001), 32, 33, 81, 140Hajic and Hladká (1998a), 9–11, 140Hajic and Hladká (1998b), 26–28,

140Hajic et al. (2001), 28, 103, 140Hajic (2004), 60, 62, 64, 66, 81, 82,

86, 90, 97, 140, 149Hana and Culicover (2008), 33, 140Hana et al. (2004), 90, 98, 115, 140Hana et al. (2006), 98, 140Hana (2007), 113, 140, 169Hansen and Salamon (1990), 20, 140Hladká (2000), 9–11, 28, 29, 140Hlavácová (2001), 92, 141Hwa et al. (2004), 39, 141ISO-9 (1995), 104, 141Isacenko (1968), 129, 141Jelinek (1985), 17, 141Johansson (1986), 23, 141Johnson and Martin (2003), 33, 141Karcevski (1927), 129, 141

Karlík et al. (1996), 49, 89, 141Kay and Röscheisen (1993), 37, 141Koskenniemi (1983), 89, 141Koskenniemi (1984), 89, 141Krotov et al. (1999), 75, 141Kucera and Francis (1967), 1, 141Kupiec (1993), 37, 142Levenshtein (1966), 33, 35, 111, 142Lezius et al. (1998), 23, 38, 142Mann and Yarowsky (2001), 44, 48,

142Marcus et al. (1993a), 60, 142Marcus et al. (1993b), 71, 104, 142Mason (1997), 23, 142Maynard et al. (2003), 42, 43, 142McClosky et al. (2006), 128, 142Megyesi (1999), 11, 142Melamed (2000), 37, 142Merialdo (1994), 17, 31, 142Meurers (2005), 3, 5, 143Mikheev and Liubushkina (1995),

86, 90, 143Mikheev (1997), 92, 98, 143Miller (1990), 36, 143Nemec (2004), 17, 143Nakagawa et al. (2002), 24, 25, 143Neuvel and Fulop (2002), 32, 143Ngai and Florian (2001), 24, 143Ngai and Yarowsky (2000), 31, 143Och and Ney (2000), 39, 143Orphanos and Christodoulakis

(1999), 16, 143Orphanos et al. (1999), 16, 143Padó and Lapata (2005), 40, 144Parmanto et al. (1996), 21, 144Paul and Baker (1992), 23, 144Pedersen et al. (2006), 43, 144Perinin (2002), 144, 178Przepiórkowski and Wolinski

(2003), 64, 144Rapp (1995), 41, 144Ratnaparkhi (1996), 12, 13, 22, 24,

144Resnik (1996), 129, 144

Citation Index 185

Resnik (2004), 2, 144Ruimy et al. (2004), 44, 144Ruzýicýka (1952), 129, 145Samuelsson (1993), 9, 145Schmid (1994a), 16, 145Schmid (1994b), 15, 23, 24, 145Schone and Jurafsky (2000), 33, 145Schone and Jurafsky (2002), 32, 34,

145Schuetze (1992), 41, 145Shenker (1995), 49, 145Sjöbergh (2003a), 23, 24, 145Sjöbergh (2003b), 24, 145Skoumalová (1997), 89, 145Smadja (1996), 37, 145Smith and Smith (2004), 40, 146Snyder and Barzilay (2008a), 38, 146Snyder and Barzilay (2008b), 38,

146Snyder et al. (2008), 38, 146Solorio and López (2005), 43, 146Spoustová et al. (2007), 29, 146Tanaka and Iwasaki (1996), 41, 146Theron and Cloete (1997), 32, 33,

146Tsang et al. (2002), 43, 146Tsang (2001), 40, 44, 146Vapnik (1998), 24, 147Viterbi (1967), 7, 147Wade (1992), 49, 99, 147Weischedel et al. (1993), 6, 147Wheeler et al. (1999), 99, 147, 175Wolpert (1992), 22, 147Wu and Xia (1994), 37, 147Yarowsky and Ngai (2001), 31, 39,

48, 147Yarowsky and Wicentowski (2000),

33–36, 147Yarowsky et al. (2001), 31, 39, 48,

147Yarowsky (1995), 41, 147Zemel (1993), 33, 147Zipf (1935), 83, 147Zipf (1949), 83, 148

de Marcken (1995), 33, 137den Boogaart (1975), 23, 137van Halteren et al. (1998), 22, 146van Halteren et al. (2001), 23, 146van Rijsbergen (1979), 6, 147

a resource-light approach to morpho-syntactic tagging

Documents